A Comprehensive Machine Learning Framework for the Exact Prediction of the Age of Onset in Familial and Sporadic Alzheimer’s Disease

Machine learning (ML) algorithms are widely used to develop predictive frameworks. Accurate prediction of Alzheimer’s disease (AD) age of onset (ADAOO) is crucial to investigate potential treatments, follow-up, and therapeutic interventions. Although genetic and non-genetic factors affecting ADAOO were elucidated by other research groups and ours, the comprehensive and sequential application of ML to provide an exact estimation of the actual ADAOO, instead of a high-confidence-interval ADAOO that may fall, remains to be explored. Here, we assessed the performance of ML algorithms for predicting ADAOO using two AD cohorts with early-onset familial AD and with late-onset sporadic AD, combining genetic and demographic variables. Performance of ML algorithms was assessed using the root mean squared error (RMSE), the R-squared (R2), and the mean absolute error (MAE) with a 10-fold cross-validation procedure. For predicting ADAOO in familial AD, boosting-based ML algorithms performed the best. In the sporadic cohort, boosting-based ML algorithms performed best in the training data set, while regularization methods best performed for unseen data. ML algorithms represent a feasible alternative to accurately predict ADAOO with little human intervention. Future studies may include predicting the speed of cognitive decline in our cohorts using ML.


Introduction
Alzheimer's disease (AD; OMIM 104300) is a neurodegenerative disorder characterized by progressive loss of neurological, mental, and cognitive functions, including memory, changes in judgment, behavior, and emotions [1][2][3][4]. AD is the most common cause of dementia and constitutes an increasing challenge due to society's public health and economic costs [5][6][7][8]. As of 2016,~44 million people had AD or related dementia worldwide [9]. Without new medicines to prevent, delay, or stop the disease, this figure is projected to dramatically increase to~66 million dementia cases by 2030 and~116 million by 2050 [10]. The financial burden associated with the disease was estimated to be USD 818 billion in 2015 worldwide [11,12].
AD neuropathological damage is characterized by extracellular deposits of the betaamyloid (Aβ) peptide, the formation of intracellular neurofibrillary tangles of hyperphos-and predict AD conversion in individuals with Mild Cognitive Impairment (MCI) [66,67]. Interestingly, optimization procedures for tuning the parameters of ML algorithms have been reported to increase the sensitivity, specificity, and accuracy of ML for AD diagnosis [68]. Other ML alternatives include the use of artificial intelligence (AI), namely deep learning (DL), assessing AD diagnosis and progression with brain radiological images [69,70]. Although these results are promising, their main limitation is that the predictive model provided either an estimate of the risk of an individual for developing AD or the range within which the ADAOO may fall with high confidence (i.e., early-or late-onset based on whether the ADAOO was before or after a threshold, respectively), but not an estimate of the actual ADAOO. Moreover, a comprehensive exploration of advanced ML algorithms for ADAOO prediction is yet to be conducted.
In this study, we comprehensively assess ML algorithms' feasibility applied to f AD and sAD cohorts, with the overarching aims of (1) accurately predicting ADAOO and improving the scope and performance previously reached; and (2) expanding the possibilities of quantifying ADAOO in the clinical setting. Our results suggest that ML constitutes a feasible and easy-to-implement new methodology to predict ADAOO, especially in the clinical setting, while significantly overpowering our previous results and paving the way for new possibilities to define follow-up and counseling strategies for patients and their family members.  [33,36]. Detailed clinical assessment and ascertainment procedures of this pedigree have been presented elsewhere [31,[71][72][73].

The Cohort of Sporadic Cases
Fifty-four individuals with sAD were included in this study (43 [80%] were women, and 11 [20%] men). Clinical, neurological, and neuropsychological assessment of sAD patients has been reported elsewhere [35]. ADAOO was determined during anamnesis with the information provided by patients or their families, with confirmation by several sources. Because some patients started their follow-up during MCI, ADAOO was defined during the follow-up stage based on Petersen's criteria [74]. This strategy was recently proven to be highly accurate [75]. AD affection status was defined based on the DSM-IV criteria [76].

Variants Associated with ADAOO
We previously studied the association of common exonic functional variants (CEFVs) with ADAOO (Table 1) [35,36] using single-and multi-locus linear mixed-effects models [77] and recursive partitioning ML algorithms [36]. These variants were found to delay ADOO up to 17 years in carriers of the E280A PSEN1 mutation and accelerate it up tõ 14 years in individuals with sAD [35,36].

ADAOO Prediction Using ML
Predictive models of ADAOO were constructed with ML algorithms in individuals carrying the E280A PSEN1 mutation and individuals with sAD. The set of predictor variables consisted of demographic variables (i.e., gender, sex, and years of education) and genomic variants previously identified to be associated as ADAOO modifiers ( Table 1). The complete list of ML algorithms is provided in the Supplementary Materials. Construction, parameters tuning, validation, and testing of these predictive models were performed in R version 4.0.2 Patched (2020-06-30 r78761) [80] with the methods implemented in the caret package [47,48] using a 10-fold cross-validation procedure with five repetitions. The training/testing data sets consisted of 70%/30% of individuals per cohort. Given the continuous nature of the outcome variable (i.e., ADAOO), the root mean squared error (RMSE), the R-squared (R 2 ), and the mean absolute error (MAE) measures were used to evaluate the performance of the ML algorithms. In ML-based predictive models, high values of R 2 and low values of RMSE and MAE indicate good performance. To graphically represent the performance of these ML algorithms and to identify similarities among them, we combined K-means clustering [81] and principal component analysis (PCA) [82,83]; the number of K-means clusters and the number of principal components were determined using the methods implemented in the NbClust [84] and paran [85] packages for R. To evaluate the stability of each predictor's variable importance, we implemented the following resampling strategy, which is a slight modification of the empirical bootstrap [86,87]. First, we constructed B = 1000 training data sets at random, keeping the 70%/30% proportion for the training/testing data sets initially used to identify the best performing ML model. Secondly, for the b-th training data set (b = 1,2, . . . , B), this model was fitted, and the variable importance measure associated with each predictor was computed. Thus, for any predictor X, we obtained the values X (1) , X (2) , X (3) , . . . , X (B) , with X (b) representing the variable importance of X calculated in the b-th randomly generated training data set. Finally, we calculated the bootstrap-based 95% confidence intervals (CIs) based on the 2.5% and 97.5% percentiles of X (1) , X (2) , X (3) , . . . , X (B) . Table 2 presents the performance measures for ML algorithms' collection for predicting AOO in the E280A pedigree. The training/testing data sets consisted of 51/20 individuals, respectively. When predicting AOO in the training data set, the xgbLinear ML algorithm outperformed all other algorithms in the RMSE, R 2 , and MAE performance measures. When evaluating these ML algorithms' performance for unseen data (i.e., testing data set), the glmboost ML algorithm outperformed all other alternatives. Following our results, the performance of these ML algorithms can be grouped into three classes. For the training data set, class 1 comprises the rf, xgbTree, xbLinear, and qrf algorithms (Figure 1a; yellow); class 2 is constituted by the mlp, treebag, rpart1SE, rpart2, rpart, knn, gbm, svmRadial, svmLinear, and svmLinear2 algorithms (Figure 1a; red); and class 3 by the bstTree, glmnet, glmboost, and svmPoly algorithms (Figure 1a; blue). In the testing data set, the svmPoly, xgbTree, xgbLinear, gbm, bstTree, rpart, and qrf algorithms belong to class 1 (Figure 1b; yellow); tree bag, rpart1SE, rpart2, svmLinear, svmLinear2, rf, knn, and svmRadial form class 2 (Figure 1a; red); and glmnet and glmboost constitute class 3 (Figure 1b; blue). Overall, the best performing algorithms are grouped into class 1 for the training data set, and into class 3 for the testing data set; the xgbLinear algorithm outperforms all other alternatives in class 1 ( Table 2 and Figure 1a), while the glmboost algorithm outperforms those in class 3 ( Table 2 and Figure 1b). testing data set, the svmPoly, xgbTree, xgbLinear, gbm, bstTree, rpart, and qrf algorithms belong to class 1 (Figure 1b; yellow); tree bag, rpart1SE, rpart2, svmLinear, svmLinear2, rf, knn, and svmRadial form class 2 (Figure 1a; red); and glmnet and glmboost constitute class 3 (Figure 1b; blue). Overall, the best performing algorithms are grouped into class 1 for the training data set, and into class 3 for the testing data set; the xgbLinear algorithm outperforms all other alternatives in class 1 ( Table 2 and Figure 1a), while the glmboost algorithm outperforms those in class 3 (Table 2 and Figure 1b). Figure 1c depicts variable importance plots for the xgbLinear, glmnet, and glmboost algorithms. Our results suggest that, for the xgbLinear algorithm, which is more suitable for assessing ADAOO in the training data set, years of education (Schooling), genetic variants GPR20-rs36092215 and PYNLIP-rs2682585, and sex (i.e., being male) are the most important predictors of ADAOO (Figure 1c, left). For the glmnet and glmboost algorithms, which outperform the other alternatives when predicting ADAOO for unseen data, the most important predictors are the genetic variants APOE-rs7412, FCRL5-rs16838748, GRP20-rs36092215, IFI16-rs62621173, AOAH-rs12701506, and PYNLIP-rs2682585, followed by years of education (Figure 1c, center; Figure 1c, right). Table 3 presents the performance measures for collecting ML algorithms used to predict AOO in individuals of the sAD cohort. The training and data sets consisted of 40 and

GRP20-rs36092215
Schooling GRP20-rs36092215 Figure 1c depicts variable importance plots for the xgbLinear, glmnet, and glmboost algorithms. Our results suggest that, for the xgbLinear algorithm, which is more suitable for assessing ADAOO in the training data set, years of education (Schooling), genetic variants GPR20-rs36092215 and PYNLIP-rs2682585, and sex (i.e., being male) are the most important predictors of ADAOO (Figure 1c, left). For the glmnet and glmboost algorithms, which outperform the other alternatives when predicting ADAOO for unseen data, the most important predictors are the genetic variants APOE-rs7412, FCRL5-rs16838748, GRP20-rs36092215, IFI16-rs62621173, AOAH-rs12701506, and PYNLIP-rs2682585, followed by years of education (Figure 1c, center; Figure 1c, right). Table 3 presents the performance measures for collecting ML algorithms used to predict AOO in individuals of the sAD cohort. The training and data sets consisted of 40 and 14 individuals, respectively. When predicting AOO in the training data set, the svmLinear and xgbLinear ML algorithms perform reasonably well, with the latter algorithm outperforming all others in terms of the RMSE, R 2 , and MAE performance measures. Despite its remarkable performance in the training data set, the predictive power of the xgbLinear algorithm is rather week in unseen data (i.e., possible overlearning). Thus, the svmLinear algorithm seems to be a better alternative than xgbLinear algorithm. On the other hand, when evaluating the performance of these ML algorithms for the testing data set, the lasso outperforms the other alternatives in terms of the RMSE and R 2 , while the glmnet algorithm does so in terms of the MAE (Table 3). In contrast, these ML algorithms are strong learners. Our results indicate that these ML algorithms' performance can be grouped into three classes. For the training data set, class 1 comprises the bstTree, glmboost, rf, and svmRadial algorithms (Figure 2a; yellow); class 2 is constituted by the xgbTree, svmPoly, qrf, svmLinear, svmLinear2, lasso, glmnet, and xbgLinear algorithms (Figure 2a; red); and class 3 by the treebag, knn, rpart1SE, rpart, and rpart2 algorithms (Figure 2a; blue). In the testing data set, the glmboost, xgbTree, rf, svmRadial, and bstTree algorithms belong to class 1 (Figure 2b, yellow); svmPoly, svmLinear, svmLinear2, lasso, and glmnet algorithms belong to class 2 (Figure 2b; red); and treebag, rpart, rpart1SE, rpart2, and qrf constitute class 3 (Figure 2b; blue). Overall, the best performing algorithms are grouped into class 2 for both the training and testing data sets; the xgbLinear algorithm outperforms all other alternatives for the training data set (Table 3 and Figure 2a), while the lasso and glmnet algorithms seem to be the best options for unseen data (Table 3 and Figure 2b). Figure 2c depicts variable importance plots for the svmLinear, lasso, and glmnet algorithms. We identified that for the svmLinear and lasso algorithms, the most important predictors of ADAOO are variants HERC6-rs7677237, years of education, GPR45-rs35946826, NFATC1-rs754093, FRAS1-rs6835769 and MAGI3-rs61742849, and CENPJ-rs17081389 (Figure 2c, left and Figure 2c, center). Interestingly, under the svmLinear and lasso ML algorithms, sex is a seemingly significant predictor of ADAOO. In terms of variable importance, the glmnet ML algorithm yields similar results to those in the svmLinear and lasso algorithms, but highlights the relevance of variants GPR45-rs35946826, MAGI3-rs61742849, C16orf96-rs17137138, and C3orf20-rs34230332, and the small contribution to ADAOO of sex and years of education in unseen individuals with sAD (Figure 2c, right). Figure 3 shows our implementation results for evaluating variable importance stability for each predictor in the best ML algorithm. When predicting ADAOO in individuals carrying the E280A mutation, the most important predictor is, by far, the APOE-rs7412 genetic variant, and the least essential predictors are sex, the genetic variant RC3H1-rs10798302, and years of education (Figure 3a). lasso ML algorithms, sex is a seemingly significant predictor of ADAOO. In terms of variable importance, the glmnet ML algorithm yields similar results to those in the svmLinear and lasso algorithms, but highlights the relevance of variants GPR45-rs35946826, MAGI3-rs61742849, C16orf96-rs17137138, and C3orf20-rs34230332, and the small contribution to ADAOO of sex and years of education in unseen individuals with sAD (Figure 2c, right).  Figure 3 shows our implementation results for evaluating variable importance stability for each predictor in the best ML algorithm. When predicting ADAOO in individuals carrying the E280A mutation, the most important predictor is, by far, the APOE-rs7412 genetic variant, and the least essential predictors are sex, the genetic variant RC3H1-rs10798302, and years of education (Figure 3a).

Variable Importance: Stability and Relationship with
In individuals with sAD, the most important ADAOO predictor is the genetic variant GPR45-rs359446826, followed by variants MAGI3-rs61742849, C16orf96-rs17137138, and C3orf20-rs34230332. Interestingly, sex and years of education (not shown) are among the least important predictors (Figure 3b). Variable importance bootstrap-based distributions are provided in Figures S1 and S2 (Supplementary Materials). Figure 4 shows scatterplots between and their variable importance predicting ADAOO (Tables 2 and 3), confirming that, in contrast to fAD, essential predictors of ADAOO in sAD correspond to several genetic variants of small effect [29,30].   In individuals with sAD, the most important ADAOO predictor is the genetic variant GPR45-rs359446826, followed by variants MAGI3-rs61742849, C16orf96-rs17137138, and C3orf20-rs34230332. Interestingly, sex and years of education (not shown) are among the least important predictors (Figure 3b). Variable importance bootstrap-based distributions  Figures S1 and S2 (Supplementary Materials). Figure 4 shows scatterplots betweenβ and their variable importance predicting ADAOO (Tables 2 and 3), confirming that, in contrast to f AD, essential predictors of ADAOO in sAD correspond to several genetic variants of small effect [29,30].   Table 1 for more details.

Discussion
Machine learning (ML) algorithms have recently caught the scientific community's attention because of their flexibility, ease of use, and ability to learn from the data provided [55,56]. Via ML, it has been possible to develop models to identify individuals more susceptible to developing common and rare diseases [58][59][60][61][62][63]67,[88][89][90][91][92][93] and determine diverse phenotypic response profiles in infectious diseases [94][95][96]. Considering that MLand computational-based models have the potential to overcome the limitations of current established clinical models for the diagnosis and follow-up of neurodegenerative diseases, including AD [97], here we studied the feasibility of ML algorithms for predicting Alzheimer's disease age of onset (ADAOO) in individuals from the Paisa genetic isolate. We argue that these ML-based predictive models will improve our understanding of the disease and provide a more accurate and precise definition of the AD natural history landmarks.  Effect on ADAOO C16orf96-rs17137138   Table 1 for more details.

Discussion
Machine learning (ML) algorithms have recently caught the scientific community's attention because of their flexibility, ease of use, and ability to learn from the data provided [55,56]. Via ML, it has been possible to develop models to identify individuals more susceptible to developing common and rare diseases [58][59][60][61][62][63]67,[88][89][90][91][92][93] and determine diverse phenotypic response profiles in infectious diseases [94][95][96]. Considering that ML-and computational-based models have the potential to overcome the limitations of current established clinical models for the diagnosis and follow-up of neurodegenerative diseases, including AD [97], here we studied the feasibility of ML algorithms for predicting Alzheimer's disease age of onset (ADAOO) in individuals from the Paisa genetic isolate. We argue that these ML-based predictive models will improve our understanding of the disease and provide a more accurate and precise definition of the AD natural history landmarks.
We previously identified protective (β > 0; Table 1) and harmful (β < 0; Table 1) ADAOO-modifying variants of significant effect in this community from whole-exome genotyping and whole-exome sequencing data [35,36] using linear-mixed effects models and some ML methods [77]. Thus, the presence of the APOE*E2 allele alone delays ADAOO up to~12 years in PSEN1 E280A mutation carriers. Furthermore, this same allele delays ADAOO up to~17 years when included in an AD oligogenic model (Table 1) [36]. Subsequent analysis led to the development of a classification tree using advanced recursive partitioning to determine whether individuals carrying this mutation would develop earlyonset or late-onset familial AD [36]. Following a similar approach, our group was able to identify ADAOO modifier variants in individuals with sporadic AD (Table 1) [35].
After evaluating several ML-based predictive algorithms for ADAOO in individuals suffering from the most aggressive form of AD ( Figure 1 and Table 2) and in individuals with sporadic AD (Figure 2 and Table 3), we identified that the glmboost and glmnet algorithms perform best for predicting ADAOO in unseen data for each cohort, respectively.
These ML-based predictive models showed promising results that can be easily extended to the clinical setting [98]. In particular, the glmboost algorithm in E280A PSEN1 AD yielded MAE values below 4% and RMSE values of~4 (Table 2), while the glmnet algorithm yielded MAE values below 1% and RMSE values < 1 in sAD ( Table 3), suggesting that predicting AOO in these cohorts is feasible. Using these ML-based ADAOO predictive models, AD diagnosis could be made earlier, and potential treatments are provided long before symptoms begin to appear.
Analysis of variable importance shows that the most relevant ADAOO predictors in f AD are variants APOE-rs7412, FCRL5-rs16838748, GPR20-rs36092215, IFI16-rs62621173, AOAH-rs12701506, and PYNLIP-rs2682585 (Figures 1b and 3a). Furthermore, protective variants APOE-rs7412, GRP20-rs36092215, and FCRL5-rs16838748 have both the highest effect on ADAOO and are the most important predictors of ADAOO, while variants TRIM22-rs12364019, IFI16-rs62621173, and AOAH-rs12701506 have both the most harmful effect on ADAOO and are among the most important predictors of ADAOO (Figure 4a). Comparing these results with those of previous models predicting AD status (early-vs. late-onset) [36] shows some discrepancies in how the genetic variants are ranked and the relevance of demographic information (i.e., sex and years of education) for predicting AD status. Although predicting AD status may be of interest in some clinical settings, the use of ML-based predictive algorithms for ADAOO is a step forward in both our understanding of the disease and our goal of providing timely clinical care to individuals from this community. While AD cannot be cured and there is no way to stop or slow its progression at the moment, our approach offers the possibility of treating symptoms several years before they begin to appear [4,99,100] under an individually tailored biomarker scheme rather, than using a one-size-fits-all population average strategy [99][100][101], while taking individual variability into account. Although our results can certainly be used to move AD research in this direction, it is also important to consider the legal implications and the preparation that health providers, neurologists, and centers specializing in AD and neurodegeneration must have in order to interpret these findings and provide proper counseling to patients and their families [102][103][104]. Another challenge in the years to come is also to significantly reduce the misinformed conclusions produced by ML methods in the absence of clinical domain expertise [105]. In this regard, having a deep understanding of the clinical background in AD, how ML methods operate, and how the results can interpreted and translated to the patient and their relatives is crucial [57].
Variants GPR45-rs35946826 and MAGI3-rs61742849 have both a more harmful effect on ADAOO and are the most important predictors of ADAOO in individuals with sAD ( Figure 4b). Interestingly, the harmful effect on ADAOO of variants MYCBPAP-rs61749930 and EBLN1-rs838759 differs from those of other variants, but their importance for predicting ADAOO is lower, while variants CHGB-rs236150 and WDR46-rs3130257 accelerate ADAOO and have higher variable importance (Figure 4b). Among protective genetic variants, the highest effect is produced by OPRM1-rs675026, followed by HERC6-rs7677237 and C3orf20-rs34230332, with the former being the less important. Intriguingly, variant C16orf96-rs17137138 is the most important ADAOO predictor despite its small effect (Figure 4b).
In summary, here we explore the feasibility of ML algorithms for predicting ADAOO using demographic and genetic data in individuals from the world's most extensive pedigree segregating a severe form of AD caused by a fully penetrant mutation in the PSEN1 gene and individuals with sAD inhabiting the same geographical region. Based on the RMSE, MAE, and R 2 performance measures, our results indicate that ML algorithms are a feasible and promising alternative for assessing ADAOO in these individuals. Interestingly, the most important predictors in these ML-based predictive models were genetic variants, which makes it possible to assess ADAOO at the individual level and opens new personalized medicine and predictive genomic alternatives for AD [98][99][100][101].
Future studies should assess the ability of the ML-based predictive models for ADAOO presented herein with out-of-sample data (i.e., determine how close the model is to predicting ADAOO in a patient with known genetic data that was not part of our cohorts) and the development of ML-based models of disease progression [38,50,51,60]. Ultimately, these models could help us to provide an easy-to-use platform, with potential application in the clinical setting, to provide early and accurate estimates of ADAOO and the evolution of AD in individuals with a family history of the disease.