A Machine Learning Approach to Parkinson’s Disease Blood Transcriptomics

The increased incidence and the significant health burden associated with Parkinson’s disease (PD) have stimulated substantial research efforts towards the identification of effective treatments and diagnostic procedures. Despite technological advancements, a cure is still not available and PD is often diagnosed a long time after onset when irreversible damage has already occurred. Blood transcriptomics represents a potentially disruptive technology for the early diagnosis of PD. We used transcriptome data from the PPMI study, a large cohort study with early PD subjects and age matched controls (HC), to perform the classification of PD vs. HC in around 550 samples. Using a nested feature selection procedure based on Random Forests and XGBoost we reached an AUC of 72% and found 493 candidate genes. We further discussed the importance of the selected genes through a functional analysis based on GOs and KEGG pathways.


Introduction
Parkinson's disease (PD) is a chronic, degenerative disease of the central nervous system with a pattern of incidence that increases with age; as the population ages, its

PPMI Data
We downloaded PPMI whole blood transcriptome data from the LONI Image and Data Archive (IDA) (data dowloaded in July 2021). From the available set of sequenced samples, we selected 579 samples collected from different individuals, namely 390 subjects in the early PD cohort and 189 age-matched subjects in the HC cohort. Therefore the dataset consisted of twice as many PD cases as HCs. Each sample had expression values (read counts) for a total of around 60,000 transcripts. The early PD cohort included subjects with PD that were not treated with dopaminergic medications, that were not carriers of 'LRRK2', 'GBA' or 'SNCA' mutations, and that did not have a first relative with one or more mutations. Sequence data had been aligned to GRCh37(hs37d5) by STAR (v2.4K) [22] using exon-exon junctions from GENCODE v19 and gene count data had been obtained via featureCounts [23] by the same GENCODE annotations. Samples that failed quality control were excluded [24].
Subject metadata that we downloaded from the PPMI website included biological variables such as age, sex, clinical site and clinical measures of motor symptoms such as indicators of tremor dominant (TD) or postural instability gait difficulty (PIGD), of non motor symptoms such as categorical REM sleep behavior disorder (RBD), of cognitive impairment (CI) such as the Montreal Cognitive Assessment or MoCA index (adjusted for education), and of olfactory function (UPSIT or University of Pennsylvania Smell Identification Test score). Additional metadata included technical variables such as, for instance, RIN (RNA integrity number), percent usable bases, total number of reads, sequencing plate. Table 1 reports some statistics on the metadata.

Overview of the Methodology
Our computational workflow consists of three main phases: (i) a first preprocessing phase, which was essential to manage the informative content of highly heterogeneous and computationally demanding data such as transcriptomes; (ii) a second learning phase, which exploited a feature importance evaluation embedded in a Random Forest (RF) classification procedure [25,26] and whose best features were used to feed an eXtreme Gradient Boosting (XGBoost) algorithm [27]; (iii) finally, an unbiased evaluation of classification performances and of the set of important features obtained through a nested cross-validation scheme. A schematic overview of our workflow is presented in Figure 1. A detailed description of the previously mentioned processing steps is presented in the following methodological subsections. For all analyses we used R version 4.0.3, packages xgboost v1.6.0.1, caret v6.0-90, and Bioconductor packages DESeq2 v1.30.1, limma v4. 46.0, enrichR v3.0, AnnotationDbi v1.52.0, and org.Hs.eg.db v3.12.0. The code used to conduct this research is available upon request.

Empowering Informative Content of Gene Expression Values
The first phase of our workflow consists of multiple preprocessing procedures. This phase is essential given the large number of features, namely gene expression values based on the GENCODE v19 comprehensive annotation. A number of label independent filtering steps, where the labels are "PD" and "HC", were required to extract informative content.
First, we selected only transcripts corresponding to protein coding genes and long intergenic non-coding RNAs (lincRNAs). Second, we discarded 2667 transcripts driving technical variance [24], which left us with 18,727 protein coding genes and 7444 lincRNAs. Third, we removed lowly expressed genes, by keeping only genes that had more than five counts in at least 10% of the individuals, which left us with 21,273 genes. Fourth, we estimated size factors, normalized the library size bias using these factors, performed independent filtering to remove lowly expressed genes using the mean of normalized counts as a filter statistic. This left us with 12,612 genes. Finally, we applied a variance stabilizing transformation to accommodate the problem of unequal variance across the range of mean values. We used DESeq2 to perform theses steps [28].
After this step, we were left with a total of 548 samples. Then we removed further confounding effects due to sex and RIN value, again with limma. Thus, for the subsequent analyses we considered a database including 548 subjects described by 12,612 genes.

Differential Expression Analysis
Before moving to the second phase of our workflow, namely the learning phase (see next section), we performed differential expression (DE) analysis, which is a classical and univariate approach towards the identification of biomarkers from RNASeq data. We will also test the performance of our ML approach (XGBoost) when we used as input the set of DE genes obtained with DESeq2 instead of the set of genes selected with RF. In the discussion we will contrast the results of this univariate approach with results from our machine learning multivariate approach. For DE analysis we used DESeq2 [28], a popular tool. As it is standard procedure, we used as input to the algorithm counts prior to independent filtering, batch correction and variance stabilization and defined a design matrix with four variables: the normalized RIN value, factor site, factor gender and the disease label. For the comparison between PD and HC, DESeq2 returns a positive fold change value to indicate an increase of expression of a gene in PD subjects vs. HC, and a negative fold change to indicate a decrease in expression. It also uses a shrinkage procedure to combine information from multiple genes, but its approach is univariate as it tests each gene individually for DE using a beta binomial generalized linear model. DESeq2 corrects for multiple testing using a Benjamini-Hochberg adjusted p-value. Genes with adjusted p-value < 0.05 are called significantly differentially expressed in the two classes. We will evaluate the fold change of genes with its associated error and adjusted p-value and compare results with a multivariate analysis that uses Machine Learning algorithms.

A Robust Learning Scheme
After performing DE analysis, we moved to the second phase of our workflow, the learning phase.
Our filtering procedure described in Section 2.3 had already significantly reduced the amount of gene expression to consider. Nonetheless, we designed and implemented an additional feature selection procedure (nested within the learning phase) to further reduce the number of genes with the two-fold goal of enhancing classification performances and optimizing model interpretability.
Within a repeated stratified (to tackle the control-patient mismatch) 10-fold crossvalidation framework (20 iterations), we trained multiple RF models (100 repetitions, where each repetition used a different seed of the random generation process) to evaluate permutation feature importance measures. We chose RF for two main reasons: on the one hand, RF is easy to tune as it only depends on two parameters, namely the number of trees to grow and the number of features randomly selected at each split; on the other hand, RF is an extremely efficient algorithm on high dimensional data. Each forest was grown using 1000 trees, a sufficient value to allow the algorithm to reach a stable plateau of the out-of-bag internal error. The features selected at each split were f with f being the overall number of genes, which is the default value for this parameter. As already mentioned, another important advantage of the RF classifier is its embedded feature importance evaluation; during the training phase, the algorithm can assess how much each feature decreases the impurity of a tree, or the likelihood of incorrect classification of a new instance of a random variable and then can make an average over all trees [26]. Using this embedded feature importance procedure, we determined the overall feature importance ranking by averaging over the 100 repetitions. Then, a subset of size C of the most important features was used to train an XGBoost model; the XGBoost classification performance was evaluated on the validation set, for the twenty 10-fold cross validation iterations, in order to obtain an unbiased performance evaluation. As with RF, the XGBoost algorithm belongs to the set of learning approaches called ensemble, which combines and manages the predictions of several weak models to obtain a more robust model. While RF relies on bagging (Bootstrap aggregation), XGBoost exploits the Gradient Boosting framework. In the Gradient Boosting method, new models are applied to predicting residuals or errors of previous models and then added together to obtain the final predictive model. This approach implements a gradient descent algorithm to minimize the loss when including new models [27].
Overall, our procedure is very robust because, in addition to the high number of iterations implemented, we also use two different classification algorithms in the training and test phases which makes the results independent from the model. Then, to compare the performance of the ML approach to the performance of a simpler XGBoost classification algorithm that uses as input features the set of DE genes obtained with DESeq2, we trained the algorithm on 90% of the data and tested it on 10%.
Finally, we tested if the predicted probability of the algorithm was different between PD subjects with different endo-phenotypes: (i) MoCA ≤ 26 and MoCA higher than 26; (ii) PDs with RBD an PDs without RBD; (iii) PDs with TD and PDs with PIGD or undetermined; (iv) PDs with Normosmia and PDs with Hyposmia or Anosmia; (v) PD subjects belonging to different age categories, namely age ≥ 56 and age < 56.

Performance Evaluation
The last phase of our workflow is performance evaluation. A binary classification problem has only two class labels; therefore, the resulting model decisions can fall into four categories: true positives (TP) when the model correctly predicts the positive class, erroneous positive predictions (false positives, FP) and, analogously, true negatives (TN) and false negatives (FN).
Given these four cases, one can define several metrics; in particular, we considered here [ Sensitivity and specificity evaluate how well the model performs on the positive and the negative class, respectively. The other metrics provide an overall performance evaluation. Although these "overall" metrics are roughly equivalent, their values can ease the comparison of our results with the state-of-the-art.

Evaluating the Informative Content of Transcriptomic Data
The first research question addressed by this work concerned the evaluation of the informative content provided by blood transcriptomic data. We first assessed the informative content through a univariate DE analysis and we found a total of 1368 up-regulated genes and 911 down-regulated genes with an adjusted p-value less than 0.05. Of the DE genes, however, only one gene, namely 'RAP1GAP', had a log fold change (lfc) higher than 0.5 in absolute value (lfc = −0.65 ± 0.15, adjusted p-value∼10 −5 ). In general, the DE signal, except for this gene, was very low.
We then evaluated the informative content of blood transcriptomic data using a multivariate ML procedure and the classification AUC as a performance measure, see  In black, the median AUC over 20 runs of 10-fold cross validation; in red, the median AUC ± its mean absolute deviation; in blue, the number of features (genes) where the maximum median AUC (72%) was reached. For each run, we collected the AUC values obtained at different thresholds C (or equivalently a different number of genes) and we interpolated these values to build a curve. Then we obtained the black curve as the median of 20 curves, one for each 10-fold Cross-Validation (CV) run. Figure 3 shows the cross-validation median AUC with its mean absolute deviation for a different number C of input features. The maximum median AUC of 72% with a mean absolute deviation of 1.5% is reached with a number of input features equal to C = 493. Despite classification results should depend on the number of features (genes) used to learn the model, this analysis shows that over an extremely broad range of features the informative content remains stable and accurate. For what concerns the other classification metrics obtained using the previously mentioned 493 features, a detailed overview is presented in Table 2. The model is generally accurate, as shown by "global" metrics (AUC, F1, accuracy); it is worth noting the performance drop revealed by the balanced accuracy, which reflects the data imbalance. The same consideration holds for the performance gap in terms of sensitivity and specificity.
We tested the performance of an XGBoost classification algorithm that used as input features the set of DE genes obtained with DESeq2. We obtained an AUC of 64%, which is considerably lower than the performance of our ML approach based on RF and XGBoost, which proves how a multivariate ML model can be more effective on this type of data compared to classical DE approaches.
A final note on the performance of the algorithm with respect to PD endo-phenotypes. The predicted probability of the algorithm was higher for PD subjects in different age categories: the algorithm had an average predicted probability higher for PD individuals with age ≥ 56 (p-value 0.004, Wilcoxon test, average predicted probability = 0.77 for age < 56 and 0.84 for age ≥ 56), while there was no significant difference between PD subjects belonging to the other considered endo-phenotypic classes.

Evaluating Gene Importance
As the RF feature importance procedure in principle returns a different feature ranking at each iteration (both because of the different cross-validation splits and intrinsic RF variability), we designed an experiment to investigate which were the most important genes for classification. Provided that the highest performance value was obtained with 493 features, within the cross-validation scheme, we evaluated the probability that an input feature (gene) is one of the top 493 genes, see Figure 4.  Among the most frequently selected genes, the 20 most important genes (according to the average importance ranking) are listed in Table 3; a list with the genes that have been selected in at least 70% of the iterations is presented in Table A1. This list includes 434 protein coding genes and 61 are lincRNAs (lincRNAs are marked with an asterisk). Table 3. List of the 20 most important protein coding genes and lincRNAs, ordered by importance. LincRNAs are marked with an asterisk. For each gene, four attributes are listed: (i) Up-arrow/Down arrow: significant over/under-expression in PD subjects compared to HC; (ii) HGCN HUGO Gene Nomenclature Committee symbol (or Ensembl ID when missing); (iii) Average XGBoost importance over 20 runs of 5-fold cross validation; (iv) Number of times that a gene is selected over 20 runs of feature selection. For a more complete list, including the genes that are selected 70% of the times, see Table A1. LincRNAs are marked with an asterisk.

Gene Set Enrichment Analysis
We performed KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway and GO (Gene Ontology) functional annotation enrichment analysis with respect to biological processes, cellular components and molecular functions using enrichR [32] on the list of most frequent genes (Table A1). Figures 5-7 report all the resulting significant groups at a False Discovery Rate (FDR) < 0.05; no GO molecular function was significant.

A Robust Machine Learning Model
With the robust methodology we implemented, we identified a set of around 500 genes that could discriminate between PD and HC with an AUC of 72%. Over 20 runs of cross validation ( Figure 3) the AUC had a slightly increasing pattern for increasing values of C, and reached a maximum at a number of features C = 493, then slowly decreased. This behavior showed that the informative content of the selected genes was stable and accurate. While there was an imbalance between sensitivity and specificity, it was moderate and, if needed, this discrepancy could be mitigated with additional under-sampling of oversampling techniques that could be embedded in the described methodology.
Comparing our performance with the state-of-the-art is not straightforward because of the nature of the data and because of ongoing research in the area. A comparable study on a large cohort of 523 individuals performed on blood microarray gene expression data and using Support Vector Machines reports an AUC of 79% on the validation set and of 74% on the test set [17]. However in that study PD subjects with a positive family history were not excluded and most importantly PD patients were treated with dopaminergic medication. Dopaminergic medication alters gene expression and thus confounds the underlying signal: higher discriminative performances are to be expected but are misleading.
A multivariate study not yet published [33] and performed on the same PPMI cohort as ours used a multi-modal approach that combines the informative content of transcriptomics, clinico-demographic data, genome sequencing data, and poligenic risk scores (PRS). To compare our results with theirs, we considered their transcriptomics-only model. They used Support Vector Machines (but they tested and tuned 12 different ML algorithms) and divided the PPMI cohort at baseline into a training (70%) and a validation set (30%), then they tested the resulting model on independent data from the PDBP (Parkinson's Disease Biomarkers Program) cohort, although performances of the uni-modal model were not reported for this test set. After careful preprocessing, where they used limma to adjust for additional covariates of sex, plate, age, ten principal components, and percentage usable bases and then normalized counts, they used significantly over-or under-expressed protein coding genes as determined through logistic regression (p-value < 0.01) on the training set as input features to a Support Vector Machine classification algorithm. Using only transcriptome data they reached an AUC of 79.73% on the validation set, 73.89% accuracy, 54.60% balanced accuracy, 97% sensitivity, and 12% specificity. When they combined transcriptomics with the other multi-modal data using a union of the features as input features; after tuning, they reached an AUC of 85.03%, 75% accuracy, 68.09% balanced accuracy, 93% sensitivity and 43% specificity on the independent test set and determined, by comparing the relative importances of the input features, that the UPSIT score, as well as PRS, contributed most to the predictive power of the model, but the accuracy of these were supplemented by many smaller effect transcripts and risk SNPs.
The strength of our work is its high balanced accuracy in delineating cases and controls and its robustness. Our feature selection procedure identified a robust set of around 500 genes listed in Table A1 that may have some impact on PD biology.

Candidate Genes, GOs and KEGG Pathways
Accurate characterization of the selected genes and of significantly enriched gene sets is beyond the scope of this paper; however, we report a few comments on the enrichment analysis and a few notes on the selected genes.
Our analyses revealed a number of significant functions and pathways, some of which have already been linked to the pathogenesis of PD, such as oxidative stress, inflammation, mitochondrial and vesicular dysfunction, as well as associations between PD and diseases such as diabetes mellitus or inflammatory bowel disease (IBD) (see Figures 5-7). Oxidative stress plays an important role in the degeneration of dopaminergic neurons [34]; its involvement in PD is further substantiated by Reactive Oxygen Species (ROS) induced Parkinsonian models and elevated oxidative markers in clinical PD samples [35]. Glu-tathione (GSH) is a ubiquitous thiol tripeptide that protects against oxidative stress-induced damage by neutralizing reactive oxygen species; its deficiency has been identified as an early event in the progression of PD [36]. Inflammation is another important contributor to the pathogenesis of the disease [37]. Interestingly, our GO analysis has identified biological processes that involve neutrophils. A very recent meta-analysis studying the association between the neutrophil-to-lymphocyte ratio (NLR), a well-established indicator of the overall inflammatory status of the organism, and clinical characteristics in PD has demonstrated that PD patients have an altered peripheral immune profile [38]. Neuronal expression of major histocompatibility complex I (MHC-I) and II (MHC-II) also play a neuroinflammatory role in PD [39,40]. The MHC gene family encodes molecules on the surface of cells that enable the immune system to recognize presented self-and foreign-derived peptides. MHC class II-positive microglia are a sensitive index of neuropathological change and are actively associated with damaged neurons and neurites in PD [41]. Mitochondrial dysfunction is another pathway that has been implicated in the pathophysiology of PD through both environmental exposure and genetic factors. The discovery of the role of the PD familial genes 'PTEN'-induced putative kinase 1 (PINK1) and parkin (PRKN) in mediating mitochondrial degradation reaffirmed the importance of this process in PD aetiology [42]. Vesicular dysfunction is another known contributor of PD [43]. Finally, diabetes mellitus and inflammatory bowel diseases (IBD) are known PD risk factors. In fact, population-based cohort studies indicate that diabetes and IBD are associated with increased PD risk by about 38% [44] and 22% [45], respectively.
A few notes on the set of genes selected follow. In Table 3 we reported the first 20 most important protein coding genes and lincRNAs in our analysis. We included lincRNAs because long non coding RNAs in general assume various roles, which include regulatory roles, and can thus modulate gene expression of protein coding genes; also they are very relevant in neurobiology, as many are associated with neurological pathologies [9]. 'MYOM1', Myomesin1, the most important gene, is a protein coding gene and is up-regulated in PD subjects. Noticeably, 'ENSG00000272688' (Lnc-MYOM1-4) falls within an intron of MYOM1 and is the fifth most important lincRNA; gene 'MYOM2' is also in the list of selected genes and was selected in all the 20 repeated runs. Gene 'MYOM1' is significantly up-regulated in human substantia nigra pars compacta from PD patients [46] and is also one of the most important genes in [33], together with 'SQLE', 'LGALS2', and 'NCR1'. The intersection between our and their set might be larger as in that paper only 29 out of a much larger set of genes selected are reported. Gene 'SLC25A20', Solute Carrier Family 25 Member 20, the second most important gene, was up-regulated in PD, and was one of the nine PD biomarkers identified by Jiang et al. [47], which used a meta-analysis of microarray gene expression data from [17,48,49]. 'PTGDS', another gene in our set, was also one of these nine biomarkers. In our set of genes, 6 other genes, 'SLC18B1', 'SLC25A3', 'SLC11A2', 'SLC25A25', 'SLC25A43', 'SLC38A11' belong to the solute carrier (SLC) superfamily, one of the major sub-groups of membrane proteins in mammalian cells. Their role in neurodegenerative disorders is described thoroughly in [50]. 'NRM', the third most important gene, the integral nuclear membrane protein Nurim, plays a role in the suppression of apoptosis [51], and apoptosis is the main mechanism of neuronal loss in Parkinson's disease [52]. 'PHF7', PHD Finger Protein 7, is a candidate gene for a PD risk locus identified with a meta-analysis of genome-wide association studies [53]. Both protein coding gene 'NUP50' (Nucleoporin 50) and lincRNA 'NUP50-DT' ('NUP50' divergent transcript) are in our gene list. 'CERS4', Ceramide Synthase 4, is involved in Sphingolipid metabolism and its relation to PD is described in [54]. Dysregulation of metabolic pathways by carnitine palmitoyl-transferase 1 'CPT1A' plays a key role in central nervous system disorders [55].
Gene 'RAP1GAP' has been identified by both the DE analysis and the ML methodology (it is selected 20 times over 20 repetitions) (see Table A1). This gene is under-expressed in PD subjects and has a role in orchestrating the development and maintenance of different populations of central and peripheral neurons [56].

Final Considerations
Two final comments. First, the performance of a classification algorithm that used as input features DE genes, as found by DESeq2, showed much lower performances compared to those obtained with the set of features selected with the ML algorithm, thus confirming the validity of our methodology and the importance of using ML models with gene expression data from RNA sequencing of whole blood where the signal is significantly low. Furthermore, we notice how some of the genes selected by the ML algorithm are not DE between the class of PD and the class of HC subjects (see Table A1) but nonetheless contain a relevant signal.
Last, the different average predicted probability between subjects that falls in different age of onset classes (early-onset and late-onset PD subtypes) could reflect the heterogeneity of PD at different ages. In fact, it has been observed that PD patients with older age onset have more severe motor and non-motor burdens and a more widespread involvement of striatal structures [57].

Conclusions
We used a robust ML approach to make predictions on PD from whole blood expression data. The studied cohort included 390 early stage drug-naive PD subjects and 189 age-matched HCs. After careful preprocessing, including batch correction and independent gene filtering, we used a feature selection procedure based on RF and re-sampling and an XGBoost algorithm to evaluate PD vs. HC classification performances within a nested 10-fold cross validation scheme. We explored classification performances for different values of C, the number of features selected, and identified a set of around 500 genes listed in Table A1 that corresponded to maximum discriminative power. We also performed an enrichment analysis on this set of genes and identified significant GO terms and KEGG pathways, many of which are in line with the current literature on PD, although further analysis of these sets is needed and is outside the scope of our work. A strength of our methodology is its robustness. The balanced accuracy of our algorithm compares favorably with the state-of-the-art.
This area of research is cutting edge and requires further investigation. A possible extension of our work could be the evaluation of the predictive power of the selected set of genes on an independent dataset. We are also working on a multi-modal approach that combines transcriptome data with epigenomic data (and other data possibly) with the final aim of increasing the predictive performances of our model.  Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable
Data Availability Statement: The pseudocode used for the analysis is reported in Table A1. R codes used to perform the preprocessing and the analysis are available upon request.

Conflicts of Interest:
The authors declare no conflict of interest.
Appendix A Algorithm A1 Pseudocode. 1: Let F be the total number of features 2: for r = 1 to 20 do 3: Divide data into 10 stratified folds using random seed r 4: for fold k = 1 to 10 do 5: Set fold k as validation_set and the remaining 9 folds as training_set 6: for s = 1 to 100 do 7: Divide training_set into 5 stratified folds using random seed s 8: Take 4 of the folds as the new training set 9: Train a RF on this training set with 1000 trees 10: for f = 1 to F do 11: Set is_outlier r,s, f = 0 12: Estimate importance r,s, f 13: end for 14: Evaluate th r,s = MEDIAN ( f ) (importance r,s, f ) + 1.5 * IQR ( f ) (importance r,s, f ) where MEDIAN ( f ) means median over the F values of f 15: for f = 1 to F do 16: is_outlier r,s, f = IFELSE(importance r,s, f > th r,s , 1, 0) 17: end for 18: end for 19: for f = 1 to F do 20: Set percentage_outlier r, f = 0 21: for s = 1 to 100 do 22: percentage_outlier r, f += is_outlier r,s, f 23: end for 24: end for 25: for C = 1 to 100 do 26: Evaluate is_selected r, f ,C = IFELSE(percentage_outlier r, f > C, 1, 0) 27: Train XGBoost on the training_set using only features f with is_selected r, f ,C =1 28: Estimate performance ROCAUC r,k,C on the validation_set 29:  Table A1. Complete list of the most frequent protein coding genes and lincRNAs, ordered by importance. LincRNAs are marked with an asterisk. For each gene, four attributes are listed: (i) No arrow, an upward pointing arrow, a downward pointing arrow indicate no significant DE bewteen PD and HC, significant over-expression in PD subjects, significant under-expression in PD subjects, respectively; (ii) HGCN symbol (or Ensembl ID when missing); (iii) Average importance over 20 runs of 5-fold cross validation; (iv) Frequency of occurrence over 20 repetitions. LincRNAs are marked with an asterisk.