Next Article in Journal
Infrageneric Plastid Genomes of Cotoneaster (Rosaceae): Implications for the Plastome Evolution and Origin of C. wilsonii on Ulleung Island
Next Article in Special Issue
Relationship between Substantia Nigra Neuromelanin Imaging and Dual Alpha-Synuclein Labeling of Labial Minor in Salivary Glands in Isolated Rapid Eye Movement Sleep Behavior Disorder and Parkinson’s Disease
Previous Article in Journal
Complete Mitochondrial Genomes of Five Racerunners (Lacertidae: Eremias) and Comparison with Other Lacertids: Insights into the Structure and Evolution of the Control Region
Previous Article in Special Issue
Does the Expression and Epigenetics of Genes Involved in Monogenic Forms of Parkinson’s Disease Influence Sporadic Forms?
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Machine Learning Approach to Parkinson’s Disease Blood Transcriptomics

1
Istituto Nazionale di Fisica Nucleare (INFN), Sezione di Bari, Via A. Orabona 4, 70125 Bari, Italy
2
Dipartimento di Scienze Mediche di Base, Neuroscienze e Organi di Senso, Università degli Studi di Bari Aldo Moro, Piazza G. Cesare 11, 70124 Bari, Italy
3
Dipartimento Interateneo di Fisica M. Merlin, Università degli Studi di Bari Aldo Moro, Via G. Amendola 173, 70125 Bari, Italy
4
Dipartimento di Farmacia-Scienze del Farmaco, Università degli Studi di Bari Aldo Moro, Via A. Orabona 4, 70125 Bari, Italy
5
Centro per le Malattie Neurodegenerative e l’Invecchiamento Cerebrale, Dipartimento di Ricerca Clinica in Neurologia, Università degli Studi di Bari Aldo Moro, Pia Fondazione Cardinale G. Panico, 73039 Tricase, Italy
6
Institute of Psychiatry, Psychology and Neuroscience, King’s College London, De Crespigny Park, London SE5 8AF, UK
7
Dipartimento di Bioscienze, Biotecnologie e Biofarmaceutica, Università degli Studi di Bari Aldo Moro, Via A. Orabona 4, 70125 Bari, Italy
8
Istituto di Biomembrane, Bioenergetica e Biotecnologie Molecolari, Consiglio Nazionale delle Ricerche, Via G. Amendola 122/O, 70126 Bari, Italy
9
Istituto di Nanotecnologia (NANOTEC), Consiglio Nazionale delle Ricerche, Via Monteroni, 73100 Lecce, Italy
10
Dipartimento di Scienze del Suolo, della Pianta e degli Alimenti, Università degli Studi di Bari Aldo Moro, Via A. Orabona 4, 70125 Bari, Italy
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
These authors contributed equally to this work.
Genes 2022, 13(5), 727; https://doi.org/10.3390/genes13050727
Submission received: 12 March 2022 / Revised: 16 April 2022 / Accepted: 18 April 2022 / Published: 21 April 2022
(This article belongs to the Special Issue Parkinson's Disease: Genetics and Pathogenesis)

Abstract

:
The increased incidence and the significant health burden associated with Parkinson’s disease (PD) have stimulated substantial research efforts towards the identification of effective treatments and diagnostic procedures. Despite technological advancements, a cure is still not available and PD is often diagnosed a long time after onset when irreversible damage has already occurred. Blood transcriptomics represents a potentially disruptive technology for the early diagnosis of PD. We used transcriptome data from the PPMI study, a large cohort study with early PD subjects and age matched controls (HC), to perform the classification of PD vs. HC in around 550 samples. Using a nested feature selection procedure based on Random Forests and XGBoost we reached an AUC of 72% and found 493 candidate genes. We further discussed the importance of the selected genes through a functional analysis based on GOs and KEGG pathways.

1. Introduction

Parkinson’s disease (PD) is a chronic, degenerative disease of the central nervous system with a pattern of incidence that increases with age; as the population ages, its burden is poised to increase [1]. Despite considerable research efforts, PD is incurable; available treatments can only help manage the symptoms, and its diagnosis often occurs a long time after onset after substantial loss of function of substantia nigra dopamine neurons [2].
Massively parallel analysis of cellular RNAs can provide an unbiased set of biomarkers of PD and can generate hypotheses about disease mechanisms. It may be particularly useful for decoding a disease with considerable environmental and epigenetic contributions not readily explained by variations in the genomic fingerprint such as PD [3]. Brain transcriptomics has already shown its potential to uncover the functional mechanisms at the basis of this disease although its signal is confounded by underlying differences in cell type composition and it can only be performed after death [4]. Whole blood transcriptomics represents a convenient and less invasive alternative to brain transcriptomics for early PD diagnosis, as blood is a readily accessible peripheral biofluid and blood and brain share significant transcriptional profile similarities [5,6] although more investigations are needed in this field. A number of experimental observations have shown molecular and biochemical changes in the blood cells of PD subjects [7,8] and RNA-sequencing experiments on blood leukocytes have revealed the diagnostic potential of long non coding RNAs (lncRNAs) [9]. Some studies have identified biomarkers from blood that are robust and have great potential for helping reduce misdiagnosis [10,11,12].
As high throughput technologies such as transcriptome sequencing can now generate huge amounts of biological data at relatively low costs, the processing and extraction of relevant signal requires the adoption of artificial intelligence methodologies. A number of Machine Learning (ML) approaches have been undertaken for PD classification that use as input vocal and gait [13] or neuroimaging [14] features, or genetic risk scores from Genome Wide Association Studies (GWAS) studies [15] and microarray transcriptional profiles [16,17]. We used advanced Machine Learning techniques for feature selection and classification of early (drug-naive) PD patients and healthy controls (HCs) using gene expression data from blood RNA sequencing.
For blood transcriptomics, experience suggests that large cohorts are needed, and that drug-naive patients should be used, as medications certainly affect gene expression [18]. Microarray assays for whole blood transcriptomics have been used to classify early stage drug-naive PDs vs. HCs [19,20] with a small number of samples (less than 50 PD subjects), while previous experiments with a large number of samples used PD subjects on dopaminergic medication [17].
Given the importance of using large cohorts of drug-naive patients, we used open access gene expression data from the Parkinson’s Progression Markers Initiative (PPMI), an international study that has enrolled the largest to date cohort of untreated PD patients (around 430 subjects) across multiple sites (www.ppmi-info.org/data accessed on 11 March 2022) [21].

2. Materials and Methods

2.1. PPMI Data

We downloaded PPMI whole blood transcriptome data from the LONI Image and Data Archive (IDA) (data dowloaded in July 2021). From the available set of sequenced samples, we selected 579 samples collected from different individuals, namely 390 subjects in the early PD cohort and 189 age-matched subjects in the HC cohort. Therefore the dataset consisted of twice as many PD cases as HCs. Each sample had expression values (read counts) for a total of around 60,000 transcripts. The early PD cohort included subjects with PD that were not treated with dopaminergic medications, that were not carriers of ‘LRRK2’, ‘GBA’ or ‘SNCA’ mutations, and that did not have a first relative with one or more mutations. Sequence data had been aligned to GRCh37(hs37d5) by STAR (v2.4K) [22] using exon-exon junctions from GENCODE v19 and gene count data had been obtained via featureCounts [23] by the same GENCODE annotations. Samples that failed quality control were excluded [24].
Subject metadata that we downloaded from the PPMI website included biological variables such as age, sex, clinical site and clinical measures of motor symptoms such as indicators of tremor dominant (TD) or postural instability gait difficulty (PIGD), of non motor symptoms such as categorical REM sleep behavior disorder (RBD), of cognitive impairment (CI) such as the Montreal Cognitive Assessment or MoCA index (adjusted for education), and of olfactory function (UPSIT or University of Pennsylvania Smell Identification Test score). Additional metadata included technical variables such as, for instance, RIN (RNA integrity number), percent usable bases, total number of reads, sequencing plate. Table 1 reports some statistics on the metadata.
For up-to-date information on the study and for access to the data, visit www.ppmi-info.org accessed on 11 March 2022.

2.2. Overview of the Methodology

Our computational workflow consists of three main phases: (i) a first preprocessing phase, which was essential to manage the informative content of highly heterogeneous and computationally demanding data such as transcriptomes; (ii) a second learning phase, which exploited a feature importance evaluation embedded in a Random Forest (RF) classification procedure [25,26] and whose best features were used to feed an eXtreme Gradient Boosting (XGBoost) algorithm [27]; (iii) finally, an unbiased evaluation of classification performances and of the set of important features obtained through a nested cross-validation scheme. A schematic overview of our workflow is presented in Figure 1. A detailed description of the previously mentioned processing steps is presented in the following methodological subsections.
For all analyses we used R version 4.0.3, packages xgboost v1.6.0.1, caret v6.0-90, and Bioconductor packages DESeq2 v1.30.1, limma v4.46.0, enrichR v3.0, AnnotationDbi v1.52.0, and org.Hs.eg.db v3.12.0. The code used to conduct this research is available upon request.

2.3. Empowering Informative Content of Gene Expression Values

The first phase of our workflow consists of multiple preprocessing procedures. This phase is essential given the large number of features, namely gene expression values based on the GENCODE v19 comprehensive annotation. A number of label independent filtering steps, where the labels are “PD” and “HC”, were required to extract informative content.
First, we selected only transcripts corresponding to protein coding genes and long intergenic non-coding RNAs (lincRNAs). Second, we discarded 2667 transcripts driving technical variance [24], which left us with 18,727 protein coding genes and 7444 lincRNAs. Third, we removed lowly expressed genes, by keeping only genes that had more than five counts in at least 10% of the individuals, which left us with 21,273 genes. Fourth, we estimated size factors, normalized the library size bias using these factors, performed independent filtering to remove lowly expressed genes using the mean of normalized counts as a filter statistic. This left us with 12,612 genes. Finally, we applied a variance stabilizing transformation to accommodate the problem of unequal variance across the range of mean values. We used DESeq2 to perform theses steps [28].
Afterwards, we used control samples to estimate the batch effect of the site, that we subsequently removed in both controls and cases [29] using limma [30]. To perform this step we removed subjects from sites with no control samples or with only one control sample, i.e., sites “14” (1 sample), “26” (16 samples), “55” (4 samples), and “59” (10 samples), see Figure 2.
After this step, we were left with a total of 548 samples. Then we removed further confounding effects due to sex and RIN value, again with limma. Thus, for the subsequent analyses we considered a database including 548 subjects described by 12,612 genes.

2.4. Differential Expression Analysis

Before moving to the second phase of our workflow, namely the learning phase (see next section), we performed differential expression (DE) analysis, which is a classical and univariate approach towards the identification of biomarkers from RNASeq data. We will also test the performance of our ML approach (XGBoost) when we used as input the set of DE genes obtained with DESeq2 instead of the set of genes selected with RF. In the discussion we will contrast the results of this univariate approach with results from our machine learning multivariate approach. For DE analysis we used DESeq2 [28], a popular tool. As it is standard procedure, we used as input to the algorithm counts prior to independent filtering, batch correction and variance stabilization and defined a design matrix with four variables: the normalized RIN value, factor site, factor gender and the disease label. For the comparison between PD and HC, DESeq2 returns a positive fold change value to indicate an increase of expression of a gene in PD subjects vs. HC, and a negative fold change to indicate a decrease in expression. It also uses a shrinkage procedure to combine information from multiple genes, but its approach is univariate as it tests each gene individually for DE using a beta binomial generalized linear model. DESeq2 corrects for multiple testing using a Benjamini–Hochberg adjusted p-value. Genes with adjusted p-value < 0.05 are called significantly differentially expressed in the two classes. We will evaluate the fold change of genes with its associated error and adjusted p-value and compare results with a multivariate analysis that uses Machine Learning algorithms.

2.5. A Robust Learning Scheme

After performing DE analysis, we moved to the second phase of our workflow, the learning phase.
Our filtering procedure described in Section 2.3 had already significantly reduced the amount of gene expression to consider. Nonetheless, we designed and implemented an additional feature selection procedure (nested within the learning phase) to further reduce the number of genes with the two-fold goal of enhancing classification performances and optimizing model interpretability.
Within a repeated stratified (to tackle the control-patient mismatch) 10-fold cross-validation framework (20 iterations), we trained multiple RF models (100 repetitions, where each repetition used a different seed of the random generation process) to evaluate permutation feature importance measures. We chose RF for two main reasons: on the one hand, RF is easy to tune as it only depends on two parameters, namely the number of trees to grow and the number of features randomly selected at each split; on the other hand, RF is an extremely efficient algorithm on high dimensional data. Each forest was grown using 1000 trees, a sufficient value to allow the algorithm to reach a stable plateau of the out-of-bag internal error. The features selected at each split were f with f being the overall number of genes, which is the default value for this parameter. As already mentioned, another important advantage of the RF classifier is its embedded feature importance evaluation; during the training phase, the algorithm can assess how much each feature decreases the impurity of a tree, or the likelihood of incorrect classification of a new instance of a random variable and then can make an average over all trees [26]. Using this embedded feature importance procedure, we determined the overall feature importance ranking by averaging over the 100 repetitions. Then, a subset of size C of the most important features was used to train an XGBoost model; the XGBoost classification performance was evaluated on the validation set, for the twenty 10-fold cross validation iterations, in order to obtain an unbiased performance evaluation. As with RF, the XGBoost algorithm belongs to the set of learning approaches called ensemble, which combines and manages the predictions of several weak models to obtain a more robust model. While RF relies on bagging (Bootstrap aggregation), XGBoost exploits the Gradient Boosting framework. In the Gradient Boosting method, new models are applied to predicting residuals or errors of previous models and then added together to obtain the final predictive model. This approach implements a gradient descent algorithm to minimize the loss when including new models [27].
Overall, our procedure is very robust because, in addition to the high number of iterations implemented, we also use two different classification algorithms in the training and test phases which makes the results independent from the model. Then, to compare the performance of the ML approach to the performance of a simpler XGBoost classification algorithm that uses as input features the set of DE genes obtained with DESeq2, we trained the algorithm on 90% of the data and tested it on 10%.
Finally, we tested if the predicted probability of the algorithm was different between PD subjects with different endo-phenotypes: (i) MoCA ≤ 26 and MoCA higher than 26; (ii) PDs with RBD an PDs without RBD; (iii) PDs with TD and PDs with PIGD or undetermined; (iv) PDs with Normosmia and PDs with Hyposmia or Anosmia; (v) PD subjects belonging to different age categories, namely age ≥ 56 and age < 56.

2.6. Performance Evaluation

The last phase of our workflow is performance evaluation. A binary classification problem has only two class labels; therefore, the resulting model decisions can fall into four categories: true positives (TP) when the model correctly predicts the positive class, erroneous positive predictions (false positives, FP) and, analogously, true negatives (TN) and false negatives (FN).
Given these four cases, one can define several metrics; in particular, we considered here [31]:
  • Accuracy
    T P + T N T P + T N + F P + F N ;
  • Sensitivity
    T P T P + F N ;
  • Specificity
    T N T N + F P ;
  • Balanced Accuracy
    S e n s i t i v i t y + S p e c i f i c i t y 2 ;
  • F1
    2 T P 2 T P + F P + F N ;
  • Area Under the Receiver Operating Characteristics (ROC) Curve (AUC), which plots sensitivity against specificity by varying the decision threshold.
Sensitivity and specificity evaluate how well the model performs on the positive and the negative class, respectively. The other metrics provide an overall performance evaluation. Although these “overall” metrics are roughly equivalent, their values can ease the comparison of our results with the state-of-the-art.

3. Results

3.1. Evaluating the Informative Content of Transcriptomic Data

The first research question addressed by this work concerned the evaluation of the informative content provided by blood transcriptomic data. We first assessed the informative content through a univariate DE analysis and we found a total of 1368 up-regulated genes and 911 down-regulated genes with an adjusted p-value less than 0.05. Of the DE genes, however, only one gene, namely ‘RAP1GAP’, had a log fold change (lfc) higher than 0.5 in absolute value (lfc = −0.65 ± 0.15, adjusted p-value∼ 10 5 ). In general, the DE signal, except for this gene, was very low.
We then evaluated the informative content of blood transcriptomic data using a multivariate ML procedure and the classification AUC as a performance measure, see Figure 3.
Figure 3 shows the cross-validation median AUC with its mean absolute deviation for a different number C of input features. The maximum median AUC of 72% with a mean absolute deviation of 1.5% is reached with a number of input features equal to C = 493. Despite classification results should depend on the number of features (genes) used to learn the model, this analysis shows that over an extremely broad range of features the informative content remains stable and accurate. For what concerns the other classification metrics obtained using the previously mentioned 493 features, a detailed overview is presented in Table 2.
The model is generally accurate, as shown by “global” metrics (AUC, F1, accuracy); it is worth noting the performance drop revealed by the balanced accuracy, which reflects the data imbalance. The same consideration holds for the performance gap in terms of sensitivity and specificity.
We tested the performance of an XGBoost classification algorithm that used as input features the set of DE genes obtained with DESeq2. We obtained an AUC of 64%, which is considerably lower than the performance of our ML approach based on RF and XGBoost, which proves how a multivariate ML model can be more effective on this type of data compared to classical DE approaches.
A final note on the performance of the algorithm with respect to PD endo-phenotypes. The predicted probability of the algorithm was higher for PD subjects in different age categories: the algorithm had an average predicted probability higher for PD individuals with age ≥ 56 (p-value 0.004, Wilcoxon test, average predicted probability = 0.77 for age < 56 and 0.84 for age ≥ 56), while there was no significant difference between PD subjects belonging to the other considered endo-phenotypic classes.

3.2. Evaluating Gene Importance

As the RF feature importance procedure in principle returns a different feature ranking at each iteration (both because of the different cross-validation splits and intrinsic RF variability), we designed an experiment to investigate which were the most important genes for classification. Provided that the highest performance value was obtained with 493 features, within the cross-validation scheme, we evaluated the probability that an input feature (gene) is one of the top 493 genes, see Figure 4.
Among the most frequently selected genes, the 20 most important genes (according to the average importance ranking) are listed in Table 3; a list with the genes that have been selected in at least 70% of the iterations is presented in Table A1. This list includes 434 protein coding genes and 61 are lincRNAs (lincRNAs are marked with an asterisk).

3.3. Gene Set Enrichment Analysis

We performed KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway and GO (Gene Ontology) functional annotation enrichment analysis with respect to biological processes, cellular components and molecular functions using enrichR [32] on the list of most frequent genes (Table A1). Figure 5, Figure 6 and Figure 7 report all the resulting significant groups at a False Discovery Rate (FDR) < 0.05; no GO molecular function was significant.

4. Discussion

4.1. A Robust Machine Learning Model

With the robust methodology we implemented, we identified a set of around 500 genes that could discriminate between PD and HC with an AUC of 72%. Over 20 runs of cross validation (Figure 3) the AUC had a slightly increasing pattern for increasing values of C, and reached a maximum at a number of features C = 493, then slowly decreased. This behavior showed that the informative content of the selected genes was stable and accurate. While there was an imbalance between sensitivity and specificity, it was moderate and, if needed, this discrepancy could be mitigated with additional under-sampling of over-sampling techniques that could be embedded in the described methodology.
Comparing our performance with the state-of-the-art is not straightforward because of the nature of the data and because of ongoing research in the area. A comparable study on a large cohort of 523 individuals performed on blood microarray gene expression data and using Support Vector Machines reports an AUC of 79% on the validation set and of 74% on the test set [17]. However in that study PD subjects with a positive family history were not excluded and most importantly PD patients were treated with dopaminergic medication. Dopaminergic medication alters gene expression and thus confounds the underlying signal: higher discriminative performances are to be expected but are misleading.
A multivariate study not yet published [33] and performed on the same PPMI cohort as ours used a multi-modal approach that combines the informative content of transcriptomics, clinico-demographic data, genome sequencing data, and poligenic risk scores (PRS). To compare our results with theirs, we considered their transcriptomics-only model. They used Support Vector Machines (but they tested and tuned 12 different ML algorithms) and divided the PPMI cohort at baseline into a training (70%) and a validation set (30%), then they tested the resulting model on independent data from the PDBP (Parkinson’s Disease Biomarkers Program) cohort, although performances of the uni-modal model were not reported for this test set. After careful preprocessing, where they used limma to adjust for additional covariates of sex, plate, age, ten principal components, and percentage usable bases and then normalized counts, they used significantly over- or under-expressed protein coding genes as determined through logistic regression (p-value < 0.01) on the training set as input features to a Support Vector Machine classification algorithm. Using only transcriptome data they reached an AUC of 79.73% on the validation set, 73.89% accuracy, 54.60% balanced accuracy, 97% sensitivity, and 12% specificity. When they combined transcriptomics with the other multi-modal data using a union of the features as input features; after tuning, they reached an AUC of 85.03%, 75% accuracy, 68.09% balanced accuracy, 93% sensitivity and 43% specificity on the independent test set and determined, by comparing the relative importances of the input features, that the UPSIT score, as well as PRS, contributed most to the predictive power of the model, but the accuracy of these were supplemented by many smaller effect transcripts and risk SNPs.
The strength of our work is its high balanced accuracy in delineating cases and controls and its robustness. Our feature selection procedure identified a robust set of around 500 genes listed in Table A1 that may have some impact on PD biology.

4.2. Candidate Genes, GOs and KEGG Pathways

Accurate characterization of the selected genes and of significantly enriched gene sets is beyond the scope of this paper; however, we report a few comments on the enrichment analysis and a few notes on the selected genes.
Our analyses revealed a number of significant functions and pathways, some of which have already been linked to the pathogenesis of PD, such as oxidative stress, inflammation, mitochondrial and vesicular dysfunction, as well as associations between PD and diseases such as diabetes mellitus or inflammatory bowel disease (IBD) (see Figure 5, Figure 6 and Figure 7). Oxidative stress plays an important role in the degeneration of dopaminergic neurons [34]; its involvement in PD is further substantiated by Reactive Oxygen Species (ROS) induced Parkinsonian models and elevated oxidative markers in clinical PD samples [35]. Glutathione (GSH) is a ubiquitous thiol tripeptide that protects against oxidative stress-induced damage by neutralizing reactive oxygen species; its deficiency has been identified as an early event in the progression of PD [36]. Inflammation is another important contributor to the pathogenesis of the disease [37]. Interestingly, our GO analysis has identified biological processes that involve neutrophils. A very recent meta-analysis studying the association between the neutrophil-to-lymphocyte ratio (NLR), a well-established indicator of the overall inflammatory status of the organism, and clinical characteristics in PD has demonstrated that PD patients have an altered peripheral immune profile [38]. Neuronal expression of major histocompatibility complex I (MHC-I) and II (MHC-II) also play a neuroinflammatory role in PD [39,40]. The MHC gene family encodes molecules on the surface of cells that enable the immune system to recognize presented self- and foreign-derived peptides. MHC class II-positive microglia are a sensitive index of neuropathological change and are actively associated with damaged neurons and neurites in PD [41]. Mitochondrial dysfunction is another pathway that has been implicated in the pathophysiology of PD through both environmental exposure and genetic factors. The discovery of the role of the PD familial genes ’PTEN’-induced putative kinase 1 (PINK1) and parkin (PRKN) in mediating mitochondrial degradation reaffirmed the importance of this process in PD aetiology [42]. Vesicular dysfunction is another known contributor of PD [43]. Finally, diabetes mellitus and inflammatory bowel diseases (IBD) are known PD risk factors. In fact, population-based cohort studies indicate that diabetes and IBD are associated with increased PD risk by about 38% [44] and 22% [45], respectively.
A few notes on the set of genes selected follow. In Table 3 we reported the first 20 most important protein coding genes and lincRNAs in our analysis. We included lincRNAs because long non coding RNAs in general assume various roles, which include regulatory roles, and can thus modulate gene expression of protein coding genes; also they are very relevant in neurobiology, as many are associated with neurological pathologies [9].
‘MYOM1’, Myomesin1, the most important gene, is a protein coding gene and is up-regulated in PD subjects. Noticeably, ‘ENSG00000272688’ (Lnc-MYOM1-4) falls within an intron of MYOM1 and is the fifth most important lincRNA; gene ‘MYOM2’ is also in the list of selected genes and was selected in all the 20 repeated runs. Gene ‘MYOM1’ is significantly up-regulated in human substantia nigra pars compacta from PD patients [46] and is also one of the most important genes in [33], together with ‘SQLE’, ‘LGALS2’, and ‘NCR1’. The intersection between our and their set might be larger as in that paper only 29 out of a much larger set of genes selected are reported. Gene ‘SLC25A20’, Solute Carrier Family 25 Member 20, the second most important gene, was up-regulated in PD, and was one of the nine PD biomarkers identified by Jiang et al. [47], which used a meta-analysis of microarray gene expression data from [17,48,49]. ‘PTGDS’, another gene in our set, was also one of these nine biomarkers. In our set of genes, 6 other genes, ‘SLC18B1’, ‘SLC25A3’, ‘SLC11A2’, ‘SLC25A25’, ‘SLC25A43’, ‘SLC38A11’ belong to the solute carrier (SLC) superfamily, one of the major sub-groups of membrane proteins in mammalian cells. Their role in neurodegenerative disorders is described thoroughly in [50]. ‘NRM’, the third most important gene, the integral nuclear membrane protein Nurim, plays a role in the suppression of apoptosis [51], and apoptosis is the main mechanism of neuronal loss in Parkinson’s disease [52]. ‘PHF7’, PHD Finger Protein 7, is a candidate gene for a PD risk locus identified with a meta-analysis of genome-wide association studies [53]. Both protein coding gene ‘NUP50’ (Nucleoporin 50) and lincRNA ‘NUP50-DT’ (‘NUP50’ divergent transcript) are in our gene list. ‘CERS4’, Ceramide Synthase 4, is involved in Sphingolipid metabolism and its relation to PD is described in [54]. Dysregulation of metabolic pathways by carnitine palmitoyl-transferase 1 ‘CPT1A’ plays a key role in central nervous system disorders [55].
Gene ‘RAP1GAP’ has been identified by both the DE analysis and the ML methodology (it is selected 20 times over 20 repetitions) (see Table A1). This gene is under-expressed in PD subjects and has a role in orchestrating the development and maintenance of different populations of central and peripheral neurons [56].

4.3. Final Considerations

Two final comments. First, the performance of a classification algorithm that used as input features DE genes, as found by DESeq2, showed much lower performances compared to those obtained with the set of features selected with the ML algorithm, thus confirming the validity of our methodology and the importance of using ML models with gene expression data from RNA sequencing of whole blood where the signal is significantly low. Furthermore, we notice how some of the genes selected by the ML algorithm are not DE between the class of PD and the class of HC subjects (see Table A1) but nonetheless contain a relevant signal.
Last, the different average predicted probability between subjects that falls in different age of onset classes (early-onset and late-onset PD subtypes) could reflect the heterogeneity of PD at different ages. In fact, it has been observed that PD patients with older age onset have more severe motor and non-motor burdens and a more widespread involvement of striatal structures [57].

5. Conclusions

We used a robust ML approach to make predictions on PD from whole blood expression data. The studied cohort included 390 early stage drug-naive PD subjects and 189 age-matched HCs. After careful preprocessing, including batch correction and independent gene filtering, we used a feature selection procedure based on RF and re-sampling and an XGBoost algorithm to evaluate PD vs. HC classification performances within a nested 10-fold cross validation scheme. We explored classification performances for different values of C, the number of features selected, and identified a set of around 500 genes listed in Table A1 that corresponded to maximum discriminative power. We also performed an enrichment analysis on this set of genes and identified significant GO terms and KEGG pathways, many of which are in line with the current literature on PD, although further analysis of these sets is needed and is outside the scope of our work. A strength of our methodology is its robustness. The balanced accuracy of our algorithm compares favorably with the state-of-the-art.
This area of research is cutting edge and requires further investigation. A possible extension of our work could be the evaluation of the predictive power of the selected set of genes on an independent dataset. We are also working on a multi-modal approach that combines transcriptome data with epigenomic data (and other data possibly) with the final aim of increasing the predictive performances of our model.

Author Contributions

Conceptualization, E.P. (Ester Pantaleo), A.M., A.L., N.A., S.T. and D.U.; methodology, E.P. (Ester Pantaleo), A.M., A.L., N.A. and S.T.; software, E.P. (Ester Pantaleo); validation, E.P. (Ernesto Picardi), N.A. and A.L.; formal analysis, E.P. (Ester Pantaleo), A.M., A.L., S.T., N.A, C.L.G. and E.P. (Ernesto Picardi); investigation, E.P. (Ernesto Picardi); resources, E.P. (Ester Pantaleo) and D.U.; data curation, E.P. (Ester Pantaleo); writing—original draft preparation, E.P. (Ester Pantaleo), N.A. and A.M.; writing—review and editing, E.P. (Ester Pantaleo), A.M., N.A., A.L., L.B., D.U., B.T., S.N., C.L.G. and E.P. (Ernesto Picardi); visualization, E.P. (Ester Pantaleo), A.M.; supervision, R.B., G.L. and G.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been supported by Regione Puglia and CNR funds to “Tecnopolo per la Medicina di Precisione”, D.G.R. n. 2117 of 21.11.2018 (B84I18000540002). A.L.s position is funded by the Program Research for Innovation—REFIN funded by Regione Puglia (Italy) in the framework of the POR Puglia FESR FSE 2014-2020 Asse X-Azione 10.4, project code 928A7C98 (Biomarcatori di connettività cerebrale da imaging multimodale per la diagnosi precoce e stadiazione personalizzata di malattie neurodegenerative con metodi avanzati di intelligenza artificiale in ambiente di calcolo distribuito).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The pseudocode used for the analysis is reported in Table A1. R codes used to perform the preprocessing and the analysis are available upon request.

Acknowledgments

PPMI—a public-private partnership—is funded by the Michael J. Fox Foundation for Parkinson’s Research and funding partners, including Abbvie, Acurex Therapeutics, Allergan, Amathus Therapeutics, Asap Aligning Science Across Parkinson’s, Avid Radiopharmaceuticals, Bial Biotech, Biogen, Biolegend, Bristol Myers Squibb, Calico, Celgene, Dacapo Brainscience, Denali, Edmond J. Safra Philantropic Foundation, 4d Pharma plc, GE Heathcare, Genentech, gsk GlaxoSmithKline, Golub Capital, Handl Therapeutics, Insitro, Janssen Neuroscience, Lilly, Lundbeck, Merck, msd Meso Scale Discovery, Neurocrine Biosciences, Pfizer, Piramal, Prevail Therapeutics, Roche, Sanofi Genzyme, Servier, Takeda, TEVA, ucb, VanquaBbio, verily, Voyager Therapeutics, Yumanity Therapeutics.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Algorithm A1 Pseudocode.
1:
Let F be the total number of features
2:
forr = 1 to 20 do
3:
     Divide data into 10 stratified folds using random seed r
4:
     for fold k = 1 to 10 do
5:
         Set fold k as v a l i d a t i o n _ s e t and the remaining 9 folds as t r a i n i n g _ s e t
6:
         for s = 1 to 100 do
7:
              Divide t r a i n i n g _ s e t into 5 stratified folds using random seed s
8:
              Take 4 of the folds as the new training set
9:
              Train a RF on this training set with 1000 trees
10:
            for f = 1 to F do
11:
                    Set i s _ o u t l i e r r , s , f = 0
12:
                    Estimate i m p o r t a n c e r , s , f
13:
            end for
14:
            Evaluate t h r , s = MEDIAN ( f ) ( i m p o r t a n c e r , s , f ) + 1.5 ∗ IQR ( f ) ( i m p o r t a n c e r , s , f ) where MEDIAN ( f ) means median over the F values of f
15:
            for f = 1 to F do
16:
                     i s _ o u t l i e r r , s , f = IFELSE( i m p o r t a n c e r , s , f > t h r , s , 1, 0)
17:
            end for
18:
         end for
19:
         for f = 1 to F do
20:
             Set p e r c e n t a g e _ o u t l i e r r , f = 0
21:
             for s = 1 to 100 do
22:
                     p e r c e n t a g e _ o u t l i e r r , f += i s _ o u t l i e r r , s , f
23:
             end for
24:
         end for
25:
         for C = 1 to 100 do
26:
             Evaluate i s _ s e l e c t e d r , f , C = IFELSE( p e r c e n t a g e _ o u t l i e r r , f > C, 1, 0)
27:
             Train XGBoost on the t r a i n i n g _ s e t using only features f with i s _ s e l e c t e d r , f , C =1
28:
             Estimate performance ROCAUC r , k , C on the v a l i d a t i o n _ s e t
29:
         end for
30:
     end for
31:
end for
32:
forC = 1 to 100 do
33:
     Evaluate m_ROCAUC r , C = MEDIAN ( k ) (ROCAUC r , k , C ) over the 10 values of k
34:
end for
35:
forC = 1 to 100 do
36:
     Evaluate m_ROCAUC C = MEDIAN ( r ) (ROCAUC r , C ) over the 20 values of r
37:
end for
38:
Let C * = ARGMAX C (m_ROCAUC C )
39:
forf = 1 to F do
40:
     Set c o u n t _ s e l e c t e d f = 0
41:
     for r = 1 to 20 do
42:
          c o u n t _ s e l e c t e d f += i s _ s e l e c t e d r , f , C *
43:
     end for
44:
end for
Table A1. Complete list of the most frequent protein coding genes and lincRNAs, ordered by importance. LincRNAs are marked with an asterisk. For each gene, four attributes are listed: (i) No arrow, an upward pointing arrow, a downward pointing arrow indicate no significant DE bewteen PD and HC, significant over-expression in PD subjects, significant under-expression in PD subjects, respectively; (ii) HGCN symbol (or Ensembl ID when missing); (iii) Average importance over 20 runs of 5-fold cross validation; (iv) Frequency of occurrence over 20 repetitions.
Table A1. Complete list of the most frequent protein coding genes and lincRNAs, ordered by importance. LincRNAs are marked with an asterisk. For each gene, four attributes are listed: (i) No arrow, an upward pointing arrow, a downward pointing arrow indicate no significant DE bewteen PD and HC, significant over-expression in PD subjects, significant under-expression in PD subjects, respectively; (ii) HGCN symbol (or Ensembl ID when missing); (iii) Average importance over 20 runs of 5-fold cross validation; (iv) Frequency of occurrence over 20 repetitions.
eSymbolimpfeSymbolimpfeSymbolimpf
MYOM182.120SLC25A2062.720 NRM46.420
PHF745.920ENSG00000277763 *39.420 ICA13620
CPT1A33.820 LINC02422 *33.320 GSTM132.420
PCDHGA631.620 AK531.520GCNT229.920
CERS429.720YJU229.420 SURF627.720
ENSG00000281181 *26.720 ENSG00000285774 *26.720ENSG00000272688 *26.220
SERF1B25.820 ENSG0000028477325.720 TREML425.720
ANKRD34B25.114NDUFB92520ERLIN224.620
ENSG00000276651 *24.420 CCR423.920 NFE2L323.920
FGGY23.520 ENSG00000234902 *23.320ARRDC423.120
TMTC421.620BPHL20.120 C2orf422020
BTBD1919.815 LOXHD119.420 DHFR19.120
LINC02470 *18.920 SHISA418.820FKBP518.420
ENSG00000234426 *18.320TKTL118.120 ATP6V0A21820
GPR1917.920 ZNF58417.420 FAN117.420
MRPS617.218 TSPAN216.920CRAT16.520
CCRL21620 GTF2IRD215.920 PUDP15.820
NOP1615.720LINC00243 *15.620CEP1915.620
GAB315.620ENSG00000269399 *15.520YOD115.320
GET115.319NREP15.220 YES114.815
COL9A314.520 NSUN414.420FARSB14.320
GZMB14.220 B4GALNT314.120TBL214.120
RAP1GAP13.720BASP113.520 PRUNE213.519
FBN213.320 VNN113.220LSMEM113.220
ZSCAN2113.120 CLEC12A1320COA41320
DPY19L212.920RNASET212.820 DCXR12.420
WDR4912.420 CRYZ12.315LINC00623 *12.220
ZNF71412.220TOR1B1220ADGRE511.920
SULF211.720 MSH311.720 PCDHGB311.620
SPHK111.320G6PC31120MASTL1120
LINC01806 *1120SQLE1120 PWP210.920
TXLNB10.820 ZSWIM310.719 SFXN410.620
RUBCNL10.620PNO110.220 SMIM1210.218
TNFRSF10B10.220 GPR1621016KRT11020
B3GAT19.820 PILRB9.820TAP29.820
MSR19.718 LINC00482 *9.618 OSER1-DT *9.415
ASCC19.420ZNF4299.420 SSPN9.320
GYPA9.320FAT49.320SLC18B19.220
TIPIN9.120 IL18RAP915 GYPE8.920
HYAL38.720PREX18.719 KLRC28.620
FCER1A8.619 ENDOG8.620 GSTM38.620
TMEM2528.620 SRGAP2C8.517ATP5MC18.520
FIS18.320 ARRDC3-AS1 *8.319 LINC01948 *8.316
TPPP38.220 HDHD38.219LINC01560 *8.220
IFRD28.120STK118.117TARBP17.920
LINC00299 *7.919 XCL17.820 ZNF4917.620
LINC00570 *7.420 PARS27.420 INPP17.420
TMEM2457.420 NECAP27.319 PER37.320
CDCA47.220 NUP210L7.219 GTF2H27.120
APOLD17.120 ETFDH720GPX46.920
PPP1R14B6.920GOLGA6L96.920 ATP6AP1L6.920
GFUS6.920ENSG00000268240 *6.720 MAST46.720
ENSG00000225750 *6.720BCL66.520 DDO6.520
TMEM185B6.420 UPB16.420 CCR36.420
PLIN26.420RALY-AS1 *6.320 DDIT46.320
FGFR26.320PAICS6.320 ENSG000002783846.219
HLA-DRB56.216VDR6.119 ENSG00000254789 *6.120
ENSG00000260077 *6.118 CLEC18A6.115LINC02193 *6.120
SBNO16.120VAV16.117 SCN3A620
CCL4L25.920 ASB35.818 GSTM25.820
KDM5B5.820 GNAL5.718 KCNMB15.615
CSGALNACT15.620RNASEH15.620 ENSG000002854765.520
FBXL135.420 VLDLR5.420FPR25.420
PPP1R3B5.420SRSF85.420 APOO5.320
TXNIP5.320MPG5.319 TAS2R435.220
FLVCR25.120 SLC25A35.120CD365.119
CENPK518 C5514PRRG4520
DYRK1B4.917APTR *4.820TMEM14C4.819
PF4V14.820ZNF7894.720 UBR74.717
HMOX24.619 PID14.620 LERFS *4.520
ENSG000002663024.520 AKAP54.520 DPCD4.420
TMTC24.420 NKAP4.417 ENSG00000276476 *4.420
EDAR4.420 VSTM14.420 PDK44.320
HIF1A4.320GRHPR4.320 TUBB2A4.320
PALLD4.320LINC01303 *4.320 FPR34.320
TMEM45B4.320 RGMB4.320CREM4.320
LYRM94.318 VSIG104.320TSPAN174.220
BBLN4.116 LTA4H4.120 U2AF14.120
PPAN4.120 ARL17B4.120 ENSG00000274922 *4.119
TM9SF1420 EPPK1420THBD420
DRAXIN414USP123.920 SLC11A23.919
ENSG00000259071 *3.920SPON23.820 ENSG00000256427 *3.814
FAM124B3.820NBDY3.820MBNL33.820
COMMD93.720 CTSK3.720CYREN3.720
LINC00654 *3.720ENSG00000270972 *3.620SVBP3.620
TMEM185A3.618CDK63.620 MFSD93.620
CRTAP3.514 CSTB3.520PTRHD13.520
PPIE3.520 HLA-DMB3.515 DSC13.520
CEP853.420 RNF1823.420HSD17B83.420
NKX3-13.420 F2R3.420 ENSG00000224635 *3.419
HDHD53.420 ZKSCAN43.420KPNB13.320
LAMP13.320ENSG00000277369 *3.320SNHG4 *3.320
MYG13.315SLC25A253.318 U2AF1L53.320
ETHE13.220 KAT2B3.220MIR378D2HG *3.216
TLCD43.220SPTY2D13.220 MYOM23.220
IL18R13.120UBE2E23.120KREMEN13.120
ENSG00000227920 *3.119 COX5A316LINC00920 *320
NRG1317GPR15320UROS320
LINC02520 *320TGM3320 CCZ1B320
S100B320 NR4A22.920 SULT1A12.919
TMEM2732.920 LINC00381 *2.918 FMN12.820
CCDC144A2.820LMTK22.820HSDL22.820
BMX2.820 ZNF5592.820ELL2.817
MIR646HG *2.820CREG12.820 DACT12.819
TBC1D302.720 JUN2.720 CLEC4F2.720
ENSG00000259652 *2.719POMC2.614THAP72.620
YDJC2.620NFE42.620 PDZD42.620
FTCDNL12.620GABARAPL12.620TIMM92.620
ANKRD92.619RNF112.519 ATP6V1F2.520
MTCH22.520 SCO12.519 NOTCH2NLA2.520
GATD3A2.520 MAP3K7CL2.520NCAM12.420
LINC02273 *2.420 PI162.414CLCN42.420
CTXN2-AS1 *2.419MECR2.420ENSG00000273243 *2.420
COL18A12.420TLK22.320HMBS2.317
CCDC102A2.315 TTF22.319C16orf912.316
HERPUD12.320SLA2.220 TMEM1022.220
HLA-DQB12.220 DUSP192.220KCTD32.214
FOLR32.220 C1orf220 *2.115PRDM82.119
KIF1B2.119 LINC00298 *2.118 LINC01410 *2.120
LINC022182.120NKAPL2.120 RAB342.120
GSTZ12.119ENSG00000267575 *2.116 SYNM217
RNF149220CSRNP1217 LSG1219
TOP1220IRF1214SYTL3220
ZNRD2220ICAM4220 CLEC12B1.920
NDRG31.920 PAQR81.920LGALS21.920
WDR111.917HDAC91.920 RRS11.915
ANKRD551.916NIT21.814 ENSG00000272908 *1.820
PMVK1.820 RFC51.820PRADC11.818
HSD17B131.818ZNF4871.820NUP50-DT *1.820
TOR3A1.720 ADAM151.720 ENSG00000285492 *1.720
CA41.720 PARN1.718AKR1A11.720
DOCK41.720IRS21.720 CHST21.720
C3orf181.620 ZNF691.620CCN31.620
CLMN1.620 GCAT1.614TXN21.615
TPST11.620MIR3945HG *1.520 PTPRN21.520
ADGRB31.518 ENSG00000281831 *1.515EIF2D1.520
OAS11.514ACSL11.520 SRP191.520
NUP501.520XK1.520COA11.419
KRT721.420ROPN1L1.416 SLC25A431.420
ENSG00000251093 *1.420ABCA11.419 AFDN1.418
TMEM176B1.320SERINC31.318CEMIP21.320
NAXD1.320 NFXL11.320ALKBH71.319
ENSG00000259959 *1.320ENSG00000275765 *1.315 BSCL21.218
CISD21.220DCAF4L11.219CD931.219
APRT1.220 CYBRD11.116NBPF261.120
MRPS271.118 GIMAP11.120 RRP7A1.120
ISCA11.120 FADS21.119TRANK11.118
PHACTR11.120VNN3120HLX120
JADE1120KNOP1120 HLA-DQA2119
XKR3120 P2RX40.916 CPA30.919
C8orf330.919 MS4A4E0.920 ENSG00000274979 *0.920
RPGRIP10.914NCR10.920PRF10.920
PEA150.819 S100A100.819 ERO1A0.820
ADGRG30.816BTNL80.820 EMC90.820
LONRF30.820 SLC38A110.720BAZ1A0.717
ACAD110.715C1orf1090.720 SUV39H10.714
PAAF10.718 MGST30.720PHTF10.720
CD550.620 MTPAP0.620 ZNF800.618
SIPA1L20.620 PTGDS0.619SNX30.620
KLF90.617TGFA0.620 HLA-DQA10.520
AMACR0.520 NCAPG20.514CTSH0.515
ENSG000002829880.517 PANX10.520 HLA-A0.520
CPD0.520NHS0.416 KRT730.420
METRNL0.317PIGW0.316AVIL0.320
ABCG10.220RAB27A0.220DNAJC3020
LincRNAs are marked with an asterisk.

References

  1. GBD Disease Incidence, Prevalence Collaborators. Global, regional, and national incidence, prevalence, and years lived with disability for 354 diseases and injuries for 195 countries and territories, 1990–2017: A systematic analysis for the Global Burden of Disease Study 2017. Lancet 2018, 392, 1789–1858. [Google Scholar] [CrossRef] [Green Version]
  2. Schapira, A.H.V.; Chaudhuri, K.R.; Jenner, P. Non-motor features of Parkinson disease. Nat. Rev. Neurosci. 2017, 18, 435–450. [Google Scholar] [CrossRef] [PubMed]
  3. Angelopoulou, E.; Paudel, Y.N.; Papageorgiou, S.G.; Piperi, C. Environmental Impact on the Epigenetic Mechanisms Underlying Parkinson’s Disease Pathogenesis: A Narrative Review. Brain Sci. 2022, 12, 175. [Google Scholar] [CrossRef] [PubMed]
  4. Nido, G.S.; Dick, F.; Toker, L.; Petersen, K.; Alves, G.; Tysnes, O.B.; Jonassen, I.; Haugarvoll, K.; Tzoulis, C. Common gene expression signatures in Parkinson’s disease are driven by changes in cell composition. Acta Neuropathol. Commun. 2020, 8, 55. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  5. Sullivan, P.F.; Fan, C.; Perou, C.M. Evaluating the comparability of gene expression in blood and brain. Am. J. Med. Genet. B Neuropsychiatr. Genet. 2006, 141, 261–268. [Google Scholar] [CrossRef] [PubMed]
  6. Soreq, L.; Salomonis, N.; Bronstein, M.; Greenberg, D.S.; Israel, Z.; Bergman, H.; Soreq, H. Small RNA sequencing-microarray analyses in Parkinson leukocytes reveal deep brain stimulation-induced splicing changes that classify brain region transcriptomes. Front. Mol. Neurosci. 2013, 6, 10. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  7. Haas, R.H.; Nasirian, F.; Nakano, K.; Ward, D.; Pay, M.; Hill, R.; Shults, C.W. Low platelet mitochondrial complex I and complex II/III activity in early untreated Parkinson’s disease. Ann. Neurol. 1995, 37, 714–722. [Google Scholar] [CrossRef] [PubMed]
  8. Barbanti, P.; Fabbrini, G.; Ricci, A.; Cerbo, R.; Bronzetti, E.; Caronti, B.; Calderaro, C.; Felici, L.; Stocchi, F.; Meco, G.; et al. Increased expression of dopamine receptors on lymphocytes in Parkinson’s disease. Mov. Disord. 1999, 14, 764–771. [Google Scholar] [CrossRef]
  9. Soreq, L.; Guffanti, A.; Salomonis, N.; Simchovitz, A.; Israel, Z.; Bergman, H.; Soreq, H. Long non-coding RNA and alternative splicing modulations in Parkinson’s leukocytes identified by RNA sequencing. PLoS Comput. Biol. 2014, 10, e1003517. [Google Scholar] [CrossRef] [PubMed]
  10. Grünblatt, E.; Zehetmayer, S.; Jacob, C.P.; Müller, T.; Jost, W.H.; Riederer, P. Pilot study: Peripheral biomarkers for diagnosing sporadic Parkinson’s disease. J. Neural Transm. 2010, 117, 1387–1393. [Google Scholar] [CrossRef]
  11. Shehadeh, L.A.; Yu, K.; Wang, L.; Guevara, A.; Singer, C.; Vance, J.; Papapetropoulos, S. SRRM2, a potential blood biomarker revealing high alternative splicing in Parkinson’s disease. PLoS ONE 2010, 5, e9104. [Google Scholar] [CrossRef] [PubMed]
  12. Molochnikov, L.; Rabey, M.R.; Dobronevsky, E.; Bonuccelli, U.; Ceravolo, R.; Frosini, D.; Grünblatt, E.; Riederer, P.; Jacob, C.; Aharon-Peretz, J.; et al. A molecular signature in blood identifies early Parkinson’s disease. Mol. Neurodegener. 2012, 7, 26. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  13. Su, C.; Tong, J.; Wang, F. Mining genetic and transcriptomic data using machine learning approaches in Parkinson’s disease. NPJ Park. Dis. 2020, 6, 24. [Google Scholar] [CrossRef]
  14. Amoroso, N.; La Rocca, M.; Monaco, A.; Bellotti, R.; Tangaro, S. Complex networks reveal early MRI markers of Parkinson’s disease. Med. Image Anal. 2018, 48, 12–24. [Google Scholar] [CrossRef]
  15. Nalls, M.A.; McLean, C.Y.; Rick, J.; Eberly, S.; Hutten, S.J.; Gwinn, K.; Sutherland, M.; Martinez, M.; Heutink, P.; Williams, N.M.; et al. Parkinson’s Disease Biomarkers Program and Parkinson’s Progression Marker Initiative investigators. Diagnosis of Parkinson’s disease on the basis of clinical and genetic classification: A population-based modelling study. Lancet Neurol. 2015, 14, 1002–1009. [Google Scholar] [CrossRef] [Green Version]
  16. Monaco, A.; Pantaleo, E.; Amoroso, N.; Bellantuono, L.; Lombardi, A.; Tateo, A.; Tangaro, S.; Bellotti, R. Identifying potential gene biomarkers for Parkinson’s disease through an information entropy based approach. Phys. Biol. 2020, 18, 016003. [Google Scholar] [CrossRef]
  17. Shamir, R.; Klein, C.; Amar, D.; Vollstedt, E.J.; Bonin, M.; Usenovic, M.; Wong, Y.C.; Maver, A.; Poths, S.; Safer, H.; et al. Analysis of blood-based gene expression in idiopathic Parkinson disease. Neurology 2017, 89, 1676–1683. [Google Scholar] [CrossRef] [PubMed]
  18. Chen-Plotkin, A. Blood transcriptomics for Parkinson disease? Nat. Rev. Neurol. 2018, 14, 5–6. [Google Scholar] [CrossRef] [PubMed]
  19. Babu, G.S.; Suresh, S. Parkinson’s disease prediction using gene expression—A projection based learning meta-cognitive neural classifier approach. Expert Syst. Appl. 2013, 40, 1519–1529. [Google Scholar] [CrossRef]
  20. Karlsson, M.K.; Sharma, P.; Aasly, J.; Toft, M.; Skogar, O.; Sæbø, S.; Lönneborg, A. Found in transcription: Accurate Parkinson’s disease classification in peripheral blood. J. Park. Dis. 2013, 3, 19–29. [Google Scholar] [CrossRef] [Green Version]
  21. Marek, K.; Chowdhury, S.; Siderowf, A.; Lasch, S.; Coffey, C.S.; Caspell-Garcia, C.; Simuni, T.; Jennings, D.; Tanner, C.M.; Trojanowski, J.Q.; et al. Parkinson’s Progression Markers Initiative. The Parkinson’s progression markers initiative (PPMI)—Establishing a PD biomarker cohort. Ann. Clin. Transl. Neurol. 2018, 5, 1460–1477. [Google Scholar] [CrossRef] [PubMed]
  22. Dobin, A.; Davis, C.A.; Schlesinger, F.; Drenkow, J.; Zaleski, C.; Jha, S.; Batut, P.; Chaisson, M.; Gingeras, T.R. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics 2013, 29, 15–21. [Google Scholar] [CrossRef] [PubMed]
  23. Liao, Y.; Smyth, G.K.; Shi, W. featureCounts: An efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 2014, 30, 923–930. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  24. Hutchins, E.; Craig, D.; Violich, I.; Alsop, E.; Casey, B.; Hutten, S.; Reimer, A.; Whitsett, T.G.; Crawford, K.L.; Toga, A.W.; et al. Quality Control Metrics for Whole Blood Transcriptome Analysis in the Parkinson’s Progression Markers Initiative (PPMI). medRxiv 2021. [Google Scholar]
  25. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
  26. Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef] [Green Version]
  27. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
  28. Love, M.I.; Huber, W.; Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014, 15, 550. [Google Scholar] [CrossRef] [Green Version]
  29. Gibbons, S.M.; Duvallet, C.; Alm, E.J. Correcting for batch effects in case-control microbiome studies. PLoS Comput. Biol. 2018, 14, e1006102. [Google Scholar] [CrossRef] [Green Version]
  30. Ritchie, M.E.; Phipson, B.; Wu, D.; Hu, Y.; Law, C.W.; Shi, W.; Smyth, G.K. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015, 43, e47. [Google Scholar] [CrossRef]
  31. Monaco, A.; Pantaleo, E.; Amoroso, N.; Lacalamita, A.; Lo Giudice, C.; Fonzino, A.; Fosso, B.; Picardi, E.; Tangaro, S.; Pesole, G.; et al. A primer on machine learning techniques for genomic applications. Comput. Struct. Biotechnol. J. 2021, 19, 4345–4359. [Google Scholar] [CrossRef]
  32. Kuleshov, M.V.; Jones, M.R.; Rouillard, A.D.; Fernandez, N.F.; Duan, Q.; Wang, Z.; Koplev, S.; Jenkins, S.L.; Jagodnik, K.M.; Lachmann, A.; et al. Enrichr: A Comprehensive Gene Set Enrichment Analysis Web Server 2016 Update. Nucleic Acids Res. 2016, 44, W90–W97. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  33. Makarious, M.B.; Leonard, H.L.; Vitale, D.; Iwaki, H.; Sargent, L.; Dadu, A.; Violich, I.; Hutchins, E.; Saffo, D.; Bandres-Ciga, S.; et al. Multi-Modality Machine Learning Predicting Parkinson’s Disease. bioRxiv 2021. [Google Scholar] [CrossRef] [PubMed]
  34. Gaki, G.S.; Papavassiliou, A.G. Oxidative stress-induced signaling pathways implicated in the pathogenesis of Parkinson’s disease. Neuromol. Med. 2014, 16, 217–230. [Google Scholar] [CrossRef] [PubMed]
  35. Wei, Z.; Li, X.; Liu, Q.; Cheng, Y. Oxidative stress in Parkinson’s disease: A systematic review and meta-analysis. Front. Mol. Neurosci. 2018, 11, 236. [Google Scholar] [CrossRef] [PubMed]
  36. Garcia, A.; León-Martinez, R.; Blanco-Lezcano, L.; Pavón-Fuentes, N.; Lorigados-Pedre, L. Transient glutathione depletion in the substantia nigra compacta is associated with neuroinflammation in rats. Neuroscience 2016, 335, 207–220. [Google Scholar] [CrossRef]
  37. Tufekci, K.U.; Meuwissen, R.; Genc, S.; Genc, K. Inflammation in Parkinson’s disease. Adv. Protein Chem. Struct. Biol. 2012, 88, 69–132. [Google Scholar] [CrossRef]
  38. Muñoz-Delgado, L.; Macías-García, D.; Jesús, S.; Martín-Rodríguez, J.F.; Labrador-Espinosa, M.Á.; Jiménez-Jaraba, M.V.; Adarmes-Gómez, A.; Carrillo, F.; Mir, P. Peripheral Immune Profile and Neutrophil-to-Lymphocyte Ratio in Parkinson’s Disease. Mov. Disord. 2021, 36, 2426–2430. [Google Scholar] [CrossRef]
  39. Sulzer, D.; Alcalay, R.N.; Garretti, F.; Cote, L.; Kanter, E.; Agin-Liebes, J.; Liong, C.; McMurtrey, C.; Hildebr, W.H.; Mao, X.; et al. T cells from patients with Parkinson’s disease recognize α-synuclein peptides. Nature 2017, 546, 656–661. [Google Scholar] [CrossRef] [Green Version]
  40. Tan, J.S.Y.; Chao, Y.X.; Rötzschke, O.; Tan, E.K. New Insights into Immune-Mediated Mechanisms in Parkinson’s Disease. Int. J. Mol. Sci. 2020, 21, 9302. [Google Scholar] [CrossRef]
  41. Imamura, K.; Hishikawa, N.; Sawada, M.; Nagatsu, T.; Yoshida, M.; Hashizume, Y. Distribution of major histocompatibility complex class II-positive microglia and cytokine profile of Parkinson’s disease brains. Acta Neuropathol. 2003, 106, 518–526. [Google Scholar] [CrossRef]
  42. Malpartida, A.B.; Williamson, M.; Narendra, D.P.; Wade-Martins, R.; Ryan, B.J. Mitochondrial Dysfunction and Mitophagy in Parkinson’s Disease: From Mechanism to Therapy. Trends Biochem. Sci. 2021, 46, 329–343. [Google Scholar] [CrossRef] [PubMed]
  43. Ebanks, K.; Lewis, P.A.; Bandopadhyay, R. Vesicular Dysfunction and the Pathogenesis of Parkinson’s Disease: Clues From Genetic Studies. Front. Neurosci. 2020, 13, 1381. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  44. Yue, X.; Li, H.; Yan, H.; Zhang, P.; Chang, L.; Li, T. Risk of Parkinson Disease in Diabetes Mellitus: An Updated Meta-Analysis of Population-Based Cohort Studies. Medicine 2016, 95, e3549. [Google Scholar] [CrossRef] [PubMed]
  45. Villumsen, M.; Aznar, S.; Pakkenberg, B.; Jess, T.; Brudek, T. Inflammatory bowel disease increases the risk of Parkinson’s disease: A Danish nationwide cohort study 1977–2014. Gut 2019, 68, 18–24. [Google Scholar] [CrossRef] [PubMed]
  46. Grünblatt, E.; Mandel, S.; Jacob-Hirsch, J.; Zeligson, S.; Amariglo, N.; Rechavi, G.; Li, J.; Ravid, R.; Roggendorf, W.; Riederer, P.; et al. Gene expression profiling of parkinsonian substantia nigra pars compacta; alterations in ubiquitin-proteasome, heat shock protein, iron and oxidative stress regulated proteins, cell adhesion/cellular matrix and vesicle trafficking genes. J. Neural Transm. 2004, 111, 1543–1573. [Google Scholar] [CrossRef]
  47. Jiang, F.; Wu, Q.; Sun, S.; Bi, G.; Guo, L. Identification of potential diagnostic biomarkers for Parkinson’s disease. FEBS Open Bio. 2019, 9, 1460–1468. [Google Scholar] [CrossRef] [Green Version]
  48. Scherzer, C.R.; Eklund, A.C.; Morse, L.J.; Liao, Z.; Locascio, J.L.; Fefer, D.; Schwarzschild, M.A.; Schlossmacher, M.G.; Hauser, M.A.; Vance, J.M.; et al. Molecular markers of early Parkinson’s disease based on gene expression in blood. Proc. Natl. Acad. Sci. USA 2007, 104, 955–960. [Google Scholar] [CrossRef] [Green Version]
  49. Calligaris, R.; Banica, M.; Roncaglia, P.; Robotti, E.; Finaurini, S.; Vlachouli, C.; Antonutti, L.; Iorio, F.; Carissimo, A.; Cattaruzza, T.; et al. Blood transcriptomics of drug-naive sporadic Parkinson’s disease patients. BMC Genom. 2015, 16, 876. [Google Scholar] [CrossRef] [Green Version]
  50. Ayka, A.; Şehirli, A.Ö. The Role of the SLC Transporters Protein in the Neurodegenerative Disorders. Clin Psychopharmacol. Neurosci. 2020, 18, 174–187. [Google Scholar] [CrossRef] [Green Version]
  51. Chen, H.; Chen, K.; Chen, J.; Cheng, H.; Zhou, R. The integral nuclear membrane protein nurim plays a role in the suppression of apoptosis. Curr. Mol. Med. 2012, 12, 1372–1382. [Google Scholar] [CrossRef]
  52. Erekat, N.S. Apoptosis and its Role in Parkinson’s Disease. In Parkinson’s Disease: Pathogenesis and Clinical Aspects; Stoker, T.B., Greenl, J.C., Eds.; Codon Publications: Brisbane, Australia, 2018; Chapter 4. [Google Scholar]
  53. Chang, D.; Nalls, M.A.; Hallgrímsdóttir, I.B.; van der Brug, M.; Cai, F.; International Parkinson’s Disease Genomics Consortium; 23andMe Research Team; Kerchner, G.A.; Ayalon, G.; Bingol, B.; et al. A meta-analysis of genome-wide association studies identifies 17 new Parkinson’s disease risk loci. Nat. Genet. 2017, 49, 1511–1516. [Google Scholar] [CrossRef] [PubMed]
  54. Custodia, A.; Aramburu-Núñez, M.; Correa-Paz, C.; Posado-Fernández, A.; Gómez-Larrauri, A.; Castillo, J.; Gómez-Muñoz, A.; Sobrino, T.; Ouro, A. Ceramide Metabolism and Parkinson’s Disease Therapeutic Targets. Biomolecules 2021, 11, 945. [Google Scholar] [CrossRef] [PubMed]
  55. Trabjerg, M.S.; Mørkholt, A.S.; Lichota, J.; Oklinski, M.K.E.; Andersen, D.C.; Jønsson, K.; Mørk, K.; Skjønnemand, M.N.; Kroese, L.J.; Pritchard, C.E.J.; et al. Dysregulation of metabolic pathways by carnitine palmitoyl-transferase 1 plays a key role in central nervous system disorders: Experimental evidence based on animal models. Sci. Rep. 2020, 10, 15583. [Google Scholar] [CrossRef] [PubMed]
  56. Paratcha, G.; Ledda, F. The GTPase-activating protein Rap1GAP: A new player to modulate Ret signaling. Cell Res. 2011, 21, 217–219. [Google Scholar] [CrossRef] [Green Version]
  57. Pagano, G.; Ferrara, N.; Brooks, D.J.; Pavese, N. Age at onset and Parkinson disease phenotype. Neurology 2016, 86, 1400–1407. [Google Scholar] [CrossRef]
Figure 1. Schematic workflow of the performed analyses. The main phases are: (i) preprocessing, (ii) learning and (iii) performance evaluation.
Figure 1. Schematic workflow of the performed analyses. The main phases are: (i) preprocessing, (ii) learning and (iii) performance evaluation.
Genes 13 00727 g001
Figure 2. Samples were collected across 25 different sites labeled with an integer number. Sites “14”, “26”, “55”, and “59” had 0 or 1 control sample only (horizontal dotted line) and were excluded from the classification analysis as batch effects due to site could not be estimated and therefore corrected for.
Figure 2. Samples were collected across 25 different sites labeled with an integer number. Sites “14”, “26”, “55”, and “59” had 0 or 1 control sample only (horizontal dotted line) and were excluded from the classification analysis as batch effects due to site could not be estimated and therefore corrected for.
Genes 13 00727 g002
Figure 3. In black, the median AUC over 20 runs of 10-fold cross validation; in red, the median AUC ± its mean absolute deviation; in blue, the number of features (genes) where the maximum median AUC (72%) was reached. For each run, we collected the AUC values obtained at different thresholds C (or equivalently a different number of genes) and we interpolated these values to build a curve. Then we obtained the black curve as the median of 20 curves, one for each 10-fold Cross-Validation (CV) run.
Figure 3. In black, the median AUC over 20 runs of 10-fold cross validation; in red, the median AUC ± its mean absolute deviation; in blue, the number of features (genes) where the maximum median AUC (72%) was reached. For each run, we collected the AUC values obtained at different thresholds C (or equivalently a different number of genes) and we interpolated these values to build a curve. Then we obtained the black curve as the median of 20 curves, one for each 10-fold Cross-Validation (CV) run.
Genes 13 00727 g003
Figure 4. Histogram of the frequency of occurrence of the top 493 genes over 20 repetitions. At each repetition we collected the 493 most important genes; over 20 repetitions we gathered in total around 800 genes, many of which (365) appeared in all 20 repetitions.
Figure 4. Histogram of the frequency of occurrence of the top 493 genes over 20 repetitions. At each repetition we collected the 493 most important genes; over 20 repetitions we gathered in total around 800 genes, many of which (365) appeared in all 20 repetitions.
Genes 13 00727 g004
Figure 5. List of all the GO Biological Processes that are enriched in the selected genes, with the respective number of genes belonging to each term. The analysis was performed with enrichR at an FDR < 0.05.
Figure 5. List of all the GO Biological Processes that are enriched in the selected genes, with the respective number of genes belonging to each term. The analysis was performed with enrichR at an FDR < 0.05.
Genes 13 00727 g005
Figure 6. List of all the GO Cellular Components that are enriched in the selected genes with the respective number of genes belonging to each term. The analysis was performed with enrichR at an FDR < 0.05.
Figure 6. List of all the GO Cellular Components that are enriched in the selected genes with the respective number of genes belonging to each term. The analysis was performed with enrichR at an FDR < 0.05.
Genes 13 00727 g006
Figure 7. List of all the KEGG pathways that are enriched in the selected genes with the respective number of genes belonging to each term. The analysis was performed with enrichR at an FDR < 0.05.
Figure 7. List of all the KEGG pathways that are enriched in the selected genes with the respective number of genes belonging to each term. The analysis was performed with enrichR at an FDR < 0.05.
Genes 13 00727 g007
Table 1. Relevant clinical, pathological and technical metadata of the cohort divided by disease status.
Table 1. Relevant clinical, pathological and technical metadata of the cohort divided by disease status.
VariablePDHC
Gender (male %)252/390 (64%)123/189 (65%)
Age at enrollment62 ± 1061 ± 11
Disease duration2 ± 2-
RBD37%20%
TD70%13%
Number of sites2523
MoCA 26 (CI-adjusted for education)33%0.5%
RIN8 ± 1.7 8 ± 1.7
Table 2. Average performances of XGBoost over 20 runs of 10-fold cross validation.
Table 2. Average performances of XGBoost over 20 runs of 10-fold cross validation.
MeanStandard Deviation
AUC71.31.2
Accuracy69.31.2
Sensitivity81.71.6
Specificity45.52.3
Balanced Accuracy63.61.3
F177.80.9
Table 3. List of the 20 most important protein coding genes and lincRNAs, ordered by importance. LincRNAs are marked with an asterisk. For each gene, four attributes are listed: (i) Up-arrow/Down arrow: significant over/under-expression in PD subjects compared to HC; (ii) HGCN HUGO Gene Nomenclature Committee symbol (or Ensembl ID when missing); (iii) Average XGBoost importance over 20 runs of 5-fold cross validation; (iv) Number of times that a gene is selected over 20 runs of feature selection. For a more complete list, including the genes that are selected 70% of the times, see Table A1.
Table 3. List of the 20 most important protein coding genes and lincRNAs, ordered by importance. LincRNAs are marked with an asterisk. For each gene, four attributes are listed: (i) Up-arrow/Down arrow: significant over/under-expression in PD subjects compared to HC; (ii) HGCN HUGO Gene Nomenclature Committee symbol (or Ensembl ID when missing); (iii) Average XGBoost importance over 20 runs of 5-fold cross validation; (iv) Number of times that a gene is selected over 20 runs of feature selection. For a more complete list, including the genes that are selected 70% of the times, see Table A1.
eSymbolimpfeSymbolimpf
MYOM182.120SLC25A2062.720
NRM46.420PHF745.920
ENSG00000277763 *39.420 ICA13620
CPT1A33.820 LINC02422 *33.320
GSTM132.420 PCDHGA631.620
AK531.520GCNT229.920
CERS429.720YJU229.420
SURF627.720 ENSG00000281181 *26.720
ENSG00000285774 *26.720ENSG00000272688 *26.220
SERF1B25.820 ENSG0000028477325.720
LincRNAs are marked with an asterisk.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Pantaleo, E.; Monaco, A.; Amoroso, N.; Lombardi, A.; Bellantuono, L.; Urso, D.; Lo Giudice, C.; Picardi, E.; Tafuri, B.; Nigro, S.; et al. A Machine Learning Approach to Parkinson’s Disease Blood Transcriptomics. Genes 2022, 13, 727. https://doi.org/10.3390/genes13050727

AMA Style

Pantaleo E, Monaco A, Amoroso N, Lombardi A, Bellantuono L, Urso D, Lo Giudice C, Picardi E, Tafuri B, Nigro S, et al. A Machine Learning Approach to Parkinson’s Disease Blood Transcriptomics. Genes. 2022; 13(5):727. https://doi.org/10.3390/genes13050727

Chicago/Turabian Style

Pantaleo, Ester, Alfonso Monaco, Nicola Amoroso, Angela Lombardi, Loredana Bellantuono, Daniele Urso, Claudio Lo Giudice, Ernesto Picardi, Benedetta Tafuri, Salvatore Nigro, and et al. 2022. "A Machine Learning Approach to Parkinson’s Disease Blood Transcriptomics" Genes 13, no. 5: 727. https://doi.org/10.3390/genes13050727

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop