1. Introduction
Parkinson’s disease (PD) is a neurodegenerative disorder primarily characterized by motor impairment resulting from the degeneration of dopaminergic neurons in the substantia nigra (SN) located in the midbrain [
1,
2]. It is considered the second most prevalent neurodegenerative disease, following Alzheimer’s disease [
3]. In 2021, approximately 11.77 million people worldwide were affected by PD, with an incidence rate of 15.63 new cases per 100,000 people per year [
4]. Epidemiological projections estimate that the number of cases will rise to approximately 1.93 million by 2030, with an age-standardized incidence rate expected to reach 27 cases per 100,000 people [
5].
Age is considered a major risk factor for the development of PD, as the disease occurs more frequently in individuals over 60 years of age. A positive family history is also recognized as a risk factor, with more than 15% of patients having a familial background [
6]. Furthermore, PD has a higher incidence in men than in women [
7,
8].
Discovered over 200 years ago by James Parkinson [
9], the pathophysiology of PD remains poorly understood. Although both genetic and environmental factors have been implicated in the disease, none has been identified as a definitive cause [
10,
11].
Microarrays have emerged as a powerful technological tool for studying gene expression, enabling the generation of large-scale datasets to support disease research and biomarker discovery [
12,
13]. They allow the simultaneous analysis of multiple genes, helping to address questions regarding genetic differences between healthy individuals and those with disease, as well as treatment responses [
14]. Various statistical techniques have traditionally been used for microarray analysis [
15,
16,
17]; however, machine learning approaches have recently led to significant improvements in data interpretation and disease prediction [
18,
19,
20].
Machine learning has already been applied to microarray data to classify healthy individuals versus PD patients. For instance, the study by Lai et al. evaluated machine learning algorithms for PD diagnosis using a dataset of 1656 participants. Algorithms such as logistic regression (LASSO-LR), decision trees, random forest, XGBoost, support vector machine (SVM), and k-nearest neighbors were tested. The SVM model was the most accurate (84.40%), highlighting constipation, olfactory decline, and daytime sleepiness as the most relevant features [
21]. In another study, Shamir et al. analyzed gene expression profiles using genetic networks and identified 87 genes related to metabolism, oxidation, and ubiquitination processes, as well as dysregulation of genes in the mitochondria [
22]. Similarly, other researchers have employed logistic regression (LASSO) and SVM-RFE to identify PD-associated biomarkers, revealing nine key genes:
NME7,
PKM,
RRM2,
POLR3C,
POLA1,
PDE6C,
PDE9A,
PDE11A, and
AMPD1 [
23]. Liu et al. focused on identifying genes associated with immune infiltration in PD through microarray-based bioinformatics, identifying
SLC18A2,
CALB1, and
SYNGR3 as candidate biomarkers [
24].
Most prior studies compare healthy individuals and PD patients, often excluding neurological diseases that share similar gene expression patterns. In this study, we analyzed and modeled microarray data derived from peripheral blood samples classified into three groups: PD, other neurological diseases, and healthy controls. We applied machine learning techniques to identify novel gene expression patterns related to PD and to propose potential biomarkers for its diagnosis.
3. Results
We obtained a model with nine genes, which, according to J48, are the most significant (see
Figure 3). The obtained genes allowed the generation of a decision tree that correctly classified 92 (86.71%) of 105 samples: 44/50 of PD, 30/33 of neurological disease control and 18/22 of healthy control.
In this model, it can be observed that the TMEM104 gene is the most significant attribute since it is the root node; if its level is ≤34.384992, then it is classified as a patient with neurological disease, if it is >34.384992, then the level of the TRIM33 gene is checked; if its level of expression is >99.453834, then it is classified as a patient with neurological disease, if it is ≤99.453834, then the level of the GJB3 gene is checked; if its level of expression is >61.613954, then it is classified as a patient with neurological disease, if it is ≤61.613954, then the level of the SPON2 gene is checked; if its level of expression is >271.096149, then it is classified as a patient with neurological disease, if it is ≤271.096149, then the level of the SNAP25 gene is checked; if its level is >8.6876, then the patient with neurological disease is classified, if it is ≤8.6876, then the level of the TRAK2 gene is checked; if its level is >82.77717175, then the patient is classified as healthy, if it is ≤82.777175, then the level of the SHPK gene is checked; if its level of expression is >−4.234751, then classifies the patient as healthy, if ≤−4.234751, then checks the level of the gene PIEZO1; if his level of expression is >87.643492, then classifies him as a healthy patient, if ≤87.643492, then checks the level of the gene RPL37; if his level is ≤708.881411, then classifies him as a healthy patient, if he is >708.881411, then classifies him as a patient with PD. In this model, the TMEM104, TRIM33, GJB3, SPON2 and SNAP25 genes separate patients with neurological diseases from those with PD or healthy; the TRAK2, SHPK and PIEZO1 genes separate healthy patients; and the RPL37 gene distinguishes between PD or healthy patients.
In order to more easily appreciate the performance of J48,
Table 2 not only shows such a performance but also that of the Naive Bayes classifier, the SVM and the Multilayer Perceptron (MLP). In this table, the asterisk means that the corresponding classifier has a worse performance (statistically significant) compared to that of the decision tree using a
T test. The standard deviations are shown in parentheses. In addition, the average classification accuracy was added for these experiments.
Classifier performance was statistically compared using the paired corrected t-test implemented in WEKA (confidence level = 0.05, two-tailed). The J48 classifier (86.71 ± 11.66%) significantly outperformed Naive Bayes (73.19 ± 10.42%), SVM (45.56 ± 14.11%), and MLP (58.10 ± 11.78%), with p < 0.05 for all comparisons. The 95% confidence intervals for the mean accuracy differences ranged from 10.2 to 14.6 (J48 vs. Naive Bayes), 36.0–43.8 (J48 vs. SVM), and 25.4–33.2 (J48 vs. MLP). These results confirm that the decision tree model achieved statistically superior predictive performance compared to the other classifiers.
Table 3,
Table 4,
Table 5 and
Table 6 show a more detailed look of the performances shown in
Table 2, where the class-wise accuracy of the data is given for each classifier by its corresponding confusion matrix. We used the corresponding implementations in Weka with default parameters. Moreover,
Table 7 shows results, by each classifier, for sensitivity, specificity, PV+ and PV−.
As can be noticed from
Table 7, all classifiers have a good performance in sensitivity (detection of cases with the disease) but only the decision tree has a good performance in the rest of the tests. This difference in performance is really surprising because of the difficulty for a classifier to win by such a big margin with respect to others.
After applying machine learning, nine candidate genes for Parkinson’s disease were obtained: TMEM104, TRIM33, GJB3, SPON2, SNAP25, TRAK2, SHPK, PIEZO1, and RPL37. We performed ROC (receiver operating characteristic) analyses using the pROC package (v1.18.5) in R (v4.3.3) to evaluate the ability of individual genes to distinguish between the two classes. An ROC curve illustrates the discriminatory performance of a binary classification method for ordinal outcomes, which may be either continuous or discrete, by depicting the relationship between sensitivity (the proportion of correctly identified positives) and 1-specificity (the proportion of correctly identified negatives) as the decision threshold is systematically varied. In this context, the area under the curve (AUC) provides a quantitative measure of classifier performance, where higher values denote superior class discrimination [
40,
41].
Figure 4 shows the ROC curves.
Figure 4a shows that the genes (
TRIM33, AUC = 0.6,
TRAK2, AUC = 0.72 and
PIEZO1, AUC = 0.62) could have the ability to separate cases of Parkinson’s disease from Healthy Controls. While genes (
TMEM104,
GJB3,
SPON2,
SNAP25, SHPK and
RPL37) have a limited ability to effectively discriminate between these two classes. In
Figure 4b (Neurological Disease and Healthy Control), the genes (
TMEM104, AUC = 0.64;
TRAK2, AUC = 0.6; and
SHPK, AUC = 0.6) may possess the ability to discriminate between Neurological Disease and Healthy Controls. Whereas the genes (
TRIM33,
GJB3,
SPON2,
SNAP25,
PIEZO1 and
RPL37) exhibit limited discriminative power between the two classes (
Figure 4b).
Finally, a Gene Ontology (GO) pathway enrichment analysis was conducted on the nine highest-ranked genes (
TMEM104,
TRIM33,
GJB3,
SPON2,
SNAP25,
TRAK2,
SHPK,
PIEZO1 and
RPL37), to explore signaling pathways [
42]. The enrichment results were considered statistically significant when the adjusted
p-value < 0.1 [
43]. The results indicated that these expressed genes are primarily involved in biological processes related to synaptic vesicle docking (
SNAP25), dense core granule exocytosis (
SNAP25), the cellular response to lipopolysaccharide (
SPON2/
SHPK), dendritic transport (
TRAK2), and other signaling pathways.
Figure 5 shows the ten GO terms with the strongest enrichment. These terms were identified through over-representation analysis using the Gene Ontology database and org.Hs.eg.db annotations [
44].
Table 8 presents the GO terms of the main genes, providing detailed information on their involvement in biological processes. It also includes the corresponding adjusted
p-adjust,
p-value, gene names, and log-transformed adjusted
p-value. See the complete Table in
Supplementary Materials Table S1.
4. Discussion
In this work, we evaluated classifiers with a decision tree, an SVM, Naive Bayes and an MLP for PD gene expression levels. Our decision tree used gene expression profiles to identify genes that may be potential biomarkers for PD. From a total of 22,283 genes, we were able to limit it to a total of 9 genes (TMEM104, TRIM33, GJB3, SPON2, SNAP25, TRAK2, SHPK, PIEZO1 and RPL37) that allowed us to classify individuals belonging to three different classes. Each gene participates in different processes carried out within the human body.
The
TMEM104 gene is a transmembrane protein encoder gene, belongs to transmembrane protein family (TMEM) whose genes are components of the cell membrane, lysosomes and mitochondrial membrane and is associated with some types of cancer. However, the functional roles of most TMEMs have not yet been fully characterized [
45]. Currently, few studies have explored a possible relationship between this gene and Parkinson’s disease. Studies have been conducted indicating that other variants of the TMEM gene do have significant associations with PD, such as
TMEM59,
TMEM230, and
TMEM108 [
46]. The work of Liu also identifies the TMEM108 gene as having a possible role in cognitive progression and that it may be influencing PD [
47].
The
TRIM33 gene encodes a protein that functions as a transcriptional corepressor. It is considered to promote the ubiquitination of SMAD4 (tumor suppressor gene), nuclear exclusion and degradation via the ubiquitin–proteosome. It is associated with two types of thyroid cancer: differentiated and non-medullary [
48]. The
TRIM33 gene is related in the activation of genes involved in the inflammatory response of mature myeloid cells, also interacts and ubiquinate
DHX33. In addition, it contributes to the assembly of the DHX33-NLRP3 inflammasome complex [
49]. Our results suggest that TRIM33 may participate indirectly by promoting NLRP3 activity. This is because eliminating TRIM33 in human macrophages appears to block NLRP3 inflammasome activation [
50]. This gene contributes to a neuroinflammatory environment that promotes damage to dopaminergic neurons by stimulating the release of proinflammatory cytokines, such as IL-1β and IL-18, thereby contributing to the progression of PD [
51].
The
GJB3 gene encodes a protein that is part of the gap junctions, which are made up of arrays of intercellular channels that provide a pathway for the diffusion of low molecular weight materials from cell to cell. It is related to diseases such as autosomal dominant deafness and variable erythrokeratoderma, which is a skin disorder [
52]. To date, few studies have investigated the potential association between this gene and Parkinson’s disease. However, overexpression of the
GJA1 gene, which belongs to the same family, has been found to be associated with neurodegenerative disorders, including PD [
53]. According to Kawasaki et al., elevated
GJA1 expression and activity in glial cells promotes the transmission of stress and inflammatory signals through gap junctions, which may underlie neuronal damage, an important factor associated with Parkinson’s disease [
54].
The
SPON2 gene encodes a protein that promotes the adhesion and growth of embryonic neurons in the hippocampus. It is considered essential at the onset of the innate immune response and represents a unique pattern recognition molecule in the extracellular matrix for microbial pathogens [
48]. To our knowledge, no studies have reported a direct or indirect link between this gene and Parkinson’s disease. This gene is critically involved in the recruitment of macrophages and neutrophils during inflammatory processes. Overexpression of
SPON2 has been shown to promote tumor cell migration in colorectal cancer [
55]. It has also been shown that this gene not only promotes M1-type macrophage infiltration, but also inhibits tumor metastasis, making it a critical factor in mediating the immune response against tumor cell growth and migration in hepatocellular carcinoma [
56].
The
SNAP25 gene encodes a protein called t-SNARE that is related to the molecular regulation of neurotransmitter release. It is considered to be of great importance in the synaptic function of specific neuronal systems. It is related to proteins involved in the coupling of vesicles and the fusion of membranes. It is also associated with congenital myasthenic syndrome [
48]. The study by Agliardi et al., reinforces the importance of this gene, as it mentions that the concentration of
SNAP25 increases in the cerebrospinal fluid of PD patients and that this increase is related to the severity of cognitive and motor symptoms. Furthermore, post-mortem analyses of the brain indicated that presynaptic (including
SNAP25) and postsynaptic proteins are depleted in PD dementia [
57]. Another study reported a twofold reduction in SNARE complex assembly in brain homogenates from the cerebral cortex of patients with PD [
58].
SNAP25 has been shown to reliably serve as a marker of synaptic injury when measured in both cerebrospinal fluid and plasma [
59]. There is an increase in
SNAP25 protein in cerebrospinal fluid in patients with Parkinson’s disease. High concentrations of this gene could induce synaptic deterioration specific to PD [
60].
SNAP25 has been identified as a neuropathological marker in neurological diseases and has been proposed as a potential biomarker for the diagnosis and treatment of both Alzheimer’s disease and Parkinson’s disease [
61] and Creutzfeldt-Jakob disease [
62]. It has also been shown that inhibiting TNFAIP1-mediated
SNAP25 degradation could be a therapeutic approach to mitigate postoperative cognitive impairment [
63].
The
TRAK2 gene is a protein encoder that is considered to regulate endosomal traffic-lysosome membrane loading, including the epidermal growth factor receptor (EGFR). It is associated with amyotrophic lateral sclerosis, which causes progressive muscle paralysis [
48]. Mitochondria are transported along microtubules by opposing kinesin and dynein motors. Kinesin-1 and dynein–dynactin are anchored to mitochondria through TRAK proteins [
64].
RAK2 has an important role in mitochondrial transport in neurons, and it has been seen that in young neurons that are maturing, this gene contributes in a similar way to mitochondrial transport in both axons and dendrites [
65]. PD is closely associated with mitochondrial dysfunction and oxidative stress. Mitochondria are the primary source of reactive oxygen species (ROS) and are essential to the antioxidant defense system [
66].
The
SHPK gene is a protein-encoding gene. It acts as a modulator of macrophage activation by controlling glucose metabolism. It is associated with sedoheptulose kinase deficiency and cystinosis [
48,
52]. A correlation has been found with glioblastoma; increased expression of
SHPK is associated with a worse prognosis [
67]. To date, no direct or indirect evidence has been found linking this gene to PD.
The
PIEZO1 gene encodes a protein that functions as a channel of mechanically activated ions that links mechanical forces to biological signals. It is essential for the formation of blood vessels and the regulation of vascular architecture in adult physiology. It is associated with hereditary dehydrated stomach disease and lymphedema [
48,
52]. Through an analysis of genetic variants, they found that
PIEZO1 could contribute to susceptibility to PD by modulating calcium-dependent signaling pathways and oxidative stress, thus sharing a genetic architecture with melanoma [
68].
PIEZO1 has been associated with negative outcomes in processes such as axonal regeneration, ischemia, and glioma. This suggests that modulation of this receptor could generate beneficial effects, indicating that
PIEZO1 represents a promising biomarker for the development of future therapies targeting brain diseases [
69].
The
RPL37 gene encodes a ribosomal protein that is part of the 60S subunit. It belongs to the L37E family of ribosomal proteins. It is related to viral mRNA translation and transcription and replication of viral RNA of influenza [
52].
RPL37 may function as an extraribosomal regulator of p53 through MDM2/MDMX proteins, thereby modulating the cellular stress response in a specific manner and providing new insights into ribosomal functions [
70].
PD is a multifactorial disease in which genetic and environmental factors are involved. PD is classified as sporadic and familial. In PD, loss of dopaminergic neurons occurs mainly due to cellular stress. Lewy bodies have been found to be a feature of PD, but it is not yet clear whether they are damaging to neurons or the results of a protective response. Lewy bodies contain a protein called ubiquitin, but the key components of ubiquitin mediation in protein regulation and the processes that control the reverse pathway are not yet known. In the studio of Walden and Muqit, the mutation of 20 new genes in the rare familial forms of PD was reported, proteins in these genes are involved in various cellular pathways, for example, synaptic function and bladder release (α-synuclein, Synaptojanin, and TMEM230) [
71]. In our model, the
TMEM104 gene is the most significant attribute and may be a novel protein that could be part of the PD ubiquitin signaling system. The level of expression of the
SNAP25 gene could be used as a biomarker in blood for predicting the cognitive impairment of PD.
One of the contributions of this work is the identification of potential biomarkers in blood for the prediction of PD. Of the list of most significant genes, only the SNAP25 gene has been directly associated with PD while genes such as PIEZO1 and TRIM33 are considered relevant for study in the context of Parkinson’s disease. The TMEM104, GJB3, SPON2, TRAK2, SHPK, and RPL37 genes could be key in PD, as they are related to inflammation, oxidative stress, and mitochondrial mechanisms.
We are fully aware of the limitations imposed by the size of the data. To obtain more reliable results, we believe a future task should be to generate synthetic databases from our existing data. These databases could be generated using Monte Carlo [
72] or data augmentation techniques, such as Synthetic Minority Over-Sampling Technique (SMOTE) [
73]. This would provide more samples, improving the performance of the classifiers. We could also use other supervised learning techniques, such as ensemble learning or deep learning models.
The integration of omics data (genetics, proteomics, and metabolomics) with machine learning has the potential to enhance our comprehension of various neurodegenerative diseases and accelerate the discovery of novel biomarkers [
74,
75].
Our study focused on analyzing gene expression microarrays using machine learning. As future work, we will aim to integrate machine learning techniques for the analysis of high-dimensional multi-omics data. This approach will allow a more comprehensive exploration of the underlying biological mechanisms and new biomarkers can be discovered as a result.