A Study of Gene Expression Levels of Parkinson’s Disease Using Machine Learning

Mestizo-Gutiérrez, Sonia Lilia; Jácome-Delgado, Joan Arturo; Cruz-Ramírez, Nicandro; Guerra-Hernández, Alejandro; Torres-Sosa, Jesús Alberto; Rosales-Morales, Viviana Yarel; Aranda-Abreu, Gonzalo Emiliano

doi:10.3390/biomedinformatics5040060

Open AccessArticle

A Study of Gene Expression Levels of Parkinson’s Disease Using Machine Learning

by

Sonia Lilia Mestizo-Gutiérrez

¹

,

Joan Arturo Jácome-Delgado

²,

Nicandro Cruz-Ramírez

³

,

Alejandro Guerra-Hernández

³

,

Jesús Alberto Torres-Sosa

⁴

,

Viviana Yarel Rosales-Morales

⁵ and

Gonzalo Emiliano Aranda-Abreu

^6,*

¹

Facultad de Ciencias Químicas, Universidad Veracruzana, Xalapa 91000, Veracruz, Mexico

²

Laboratorio Nacional de Informática Avanzada, Xalapa 91000, Veracruz, Mexico

³

Instituto de Investigaciones en Inteligencia Artificial, Universidad Veracruzana, Xalapa 91097, Veracruz, Mexico

⁴

Doctorado en Investigaciones Cerebrales, Universidad Veracruzana, Xalapa 91190, Veracruz, Mexico

⁵

Facultad de Estadística e Informática SECIHTI, Universidad Veracruzana, Xalapa 91020, Veracruz, Mexico

⁶

Instituto de Investigaciones Cerebrales, Universidad Veracruzana, Xalapa 91190, Veracruz, Mexico

^*

Author to whom correspondence should be addressed.

BioMedInformatics 2025, 5(4), 60; https://doi.org/10.3390/biomedinformatics5040060

Submission received: 10 September 2025 / Revised: 18 October 2025 / Accepted: 23 October 2025 / Published: 29 October 2025

Download

Browse Figures

Versions Notes

Abstract

Parkinson’s disease (PD) is the second most common neurodegenerative disorder, characterized primarily by motor impairments due to the loss of dopaminergic neurons. Despite extensive research, the precise causes of PD remain unknown, and reliable non-invasive biomarkers are still lacking. This study aimed to explore gene expression profiles in peripheral blood to identify potential biomarkers for PD using machine learning approaches. We analyzed microarray-based gene expression data from 105 individuals (50 PD patients, 33 with other neurodegenerative diseases, and 22 healthy controls) obtained from the GEO database (GSE6613). Preprocessing was performed using the “affy” package in R with Expresso normalization. Feature selection and classification were conducted using a decision tree approach (C4.5/J48 algorithm in WEKA), and model performance was evaluated with 10-fold cross-validation. Additional classifiers such as Support Vector Machine (SVM), the Naive Bayes classifier and Multilayer Perceptron Neural Network (MLP) were used for comparison. ROC curve analysis and Gene Ontology (GO) enrichment analysis were applied to the selected genes. A nine-gene decision tree model (TMEM104, TRIM33, GJB3, SPON2, SNAP25, TRAK2, SHPK, PIEZO1, RPL37) achieved 86.71% accuracy, 88% sensitivity, and 87% specificity. The model significantly outperformed other classifiers (SVM, Naive Bayes, MLP) in terms of overall predictive accuracy. ROC analysis showed moderate discrimination for some genes (e.g., TRAK2, TRIM33, PIEZO1), and GO enrichment revealed associations with synaptic processes, inflammation, mitochondrial transport, and stress response pathways. Our decision tree model based on blood gene expression profiles effectively discriminates between PD, other neurodegenerative conditions, and healthy controls, offering a non-invasive method for potential early diagnosis. Notably, TMEM104, TRIM33, and SNAP25 emerged as promising candidate biomarkers, warranting further investigation in larger and synthetic datasets to validate their clinical relevance.

Keywords:

Parkinson’s disease; gene expression; microarray; machine learning; decision tree

Graphical Abstract

1. Introduction

Parkinson’s disease (PD) is a neurodegenerative disorder primarily characterized by motor impairment resulting from the degeneration of dopaminergic neurons in the substantia nigra (SN) located in the midbrain [1,2]. It is considered the second most prevalent neurodegenerative disease, following Alzheimer’s disease [3]. In 2021, approximately 11.77 million people worldwide were affected by PD, with an incidence rate of 15.63 new cases per 100,000 people per year [4]. Epidemiological projections estimate that the number of cases will rise to approximately 1.93 million by 2030, with an age-standardized incidence rate expected to reach 27 cases per 100,000 people [5].

Age is considered a major risk factor for the development of PD, as the disease occurs more frequently in individuals over 60 years of age. A positive family history is also recognized as a risk factor, with more than 15% of patients having a familial background [6]. Furthermore, PD has a higher incidence in men than in women [7,8].

Discovered over 200 years ago by James Parkinson [9], the pathophysiology of PD remains poorly understood. Although both genetic and environmental factors have been implicated in the disease, none has been identified as a definitive cause [10,11].

Microarrays have emerged as a powerful technological tool for studying gene expression, enabling the generation of large-scale datasets to support disease research and biomarker discovery [12,13]. They allow the simultaneous analysis of multiple genes, helping to address questions regarding genetic differences between healthy individuals and those with disease, as well as treatment responses [14]. Various statistical techniques have traditionally been used for microarray analysis [15,16,17]; however, machine learning approaches have recently led to significant improvements in data interpretation and disease prediction [18,19,20].

Machine learning has already been applied to microarray data to classify healthy individuals versus PD patients. For instance, the study by Lai et al. evaluated machine learning algorithms for PD diagnosis using a dataset of 1656 participants. Algorithms such as logistic regression (LASSO-LR), decision trees, random forest, XGBoost, support vector machine (SVM), and k-nearest neighbors were tested. The SVM model was the most accurate (84.40%), highlighting constipation, olfactory decline, and daytime sleepiness as the most relevant features [21]. In another study, Shamir et al. analyzed gene expression profiles using genetic networks and identified 87 genes related to metabolism, oxidation, and ubiquitination processes, as well as dysregulation of genes in the mitochondria [22]. Similarly, other researchers have employed logistic regression (LASSO) and SVM-RFE to identify PD-associated biomarkers, revealing nine key genes: NME7, PKM, RRM2, POLR3C, POLA1, PDE6C, PDE9A, PDE11A, and AMPD1 [23]. Liu et al. focused on identifying genes associated with immune infiltration in PD through microarray-based bioinformatics, identifying SLC18A2, CALB1, and SYNGR3 as candidate biomarkers [24].

Most prior studies compare healthy individuals and PD patients, often excluding neurological diseases that share similar gene expression patterns. In this study, we analyzed and modeled microarray data derived from peripheral blood samples classified into three groups: PD, other neurological diseases, and healthy controls. We applied machine learning techniques to identify novel gene expression patterns related to PD and to propose potential biomarkers for its diagnosis.

2. Materials and Methods

2.1. Decision Trees

A decision tree [25,26] is a set of rules that allow to classify data. It is made up of nodes, branches and leaves. The node from which the classification starts is known as root node; the internal nodes correspond to the questioning about the particular attribute of the problem within the tree. The outgoing branches of each of these nodes represent possible node values. Finally, the leaf nodes, or final nodes, represent a decision, which corresponds to one of the class variables of the problem. This representation allows, starting from the root node, to follow a path to reach one of the leaves that gives us a response to a problem.

The construction of a decision tree is divided in two stages: tree induction and classification. In the first stage, the training data for the construction of the decision tree is used. From these data, the attributes that are going to be assigned in each node are established. In the second stage, from the tree already built, each new object is classified. That is, it runs from the root node to any of the leaves of the tree in order to know what class the new object belongs to. The path followed within the tree determines this classification. There are several algorithms for the generation of decision trees, among which are ID3 [27] and C4.5 [28].

The ID3 algorithm is recursive and is based on measures of entropy and information gain for the most appropriate selection of the attribute of each node. According to the values of the selected attribute, several subsets are created from the training dataset, which correspond to the outgoing branches of the node. The remaining attributes are evaluated on each subset and the nodes of the following levels of the tree are established. This process is followed recursively on each new node until there are no more attributes to evaluate, there are no more data whose conditions are directed to that node or all the data have the same class. If any of the above situations exist, that node becomes a leaf node. The formulas used by this algorithm are the ones shown below [29]:

H (X) = - \sum_{x \in X} p (x) l o g_{2} p (x)

H(X) represents the entropy of a discrete random variable X. In our case, X is the class. The conditional entropy H(X|Y) is defined as follows:

H (X | Y) = - \sum_{x \in X} \sum_{y \in Y} p (x, y) l o g_{2} p (x | y)

where Y represents each attribute. The information gain

I (X; Y)

is defined as follows:

I (X; Y) = H (X) - H (X | Y)

which is the reduction in the uncertainty of X once Y is known.

The joint use of this pair of formulas allows us to establish the root, the intermediate and leaf nodes.

The C4.5 algorithm was used, which is an extension of ID3 and allows us to work with continuous and unknown values, as well as simplify the decision rules and provide a greater generalization capacity. In addition, it provides models that are easy to interpret and understand thanks to its ability to select and classify attributes according to their relevance to predict a result [30].

2.2. Evaluation Method: k-Fold Cross-Validation

We followed the definition of the cross-validation method given by Kohavi [31]. In k-fold cross-validation, we split the database D in k mutually exclusive random samples called the folds: D1, D2,…, D_k, where such folds have approximately equal size. We train this classifier each time i ∈ {1, 2,…, k} using D\D_i and test it on D_i (the symbol ‘\’ denotes set difference). The cross-validation accuracy estimation is the total number of correct classifications divided by the sample size (total number of instances in D). Thus, the k-fold cross-validation estimate is as follows:

a c c_{c v} = \frac{1}{n} \sum_{(v_{i}, y_{i}) \in D} \partial (I (D \ D_{(i)}, v_{i}), y_{i})

where

I (D \ D_{(i)}, v_{i}), y_{i}

represents the label assigned by the inducer I to an unlabeled instance

v_{i}

, in the set

D \ D_{(i)}

,

y_{i}

is the class of the instance

v_{i}

,

n

is the size of the complete dataset and

\partial (i, j)

is a function where

\partial (i, j) = 1

, if

i = j

and 0 if

i \neq j

. In our experiments, the value we use for k is 10.

The performance of the classifiers was evaluated according to the following [32]:

(a): Sensitivity: the ability to correctly identify patients with PD, where t_pos is the number of true positives.

s e n s i t i v i t y = \frac{t_p o s}{p o s}

(b): Specificity: is the ability to correctly identify those patients who do not have PD, where t_neg is the number of true negatives.

s p e c i f i c i t y = \frac{t_n e g}{n e g}

(c): Accuracy: is the number of correct classifications divided by the corresponding size of the test set.

a c c u r a c y = s e n s i t i v i t y \frac{p o s}{(p o s + n e g)} + s p e c i f i t y \frac{n e g}{(p o s + n e g)}

(d): Predictive value for a positive result (PV+): PV+ asks “If the test result is positive what is the probability that the patient actually has the disease?”

P V + = \frac{t r u e p o s i t i v e}{(t r u e p o s i t i v e + f a l s e p o s i t i v e)}

(e): Predictive value for a negative result (PV−): PV− asks “If f the test result is negative what is the probability that the patient does not have disease?”

P V - = \frac{t r u e n e g a t i v e s}{(t r u e n e g a t i v e s + f a l s e n e g a t i v e s)}

2.3. Implementation Details

2.3.1. Dataset

The database used in this work is in the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) [33] and can be accessed with serial number GSE6613. The dataset consists of peripheral blood gene expression profiles of 105 individuals: 50 with Parkinson’s disease, 33 with neurodegenerative diseases (different from Parkinson’s disease) and 22 healthy controls, each with 22,283 genes.

2.3.2. Data Preprocessing

An exploratory analysis was conducted using a boxplot before applying normalization techniques. This boxplot shows the intensity levels of each unnormalized experiment. Figure 1 illustrates this. The y-axis indicates the intensity level of each sample, and the x-axis represents each sample in the dataset. Each box represents the interquartile range of a sample. The black line indicates the median of a sample, and the dotted line represents the range of intensity values for a sample. Figure 1 shows that the samples have different intensity distributions, indicating the need for normalization.

This study made use of the Expresso normalization technique from the “affy” package of R to create one model. The parameters for such a technique are shown in Table 1.

For normalization, the “loess” parameter was applied, which uses a nonlinear regression method that adjusts for and eliminates systematic biases in chip intensity data, facilitating comparison between chips by placing them on the same scale [34]. For probe-specific correction, the “subtractmm” parameter was used, referring to the Affymetrix MAS 4.0 method [35], which corrects perfect match (PM) intensities by subtracting the mismatch (MM) probe signal, thereby producing more accurate expression values. Lastly, the summarization parameter was set to “liwong”, which applies a technique used in Affymetrix microarray analysis to estimate gene expression levels by calculating a weighted average of probe intensities. This approach enhances the accuracy and robustness of gene expression values derived from raw probe data [36,37]. Figure 2 shows the results of the normalization process using the Expresso technique.

2.3.3. Gene Selection

For gene selection, we used decision tree since they are also able to determine which are the most significant attributes in a dataset that facilitate the resolution of a problem [30,38].

2.3.4. WEKA

We used Waikato Environment for Knowledge Analysis (WEKA 3.8.6, University of Waikato, Hamilton, New Zealand) software to generate the decision tree model with the J48 algorithm, the Naive Bayes classifier, the Support Vector Machine (SVM) and the Multilayer Perceptron Neural Network (MLP) with 10-fold cross-validation [39].

3. Results

We obtained a model with nine genes, which, according to J48, are the most significant (see Figure 3). The obtained genes allowed the generation of a decision tree that correctly classified 92 (86.71%) of 105 samples: 44/50 of PD, 30/33 of neurological disease control and 18/22 of healthy control.

In this model, it can be observed that the TMEM104 gene is the most significant attribute since it is the root node; if its level is ≤34.384992, then it is classified as a patient with neurological disease, if it is >34.384992, then the level of the TRIM33 gene is checked; if its level of expression is >99.453834, then it is classified as a patient with neurological disease, if it is ≤99.453834, then the level of the GJB3 gene is checked; if its level of expression is >61.613954, then it is classified as a patient with neurological disease, if it is ≤61.613954, then the level of the SPON2 gene is checked; if its level of expression is >271.096149, then it is classified as a patient with neurological disease, if it is ≤271.096149, then the level of the SNAP25 gene is checked; if its level is >8.6876, then the patient with neurological disease is classified, if it is ≤8.6876, then the level of the TRAK2 gene is checked; if its level is >82.77717175, then the patient is classified as healthy, if it is ≤82.777175, then the level of the SHPK gene is checked; if its level of expression is >−4.234751, then classifies the patient as healthy, if ≤−4.234751, then checks the level of the gene PIEZO1; if his level of expression is >87.643492, then classifies him as a healthy patient, if ≤87.643492, then checks the level of the gene RPL37; if his level is ≤708.881411, then classifies him as a healthy patient, if he is >708.881411, then classifies him as a patient with PD. In this model, the TMEM104, TRIM33, GJB3, SPON2 and SNAP25 genes separate patients with neurological diseases from those with PD or healthy; the TRAK2, SHPK and PIEZO1 genes separate healthy patients; and the RPL37 gene distinguishes between PD or healthy patients.

In order to more easily appreciate the performance of J48, Table 2 not only shows such a performance but also that of the Naive Bayes classifier, the SVM and the Multilayer Perceptron (MLP). In this table, the asterisk means that the corresponding classifier has a worse performance (statistically significant) compared to that of the decision tree using a T test. The standard deviations are shown in parentheses. In addition, the average classification accuracy was added for these experiments.

Classifier performance was statistically compared using the paired corrected t-test implemented in WEKA (confidence level = 0.05, two-tailed). The J48 classifier (86.71 ± 11.66%) significantly outperformed Naive Bayes (73.19 ± 10.42%), SVM (45.56 ± 14.11%), and MLP (58.10 ± 11.78%), with p < 0.05 for all comparisons. The 95% confidence intervals for the mean accuracy differences ranged from 10.2 to 14.6 (J48 vs. Naive Bayes), 36.0–43.8 (J48 vs. SVM), and 25.4–33.2 (J48 vs. MLP). These results confirm that the decision tree model achieved statistically superior predictive performance compared to the other classifiers.

Table 3, Table 4, Table 5 and Table 6 show a more detailed look of the performances shown in Table 2, where the class-wise accuracy of the data is given for each classifier by its corresponding confusion matrix. We used the corresponding implementations in Weka with default parameters. Moreover, Table 7 shows results, by each classifier, for sensitivity, specificity, PV+ and PV−.

As can be noticed from Table 7, all classifiers have a good performance in sensitivity (detection of cases with the disease) but only the decision tree has a good performance in the rest of the tests. This difference in performance is really surprising because of the difficulty for a classifier to win by such a big margin with respect to others.

After applying machine learning, nine candidate genes for Parkinson’s disease were obtained: TMEM104, TRIM33, GJB3, SPON2, SNAP25, TRAK2, SHPK, PIEZO1, and RPL37. We performed ROC (receiver operating characteristic) analyses using the pROC package (v1.18.5) in R (v4.3.3) to evaluate the ability of individual genes to distinguish between the two classes. An ROC curve illustrates the discriminatory performance of a binary classification method for ordinal outcomes, which may be either continuous or discrete, by depicting the relationship between sensitivity (the proportion of correctly identified positives) and 1-specificity (the proportion of correctly identified negatives) as the decision threshold is systematically varied. In this context, the area under the curve (AUC) provides a quantitative measure of classifier performance, where higher values denote superior class discrimination [40,41].

Figure 4 shows the ROC curves. Figure 4a shows that the genes (TRIM33, AUC = 0.6, TRAK2, AUC = 0.72 and PIEZO1, AUC = 0.62) could have the ability to separate cases of Parkinson’s disease from Healthy Controls. While genes (TMEM104, GJB3, SPON2, SNAP25, SHPK and RPL37) have a limited ability to effectively discriminate between these two classes. In Figure 4b (Neurological Disease and Healthy Control), the genes (TMEM104, AUC = 0.64; TRAK2, AUC = 0.6; and SHPK, AUC = 0.6) may possess the ability to discriminate between Neurological Disease and Healthy Controls. Whereas the genes (TRIM33, GJB3, SPON2, SNAP25, PIEZO1 and RPL37) exhibit limited discriminative power between the two classes (Figure 4b).

Finally, a Gene Ontology (GO) pathway enrichment analysis was conducted on the nine highest-ranked genes (TMEM104, TRIM33, GJB3, SPON2, SNAP25, TRAK2, SHPK, PIEZO1 and RPL37), to explore signaling pathways [42]. The enrichment results were considered statistically significant when the adjusted p-value < 0.1 [43]. The results indicated that these expressed genes are primarily involved in biological processes related to synaptic vesicle docking (SNAP25), dense core granule exocytosis (SNAP25), the cellular response to lipopolysaccharide (SPON2/SHPK), dendritic transport (TRAK2), and other signaling pathways. Figure 5 shows the ten GO terms with the strongest enrichment. These terms were identified through over-representation analysis using the Gene Ontology database and org.Hs.eg.db annotations [44].

Table 8 presents the GO terms of the main genes, providing detailed information on their involvement in biological processes. It also includes the corresponding adjusted p-adjust, p-value, gene names, and log-transformed adjusted p-value. See the complete Table in Supplementary Materials Table S1.

4. Discussion

In this work, we evaluated classifiers with a decision tree, an SVM, Naive Bayes and an MLP for PD gene expression levels. Our decision tree used gene expression profiles to identify genes that may be potential biomarkers for PD. From a total of 22,283 genes, we were able to limit it to a total of 9 genes (TMEM104, TRIM33, GJB3, SPON2, SNAP25, TRAK2, SHPK, PIEZO1 and RPL37) that allowed us to classify individuals belonging to three different classes. Each gene participates in different processes carried out within the human body.

The TMEM104 gene is a transmembrane protein encoder gene, belongs to transmembrane protein family (TMEM) whose genes are components of the cell membrane, lysosomes and mitochondrial membrane and is associated with some types of cancer. However, the functional roles of most TMEMs have not yet been fully characterized [45]. Currently, few studies have explored a possible relationship between this gene and Parkinson’s disease. Studies have been conducted indicating that other variants of the TMEM gene do have significant associations with PD, such as TMEM59, TMEM230, and TMEM108 [46]. The work of Liu also identifies the TMEM108 gene as having a possible role in cognitive progression and that it may be influencing PD [47].

The TRIM33 gene encodes a protein that functions as a transcriptional corepressor. It is considered to promote the ubiquitination of SMAD4 (tumor suppressor gene), nuclear exclusion and degradation via the ubiquitin–proteosome. It is associated with two types of thyroid cancer: differentiated and non-medullary [48]. The TRIM33 gene is related in the activation of genes involved in the inflammatory response of mature myeloid cells, also interacts and ubiquinate DHX33. In addition, it contributes to the assembly of the DHX33-NLRP3 inflammasome complex [49]. Our results suggest that TRIM33 may participate indirectly by promoting NLRP3 activity. This is because eliminating TRIM33 in human macrophages appears to block NLRP3 inflammasome activation [50]. This gene contributes to a neuroinflammatory environment that promotes damage to dopaminergic neurons by stimulating the release of proinflammatory cytokines, such as IL-1β and IL-18, thereby contributing to the progression of PD [51].

The GJB3 gene encodes a protein that is part of the gap junctions, which are made up of arrays of intercellular channels that provide a pathway for the diffusion of low molecular weight materials from cell to cell. It is related to diseases such as autosomal dominant deafness and variable erythrokeratoderma, which is a skin disorder [52]. To date, few studies have investigated the potential association between this gene and Parkinson’s disease. However, overexpression of the GJA1 gene, which belongs to the same family, has been found to be associated with neurodegenerative disorders, including PD [53]. According to Kawasaki et al., elevated GJA1 expression and activity in glial cells promotes the transmission of stress and inflammatory signals through gap junctions, which may underlie neuronal damage, an important factor associated with Parkinson’s disease [54].

The SPON2 gene encodes a protein that promotes the adhesion and growth of embryonic neurons in the hippocampus. It is considered essential at the onset of the innate immune response and represents a unique pattern recognition molecule in the extracellular matrix for microbial pathogens [48]. To our knowledge, no studies have reported a direct or indirect link between this gene and Parkinson’s disease. This gene is critically involved in the recruitment of macrophages and neutrophils during inflammatory processes. Overexpression of SPON2 has been shown to promote tumor cell migration in colorectal cancer [55]. It has also been shown that this gene not only promotes M1-type macrophage infiltration, but also inhibits tumor metastasis, making it a critical factor in mediating the immune response against tumor cell growth and migration in hepatocellular carcinoma [56].

The SNAP25 gene encodes a protein called t-SNARE that is related to the molecular regulation of neurotransmitter release. It is considered to be of great importance in the synaptic function of specific neuronal systems. It is related to proteins involved in the coupling of vesicles and the fusion of membranes. It is also associated with congenital myasthenic syndrome [48]. The study by Agliardi et al., reinforces the importance of this gene, as it mentions that the concentration of SNAP25 increases in the cerebrospinal fluid of PD patients and that this increase is related to the severity of cognitive and motor symptoms. Furthermore, post-mortem analyses of the brain indicated that presynaptic (including SNAP25) and postsynaptic proteins are depleted in PD dementia [57]. Another study reported a twofold reduction in SNARE complex assembly in brain homogenates from the cerebral cortex of patients with PD [58]. SNAP25 has been shown to reliably serve as a marker of synaptic injury when measured in both cerebrospinal fluid and plasma [59]. There is an increase in SNAP25 protein in cerebrospinal fluid in patients with Parkinson’s disease. High concentrations of this gene could induce synaptic deterioration specific to PD [60]. SNAP25 has been identified as a neuropathological marker in neurological diseases and has been proposed as a potential biomarker for the diagnosis and treatment of both Alzheimer’s disease and Parkinson’s disease [61] and Creutzfeldt-Jakob disease [62]. It has also been shown that inhibiting TNFAIP1-mediated SNAP25 degradation could be a therapeutic approach to mitigate postoperative cognitive impairment [63].

The TRAK2 gene is a protein encoder that is considered to regulate endosomal traffic-lysosome membrane loading, including the epidermal growth factor receptor (EGFR). It is associated with amyotrophic lateral sclerosis, which causes progressive muscle paralysis [48]. Mitochondria are transported along microtubules by opposing kinesin and dynein motors. Kinesin-1 and dynein–dynactin are anchored to mitochondria through TRAK proteins [64]. RAK2 has an important role in mitochondrial transport in neurons, and it has been seen that in young neurons that are maturing, this gene contributes in a similar way to mitochondrial transport in both axons and dendrites [65]. PD is closely associated with mitochondrial dysfunction and oxidative stress. Mitochondria are the primary source of reactive oxygen species (ROS) and are essential to the antioxidant defense system [66].

The SHPK gene is a protein-encoding gene. It acts as a modulator of macrophage activation by controlling glucose metabolism. It is associated with sedoheptulose kinase deficiency and cystinosis [48,52]. A correlation has been found with glioblastoma; increased expression of SHPK is associated with a worse prognosis [67]. To date, no direct or indirect evidence has been found linking this gene to PD.

The PIEZO1 gene encodes a protein that functions as a channel of mechanically activated ions that links mechanical forces to biological signals. It is essential for the formation of blood vessels and the regulation of vascular architecture in adult physiology. It is associated with hereditary dehydrated stomach disease and lymphedema [48,52]. Through an analysis of genetic variants, they found that PIEZO1 could contribute to susceptibility to PD by modulating calcium-dependent signaling pathways and oxidative stress, thus sharing a genetic architecture with melanoma [68]. PIEZO1 has been associated with negative outcomes in processes such as axonal regeneration, ischemia, and glioma. This suggests that modulation of this receptor could generate beneficial effects, indicating that PIEZO1 represents a promising biomarker for the development of future therapies targeting brain diseases [69].

The RPL37 gene encodes a ribosomal protein that is part of the 60S subunit. It belongs to the L37E family of ribosomal proteins. It is related to viral mRNA translation and transcription and replication of viral RNA of influenza [52]. RPL37 may function as an extraribosomal regulator of p53 through MDM2/MDMX proteins, thereby modulating the cellular stress response in a specific manner and providing new insights into ribosomal functions [70].

PD is a multifactorial disease in which genetic and environmental factors are involved. PD is classified as sporadic and familial. In PD, loss of dopaminergic neurons occurs mainly due to cellular stress. Lewy bodies have been found to be a feature of PD, but it is not yet clear whether they are damaging to neurons or the results of a protective response. Lewy bodies contain a protein called ubiquitin, but the key components of ubiquitin mediation in protein regulation and the processes that control the reverse pathway are not yet known. In the studio of Walden and Muqit, the mutation of 20 new genes in the rare familial forms of PD was reported, proteins in these genes are involved in various cellular pathways, for example, synaptic function and bladder release (α-synuclein, Synaptojanin, and TMEM230) [71]. In our model, the TMEM104 gene is the most significant attribute and may be a novel protein that could be part of the PD ubiquitin signaling system. The level of expression of the SNAP25 gene could be used as a biomarker in blood for predicting the cognitive impairment of PD.

One of the contributions of this work is the identification of potential biomarkers in blood for the prediction of PD. Of the list of most significant genes, only the SNAP25 gene has been directly associated with PD while genes such as PIEZO1 and TRIM33 are considered relevant for study in the context of Parkinson’s disease. The TMEM104, GJB3, SPON2, TRAK2, SHPK, and RPL37 genes could be key in PD, as they are related to inflammation, oxidative stress, and mitochondrial mechanisms.

We are fully aware of the limitations imposed by the size of the data. To obtain more reliable results, we believe a future task should be to generate synthetic databases from our existing data. These databases could be generated using Monte Carlo [72] or data augmentation techniques, such as Synthetic Minority Over-Sampling Technique (SMOTE) [73]. This would provide more samples, improving the performance of the classifiers. We could also use other supervised learning techniques, such as ensemble learning or deep learning models.

The integration of omics data (genetics, proteomics, and metabolomics) with machine learning has the potential to enhance our comprehension of various neurodegenerative diseases and accelerate the discovery of novel biomarkers [74,75].

Our study focused on analyzing gene expression microarrays using machine learning. As future work, we will aim to integrate machine learning techniques for the analysis of high-dimensional multi-omics data. This approach will allow a more comprehensive exploration of the underlying biological mechanisms and new biomarkers can be discovered as a result.

5. Conclusions

Our decision tree model had the best classifier compared to other classification techniques such as SVM, Naive Bayes and MLP. The tree’s results modeled gene expression levels of non-invasive samples (peripheral blood) of Parkinson’s disease using decision trees that made it possible to generate more accessible predictive models for the population, given that a blood sample is low cost and less invasive, so that TMEM104, TRIM33, GJB3, SPON2, SNAP25, TRAK2, SHPK, PIEZO1, and RPL37 could be promising biomarkers for the early detection of Parkinson’s disease.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/biomedinformatics5040060/s1, Table S1: The terms significantly enriched.

Author Contributions

S.L.M.-G.: She contributed to the conception and design or the study, to the analysis, discussion of results and approval of the article. G.E.A.-A.: He contributed to the analysis of results, discussion of results and approval of the article. J.A.J.-D.: He studied the decision trees and the preprocessing methods. N.C.-R.: He contributed to the analysis of results and approval of the article. A.G.-H.: He contributed to the analysis of results and approval of the article. J.A.T.-S.: He contributed with discussion and editing the article. V.Y.R.-M.: She contributed to the analysis the results and editing the article. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in Gene Expression Omnibus (GEO) of National Centre for Biotechnology Information (NCBI) at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE6613 (accessed on 22 August 2025).

Acknowledgments

The authors thankfully acknowledge the computer resources, technical expertise and support provided by the National Supercomputing Laboratory of Southeastern Mexico and the Ministry of Science, Humanities, Technology and Innovation.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Rietdijk, C.D.; Perez-Pardo, P.; Garssen, J.; Van Wezel, R.J.A.; Kraneveld, A.D. Exploring Braak’s Hypothesis of Parkinson’s Disease. Front. Neurol. 2017, 8, 37. [Google Scholar] [CrossRef]
Kulcsarova, K.; Skorvanek, M.; Postuma, R.B.; Berg, D. Defining Parkinson’s Disease: Past and Future. J. Park. Dis. 2024, 14, S257–S271. [Google Scholar] [CrossRef] [PubMed]
Tolosa, E.; Garrido, A.; Scholz, S.W.; Poewe, W. Challenges in the Diagnosis of Parkinson’s Disease. Lancet Neurol. 2021, 20, 385–397. [Google Scholar] [CrossRef] [PubMed]
Luo, Y.; Qiao, L.; Li, M.; Wen, X.; Zhang, W.; Li, X. Global, Regional, National Epidemiology and Trends of Parkinson’s Disease from 1990 to 2021: Findings from the Global Burden of Disease Study 2021. Front. Aging Neurosci. 2025, 16, 1498756. [Google Scholar] [CrossRef] [PubMed]
Xu, L.; Wang, Z.; Li, Q. Global Trends and Projections of Parkinson’s Disease Incidence: A 30-Year Analysis Using GBD 2021 Data. J. Neurol. 2025, 272, 286. [Google Scholar] [CrossRef]
Diagnóstico y Tratamiento de La Enfermedad de Parkinson Inicial y Avanzada en el Tercer Nivel de Atención; México Secretaría de Salud: Mexico City, Mexico, 2010.
Castro Toro, A.; Freddy Buriticá, O. Enfermedad de Parkinson: Criterios Diagnósticos, Factores de Riesgo y de Progresión, y Escalas de Valoración Del Estadio Clínico. Acta Neurol. Colomb. 2014, 30, 300–306. [Google Scholar]
Cerri, S.; Mus, L.; Blandini, F. Parkinson’s Disease in Women and Men: What’s the Difference? J. Park. Dis. 2019, 9, 501–515. [Google Scholar] [CrossRef]
Parkinson, J. An Essay on the Shaking Palsy. J. Neuropsychiatry Clin. Neurosci. 2002, 14, 223–236. [Google Scholar] [CrossRef]
Bellou, V.; Belbasis, L.; Tzoulaki, I.; Evangelou, E.; Ioannidis, J.P.A. Environmental Risk Factors and Parkinson’s Disease: An Umbrella Review of Meta-Analyses. Park. Relat. Disord. 2016, 23, 1–9. [Google Scholar] [CrossRef]
Gómez-Chavarín, M.; Torres-Ortiz, M.C.; Perez-Soto, G. Interacción Entre Factores Genéticosambientales y La Epigénesis de La Enfermedad de Parkinson. Arch. Neurocien. 2016, 21, 32–44. [Google Scholar] [CrossRef]
Pashaei, E.; Ozen, M.; Aydin, N. Biomarker Discovery Based on BBHA and AdaboostM1 on Microarray Data for Cancer Classification. In Proceedings of the 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Orlando, FL, USA, 16–20 August 2016; IEEE: New York, NY, USA, 2016; pp. 3080–3083. [Google Scholar]
Wu, M.-Y.; Dai, D.-Q.; Shi, Y.; Yan, H.; Zhang, X.-F. Biomarker Identification and Cancer Classification Based on Microarray Data Using Laplace Naive Bayes Model with Mean Shrinkage. IEEE/ACM Trans. Comput. Biol. Bioinform. 2012, 9, 1649–1662. [Google Scholar] [CrossRef] [PubMed]
Dang, K.; Zhang, W.; Jiang, S.; Lin, X.; Qian, A. Application of Lectin Microarrays for Biomarker Discovery. ChemistryOpen 2020, 9, 285–300. [Google Scholar] [CrossRef]
Kumar, G.; Lahiri, T.; Kumar, R. Statistical Discrimination of Breast Cancer Microarray Data. In Proceedings of the 2016 International Conference on Bioinformatics and Systems Biology (BSB), Allahabad, India, 21–23 March 2016; IEEE: New York, NY, USA, 2016; pp. 1–4. [Google Scholar]
Shashirekha, H.L.; Wani, A.H. A Comparative Study of Statistical and Clustering Techniques Based Meta-Analysis to Identify Differentially Expressed Genes. In Proceedings of the 2016 IEEE International Conference on Advances in Computer Applications (ICACA), Coimbatore, India, 24 October 2016; IEEE: New York, NY, USA, 2016; pp. 87–93. [Google Scholar]
Sheela, T.; Rangarajan, L. Statistical Class Prediction Method for Efficient Microarray Gene Expression Data Sample Classification. In Proceedings of the 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Udupi, India, 13–16 September 2017; IEEE: New York, NY, USA, 2017; pp. 73–78. [Google Scholar]
Anakal, S.; Sandhya, P. Clinical Decision Support System for Chronic Obstructive Pulmonary Disease Using Machine Learning Techniques. In Proceedings of the 2017 International Conference on Electrical, Electronics, Communication, Computer, and Optimization Techniques (ICEECCOT), Mysuru, India, 15–16 December 2017; IEEE: New York, NY, USA, 2017; pp. 1–5. [Google Scholar]
Poreva, A.; Karplyuk, Y.; Vaityshyn, V. Machine Learning Techniques Application for Lung Diseases Diagnosis. In Proceedings of the 2017 5th IEEE Workshop on Advances in Information, Electronic and Electrical Engineering (AIEEE), Riga, Latvia, 24–25 November 2017; IEEE: New York, NY, USA, 2017; pp. 1–5. [Google Scholar]
Raut, A.; Dalal, V. A Machine Learning Based Approach for Detection of Alzheimer’s Disease Using Analysis of Hippocampus Region from MRI Scan. In Proceedings of the 2017 International Conference on Computing Methodologies and Communication (ICCMC), Erode, India, 18–19 July 2017; IEEE: New York, NY, USA, 2017; pp. 236–242. [Google Scholar]
Lai, H.; Li, X.-Y.; Xu, F.; Zhu, J.; Li, X.; Song, Y.; Wang, X.; Wang, Z.; Wang, C. Applications of Machine Learning to Diagnosis of Parkinson’s Disease. Brain Sci. 2023, 13, 1546. [Google Scholar] [CrossRef] [PubMed]
Shamir, R.; Klein, C.; Amar, D.; Vollstedt, E.-J.; Bonin, M.; Usenovic, M.; Wong, Y.C.; Maver, A.; Poths, S.; Safer, H.; et al. Analysis of Blood-Based Gene Expression in Idiopathic Parkinson Disease. Neurology 2017, 89, 1676–1683. [Google Scholar] [CrossRef]
Wang, Y.; Wu, D.; Zheng, M.; Yang, T. An Integrated Bioinformatics and Machine Learning Approach to Identifying Biomarkers Connecting Parkinson’s Disease with Purine Metabolism-Related Genes. BMC Neurol. 2025, 25, 161. [Google Scholar] [CrossRef]
Liu, S.-H.; Wang, Y.-L.; Jiang, S.-M.; Wan, X.-J.; Yan, J.-H.; Liu, C.-F. Identifying the Hub Gene and Immune Infiltration of Parkinson’s Disease Using Bioinformatical Methods. Brain Res. 2022, 1785, 147879. [Google Scholar] [CrossRef]
Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees, 1st ed.; Routledge: Abingdon, UK, 2017; ISBN 978-1-315-13947-0. [Google Scholar]
Rokach, L.; Maimon, O. Data Mining with Decision Trees: Theory and Applications, 2nd ed.; Series in Machine Perception and Artificial Intelligence; World Scientific: Singapore, 2014; Volume 81, ISBN 978-981-4590-07-5. [Google Scholar]
Quinlan, J.R. Induction of Decision Trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef]
Salzberg, S.L. C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993. Mach. Learn. 1994, 16, 235–240. [Google Scholar] [CrossRef]
Cover, T.M.; Thomas, J.A. Elements of Information Theory, 1st ed.; Wiley: Hoboken, NJ, USA, 2001; ISBN 978-0-471-06259-2. [Google Scholar]
Geurts, P.; Irrthum, A.; Wehenkel, L. Supervised Learning with Decision Tree-Based Methods in Computational and Systems Biology. Mol. Biosyst. 2009, 5, 1593. [Google Scholar] [CrossRef]
Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In Proceedings of the IJCAI’95: Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 20–25 August 1995; Volume 2, pp. 1137–1143. [Google Scholar]
Han, J.; Kamber, M. Data Mining: Concepts and Techniques, 3rd ed.; Elsevier: Burlington, MA, USA, 2012; ISBN 978-0-12-381479-1. [Google Scholar]
Scherzer, C.R.; Eklund, A.C.; Morse, L.J.; Liao, Z.; Locascio, J.J.; Fefer, D.; Schwarzschild, M.A.; Schlossmacher, M.G.; Hauser, M.A.; Vance, J.M.; et al. Molecular Markers of Early Parkinson’s Disease Based on Gene Expression in Blood. Proc. Natl. Acad. Sci. USA 2007, 104, 955–960. [Google Scholar] [CrossRef] [PubMed]
Bolstad, B.M.; Irizarry, R.A.; Åstrand, M.; Speed, T.P. A Comparison of Normalization Methods for High Densityoligonucleotide Array Data Based on Variance and Bias. Bioinformatics 2003, 19, 185–193. [Google Scholar] [CrossRef]
Affymetrix, I. Affymetrix Microarray Suite User Guide; Thermo Fisher Scientific: Santa Clara, CA, USA, 2000; pp. 295–316. [Google Scholar]
Li, C.; Wong, W.H. Model-Based Analysis of Oligonucleotide Arrays: Expression Index Computation and Outlier Detection. Proc. Natl. Acad. Sci. USA 2001, 98, 31–36. [Google Scholar] [CrossRef]
Li, C.; Hung Wong, W. Model-Based Analysis of Oligonucleotide Arrays: Model Validation, Design Issues and Standard Error Application. Genome Biol. 2001, 2, research0032.1. [Google Scholar] [CrossRef]
Awad, M.; Fraihat, S. Recursive Feature Elimination with Cross-Validation with Decision Tree: Feature Selection Method for Machine Learning-Based Intrusion Detection Systems. J. Sens. Actuator Netw. 2023, 12, 67. [Google Scholar] [CrossRef]
Waikato Environment for Knowledge Analysis the Weka Workbench. Available online: https://ml.cms.waikato.ac.nz/weka/index.html (accessed on 12 January 2025).
Robin, X.; Turck, N.; Hainard, A.; Tiberti, N.; Lisacek, F.; Sanchez, J.-C.; Müller, M. pROC: An Open-Source Package for R and S+ to Analyze and Compare ROC Curves. BMC Bioinform. 2011, 12, 77. [Google Scholar] [CrossRef]
Yang, Z.; Xu, Q.; Bao, S.; Wen, P.; He, Y.; Cao, X.; Huang, Q. AUC-Oriented Domain Adaptation: From Theory to Algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 14161–14174. [Google Scholar] [CrossRef]
Huang, D.W.; Sherman, B.T.; Lempicki, R.A. Bioinformatics Enrichment Tools: Paths toward the Comprehensive Functional Analysis of Large Gene Lists. Nucleic Acids Res. 2009, 37, 1–13. [Google Scholar] [CrossRef] [PubMed]
Lin, C.-Y.; Lu, M.-Y.J.; Yue, J.-X.; Li, K.-L.; Le Pétillon, Y.; Yong, L.W.; Chen, Y.-H.; Tsai, F.-Y.; Lyu, Y.-F.; Chen, C.-Y.; et al. Molecular Asymmetry in the Cephalochordate Embryo Revealed by Single-Blastomere Transcriptome Profiling. PLoS Genet. 2020, 16, e1009294. [Google Scholar] [CrossRef]
Yu, G.; Wang, L.-G.; Han, Y.; He, Q.-Y. clusterProfiler: An R Package for Comparing Biological Themes Among Gene Clusters. OMICS J. Integr. Biol. 2012, 16, 284–287. [Google Scholar] [CrossRef] [PubMed]
Wrzesiński, T.; Szelag, M.; Cieślikowski, W.A.; Ida, A.; Giles, R.; Zodro, E.; Szumska, J.; Poźniak, J.; Kwias, Z.; Bluyssen, H.A.R.; et al. Expression of Pre-Selected TMEMs with Predicted ER Localization as Potential Classifiers of ccRCC Tumors. BMC Cancer 2015, 15, 518. [Google Scholar] [CrossRef]
Zhao, Y.; Zhang, K.; Pan, H.; Wang, Y.; Zhou, X.; Xiang, Y.; Xu, Q.; Sun, Q.; Tan, J.; Yan, X.; et al. Genetic Analysis of Six Transmembrane Protein Family Genes in Parkinson’s Disease in a Large Chinese Cohort. Front. Aging Neurosci. 2022, 14, 889057. [Google Scholar] [CrossRef]
Liu, G.; Peng, J.; Liao, Z.; Locascio, J.J.; Corvol, J.-C.; Zhu, F.; Dong, X.; Maple-Grødem, J.; Campbell, M.C.; Elbaz, A.; et al. Genome-Wide Survival Study Identifies a Novel Synaptic Locus and Polygenic Score for Cognitive Progression in Parkinson’s Disease. Nat. Genet. 2021, 53, 787–793. [Google Scholar] [CrossRef] [PubMed]
Apweiler, R. UniProt: The Universal Protein Knowledgebase. Nucleic Acids Res. 2004, 32, D115–D119. [Google Scholar] [CrossRef] [PubMed]
Gallouet, A.-S.; Ferri, F.; Petit, V.; Parcelier, A.; Lewandowski, D.; Gault, N.; Barroca, V.; Le Gras, S.; Soler, E.; Grosveld, F.; et al. Macrophage Production and Activation Are Dependent on TRIM33. Oncotarget 2017, 8, 5111–5122. [Google Scholar] [CrossRef]
Weng, L.; Mitoma, H.; Tricot, C.; Bao, M.; Liu, Y.; Zhang, Z.; Liu, Y.-J. The E3 Ubiquitin Ligase Tripartite Motif 33 Is Essential for Cytosolic RNA–Induced NLRP3 Inflammasome Activation. J. Immunol. 2014, 193, 3676–3682. [Google Scholar] [CrossRef] [PubMed]
Yan, Y.-Q.; Fang, Y.; Zheng, R.; Pu, J.-L.; Zhang, B.-R. NLRP3 Inflammasomes in Parkinson’s Disease and Their Regulation by Parkin. Neuroscience 2020, 446, 323–334. [Google Scholar] [CrossRef]
O’Leary, N.A.; Wright, M.W.; Brister, J.R.; Ciufo, S.; Haddad, D.; McVeigh, R.; Rajput, B.; Robbertse, B.; Smith-White, B.; Ako-Adjei, D.; et al. Reference Sequence (RefSeq) Database at NCBI: Current Status, Taxonomic Expansion, and Functional Annotation. Nucleic Acids Res. 2016, 44, D733–D745. [Google Scholar] [CrossRef]
Denaro, S.; D’Aprile, S.; Vicario, N.; Parenti, R. Mechanistic Insights into Connexin-Mediated Neuroglia Crosstalk in Neurodegenerative Diseases. Front. Cell. Neurosci. 2025, 19, 1532960. [Google Scholar] [CrossRef]
Kawasaki, A.; Hayashi, T.; Nakachi, K.; Trosko, J.E.; Sugihara, K.; Kotake, Y.; Ohta, S. Modulation of Connexin 43 in Rotenone-Induced Model of Parkinson’s Disease. Neuroscience 2009, 160, 61–68. [Google Scholar] [CrossRef]
Huang, C.; Ou, R.; Chen, X.; Zhang, Y.; Li, J.; Liang, Y.; Zhu, X.; Liu, L.; Li, M.; Lin, D.; et al. Tumor Cell-Derived SPON2 Promotes M2-Polarized Tumor-Associated Macrophage Infiltration and Cancer Progression by Activating PYK2 in CRC. J. Exp. Clin. Cancer Res. 2021, 40, 304. [Google Scholar] [CrossRef]
Zhang, Y.-L.; Li, Q.; Yang, X.-M.; Fang, F.; Li, J.; Wang, Y.-H.; Yang, Q.; Zhu, L.; Nie, H.-Z.; Zhang, X.-L.; et al. SPON2 Promotes M1-like Macrophage Recruitment and Inhibits Hepatocellular Carcinoma Metastasis by Distinct Integrin–Rho GTPase–Hippo Pathways. Cancer Res. 2018, 78, 2305–2317. [Google Scholar] [CrossRef]
Agliardi, C.; Guerini, F.R.; Zanzottera, M.; Riboldazzi, G.; Zangaglia, R.; Sturchio, A.; Casali, C.; Di Lorenzo, C.; Minafra, B.; Nemni, R.; et al. SNAP25 Gene Polymorphisms Protect Against Parkinson’s Disease and Modulate Disease Severity in Patients. Mol. Neurobiol. 2019, 56, 4455–4463. [Google Scholar] [CrossRef] [PubMed]
Gao, V.; Briano, J.A.; Komer, L.E.; Burré, J. Functional and Pathological Effects of α-Synuclein on Synaptic SNARE Complexes. J. Mol. Biol. 2023, 435, 167714. [Google Scholar] [CrossRef]
Gaetani, L.; Bellomo, G.; Chiasserini, D.; De Rocker, C.; Goossens, J.; Paolini Paoletti, F.; Vanmechelen, E.; Parnetti, L. Influence of Co-Pathology on CSF and Plasma Synaptic Markers SNAP25 and VAMP2 in Alzheimer’s Disease and Parkinson’s Disease. Alzheimer’s Res. Ther. 2025, 17, 115. [Google Scholar] [CrossRef] [PubMed]
Bereczki, E.; Bogstedt, A.; Höglund, K.; Tsitsi, P.; Brodin, L.; Ballard, C.; Svenningsson, P.; Aarsland, D. Synaptic Proteins in CSF Relate to Parkinson’s Disease Stage Markers. npj Park. Dis. 2017, 3, 7. [Google Scholar] [CrossRef]
Shu, J.; Peng, F.; Li, J.; Liu, Y.; Li, X.; Yuan, C. The Relationship between SNAP25 and Some Common Human NeurologicalSyndromes. Curr. Pharm. Des. 2024, 30, 2378–2386. [Google Scholar] [CrossRef]
Wood, H. SNAP25—An Early Biomarker in AD and CJD. Nat. Rev. Neurol. 2022, 18, 575. [Google Scholar] [CrossRef]
Wang, W.; Gao, W.; Gong, P.; Song, W.; Bu, X.; Hou, J.; Zhang, L.; Zhao, B. Neuronal-Specific TNFAIP1 Ablation Attenuates Postoperative Cognitive Dysfunction via Targeting SNAP25 for K48-Linked Ubiquitination. Cell Commun. Signal. 2023, 21, 356. [Google Scholar] [CrossRef]
Fenton, A.R.; Jongens, T.A.; Holzbaur, E.L.F. Mitochondrial Adaptor TRAK2 Activates and Functionally Links Opposing Kinesin and Dynein Motors. Nat. Commun. 2021, 12, 4578. [Google Scholar] [CrossRef] [PubMed]
Loss, O.; Stephenson, F.A. Developmental Changes in Trak-Mediated Mitochondrial Transport in Neurons. Mol. Cell. Neurosci. 2017, 80, 134–147. [Google Scholar] [CrossRef]
Bose, A.; Beal, M.F. Mitochondrial Dysfunction in Parkinson’s Disease. J. Neurochem. 2016, 139, 216–231. [Google Scholar] [CrossRef] [PubMed]
Franceschi, S.; Lessi, F.; Morelli, M.; Menicagli, M.; Pasqualetti, F.; Aretini, P.; Mazzanti, C. Sedoheptulose Kinase SHPK Expression in Glioblastoma: Emerging Role of the Nonoxidative Pentose Phosphate Pathway in Tumor Proliferation. Int. J. Mol. Sci. 2022, 23, 5978. [Google Scholar] [CrossRef] [PubMed]
Dube, U.; Ibanez, L.; Budde, J.P.; Benitez, B.A.; Davis, A.A.; Harari, O.; Iles, M.M.; Law, M.H.; Brown, K.M.; Cruchaga, C. Overlapping Genetic Architecture between Parkinson Disease and Melanoma. Acta Neuropathol. 2020, 139, 347–364. [Google Scholar] [CrossRef] [PubMed]
Bryniarska-Kubiak, N.; Kubiak, A.; Basta-Kaim, A. Mechanotransductive Receptor Piezo1 as a Promising Target in theTreatment of Neurological Diseases. Curr. Neuropharmacol. 2023, 21, 2030–2035. [Google Scholar] [CrossRef] [PubMed]
Daftuar, L.; Zhu, Y.; Jacq, X.; Prives, C. Ribosomal Proteins RPL37, RPS15 and RPS20 Regulate the Mdm2-P53-MdmX Network. PLoS ONE 2013, 8, e68667. [Google Scholar] [CrossRef]
Walden, H.; Muqit, M.M.K. Ubiquitin and Parkinson’s Disease through the Looking Glass of Genetics. Biochem. J. 2017, 474, 1439–1451. [Google Scholar] [CrossRef]
Miok, K.; Nguyen-Doan, D.; Zaharie, D.; Robnik-Šikonja, M. Generating Data Using Monte Carlo Dropout. In Proceedings of the 2019 IEEE 15th International Conference on Intelligent Computer Communication and Processing (ICCP), Cluj-Napoca, Romania, 5–7 September 2019; IEEE: New York, NY, USA, 2019; pp. 509–515. [Google Scholar]
Blagus, R.; Lusa, L. SMOTE for High-Dimensional Class-Imbalanced Data. BMC Bioinform. 2013, 14, 106. [Google Scholar] [CrossRef]
Reel, P.S.; Reel, S.; Pearson, E.; Trucco, E.; Jefferson, E. Using Machine Learning Approaches for Multi-Omics Data Analysis: A Review. Biotechnol. Adv. 2021, 49, 107739. [Google Scholar] [CrossRef]
Augustine, J.; Jereesh, A.S. Marker genes identification and prediction of Parkinson’s disease by integrating blood-based multi-omics data. Chemom. Intell. Lab. Syst. 2025, 265, 105478. [Google Scholar] [CrossRef]

Figure 1. Intensity levels of raw data.

Figure 2. Intensity levels of normalized data using Expresso.

Figure 3. Decision tree obtained by J48 from the expression levels of the main genes normalized with the Expresso technique.

Figure 4. Receiver operating characteristics (ROC) curves for gene expression. (a) Parkinson’s Disease vs. Healthy Control. (b) Neurological Disease vs. Healthy Control.

Figure 5. The GO enrichment analysis.

Table 1. Expresso parameters used for preprocessing the dataset.

Parameters
normalize.method	loess
bgcorrect.method	none
pmcorrect.method	subtractmm
summary.method	liwong

Table 2. Accuracy performance of classifiers. All classifiers use the default parameters in Weka. The asterisk (*) in the table indicates that the difference with respect to J48 is statistically significant (corrected paired t-test, α = 0.05).

Dataset	J48	Naive Bayes	SVM	MLP
expr_217_LNS_9	86.71 (11.66)	73.19 (10.42) *	45.56 (14.11) *	58.10 (11.78) *

Table 3. Confusion matrix for J48.

Classified as →	a	b	c
a. Parkinson’s Disease	44	2	4
b. Healthy Control	2	18	2
c. Neurological Disease Control	1	2	30

Table 4. Confusion matrix for Naive Bayes.

Classified as →	a	b	c
a. Parkinson’s Disease	42	5	3
b. Healthy Control	11	9	2
c. Neurological Disease Control	12	7	14

Table 5. Confusion matrix for SVM.

Classified as →	a	b	c
a. Parkinson’s Disease	36	1	13
b. Healthy Control	11	2	9
c. Neurological Disease Control	15	2	16

Table 6. Confusion matrix for MLP.

Classified as →	a	b	c
a. Parkinson’s Disease	43	1	6
b. Healthy Control	12	3	7
c. Neurological Disease Control	15	2	16

Table 7. Sensitivity, specificity, PV+ and PV− for all classifiers. Numbers between parentheses represent the lower and upper bounds at 95% confidence interval, respectively.

Dataset	J48	SVM	Naive Bayes	MLP
Sensitivity	88% (79–97)	43% (29–57)	84% (74–94)	86% (76–96)
Specificity	87% (78–96)	73% (59–88)	42% (29–55)	35% (22–47)
PV+	86% (77–96)	47% (33–61)	57% (45–68)	54% (43–65)
PV−	89% (81–97)	74% (60–88)	74% (59–90)	73% (56–90)

Table 8. GO terms that are significantly enriched.

Description	p-Adjust	p-Value	Gene ID	log_p-Adjust
Synaptic vesicle docking	0.08605387	0.00476038	SNAP25	1.06522959
Dense core granule exocytosis	0.08605387	0.00476038	SNAP25	1.06522959
Cellular response to lipopolysaccharide	0.08605387	0.00486391	SPON2/SHPK	1.06522959
Positive regulation of integrin activation	0.08605387	0.00523531	PIEZO1	1.06522959
Regulation of cell–cell adhesion mediated by integrin	0.08605387	0.00523531	PIEZO1	1.06522959
Cellular response to molecule of bacterial origin	0.08605387	0.00542339	SPON2/SHPK	1.06522959
Dendritic transport	0.08605387	0.00571004	TRAK2	1.06522959
Cellular response to biotic stimulus	0.08605387	0.00662745	SPON2/SHPK	1.06522959
Neurotransmitter receptor transport to postsynaptic membrane	0.08605387	0.00760694	SNAP25	1.06522959
Opsonization	0.08605387	0.00808066	SPON2	1.06522959
Positive regulation of myotube differentiation	0.08605387	0.00808066	PIEZO1	1.06522959
Regulation of integrin activation	0.08605387	0.00808066	PIEZO1	1.06522959
Cell–cell adhesion mediated by integrin	0.08605387	0.00808066	PIEZO1	1.06522959
Mitochondrion distribution	0.08605387	0.00808066	TRAK2	1.06522959
Neurotransmitter receptor transport to plasma membrane	0.08605387	0.00808066	SNAP25	1.06522959
Establishment of protein localization to postsynaptic membrane	0.08605387	0.00808066	SNAP25	1.06522959
Synaptic vesicle priming	0.08605387	0.00855418	SNAP25	1.06522959
Axonal transport of mitochondrion	0.08605387	0.00855418	TRAK2	1.06522959
Glycerol metabolic process	0.08605387	0.00950062	SHPK	1.06522959

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mestizo-Gutiérrez, S.L.; Jácome-Delgado, J.A.; Cruz-Ramírez, N.; Guerra-Hernández, A.; Torres-Sosa, J.A.; Rosales-Morales, V.Y.; Aranda-Abreu, G.E. A Study of Gene Expression Levels of Parkinson’s Disease Using Machine Learning. BioMedInformatics 2025, 5, 60. https://doi.org/10.3390/biomedinformatics5040060

AMA Style

Mestizo-Gutiérrez SL, Jácome-Delgado JA, Cruz-Ramírez N, Guerra-Hernández A, Torres-Sosa JA, Rosales-Morales VY, Aranda-Abreu GE. A Study of Gene Expression Levels of Parkinson’s Disease Using Machine Learning. BioMedInformatics. 2025; 5(4):60. https://doi.org/10.3390/biomedinformatics5040060

Chicago/Turabian Style

Mestizo-Gutiérrez, Sonia Lilia, Joan Arturo Jácome-Delgado, Nicandro Cruz-Ramírez, Alejandro Guerra-Hernández, Jesús Alberto Torres-Sosa, Viviana Yarel Rosales-Morales, and Gonzalo Emiliano Aranda-Abreu. 2025. "A Study of Gene Expression Levels of Parkinson’s Disease Using Machine Learning" BioMedInformatics 5, no. 4: 60. https://doi.org/10.3390/biomedinformatics5040060

APA Style

Mestizo-Gutiérrez, S. L., Jácome-Delgado, J. A., Cruz-Ramírez, N., Guerra-Hernández, A., Torres-Sosa, J. A., Rosales-Morales, V. Y., & Aranda-Abreu, G. E. (2025). A Study of Gene Expression Levels of Parkinson’s Disease Using Machine Learning. BioMedInformatics, 5(4), 60. https://doi.org/10.3390/biomedinformatics5040060

Article Menu

A Study of Gene Expression Levels of Parkinson’s Disease Using Machine Learning

Abstract

1. Introduction

2. Materials and Methods

2.1. Decision Trees

2.2. Evaluation Method: k-Fold Cross-Validation

2.3. Implementation Details

2.3.1. Dataset

2.3.2. Data Preprocessing

2.3.3. Gene Selection

2.3.4. WEKA

3. Results

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI