Data Complexity-Aware Feature Selection with Symmetric Splitting for Robust Parkinson’s Disease Detection
Abstract
1. Introduction
- This work presents the experimental error in the aforementioned studies and introduces a uniform (or fair) experimental setup to streamline research and ensure a fair comparison of results.
- This work releases a publicly accessible 5-fold benchmark version of the PD speech dataset introduced in [25] for consistent and reproducible evaluation.
- This work proposes a hybrid data complexity-aware feature selection (HDC) method using data complexity metrics such as F1, F3, and F4 measures along with the ReliefF feature selection algorithm.
- Empirical analysis demonstrates
- –
- The impact of feature extraction techniques and the size limit at which features remain informative for PD classification;
- –
- An analysis of the top-50 features using the proposed HDC algorithm and existing state-of-the-art feature selection algorithms, including Information Gain, Gain Ratio, ReliefF, and mRMR.
- This work employs Naive Bayes, Decision Tree, k-Nearest Neighbors (k-NN), and Support Vector Machine (SVM) classifiers to evaluate the efficiency of the proposed PD telediagnosis system based on four evaluation metrics: accuracy, G-mean, F1 measure, and MCC.
2. Related Work
2.1. Dataset and Speech Feature Categories
2.2. Existing Feature Selection Algorithms
2.3. Data Complexity Measures
3. Proposed Work
3.1. Subject-Wise Dataset Bifurcation Enabling Researchers to Have Fair Common Platform for Their Study
- Internal test set evaluation (k-fold cross-validation):This work employs subject-wise 5-fold cross-validation to create five train–test pairs from a total of 245 subjects—185 PD (105 males, 80 females) and 60 healthy (20 males, 40 females)—using a symmetric data splitting strategy that ensures the same class and gender ratio as in the original population.
- External test set evaluation:The remaining seven subjects—three PD (two males, one female) and four healthy (three males, one female)—are held out to maintain symmetry in the data split used for subject-wise 5-fold cross-validation. In this configuration, these 7 subjects are used as the external test set, and 245 subjects from the earlier setup are used as the training set.
3.2. Hybrid Data Complexity-Based (HDC) Feature Selection
- Category : measures of overlap of individual feature values;
- Category : measures of the separability of classes.
| Algorithm 1 Hybrid data complexity-based Feature Selection |
| Input: : subject-wise k-fold training data, m: total number of features, d: number of features to select via F4 (one per iteration), C: number of top features to select (Top-C), : correlation threshold for feature removal, : mean values of feature j for classes and , : variance of feature j for classes and , : individual feature, n: number of samples in , : number of overlapping samples for feature j, Output: (optimal feature subset for each training set) |
|
- This work introduces F1F3, a novel metric that combines the weights of F1 and F3 (lines 10–11) to unify different perspectives of feature overlap for feature ranking, providing a holistic view of class separability.
- –
- The F1 measure (Fisher’s discriminant ratio) evaluates the extent of the overlapping region (i.e., how much the classes overlap) in each feature (line 3), where and denote the mean and variance of feature within each class, respectively. Lower F1 values indicate better class separability (minimal feature overlap) [55].
- –
- The F3 measure considers how many data points lie in the overlapping region (line 4) and evaluates a feature (say, j) based on whether the overlap is densely populated with samples (high complexity) or contains only a few samples (low complexity). A low value indicates better feature efficiency and implies that it can separate more samples.
- The F4 measure builds on the F3 measure and iteratively reduces class overlap (lines 5–6) by focusing on local class separability to select a set of features. It selects the feature that eliminates the most samples from the overlapping region, continuing until no samples remain in the overlap.
- ReliefF applies a different strategy to address class overlap by employing k-Nearest Neighbors (k-NN) to rank features based on their discriminative power for borderline points, considering both intra-class and inter-class nearest neighbors. It updates the feature weight vector by evaluating the distances between and its near-hiti as well as and its near-missi, as defined in line 12 of Algorithm 1.
4. Experimental Study and Results
- Accuracy: Measures the proportion of correctly classified PD and healthy cases, and is given bywhere TP (true positives) represents correctly detected PD cases, TN (true negatives) indicates correctly identified healthy cases, FP (false positives) is misdiagnosed healthy individuals, and FN (false negatives) denotes missed PD cases.
- Geometric Mean (G-Mean): Measures the balance between sensitivity (correct PD detection) and specificity (correct identification of healthy individuals), and is given bywhere , and .
- -Score: A measure such as the F1-score is crucial for false detection evaluation, as it balances precision (reduces false positives) and recall (reduces false negatives), and is given bywhere , and .
- Matthews Correlation Coefficient (MCC): A measure such as MCC plays a vital role in assessing model reliability in clinical applications such as PD detection, where a higher MCC value (closer to +1) is desirable, as it indicates accurate and reliable classification [57,58]. Values near 0 suggest random performance, while values near −1 imply poor or misleading predictions.
4.1. Empirical Analysis of Existing Speech Feature Categories
4.1.1. Size Limit or Threshold Analysis for Informative Features and Category-Specific Contributions
4.1.2. Feature Redundancy and Correlation Analysis
4.2. Design and Evaluation of the Final Feature Set for the Proposed HDC
4.3. Results and Discussion
5. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- De Rijk, M.d.; Launer, L.; Berger, K.; Breteler, M.; Dartigues, J.; Baldereschi, M.; Fratiglioni, L.; Lobo, A.; Martinez-Lage, J.; Trenkwalder, C.; et al. Prevalence of Parkinson’s disease in Europe: A collaborative study of population-based cohorts. Neurologic Diseases in the Elderly Research Group. Neurology 2000, 54, S21–S23. [Google Scholar]
- Han, C.X.; Wang, J.; Yi, G.S.; Che, Y.Q. Investigation of EEG abnormalities in the early stage of Parkinson’s disease. Cogn. Neurodyn. 2013, 7, 351–359. [Google Scholar] [CrossRef] [PubMed]
- Yuvaraj, R.; Murugappan, M.; Acharya, U.R.; Adeli, H.; Ibrahim, N.M.; Mesquita, E. Brain functional connectivity patterns for emotional state classification in Parkinson’s disease patients without dementia. Behav. Brain Res. 2016, 298, 248–260. [Google Scholar] [CrossRef] [PubMed]
- Yuvaraj, R.; Acharya, U.R.; Hagiwara, Y. A novel Parkinson’s Disease Diagnosis Index using higher-order spectra features in EEG signals. Neural Comput. Appl. 2018, 30, 1225–1235. [Google Scholar] [CrossRef]
- Oh, S.L.; Hagiwara, Y.; Raghavendra, U.; Yuvaraj, R.; Arunkumar, N.; Murugappan, M.; Acharya, U.R. A deep learning approach for Parkinson’s disease diagnosis from EEG signals. Neural Comput. Appl. 2018, 32, 10927–10933. [Google Scholar] [CrossRef]
- Bhat, S.; Acharya, U.R.; Hagiwara, Y.; Dadmehr, N.; Adeli, H. Parkinson’s disease: Cause factors, measurable indicators, and early diagnosis. Comput. Biol. Med. 2018, 102, 234–241. [Google Scholar] [CrossRef]
- Lacy, S.E.; Smith, S.L.; Lones, M.A. Using echo state networks for classification: A case study in Parkinson’s disease diagnosis. Artif. Intell. Med. 2018, 86, 53–59. [Google Scholar] [CrossRef]
- Loconsole, C.; Cascarano, G.D.; Brunetti, A.; Trotta, G.F.; Losavio, G.; Bevilacqua, V.; Di Sciascio, E. A model-free technique based on computer vision and sEMG for classification in Parkinson’s disease by using computer-assisted handwriting analysis. Pattern Recognit. Lett. 2019, 121, 28–36. [Google Scholar] [CrossRef]
- Zeng, W.; Liu, F.; Wang, Q.; Wang, Y.; Ma, L.; Zhang, Y. Parkinson’s disease classification using gait analysis via deterministic learning. Neurosci. Lett. 2016, 633, 268–278. [Google Scholar] [CrossRef]
- Joshi, D.; Khajuria, A.; Joshi, P. An automatic non-invasive method for Parkinson’s disease classification. Comput. Methods Programs Biomed. 2017, 145, 135–145. [Google Scholar] [CrossRef]
- Sharma, P.; Sundaram, S.; Sharma, M.; Sharma, A.; Gupta, D. Diagnosis of Parkinson’s disease using modified grey wolf optimization. Cogn. Syst. Res. 2019, 54, 100–115. [Google Scholar] [CrossRef]
- Afonso, L.C.; Rosa, G.H.; Pereira, C.R.; Weber, S.A.; Hook, C.; Albuquerque, V.H.C.; Papa, J.P. A recurrence plot-based approach for Parkinson’s disease identification. Future Gener. Comput. Syst. 2019, 94, 282–292. [Google Scholar] [CrossRef]
- Rios-Urrego, C.D.; Vásquez-Correa, J.C.; Vargas-Bonilla, J.F.; Nöth, E.; Lopera, F.; Orozco-Arroyave, J.R. Analysis and evaluation of handwriting in patients with Parkinson’s disease using kinematic, geometrical, and non-linear features. Comput. Methods Programs Biomed. 2019, 173, 43–52. [Google Scholar] [CrossRef] [PubMed]
- Tsanas, A.; Little, M.; McSharry, P.; Ramig, L. Accurate telemonitoring of Parkinson’s disease progression by non-invasive speech tests. Nat. Preced. 2009. [Google Scholar] [CrossRef]
- Mostafa, S.A.; Mustapha, A.; Mohammed, M.A.; Hamed, R.I.; Arunkumar, N.; Abd Ghani, M.K.; Jaber, M.M.; Khaleefah, S.H. Examining multiple feature evaluation and classification methods for improving the diagnosis of Parkinson’s disease. Cogn. Syst. Res. 2019, 54, 90–99. [Google Scholar] [CrossRef]
- Hartelius, L.; Svensson, P. Speech and swallowing symptoms associated with Parkinson’s disease and multiple sclerosis: A survey. Folia Phoniatr. Logop. 1994, 46, 9–17. [Google Scholar] [CrossRef]
- Ho, A.K.; Iansek, R.; Marigliani, C.; Bradshaw, J.L.; Gates, S. Speech impairment in a large sample of patients with Parkinson’s disease. Behav. Neurol. 1998, 11, 131–137. [Google Scholar] [CrossRef]
- Erdogdu Sakar, B.; Serbes, G.; Sakar, C.O. Analyzing the effectiveness of vocal features in early telediagnosis of Parkinson’s disease. PLoS ONE 2017, 12, e0182428. [Google Scholar] [CrossRef]
- Tsanas, A.; Little, M.A.; McSharry, P.E.; Ramig, L.O. Enhanced classical dysphonia measures and sparse regression for telemonitoring of Parkinson’s disease progression. In Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA, 14–19 March 2010; IEEE: New York, NY, USA, 2010; pp. 594–597. [Google Scholar]
- Braak, H.; Ghebremedhin, E.; Rüb, U.; Bratzke, H.; Del Tredici, K. Stages in the development of Parkinson’s disease-related pathology. Cell Tissue Res. 2004, 318, 121–134. [Google Scholar] [CrossRef]
- Logemann, J.A.; Fisher, H.B.; Boshes, B.; Blonsky, E.R. Frequency and cooccurrence of vocal tract dysfunctions in the speech of a large sample of Parkinson patients. J. Speech Hear. Disord. 1978, 43, 47–57. [Google Scholar] [CrossRef]
- Little, M.; McSharry, P.; Roberts, S.; Costello, D.; Moroz, I. Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection. Nat. Preced. 2007. [Google Scholar] [CrossRef]
- Tsanas, A.; Little, M.A.; McSharry, P.E.; Ramig, L.O. New nonlinear markers and insights into speech signal degradation for effective tracking of Parkinson’s disease symptom severity. IEICE Proc. Ser. 2010, 457–460. [Google Scholar] [CrossRef]
- Tsanas, A.; Little, M.A.; McSharry, P.E.; Spielman, J.; Ramig, L.O. Novel speech signal processing algorithms for high-accuracy classification of Parkinson’s disease. IEEE Trans. Biomed. Eng. 2012, 59, 1264–1271. [Google Scholar] [CrossRef] [PubMed]
- Sakar, C.O.; Serbes, G.; Gunduz, A.; Tunc, H.C.; Nizam, H.; Sakar, B.E.; Tutuncu, M.; Aydin, T.; Isenkul, M.E.; Apaydin, H. A comparative analysis of speech signal processing algorithms for Parkinson’s disease classification and the use of the tunable Q-factor wavelet transform. Appl. Soft Comput. 2019, 74, 255–263. [Google Scholar] [CrossRef]
- Sharma, G.; Umapathy, K.; Krishnan, S. Trends in audio signal feature extraction methods. Appl. Acoust. 2020, 158, 107020. [Google Scholar] [CrossRef]
- Tuncer, T.; Dogan, S. Novel dynamic center based binary and ternary pattern network using M4 pooling for real world voice recognition. Appl. Acoust. 2019, 156, 176–185. [Google Scholar] [CrossRef]
- Korkmaz, Y.; Boyacı, A. A comprehensive Turkish accent/dialect recognition system using acoustic perceptual formants. Appl. Acoust. 2022, 193, 108761. [Google Scholar] [CrossRef]
- Soumaya, Z.; Drissi Taoufiq, B.; Benayad, N.; Yunus, K.; Abdelkrim, A. The detection of Parkinson disease using the genetic algorithm and SVM classifier. Appl. Acoust. 2021, 171, 107528. [Google Scholar] [CrossRef]
- Tuncer, T.; Dogan, S. A novel octopus based Parkinson’s disease and gender recognition method using vowels. Appl. Acoust. 2019, 155, 75–83. [Google Scholar] [CrossRef]
- Tuncer, T.; Dogan, S.; Acharya, U.R. Automated detection of Parkinson’s disease using minimum average maximum tree and singular value decomposition method with vowels. Biocybern. Biomed. Eng. 2020, 40, 211–220. [Google Scholar] [CrossRef]
- Solana-Lavalle, G.; Galán-Hernández, J.C.; Rosas-Romero, R. Automatic Parkinson disease detection at early stages as a pre-diagnosis tool by using classifiers and a small set of vocal features. Biocybern. Biomed. Eng. 2020, 40, 505–516. [Google Scholar] [CrossRef]
- Xiong, Y.; Lu, Y. Deep feature extraction from the vocal vectors using sparse autoencoders for Parkinson’s classification. IEEE Access 2020, 8, 27821–27830. [Google Scholar] [CrossRef]
- Masud, M.; Singh, P.; Gaba, G.S.; Kaur, A.; Alroobaea, R.; Alrashoud, M.; Alqahtani, S.A. CROWD: Crow search and deep learning based feature extractor for classification of Parkinson’s disease. ACM Trans. Internet Technol. (TOIT) 2021, 21, 1–18. [Google Scholar] [CrossRef]
- Gunduz, H. An efficient dimensionality reduction method using filter-based feature selection and variational autoencoders on Parkinson’s disease classification. Biomed. Signal Process. Control 2021, 66, 102452. [Google Scholar] [CrossRef]
- Sakar, C.O.; Kursun, O. Telediagnosis of Parkinson’s disease using measurements of dysphonia. J. Med. Syst. 2010, 34, 591–599. [Google Scholar] [CrossRef]
- Little, M.; McSharry, P.; Hunter, E.; Spielman, J.; Ramig, L. Suitability of dysphonia measurements for telemonitoring of Parkinson’s disease. Nat. Preced. 2008. [Google Scholar] [CrossRef]
- Smialowski, P.; Frishman, D.; Kramer, S. Pitfalls of supervised feature selection. Bioinformatics 2010, 26, 440–443. [Google Scholar] [CrossRef]
- Sakar, C.; Sakar, B. Parkinson’s Disease Classification; UCI Machine Learning Repository: Noida, India, 2018. [Google Scholar] [CrossRef]
- Liu, H.; Li, J.; Wong, L. A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. Genome Inform. 2002, 13, 51–60. [Google Scholar]
- Hilario, M.; Kalousis, A. Approaches to dimensionality reduction in proteomic biomarker studies. Briefings Bioinform. 2008, 9, 102–118. [Google Scholar] [CrossRef]
- Zheng, C.H.; Huang, D.S.; Zhang, L.; Kong, X.Z. Tumor clustering using nonnegative matrix factorization with gene selection. IEEE Trans. Inf. Technol. Biomed. 2009, 13, 599–607. [Google Scholar] [CrossRef]
- Choi, J.Y.; Ro, Y.M.; Plataniotis, K.N. Boosting color feature selection for color face recognition. IEEE Trans. Image Process. 2010, 20, 1425–1434. [Google Scholar] [CrossRef]
- Goltsev, A.; Gritsenko, V. Investigation of efficient features for image recognition by neural networks. Neural Netw. 2012, 28, 15–23. [Google Scholar] [CrossRef]
- Forman, G. An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 2003, 3, 1289–1305. [Google Scholar]
- Ambusaidi, M.A.; He, X.; Nanda, P.; Tan, Z. Building an intrusion detection system using a filter-based feature selection algorithm. IEEE Trans. Comput. 2016, 65, 2986–2998. [Google Scholar] [CrossRef]
- Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
- Quinlan, J.R. Induction of decision trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef]
- Quinlan, J.R. C 4.5: Programs for machine learning. In The Morgan Kaufmann Series in Machine Learning; Elsevier: Amsterdam, The Netherlands, 1993. [Google Scholar]
- Quinlan, J.R. Improved use of continuous attributes in C4. 5. J. Artif. Intell. Res. 1996, 4, 77–90. [Google Scholar] [CrossRef]
- Kononenko, I. Estimating attributes: Analysis and extensions of RELIEF. In Proceedings of the European Conference on Machine Learning; Springer: Berlin/Heidelberg, Germnay, 1994; pp. 171–182. [Google Scholar]
- Kira, K.; Rendell, L.A. A practical approach to feature selection. In Machine Learning Proceedings 1992; Elsevier: Amsterdam, The Netherlands, 1992; pp. 249–256. [Google Scholar]
- Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef]
- Ho, T.K.; Basu, M. Complexity measures of supervised classification problems. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 289–300. [Google Scholar] [CrossRef]
- Lorena, A.C.; Garcia, L.P.; Lehmann, J.; Souto, M.C.; Ho, T.K. How Complex is your classification problem? A survey on measuring classification complexity. ACM Comput. Surv. (CSUR) 2019, 52, 1–34. [Google Scholar] [CrossRef]
- Orriols-Puig, A.; Macia, N.; Ho, T.K. Documentation for the data complexity library in C++. Univ. Ramon Llull Salle 2010, 196, 12. [Google Scholar]
- Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef] [PubMed]
- Chicco, D.; Tötsch, N.; Jurman, G. The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation. BioData Min. 2021, 14, 13. [Google Scholar] [CrossRef] [PubMed]
- Li, H.; Pun, C.M.; Xu, F.; Pan, L.; Zong, R.; Gao, H.; Lu, H. A hybrid feature selection algorithm based on a discrete artificial bee colony for Parkinson’s diagnosis. ACM Trans. Internet Technol. 2021, 21, 1–22. [Google Scholar] [CrossRef]
- Celik, G.; Başaran, E. Proposing a new approach based on convolutional neural networks and random forest for the diagnosis of Parkinson’s disease from speech signals. Appl. Acoust. 2023, 211, 109476. [Google Scholar] [CrossRef]
- Hasanzadeh, M.; Mahmoodian, H. A novel hybrid method for feature selection based on gender analysis for early Parkinson’s disease diagnosis using speech analysis. Appl. Acoust. 2023, 211, 109561. [Google Scholar] [CrossRef]
- Gunduz, H. Deep learning-based Parkinson’s disease classification using vocal feature sets. IEEE Access 2019, 7, 115540–115551. [Google Scholar] [CrossRef]
- Polat, K.; Nour, M. Parkinson disease classification using one against all based data sampling with the acoustic features from the speech signals. Med. Hypotheses 2020, 140, 109678. [Google Scholar] [CrossRef]
- Xavier, D.; Felizardo, V.; Ferreira, B.; Zacarias, H.; Pourvahab, M.; Souza-Pereira, L.; Garcia, N.M. Voice analysis in Parkinson’s disease-a systematic literature review. Artif. Intell. Med. 2025, 163, 103109. [Google Scholar] [CrossRef]














| Study Period | Feature Extraction Techniques | Advantages |
|---|---|---|
| Earlier studies | Baseline, Time–Frequency, Vocal | Capture general and traditional acoustic features (pitch, jitter, shimmer, formants) |
| Later studies | Wavelet, MFCC, TQWT | Capture localized, nonstationary, and fine-grained speech patterns; better at detecting subtle impairments |
| Name | Values | Description |
|---|---|---|
| # of Vowel /a/ Samples | 756 | Each subject with 3 repetitions Speech recordings were collected using a microphone at a sampling rate of 44.1 KHz |
| # of Subjects | 252 (188 PD, 64 healthy) | Class imbalance ratio (3:1) |
| # of PD Patients | 188 (107 men, 81 women) | Age range: [33, 87] years Mean ± Std: 65.1 ± 10.9 |
| # of Healthy Subjects | 64 (23 men, 41 women) | Age range: [41, 82] years Mean ± Std: 61.1 ± 8.9 |
| Feature Category | Description | Features |
|---|---|---|
| Baseline Features | Jitter and shimmer capture variations in fundamental frequency () and amplitude Statistical measures of perturbation (Min, Max, mean, Median, Std. dev., and Avg.) Noise-to-tonal component ratios (noise-to-harmonics (NHR), harmonics-to-noise (HNR)) Nonlinear measures, including Recurrence Period Density Entropy (RPDE), Pitch Period Entropy (PPE), and Detrended Fluctuation Analysis (DFA) | 23 |
| Time–Frequency Features | Intensity parameters, i.e., Min, Max and mean Formant frequencies, i.e., F1, F2, F3, and F4 Bandwidth, i.e., b1, b2, b3, and b4 | 11 |
| Vocal Fold Features | Vocal fold excitation ratio (VFER), Glottal to noise excitation (GNE) Glottis quotient (GQ) and Empirical mode decomposition (EMD) | 22 |
| Wavelet Transform Features | Original and log transform of contour 10-level discrete wavelet transform Approximation and detailed coefficients of Teager–Kaiser energy (TKEO) and entropy of Shannon and log energy | 182 |
| MFCC Features | Mean and Std. dev. of the original 13 MFCCs and their derivatives (i.e., first, second) Log energy of the signal | 84 |
| TQWT Features | 36-level Q-factor wavelet transform Min, Max, mean, Median, Std. dev., Skewness, Kurtosis, entropy of log energy and Shannon, and Teager–Kaiser energy (TKEO) | 432 |
| Feature | Equation/Definition with Parameters |
|---|---|
| Jitter (JitterF0,abs) | where : fundamental frequency of i-th speech cycle; N: total number of periods |
| Shimmer (ShimmerdB) | where : peak amplitude of the i-th cycle; N: total number of cycles |
| HNR(dB) & NHR(dB) | HNR: NHR: where : maximum autocorrelation value of the signal |
| RPDE | where : probability of recurrence period i; : maximum recurrence period |
| PPE | where : probability of pitch period i; : total pitch periods |
| DFA | where : slope of fluctuation function |
| MFCC (MFCCn) | where : log energy of the k-th Mel-filtered frequency band; K: total number of Mel filters; n: index of cepstral coefficient (0 to L); L: total number of MFCC coefficients |
where : discrete-time speech signal; : mother wavelet function; r: scale index (controls dilation of the wavelet); s: translation index (controls shift of the wavelet); : scale step size (usually >1); : translation step size; N: total number of signal samples | |
| TQWT | where : low-pass and high-pass filters; : Q-factor scaling parameter (controls bandwidth of the filters), : redundancy parameter (controls overlap of frequency bands); k: decomposition level (number of wavelet subbands); w: frequency variable in radians per sample |
| Feature Selection | Fold(s) | Baseline (23) | Time (4) | Frequency (7) | MFCC (84) | Wavelet (182) | Vocal (22) | TQWT (432) |
|---|---|---|---|---|---|---|---|---|
| Information Gain | Fold 1 (249) | 15 | 3 | 1 | 25 | 79 | 2 | 124 |
| Fold 2 (267) | 11 | 3 | 1 | 20 | 78 | 1 | 153 | |
| Fold 3 (322) | 18 | 3 | 2 | 25 | 88 | 7 | 179 | |
| Fold 4 (253) | 10 | 3 | 1 | 24 | 79 | 3 | 133 | |
| Fold 5 (295) | 14 | 3 | 1 | 27 | 80 | 5 | 165 | |
| Gain Ratio | Fold 1 (327) | 20 | 3 | 0 | 32 | 86 | 6 | 180 |
| Fold 2 (383) | 16 | 3 | 2 | 33 | 88 | 4 | 237 | |
| Fold 3 (391) | 20 | 3 | 2 | 32 | 88 | 9 | 237 | |
| Fold 4 (331) | 15 | 3 | 2 | 31 | 87 | 5 | 188 | |
| Fold 5 (366) | 17 | 3 | 1 | 35 | 81 | 7 | 222 | |
| ReliefF | Fold 1 (329) | 15 | 3 | 3 | 53 | 84 | 17 | 154 |
| Fold 2 (328) | 14 | 3 | 5 | 53 | 74 | 15 | 164 | |
| Fold 3 (364) | 18 | 3 | 6 | 54 | 81 | 16 | 186 | |
| Fold 4 (323) | 15 | 3 | 4 | 53 | 77 | 16 | 155 | |
| Fold 5 (350) | 18 | 3 | 5 | 56 | 86 | 15 | 167 |
| F4: ‘All Features’ | Corr. > 0.85 | Corr. > 0.90 | Corr. > 0.95 | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | G-Mean | F1-Score | MCC | Accuracy | G-Mean | F1-Score | MCC | Accuracy | G-Mean | F1-Score | MCC | |||
| Naive Bayes | 0.70 | 0.70 | 0.78 | 0.37 | 0.71 | 0.70 | 0.78 | 0.37 | 0.70 | 0.69 | 0.77 | 0.36 | ||
| k-NN | 0.81 | 0.68 | 0.88 | 0.46 | 0.81 | 0.66 | 0.88 | 0.45 | 0.81 | 0.68 | 0.88 | 0.46 | ||
| Decision Tree | 0.72 | 0.63 | 0.81 | 0.29 | 0.72 | 0.61 | 0.81 | 0.28 | 0.75 | 0.60 | 0.84 | 0.30 | ||
| SVM (Linear) | 0.85 | 0.73 | 0.91 | 0.58 | 0.86 | 0.73 | 0.91 | 0.58 | 0.85 | 0.73 | 0.91 | 0.57 | ||
| SVM (Polynomial) | 0.82 | 0.72 | 0.88 | 0.51 | 0.84 | 0.74 | 0.90 | 0.56 | 0.82 | 0.71 | 0.88 | 0.49 | ||
| SVM (RBF) | 0.85 | 0.69 | 0.90 | 0.55 | 0.85 | 0.69 | 0.91 | 0.55 | 0.84 | 0.68 | 0.90 | 0.53 | ||
| F4: ‘all’ − f(=1) | ||||||||||||||
| Naive Bayes | 0.70 | 0.70 | 0.77 | 0.37 | 0.71 | 0.70 | 0.78 | 0.38 | 0.69 | 0.69 | 0.77 | 0.35 | ||
| k-NN | 0.82 | 0.68 | 0.88 | 0.47 | 0.82 | 0.67 | 0.89 | 0.47 | 0.81 | 0.67 | 0.88 | 0.45 | ||
| Decision Tree | 0.72 | 0.63 | 0.81 | 0.29 | 0.72 | 0.63 | 0.81 | 0.29 | 0.76 | 0.63 | 0.84 | 0.34 | ||
| SVM (Linear) | 0.85 | 0.72 | 0.90 | 0.56 | 0.85 | 0.72 | 0.91 | 0.57 | 0.85 | 0.73 | 0.91 | 0.57 | ||
| SVM (Polynomial) | 0.83 | 0.73 | 0.89 | 0.52 | 0.83 | 0.73 | 0.89 | 0.53 | 0.81 | 0.71 | 0.88 | 0.48 | ||
| SVM (RBF) | 0.85 | 0.69 | 0.91 | 0.56 | 0.85 | 0.68 | 0.91 | 0.56 | 0.84 | 0.69 | 0.90 | 0.53 | ||
| F4: ‘all’ − f(=2) | ||||||||||||||
| Naive Bayes | 0.70 | 0.70 | 0.77 | 0.37 | 0.70 | 0.70 | 0.77 | 0.37 | 0.69 | 0.69 | 0.77 | 0.36 | ||
| k-NN | 0.83 | 0.69 | 0.89 | 0.50 | 0.82 | 0.68 | 0.89 | 0.48 | 0.81 | 0.66 | 0.88 | 0.45 | ||
| Decision Tree | 0.72 | 0.63 | 0.81 | 0.30 | 0.73 | 0.65 | 0.82 | 0.33 | 0.72 | 0.61 | 0.81 | 0.28 | ||
| SVM (Linear) | 0.85 | 0.72 | 0.90 | 0.56 | 0.85 | 0.72 | 0.91 | 0.57 | 0.85 | 0.72 | 0.90 | 0.55 | ||
| SVM (Polynomial) | 0.82 | 0.71 | 0.89 | 0.50 | 0.83 | 0.72 | 0.89 | 0.52 | 0.81 | 0.70 | 0.88 | 0.47 | ||
| SVM (RBF) | 0.85 | 0.69 | 0.91 | 0.55 | 0.85 | 0.69 | 0.91 | 0.55 | 0.85 | 0.69 | 0.91 | 0.55 | ||
| F4: ‘all’ − f(=3) | ||||||||||||||
| Naive Bayes | 0.73 | 0.72 | 0.81 | 0.40 | 0.73 | 0.71 | 0.80 | 0.38 | 0.72 | 0.71 | 0.80 | 0.38 | ||
| k-NN | 0.83 | 0.69 | 0.89 | 0.50 | 0.81 | 0.67 | 0.88 | 0.45 | 0.82 | 0.68 | 0.89 | 0.48 | ||
| Decision Tree | 0.73 | 0.63 | 0.82 | 0.30 | 0.74 | 0.66 | 0.83 | 0.34 | 0.73 | 0.62 | 0.82 | 0.30 | ||
| SVM (Linear) | 0.84 | 0.71 | 0.90 | 0.54 | 0.85 | 0.73 | 0.91 | 0.57 | 0.85 | 0.73 | 0.91 | 0.57 | ||
| SVM (Polynomial) | 0.81 | 0.70 | 0.88 | 0.48 | 0.83 | 0.73 | 0.89 | 0.53 | 0.82 | 0.72 | 0.88 | 0.51 | ||
| SVM (RBF) | 0.85 | 0.69 | 0.91 | 0.56 | 0.85 | 0.68 | 0.91 | 0.55 | 0.85 | 0.70 | 0.91 | 0.55 | ||
| F4: ‘all’ − f(=4) | ||||||||||||||
| Naive Bayes | 0.73 | 0.71 | 0.80 | 0.38 | 0.74 | 0.72 | 0.81 | 0.41 | 0.72 | 0.71 | 0.80 | 0.39 | ||
| k-NN | 0.82 | 0.68 | 0.88 | 0.46 | 0.82 | 0.68 | 0.89 | 0.48 | 0.83 | 0.70 | 0.89 | 0.51 | ||
| Decision Tree | 0.74 | 0.63 | 0.82 | 0.31 | 0.74 | 0.65 | 0.83 | 0.34 | 0.74 | 0.65 | 0.83 | 0.34 | ||
| SVM (Linear) | 0.84 | 0.69 | 0.90 | 0.52 | 0.85 | 0.71 | 0.91 | 0.57 | 0.85 | 0.72 | 0.91 | 0.57 | ||
| SVM (Polynomial) | 0.81 | 0.71 | 0.88 | 0.48 | 0.82 | 0.72 | 0.88 | 0.50 | 0.83 | 0.73 | 0.89 | 0.52 | ||
| SVM (RBF) | 0.85 | 0.69 | 0.91 | 0.56 | 0.85 | 0.68 | 0.91 | 0.55 | 0.86 | 0.70 | 0.91 | 0.58 | ||
| F4: ‘all’ − f(=5) | ||||||||||||||
| Naive Bayes | 0.73 | 0.71 | 0.81 | 0.40 | 0.75 | 0.72 | 0.82 | 0.42 | 0.73 | 0.72 | 0.80 | 0.40 | ||
| k-NN | 0.81 | 0.67 | 0.88 | 0.45 | 0.81 | 0.67 | 0.88 | 0.46 | 0.82 | 0.68 | 0.89 | 0.49 | ||
| Decision Tree | 0.74 | 0.64 | 0.82 | 0.32 | 0.74 | 0.66 | 0.82 | 0.34 | 0.74 | 0.65 | 0.82 | 0.34 | ||
| SVM (Linear) | 0.84 | 0.69 | 0.90 | 0.53 | 0.85 | 0.70 | 0.90 | 0.55 | 0.86 | 0.73 | 0.91 | 0.59 | ||
| SVM (Polynomial) | 0.81 | 0.70 | 0.88 | 0.47 | 0.82 | 0.71 | 0.88 | 0.50 | 0.83 | 0.73 | 0.89 | 0.53 | ||
| SVM (RBF) | 0.84 | 0.69 | 0.90 | 0.53 | 0.85 | 0.68 | 0.91 | 0.55 | 0.85 | 0.69 | 0.91 | 0.56 | ||
| F4: ‘all’ − f(=6) | ||||||||||||||
| Naive Bayes | 0.73 | 0.71 | 0.81 | 0.39 | 0.74 | 0.72 | 0.82 | 0.41 | 0.72 | 0.71 | 0.80 | 0.38 | ||
| k-NN | 0.82 | 0.68 | 0.88 | 0.47 | 0.82 | 0.68 | 0.88 | 0.47 | 0.82 | 0.68 | 0.88 | 0.48 | ||
| Decision Tree | 0.74 | 0.63 | 0.82 | 0.31 | 0.74 | 0.67 | 0.82 | 0.35 | 0.74 | 0.65 | 0.83 | 0.34 | ||
| SVM (Linear) | 0.84 | 0.69 | 0.90 | 0.52 | 0.85 | 0.71 | 0.91 | 0.56 | 0.86 | 0.73 | 0.91 | 0.58 | ||
| SVM (Polynomial) | 0.81 | 0.71 | 0.88 | 0.47 | 0.82 | 0.72 | 0.88 | 0.50 | 0.83 | 0.73 | 0.89 | 0.52 | ||
| SVM (RBF) | 0.85 | 0.69 | 0.90 | 0.55 | 0.85 | 0.69 | 0.91 | 0.55 | 0.85 | 0.69 | 0.91 | 0.56 | ||
| F4: ‘all’ − f(=7) | ||||||||||||||
| Naive Bayes | 0.73 | 0.71 | 0.80 | 0.39 | 0.74 | 0.72 | 0.81 | 0.40 | 0.72 | 0.71 | 0.80 | 0.38 | ||
| k-NN | 0.82 | 0.68 | 0.88 | 0.47 | 0.81 | 0.66 | 0.88 | 0.44 | 0.82 | 0.67 | 0.88 | 0.47 | ||
| Decision Tree | 0.73 | 0.63 | 0.82 | 0.31 | 0.74 | 0.65 | 0.82 | 0.32 | 0.74 | 0.66 | 0.83 | 0.35 | ||
| SVM (Linear) | 0.84 | 0.69 | 0.90 | 0.53 | 0.85 | 0.72 | 0.91 | 0.57 | 0.86 | 0.73 | 0.91 | 0.59 | ||
| SVM (Polynomial) | 0.82 | 0.71 | 0.88 | 0.49 | 0.82 | 0.71 | 0.88 | 0.49 | 0.83 | 0.73 | 0.89 | 0.52 | ||
| SVM (RBF) | 0.84 | 0.69 | 0.90 | 0.54 | 0.85 | 0.68 | 0.91 | 0.56 | 0.85 | 0.69 | 0.91 | 0.56 | ||
| Information Gain | Gain Ratio | ||||||||||||
| Accuracy | G-Mean | F1-Score | MCC | Precision | Recall | Accuracy | G-Mean | F1-Score | MCC | Precision | Recall | ||
| Naive Bayes | 0.75 | 0.70 | 0.83 | 0.39 | 0.88 | 0.79 | 0.69 | 0.65 | 0.75 | 0.35 | 0.87 | 0.71 | |
| k-NN | 0.78 | 0.62 | 0.86 | 0.37 | 0.83 | 0.89 | 0.78 | 0.61 | 0.86 | 0.37 | 0.83 | 0.90 | |
| Decision Tree | 0.64 | 0.59 | 0.74 | 0.19 | 0.81 | 0.68 | 0.57 | 0.52 | 0.62 | 0.14 | 0.81 | 0.57 | |
| SVM (Linear) | 0.83 | 0.67 | 0.89 | 0.49 | 0.85 | 0.94 | 0.82 | 0.66 | 0.89 | 0.48 | 0.85 | 0.94 | |
| SVM (Polynomial) | 0.78 | 0.67 | 0.86 | 0.40 | 0.85 | 0.87 | 0.78 | 0.67 | 0.86 | 0.40 | 0.85 | 0.86 | |
| SVM (RBF) | 0.81 | 0.55 | 0.89 | 0.43 | 0.81 | 0.98 | 0.79 | 0.55 | 0.87 | 0.36 | 0.81 | 0.94 | |
| ReliefF | MRMR [25] | ||||||||||||
| Accuracy | G-Mean | F1-Score | MCC | Precision | Recall | Accuracy | G-Mean | F1-Score | MCC | Precision | Recall | ||
| Naive Bayes | 0.77 | 0.70 | 0.84 | 0.41 | 0.87 | 0.82 | 0.73 | 0.72 | 0.81 | 0.40 | 0.89 | 0.74 | |
| k-NN | 0.80 | 0.63 | 0.87 | 0.40 | 0.83 | 0.91 | 0.82 | 0.71 | 0.88 | 0.49 | 0.86 | 0.90 | |
| Decision Tree | 0.74 | 0.62 | 0.83 | 0.31 | 0.83 | 0.83 | 0.70 | 0.63 | 0.79 | 0.29 | 0.83 | 0.76 | |
| SVM (Linear) | 0.85 | 0.71 | 0.90 | 0.55 | 0.86 | 0.95 | 0.84 | 0.72 | 0.90 | 0.55 | 0.87 | 0.94 | |
| SVM (Polynomial) | 0.82 | 0.71 | 0.89 | 0.50 | 0.86 | 0.91 | 0.82 | 0.72 | 0.88 | 0.50 | 0.87 | 0.90 | |
| SVM (RBF) | 0.84 | 0.68 | 0.90 | 0.52 | 0.85 | 0.95 | 0.85 | 0.66 | 0.90 | 0.54 | 0.85 | 0.97 | |
| Proposed HDC (Corr. > 0.90) | Proposed HDC (Corr. > 0.95) | ||||||||||||
| Accuracy | G-Mean | F1-Score | MCC | Precision | Recall | Accuracy | G-Mean | F1-Score | MCC | Precision | Recall | ||
| Naive Bayes | 0.74 | 0.72 | 0.81 | 0.41 | 0.89 | 0.75 | 0.72 | 0.71 | 0.80 | 0.39 | 0.89 | 0.73 | |
| k-NN | 0.82 | 0.68 | 0.89 | 0.48 | 0.85 | 0.92 | 0.83 | 0.70 | 0.89 | 0.51 | 0.86 | 0.93 | |
| Decision Tree | 0.74 | 0.65 | 0.83 | 0.34 | 0.84 | 0.81 | 0.74 | 0.65 | 0.83 | 0.34 | 0.84 | 0.81 | |
| SVM (Linear) | 0.85 | 0.71 | 0.90 | 0.55 | 0.86 | 0.95 | 0.85 | 0.72 | 0.91 | 0.57 | 0.86 | 0.95 | |
| SVM (Polynomial) | 0.82 | 0.72 | 0.89 | 0.51 | 0.87 | 0.91 | 0.83 | 0.73 | 0.89 | 0.52 | 0.87 | 0.91 | |
| SVM (RBF) | 0.85 | 0.68 | 0.91 | 0.55 | 0.85 | 0.97 | 0.86 | 0.70 | 0.91 | 0.58 | 0.86 | 0.97 | |
| Information Gain | Gain Ratio | ||||||||||||
| Accuracy | G-Mean | F1-Score | MCC | Precision | Recall | Accuracy | G-Mean | F1-Score | MCC | Precision | Recall | ||
| Naive Bayes | 0.67 | 0.65 | 0.72 | 0.48 | 0.56 | 1 | 0.57 | 0.57 | 0.53 | 0.14 | 0.50 | 0.56 | |
| k-NN | 0.52 | 0.41 | 0.64 | 0.28 | 0.47 | 1 | 0.67 | 0.65 | 0.72 | 0.48 | 0.56 | 1 | |
| Decision Tree | 0.48 | 0.48 | 0.48 | −0.03 | 0.42 | 0.56 | 0.62 | 0.53 | 0.43 | 0.19 | 0.60 | 0.33 | |
| SVM (Linear) | 0.57 | 0.50 | 0.67 | 0.35 | 0.50 | 1 | 0.57 | 0.50 | 0.67 | 0.35 | 0.50 | 1 | |
| SVM (Polynomial) | 0.57 | 0.50 | 0.67 | 0.35 | 0.50 | 1 | 0.67 | 0.65 | 0.72 | 0.48 | 0.56 | 1 | |
| SVM (RBF) | 0.52 | 0.41 | 0.64 | 0.28 | 0.47 | 1 | 0.67 | 0.65 | 0.72 | 0.48 | 0.56 | 1 | |
| ReliefF | MRMR [25] | ||||||||||||
| Accuracy | G-Mean | F1-Score | MCC | Precision | Recall | Accuracy | G-Mean | F1-Score | MCC | Precision | Recall | ||
| Naive Bayes | 0.62 | 0.58 | 0.69 | 0.42 | 0.53 | 1 | 0.71 | 0.71 | 0.75 | 0.55 | 0.60 | 1 | |
| k-NN | 0.52 | 0.41 | 0.64 | 0.28 | 0.47 | 1 | 0.57 | 0.54 | 0.64 | 0.26 | 0.50 | 0.89 | |
| Decision Tree | 0.62 | 0.58 | 0.69 | 0.42 | 0.53 | 1 | 0.52 | 0.51 | 0.58 | 0.12 | 0.47 | 0.78 | |
| SVM (Linear) | 0.57 | 0.50 | 0.67 | 0.35 | 0.50 | 1 | 0.62 | 0.58 | 0.69 | 0.42 | 0.53 | 1 | |
| SVM (Polynomial) | 0.62 | 0.58 | 0.69 | 0.42 | 0.53 | 1 | 0.67 | 0.65 | 0.72 | 0.48 | 0.56 | 1 | |
| SVM (RBF) | 0.57 | 0.50 | 0.67 | 0.35 | 0.50 | 1 | 0.57 | 0.50 | 0.67 | 0.35 | 0.50 | 1 | |
| Proposed HDC (Corr. > 0.90) | Proposed HDC (Corr. > 0.95) | ||||||||||||
| Accuracy | G-Mean | F1-Score | MCC | Precision | Recall | Accuracy | G-Mean | F1-Score | MCC | Precision | Recall | ||
| Naive Bayes | 0.67 | 0.65 | 0.72 | 0.48 | 0.56 | 1 | 0.76 | 0.76 | 0.78 | 0.61 | 0.64 | 1 | |
| k-NN | 0.67 | 0.65 | 0.72 | 0.48 | 0.56 | 1 | 0.62 | 0.58 | 0.69 | 0.42 | 0.53 | 1 | |
| Decision Tree | 0.76 | 0.76 | 0.78 | 0.61 | 0.64 | 1 | 0.52 | 0.51 | 0.44 | 0.03 | 0.44 | 0.44 | |
| SVM (Linear) | 0.57 | 0.50 | 0.67 | 0.35 | 0.50 | 1 | 0.57 | 0.50 | 0.67 | 0.35 | 0.50 | 1 | |
| SVM (Polynomial) | 0.57 | 0.50 | 0.67 | 0.35 | 0.50 | 1 | 0.67 | 0.65 | 0.72 | 0.48 | 0.56 | 1 | |
| SVM (RBF) | 0.57 | 0.50 | 0.67 | 0.35 | 0.50 | 1 | 0.57 | 0.50 | 0.67 | 0.35 | 0.50 | 1 | |
| Study | Model | Accuracy | G-Mean | F1-Score | MCC | Precision | Recall |
|---|---|---|---|---|---|---|---|
| Polat [63] | OGA Sampling, wkNN | 0.89 | - | - | - | - | - |
| Xiong [33] | SAE (mRMR), LDA | 0.91 | - | - | - | 0.94 | - |
| Gunduz [35] | VAE (Relief), SVM | 0.96 | - | 0.97 | 0.88 | - | - |
| Solana-Lavelle [32] | Wrapper, kNN | 0.95 | - | 0.96 | 0.87 | 0.97 | 0.96 |
| Tuncer [31] | MAMa Tree, kNN | 0.97 | - | 0.96 | - | 0.97 | 0.95 |
| Proposed method | HDC, kNN | 0.96 | 0.94 | 0.97 | 0.89 | 0.97 | 0.98 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Kumar, A.; Gyanchandani, M.; Shukla, S. Data Complexity-Aware Feature Selection with Symmetric Splitting for Robust Parkinson’s Disease Detection. Symmetry 2026, 18, 22. https://doi.org/10.3390/sym18010022
Kumar A, Gyanchandani M, Shukla S. Data Complexity-Aware Feature Selection with Symmetric Splitting for Robust Parkinson’s Disease Detection. Symmetry. 2026; 18(1):22. https://doi.org/10.3390/sym18010022
Chicago/Turabian StyleKumar, Arvind, Manasi Gyanchandani, and Sanyam Shukla. 2026. "Data Complexity-Aware Feature Selection with Symmetric Splitting for Robust Parkinson’s Disease Detection" Symmetry 18, no. 1: 22. https://doi.org/10.3390/sym18010022
APA StyleKumar, A., Gyanchandani, M., & Shukla, S. (2026). Data Complexity-Aware Feature Selection with Symmetric Splitting for Robust Parkinson’s Disease Detection. Symmetry, 18(1), 22. https://doi.org/10.3390/sym18010022

