Fuzzy Graded Preprocessing for Robust Machine Learning: A Three-Stage Mamdani Framework with Interpretable Audit Trails
Featured Application
Abstract
1. Introduction
- (1)
- We introduce GDEDC, a Mamdani-type fuzzy inference framework for graded data error detection and correction. Its nine-rule base produces human-readable explanations for every decision, allowing domain experts to inspect and, if necessary, override the preprocessing logic.
- (2)
- The framework integrates three stages—fuzzy anomaly scoring with RMS aggregation, rule-based error classification, and sigmoid-based fuzzy-weighted imputation—into a single pipeline that corrects values in proportion to their error severity rather than applying binary accept/reject decisions. All observations are retained.
- (3)
- We evaluate the framework on five benchmark datasets and one real-world medical dataset (Pima Indians diabetes) with five ML classifiers across six noise levels (5–30%), using a split-first protocol that prevents data leakage, and compare against five baselines including MICE (Multiple Imputation by Chained Equations). The results confirm three findings: correction-based methods (GDEDC, KNN Imputation, and MICE) consistently outperform raw data, whereas deletion-based methods (Z-Score and IQR) degrade performance under leakage-free conditions; GDEDC matches both KNN Imputation and MICE at low noise and surpasses them at high noise (≥20%); and the graded approach generalizes to naturally noisy medical data. Statistical validation through paired t-tests, Wilcoxon signed-rank tests, Cohen’s d effect sizes, and Friedman rank analysis supports these conclusions.
2. Literature Review
2.1. Data Quality and Its Impact on Machine Learning
2.2. Traditional Data Preprocessing Approaches
2.3. Fuzzy Logic in Data Preprocessing and Quality Assessment
2.4. Summary and Research Gap
3. Preliminaries
3.1. Fuzzy Sets and Membership Functions
3.2. Fuzzy Set Operations
3.3. α-Cuts, Support, and Core of Fuzzy Sets
3.4. Fuzzy Inference Systems
3.5. Defuzzification
3.6. Machine Learning Classifiers
3.7. Evaluation Metrics
4. Proposed Methodology: GDEDC Framework
4.1. Stage 1: Fuzzy Anomaly Scoring
4.1.1. Membership Function Construction
4.1.2. Feature-Level Anomaly Score
4.1.3. RMS Weighted Aggregation
4.2. Stage 2: Rule-Based Error Classification
4.2.1. Rule Firing and Aggregation
4.2.2. Classification Thresholds
4.3. Stage 3: Sigmoid-Based Fuzzy-Weighted Imputation
4.4. Algorithm Summary
| Algorithm 1: GDEDC Framework |
| Input: Raw dataset D = {x1, …, xn} with p features Params: λ (correction strength), k (neighbors), α (cut level) Output: Corrected dataset //STAGE 1: Fuzzy Anomaly Scoring 1–4: Compute Mj, IQRj, σN and construct μN for each feature j 5–11: For each observation: compute ASj, RMS aggregation AS(xi), and FCS(xi) //STAGE 2: Rule-Based Error Classification 12–19: For each observation: fuzzify, evaluate rules R1–R9, defuzzify to ES*, classify //STAGE 3: Sigmoid-Based Fuzzy-Weighted Imputation 20–28: For each flagged observation: find k clean neighbors, compute sigmoid weights, apply proportional correction 29: RETURN |
4.5. Computational Complexity Analysis
5. Experimental Results and Discussion
5.1. Experimental Setup
5.1.1. Datasets
5.1.2. Noise Injection Protocol
5.1.3. Baseline Methods
- (B1)
- No Preprocessing (Raw): ML models trained on the noisy data without any correction.
- (B2)
- Z-Score Filtering: Observations with |z| > 3 for any feature are removed.
- (B3)
- IQR-Based Outlier Removal: Observations outside [Q1 − 1.5·IQR, Q3 + 1.5·IQR] for any feature are removed.
- (B4)
- KNN Imputation: Outlier values (detected via IQR) are replaced using 5-nearest-neighbor averaging.
- (B5)
- MICE Imputation: Outlier values (detected via IQR, same threshold as B4) are set to missing and imputed using Multiple Imputation by Chained Equations (scikit-learn’s IterativeImputer with 10 iterations) [34]. MICE fits a sequence of Bayesian Ridge regression models, imputing each feature conditioned on the others, and iterates until convergence. Like B4, MICE is a correction-based method that retains all observations. The same IQR-based outlier detection is used for both B4 and B5 to ensure that the only difference is the imputation strategy (neighbor averaging vs. chained equations), enabling a fair comparison.
5.1.4. Implementation Details
5.2. Classification Performance Results
5.3. Robustness Analysis Across Noise Levels
5.4. Data Retention Analysis
5.5. Statistical Significance Testing
5.6. Statistical Significance by Noise Level
5.7. Friedman Rank Analysis
5.8. Sensitivity Analysis
5.9. Error Detection Analysis
5.10. Computational Cost Analysis
5.11. Classifier-Specific Impact Analysis
5.12. Ablation Study
5.13. Real-World Validation: Pima Indians Diabetes
5.14. Detection–Correction Coupling: Same-Detected-Cells Comparison
5.15. Robustness to Alternative Noise Patterns
5.16. Discussion
5.16.1. Correction vs. Deletion: A Fundamental Distinction
5.16.2. GDEDC vs. KNN Imputation: Comparable Accuracy, Superior Interpretability
5.16.3. GDEDC vs. MICE: Proportional Correction vs. Model-Based Imputation
5.16.4. Data Retention and Practical Implications
5.16.5. RMS Aggregation and Sigmoid Correction
5.16.6. Symmetric Membership Functions and Feature Skewness
5.16.7. Limitations and Practical Recommendations
6. Conclusions and Future Work
- (1)
- GDEDC improves accuracy over raw data by 0.7–2.3% at all noise levels (p < 0.001, d = 0.25–0.45).
- (2)
- Under leakage-free conditions, deletion-based methods consistently underperform raw data. GDEDC outperforms Z-Score by +1.2–2.4% and IQR by +2.1–6.1%.
- (3)
- GDEDC matches KNN Imputation and MICE at 5–15% noise, then surpasses both at ≥20% noise on noise-sensitive classifiers (best Friedman rank at 20–30%).
- (4)
- GDEDC retains 100% of observations; IQR discards >39% at high noise.
- (5)
- Friedman tests confirm significant differences among methods at all noise levels (p < 0.001).
- (6)
- Noise-sensitive classifiers (SVM, KNN, LR) gain +1.9–2.8%; tree-based ensembles show marginal change.
- (7)
- Performance is stable across λ (0.2–1.0) and k (3–15), with runtime under 4.2 s for up to 10,992 instances.
- (8)
- Ablation confirms that sigmoid-based proportional correction is the primary contributor (+2.02 pp), followed by RMS aggregation (+0.49 pp), Mamdani FIS (+0.38 pp), and FCS (+0.11 pp).
- (9)
- On the Pima dataset with naturally occurring missing values, GDEDC achieves the highest accuracy (75.2%) and outperforms IQR by +2.97% (p < 0.001, d = 0.684).
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| GDEDC | Graded Data Error Detection and Correction |
| FIS | Fuzzy Inference System |
| ANFIS | Adaptive Neuro-Fuzzy Inference System |
| RMS | Root-Mean-Square |
| FCS | Feature Consistency Score |
| AS | Aggregated Anomaly Score |
| ES | Error Severity |
| ML | Machine Learning |
| AI | Artificial Intelligence |
| RF | Random Forest |
| SVM | Support Vector Machine |
| KNN | K-Nearest Neighbors |
| LR | Logistic Regression |
| GAN | Generative Adversarial Network |
| MIWAE | Missing Data Importance-Weighted Autoencoder |
| IQR | Interquartile Range |
| MAD | Median Absolute Deviation |
| MCD | Minimum Covariance Determinant |
| MICE | Multiple Imputation by Chained Equations |
| MAR | Missing-At-Random |
| SVD | Singular Value Decomposition |
| TP | True Positive |
| TN | True Negative |
| FP | False Positive |
| FN | False Negative |
| F1 | F1 Score |
| UCI | University of California, Irvine |
References
- Jordan, M.I.; Mitchell, T.M. Machine learning: Trends, perspectives, and prospects. Science 2015, 349, 255–260. [Google Scholar] [CrossRef] [PubMed]
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
- Gudivada, V.; Apon, A.; Ding, J. Data quality considerations for big data and machine learning: Going beyond data cleaning and transformations. Int. J. Adv. Softw. 2017, 10, 1–20. [Google Scholar]
- Jain, S.; Shukla, S.; Wadhvani, R. Dynamic selection of normalization techniques using data complexity measures. Expert Syst. Appl. 2018, 106, 252–262. [Google Scholar] [CrossRef]
- Ilyas, I.F.; Chu, X. Data Cleaning; ACM Books: New York, NY, USA, 2019. [Google Scholar]
- Domingos, P. A few useful things to know about machine learning. Commun. ACM 2012, 55, 78–87. [Google Scholar] [CrossRef]
- García, S.; Luengo, J.; Herrera, F. Data Preprocessing in Data Mining; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
- Donders, A.R.T.; van der Heijden, G.J.M.G.; Stijnen, T.; Reitsma, J.B. Review: A gentle introduction to imputation of missing values. J. Clin. Epidemiol. 2006, 59, 1087–1091. [Google Scholar] [CrossRef]
- Rousseeuw, P.J.; Hubert, M. Robust statistics for outlier detection. WIREs Data Min. Knowl. Discov. 2011, 1, 73–79. [Google Scholar] [CrossRef]
- Little, R.J.A.; Rubin, D.B. Statistical Analysis with Missing Data, 3rd ed.; Wiley: Hoboken, NJ, USA, 2019. [Google Scholar]
- Schafer, J.L.; Graham, J.W. Missing data: Our view of the state of the art. Psychol. Methods 2002, 7, 147–177. [Google Scholar] [CrossRef]
- Zadeh, L.A. Fuzzy sets. Inf. Control 1965, 8, 338–353. [Google Scholar] [CrossRef]
- Zimmermann, H.-J. Fuzzy Set Theory and Its Applications, 4th ed.; Springer: Dordrecht, The Netherlands, 2001. [Google Scholar]
- Tsoukalas, L.H. Fuzzy Logic: Applications in Artificial Intelligence, Big Data, and Machine Learning; McGraw Hill: New York, NY, USA, 2023. [Google Scholar]
- Mendel, J.M. Uncertain Rule-Based Fuzzy Systems: Introduction and New Directions, 2nd ed.; Springer: Cham, Switzerland, 2017. [Google Scholar]
- Pedrycz, W.; Gomide, F. Fuzzy Systems Engineering: Toward Human-Centric Computing; Wiley: Hoboken, NJ, USA, 2007. [Google Scholar]
- Klir, G.J.; Yuan, B. Fuzzy Sets and Fuzzy Logic: Theory and Applications; Prentice Hall: Upper Saddle River, NJ, USA, 1995. [Google Scholar]
- Nettleton, D.F.; Orriols-Puig, A.; Fornells, A. A study of the effect of different types of noise on the precision of supervised learning techniques. Artif. Intell. Rev. 2010, 33, 275–306. [Google Scholar] [CrossRef]
- Mohammed, S.; Budach, L.; Feuerpfeil, M.; Ihde, N.; Nathansen, A.; Noack, N.; Patzlaff, H.; Naumann, F.; Harmouch, H. The effects of data quality on machine learning performance on tabular data. Inf. Syst. 2025, 132, 102549. [Google Scholar] [CrossRef]
- Wanyonyi, E.N.; Masinde, N.W. The impact of data preprocessing on machine learning model performance: A comprehensive examination. Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol. 2025, 11, 3814–3827. [Google Scholar] [CrossRef]
- Abedjan, Z.; Chu, X.; Deng, D.; Fernandez, R.C.; Ilyas, I.F.; Ouzzani, M.; Papotti, P.; Stonebraker, M.; Tang, N. Detecting data errors: Where are we and what needs to be done? Proc. VLDB Endow. 2016, 9, 993–1004. [Google Scholar] [CrossRef]
- Frénay, B.; Kabán, A. A comprehensive introduction to label noise. In Proceedings of the ESANN, Bruges, Belgium, 23–25 April 2014; pp. 667–676. [Google Scholar]
- Rahm, E.; Do, H.H. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 2000, 23, 3–13. [Google Scholar]
- Zha, D.; Bhat, Z.P.; Lai, K.-H.; Yang, F.; Jiang, Z.; Zhong, S.; Hu, X. Data-centric artificial intelligence: A survey. ACM Comput. Surv. 2025, 57, 1–42. [Google Scholar] [CrossRef]
- Frénay, B.; Verleysen, M. Classification in the presence of label noise: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2014, 25, 845–869. [Google Scholar] [CrossRef]
- Ng, A. Data-centric AI competition. In Proceedings of the NeurIPS Data-Centric AI Workshop, Virtual, 14 December 2021. [Google Scholar]
- Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques, 3rd ed.; Morgan Kaufmann: Waltham, MA, USA, 2012. [Google Scholar]
- García, S.; Ramírez-Gallego, S.; Luengo, J.; Benítez, J.M.; Herrera, F. Big data preprocessing: Methods and prospects. Big Data Anal. 2016, 1, 9. [Google Scholar] [CrossRef]
- Leys, C.; Ley, C.; Klein, O.; Bernard, P.; Licata, L. Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. J. Exp. Soc. Psychol. 2013, 49, 764–766. [Google Scholar] [CrossRef]
- Hodge, V.; Austin, J. A survey of outlier detection methodologies. Artif. Intell. Rev. 2004, 22, 85–126. [Google Scholar] [CrossRef]
- Rousseeuw, P.J.; van Driessen, K. A fast algorithm for the minimum covariance determinant estimator. Technometrics 1999, 41, 212–223. [Google Scholar] [CrossRef]
- Batista, G.E.A.P.A.; Monard, M.C. An analysis of four missing data treatment methods for supervised learning. Appl. Artif. Intell. 2003, 17, 519–533. [Google Scholar] [CrossRef]
- Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Tibshirani, R.; Botstein, D.; Altman, R.B. Missing value estimation methods for DNA microarrays. Bioinformatics 2001, 17, 520–525. [Google Scholar] [CrossRef] [PubMed]
- Van Buuren, S.; Groothuis-Oudshoorn, K. mice: Multivariate Imputation by Chained Equations in R. J. Stat. Softw. 2011, 45, 67. [Google Scholar] [CrossRef]
- Cai, J.-F.; Candès, E.J.; Shen, Z. A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 2010, 20, 1956–1982. [Google Scholar] [CrossRef]
- Yoon, J.; Jordon, J.; van der Schaar, M. GAIN: Missing data imputation using generative adversarial nets. In Proceedings of the ICML, Stockholm, Sweden, 10–15 July 2018; pp. 5689–5698. [Google Scholar]
- Mattei, P.-A.; Frellsen, J. MIWAE: Deep generative modelling and imputation of incomplete data sets. In Proceedings of the ICML, Long Beach, CA, USA, 9–15 June 2019; pp. 4413–4423. [Google Scholar]
- Hasan, M.F.; Sobhan, M.A. Describing fuzzy membership function and detecting the outlier by using five number summary of data. Am. J. Comput. Math. 2020, 10, 410–424. [Google Scholar] [CrossRef]
- Naik, N.; Diao, R.; Shen, Q. Dynamic fuzzy rule interpolation and its application to intrusion detection. IEEE Trans. Fuzzy Syst. 2018, 26, 1878–1892. [Google Scholar] [CrossRef]
- Amiri, M.; Jensen, R. Missing data imputation using fuzzy-rough methods. Neurocomputing 2016, 205, 152–164. [Google Scholar] [CrossRef]
- Li, D.; Gu, H.; Zhang, L. A fuzzy c-means clustering algorithm based on nearest-neighbor intervals for incomplete data. Expert Syst. Appl. 2010, 37, 6942–6947. [Google Scholar] [CrossRef]
- Jang, J.-S.R. ANFIS: Adaptive-network-based fuzzy inference system. IEEE Trans. Syst. Man Cybern. 1993, 23, 665–685. [Google Scholar] [CrossRef]
- Manimurugan, S.; Majdi, A.; Mohmmed, M.A.; Narmatha, C.; Varatharajan, R. Intrusion detection in networks using crow search optimization algorithm with adaptive neuro-fuzzy inference system. Microprocess. Microsyst. 2020, 79, 103261. [Google Scholar] [CrossRef]
- Rafique, Y.; Wu, J.; Muzaffar, A.W.; Rafique, B. An enhanced integrated fuzzy logic-based deep learning technique (EIFL-DL) for the recommendation system. PeerJ Comput. Sci. 2024, 10, e2529. [Google Scholar] [CrossRef]
- Hazarika, B.B.; Gupta, D. Density-weighted twin SVM for binary class imbalance learning. Neural Process. Lett. 2022, 54, 1091–1130. [Google Scholar] [CrossRef]
- Prasad, S.C.; Anagha, P.; Balasundaram, S. Robust Pinball Twin Bounded Support Vector Machine for Data Classification. Neural Process. Lett. 2023, 55, 1131–1153. [Google Scholar] [CrossRef]
- Khushal, R.; Fatima, U. Fuzzy machine learning logic utilization on hormonal imbalance dataset. Comput. Biol. Med. 2024, 174, 108429. [Google Scholar] [CrossRef] [PubMed]
- Saatchi, R. Fuzzy logic concepts, developments and implementation. Information 2024, 15, 656. [Google Scholar] [CrossRef]
- Klement, E.P.; Mesiar, R.; Pap, E. Triangular Norms; Springer: Dordrecht, The Netherlands, 2000. [Google Scholar]
- Dubois, D.; Prade, H. New results about properties and semantics of fuzzy set-theoretic operators. In Fuzzy Sets: Theory and Applications to Policy Analysis and Information Systems; Springer: New York, NY, USA, 1980; pp. 59–75. [Google Scholar]
- Mamdani, E.H.; Assilian, S. An experiment in linguistic synthesis with a fuzzy logic controller. Int. J. Man-Mach. Stud. 1975, 7, 1–13. [Google Scholar] [CrossRef]
- Ross, T.J. Fuzzy Logic with Engineering Applications, 4th ed.; Wiley: Hoboken, NJ, USA, 2016. [Google Scholar]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the ACM SIGKDD, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
- Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
- Cover, T.M.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
- Cox, D.R. The regression analysis of binary sequences. J. R. Stat. Soc. Ser. B 1958, 20, 215–232. [Google Scholar] [CrossRef]
- Dua, D.; Graff, C. UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences. Available online: https://archive.ics.uci.edu/ml (accessed on 10 January 2026).
- Smith, J.W.; Everhart, J.E.; Dickson, W.C.; Knowler, W.C.; Johannes, R.S. Using the ADAP Learning Algorithm to Forecast the Onset of Diabetes Mellitus. In Proceedings of the Annual Symposium on Computer Application in Medical Care; American Medical Informatics Association: Washington, DC, USA, 1988; pp. 261–265. [Google Scholar]
- Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
- Jarrahi, M.H.; Memariani, A.; Guha, S. The Principles of Data-Centric AI. Commun. ACM 2023, 66, 84–92. [Google Scholar] [CrossRef] [PubMed]




| Reference | Year | Fuzzy Method | Detection | Correction | ML Eval. | Proportional |
|---|---|---|---|---|---|---|
| Hasan & Sobhan [38] | 2020 | Membership Func. | Yes | No | Limited | No |
| Naik et al. [39] | 2018 | Fuzzy Rule Interp. | Partial | No | Yes | No |
| Amiri & Jensen [40] | 2016 | Fuzzy-Rough NN | No | Yes | Limited | No |
| Li et al. [41] | 2010 | Fuzzy C-Means | No | Yes | Yes | No |
| Manimurugan et al. [43] | 2020 | ANFIS | Yes | No | Limited | No |
| Rafique et al. [44] | 2024 | Fuzzy + DL | Partial | Partial | Yes | No |
| Khushal & Fatima [47] | 2024 | Fuzzy Transform | No | Partial | Yes | No |
| Proposed GDEDC | 2026 | Mamdani FIS | Yes | Yes | Yes (6 datasets, 5 baselines) | Yes |
| Rule | AS (Input 1) | FCS (Input 2) | ES (Output) | Rationale |
|---|---|---|---|---|
| R1 | Low | Consistent | Clean | Low deviation, mostly within normal range |
| R2 | Low | Partially Cons. | Clean | Low deviation overall, minor inconsistency |
| R3 | Low | Inconsistent | Suspicious | Low aggregate but scattered deviations |
| R4 | Medium | Consistent | Suspicious | Moderate deviation but concentrated |
| R5 | Medium | Partially Cons. | Suspicious | Moderate deviation, some spread |
| R6 | Medium | Inconsistent | Erroneous | Moderate deviation across many features |
| R7 | High | Consistent | Suspicious | High but focused deviation (possible outlier) |
| R8 | High | Partially Cons. | Erroneous | High deviation with spread |
| R9 | High | Inconsistent | Erroneous | High deviation across most features |
| Dataset | Instances | Features | Classes | Class Distribution |
|---|---|---|---|---|
| Iris | 150 | 4 | 3 | 50/50/50 |
| Wine | 178 | 13 | 3 | 59/71/48 |
| Breast Cancer | 569 | 30 | 2 | 212/357 |
| Seeds | 210 | 7 | 3 | 70/70/70 |
| Pendigits | 10,992 | 16 | 10 | ~1055–1144 per class |
| Pima Diabetes | 768 | 8 | 2 | 500/268 |
| Method | RF | SVM | KNN | LR | XGBoost |
|---|---|---|---|---|---|
| Raw (B1) | 93.0 ± 4.9 | 87.3 ± 5.6 | 86.9 ± 5.4 | 81.6 ± 9.0 | 92.5 ± 5.1 |
| Z-Score (B2) | 92.5 ± 4.8 | 86.1 ± 5.0 | 86.0 ± 5.7 | 81.2 ± 8.8 | 92.2 ± 5.1 |
| IQR (B3) | 92.1 ± 4.8 | 81.9 ± 6.2 | 84.1 ± 5.6 | 79.6 ± 8.2 | 91.4 ± 5.1 |
| KNN Imp. (B4) | 93.0 ± 5.0 | 89.3 ± 5.7 | 88.4 ± 6.3 | 84.4 ± 9.8 | 92.8 ± 4.9 |
| MICE (B5) | 93.1 ± 4.8 | 90.0 ± 5.5 | 89.4 ± 5.9 | 85.3 ± 9.1 | 92.9 ± 5.1 |
| GDEDC | 92.4 ± 5.1 | 89.1 ± 5.2 | 88.7 ± 5.1 | 84.3 ± 8.5 | 92.1 ± 5.0 |
| Method | 5% | 10% | 15% | 20% | 25% | 30% |
|---|---|---|---|---|---|---|
| Raw (B1) | 91.6 ± 5.7 | 88.3 ± 7.5 | 85.3 ± 8.6 | 82.1 ± 10.0 | 79.1 ± 11.3 | 75.9 ± 12.5 |
| Z-Score (B2) | 91.1 ± 5.8 | 87.6 ± 7.4 | 84.6 ± 8.4 | 81.6 ± 9.9 | 78.8 ± 11.1 | 75.8 ± 12.4 |
| IQR (B3) | 90.2 ± 6.2 | 85.8 ± 7.9 | 82.2 ± 9.2 | 78.8 ± 10.3 | 75.2 ± 11.2 | 72.1 ± 12.1 |
| KNN Imp. (B4) | 92.5 ± 5.7 | 89.6 ± 7.3 | 86.8 ± 8.5 | 83.8 ± 9.9 | 80.5 ± 11.4 | 77.4 ± 12.6 |
| MICE (B5) | 92.8 ± 5.5 | 90.1 ± 6.9 | 87.4 ± 8.0 | 84.4 ± 9.5 | 81.1 ± 10.9 | 77.7 ± 12.3 |
| GDEDC | 92.3 ± 5.3 | 89.4 ± 6.6 | 86.7 ± 7.6 | 84.0 ± 8.7 | 81.1 ± 9.7 | 78.2 ± 10.9 |
| Method | 5% | 10% | 15% | 20% | 25% | 30% |
|---|---|---|---|---|---|---|
| Z-Score (B2) | 84.5 | 81.3 | 82.1 | 85.0 | 89.0 | 93.7 |
| IQR (B3) | 76.2 | 68.1 | 63.8 | 61.4 | 60.9 | 61.6 |
| KNN Imp. (B4) | 100 | 100 | 100 | 100 | 100 | 100 |
| MICE (B5) | 100 | 100 | 100 | 100 | 100 | 100 |
| GDEDC | 100 | 100 | 100 | 100 | 100 | 100 |
| Comparison | Paired t-Test | Wilcoxon | Cohen’s d | Significant? |
|---|---|---|---|---|
| GDEDC vs. Raw | <0.001 | <0.001 | 0.336 | Yes |
| GDEDC vs. Z-Score | <0.001 | <0.001 | 0.474 | Yes |
| GDEDC vs. IQR | <0.001 | <0.001 | 0.637 | Yes |
| GDEDC vs. KNN Imp. | 0.060 | 0.003 | −0.069 | No |
| GDEDC vs. MICE | <0.001 | <0.001 | −0.229 | Yes (MICE better) |
| Dataset | GDEDC% | Raw% | Gain | t-Test p | Cohen’s d | Sig? |
|---|---|---|---|---|---|---|
| Iris | 85.9 | 85.1 | +0.9 | 0.008 | 0.219 | Yes |
| Wine | 92.1 | 91.0 | +1.2 | <0.001 | 0.291 | Yes |
| Breast Cancer | 94.2 | 93.2 | +0.9 | <0.001 | 0.423 | Yes |
| Seeds | 86.4 | 84.9 | +1.6 | <0.001 | 0.423 | Yes |
| Pendigits | 88.1 | 87.1 | +1.0 | <0.001 | 0.561 | Yes |
| Comparison | Metric | 5% | 10% | 15% | 20% | 25% | 30% |
|---|---|---|---|---|---|---|---|
| GDEDC vs. Raw | Gain% | +0.70 | +1.10 | +1.47 | +1.91 | +2.07 | +2.32 |
| p-value | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 | |
| Cohen d | 0.25 | 0.34 | 0.37 | 0.41 | 0.42 | 0.45 | |
| GDEDC vs. Z-Score | Gain% | +1.21 | +1.76 | +2.09 | +2.39 | +2.33 | +2.43 |
| p-value | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 | |
| Cohen d | 0.39 | 0.47 | 0.51 | 0.48 | 0.47 | 0.45 | |
| GDEDC vs. IQR | Gain% | +2.12 | +3.53 | +4.55 | +5.23 | +5.92 | +6.14 |
| p-value | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 | |
| Cohen d | 0.52 | 0.64 | 0.68 | 0.66 | 0.70 | 0.73 | |
| GDEDC vs. KNN Imp. | Gain% | −0.19 | −0.22 | −0.02 | +0.29 | +0.64 | +0.85 |
| p-value | 0.059 | 0.060 | 0.902 | 0.050 * | <0.001 ** | <0.001 ** | |
| Cohen d | −0.07 | −0.07 | −0.00 | 0.07 | 0.14 | 0.19 | |
| GDEDC vs. MICE | Gain% | −0.50 | −0.79 | −0.66 | −0.38 | +0.08 | +0.49 |
| p-value | <0.001 | <0.001 | <0.001 | 0.020 | 0.674 | 0.006 ** | |
| Cohen d | −0.17 | −0.23 | −0.17 | −0.08 | 0.02 | 0.10 |
| Method | 5% | 10% | 15% | 20% | 25% | 30% | Combined |
|---|---|---|---|---|---|---|---|
| Raw (B1) | 2.94 | 3.11 | 3.12 | 3.17 | 3.13 | 3.15 | 3.10 |
| Z-Score (B2) | 3.79 | 3.89 | 3.93 | 3.79 | 3.72 | 3.72 | 3.81 |
| IQR (B3) | 4.47 | 4.73 | 4.80 | 4.73 | 4.80 | 4.84 | 4.73 |
| KNN Imp. (B4) | 2.85 | 2.82 | 2.91 | 2.98 | 3.10 | 2.97 | 2.94 |
| MICE (B5) | 3.11 | 2.81 | 2.83 | 2.92 | 3.02 | 3.15 | 2.97 |
| GDEDC | 3.84 | 3.64 | 3.41 | 3.40 | 3.24 | 3.16 | 3.45 |
| Method | 5% | 10% | 15% | 20% | 25% | 30% | Combined |
|---|---|---|---|---|---|---|---|
| Raw (B1) | 3.51 | 3.68 | 3.66 | 3.74 | 3.60 | 3.62 | 3.63 |
| Z-Score (B2) | 4.09 | 4.24 | 4.28 | 4.34 | 4.17 | 4.16 | 4.21 |
| IQR (B3) | 4.79 | 5.08 | 5.12 | 5.00 | 5.10 | 5.10 | 5.03 |
| KNN Imp. (B4) | 2.60 | 2.60 | 2.72 | 2.79 | 3.02 | 2.95 | 2.78 |
| MICE (B5) | 2.72 | 2.42 | 2.48 | 2.57 | 2.74 | 2.91 | 2.64 |
| GDEDC | 3.29 | 2.98 | 2.75 | 2.55 | 2.38 | 2.26 | 2.70 |
| λ | 0.2 | 0.4 | 0.6 | 0.8 | 1.0 |
|---|---|---|---|---|---|
| Accuracy (%) | 88.7 | 89.1 | 89.2 | 89.4 | 89.2 |
| k | 3 | 5 | 7 | 10 | 15 |
|---|---|---|---|---|---|
| Accuracy (%) | 89.3 | 89.4 | 89.4 | 89.4 | 89.5 |
| Metric | Iris | Wine | Breast Cancer | Seeds | Pendigits | Average |
|---|---|---|---|---|---|---|
| Detection Precision | 0.41 | 0.82 | 0.98 | 0.61 | 0.87 | 0.74 |
| Detection Recall | 0.83 | 0.78 | 0.72 | 0.78 | 0.71 | 0.76 |
| Detection F1 | 0.55 | 0.80 | 0.83 | 0.68 | 0.79 | 0.73 |
| Method | Iris (150 × 4) | Wine (178 × 13) | BC (569 × 30) | Seeds (210 × 7) | Pendigits (10,992 × 16) |
|---|---|---|---|---|---|
| Raw (B1) | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| Z-Score (B2) | 0.03 | 0.04 | 0.09 | 0.03 | 1.59 |
| IQR (B3) | 0.10 | 0.13 | 0.35 | 0.11 | 3.41 |
| KNN Imp. (B4) | 2.13 | 9.26 | 9015.28 | 10.06 | 3153.35 |
| MICE (B5) | 12.00 | 28.74 | 890.76 | 22.83 | 427.22 |
| GDEDC | 35.37 | 44.94 | 210.93 | 47.09 | 3085.23 |
| Classifier | Type | Raw Avg (%) | GDEDC Avg (%) | Gain (%) |
|---|---|---|---|---|
| RF | Tree ensemble | 93.0 | 92.4 | −0.6 |
| XGBoost | Tree ensemble | 92.5 | 92.1 | −0.4 |
| SVM | Distance-based | 87.3 | 89.1 | +1.9 |
| KNN | Distance-based | 86.9 | 88.7 | +1.9 |
| LR | Coefficient-based | 81.6 | 84.3 | +2.8 |
| Variant | Ablated Component | 10% | 20% | 30% | Avg Drop |
|---|---|---|---|---|---|
| GDEDC-Full | — (control) | 89.36 | 84.04 | 78.21 | — |
| A1-MeanAgg | RMS → weighted mean | 88.90 | 83.56 | 77.68 | −0.49 |
| A2-BinaryCorr | Sigmoid → binary replace | 87.24 | 81.84 | 76.47 | −2.02 |
| A3-Threshold | Mamdani FIS → AS threshold | 89.00 | 83.68 | 77.79 | −0.38 |
| A4-NoFCS | FCS removed (AS-only FIS) | 89.08 | 83.94 | 78.27 | −0.11 |
| Component | 10% | 20% | 30% | Average |
|---|---|---|---|---|
| Sigmoid proportional correction (vs. binary) | +2.12 | +2.20 | +1.74 | +2.02 |
| RMS aggregation (vs. weighted mean) | +0.46 | +0.48 | +0.53 | +0.49 |
| Mamdani FIS (vs. simple threshold) | +0.36 | +0.36 | +0.42 | +0.38 |
| Feature Consistency Score (vs. AS-only) | +0.28 | +0.10 | −0.06 | +0.11 |
| Comparison | Gain (%) | p-Value | Cohen’s d | Significant? |
|---|---|---|---|---|
| GDEDC vs. Raw | −0.19 | 0.270 | −0.090 | No |
| GDEDC vs. Z-Score | +0.29 | 0.129 | +0.125 | No |
| GDEDC vs. IQR | +2.97 | < 0.001 | +0.684 | Yes |
| GDEDC vs. KNN Imp. | +0.45 | 0.032 | +0.177 | Yes |
| GDEDC vs. MICE | +0.43 | 0.052 | +0.160 | No (borderline) |
| Method | RF | SVM | KNN | LR | XGBoost | Average |
|---|---|---|---|---|---|---|
| Raw (B1) | 76.6 ± 2.9 | 76.1 ± 2.6 | 73.9 ± 3.6 | 76.5 ± 2.4 | 73.9 ± 2.8 | 75.4 |
| Z-Score (B2) | 76.3 ± 2.9 | 75.1 ± 2.3 | 72.9 ± 3.0 | 76.4 ± 2.6 | 73.9 ± 2.3 | 74.9 |
| IQR (B3) | 75.4 ± 3.3 | 70.5 ± 2.1 | 70.1 ± 3.9 | 72.4 ± 6.0 | 72.9 ± 3.1 | 72.2 |
| KNN Imp. (B4) | 76.5 ± 2.8 | 75.6 ± 2.4 | 72.4 ± 2.2 | 76.2 ± 2.7 | 73.1 ± 2.9 | 74.8 |
| MICE (B5) | 75.8 ± 2.4 | 75.9 ± 2.4 | 72.6 ± 3.2 | 76.3 ± 2.6 | 73.4 ± 3.0 | 74.8 |
| GDEDC | 76.4 ± 2.7 | 76.0 ± 2.8 | 73.9 ± 3.5 | 76.6 ± 2.4 | 73.2 ± 2.8 | 75.2 |
| Method | Accuracy | Precision | Recall | F1 | AUC |
|---|---|---|---|---|---|
| Raw | 75.41 | 68.00 | 58.25 | 62.28 | 81.29 |
| Z-Score | 74.93 | 67.02 | 57.86 | 61.68 | 80.77 |
| IQR | 72.25 | 63.31 | 53.22 | 56.27 | 77.71 |
| KNN Imp. | 74.77 | 67.07 | 56.57 | 61.01 | 80.85 |
| MICE | 74.86 | 67.31 | 56.44 | 61.04 | 80.47 |
| GDEDC | 75.22 | 67.49 | 58.18 | 62.10 | 81.32 |
| Noise | IQR + Mean | IQR + KNN | IQR + MICE | IQR + GDEDC |
|---|---|---|---|---|
| 10% | 90.03 | 89.58 | 90.15 | 89.14 |
| 20% | 84.27 | 83.75 | 84.42 | 83.34 |
| 30% | 77.74 | 77.36 | 77.72 | 77.03 |
| Noise | Raw | Z-Score | IQR | KNN Imp. | MICE | GDEDC |
|---|---|---|---|---|---|---|
| 10% | 88.85 | 88.19 | 87.14 | 89.86 | 89.88 | 89.70 |
| 20% | 83.84 | 83.36 | 81.08 | 85.13 | 85.10 | 85.18 |
| Noise | Raw | Z-Score | IQR | KNN Imp. | MICE | GDEDC |
|---|---|---|---|---|---|---|
| 10% | 95.18 | 95.10 | 94.74 | 95.11 | 95.02 | 95.09 |
| 20% | 94.70 | 94.51 | 94.14 | 94.57 | 94.55 | 94.56 |
| Noise | Raw | Z-Score | IQR | KNN Imp. | MICE | GDEDC |
|---|---|---|---|---|---|---|
| 10% | 94.34 | 94.09 | 93.72 | 94.14 | 94.04 | 94.20 |
| 20% | 93.12 | 92.83 | 92.33 | 93.04 | 92.76 | 93.16 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Tekin, A.T. Fuzzy Graded Preprocessing for Robust Machine Learning: A Three-Stage Mamdani Framework with Interpretable Audit Trails. Appl. Sci. 2026, 16, 5072. https://doi.org/10.3390/app16105072
Tekin AT. Fuzzy Graded Preprocessing for Robust Machine Learning: A Three-Stage Mamdani Framework with Interpretable Audit Trails. Applied Sciences. 2026; 16(10):5072. https://doi.org/10.3390/app16105072
Chicago/Turabian StyleTekin, Ahmet Tezcan. 2026. "Fuzzy Graded Preprocessing for Robust Machine Learning: A Three-Stage Mamdani Framework with Interpretable Audit Trails" Applied Sciences 16, no. 10: 5072. https://doi.org/10.3390/app16105072
APA StyleTekin, A. T. (2026). Fuzzy Graded Preprocessing for Robust Machine Learning: A Three-Stage Mamdani Framework with Interpretable Audit Trails. Applied Sciences, 16(10), 5072. https://doi.org/10.3390/app16105072

