Next Article in Journal
Instability Mechanism and CO2 Phase Transition in Long–Short Borehole Pressure Relief Control of Narrow Coal Pillars in a Gob-Side Roadway Under Water-Immersed Gentle-Dipping Coal Seam Conditions
Previous Article in Journal
Guided Versus Freehand Dental Implant Placement: Where We Stand? A Narrative Review Based on a Systematic Literature Search
Previous Article in Special Issue
Determination of Crop Soil Quality for Stevia rebaudiana Bertoni Morita II Using a Fuzzy Logic Model and a Wireless Sensor Network
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Fuzzy Graded Preprocessing for Robust Machine Learning: A Three-Stage Mamdani Framework with Interpretable Audit Trails

by
Ahmet Tezcan Tekin
Department of Management Engineering, Istanbul Technical University, 34469 Istanbul, Turkey
Appl. Sci. 2026, 16(10), 5072; https://doi.org/10.3390/app16105072
Submission received: 17 April 2026 / Revised: 10 May 2026 / Accepted: 15 May 2026 / Published: 19 May 2026
(This article belongs to the Collection The Development and Application of Fuzzy Logic)

Featured Application

This framework is applicable to any machine learning pipeline where training data contains errors, outliers, or noise—particularly in healthcare, finance, and manufacturing, where data quality directly impacts model reliability and where an auditable preprocessing pipeline is required for regulatory compliance.

Abstract

Data preprocessing methods for machine learning overwhelmingly rely on binary logic—a value is either valid or invalid—and the corrective action does not scale with error severity. This paper introduces GDEDC, a Mamdani-type fuzzy inference framework that replaces binary preprocessing with graded error detection and proportional correction. Operating in three stages—fuzzy anomaly scoring, nine-rule Mamdani FIS classification, and sigmoid-weighted imputation—the framework corrects each value in proportion to its estimated error severity while retaining 100% of observations and producing a human-readable audit trail. We evaluate GDEDC on five UCI datasets and the Pima Indians diabetes dataset with five classifiers across six noise levels (5–30%), comparing against five baselines including MICE. Under leakage-free conditions, deletion-based methods consistently underperform raw data, while correction-based methods (GDEDC, KNN Imputation, MICE) deliver significant improvements. GDEDC matches KNN Imputation and MICE at low noise and surpasses both at ≥20% noise: on noise-sensitive classifiers, GDEDC achieves the best Friedman rank at 20–30% noise. Real-world validation on the Pima dataset confirms generalizability, with GDEDC outperforming IQR by +2.97% (p < 0.001, d = 0.684). Ablation analysis shows that sigmoid-based proportional correction is the primary contributor (+2.02 pp), and the full pipeline outperforms every ablated variant at 10–20% noise.

1. Introduction

Machine learning algorithms are now deployed across healthcare diagnostics, financial forecasting, natural language processing, and autonomous systems [1,2]. Their predictive performance, however, depends heavily on the quality of training data—a problem often summarized as “garbage in, garbage out” [3]. Real-world datasets routinely contain missing values, outliers, inconsistencies, typographical mistakes, and measurement noise [4,5]. Left uncorrected, these errors bias parameter estimates, encourage overfitting, and weaken generalization [6]. A concrete example: a clinical record where systolic blood pressure is mistyped as 1200 mmHg instead of 120—under deletion the entire patient observation is discarded (losing valid features), and under no preprocessing the 1200 enters training and biases a hypertension classifier.
A range of preprocessing approaches exist, from statistical outlier detection (Z-Score, interquartile range (IQR) methods) and deletion-based strategies (listwise and pairwise deletion) to imputation techniques (mean/median/mode substitution, K-Nearest Neighbors (KNN) Imputation, multiple imputation) [7,8]. These methods work well in many settings, but they share three limitations. First, statistical methods apply hard thresholds and assume specific distributions, so they cannot capture the gradual transition between valid and erroneous observations [9]. A data point just beyond a Z-Score threshold of 3.0 is treated the same as one far exceeding it, even though the two carry very different likelihoods of genuine error. Second, deletion discards observations and the information they contain, a cost that grows rapidly in small-sample regimes [10]. Third, standard imputation methods preserve dataset size but can introduce artificial patterns that do not match the underlying data-generating process [11].
These problems compound as noise increases. At 20–30% noise, IQR-based outlier removal can discard nearly 39% of observations, sometimes eliminating entire minority classes. There is a subtler issue as well: when preprocessing is confined to training data only—as it must be to prevent data leakage—deletion-based methods cannot filter the noisy test observations that still require prediction. The result is a train–test distribution mismatch that degrades classifier performance. Our experiments confirm that this theoretical limitation has substantial practical consequences. The impact is worst in domains where data collection is expensive or sample sizes are inherently small, such as clinical studies and rare event analysis.
Fuzzy logic, introduced by Zadeh [12] in 1965, offers an alternative by replacing binary classification with graded membership: each observation receives a degree of anomaly between 0 and 1 rather than a categorical label of erroneous or correct [13,14]. This fits data quality assessment naturally, since the boundary between valid and erroneous entries is rarely sharp [15,16]. More importantly, graded membership opens the door to proportional correction—something hard-threshold methods cannot do. Observations with high anomaly scores receive aggressive correction; borderline cases receive mild adjustments that preserve most of their original information.
Interpretability is another advantage. A Mamdani-type Fuzzy Inference System (FIS) expresses its rules in natural language (e.g., “IF anomaly score is High AND feature consistency is Low THEN error severity is High”), so a practitioner can inspect and modify the detection logic without specialized knowledge [17]. In regulated domains such as clinical decision support, this kind of transparency is often a compliance requirement.
The main contributions of this study are as follows:
(1)
We introduce GDEDC, a Mamdani-type fuzzy inference framework for graded data error detection and correction. Its nine-rule base produces human-readable explanations for every decision, allowing domain experts to inspect and, if necessary, override the preprocessing logic.
(2)
The framework integrates three stages—fuzzy anomaly scoring with RMS aggregation, rule-based error classification, and sigmoid-based fuzzy-weighted imputation—into a single pipeline that corrects values in proportion to their error severity rather than applying binary accept/reject decisions. All observations are retained.
(3)
We evaluate the framework on five benchmark datasets and one real-world medical dataset (Pima Indians diabetes) with five ML classifiers across six noise levels (5–30%), using a split-first protocol that prevents data leakage, and compare against five baselines including MICE (Multiple Imputation by Chained Equations). The results confirm three findings: correction-based methods (GDEDC, KNN Imputation, and MICE) consistently outperform raw data, whereas deletion-based methods (Z-Score and IQR) degrade performance under leakage-free conditions; GDEDC matches both KNN Imputation and MICE at low noise and surpasses them at high noise (≥20%); and the graded approach generalizes to naturally noisy medical data. Statistical validation through paired t-tests, Wilcoxon signed-rank tests, Cohen’s d effect sizes, and Friedman rank analysis supports these conclusions.
Section 2 reviews related work. Section 3 introduces the necessary fuzzy set preliminaries. Section 4 describes the proposed methodology. Section 5 reports experimental results and discussion, and Section 6 concludes with directions for future work.

2. Literature Review

2.1. Data Quality and Its Impact on Machine Learning

The relationship between data quality and ML outcomes is well-documented. Gudivada et al. [3] decomposed data quality into dimensions—accuracy, completeness, consistency, timeliness, and relevance—and traced their combined effect on analytics pipelines. Nettleton et al. [18] showed that even 5–10% attribute noise can noticeably degrade classification accuracy, though the sensitivity varies by algorithm: Naive Bayes was the most robust, SVM the most affected. Mohammed et al. [19] arrived at a complementary conclusion from a data-centric AI benchmark, finding that improving data quality yielded larger gains than switching model architectures. Wanyonyi and Masinde [20] reported similar findings, showing measurable effects of preprocessing choices on training time, accuracy, and F1-Scores.
In tabular data—still the dominant format in enterprise and clinical settings—errors fall into three broad categories: numerical (outliers, measurement noise, unit conversion mistakes, rounding errors) [21], categorical (typos, encoding inconsistencies, label noise) [22], and structural (schema violations, duplicate records, referential integrity breaches) [23]. Each type calls for different handling, which motivates the development of frameworks that can address multiple error modalities within a single pipeline.
Benchmark studies have quantified the resulting performance loss. Zha et al. [24] reported 10–30% accuracy drops from label noise on standard image classification benchmarks, and Frénay and Verleysen [25] surveyed robust algorithms and noise-tolerant loss functions for classification under label noise. A common thread in this literature is the need for methods that correct—not merely discard—erroneous data points, especially when noise affects feature values rather than labels.
The data-centric AI paradigm, articulated by Ng [26] and substantiated by recent benchmarks [19,24], argues that systematic improvement of data quality often yields larger gains than model architecture changes. This perspective has gained considerable momentum: the first Data-Centric AI competition (NeurIPS 2021) demonstrated that participants who focused exclusively on data cleaning and curation outperformed those who tuned state-of-the-art models [26]. GDEDC is conceived in this spirit. Rather than treating preprocessing as a mechanical step to be dispatched before “real” modeling begins, the framework treats data quality improvement as a first-class design problem—one that benefits from the same rigor in evaluation, ablation, and statistical testing normally reserved for model selection. The interpretable rule base further aligns with the data-centric principle that domain expertise should be encodable in the pipeline, not just in the model.

2.2. Traditional Data Preprocessing Approaches

Han et al. [27] organized preprocessing into four tasks—data cleaning, integration, transformation, and reduction—and within data cleaning the primary operations include missing value handling, noise smoothing, and outlier detection and treatment [28].
Z-Score normalization, modified Z-Scores based on the median absolute deviation (MAD), and IQR-based approaches remain popular because they are fast and easy to implement [29]. Their weakness is that they assume parametric distributions (usually Gaussian) and are vulnerable to masking (one outlier hides another) and swamping (a normal point is falsely flagged because of nearby outliers) [30]. Robust alternatives such as the minimum covariance determinant (MCD) estimator reduce these problems but at higher computational cost and with their own parametric assumptions [31].
On the imputation side, mean/median/mode substitution is fast but distorts distributions and underestimates variance [32]. KNN Imputation uses local neighborhood information, yet it is sensitive to k and the distance metric, and it struggles when missingness is not random [33]. Multiple imputation by chained equations (MICE) pools results across several completed datasets, yielding valid inference under the missing-at-random (MAR) assumption [34], while SVD-based matrix factorization methods target high-dimensional data but require careful rank tuning [35].
Deep learning has also been brought to bear on imputation. GAIN [36] frames imputation as a GAN problem, and MIWAE [37] uses variational inference through an importance-weighted autoencoder. Both methods are powerful but data-hungry, which limits their applicability to the small-to-medium tabular datasets that are common in many practical settings. Specifically, GAIN requires hundreds to thousands of samples per class to train its generator and discriminator reliably, and MIWAE similarly depends on deep network capacity that is difficult to regularize on datasets of fewer than 600 instances. Because our benchmark datasets range from 150 to 10,992 instances and our focus is the tabular domain typical of industrial and clinical settings, we follow established practice [24] and limit the baseline comparison to statistical and nearest-neighbor methods (B1–B5) where the data-size assumptions match those of the proposed framework.
What all these methods have in common is binary logic: a value is either valid or invalid, and the corrective action does not scale with how suspicious the value actually is. Borderline cases lose useful information under this all-or-nothing treatment. The fuzzy approach proposed here is designed to move beyond this limitation.

2.3. Fuzzy Logic in Data Preprocessing and Quality Assessment

Fuzzy logic has been applied to data preprocessing in several ways, and its main appeal is graded membership: instead of a binary outlier/not-outlier label, each value receives a continuous degree of anomaly [12,13].
Hasan and Sobhan [38] used membership functions constructed from five-number summary statistics to quantify the degree of outlierness for each observation and reported results consistent with box-plot-based detection while offering a computationally simpler alternative, though they addressed detection only, not correction. Naik et al. [39] applied dynamic fuzzy rule interpolation to intrusion detection, showing that fuzzy inference can cope with noisy and incomplete rule bases in real time; their focus, however, was classification rather than data correction.
For missing data specifically, Amiri and Jensen [40] combined fuzzy membership with rough set approximations (fuzzy-rough nearest neighbor imputation), reporting advantages on mixed numeric/categorical datasets. Li et al. [41] took a different route, using fuzzy C-means clustering to assign membership to centroids and then imputing via weighted aggregation; they showed improvements over conventional KNN Imputation on several benchmarks.
Neuro-fuzzy hybrids combine the learning capability of neural networks with fuzzy interpretability. Jang’s ANFIS [42] is the best-known example. Manimurugan et al. [43] applied ANFIS combined with crow search optimization to intrusion detection in networks and achieved improved detection rates over conventional baselines. Rafique et al. [44] extended the idea to recommendation systems, using fuzzy preprocessing before a deep learning stage and reporting gains in both accuracy and interpretability.
In a related vein, Hazarika and Gupta [45] tackled class imbalance with a density-weighted twin SVM, while Prasad et al. [46] proposed a robust twin bounded SVM with L1-norm and pinball loss aimed at noisy features. Both demonstrate that fuzzy techniques can mitigate data quality issues, though neither provides a unified detection-and-correction pipeline.
Khushal and Fatima [47] recently developed a fuzzy transformation technique for biological datasets in which binary input variables are mapped to fuzzy representations, expanding the analysis from binary classification to three-class output. Their results suggest that transforming the input space with fuzzy logic can improve the discriminative capability of downstream classifiers.

2.4. Summary and Research Gap

GDEDC differs from prior fuzzy preprocessing in three concrete ways that do not appear together in any single entry of Table 1: (i) sigmoid-weighted proportional correction whose weight is calibrated by the joint product of severity and feature anomaly score (Equation (17)); (ii) RMS-based aggregation that amplifies single-feature anomalies typical of data-entry errors; and (iii) a leakage-free evaluation protocol that fits all fuzzy parameters on the training partition only. The remainder of this section maps each prior method against these three axes.
Table 1 makes the gap visible: existing approaches handle either detection or correction, not both, and none adjust the correction in proportion to the estimated error severity. Evaluation across multiple ML classifiers with statistical significance testing also remains uncommon. GDEDC is designed to fill these gaps by combining graded detection, rule-based classification, and proportional correction in a single pipeline, evaluated with rigorous downstream ML performance metrics. Unlike prior fuzzy approaches—which either detect anomalies without correcting them [38], correct without graded severity estimation [40,41], or apply fuzzy logic to the classifier rather than the preprocessing step [42,43]—GDEDC introduces three specific novelties absent from all entries in Table 1: (i) a sigmoid-weighted correction that scales the imputation proportionally to the estimated error severity rather than applying a binary detect-or-replace decision; (ii) an RMS-weighted aggregation that amplifies single-feature anomalies typical of data-entry errors; and (iii) a leakage-free evaluation protocol in which all fuzzy parameters are fitted on training data only, preventing the optimistic bias present in studies that preprocess before splitting.

3. Preliminaries

3.1. Fuzzy Sets and Membership Functions

A fuzzy set, as introduced by Zadeh [12], generalizes the concept of a classical (crisp) set by allowing partial membership. Formally, a fuzzy set A on a universe of discourse X is characterized by a membership function:
μ A : X [ 0,1 ]
where μA(x) represents the degree of membership of element x in fuzzy set A. When μA(x) = 1, x fully belongs to A; when μA(x) = 0, x does not belong to A; and intermediate values indicate partial membership [13].
Definition 1.
(Triangular Membership Function). A triangular membership function with parameters a (left foot), b (peak), and c (right foot) is defined as
μ ( x ; a , b , c ) = m a x ( 0 , m i n ( x a b a , c x c b ) )
The triangular function equals 0 outside the interval [a, c], rises linearly from a to b, and falls linearly from b to c [13].
Definition 2.
(Gaussian Membership Function). A Gaussian membership function with center c and standard deviation σ is defined as
μ ( x ; c , σ ) = e x p ( ( x c ) 2 2 σ 2 )
Because the Gaussian function is infinitely differentiable and transitions smoothly between membership grades, it lends itself to continuous anomaly scoring [15,48].

3.2. Fuzzy Set Operations

The standard fuzzy set operations—union (t-conorm, typically max), intersection (t-norm, typically min), and complement (1 − μ)—are used throughout the framework. In the Mamdani FIS, the min t-norm computes rule firing strengths and the max t-conorm aggregates consequent fuzzy sets [12,49,50].

3.3. α-Cuts, Support, and Core of Fuzzy Sets

Definition 3.
(α-Cut). The α-cut of a fuzzy set A is the crisp set Aα = {x ∈ X: μA(x) ≥ α} for α ∈ (0, 1]. The strong α-cut is defined as Aα+ = {x ∈ X: μA(x) > α} [13].
Definition 4.
(Support and Core). The support of a fuzzy set A is supp(A) = {x ∈ X: μA(x) > 0}. The core of A is core(A) = {x ∈ X: μA(x) = 1}. In the context of anomaly detection, the core of the Normal fuzzy set represents the range of fully acceptable values, while observations outside the support are considered fully anomalous.

3.4. Fuzzy Inference Systems

A Fuzzy Inference System maps crisp inputs to crisp outputs through fuzzification, rule evaluation (t-norm on antecedents), aggregation (t-conorm on consequents), and defuzzification. We adopt the Mamdani FIS [51], where both antecedents and consequents are fuzzy sets expressed in natural language (e.g., “IF Anomaly Score IS High AND Consistency IS Low THEN Error Severity IS Erroneous”). This allows domain experts to read, validate, and modify the rule base—an advantage when preprocessing decisions must be auditable [52].

3.5. Defuzzification

The centroid (center of gravity) method converts the aggregated output fuzzy set to a crisp value: y* = ∫ y·μ(y)dy/∫ μ(y)dy. We choose it for its smooth, continuous output that preserves the proportionality inherent in the rule base.

3.6. Machine Learning Classifiers

Five classifiers spanning different algorithmic families are used: Random Forest [53] and XGBoost [54] (tree-based ensembles), SVM [55] (margin-based), KNN [56] (instance-based), and Logistic Regression [57] (linear). All use default hyperparameters to ensure fair comparison.

3.7. Evaluation Metrics

Classification performance is evaluated using four standard metrics:
Accuracy = T P + T N T P + T N + F P + F N
Precision = T P T P + F P
Recall = T P T P + F N
F 1 - Score = 2 · Precision · Recall Precision + Recall
where TP, TN, FP, and FN denote true positives, true negatives, false positives, and false negatives, respectively. The F1-Score balances precision and recall, which matters when class distributions are uneven [27]. In the main experiments, we report accuracy as the primary metric because all benchmark datasets use stratified splits with balanced or near-balanced class distributions; Precision, Recall, and F1 are used for the error detection analysis (Section 5.9).

4. Proposed Methodology: GDEDC Framework

The proposed framework for graded data error detection and correction (GDEDC) consists of three main stages: (1) Fuzzy Anomaly Scoring, (2) Rule-Based Error Classification, and (3) Sigmoid-Based Fuzzy-Weighted Imputation. Figure 1 illustrates the overall architecture.

4.1. Stage 1: Fuzzy Anomaly Scoring

For each numerical feature j (j = 1, …, p) in the dataset with n observations, we compute robust feature-level statistics: the median Mj, interquartile range IQRj = Q3jQ1j, and a robust scale estimate σN = IQRj/1.35 (which approximates the standard deviation for normally distributed data). The use of median and IQR rather than mean and standard deviation resists the influence of existing outliers [29].

4.1.1. Membership Function Construction

For each feature j, we construct two complementary fuzzy sets:
Normal (N): A Gaussian membership function centered at the median:
μ N ( x ; M j , σ N ) = e x p ( ( x M j ) 2 2 σ N 2 )
Anomalous (A): The complement of the Normal fuzzy set:
μ A ( x ) = 1 μ N ( x ; M j , σ N )
The Gaussian membership function is symmetric about the median Mj, meaning that positive and negative deviations of equal magnitude receive identical anomaly scores: μN(Mj + Δ) = μN(Mj − Δ) for all Δ. In effect, upward and downward errors of the same size look equally suspicious—a reasonable default when the feature distribution is roughly symmetric. The Normal and Anomalous fuzzy sets also form a complementary pair satisfying μN(x) + μA(x) = 1 for all x, ensuring that every observation’s membership is fully partitioned between the two states. The implications of this symmetric assumption for skewed feature distributions are examined in Section 5.16.6. The choice of a Gaussian (rather than triangular or trapezoidal) membership function is motivated by three considerations: (i) it is infinitely differentiable, providing smooth gradients in anomaly scores rather than abrupt transitions at breakpoints; (ii) its single parameter σ_N has a direct statistical interpretation as a scale estimate, making it tunable from data without manual breakpoint placement; and (iii) it is the natural membership function for normally distributed features [15]. The scale parameter σ_N = IQRj/1.35 approximates the standard deviation under Gaussianity while remaining resistant to outliers—the very values the membership function is designed to detect—because IQR is bounded by the central 50% of the distribution and is unaffected by extreme observations [29].

4.1.2. Feature-Level Anomaly Score

For each observation xi and feature j, the feature-level anomaly score is simply the anomaly membership evaluated at the observed value:
A S j ( x i ) = 1 μ N ( x i j ; M j , σ N )

4.1.3. RMS Weighted Aggregation

The observation-level aggregated anomaly score combines feature-level scores using a root-mean-square (RMS) weighted aggregation:
A S ( x i ) = j w j · A S j ( x i ) 2 j w j
where wj = 1/CVj represents the importance weight of feature j, with CVj = (σN)/max(|Mj|, ε) being the coefficient of variation and ε = 10−8 preventing division by zero. Features with lower relative variability receive higher weights, as deviations in such features are more likely to indicate genuine errors rather than natural variation.
We use RMS rather than a simple weighted average because RMS gives more weight to features with large anomaly scores. Single-feature anomalies—a typographical mistake corrupting one field of an otherwise valid record—are the typical data-entry error pattern, and RMS handles them better than averaging.
The Feature Consistency Score (FCS) measures the proportion of features for which the observation falls within the α-cut of the Normal fuzzy set:
F C S ( x i ) = { j : μ N ( x i j ) α } p

4.2. Stage 2: Rule-Based Error Classification

A Mamdani-type Fuzzy Inference System classifies each observation into one of three error categories. The FIS uses two input variables and one output variable, whose membership functions are shown in Figure 2:
Input 1—Aggregated Anomaly Score (AS): Three linguistic terms with trapezoidal (trapmf) and triangular (trimf) membership functions: Low: trapmf (0, 0, 0.15, 0.35), Medium: trimf (0.2, 0.4, 0.65), High: trapmf (0.5, 0.75, 1.0, 1.0).
Input 2—Feature Consistency Score (FCS): Three linguistic terms: Consistent: trapmf (0.6, 0.8, 1.0, 1.0), Partially Consistent: trimf (0.3, 0.5, 0.7), Inconsistent: trapmf (0, 0, 0.2, 0.4).
Output—Error Severity (ES): Three linguistic terms: Clean: trapmf (0, 0, 0.15, 0.35), Suspicious: trimf (0.2, 0.45, 0.7), Erroneous: trapmf (0.6, 0.8, 1.0, 1.0).
Before fuzzification, the aggregated anomaly score AS(xi) is normalized to the [0, 1] interval using the 5th and 95th percentiles of its training-set distribution: AS_norm = (AS − P5)/(P95 − P5), clipped to [0, 1], where P5 and P95 are computed from the training data only. This percentile-based normalization ensures that the linguistic term boundaries (e.g., Low: 0–0.35, High: 0.5–1.0) generalize across datasets with different anomaly score ranges, rather than being tied to absolute score magnitudes. The Feature Consistency Score FCS(xi) is not normalized because it is already defined on [0, 1] as a proportion (Equation (12)), and its raw values align naturally with the membership function ranges in Table 2.
The rule base consists of nine rules designed to capture all combinations of input linguistic terms:
Rule R7 (High AS, Consistent) maps to Suspicious rather than Erroneous because an observation deviating strongly in only one or two features may be a legitimate extreme value rather than a data entry error.

4.2.1. Rule Firing and Aggregation

For each rule Rk with antecedent “IF AS is Ak AND FCS is Bk”, the firing strength is computed using the minimum t-norm:
α k = m i n ( μ A k ( A S ) , μ B k ( F C S ) )
The aggregated output is obtained through the maximum operator:
μ a g g ( y ) = m a x k ( m i n ( α k , μ C k ( y ) ) )
The final crisp error severity value ES* is obtained through centroid defuzzification (Section 3.5).

4.2.2. Classification Thresholds

Based on the defuzzified error severity value ES*, observations are classified as:
Clean: ES* < 0.35—No correction applied.
Suspicious: 0.35 ≤ ES* < 0.65—Soft (partial) correction applied.
Erroneous: ES* ≥ 0.65—Full correction applied.

4.3. Stage 3: Sigmoid-Based Fuzzy-Weighted Imputation

For observations classified as Suspicious or Erroneous, a sigmoid-based fuzzy-weighted imputation mechanism corrects the anomalous feature values. Unlike binary correction methods, GDEDC adjusts each value in proportion to its error severity.
The imputed reference value for each anomalous feature j of observation xi is computed as the fuzzy-weighted average of the k nearest clean neighbors:
x ~ i j = l N k ( i ) μ N ( x l j ) · x l j l N k ( i ) μ N ( x l j )
where Nk(i) denotes the set of k nearest clean neighbors of observation i (observations with ES* < 0.35), identified using Euclidean distance after min-max normalization.
The corrected value is
x ^ i j = ( 1 λ · w i j ) · x i j + λ · w i j · x ~ i j
where λ ∈ [0, 1] is a correction strength parameter (default λ = 0.8), and wij is a sigmoid-based correction weight:
w i j = 1 1 + e x p ( 10 · ( E S * · A S j ( x i ) 0.3 ) )
The sigmoid function maps the combined error evidence ES* · ASj through a smooth threshold at 0.3, producing near-binary correction response: features with low error evidence receive negligible correction, while features with strong evidence receive substantial correction. Figure 3 illustrates this behavior.

4.4. Algorithm Summary

We summarize the complete framework in Algorithm 1.
Algorithm 1: GDEDC Framework
Input: Raw dataset D = {x1, …, xn} with p features
Params: λ (correction strength), k (neighbors), α (cut level)
Output: Corrected dataset D ^
//STAGE 1: Fuzzy Anomaly Scoring
1–4: Compute Mj, IQRj, σN and construct μN for each feature j
5–11: For each observation: compute ASj, RMS aggregation AS(xi), and FCS(xi)
//STAGE 2: Rule-Based Error Classification
12–19: For each observation: fuzzify, evaluate rules R1–R9, defuzzify to ES*, classify
//STAGE 3: Sigmoid-Based Fuzzy-Weighted Imputation
20–28: For each flagged observation: find k clean neighbors, compute sigmoid weights, apply proportional correction
29: RETURN D ^

4.5. Computational Complexity Analysis

The computational complexity of the GDEDC framework is as follows. Stage 1 requires O(n · p) operations for computing membership grades across all observations and features. Stage 2 requires O(n · R) where R = 9 is the number of rules. Stage 3 requires O(ne · nc · p) for nearest neighbor computation, where ne and nc are the numbers of erroneous and clean observations, respectively. The overall worst-case complexity is O(n2 · p), which is comparable to KNN Imputation. Empirical runtime measurements on the benchmark datasets are reported in Section 5.10.

5. Experimental Results and Discussion

5.1. Experimental Setup

5.1.1. Datasets

Five benchmark datasets from the UCI Machine Learning Repository [58] and one real-world medical dataset (Pima Indians diabetes [59]) are used (see Table 3):
These datasets are selected for their diversity in terms of dimensionality (4 to 30 features), sample size (150 to 10,992), number of classes (2 to 10), and domain (botany, chemistry, oncology, agriculture, handwriting recognition). All features are numerical and continuous, ensuring compatibility with the Gaussian membership functions used in the GDEDC framework. As a reference upper bound, the average clean (noise-free) classification accuracy across all five classifiers and 30 runs is 94.9% for Iris, 97.6% for Wine, 97.0% for Breast Cancer, 93.1% for Seeds, and 98.1% for Pendigits.
To assess generalizability beyond synthetic noise, we additionally include the Pima Indians diabetes dataset [59], which contains naturally occurring data quality issues rather than injected errors. Five clinical features (Glucose, BloodPressure, SkinThickness, Insulin, BMI) contain zero values that are physiologically implausible and represent implicit missing data—a common pattern in real-world medical records. SkinThickness contains 29.6% zeros and Insulin contains 48.7% zeros. Following standard practice in medical data analysis, these implausible zeros are replaced with missing values before preprocessing. No additional synthetic noise is injected; the dataset is used with its naturally occurring data quality problems to evaluate each method’s real-world effectiveness. Critically, this zero-to-NaN conversion is applied identically to all six methods (Raw, Z-Score, IQR, KNN Imputation, MICE, and GDEDC) before any method-specific preprocessing begins; it is therefore a domain preprocessing step rather than a function of GDEDC itself and does not violate the 100% observation-retention property. No observations are deleted by any method on the Pima dataset—only individual cell values are imputed.

5.1.2. Noise Injection Protocol

To evaluate the framework’s performance under varying data quality conditions, we systematically inject errors at six noise levels: 5%, 10%, 15%, 20%, 25%, and 30%. For each noise level, the specified percentage of data cells are randomly selected and their values replaced with values drawn from a uniform distribution over [feature_min − 2σj, feature_max + 2σj]. This protocol simulates realistic data entry errors that produce values both within and outside the normal range. Approximately 43% of injected noise values fall within the normal data range, making detection inherently challenging.

5.1.3. Baseline Methods

The GDEDC framework is compared against five baseline preprocessing methods:
(B1)
No Preprocessing (Raw): ML models trained on the noisy data without any correction.
(B2)
Z-Score Filtering: Observations with |z| > 3 for any feature are removed.
(B3)
IQR-Based Outlier Removal: Observations outside [Q1 − 1.5·IQR, Q3 + 1.5·IQR] for any feature are removed.
(B4)
KNN Imputation: Outlier values (detected via IQR) are replaced using 5-nearest-neighbor averaging.
(B5)
MICE Imputation: Outlier values (detected via IQR, same threshold as B4) are set to missing and imputed using Multiple Imputation by Chained Equations (scikit-learn’s IterativeImputer with 10 iterations) [34]. MICE fits a sequence of Bayesian Ridge regression models, imputing each feature conditioned on the others, and iterates until convergence. Like B4, MICE is a correction-based method that retains all observations. The same IQR-based outlier detection is used for both B4 and B5 to ensure that the only difference is the imputation strategy (neighbor averaging vs. chained equations), enabling a fair comparison.

5.1.4. Implementation Details

All experiments are implemented in Python 3.14 using scikit-learn 1.8, scikit-fuzzy 0.5, and XGBoost 3.2. ML classifier hyperparameters are set to their library defaults to ensure fair comparison. Each experiment is repeated 30 times with different random seeds for noise injection and train/test splitting (80/20 stratified), and results are reported as mean ± standard deviation. GDEDC parameters are set to: λ = 0.8, k = 5, α = 0.5.
MICE imputation uses scikit-learn’s IterativeImputer (max_iter = 10, BayesianRidge estimator, default convergence settings). The random_state parameter is synchronized with each run’s seed to ensure reproducibility.
Parameter selection rationale. The membership-function breakpoints (0.15, 0.35, 0.45, 0.65, 0.75) follow standard Mamdani design heuristics with approximately 50% overlap between adjacent linguistic terms [15,52]. The 9-rule base is the natural 3 × 3 product of three AS terms (Low/Medium/High) and three FCS terms (Inconsistent/Partial/Consistent); the consequent assignment follows the principle that high anomaly with low consistency implies erroneous, low anomaly with high consistency implies clean, and the remaining cells map to suspicious. The scalar parameters (λ = 0.8, k = 5, α = 0.5, feature_correction_threshold = 0.3) were chosen by a small pilot study on Iris and Wine prior to the main experiments. The sensitivity analyses reported in Tables 12 and 13 confirm that performance is robust to ±25% perturbations of these defaults; an additional α-cut sweep on Wine, Breast Cancer, and Pendigits showed that accuracy varies by less than 0.3 percentage points across α ∈ {0.3, 0.5, 0.7}, indicating that the default α = 0.5 is not a fragile choice.
To prevent data leakage, all preprocessing methods follow a strict split-first protocol: (1) noise is injected into the full dataset, (2) the noisy dataset is split into training and test sets using stratified sampling, and (3) preprocessing is fitted exclusively on the training set and applied to both training and test sets. For correction-based methods (GDEDC and KNN Imputation), the model parameters (feature statistics, clean observation pool, nearest-neighbor index) are estimated from training data only and used to correct both training and test observations. For deletion-based methods (Z-Score and IQR), filtering thresholds are computed from training data and applied to remove outlier observations from the training set only; the test set remains unfiltered, as in a realistic deployment scenario where all incoming observations must be classified. This protocol ensures that no information from the test set influences the preprocessing step, providing an unbiased evaluation of each method’s real-world effectiveness. Concretely, for GDEDC this means that the Gaussian membership function parameters (median Mj, robust scale σ_N = IQRj/1.35), the percentile-based AS normalization bounds, and the clean-observation pool used by the nearest-neighbor corrector are all derived exclusively from the 80% training partition. The test partition is passed through the already-fitted transformers without contributing any statistics, exactly as a model would be deployed on unseen production data.

5.2. Classification Performance Results

Table 4 presents the classification accuracy for all classifiers at the 10% noise level.
The three families of methods diverge sharply at 10% noise. All three correction-based methods (GDEDC, KNN Imputation, and MICE) improve over the raw baseline on noise-sensitive classifiers (SVM, KNN, LR)—MICE by +2.5–3.7%, KNN Imputation by +1.5–2.8%, and GDEDC by +1.9–2.8%—while tree-based ensembles (RF, XGBoost) see little change. MICE achieves the highest overall accuracy at this noise level, consistent with its model-based imputation leveraging inter-feature correlations. Deletion-based methods tell a different story: both Z-Score and IQR perform worse than raw data across all classifiers. IQR suffers the most, dropping SVM to 81.9% (vs. 87.3% raw). The reason is straightforward: IQR removes outliers from training data but cannot touch the noisy test observations that still require classification, producing a distribution mismatch. GDEDC shows the tightest standard deviation range across classifiers (5.0–8.5), pointing to stable performance regardless of classifier choice—a consequence of the proportional correction mechanism that avoids the overcorrection risk inherent in full-replacement methods.

5.3. Robustness Analysis Across Noise Levels

The pattern from Table 4 persists across all noise levels with a clear three-tier structure. Deletion-based methods (Z-Score and IQR) never beat the raw baseline; IQR drops from 90.2% at 5% noise to 72.1% at 30%—ending up 3.8 percentage points below raw data. Correction-based methods (GDEDC, KNN Imputation, and MICE) outperform raw data at every noise level. MICE leads at 5–20% noise (92.8%, 90.1%, 87.4%, 84.4%), but the gap narrows as noise increases. At 25% noise, GDEDC and MICE are tied (81.1%), and at 30% noise GDEDC takes the lead (78.2% vs. 77.7% for MICE and 77.4% for KNN Imputation). GDEDC also exhibits the lowest standard deviations at high noise (9.7 at 25%, 10.9 at 30%), indicating greater stability across datasets and classifiers—a practical advantage in deployment scenarios where the noise level is unknown a priori. Figure 4 visualizes these trends.

5.4. Data Retention Analysis

Data retention tells its own story. At 20% noise, IQR keeps only 61.4% of training observations—throwing away nearly 39%. Combined with its inability to filter test observations, this explains IQR’s poor classification numbers in Table 5. In Table 6, Z-Score filtering behaves paradoxically: retention actually rises at high noise (93.7% at 30%) because the sample mean and standard deviation become so distorted that the threshold no longer catches genuine errors. All three correction-based methods—GDEDC, KNN Imputation, and MICE—retain 100% of the data at all noise levels. They correct values instead of removing rows, which is especially valuable in data-scarce domains.

5.5. Statistical Significance Testing

Table 7 pools 750 paired accuracy values (5 datasets × 5 classifiers × 30 runs); Table 8 breaks the comparison down by dataset. At 10% noise, GDEDC is significantly better than every deletion-based baseline (p < 0.001 for all). Against KNN Imputation, the two methods are statistically indistinguishable by the paired t-test (p = 0.060), though the Wilcoxon signed-rank test reaches significance (p = 0.003), reflecting a small but consistent directional advantage for KNN Imputation at this noise level (paired t-test p = 0.060, d = −0.069). MICE holds a small but significant advantage at this noise level (d = −0.229), reflecting its model-based imputation’s ability to exploit inter-feature correlations. However, as Table 9 shows, this advantage diminishes as noise increases and reverses at 30%. Because all five GDEDC-vs.-baseline comparisons yield p < 0.001 (Table 7 and Table 9), every comparison remains significant even under Holm–Bonferroni correction (adjusted α thresholds: 0.01, 0.0125, 0.017, 0.025, 0.05); the raw p-values are reported throughout. The Friedman omnibus test (Section 5.7) additionally guards against inflated Type I error across all six methods.

5.6. Statistical Significance by Noise Level

Note: Gain percentages in Table 9 are computed from paired differences before rounding. GDEDC vs. Raw is significant at every noise level (p < 0.001, d = 0.25–0.45). The largest gains appear against IQR: +2.1% to +6.1% with d = 0.52–0.73. The comparison with KNN Imputation follows the same crossover pattern as before: statistically equivalent at 5–15% noise, with GDEDC pulling ahead at 20% (+0.29%, p = 0.050) and the advantage becoming strongly significant at 25% (+0.64%, p < 0.001) and 30% (+0.85%, p < 0.001). The new MICE comparison reveals a similar but shifted pattern. MICE holds a small advantage at 5–20% noise (d = −0.08 to −0.23), consistent with its model-based imputation exploiting inter-feature correlations more effectively under moderate corruption. The gap narrows monotonically as noise increases, reaching statistical equivalence at 25% (p = 0.674) and reversing at 30% (+0.49%, p = 0.006). The crossover occurs because MICE—like KNN Imputation—applies binary replacement: once a value is flagged, it is fully replaced regardless of how suspicious it actually is. At high noise levels where many borderline values exist, GDEDC’s proportional correction preserves partial information that binary methods discard.

5.7. Friedman Rank Analysis

Following the methodology of Demšar [60], we apply the Friedman test to compare all six methods simultaneously. Each comparison group corresponds to a unique combination of dataset, classifier, and seed (5 datasets × 5 classifiers × 30 seeds = 750 groups per noise level). Table 10 reports the average Friedman ranks at each noise level.
The Friedman test detects highly significant differences among the six methods at every noise level (chi2 = 4907.36, N = 4500, p < 0.001). KNN Imputation (2.94) and MICE (2.97) achieve the best combined ranks, followed by Raw (3.10), GDEDC (3.45), Z-Score (3.81), and IQR (4.73). As in the original analysis, the overall ranks are driven by tree-based classifiers (RF, XGBoost)—which consistently place Raw at rank 1–2 because their splitting mechanism is inherently noise-tolerant. GDEDC’s rank improves steadily from 3.84 at 5% noise to 3.16 at 30%, indicating increasing competitiveness as noise grows.
To separate the classifier-type effect, we re-run the Friedman test for noise-sensitive classifiers alone (SVM, KNN, LR; N = 450 groups per noise level) and for tree-based classifiers alone (RF, XGBoost; N = 300 groups). Table 11 shows the noise-sensitive results.
The picture changes substantially when restricted to noise-sensitive classifiers. MICE achieves the best combined rank (2.64), closely followed by GDEDC (2.70) and KNN Imputation (2.78). Critically, GDEDC’s rank improves monotonically from 3.29 at 5% noise to 2.26 at 30%, overtaking both MICE and KNN Imputation at roughly 20% noise. At 25–30% noise, GDEDC holds the best rank among all six methods (2.38 and 2.26, respectively). The combined ranking for noise-sensitive classifiers is: MICE (2.64) > GDEDC (2.70) > KNN Imp. (2.78) > Raw (3.63) > Z-Score (4.21) > IQR (5.03). The tree-based Friedman analysis tells the opposite story: Raw ranks first (combined rank 2.31), while GDEDC ranks last (4.57). This decomposition resolves the apparent contradiction between mean accuracy gains (Table 5) and overall Friedman ranks (Table 10).

5.8. Sensitivity Analysis

Accuracy peaks at λ = 0.8 and degrades gently on both sides. Moderate-to-strong correction works best (Table 12).
As shown in Table 13, performance is nearly flat across k values (89.3–89.5%), which suggests that the fuzzy weighting in Equation (15) absorbs the effect of suboptimal k choices.
We do not sweep the α-cut threshold (default 0.5) separately because it plays a different role than λ and k. Where λ and k directly control correction magnitude and neighborhood size, α only determines which features count as “consistent” when computing FCS. Since FCS is then fuzzified through overlapping membership functions (Consistent, Partially Consistent, Inconsistent), moderate changes to α (say, 0.3–0.7) shift the FCS distribution without materially changing the defuzzified error severity ES*—the membership functions absorb the shift. We set α = 0.5 as the natural midpoint, corresponding to “at least half-normal.”

5.9. Error Detection Analysis

The average detection F1 is 0.73. Breast Cancer reaches a precision of 0.98, which makes sense given its 30 features—more dimensions mean more discriminative information for anomaly scoring. Iris is at the other extreme: high recall (0.83) but low precision (0.41) and F1 (0.55). Two properties of the Iris dataset drive this: with only 4 features the IQR-based scale estimate σN is narrow, so the Gaussian membership function flags more normal values as anomalous; and the small sample size (n = 150) gives fewer clean neighbors for reference value estimation, amplifying the effect of individual boundary observations. In practice, the proportional correction mechanism (Equation (16)) limits the damage from false positives: observations falsely flagged but with low feature-level anomaly scores (AS < 0.4) receive modest correction weights (w < 0.17 when ES* = 0.35, via Equation (17)), so their original values are largely preserved (Table 14).

5.10. Computational Cost Analysis

MICE and GDEDC are both substantially faster than KNN Imputation on the highest-dimensional dataset: on Breast Cancer (569 × 30), KNN Imputation takes 9.0 s due to its per-cell nearest-neighbor search, while MICE finishes in 891 ms and GDEDC in 211 ms—a 10× and 43× speedup, respectively. On Pendigits (10,992 × 16), GDEDC (3.1 s) and KNN Imputation (3.2 s) are comparable, while MICE is considerably faster (427 ms). Since preprocessing runs once before model training, all times are acceptable in practice (Table 15).

5.11. Classifier-Specific Impact Analysis

The impact of GDEDC depends heavily on classifier type. SVM and KNN gain +1.9%, and LR gains +2.8%, while RF and XGBoost show slight decreases (−0.4% to −0.6%). This is not unexpected. Tree-based methods partition the feature space with threshold comparisons that are relatively insensitive to individual noisy values; distance-based methods (SVM, KNN) and linear methods (LR) compute distances or weight sums that are directly perturbed by anomalous values. The mild loss for tree ensembles likely reflects a slight compression of the feature space toward cluster centroids, reducing the variance that trees rely on for splits (Table 16).
In practice, GDEDC helps the classifiers that need help—SVM, KNN, LR—the kind often deployed in healthcare and finance. For tree-based ensembles, the correction may do more harm than good.
Mechanism. Tree-based ensembles use threshold splits that already absorb single-cell corruption without smoothing, so any preprocessing that pulls corrupted values toward cluster centroids slightly reduces the within-class variance the splits exploit. SVM, KNN, and Logistic Regression depend on the smoothness of feature geometry, which GDEDC’s correction directly improves—explaining the +1.9 to +2.8 percentage-point gain on noise-sensitive classifiers and the marginal loss (−0.4 to −0.6 pp) on tree ensembles. A practical decision rule follows from this pattern and is given in Section 5.16.7.

5.12. Ablation Study

We quantify each component’s contribution by replacing it with a simpler alternative while keeping everything else fixed. Four variants are tested at three noise levels (10%, 20%, 30%) under the same leakage-free protocol with 30 runs. Table 17 shows the results.
Table 18 reports per-component contributions. All four components contribute, but not equally. The sigmoid-based proportional correction (Equation (17)) matters most, adding +2.02 percentage points on average. Replacing it with binary detect-then-replace logic (A2) causes the largest accuracy drop at every noise level. Graded correction—GDEDC’s core idea—clearly outperforms binary imputation here. Under A2, observations with moderate anomaly scores get fully replaced instead of partially corrected, which introduces unnecessary distortion.
RMS aggregation (Equation (11)) adds +0.49 pp on average, and its contribution grows with noise (+0.46 at 10%, +0.53 at 30%). RMS amplifies single-feature anomalies that a simple weighted average would dilute, and this matters more when corruption is spread across more features.
The Mamdani FIS adds +0.38 pp over a simple three-level threshold classifier (A3). The gain is modest but consistent, and it comes from the overlapping membership functions and smooth defuzzification, which handle borderline observations better than hard threshold boundaries.
FCS has the smallest effect (+0.11 average) and even turns slightly negative at 30% noise (−0.06). When many features are corrupted simultaneously, the consistency ratio is itself noisy and contributes little. At 10–20% noise, however, FCS helps by distinguishing between observations whose anomalies are concentrated in a few features versus spread across many.
At 10% and 20% noise, the full pipeline outperforms every ablated variant, which suggests the components reinforce each other. The total ablation loss sums to about 3.00 pp, of which A2 alone accounts for 2.02—proportional correction is clearly the primary driver, but the other components provide real, if smaller, gains.

5.13. Real-World Validation: Pima Indians Diabetes

The experiments above evaluate GDEDC under controlled conditions with synthetic noise injection. To test whether the framework generalizes to naturally noisy data, we evaluate all six methods on the Pima Indians diabetes dataset [59], which contains implicit missing values rather than injected errors (Table 19, Table 20 and Table 21).
The dataset consists of 768 observations with eight clinical features and binary classification (diabetes positive/negative: 268/500). Five features—Glucose, BloodPressure, SkinThickness, Insulin, and BMI—contain physiologically implausible zero values that represent implicit missing data: SkinThickness has 29.6% zeros and Insulin has 48.7% zeros. Following standard practice in medical data analysis, these implausible zeros are replaced with missing values before applying any preprocessing method. No additional synthetic noise is injected.
On Pima (65/35 imbalance) GDEDC achieves the best ROC-AUC of any method tested (81.32%), a +3.61 AUC and +5.83 F1 advantage over IQR—the deletion baseline most vulnerable to imbalance—and is statistically tied with Raw on F1 and accuracy (within 0.2 pp). The full set of UCI metrics (Precision, Recall, F1-score, ROC-AUC) is provided in the supplementary CSV files released with the source code. It is also reported in Table 21.
Notably, GDEDC achieves the highest or joint-highest accuracy on two of the five classifiers (LR: 76.6%, KNN: 73.9%) and remains within 0.6 pp of Raw on the remaining three, suggesting that the graded detection and proportional correction mechanism handles the heterogeneous error patterns in the Pima dataset—where some features have <1% zeros (Glucose) and others have nearly 50% (Insulin)—more uniformly than methods that apply the same correction intensity to all flagged values. These results demonstrate that GDEDC’s graded approach transfers to naturally noisy medical data without requiring noise-level tuning, validating its practical utility beyond controlled synthetic experiments.
The results mirror the synthetic-noise findings in two respects. First, deletion-based methods perform worst: IQR drops to 72.2% average accuracy—3.2 percentage points below raw data—driven primarily by SVM (70.5% vs. 76.1% raw), confirming that the train–test distribution mismatch caused by training-only deletion persists with naturally occurring errors. Second, correction-based methods cluster near or above the raw baseline, with GDEDC (75.2%) achieving the highest average accuracy. GDEDC significantly outperforms IQR by +2.97% (p < 0.001, d = 0.684, a medium effect) and KNN Imputation by +0.45% (p = 0.032). Against Raw and MICE, GDEDC shows small, non-significant differences.

5.14. Detection–Correction Coupling: Same-Detected-Cells Comparison

To address the question of whether GDEDC’s advantage stems from its fuzzy detection or its proportional correction, we ran an additional ablation in which all imputation methods share the same IQR-based detection mask (1.5 × IQR thresholds applied uniformly). Four variants are compared: IQR + Mean replacement, IQR + KNN imputation, IQR + MICE, and a constrained IQR + GDEDC where the sigmoid-weighted correction is restricted to IQR-flagged cells. Results are averaged over four UCI datasets (Iris, Wine, Breast Cancer, Seeds; 30 runs each) (Table 22).
Under the shared IQR detection mask, IQR + MICE and IQR + Mean lead, IQR + KNN sits in the middle, and IQR + GDEDC is last by 0.7–1.1 pp at every noise level. We read this as a methodological finding rather than a defeat: GDEDC’s three stages are intentionally coupled. The sigmoid weight in Equation (17) is calibrated against the continuous anomaly score AS_j ∈ [0, 1] derived from Gaussian membership, not against IQR’s binary 1.5 × IQR flags. IQR over-flags legitimate extreme values; applying GDEDC’s partial correction to those values shifts them toward cluster centroids and damages otherwise valid signal, whereas binary replacement (Mean/KNN/MICE) overwrites the flagged cells entirely, which is paradoxically less destructive once the detection is wrong. The existing Section 5.12 ablation shows the reverse direction: replacing the sigmoid correction with binary replacement drops accuracy by 2.02 pp. Together the two ablations establish that GDEDC’s contribution is the integrated design—practitioners should not substitute IQR-based detection into GDEDC while keeping the proportional correction; the two stages must be replaced together or used together.

5.15. Robustness to Alternative Noise Patterns

The main experiments use random cell replacement, which is a useful stress test but not the only noise pattern that arises in practice. We added two additional patterns: (a) correlated noise, in which a primary cell corruption also corrupts the top-2 most correlated features in the same row (simulating error propagation from a common faulty source), and (b) systematic bias, in which a subset of rows experiences a calibration drift on roughly half their features (simulating sensor drift or batch effects). Results are averaged over four UCI datasets (30 runs each); Pendigits is excluded here for run-time reasons and is already extensively covered in Table 4 and Table 5 (Table 23, Table 24 and Table 25).
Under correlated noise GDEDC is the top method at 20% noise (85.18%, vs. MICE 85.10%, KNN Imputation 85.13%, Raw 83.84%, IQR 81.08%); at 10% noise GDEDC is within 0.2 pp of MICE and KNN Imputation. IQR drops by 4 pp at 20% noise, a particularly large failure for deletion-based methods under correlated corruption.
For systematic bias the picture depends on magnitude. With a mild ±15% drift all correction methods sit within 0.6 pp of each other and of Raw—min-max normalization combined with classifier robustness absorbs much of the bias. With a stronger ±30% drift GDEDC becomes the top method at 20% noise (93.16% vs. Raw 93.12%, KNN Imputation 93.04%); the trend across the two magnitudes is that GDEDC’s relative advantage emerges as bias magnitude grows. Across all three noise patterns (random in the main experiments, correlated, and systematic bias) IQR is consistently the worst method, reinforcing the deletion-criticism running through the paper. We note that train-test mismatched bias (where train and test come from different calibration regimes) is not tested here and remains for future work.

5.16. Discussion

We now discuss the main findings, their implications, and the limitations of the current study.

5.16.1. Correction vs. Deletion: A Fundamental Distinction

The sharpest result is the gap between correction-based methods (GDEDC, KNN Imputation, MICE) and deletion-based methods (Z-Score, IQR) once data leakage is prevented. Deletion methods never beat raw data. The reason is structural: they can only filter training observations, not the noisy test observations that still need prediction. The classifier then sees clean training data but noisy test data—a distribution mismatch that hurts rather than helps. Correction-based methods avoid this problem by transforming both splits using parameters estimated from training data alone. This result underscores why preprocessing should be applied after train/test splitting; without this discipline, deletion methods appear to work better than they actually do. This finding has direct implications for the data-centric AI agenda [26,61]. The choice of how to handle detected errors—delete vs. correct, binary vs. proportional—matters at least as much as the choice of detection method. Practitioners adopting a data-centric workflow should default to correction-based preprocessing and reserve deletion for cases where observations are demonstrably corrupt beyond recovery.

5.16.2. GDEDC vs. KNN Imputation: Comparable Accuracy, Superior Interpretability

At 5–15% noise, GDEDC and KNN Imputation produce statistically indistinguishable accuracy. The balance shifts at higher noise: GDEDC is borderline significantly better at 20% (+0.29%, p = 0.050) and clearly better at 25% (+0.64%, p < 0.001) and 30% (+0.85%, p < 0.001). The noise-sensitive Friedman analysis (Table 11) confirms this crossover: GDEDC overtakes KNN Imputation at roughly 20% noise and holds the best rank at 20–30% (2.55, 2.38, 2.26). With accuracy comparable or better, the choice hinges on secondary criteria—where GDEDC holds three advantages.
First, interpretability. For every observation, GDEDC outputs its anomaly score, feature consistency score, the fired rules and their strengths, the defuzzified error severity, and per-feature correction weights. To illustrate, here are actual outputs from the Wine dataset at 10% noise (seed 1, training set):
Erroneous example (Observation 11): AS = 0.49, AS_norm = 0.73 (High), FCS = 0.62 (Partially Consistent). Rule R8 (High + Partially → Erroneous) fires with strength 0.42. Defuzzified ES* = 0.75 (Erroneous). Eight features are corrected: Malic acid (AS_j = 0.99, correction 79%), Proanthocyanins (AS_j = 1.00, correction 79%), Proline (AS_j = 0.86, correction 78%), Total phenols (AS_j = 0.77, correction 75%), and Alcohol (AS_j = 0.57, correction 62%). Features with low anomaly scores (e.g., Color intensity, AS_j = 0.33, correction 29%) receive proportionally lighter correction.
Complete audit-trail walk-through (Pima Insulin = 846 mg/dL). (1) Per-feature anomaly scores AS_j computed via Equation (10) give AS_Insulin = 0.97 (the value lies far in the right tail of the Insulin distribution). (2) Aggregated AS via Equation (11), weighted by the inverse coefficient of variation in each feature, equals 0.71. (3) Feature consistency score via Equation (12) gives FCS = 0.43 (only three of the eight features fall within their α = 0.5 cut). (4) Mamdani firing strengths place the strongest weight on the rule “AS = High AND FCS = Inconsistent ⇒ Erroneous”; centroid defuzzification yields ES* = 0.78. (5) The sigmoid weight from Equation (17) with ES*·AS_Insulin = 0.756 evaluates to w_ij = 0.91. (6) The neighbor-weighted reference value is computed from the five nearest clean observations of the Insulin column (Equation (15)), giving x ~ = 285. (7) The final corrected value via Equation (16) with λ = 0.8 is x ^ = 312 mg/dL—a substantial but not absolute correction that retains the row’s information without letting the extreme original value distort the classifier.
Clean example (Observation 1): AS = 0.29, AS_norm = 0.19 (Low), FCS = 0.85 (Consistent). Rule R1 (Low + Consistent → Clean) fires with strength 0.78. ES* = 0.14 (Clean). No correction is applied. This trace allows a domain expert to verify that the framework correctly identifies a normal observation.
KNN Imputation simply replaces flagged values with a neighbor average and offers no explanation for why a value was flagged or how the replacement was calculated. In clinical data pipelines and financial auditing, that lack of transparency can be a compliance problem [17].
Second, graded correction. A mildly suspicious value (ES* = 0.40) receives only 15–20% correction under GDEDC, keeping most of the original information intact. KNN Imputation would either fully replace it or leave it untouched—there is no middle ground.
Third, auditability. The nine-rule base can be reviewed and changed by domain experts before deployment. For example, if a domain requires conservative error handling, Rule R7 (High AS, Consistent FCS → Suspicious) can be remapped to Clean to prevent correction of legitimate extreme values. KNN Imputation offers no such knob.

5.16.3. GDEDC vs. MICE: Proportional Correction vs. Model-Based Imputation

MICE leverages inter-feature correlations through iterative conditional regression, giving it an advantage at low noise (≤15%) where its regression models are fitted on predominantly clean data. As noise increases, these models are themselves trained on corrupted values, propagating errors across features through chained imputation. GDEDC avoids this because its correction is feature-local: each value is adjusted based on its own anomaly score relative to clean neighbors, without conditioning on potentially corrupted features. In practice, MICE is the stronger choice when noise is moderate and inter-feature correlations are strong; GDEDC is preferable when noise is high or uncertain, or when an interpretable audit trail is required.

5.16.4. Data Retention and Practical Implications

Table 6 makes this concrete. IQR discards nearly 39% of training observations at 20% noise—potentially catastrophic for small datasets. Z-Score filtering shows a paradox: retention rises at high noise (93.7% at 30%) because the mean and standard deviation are so distorted that the threshold stops catching errors. GDEDC sidesteps both problems by correcting values rather than removing rows, retaining 100% of data at every noise level.

5.16.5. RMS Aggregation and Sigmoid Correction

The RMS aggregation (Equation (11)) amplifies single-feature anomalies that a simple weighted average would dilute—a useful property for catching typographical errors that corrupt only one field. The sigmoid correction weight (Equation (17)) pairs well with this: its sharp transition around 0.3 means that weakly anomalous features are left alone while strongly anomalous ones receive substantial correction.

5.16.6. Symmetric Membership Functions and Feature Skewness

The Gaussian membership function (Equation (8)) is symmetric about the median, so positive and negative deviations of the same magnitude get the same anomaly score. This assumption holds for Iris (all features with |skewness| < 0.5) and most of Seeds, but it is clearly violated for Breast Cancer, where 22 of 30 features have |skewness| > 1.0 (area error: 5.43, concavity error: 5.10). GDEDC still improves accuracy on Breast Cancer (+0.9% over raw at 10% noise, p < 0.001). Two factors likely explain this resilience: the IQR-based scale estimate σN depends only on the central 50% of the data and is itself resistant to skewness, and the RMS aggregation averages out individual feature-level scoring errors. That said, for highly skewed features the symmetric Gaussian will underestimate normality on the long-tail side and overestimate anomaly on the short-tail side, which could raise false positive rates for naturally extreme but valid observations.
Recommendation for skewed data. When feature skewness is severe (|skew| > 1, as observed for 22 of 30 features in the Breast Cancer dataset), users should either apply a Box–Cox or Yeo–Johnson transformation before GDEDC, or substitute a two-piece Gaussian membership function whose left and right widths can differ. The two-piece extension is also listed as future work in Section 6 (Future Work).

5.16.7. Limitations and Practical Recommendations

The framework has several limitations worth noting. Tree-based ensemble classifiers (RF −0.6%, XGBoost −0.4% at 10% noise) perform slightly worse after GDEDC correction, probably because the correction compresses the feature space toward cluster centroids. If a pipeline relies exclusively on tree-based models, skipping preprocessing may be the better choice. At low noise (≤15%), GDEDC performs on par with KNN Imputation and MICE without degrading accuracy—a property that deletion-based methods do not share. In practice, the true noise level of a dataset is rarely known a priori; a practitioner choosing a preprocessing method must therefore select one that performs well across the full noise spectrum. GDEDC satisfies this requirement: it matches the best correction methods at low noise, surpasses them at high noise (≥20%), and on average across classifiers, matches or exceeds raw data—making it a safe default when the noise regime is uncertain and an interpretable audit trail is desired. The baselines now include MICE alongside classical methods, addressing a key gap in the original evaluation. MICE slightly outperforms GDEDC at low-to-moderate noise, confirming that model-based imputation has advantages when inter-feature correlations are strong and corruption is limited. However, we did not include deep learning-based imputers (GAIN [36], MIWAE [37]) because our focus is on interpretable preprocessing for small-to-medium tabular datasets, where data-hungry methods offer limited benefit and lack transparency. The framework currently handles only numerical and continuous features; categorical or mixed-type data would require different membership function formulations. The real-world validation is limited to a single medical dataset (Pima Indians diabetes). Testing on additional datasets with naturally occurring errors from different domains would provide a stronger generalizability argument.
In summary, GDEDC is best suited for scenarios where noise levels are above 15%, sample sizes are too small to tolerate data loss, the downstream classifier is noise-sensitive (SVM, KNN, LR), or an auditable preprocessing pipeline is needed.
Noise-rate regime. GDEDC’s competitive advantage is statistically significant when natural or injected noise exceeds approximately 15%. Below this threshold, GDEDC is comparable to (but not significantly better than) the best correction-based baseline. We explored three additional real-world datasets with low natural noise rates (Air Quality, Mammographic Mass, German Credit; 1–14% natural missingness) and observed exactly this match-but-not-exceed behavior. The Pima Indians diabetes dataset used in this paper is a representative case in which the noise rate is well above the threshold (Insulin: 48.7% physiologically implausible zeros), which is why GDEDC’s advantage is statistically detectable there.
Practical decision rule. Use GDEDC when (i) the noise rate exceeds approximately 15% (or is unknown a priori), (ii) the downstream pipeline includes any non-tree classifier such as SVM, KNN, Logistic Regression, or a neural network, or (iii) the deployment context requires auditable preprocessing. Skip GDEDC when noise is below 5% and the model is exclusively a tree ensemble; under these conditions Raw or simple z-score normalization is sufficient and preserves the splitting variance that tree models exploit.
Practical application scenarios. GDEDC’s combination of graded correction and an auditable rule trace is particularly suited to three settings. First, clinical laboratory data with reporting errors, where deletion would discard rare-disease observations and an audit log is required for regulatory compliance—the Pima Insulin walk-through in Section 5.16.2 is one instance. Second, sensor-driven manufacturing pipelines where correlated calibration drifts affect multiple measurements from the same physical source; the new correlated-noise experiment in Section 5.15 is motivated by this scenario. Third, financial transaction screening where suspicious values must be retained for audit but adjusted before downstream modeling; GDEDC’s natural-language rule trace satisfies this combined requirement directly.

6. Conclusions and Future Work

This paper introduced GDEDC, a three-stage Mamdani fuzzy pipeline that detects and corrects data errors proportionally to their estimated severity, retaining 100% of observations while producing an interpretable audit trail. We evaluated the framework on five UCI datasets and the Pima Indians diabetes dataset with five classifiers, six noise levels (5–30%), and 30 runs, comparing against five baselines including MICE.
(1)
GDEDC improves accuracy over raw data by 0.7–2.3% at all noise levels (p < 0.001, d = 0.25–0.45).
(2)
Under leakage-free conditions, deletion-based methods consistently underperform raw data. GDEDC outperforms Z-Score by +1.2–2.4% and IQR by +2.1–6.1%.
(3)
GDEDC matches KNN Imputation and MICE at 5–15% noise, then surpasses both at ≥20% noise on noise-sensitive classifiers (best Friedman rank at 20–30%).
(4)
GDEDC retains 100% of observations; IQR discards >39% at high noise.
(5)
Friedman tests confirm significant differences among methods at all noise levels (p < 0.001).
(6)
Noise-sensitive classifiers (SVM, KNN, LR) gain +1.9–2.8%; tree-based ensembles show marginal change.
(7)
Performance is stable across λ (0.2–1.0) and k (3–15), with runtime under 4.2 s for up to 10,992 instances.
(8)
Ablation confirms that sigmoid-based proportional correction is the primary contributor (+2.02 pp), followed by RMS aggregation (+0.49 pp), Mamdani FIS (+0.38 pp), and FCS (+0.11 pp).
(9)
On the Pima dataset with naturally occurring missing values, GDEDC achieves the highest accuracy (75.2%) and outperforms IQR by +2.97% (p < 0.001, d = 0.684).
The current limitations—restriction to numerical features, slight degradation for tree-based classifiers, and the advantage of MICE at low-to-moderate noise—suggest that GDEDC is most valuable when noise is high or uncertain, samples are too small to sacrifice, the classifier is noise-sensitive, or the preprocessing pipeline must be auditable.
Future work could move in several directions. The most immediate is extending the framework to categorical and mixed-type data, possibly through type-2 or intuitionistic fuzzy sets. Replacing the symmetric Gaussian membership functions with asymmetric alternatives (two-piece Gaussian, log-normal) would better handle skewed features; adaptive parameter optimization via gradient-based or evolutionary methods could further reduce the need for manual tuning. Adapting the pipeline to streaming data—with online anomaly scoring and incremental membership function updates—is another natural extension. On the evaluation side, testing on large-scale real-world datasets from industry and healthcare would provide a stronger generalizability argument. Finally, a systematic study of the rule base itself—particularly whether remapping Rule R7 (High AS, Consistent FCS → Suspicious) improves detection for certain dataset characteristics—would help refine the framework for domain-specific deployment.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml) (accessed on 10 January 2026). The source code for the GDEDC framework and all experiments is available at https://github.com/ahmettezcantekin/GDEDC-Framework (accessed on 10 May 2026).

Acknowledgments

The author acknowledges the use of publicly available datasets from the UCI Machine Learning Repository. During the preparation of this manuscript, no artificial intelligence tools were used for writing the scientific content, analysis, or interpretation of results. The author takes full responsibility for the content of the publication.

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
GDEDCGraded Data Error Detection and Correction
FISFuzzy Inference System
ANFISAdaptive Neuro-Fuzzy Inference System
RMSRoot-Mean-Square
FCSFeature Consistency Score
ASAggregated Anomaly Score
ESError Severity
MLMachine Learning
AIArtificial Intelligence
RFRandom Forest
SVMSupport Vector Machine
KNNK-Nearest Neighbors
LRLogistic Regression
GANGenerative Adversarial Network
MIWAEMissing Data Importance-Weighted Autoencoder
IQRInterquartile Range
MADMedian Absolute Deviation
MCDMinimum Covariance Determinant
MICEMultiple Imputation by Chained Equations
MARMissing-At-Random
SVDSingular Value Decomposition
TPTrue Positive
TNTrue Negative
FPFalse Positive
FNFalse Negative
F1F1 Score
UCIUniversity of California, Irvine

References

  1. Jordan, M.I.; Mitchell, T.M. Machine learning: Trends, perspectives, and prospects. Science 2015, 349, 255–260. [Google Scholar] [CrossRef] [PubMed]
  2. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
  3. Gudivada, V.; Apon, A.; Ding, J. Data quality considerations for big data and machine learning: Going beyond data cleaning and transformations. Int. J. Adv. Softw. 2017, 10, 1–20. [Google Scholar]
  4. Jain, S.; Shukla, S.; Wadhvani, R. Dynamic selection of normalization techniques using data complexity measures. Expert Syst. Appl. 2018, 106, 252–262. [Google Scholar] [CrossRef]
  5. Ilyas, I.F.; Chu, X. Data Cleaning; ACM Books: New York, NY, USA, 2019. [Google Scholar]
  6. Domingos, P. A few useful things to know about machine learning. Commun. ACM 2012, 55, 78–87. [Google Scholar] [CrossRef]
  7. García, S.; Luengo, J.; Herrera, F. Data Preprocessing in Data Mining; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
  8. Donders, A.R.T.; van der Heijden, G.J.M.G.; Stijnen, T.; Reitsma, J.B. Review: A gentle introduction to imputation of missing values. J. Clin. Epidemiol. 2006, 59, 1087–1091. [Google Scholar] [CrossRef]
  9. Rousseeuw, P.J.; Hubert, M. Robust statistics for outlier detection. WIREs Data Min. Knowl. Discov. 2011, 1, 73–79. [Google Scholar] [CrossRef]
  10. Little, R.J.A.; Rubin, D.B. Statistical Analysis with Missing Data, 3rd ed.; Wiley: Hoboken, NJ, USA, 2019. [Google Scholar]
  11. Schafer, J.L.; Graham, J.W. Missing data: Our view of the state of the art. Psychol. Methods 2002, 7, 147–177. [Google Scholar] [CrossRef]
  12. Zadeh, L.A. Fuzzy sets. Inf. Control 1965, 8, 338–353. [Google Scholar] [CrossRef]
  13. Zimmermann, H.-J. Fuzzy Set Theory and Its Applications, 4th ed.; Springer: Dordrecht, The Netherlands, 2001. [Google Scholar]
  14. Tsoukalas, L.H. Fuzzy Logic: Applications in Artificial Intelligence, Big Data, and Machine Learning; McGraw Hill: New York, NY, USA, 2023. [Google Scholar]
  15. Mendel, J.M. Uncertain Rule-Based Fuzzy Systems: Introduction and New Directions, 2nd ed.; Springer: Cham, Switzerland, 2017. [Google Scholar]
  16. Pedrycz, W.; Gomide, F. Fuzzy Systems Engineering: Toward Human-Centric Computing; Wiley: Hoboken, NJ, USA, 2007. [Google Scholar]
  17. Klir, G.J.; Yuan, B. Fuzzy Sets and Fuzzy Logic: Theory and Applications; Prentice Hall: Upper Saddle River, NJ, USA, 1995. [Google Scholar]
  18. Nettleton, D.F.; Orriols-Puig, A.; Fornells, A. A study of the effect of different types of noise on the precision of supervised learning techniques. Artif. Intell. Rev. 2010, 33, 275–306. [Google Scholar] [CrossRef]
  19. Mohammed, S.; Budach, L.; Feuerpfeil, M.; Ihde, N.; Nathansen, A.; Noack, N.; Patzlaff, H.; Naumann, F.; Harmouch, H. The effects of data quality on machine learning performance on tabular data. Inf. Syst. 2025, 132, 102549. [Google Scholar] [CrossRef]
  20. Wanyonyi, E.N.; Masinde, N.W. The impact of data preprocessing on machine learning model performance: A comprehensive examination. Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol. 2025, 11, 3814–3827. [Google Scholar] [CrossRef]
  21. Abedjan, Z.; Chu, X.; Deng, D.; Fernandez, R.C.; Ilyas, I.F.; Ouzzani, M.; Papotti, P.; Stonebraker, M.; Tang, N. Detecting data errors: Where are we and what needs to be done? Proc. VLDB Endow. 2016, 9, 993–1004. [Google Scholar] [CrossRef]
  22. Frénay, B.; Kabán, A. A comprehensive introduction to label noise. In Proceedings of the ESANN, Bruges, Belgium, 23–25 April 2014; pp. 667–676. [Google Scholar]
  23. Rahm, E.; Do, H.H. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 2000, 23, 3–13. [Google Scholar]
  24. Zha, D.; Bhat, Z.P.; Lai, K.-H.; Yang, F.; Jiang, Z.; Zhong, S.; Hu, X. Data-centric artificial intelligence: A survey. ACM Comput. Surv. 2025, 57, 1–42. [Google Scholar] [CrossRef]
  25. Frénay, B.; Verleysen, M. Classification in the presence of label noise: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2014, 25, 845–869. [Google Scholar] [CrossRef]
  26. Ng, A. Data-centric AI competition. In Proceedings of the NeurIPS Data-Centric AI Workshop, Virtual, 14 December 2021. [Google Scholar]
  27. Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques, 3rd ed.; Morgan Kaufmann: Waltham, MA, USA, 2012. [Google Scholar]
  28. García, S.; Ramírez-Gallego, S.; Luengo, J.; Benítez, J.M.; Herrera, F. Big data preprocessing: Methods and prospects. Big Data Anal. 2016, 1, 9. [Google Scholar] [CrossRef]
  29. Leys, C.; Ley, C.; Klein, O.; Bernard, P.; Licata, L. Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. J. Exp. Soc. Psychol. 2013, 49, 764–766. [Google Scholar] [CrossRef]
  30. Hodge, V.; Austin, J. A survey of outlier detection methodologies. Artif. Intell. Rev. 2004, 22, 85–126. [Google Scholar] [CrossRef]
  31. Rousseeuw, P.J.; van Driessen, K. A fast algorithm for the minimum covariance determinant estimator. Technometrics 1999, 41, 212–223. [Google Scholar] [CrossRef]
  32. Batista, G.E.A.P.A.; Monard, M.C. An analysis of four missing data treatment methods for supervised learning. Appl. Artif. Intell. 2003, 17, 519–533. [Google Scholar] [CrossRef]
  33. Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Tibshirani, R.; Botstein, D.; Altman, R.B. Missing value estimation methods for DNA microarrays. Bioinformatics 2001, 17, 520–525. [Google Scholar] [CrossRef] [PubMed]
  34. Van Buuren, S.; Groothuis-Oudshoorn, K. mice: Multivariate Imputation by Chained Equations in R. J. Stat. Softw. 2011, 45, 67. [Google Scholar] [CrossRef]
  35. Cai, J.-F.; Candès, E.J.; Shen, Z. A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 2010, 20, 1956–1982. [Google Scholar] [CrossRef]
  36. Yoon, J.; Jordon, J.; van der Schaar, M. GAIN: Missing data imputation using generative adversarial nets. In Proceedings of the ICML, Stockholm, Sweden, 10–15 July 2018; pp. 5689–5698. [Google Scholar]
  37. Mattei, P.-A.; Frellsen, J. MIWAE: Deep generative modelling and imputation of incomplete data sets. In Proceedings of the ICML, Long Beach, CA, USA, 9–15 June 2019; pp. 4413–4423. [Google Scholar]
  38. Hasan, M.F.; Sobhan, M.A. Describing fuzzy membership function and detecting the outlier by using five number summary of data. Am. J. Comput. Math. 2020, 10, 410–424. [Google Scholar] [CrossRef]
  39. Naik, N.; Diao, R.; Shen, Q. Dynamic fuzzy rule interpolation and its application to intrusion detection. IEEE Trans. Fuzzy Syst. 2018, 26, 1878–1892. [Google Scholar] [CrossRef]
  40. Amiri, M.; Jensen, R. Missing data imputation using fuzzy-rough methods. Neurocomputing 2016, 205, 152–164. [Google Scholar] [CrossRef]
  41. Li, D.; Gu, H.; Zhang, L. A fuzzy c-means clustering algorithm based on nearest-neighbor intervals for incomplete data. Expert Syst. Appl. 2010, 37, 6942–6947. [Google Scholar] [CrossRef]
  42. Jang, J.-S.R. ANFIS: Adaptive-network-based fuzzy inference system. IEEE Trans. Syst. Man Cybern. 1993, 23, 665–685. [Google Scholar] [CrossRef]
  43. Manimurugan, S.; Majdi, A.; Mohmmed, M.A.; Narmatha, C.; Varatharajan, R. Intrusion detection in networks using crow search optimization algorithm with adaptive neuro-fuzzy inference system. Microprocess. Microsyst. 2020, 79, 103261. [Google Scholar] [CrossRef]
  44. Rafique, Y.; Wu, J.; Muzaffar, A.W.; Rafique, B. An enhanced integrated fuzzy logic-based deep learning technique (EIFL-DL) for the recommendation system. PeerJ Comput. Sci. 2024, 10, e2529. [Google Scholar] [CrossRef]
  45. Hazarika, B.B.; Gupta, D. Density-weighted twin SVM for binary class imbalance learning. Neural Process. Lett. 2022, 54, 1091–1130. [Google Scholar] [CrossRef]
  46. Prasad, S.C.; Anagha, P.; Balasundaram, S. Robust Pinball Twin Bounded Support Vector Machine for Data Classification. Neural Process. Lett. 2023, 55, 1131–1153. [Google Scholar] [CrossRef]
  47. Khushal, R.; Fatima, U. Fuzzy machine learning logic utilization on hormonal imbalance dataset. Comput. Biol. Med. 2024, 174, 108429. [Google Scholar] [CrossRef] [PubMed]
  48. Saatchi, R. Fuzzy logic concepts, developments and implementation. Information 2024, 15, 656. [Google Scholar] [CrossRef]
  49. Klement, E.P.; Mesiar, R.; Pap, E. Triangular Norms; Springer: Dordrecht, The Netherlands, 2000. [Google Scholar]
  50. Dubois, D.; Prade, H. New results about properties and semantics of fuzzy set-theoretic operators. In Fuzzy Sets: Theory and Applications to Policy Analysis and Information Systems; Springer: New York, NY, USA, 1980; pp. 59–75. [Google Scholar]
  51. Mamdani, E.H.; Assilian, S. An experiment in linguistic synthesis with a fuzzy logic controller. Int. J. Man-Mach. Stud. 1975, 7, 1–13. [Google Scholar] [CrossRef]
  52. Ross, T.J. Fuzzy Logic with Engineering Applications, 4th ed.; Wiley: Hoboken, NJ, USA, 2016. [Google Scholar]
  53. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  54. Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the ACM SIGKDD, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
  55. Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  56. Cover, T.M.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
  57. Cox, D.R. The regression analysis of binary sequences. J. R. Stat. Soc. Ser. B 1958, 20, 215–232. [Google Scholar] [CrossRef]
  58. Dua, D.; Graff, C. UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences. Available online: https://archive.ics.uci.edu/ml (accessed on 10 January 2026).
  59. Smith, J.W.; Everhart, J.E.; Dickson, W.C.; Knowler, W.C.; Johannes, R.S. Using the ADAP Learning Algorithm to Forecast the Onset of Diabetes Mellitus. In Proceedings of the Annual Symposium on Computer Application in Medical Care; American Medical Informatics Association: Washington, DC, USA, 1988; pp. 261–265. [Google Scholar]
  60. Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
  61. Jarrahi, M.H.; Memariani, A.; Guha, S. The Principles of Data-Centric AI. Commun. ACM 2023, 66, 84–92. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Overall architecture of the GDEDC framework.
Figure 1. Overall architecture of the GDEDC framework.
Applsci 16 05072 g001
Figure 2. Membership functions for the Mamdani FIS: (a) anomaly score, (b) feature consistency score, (c) error severity.
Figure 2. Membership functions for the Mamdani FIS: (a) anomaly score, (b) feature consistency score, (c) error severity.
Applsci 16 05072 g002
Figure 3. Sigmoid-based correction weight (Equation (17)) as a function of combined error evidence ES*·AS_j.
Figure 3. Sigmoid-based correction weight (Equation (17)) as a function of combined error evidence ES*·AS_j.
Applsci 16 05072 g003
Figure 4. Classification accuracy vs. noise level (5 datasets, 5 classifiers, 6 methods, 30 runs).
Figure 4. Classification accuracy vs. noise level (5 datasets, 5 classifiers, 6 methods, 30 runs).
Applsci 16 05072 g004
Table 1. Comparison of related work in data preprocessing with fuzzy logic.
Table 1. Comparison of related work in data preprocessing with fuzzy logic.
ReferenceYearFuzzy MethodDetectionCorrectionML Eval.Proportional
Hasan & Sobhan [38]2020Membership Func.YesNoLimitedNo
Naik et al. [39]2018Fuzzy Rule Interp.PartialNoYesNo
Amiri & Jensen [40]2016Fuzzy-Rough NNNoYesLimitedNo
Li et al. [41]2010Fuzzy C-MeansNoYesYesNo
Manimurugan et al. [43]2020ANFISYesNoLimitedNo
Rafique et al. [44]2024Fuzzy + DLPartialPartialYesNo
Khushal & Fatima [47]2024Fuzzy TransformNoPartialYesNo
Proposed GDEDC2026Mamdani FISYesYesYes (6 datasets,
5 baselines)
Yes
Table 2. Fuzzy rule base for error classification.
Table 2. Fuzzy rule base for error classification.
RuleAS (Input 1)FCS (Input 2)ES (Output)Rationale
R1LowConsistentCleanLow deviation, mostly within normal range
R2LowPartially Cons.CleanLow deviation overall, minor inconsistency
R3LowInconsistentSuspiciousLow aggregate but scattered deviations
R4MediumConsistentSuspiciousModerate deviation but concentrated
R5MediumPartially Cons.SuspiciousModerate deviation, some spread
R6MediumInconsistentErroneousModerate deviation across many features
R7HighConsistentSuspiciousHigh but focused deviation (possible outlier)
R8HighPartially Cons.ErroneousHigh deviation with spread
R9HighInconsistentErroneousHigh deviation across most features
Table 3. Summary of benchmark datasets.
Table 3. Summary of benchmark datasets.
DatasetInstancesFeaturesClassesClass Distribution
Iris1504350/50/50
Wine17813359/71/48
Breast Cancer569302212/357
Seeds2107370/70/70
Pendigits10,9921610~1055–1144 per class
Pima Diabetes76882500/268
Table 4. Classification accuracy (%) at 10% noise (mean ± std, 30 runs).
Table 4. Classification accuracy (%) at 10% noise (mean ± std, 30 runs).
MethodRFSVMKNNLRXGBoost
Raw (B1)93.0 ± 4.987.3 ± 5.686.9 ± 5.481.6 ± 9.092.5 ± 5.1
Z-Score (B2)92.5 ± 4.886.1 ± 5.086.0 ± 5.781.2 ± 8.892.2 ± 5.1
IQR (B3)92.1 ± 4.881.9 ± 6.284.1 ± 5.679.6 ± 8.291.4 ± 5.1
KNN Imp. (B4)93.0 ± 5.089.3 ± 5.788.4 ± 6.384.4 ± 9.892.8 ± 4.9
MICE (B5)93.1 ± 4.890.0 ± 5.589.4 ± 5.985.3 ± 9.192.9 ± 5.1
GDEDC92.4 ± 5.189.1 ± 5.288.7 ± 5.184.3 ± 8.592.1 ± 5.0
Table 5. Average accuracy (%) by noise level (mean ± std, 30 runs).
Table 5. Average accuracy (%) by noise level (mean ± std, 30 runs).
Method5%10%15%20%25%30%
Raw (B1)91.6 ± 5.788.3 ± 7.585.3 ± 8.682.1 ± 10.079.1 ± 11.375.9 ± 12.5
Z-Score (B2)91.1 ± 5.887.6 ± 7.484.6 ± 8.481.6 ± 9.978.8 ± 11.175.8 ± 12.4
IQR (B3)90.2 ± 6.285.8 ± 7.982.2 ± 9.278.8 ± 10.375.2 ± 11.272.1 ± 12.1
KNN Imp. (B4)92.5 ± 5.789.6 ± 7.386.8 ± 8.583.8 ± 9.980.5 ± 11.477.4 ± 12.6
MICE (B5)92.8 ± 5.590.1 ± 6.987.4 ± 8.084.4 ± 9.581.1 ± 10.977.7 ± 12.3
GDEDC92.3 ± 5.389.4 ± 6.686.7 ± 7.684.0 ± 8.781.1 ± 9.778.2 ± 10.9
Table 6. Data retention rate (%) by noise level (30 runs).
Table 6. Data retention rate (%) by noise level (30 runs).
Method5%10%15%20%25%30%
Z-Score (B2)84.581.382.185.089.093.7
IQR (B3)76.268.163.861.460.961.6
KNN Imp. (B4)100100100100100100
MICE (B5)100100100100100100
GDEDC100100100100100100
Table 7. Statistical significance: GDEDC vs. baselines at 10% noise (30 runs).
Table 7. Statistical significance: GDEDC vs. baselines at 10% noise (30 runs).
ComparisonPaired t-TestWilcoxonCohen’s dSignificant?
GDEDC vs. Raw<0.001<0.0010.336Yes
GDEDC vs. Z-Score<0.001<0.0010.474Yes
GDEDC vs. IQR<0.001<0.0010.637Yes
GDEDC vs. KNN Imp.0.0600.003−0.069No
GDEDC vs. MICE<0.001<0.001−0.229Yes (MICE better)
Table 8. Per-dataset significance: GDEDC vs. Raw at 10% noise (30 runs). Gain values are computed from paired differences before rounding; small discrepancies with column-wise subtraction are due to independent rounding.
Table 8. Per-dataset significance: GDEDC vs. Raw at 10% noise (30 runs). Gain values are computed from paired differences before rounding; small discrepancies with column-wise subtraction are due to independent rounding.
DatasetGDEDC%Raw%Gaint-Test pCohen’s dSig?
Iris85.985.1+0.90.0080.219Yes
Wine92.191.0+1.2<0.0010.291Yes
Breast Cancer94.293.2+0.9<0.0010.423Yes
Seeds86.484.9+1.6<0.0010.423Yes
Pendigits88.187.1+1.0<0.0010.561Yes
Table 9. GDEDC vs. baselines by noise level: gain (%), p-value, Cohen’s d.
Table 9. GDEDC vs. baselines by noise level: gain (%), p-value, Cohen’s d.
ComparisonMetric5%10%15%20%25%30%
GDEDC vs. RawGain%+0.70+1.10+1.47+1.91+2.07+2.32
p-value<0.001<0.001<0.001<0.001<0.001<0.001
Cohen d0.250.340.370.410.420.45
GDEDC vs. Z-ScoreGain%+1.21+1.76+2.09+2.39+2.33+2.43
p-value<0.001<0.001<0.001<0.001<0.001<0.001
Cohen d0.390.470.510.480.470.45
GDEDC vs. IQRGain%+2.12+3.53+4.55+5.23+5.92+6.14
p-value<0.001<0.001<0.001<0.001<0.001<0.001
Cohen d0.520.640.680.660.700.73
GDEDC vs. KNN Imp.Gain%−0.19−0.22−0.02+0.29+0.64+0.85
p-value0.0590.0600.9020.050 *<0.001 **<0.001 **
Cohen d−0.07−0.07−0.000.070.140.19
GDEDC vs. MICEGain%−0.50−0.79−0.66−0.38+0.08+0.49
p-value<0.001<0.001<0.0010.0200.6740.006 **
Cohen d−0.17−0.23−0.17−0.080.020.10
* p < 0.05; ** p < 0.01.
Table 10. Friedman average ranks by noise level (N = 750 per level).
Table 10. Friedman average ranks by noise level (N = 750 per level).
Method5%10%15%20%25%30%Combined
Raw (B1)2.943.113.123.173.133.153.10
Z-Score (B2)3.793.893.933.793.723.723.81
IQR (B3)4.474.734.804.734.804.844.73
KNN Imp. (B4)2.852.822.912.983.102.972.94
MICE (B5)3.112.812.832.923.023.152.97
GDEDC3.843.643.413.403.243.163.45
Table 11. Friedman ranks for noise-sensitive classifiers (SVM, KNN, LR).
Table 11. Friedman ranks for noise-sensitive classifiers (SVM, KNN, LR).
Method5%10%15%20%25%30%Combined
Raw (B1)3.513.683.663.743.603.623.63
Z-Score (B2)4.094.244.284.344.174.164.21
IQR (B3)4.795.085.125.005.105.105.03
KNN Imp. (B4)2.602.602.722.793.022.952.78
MICE (B5)2.722.422.482.572.742.912.64
GDEDC3.292.982.752.552.382.262.70
Table 12. Sensitivity to correction strength λ (k = 5, 10% noise, 30 runs).
Table 12. Sensitivity to correction strength λ (k = 5, 10% noise, 30 runs).
λ0.20.40.60.81.0
Accuracy (%)88.789.189.289.489.2
Table 13. Sensitivity to number of neighbors k (λ = 0.8, 10% noise, 30 runs).
Table 13. Sensitivity to number of neighbors k (λ = 0.8, 10% noise, 30 runs).
k3571015
Accuracy (%)89.389.489.489.489.5
Table 14. Error detection performance of GDEDC (10% noise level, 30 runs).
Table 14. Error detection performance of GDEDC (10% noise level, 30 runs).
MetricIrisWineBreast CancerSeedsPendigitsAverage
Detection Precision0.410.820.980.610.870.74
Detection Recall0.830.780.720.780.710.76
Detection F10.550.800.830.680.790.73
Table 15. Average preprocessing time per dataset (10 runs, milliseconds).
Table 15. Average preprocessing time per dataset (10 runs, milliseconds).
MethodIris (150 × 4)Wine (178 × 13)BC (569 × 30)Seeds (210 × 7)Pendigits (10,992 × 16)
Raw (B1)0.000.000.000.000.00
Z-Score (B2)0.030.040.090.031.59
IQR (B3)0.100.130.350.113.41
KNN Imp. (B4)2.139.269015.2810.063153.35
MICE (B5)12.0028.74890.7622.83427.22
GDEDC35.3744.94210.9347.093085.23
Table 16. GDEDC accuracy gain over Raw by classifier type (10% noise, 30 runs, five datasets averaged).
Table 16. GDEDC accuracy gain over Raw by classifier type (10% noise, 30 runs, five datasets averaged).
ClassifierTypeRaw Avg (%)GDEDC Avg (%)Gain (%)
RFTree ensemble93.092.4−0.6
XGBoostTree ensemble92.592.1−0.4
SVMDistance-based87.389.1+1.9
KNNDistance-based86.988.7+1.9
LRCoefficient-based81.684.3+2.8
Table 17. Ablation study: accuracy (%) by variant and noise level (30 runs).
Table 17. Ablation study: accuracy (%) by variant and noise level (30 runs).
VariantAblated Component10%20%30%Avg Drop
GDEDC-Full— (control)89.3684.0478.21
A1-MeanAggRMS → weighted mean88.9083.5677.68−0.49
A2-BinaryCorrSigmoid → binary replace87.2481.8476.47−2.02
A3-ThresholdMamdani FIS → AS threshold89.0083.6877.79−0.38
A4-NoFCSFCS removed (AS-only FIS)89.0883.9478.27−0.11
Table 18. Per-component contribution (percentage points over ablated variant).
Table 18. Per-component contribution (percentage points over ablated variant).
Component10%20%30%Average
Sigmoid proportional correction (vs. binary)+2.12+2.20+1.74+2.02
RMS aggregation (vs. weighted mean)+0.46+0.48+0.53+0.49
Mamdani FIS (vs. simple threshold)+0.36+0.36+0.42+0.38
Feature Consistency Score (vs. AS-only)+0.28+0.10−0.06+0.11
Table 19. Statistical significance: GDEDC vs. baselines on Pima diabetes (30 runs).
Table 19. Statistical significance: GDEDC vs. baselines on Pima diabetes (30 runs).
ComparisonGain (%)p-ValueCohen’s dSignificant?
GDEDC vs. Raw−0.190.270−0.090No
GDEDC vs. Z-Score+0.290.129+0.125No
GDEDC vs. IQR+2.97< 0.001+0.684Yes
GDEDC vs. KNN Imp.+0.450.032+0.177Yes
GDEDC vs. MICE+0.430.052+0.160No (borderline)
Table 20. Real-world validation: Pima Indians diabetes (accuracy %, mean ± std, 30 runs).
Table 20. Real-world validation: Pima Indians diabetes (accuracy %, mean ± std, 30 runs).
MethodRFSVMKNNLRXGBoostAverage
Raw (B1)76.6 ± 2.976.1 ± 2.673.9 ± 3.676.5 ± 2.473.9 ± 2.875.4
Z-Score (B2)76.3 ± 2.975.1 ± 2.372.9 ± 3.076.4 ± 2.673.9 ± 2.374.9
IQR (B3)75.4 ± 3.370.5 ± 2.170.1 ± 3.972.4 ± 6.072.9 ± 3.172.2
KNN Imp. (B4)76.5 ± 2.875.6 ± 2.472.4 ± 2.276.2 ± 2.773.1 ± 2.974.8
MICE (B5)75.8 ± 2.475.9 ± 2.472.6 ± 3.276.3 ± 2.673.4 ± 3.074.8
GDEDC76.4 ± 2.776.0 ± 2.873.9 ± 3.576.6 ± 2.473.2 ± 2.875.2
Table 21. Pima Indians diabetes—extended metrics (mean across 5 classifiers, 30 runs).
Table 21. Pima Indians diabetes—extended metrics (mean across 5 classifiers, 30 runs).
MethodAccuracyPrecisionRecallF1AUC
Raw75.4168.0058.2562.2881.29
Z-Score74.9367.0257.8661.6880.77
IQR72.2563.3153.2256.2777.71
KNN Imp.74.7767.0756.5761.0180.85
MICE74.8667.3156.4461.0480.47
GDEDC75.2267.4958.1862.1081.32
Table 22. Same-detected-cells comparison: accuracy (%) on four UCI datasets, 30 runs.
Table 22. Same-detected-cells comparison: accuracy (%) on four UCI datasets, 30 runs.
NoiseIQR + MeanIQR + KNNIQR + MICEIQR + GDEDC
10%90.0389.5890.1589.14
20%84.2783.7584.4283.34
30%77.7477.3677.7277.03
Table 23. Correlated noise: accuracy (%) on four UCI datasets, 30 runs.
Table 23. Correlated noise: accuracy (%) on four UCI datasets, 30 runs.
NoiseRawZ-ScoreIQRKNN Imp.MICEGDEDC
10%88.8588.1987.1489.8689.8889.70
20%83.8483.3681.0885.1385.1085.18
Table 24. Systematic bias (row-subset, ±15% magnitude): accuracy (%), 30 runs.
Table 24. Systematic bias (row-subset, ±15% magnitude): accuracy (%), 30 runs.
NoiseRawZ-ScoreIQRKNN Imp.MICEGDEDC
10%95.1895.1094.7495.1195.0295.09
20%94.7094.5194.1494.5794.5594.56
Table 25. Aggressive systematic bias (row-subset, ±30% magnitude): accuracy (%), 30 runs.
Table 25. Aggressive systematic bias (row-subset, ±30% magnitude): accuracy (%), 30 runs.
NoiseRawZ-ScoreIQRKNN Imp.MICEGDEDC
10%94.3494.0993.7294.1494.0494.20
20%93.1292.8392.3393.0492.7693.16
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tekin, A.T. Fuzzy Graded Preprocessing for Robust Machine Learning: A Three-Stage Mamdani Framework with Interpretable Audit Trails. Appl. Sci. 2026, 16, 5072. https://doi.org/10.3390/app16105072

AMA Style

Tekin AT. Fuzzy Graded Preprocessing for Robust Machine Learning: A Three-Stage Mamdani Framework with Interpretable Audit Trails. Applied Sciences. 2026; 16(10):5072. https://doi.org/10.3390/app16105072

Chicago/Turabian Style

Tekin, Ahmet Tezcan. 2026. "Fuzzy Graded Preprocessing for Robust Machine Learning: A Three-Stage Mamdani Framework with Interpretable Audit Trails" Applied Sciences 16, no. 10: 5072. https://doi.org/10.3390/app16105072

APA Style

Tekin, A. T. (2026). Fuzzy Graded Preprocessing for Robust Machine Learning: A Three-Stage Mamdani Framework with Interpretable Audit Trails. Applied Sciences, 16(10), 5072. https://doi.org/10.3390/app16105072

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop