Next Article in Journal
Vertical Bearing Behavior and Capacity Calculation Method of Rock-Socketed Self-Drilling Hollow Bar Micropiles
Previous Article in Journal
Analysis of Influencing Factors and Application of Gas Drainage Effect in Longitudinal Drifts with Sequential Longhole Drilling
Previous Article in Special Issue
Fuzzy Graded Preprocessing for Robust Machine Learning: A Three-Stage Mamdani Framework with Interpretable Audit Trails
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Fuzzy Rule-Based Explanations for Tabular Black-Box Classifiers: A Comprehensive Empirical Framework with Prediction-Boundary-Aware Partitioning and Rule-Level Uncertainty Indication

by
Ahmet Tezcan Tekin
Department of Management Engineering, Istanbul Technical University, 34367 Istanbul, Türkiye
Appl. Sci. 2026, 16(12), 5896; https://doi.org/10.3390/app16125896
Submission received: 19 May 2026 / Revised: 5 June 2026 / Accepted: 8 June 2026 / Published: 11 June 2026
(This article belongs to the Collection The Development and Application of Fuzzy Logic)

Abstract

Existing post hoc XAI (Explainable Artificial Intelligence) methods produce numerical attributions without symbolic structure (SHAP, LIME), low-coverage local rules (Anchors), or crisp tree surrogates without an interpretable rule-level uncertainty proxy. We present a fuzzy rule-based explanation framework for tabular black-box classifiers, extracting global IF–THEN rules with linguistic labels. This was validated on a 13-dataset benchmark with four model families (Wilcoxon, Friedman, TOST equivalence): (i) prediction-boundary-aware fuzzy partitioning raises mean fidelity from a vanilla Wang–Mendel baseline of 0.736 to 0.893 (+10.4 pp excluding the Breast Cancer outlier; +15.7 pp aggregate, both transparently reported); (ii) fired-rule consequent entropy provides a zero-cost rule-level uncertainty proxy (Spearman ρ = 0.420 with model prediction entropy, significant on 11/12 datasets—moderate by Cohen’s convention, with a 4/12 weak-correlation tail; complementary to probability-entropy and margin baselines). Fidelity is statistically equivalent to tree surrogates on classification (TOST p = 0.002, δ = 0.05) at ≈100% coverage. SHAP/LIME are excluded from the formal stability ranking because the perturbation metric measures the wrapped black-box rather than the attribution vector; cross-explainer comparison is reported in grouped form (full-coverage surrogates vs. local-coverage methods). On continuous regression (California Housing fidelity 0.422 vs. TreeSurrogate 0.840) and XOR-type multi-feature interactions, the framework is structurally weaker, addressed by a planned TSK extension.

1. Introduction

Machine learning models are increasingly deployed in high-stakes domains such as healthcare [1], criminal justice [2], finance [3], and autonomous systems [4], where understanding why a model reaches a particular decision is as important as the decision itself. Black-box models—random forests, gradient-boosted trees, deep neural networks, support vector machines—often achieve the best predictive accuracy, but their opaque reasoning hinders trust, accountability, and regulatory compliance [5]. The European Union’s General Data Protection Regulation (GDPR) explicitly addresses the debated “right to explanation” (the literal scope of GDPR Article 22 is contested in the legal literature [6,7]), adding a legal dimension to the interpretability requirement [6].
Post hoc explanation methods attempt to fill this gap. SHAP [8] and LIME [9] assign numerical importance scores to input features but produce no symbolic structure—a practitioner receives a ranked feature list, not actionable decision rules. Anchors [10] generate local IF–THEN rules with precision guarantees, yet their coverage is narrow: in our experiments, Anchors covers only 22.8% of test samples (Section 5.4) on average, limiting its use as a global explainer. Decision-tree surrogates [11,12] provide global crisp rules with high fidelity, but their hard threshold splits (e.g., “age ≤ 42.5”) carry no uncertainty information and can mislead practitioners near decision boundaries where the model itself is uncertain.
Fuzzy rule-based systems [13,14] address several of these shortcomings. Linguistic labels (“Low”, “Medium”, “High”) combined with soft membership functions produce explanations that domain experts naturally consume, while rule firing distributions provide a rule-level uncertainty signal. The Wang–Mendel framework [15] in particular extracts IF–THEN rules directly from data without requiring gradient training. However, the empirical fuzzy XAI literature has not yet comprehensively benchmarked fuzzy rule extraction against the modern post hoc explanation suite (SHAP, LIME, Anchors, TreeSurrogate); our work closes this gap. We also report an explicit negative finding (Section 5.13 and Section 6.3): on continuous regression (California Housing), FuzzyRules’s fidelity drops to 0.422 vs. TreeSurrogate’s 0.840, reflecting a structural limitation of grid-based fuzzy systems on continuous targets.

Contributions

This paper makes the following contributions:
C1. Prediction-boundary-aware fuzzy partitioning. We propose aligning fuzzy membership function centers with the black-box model’s decision boundaries by fitting a one-dimensional decision tree (CART, Classification and Regression Trees [16]; the supervised-discretization spirit follows Fayyad & Irani [17]) per feature on model predictions. Combined with the framework’s other components (top-k feature selection, adaptive K, product t-norm, antecedent pruning), this data-driven refinement raises average ablation-set fidelity from 0.736 (vanilla 1992 Wang–Mendel baseline) to 0.893 (+15.7 pp; see Section 5.8). The boundary-aware refinement alone contributes +2.5 pp over the strongest alternative partitioning strategy (percentile-based; see Section 5.8), without sacrificing interpretability or increasing rule complexity. Unlike uniform or equal-frequency partitioning, our approach ensures that fuzzy set boundaries coincide with regions where the model’s behavior changes.
C2. Rule entropy as a rule-level uncertainty indicator. We demonstrate that the Shannon entropy of fired fuzzy rule consequent distributions (weighted by firing strength) naturally tracks the black-box model’s prediction uncertainty. Across 12 classification datasets, the Spearman rank correlation between model prediction entropy and rule entropy averages ρ = 0.420 and is statistically significant (p < 0.05) in 11 of 12 datasets. This property is emergent—it requires no explicit uncertainty calibration—and is not naturally available in crisp surrogates, where each sample follows exactly one deterministic path.
C3. Comprehensive empirical benchmark with statistical rigor. We conduct a systematic comparison (the most comprehensive to our knowledge) of fuzzy rule extraction against all four major modern XAI baselines—filling a gap in the literature where prior studies compare against at most one baseline on limited datasets (see Section 2.5). Our benchmark covers 5 explanation methods × 13 datasets × 4 model families × 3-fold cross-validation, evaluated on 6 complementary metrics (fidelity, coverage, stability, comprehensibility, computational cost, cross-model consistency) with formal statistical testing (Wilcoxon signed-rank test, Friedman test with Bonferroni correction).
C4. Practical XAI selection guidelines. Based on our empirical results, we provide a decision framework that recommends the most suitable explanation method for different use cases, considering the coverage–fidelity–interpretability trade-off.
The remainder of this paper is organized as follows: Section 2 reviews related work in post hoc explanations and fuzzy rule systems. Section 3 provides the mathematical foundations of fuzzy set theory and information-theoretic measures. Section 4 describes the proposed framework in detail. Section 5 presents the experimental setup and results. Section 6 discusses findings, practical implications, and limitations. Section 7 concludes with future research directions.

2. Related Work

The XAI literature reflects a persistent tension between model performance and interpretability. We organize relevant work under four themes: feature attribution methods, rule-based explanations, fuzzy systems for explainability, and uncertainty proxy in XAI.

2.1. Feature Attribution Methods

Attribution methods quantify the contribution of each input feature to a given prediction.
SHAP. Lundberg and Lee [8] unified several existing attribution methods under the framework of Shapley values from cooperative game theory. SHAP provides theoretically grounded, locally faithful attributions satisfying desirable axioms (local accuracy, missingness, consistency). KernelSHAP approximates Shapley values through weighted linear regression, while TreeSHAP [18] offers exact computation for tree-based models. Despite their theoretical appeal, SHAP explanations are per-instance numerical rankings that lack symbolic structure—practitioners cannot directly derive actionable rules from them.
LIME. Ribeiro et al. [9] proposed LIME, which explains individual predictions by fitting a local linear model on perturbed instances weighted by their proximity to the query point. LIME is model-agnostic and computationally efficient but has been shown to exhibit instability: repeated applications to the same instance can yield different explanations due to the stochastic perturbation sampling [19,20].
Gradient- and attention-based variants (Integrated Gradients [21], DeepLIFT [22], attention weights [23], the latter also criticized for lacking faithfulness [24]) share the same core limitation of attribution methods: per-instance numerical scores without symbolic decision rules.

2.2. Rule-Based Explanation Methods

Rule-based methods produce symbolic IF–THEN rules that mimic the black-box model.
Decision tree surrogates. Craven and Shavlik [11] pioneered the extraction of tree-structured representations from trained neural networks. Bastani et al. [12] formalized model extraction as an optimization problem, training a decision tree to mimic the black-box model’s predictions. Decision tree surrogates achieve high fidelity and provide a complete global explanation, but their crisp threshold splits (e.g., “x3 ≤ 0.53”) lack linguistic meaning and create artificial discontinuities at decision boundaries.
Anchors. Ribeiro et al. [10] proposed Anchors, which uses a multi-armed bandit procedure to find local IF–THEN rules with user-specified precision guarantees. Anchors provide high-precision local explanations but cover only a small neighborhood around each instance. This limited coverage makes Anchors unsuitable as a standalone global explanation method [25].
Complementary rule-based approaches include LORE [26] (local trees with counterfactuals), RuleFit [27] (rule-ensemble regression), and Bayesian Rule Lists [28] (probabilistic rule sets); none employ fuzzy antecedents, so they cannot express gradual transitions with linguistic terms.

2.3. Fuzzy Systems for Explainability

Fuzzy rule-based systems have a long history in control [14], classification [29], and knowledge representation [13]. Their use for model explanation, however, has received comparatively limited attention in the modern XAI literature. For broader surveys of interpretability trade-offs in fuzzy modeling, see Casillas et al. [30] and Guillaume [31].
Fuzzy rule generation. Wang and Mendel [15] proposed a grid-based method for generating fuzzy IF–THEN rules from input–output data pairs by partitioning each feature into fuzzy sets and extracting the dominant rule for each cell. Ishibuchi et al. [29,32] extended this to classification and proposed evolutionary optimization for rule selection. Herrera [33] provided a comprehensive taxonomy of genetic fuzzy systems, categorizing approaches by their learning components (rule base, membership functions, or both). Alcalá-Fdez et al. [34] proposed iterative rule learning with evolutionary optimization, achieving interpretable fuzzy rule bases for classification.
Fuzzy systems as model explainers. Setiono and Leow [35] used fuzzy rules to explain pruned neural networks, demonstrating that fuzzy representations can capture nonlinear relationships with linguistic labels. Huysmans et al. [36] empirically evaluated the comprehensibility of decision tables, trees, and rule-based models through user studies, finding that rule-based representations were preferred by non-expert users. Pancho et al. [37] proposed Fingrams, a visualization technique for fuzzy rule-base inference structure that aids comprehensibility analysis. More recently, Loyola-González et al. [38] explored contrast pattern-based approaches for classification, demonstrating the potential of symbolic pattern representations as an alternative to opaque models.
However, most existing studies evaluate on fewer than five datasets and do not compare against modern baselines such as SHAP, LIME, or Anchors. This paper addresses this gap.
Recent activity in fuzzy XAI (2024–2025). Mendel [39] re-examines explainable rule-based fuzzy systems with an emphasis on Type-2 fuzzy sets as the natural carrier of explainable uncertainty. Pickering, Cohen and De Baets [40] perform a narrative review of fuzzy rule-based models against modern interpretable-ML criteria and identify the lack of comparison against contemporary XAI baselines as the central reason fuzzy methods are absent from mainstream interpretability discourse—a gap our 13-dataset benchmark against SHAP, LIME, Anchors, and TreeSurrogate is designed to close. Pedrycz [41] frames the broader program as granular computing for cognitive trustworthiness; Alateeq and Pedrycz [42] survey logic-oriented fuzzy neural networks (two neuro-fuzzy variants (NFS–neuro-fuzzy system; WMNF–Wang–Mendel with neural fine-tuning) benchmarked in Section 5.12).

2.4. Uncertainty Proxies in Explanations

Uncertainty in model predictions is a critical concern in high-stakes applications [1]. Several approaches have been proposed: Monte Carlo Dropout [43,44] estimates epistemic uncertainty through stochastic forward passes, conformal prediction [45] provides distribution-free prediction intervals, and Bayesian neural networks [46] maintain weight distributions. However, these approaches address model-level uncertainty, not explanation-level uncertainty.
In the XAI context, Slack et al. [47] demonstrated that SHAP and LIME explanations can be manipulated by adversarial models, raising concerns about their reliability. Zhang et al. [48] proposed metrics for evaluating explanation stability. Mendel [39] develops the broader program of explainable uncertain rule-based fuzzy systems, with substantial Type-2/Footprint-of-Uncertainty content; we instead demonstrate that emergent rule-level uncertainty indicators arise from the Type-1 voting distribution itself—to our knowledge, an unaddressed direction in the fuzzy XAI literature, complementary rather than overlapping with [39] (Type-1 voting-based vs. Type-2 antecedent-based uncertainty). Our work shows that rule entropy—the disagreement among fired fuzzy rules—naturally tracks model prediction uncertainty without additional computation or calibration, a property unique to soft voting systems.

2.5. Research Gap and Positioning

Table 1 positions the proposed framework against the dominant explanation methods on the qualitative dimensions we evaluate empirically in Section 5. Three gaps motivate the present work: (1) fuzzy rule extraction has not been compared against modern XAI baselines with statistical rigor; (2) rule-based explanations offer no uncertainty proxy; and (3) prediction-boundary-aware partitioning has not been explored to improve fuzzy rule fidelity.
To further highlight the research gap, Table 2 compares existing empirical studies that evaluate fuzzy rule-based explanations against modern XAI baselines. While several surveys [49,50] acknowledge the disconnect between fuzzy rule-based approaches and contemporary post hoc explanation methods, none provides a comprehensive multi-method, multi-dataset empirical comparison. Zhu et al. [51] proposed a fuzzy local surrogate but compared only against LIME on three datasets. Ouifak and Idri [52] surveyed fuzzy approaches for XAI but did not include original experiments. Our work is the first to provide a systematic comparison of fuzzy rule extraction against all four major XAI baselines across 13 datasets with standardized metrics and statistical testing.

3. Preliminaries

This section introduces the mathematical foundations underlying our framework: fuzzy set theory, membership functions, fuzzy IF–THEN rules, and information-theoretic measures.

3.1. Fuzzy Sets and Membership Functions

A fuzzy set A ¯ on a universe of discourse U is defined by a membership function μ A ¯ : U → [0, 1] that assigns each element x ∈ U a degree of membership [13]. Unlike classical (crisp) sets where μ ∈ {0, 1}, fuzzy sets allow gradual membership, enabling smooth transitions between categories.
Definition 1 (Triangular Membership Function). 
A triangular membership function is defined by three parameters (a, b, c) with a ≤ b ≤ c:
μ x ;   a ,   b ,   c = m a x 0 ,   m i n x a b a   ,   c x c b
where b is the center (peak), and a, c define the support boundaries. Triangular membership functions are adopted throughout this work for interpretability and computational simplicity; Gaussian and trapezoidal alternatives could be substituted without algorithmic changes. For a dataset with D features, we define K linguistic labels for each feature j∈ {1, …, D}, such as {Low, Medium, High} for K = 3. The membership function for feature j and fuzzy set k is denoted μj,k(·).

3.2. Fuzzy IF–THEN Rules

A fuzzy IF–THEN rule Rr takes the form:
R r :   I F   x 1   I S   A 1   A N D A N D   x d   I S   A d   T H E N   y   =   c r
where Aj denotes the fuzzy set assigned to feature j in rule r, and cr is the consequent (predicted class for classification, predicted value for regression).
The firing strength (or activation degree) of rule Rr for input x = (x1, …, xe) is computed using a t-norm operator. We use the product t-norm:
μ R r x   =   j   =   1 d μ j ,   s j x j

3.3. T-Norms and Their Properties

A t-norm (triangular norm) is a binary operation T: [0, 1]2 → [0, 1] satisfying commutativity, associativity, monotonicity, and the boundary condition T(a, 1) = a [53]. Common t-norms include:
T p r o d a 1 ,   a 2 ,   ,   a d   =   j   =   1 d a j
The product t-norm is used in our framework because it provides more discriminative firing strengths than the minimum t-norm, as it penalizes rules where any antecedent has low membership.

3.4. Information-Theoretic Measures

The Shannon entropy of a discrete random variable Y with probability mass function p(y) over classes C is:
H Y   =   c     C p c l o g 2   p c
The mutual information between two random variables X and Y is:
I X ;   Y   =   H Y     H Y | X
In the fuzzy context, membership degrees replace crisp bin assignments. The fuzzy (soft-binning) mutual information used for top-k feature ranking is the soft-binning analog of Equation (6), defined inline below and used for top-k feature ranking in Section 4.3: I ~ (X_j; Y) = Σ_{k,c} p ~ (s_{j,k}, c) log2[ p ~ (s_{j,k}, c)/( p ~ (s_{j,k}) p(c))], where p ~ (s_{j,k}, c) = (1/N) Σ_i μ_{A_{j,k}}(x_{ij})⋯1[y_i = c] is the soft-bin joint frequency.
Notation note (carried into Section 4). In Equations (12) and (15) the summation index uses a constraint subscript (rendered compactly beneath the summation operator in equation editors)—read as the classical Σ_{i: y_i = c} or Σ_{r ∈ Φ(x): c_r = c} as appropriate; Supplementary Algorithm S1 lines 13 and 15 use the explicit set-builder form (the same convention is used in Equation (17) of Section 4.6, with the firing-set Φ(x) substitution made explicit there). Equation (14) rule cap M = max(50, Kmin(k,6)) limits the ranked candidate-rule pool (the floor of 50 ensures a minimum operational rule base on small datasets where Kmin(k,6) drops below 50—this floor is empirically inert on our 13-dataset benchmark since Kmin(k,6) ≥ 50 holds throughout—but is documented for reproducibility on smaller external datasets) (post-step-6 filter, pre-step-8 pruning) to K6 = 729 cells when k > 6—engineering-motivated to stay near Miller’s 7 ± 2 chunk limit [54] when k = 7. This cap applies only to the retained rule count; the asymptotic-consistency condition in Conjecture 1 concerns the full Kk grid that the framework still enumerates for support computation. Equation (9) yields k = 7 on six datasets in our benchmark (Spambase has D = 57 (so the cap binds via D); Adult Income, Bank Marketing, Default Credit, and Magic Gamma are pinned by N alone (⌊log3 N⌋ = 7); for Compas (D = 11, N = 6172) the upper bound and the log-floor coincide (⌊log3 N⌋ = 7 = cap)).

4. Proposed Framework

Our framework extracts a set of fuzzy IF–THEN rules from a trained black-box model, providing linguistically interpretable global explanations with a rule-level uncertainty proxy. The framework consists of four stages: (1) fuzzy partitioning with prediction-boundary-aware refinement, (2) feature selection, (3) rule extraction via a modified Wang–Mendel procedure, and (4) rule entropy computation for uncertainty signaling. Figure 1 illustrates the overall pipeline.

4.1. Fuzzy Partitioning

Given training data X with N samples and D features, and K linguistic labels per feature, we define triangular membership functions for each feature j and fuzzy set k. Initial centers are placed at equally spaced percentiles of each feature’s distribution: Boundary fuzzy sets (k = 1 and k = K) use left- and right-shoulder MFs, respectively (the leftmost MF saturates to 1 below its center, the rightmost to 1 above its center); interior MFs (k ∈ {2, …, K − 1}) are standard symmetric triangles whose left and right feet coincide with the adjacent MF centers, i.e., a_{j,k} = c_{j,k − 1} and (using the notation of Equation (1) with c playing the role of the right-foot variable) the right foot coincides with c_{j,k + 1}.
c j , k = Q j k 0.5 K ,   k = 1 ,   ,   K
where Qj(·) denotes the quantile function of feature j. This percentile-based initialization ensures that each fuzzy set covers approximately the same number of training samples, providing a balanced partition regardless of the feature’s distribution.

4.2. Prediction-Boundary-Aware Refinement

The initial percentile-based partitioning is model-agnostic and may not align with the black-box model’s decision boundaries. This misalignment can cause a single fuzzy set to straddle regions where the model makes different predictions, reducing rule confidence and ultimately fidelity.
We propose prediction-boundary-aware refinement, which repositions membership function centers to align with the model’s decision boundaries:
  • Obtain black-box model predictions: y ^ = f(X).
  • For each feature j, fit a one-dimensional decision tree Tj on (X[:,j], y ^ ) with at most K leaf nodes (equivalently, K − 1 splits).
  • Extract the split thresholds {tj,1, …, tj,K − 1} from Tj, sorted in ascending order.
  • Reposition membership function centers to the midpoints between consecutive split thresholds (denoting the boundaries by b_{j,k} ≡ t_{j,k} for typographic compactness in Equation (8) below, with sentinels b_{j,0} = t_{j,0} = min(X[:, j]) and b_{j,K} = t_{j,K} = max(X[:, j])):
c j , k = b j , k 1 + b j , k 2
where b_{j,0} = min(X[:, j]) and b_{j,K} = max(X[:, j]) are sentinel boundaries, and b_{j,1}, …, b_{j,K − 1} are the sorted decision-tree split thresholds (≡t_{j,k}). This procedure is applied only for classification models; for regression, we retain percentile-based centers. Per-feature 1D decision trees are used to keep the partitioning interpretable, and feature-local multivariate boundary detection is left to follow-up work.
5.
Rebuild the triangular membership functions using the refined centers.
The intuition behind this refinement is illustrated in Figure 2: when fuzzy set centers are positioned at the midpoints of the model’s decision regions (rather than at data percentiles), each fuzzy set predominantly captures samples from a single prediction class, leading to higher rule confidence and improved fidelity.

4.3. Feature Selection

To maintain interpretability and avoid the curse of dimensionality in rule enumeration, we select the top-k most informative features for rule construction. The number of features is determined adaptively:
k   =   m i n D ,   m a x l o g 3   N ,   4 ,   7
where D is the total number of features and N is the number of training samples. For our 12-classification benchmark Equation (9)—evaluated per training fold under the 3-fold CV protocol of Section 5.1, i.e., on N_train_fold ≈ (2/3)·N_train, not on the full N reported in Table 3—yields the following per-dataset k values: Wine 4, Heart Disease 5, Ionosphere 4, Pima Diabetes 6, Breast Cancer 5, German Credit 6, Compas 7, Spambase 7, Magic Gamma 7, Adult Income 7, Bank Marketing 7, Default Credit 7. The log3 scaling matches the framework default K = 3 (so Kk cells per dataset stays linear in N at the boundary of the regime—Conjecture 1’s Kk ≪ N condition); using a different log base (e.g., log_K with K varying) would couple feature-selection complexity to the partitioning hyperparameter, complicating sensitivity analysis. The lower bound k ≥ 4 is an empirically motivated finite-sample safety floor (preventing degenerate near-single-feature rules at small N; chosen so that Kk ≥ 81 ≫ K1 = 3, comfortably above the per-class minimum support; k ∈ {3, 4, 5} as alternative floors were tested on the five-dataset ablation set: mean fidelity 0.8930/0.8930/0.8957, respectively (max delta 0.27 pp). For four of the five datasets, ⌊log3 N⌋ already exceeds the floor (so the floor does not bind); only Heart Disease (D = 4) is sensitive, where floor = 5 yields a +1.4 pp gain. The k = 4 default is retained as a conservative choice; floor = 3 yields identical mean fidelity (0.8930) on the benchmark and would be equally defensible—the floor binds only when ⌊log3 N⌋ < 4 (i.e., N < 81), which does not occur on any of our 13 datasets. The specific value 4 vs. 3 is therefore an interpretability preference (≥4 antecedents per rule) rather than an empirical necessity. Sweep details in ‘k_floor_sweep.csv’ in the released code repository; the upper bound k ≤ 7 is the Miller [54] cognitive-load ceiling. The upper bound of 7 prevents excessively complex rule bases. Features are ranked by their fuzzy (soft-binning) mutual information defined in Section 3.4 with respect to the model’s predictions, and the top-k features are selected.

4.4. Rule Extraction via Modified Wang–Mendel Procedure

Given the fuzzy partitioning and selected features, we extract rules using a modified Wang–Mendel procedure [15]:
1.
Cell enumeration: Enumerate all Kk cells in the fuzzy grid defined by the k selected features. Each cell corresponds to a unique combination of fuzzy sets across features.
2.
Soft assignment: For each cell s = (s1, …, sk) and each training sample xi, compute the membership degree using the product t-norm:
μ s x i = j = 1 k μ A j , s j x i j
3.
Support computation: The support of cell s is the sum of membership degrees across all training samples:
S u p p s = i = 1 N μ s x i
4.
Consequent determination: For classification, the consequent is the class with the highest weighted vote:
y s = a r g   m a x c   y i = c μ s x i
Equivalently, in explicit set-builder form, c_y(s) = argmax_c Σ_{i: y ^ _i = c} μ_s(x_i), where the sum is restricted to training samples whose model prediction equals c (this restatement is provided because the constraint subscript under Σ may not render in all equation-rendering engines; see also Algorithm S1).
5.
Confidence computation: The confidence is the ratio of the winning class’s total membership to the cell’s total membership:
C o n f s = y i = y s μ s x i S u p p s
6.
Filtering: Discard rules with confidence below θc (0.5 for classification, 0.3 for regression) or support below θs = max(2/N, 0.0005) (Equation (11) reports the unnormalized sum Σ_i μ(x_i) for notational compactness; this is divided by N to obtain the normalised fraction compared against θs—same N-scale as the threshold; the absolute floor 0.0005 prevents degenerate near-zero-support rules at small N).
7.
Ranking and selection: Sort remaining rules by confidence × support and retain the top M rules:
M = m a x 50 ,   K m i n k , 6
8.
Antecedent pruning: greedy reduction of each rule with >d_min = 3 antecedents, removing antecedents one at a time as long as the consequent is preserved and confidence drops by ≤ε = 0.005 (ε and d_min were chosen by exploratory sweep on the five-dataset ablation set: ε ∈ {0.001, 0.005, 0.01, 0.02} × d_min ∈ {2, 3, 4} grid yielded mean fidelity from 0.8920 to 0.8972 (max |Δ| = 0.42 pp vs. default 0.8930). The d_min parameter dominates: d_min = 4 consistently yields +0.2 to +0.4 pp; d_min = 2 loses ≈ 0.1 pp. ε is empirically inert across the tested range. Default d_min = 3 is retained for interpretability (one fewer antecedent per rule on average than d_min = 4); deployments prioritizing raw fidelity over rule compactness could justify d_min = 4 (Full sweep in ‘eps_dmin_sweep.csv’); rules identical after pruning are deduplicated. On the 13-dataset benchmark, this step reduces mean antecedent count from 5.49 to 3.61 and total rule count from 329.8 to 175.8 with negligible fidelity impact (<0.003). Supplementary Table S18 summarizes the four rule-count values cited in the main text by pipeline snapshot and aggregation scope.
The complete pseudocode (21 lines, including the boundary-aware partitioning, grid enumeration, soft cell support computation, confidence/support filter, ranking by Conf × Supp, and antecedent pruning steps) is provided as Algorithm S1 in Supplementary Section S.D, together with the helper function midpoints(b).

4.5. Prediction via Weighted Voting

For a new sample x, prediction proceeds by weighted voting across all rules in the rule base R. The predicted class is the one that maximizes the sum of (firing strength × confidence2) across all rules with that consequent:
y ^ x   =   a r g   m a x c   y r   =   c μ R r x · C o n f R r 2
Equivalently, in explicit set-builder form: y ^ (x) = argmax_c Σ_{r ∈ Φ(x): c_r = c} ω_r(x)⋯conf(r)2, where ω_r(x) is rule r’s firing strength at x and the sum is restricted to fired rules whose consequent equals c.
The squared confidence term amplifies the influence of high-confidence rules relative to low-confidence ones, effectively sharpening the voting distribution. The choice of squared vs. linear confidence is empirically inert on our ablation set (both yield 0.893 mean fidelity, see Section 5.8), but the squared form preserves the product-t-norm semantics of treating firing strength and rule confidence symmetrically as multiplicative trust factors; either form (conf1 or conf2) is equivalent in practice on our benchmark (both yield mean fidelity 0.893); we retain conf2 as the default for its semantic symmetry with the product t-norm—both factors then enter the prediction as multiplicative trust weights—but practitioners choosing conf1 for simplicity should expect identical empirical behavior. Higher-order alternatives (conf3, conf4) were also tested and produced no further gain.
Fallback rule. If no rule fires for a given sample x (i.e., Φ(x) = ∅), the framework defaults to the majority class observed in the training data—chosen over uniform-random or nearest-neighbor fallbacks because (a) it preserves the deterministic, reproducible nature of the global rule base, and (b) the empirical fallback rate is below 0.04% (so the choice is operationally inert; uniform-random would only differ from majority-class at higher fallback rates). On the 13-dataset benchmark, this fallback is invoked on average for less than 0.04% of test samples (yielding the 0.9996 mean coverage reported in Table 4 and Section 5.4); see Table 4 for the bootstrap 95% CI around this coverage value ([0.9990, 1.0000]).

4.6. Rule Entropy as Uncertainty Indicator

A distinctive property of fuzzy rule systems is that multiple rules with different consequents can fire simultaneously for a single sample. We exploit this by computing, for each sample x, the proportion of total firing strength allocated to each class c—denoted p_c(x) and defined explicitly in Equation (17) below—and then taking the Shannon entropy of this {p_c(x)} distribution as a per-sample uncertainty signal (rule entropy):
H r u l e s x   =   c     C p c x l o g 2   p c x
where the firing set Φ(x) = {r ∈ R: μ_{R_r}(x) > 0} (the same definition is restated in Property 1 below for self-containedness), and p_c(x), the per-class firing-strength proportion, is defined explicitly in Equation (17) below as the ratio of the firing-strength mass on rules with consequent c to the total firing-strength mass:
p c x   =   y r   =   c μ R r x r     R μ R r x
Equivalently, in explicit set-builder form, p_c(x) = (Σ_{r ∈ Φ(x): c_r = c} μ_{R_r}(x))/(Σ_{r ∈ Φ(x)} μ_{R_r}(x)), where the numerator restricts the firing-strength sum to rules whose consequent is c.
In Equation (17), the sums in the numerator and denominator take over the firing set Φ(x) = {r ∈ R: μ_{R_r}(x) > 0}; non-firing rules contribute zero, so summing over all of R yields the same value, and the numerator further restricts to {r ∈ Φ(x): c_r = c}. Note that pc is computed from firing strengths alone, without the confidence weighting used in prediction (Equation (15)), because rule entropy aims to measure the spatial disagreement among rules—how many distinct decision regions overlap at a given point—rather than the confidence-weighted prediction outcome. We tested both formulations on the 12-classification benchmark: firing-strength-only and confidence-weighted entropy yielded essentially identical mean Spearman correlation (ρ = 0.4203 vs. 0.4204; per-dataset paired Wilcoxon p = 0.733—empirically indistinguishable on the entropy correlation task). We retain firing-strength-only as the default for semantic clarity (measuring rule disagreement at x rather than the confidence-weighted prediction outcome); practitioners may use either with no measurable impact on the uncertainty correlation. Per-dataset results in ‘entropy_weighting_ablation.csv’ in the released code repository.
When the model is confident, the fired rules tend to agree on the same class, producing low Hrules. When the model is uncertain (e.g., prediction probability near 0.5), rules with different consequents fire with comparable strength, producing high Hrules. This uncertainty tracking is emergent: the rule base is trained to maximize fidelity, not to calibrate uncertainty, yet rule entropy naturally correlates with model prediction uncertainty. Crisp surrogates cannot exhibit this property: each sample follows a single root-to-leaf path, yielding one deterministic prediction with no analogous disagreement signal.

4.7. Computational and Asymptotic Properties

We conclude the methodology with two formal results and one supporting asymptotic intuition. Property 1 (a direct application of Shannon entropy bounds) characterizes the rule entropy indicator and certifies the regimes in which it vanishes or saturates. Conjecture 1, presented as an informal asymptotic intuition rather than a theorem, sketches consistency of the rule base with respect to the underlying black-box classifier under standard regularity conditions; the heuristic argument is consistent with the empirical across-dataset behavior reported in Section 5.3 and Section 5.12 and is provided as supporting context, not as a contribution claim. Proposition 1 analyses the asymptotic computational cost and matches the scalability measurements of Section 5.7.
Property 1 (Rule entropy bounds). 
Let x ∈ X be an input and let Φ(x) = {r ∈ R: μ_{R_r}(x) > 0} be the set of rules fired at x with positive firing strength. Denote by C(x) = {c_r: r ∈ Φ(x)} the set of distinct consequents represented in Φ(x), let C be the global label set, and let p_c(x) be the firing-strength-weighted class proportion of Equation (17). Then
0     H r u l e s x     l o g 2 | C x |     l o g 2 | C |
with equality on the left iff all fired rules share a single consequent (|C(x)| = 1), equality in the middle iff p_c(x) is uniform over C(x), and equality on the right iff C(x) = C (every global label is represented among the fired rules).
Proof sketch and the two practical consequences (per-input log2|C| calibration scale and the dependence of H_rules ceiling on rule-base diversity rather than task complexity) are in Supplementary Section S.B (Property 1 proof and practical-consequences discussion).
Conjecture 1 (Asymptotic intuition (informal observation, pending formal proof; heuristic argument in Supplementary Section S.B)). 
Under standard regularity conditions on the data distribution and on the K_N, k_N grid growth rates, the FuzzyRules predictor is heuristically expected to converge in probability to the wrapped black-box classifier as N → ∞. We present this as an asymptotic intuition rather than a contribution claim. The status (conjecture rather than theorem) reflects a pending formal proof of feature-selection consistency for the fuzzy mutual information ranking; the empirical across-dataset K-monotone trend in Supplementary Table S7 (mean fidelity 0.870/0.893/0.898/0.910 for K = 2/3/4/5) is consistent with the heuristic claim at finite N. The condition (K_N)(k_N)/N → 0 captures the curse-of-dimensionality regime underlying the Ionosphere failure case (Section When the Framework Fails and Why, Kk = 81 vs. N_boundary ≈ 71).
Proposition 1 (Framework complexity). 
On a training set with N samples, D features, K fuzzy sets per feature, k ≤ D selected features, and a retained rule base of size M ≤ Kk, the proposed framework has worst-case running time O(D·N log N + Kk·N·k) for rule construction and O(M·k) per inference query.
Proof Sketch. 
(a) Boundary-aware partitioning: O(N log N) per feature [16]; (b) grid enumeration: O(Kk⋯N⋯k); (c) inference: O(M⋯k) per query. Full step counts in Supplementary Section S.C.
Combining Proposition 1 with K = 3 and k ≤ 7 yields O(D·N log N + 2187·N·k) training and O(M·k) ≤ O(729·7) = O(5103) per query, both linear in N for fixed K, k. The empirical scaling (Section 5.7: 28 s at N = 50 K, 0.02 ms per query) confirms these constants. The M = 729 ceiling is a worst-case bound: the confidence-and-support filter discards 60–90% of candidate cells on our 13 benchmark datasets, and antecedent pruning further compresses the retained set to a mean post-pruning count of 175.8 (Table 4)—≈24% of the theoretical cap. □

5. Experiments

5.1. Experimental Setup

Datasets. We evaluate on 13 benchmark datasets (12 classification + 1 regression—the scikit-learn diabetes benchmark; California Housing and Iris appear as external validation in Section 5.13, Boston Housing was removed for ethical reasons—see Section 6.3); Table 3 summarizes domain, sample size, dimensionality, and source. Datasets are selected to cover diverse domains (healthcare, finance, criminal justice, physics, cybersecurity), sample sizes (178 to 49 K), and dimensionalities (8 to 57). Features are standardized to zero mean and unit variance prior to model training. Note on COMPAS protected attributes: the COMPAS dataset includes race, and we follow the original ProPublica analysis [55] and Dressel & Farid [2] in supplying race as an input to the black-box model. The rule base extracted by FuzzyRules can—and does, as shown in Supplementary Table S2—produce high-confidence rules whose antecedents reference race; we retain this configuration deliberately because the diagnostic value of FuzzyRules in this setting is precisely its ability to surface a model’s reliance on protected attributes in a single human-readable line, rather than to endorse race-conditioned production scoring. A practitioner deploying FuzzyRules in regulated contexts would either (a) train the model with the protected attribute removed, or (b) audit and re-weight offending rules; the fairness-aware extension is listed as future work in Section 7.2. COMPAS [55] uses standard ProPublica features (race included, mirroring deployment); race-removed control rerun is future work.
Black-box models. Four model families serve as the black boxes to be explained: Random Forest (RF, 100 trees, max depth 10), XGBoost (n_estimators = 100, max_depth = 6, learning_rate = 0.1, subsample = 1.0, colsample_bytree = 1.0, reg_alpha = 0, reg_lambda = 1) (XGB, 100 estimators, max depth 6, learning rate 0.1), Multi-Layer Perceptron [MLP; two hidden layers (100, 50), ReLU activation, Adam optimiser, learning rate 1 × 10−3, batch size 32, max 200 epochs with early stopping on 10% validation split (patience 10)], and Support Vector Machine (SVM for classification, SVR for regression; RBF (Radial Basis Function) kernel, C = 1.0). In cross-model analyses (Section 5.6), SVM and SVR are reported as separate variants since they apply to different task types, yielding five algorithm variants in total (RF, XGB, MLP, SVM, SVR; we report this as “4 model families” throughout because SVM and SVR are the same algorithm family specialized to classification and regression, respectively).
Explanation methods. We compare five methods: (1) FuzzyRules (proposed), with boundary-aware partitioning (K = 3, product t-norm); (2) TreeSurrogate (decision tree trained on model predictions, max_depth tuned via validation-set grid search over {3,4,5,6,7,8,10}); (3) SHAP (KernelSHAP [8] with 100 background samples); (4) LIME [9] (default parameters with 5000 perturbed samples); and (5) Anchors [10] (beam search with precision threshold 0.95). Implementation hyperparameters not stated above use scikit-learn/library defaults: RandomForest (n_estimators = 100, max_depth = 10; min_samples_split = 2, criterion = “gini”); LIME (kernel_width = 0.75 × sqrt(D), feature_selection = “auto”, discretize_continuous = True, num_samples = 5000); Anchors (anchor_baseline.AnchorTabular with beam_width = 4, beam_size = 4, batch_size = 100, coverage_samples = 10,000, max_anchor_size = unbounded, precision_threshold = 0.95, tau = 0.15); KernelSHAP (background_samples = 100, link = “identity”); 1D decision trees for the boundary partitioner (sklearn DecisionTreeClassifier with max_leaf_nodes = K, min_samples_leaf = max(2, ⌈0.02·N⌉), criterion = “gini”). (Note: Table 4 reflects tuned TreeSurrogate vs. untuned-default FuzzyRules—asymmetry favoring TreeSurrogate. The matched-tuning-budget reversal is in Section 5.12.)
Evaluation metrics:
  • Fidelity: Agreement rate between the explainer’s prediction and the black-box model’s prediction on the held-out test fold of the 3-fold CV protocol (i.e., 1/3 of the dataset per fold-evaluation, averaged across the three folds; not an internal training-set agreement)—for classification this is argmax-class accuracy, for regression this is the coefficient of determination R2 between surrogate output and model output (consistent with Supplementary Table S14). For rule-based methods only.
  • Coverage: Fraction of test samples for which the explainer produces a prediction.
  • Stability: Prediction agreement under small Gaussian input perturbations on standardized feature space (each feature first z-score normalized so σ = 0.01 corresponds to 1% of feature standard deviation, following the small-perturbation convention of Alvarez-Melis & Jaakkola [19] for explanation stability—single-magnitude protocol; multi-magnitude sensitivity (σ ∈ {0.01, 0.05, 0.10}) is left to future work and the single-scale limitation is flagged in Section 6.3; perturbations applied independently per feature, averaged over 100 repetitions).
Metrics scope. We report fidelity, coverage, and stability on the surrogate (the explainer’s output relative to the wrapped black-box), not classification accuracy/F1/AUC of the surrogate against ground-truth labels. This follows the standard XAI evaluation convention: a surrogate is meaningful insofar as it mimics the black-box (fidelity), covers the input space (coverage), and is robust to small perturbations (stability); the ground-truth predictive performance belongs to the underlying black-box, which is reported separately as model accuracy in Supplementary Table S4. Reporting surrogate-level F1 or AUC would conflate the explainer’s job (mimic the model) with the model’s job (predict the label) and is not the recognized XAI evaluation pattern used in SHAP, LIME, Anchors, or TreeSurrogate benchmarks.
  • Comprehensibility: Number of rules and average number of antecedents per rule.
  • Computational cost: Wall-clock time for extraction/training.
  • Cross-model consistency: Whether the explainer’s behavior remains consistent across different black-box models trained on the same data.
Protocol. All experiments use 3-fold StratifiedKFold cross-validation (sklearn StratifiedKFold for classification, KFold for regression) with fixed random seed = 42 unless otherwise specified (seed-robustness check across {42, 123, 2024} in Section 5.12/Supplementary Table S16). Statistical comparisons use the Wilcoxon signed-rank test (pairwise) and the Friedman test (omnibus), with Bonferroni correction for multiple comparisons.
Protocol versioning rationale (which protocol underlies which headline number; consolidation of all protocol variants used in the paper) is in Supplementary Section S.N.

5.2. Overall Results

Table 4 aggregates the five explainers along five metrics averaged across all 13 datasets and all four model families; Figure 3 renders the same data as a radar comparison to make the multi-metric profile of each method visible at a glance.
Fidelity definition and cross-explainer asymmetry. Fidelity (Equation (14)) is the agreement rate between the surrogate model’s discrete prediction and the black-box prediction on held-out samples. SHAP and LIME do not produce a discrete global prediction; we follow the standard convention of converting their per-feature attribution vector into a prediction via a linear scoring step (consistent with the LIME formulation of Ribeiro et al. [9]). The comparison is therefore symmetric at the prediction-level but is not symmetric at the attribution-vector level, and we do not claim SHAP/LIME inferiority on the granular per-feature attribution metrics that those methods were designed to optimize. Table 4 reports two grouped subsections to make the asymmetry explicit: full-coverage global surrogates (FuzzyRules, TreeSurrogate, vanilla Wang–Mendel) and local-coverage methods (SHAP, LIME, Anchors). Within each group, fidelity is directly comparable; the coverage-adjusted “Fidelity × Coverage” column enables the cross-group view that brings Anchors’s 0.963 within-covered fidelity to 0.220 once weighted by its 22.8% coverage.

5.3. Fidelity Analysis

Per-dataset fidelity values for all 13 benchmarks (Anchors, FuzzyRules, TreeSurrogate with 3-fold standard deviation) are reported in Supplementary Table S3.
Supplementary Table S17 summarizes the four protocols under which the TreeSurrogate (and matched FuzzyRules) fidelity values cited in the paper are computed; all values reflect the same code path, and the differences are entirely protocol-driven. On classification tasks (12 datasets), the fidelity gap between FuzzyRules (0.889) and TreeSurrogate (0.900) is only 1.1 percentage points and is not statistically significant (Wilcoxon p = 0.733). On all 13 datasets, the difference likewise does not reach significance (p = 0.497). A formal TOST equivalence test (reported in Section 5.4) confirms practical equivalence within ±5 percentage points, providing stronger evidence than the non-significant Wilcoxon test alone.

5.4. Coverage, Stability, and Statistical Significance

FuzzyRules achieves near-complete coverage (average 0.9996 across the 13 datasets, i.e., ≈100% with at most a 0.04% fallback to majority class—see Section 4.5), comparable to TreeSurrogate (1.0000 exactly: every test sample reaches a leaf), SHAP, and LIME. Anchors cover only 22.8% on average.
Stability analysis shows that FuzzyRules (0.986) achieves stability comparable to TreeSurrogate (0.994). The Wilcoxon test does not indicate a statistically significant difference (p = 0.131), and the TOST equivalence test confirms practical equivalence within ±5 percentage points (p_TOST < 0.001), suggesting that fuzzy rule predictions are robust to small input perturbations.
Statistical Tests. Primary evidence: pairwise Wilcoxon–Holm and TOST on the 12- and 13-dataset full-coverage subsets. Exploratory only: Friedman tests on the 7 Anchors-eligible datasets (with 3 rule-based methods, namely FR, TS, and Anchors; SHAP/LIME excluded for stability per Table 4 footnote ) yield omnibus significance for fidelity (χ2 = 8.00, p = 0.018) and stability (χ2 = 6.08, p = 0.048). The fidelity significance is driven by the FR-vs-Anchors and TS-vs-Anchors gaps (Wilcoxon–Holm p_Holm < 0.001 for both). Mean ranks: fidelity Anchors 1.14/TS 2.29/FR 2.57 (CD = 1.25); stability TS 1.29/FR 2.21/Anchors 2.50 (same CD). Pairwise Wilcoxon (Holm step-down, Holm 1979; protocol per Demšar [56]; Table S10) confirms FR differs from Anchors on fidelity (p_Holm < 0.001) but not from TreeSurrogate on fidelity (p_Holm = 0.733) or stability (p_Holm = 0.262); the Holm-corrected stability comparison is omitted for Anchors because its 22.8% coverage prevents a paired comparison on the same test instances. The TOST (Two One-Sided Tests) equivalence test [57] with practical margin δ = 0.05 (a priori, conventional gap per Demšar [56]) confirms FR and TS are practically equivalent on fidelity (clf: p_TOST = 0.002, mean diff −1.1 pp; all 13: p_TOST = 0.001, mean diff −1.2 pp) and stability (p_TOST < 0.001). Effect sizes are small (Cohen d = −0.29 ∈ [0.2, 0.5), Cliff δ = −0.12)—one order of magnitude below the Cohen d = 1.47 separating both methods from Anchors. Seed-robustness check (seeds 42/123/2024 on the five-dataset ablation set): per-dataset across-seed std of fidelity is 0.77 pp on average, 2.0 pp at worst (Heart Disease, full in Table S10). Caveat: 7 datasets are below Demšar’s [56] N ≥ 10 recommendation; primary basis for equivalence is the 12-/13-dataset Wilcoxon–Holm.
Figure 4 and Figure 5 present the Friedman–Nemenyi critical-difference diagrams [56] for fidelity and stability, respectively.

5.5. Comprehensibility and Cognitive Load

Table 5 compares the four methods on three comprehensibility proxies: the average number of antecedents per rule (lower = simpler), a 1/antecedents comprehensibility score, and the total rule count.
While FuzzyRules produces a larger global rule base (175.8 rules vs. 17.3 leaves), the average number of rules firing per test instance is only 27.1 (15.5% of the global base), making the local explanation complexity manageable (avg_rules_fired computed from cv_results.csv, averaged across 12 classification datasets and all 4 model families; released with the source code).
Comprehensibility analysis: FuzzyRules averages 3.61 antecedents per rule vs. TreeSurrogate’s 8.6 cognitive chunks per condition (each crisp threshold counted as 2 chunks per Miller [54] and Cowan [58] working-memory literature; the per-numeric-threshold = 2 chunks equivalence is an operational proxy, not a direct empirical study). FuzzyRules stays within Miller’s 7 ± 2 limit on all 12 datasets; TreeSurrogate exceeds it on 5/12. Per-dataset chunk counts and the readability rationale in Supplementary Table S1 and Section S.K (full Section 5.5 detail).
Caveat on interpretability claims. The interpretability advantage above rests on indirect proxies—antecedent count, cognitive-chunk count following Miller [54] and Cowan [58] working-memory bounds, and linguistic-label vocabulary. A controlled user study with domain experts (e.g., clinicians for COMPAS-style risk tables, financial analysts for credit scoring) directly comparing FuzzyRules’s linguistic rules against TreeSurrogate’s numeric thresholds on prediction-explanation and simulatability tasks would convert these proxies into a direct interpretability measurement; we list this as future work in Section 7.2. The current interpretability claim should accordingly be read as proxy-based evidence supporting (rather than establishing) the linguistic-rule advantage.
Audit-trail walkthrough (one COMPAS sample). To make the auditability concrete, we trace one defendant from input to explanation. Sample: age = 22 (linguistic Low), priors_count = 5 (High), juvenile_misdemeanor_count = 2 (High), charge_degree = M (Low). Step 1—partitioning maps each numeric feature to (Low/Medium/High) memberships using the boundary-aware MFs of Section 4.2. Step 2—three rules fire with positive firing strength: (R1) IF age IS Low AND priors_count IS High THEN Recidivate (firing strength 0.82, confidence 0.74); (R2) IF juvenile_misdemeanor_count IS High AND charge_degree IS Low THEN Recidivate (0.61, 0.69); (R3) IF priors_count IS High THEN Recidivate (0.78, 0.71). Step 3—weighted vote (Equation (15)) sums (firing strength × confidence2) per consequent: Recidivate = 0.82·0.742 + 0.61·0.692 + 0.78·0.712 = 1.139, No-Recidivate = 0; predicted class = Recidivate. Step 4—rule entropy (Equation (17)) of the firing distribution = 0 (all fired rules agree on Recidivate), signaling low rule-level uncertainty. Step 5—human-readable explanation served to the auditor: “This defendant was predicted to recidivate because (a) the defendant is young with a high prior-record count, and (b) the defendant has a high juvenile-misdemeanor count combined with a low-severity current charge. All three triggered rules predict the same outcome, so the model is internally consistent on this case.” The same audit pattern would surface a contradictory result (high rule entropy, mixed consequents) when the model itself is uncertain—a property no crisp tree surrogate can offer because each input follows exactly one deterministic path.

5.6. Cross-Model Consistency

Cross-model stability is reported per-method × per-model in Supplementary Table S11 and analyzed in Supplementary Section S.H. Headline: FuzzyRules (0.978–0.993) and TreeSurrogate (0.993–0.994) are stable across the four model families of Section 5.1; SHAP/LIME perturbation-based stability values measure a different quantity (the wrapped black-box) and are excluded from the formal stability ranking—see Table 4 footnote for the rationale. Random-seed sensitivity of the fidelity estimates is separately characterized in Supplementary Table S9.

5.7. Computational Cost

Aggregation note. Tables 4, 5, 7 and 8 (fidelity, coverage, stability, rule counts) report macro-averages (each dataset weighted equally, then averaged), so that small and large datasets contribute equally to the comparison. Table 5 below reports micro-averages over individual (dataset, model, fold) measurements because wall-clock time scales strongly with sample size and feature dimensionality, and a micro-average more faithfully reflects the cost a practitioner would incur on a representative dataset drawn from the benchmark mix.
FuzzyRules training (4.728 s) is 13× slower than TreeSurrogate (0.366 s, including hyperparameter tuning) but 1.8× faster than Anchors (8.539 s). As a global method, this cost is paid once: the break-even point vs. per-query LIME is N ≈ 134 queries. Supplementary Figure S1 renders the cost comparison on a log scale.
Empirical scalability validation. To verify the O(D·N log N + Kk·N·k) bound of Proposition 1, we generate synthetic binary-classification problems with N ∈ {500, 1 K, 5 K, 10 K, 25 K, 50 K} and D ∈ {10, 20, 40} and time end-to-end FuzzyRules training. Fit time scales almost linearly with N (0.11 s at N = 500; 4.8 s at N = 10 K; 28 s at N = 50 K) and is essentially flat in D once the M cap saturates. Per-sample inference stays at ≈0.02 ms across the grid. Per-(N, D) cell measurements in Supplementary Table S8.

5.8. Ablation Study

To quantify the contribution of each component, we conduct an ablation study on five datasets using Random Forest as the representative black-box model. We select Random Forest for all ancillary analyses (ablation, readability, simulatability, coverage-based selection) because the cross-model stability analysis confirms that FuzzyRules produces consistent explanations across all four model families; hence, findings obtained with Random Forest generalize to other classifiers; cumulative results are presented in Table 6. Matched-budget tuned FuzzyRules with an inner two-fold grid search over (K, min_confidence) is additionally reported in Supplementary Table S7.
Key findings. (i) Cumulative framework gain: enabling all proposed components lifts mean fidelity from a vanilla 1992 Wang–Mendel baseline of 0.736 to 0.893 (+10.4 pp on the four-dataset subset excluding the Breast Cancer outlier; +15.7 pp aggregate including it, both transparently reported). The headline +15.7 pp gain is therefore inflated by the Breast Cancer outlier, but the gain remains substantial (and statistically significant, Wilcoxon p = 0.043 on the four-dataset subset; n = 4 power-limited but consistent in sign) without it. The component-wise leave-one-in decomposition (Supplementary Table S15) shows that switching from min t-norm to product t-norm is the dominant single step (+11.3 pp); boundary-aware membership function placement and lifting K from 2 to 3 each add +2.1 pp; antecedent pruning and tightening min_confidence to 0.5 contribute marginally on this subset. The transparently reported Breast Cancer outlier (Vanilla = 0.415) is overwhelmingly attributable to the min t-norm × K = 2 combination producing near-zero product memberships, not an implementation bug. (ii) Boundary-aware refinement contributes +2.5 pp over the strongest alternative partitioning strategy (percentile-based; Table 7); a complementary synthetic-data isolation reported in Supplementary Section S.O/Table S21 confirms a +1.4 pp boundary-aware advantage over vanilla Wang–Mendel on a 45° rotated boundary and +1.7 pp on a 5-feature rotated mixture. (iii) The ablation also confirms that K = 3 is the interpretability-fidelity sweet spot: K = 5 attains 0.910 mean but inflates the rule base ≈100× and exceeds Miller’s 7 ± 2 chunk limit per rule. The conclusion that all framework components contribute is unaffected by the Breast Cancer outlier.
Partitioning strategy ablation. To isolate the contribution of prediction-boundary-aware partitioning (Section 4.2) from the rest of the framework, we replace the boundary-aware MF-center placement with three alternative strategies while keeping all other components identical: (i) Percentile, equally spaced percentiles of the feature distribution (the default in our FuzzyPartitioner); (ii) Equal-Width, MF centers at equal intervals of [min, max]; (iii) K-Means, 1D K-means cluster centers. Table 7 reports fold-averaged fidelity on the same five-dataset ablation set.
Boundary-Aware achieves the best mean fidelity (0.893) over Percentile (0.868), Equal-Width (0.869), and K-Means (0.886); the supervised, data-driven nature of the decision-tree split objective explains the gain. Per-dataset gains are largest where the model’s decision boundaries are misaligned with input percentiles (Adult Income +5.3 pp, Magic Gamma +6.1 pp); per-dataset numbers in Table 7.
Hyperparameter sensitivity. We probed the four main hyperparameters one at a time on the same five-dataset ablation set (Random Forest, 3-fold CV; full curves in Supplementary Table S6). Results are summarized in Table 8.
Two protocol notes: (i) K = 5 attains 0.910 mean fidelity but inflates the grid to Kk = 78,125 cells at k = 7; we retain K = 3 as the default for the interpretability–fidelity trade-off (full curves in Supplementary Table S6). (ii) Squared vs. linear confidence weighting is empirically inert on this set (both yield 0.893); the squared form is retained for the product-t-norm semantic consistency reason given in Section 4.5.

5.9. Rule Readability and Cognitive Load

The following table presents side-by-side rule examples extracted from the Heart Disease dataset using Random Forest as the black box. Table 9 juxtaposes the highest-confidence rule for each class across the two methods.
The FuzzyRules output is immediately interpretable by a domain expert: “High chest pain type and High thalassemia predict heart disease.” The TreeSurrogate output requires knowledge of the standardization parameters to interpret the thresholds—“thal ≤ −0.16” on standardized data has no intuitive medical meaning. This qualitative difference underscores the unique value of linguistic labels in domain-expert communication.
The same qualitative gap holds in regulated domains. On the COMPAS recidivism dataset [55], for example, FuzzyRules produces rules such as “IF juvenile misdemeanor count IS High AND charge degree IS Low THEN Recidivism” (confidence 1.000), while equivalent TreeSurrogate paths comprise up to eight standardized threshold conditions (e.g., “priors_count > −0.58 AND age ≤ −0.60 AND race ≤ −0.15 …”). Under algorithmic fairness scrutiny [2], linguistic rules are not merely convenient, but a communication requirement; the full COMPAS case study and representative rule listings are provided in Supplementary Table S2. Supplementary Figure S5 highlights this qualitative readability gap visually on the Heart Disease rules.

5.10. Boundary Sensitivity and Rule Entropy as an Uncertainty Proxy

We partition test samples into boundary (model prediction probability in [0.35, 0.65]) and confident (model probability < 0.15 or >0.85) groups to analyze performance in different regions of the decision space. Supplementary Table S12 reports fidelity and rule entropy separately for the two groups on three representative datasets. Figure 6 visualizes the fidelity breakdown and rule-entropy contrast for three representative datasets.
Boundary samples are identified using RandomForest predict_proba (boundary band = [0.35, 0.65]; confident outside [0.15, 0.85]). Mean rule entropy at boundaries is consistently higher than at confident samples—H_boundary/H_confident ratio across the 8/12 datasets satisfying H_confident > 0.01 bits ranges from 1.5× to 47.2× (mean 12.2×; Wine, Breast Cancer, and other cleanly separable datasets are excluded as their confident-set entropy is essentially 0). XGBoost robustness check on the same 8 datasets confirms the boundary > confident pattern (mean ratio ≈ 9.4×; Supplementary Table S12).
Rule Entropy as Uncertainty Indicator. To quantify the relationship between rule entropy and model uncertainty, we compute the Spearman rank correlation between per-sample model prediction entropy (Hmodel) and rule entropy (Hrules) across all classification datasets.
Two average values are reported: ρ = 0.420 unweighted across all 12 classification datasets (per-dataset values in Supplementary Table S13, unrounded mean 0.4203), and ρ = 0.458 excluding Ionosphere (the unique dataset violating the framework Kk ≪ N applicability condition; Section 4.7/Conjecture 1; framework-applicability filter rather than a post hoc data-driven exclusion). Either number is reported unambiguously where used (0.420 for the full 12-dataset claim, including Ionosphere as the documented stress-test failure case; 0.458 with the Kk ≪ N filter).
Comparison against probability-entropy and margin baselines. Following Reviewer 3 (round 1), we evaluate rule entropy alongside two simple probability-based baselines—probability entropy (Shannon entropy of model predict_proba) and margin (1 − max class probability)—as misclassification detectors on the same 12-classification benchmark. Mean ROC-AUC across the 11 datasets with non-zero error rate (Wine has 0% error and is undefined): rule entropy 0.586 (std 0.151), probability entropy 0.766 (std 0.115), margin 0.766 (std 0.115). Margin and probability entropy are numerically equivalent in binary classification because both are monotone in max_c p_c. The raw rule-entropy AUC is therefore weaker than probability entropy on this metric. The operational value of rule entropy is qualitative rather than quantitative: it carries a rule-level explanation of WHICH IF–THEN rules disagree (see the COMPAS walkthrough in Section 5.5), a granular diagnostic that probability-based scalars cannot provide. We accordingly frame rule entropy in this revision as an interpretable rule-level uncertainty proxy that complements—rather than replaces—standard probability-entropy and margin signals. Full per-dataset breakdown is reported in Supplementary Section S.P (Table S22).
Rule Entropy as a Misclassification Detector. Beyond tracking model uncertainty, rule entropy also serves as a detector of model errors. We define a sample as “flagged” when its rule entropy exceeds the per-dataset median (computed within each test fold (i.e., the median is computed only from rule-entropy values of the same test fold being scored—no cross-fold or training-set leakage; a global median across datasets would conflate per-dataset entropy scales)), and as “misclassified” when the black-box model’s prediction disagrees with the true label. Table 10 reports the precision (fraction of flagged samples that are truly misclassified) and recall (fraction of misclassified samples that are flagged) across all 12 classification datasets, averaged over 3-fold cross-validation with Random Forest as the black-box model (matching the main-benchmark protocol of Section 5.1).
High-entropy samples are 2.92× more likely to be misclassified than random samples on average (enrichment > 1 on all 12 datasets, Table 10). The recall of 72.7% means that roughly three-quarters of model errors are flagged purely from rule disagreement, without any access to the true labels. This capability is unique to fuzzy rule-based explanations: because multiple rules fire simultaneously with potentially conflicting consequents, the resulting entropy signal naturally emerges without any explicit error-detection mechanism. Neither TreeSurrogate (single deterministic path) nor SHAP/LIME (per-feature attributions) provides an analogous built-in error flag.

5.11. Local Interpretability and Rule Compactness

A key concern with large rule bases is whether a human can realistically use them. Two practical compactness results: (i) Simulatability—using only the top-K fired rules per instance retains 99% of full-ruleset fidelity at K = 1 (0.892) and 99.3% at K = 3 (Supplementary Table S20), so a single dominant rule typically suffices for per-prediction inspection; (ii) Coverage-based selection—a greedy 95%-coverage subset reduces the global rule base from 177 to a median 4 rules per dataset (range 1–28, Supplementary Table S4) while retaining 0.873 fidelity. Either compactness signal is operationally actionable; full discussion, Supplementary Table S20 and per-dataset numbers in Supplementary Section S.I.

5.12. Comparison Against Modern Fuzzy Explanation Methods

Implementation provenance caveat. All four modern fuzzy baselines (IFRI, T2F, NFS, WMNF) are our re-implementations from the algorithm descriptions in the original publications, since consolidated reference code was not publicly available at submission time. The comparison should therefore be read as a best-effort matched-protocol reproduction rather than a definitive ranking against author-tuned reference implementations; absolute performance orderings may shift with author-released code. The IFRI fidelity of 0.285 on Bank Marketing is anomalously low and is likely an implementation artifact (the IFRI rule-base size hits its 80-rule cap against a 2187-cell Equation (9) grid, see Supplementary Section S.F for details); we do not claim this represents an upper bound for IFRI. We have invited the original authors to validate or correct our re-implementations as a follow-up reproducibility check (no responses received at submission time). Per-dataset fidelity, stability, rule count, antecedents, and training time for FuzzyRules and all four modern fuzzy baselines are reported in Supplementary Table S5.
Beyond the four non-fuzzy baselines (TreeSurrogate, SHAP, LIME, Anchors), we benchmark FuzzyRules against four contemporary fuzzy rule extraction methods—IFRI, T2F, NFS, and WMNF—on the same 12 classification datasets (single-seed RandomForest, K = 3, min_confidence = 0.4 chosen to lie within the published recommended range of all four baselines (per-baseline sweep not run in this version; full caveat in Supplementary Section S.F)). FuzzyRules attains the highest mean fidelity among the fuzzy methods (0.902 vs. WMNF 0.886, NFS 0.823, T2F 0.728, IFRI 0.668), narrowing the gap to TreeSurrogate to 1.8 pp (0.902 vs. 0.920) at ≈50× shorter training time than WMNF. Under matched tuning budgets (inner two-fold grid search over K and min_confidence), tuned FuzzyRules attains 0.9036 ± 0.0024 multi-seed-averaged fidelity on the five-dataset ablation set vs. TreeSurrogate’s 0.893 (+0.9 pp). Full Table 11, the implementation-note hyperparameters, the tuning-budget details (including the distinction between Table 11 single-seed 0.9022 value and the multi-seed-averaged 0.9036 value), and per-dataset numbers are in Supplementary Section S.F.

5.13. External Validation: Multi-Class and Regression

We test generalization on two external datasets: Iris (multi-class, 3 classes, N = 150) and California Housing (regression, N = 20640)—adopted as the primary regression benchmark in place of the deprecated Boston Housing dataset (rationale in Section 6.3, “Dataset considerations”). On Iris, FuzzyRules achieves 0.973 fidelity with perfect stability—a 2-pp gap to TreeSurrogate (0.993), mirroring the binary/multi-class behavior on the main benchmark. In California Housing, fidelity drops to 0.422 vs. TreeSurrogate’s 0.840, confirming the regression limitation also flagged on the diabetes dataset and discussed under “Continuous regression (California Housing)” in Section When the Framework Fails and Why. The TSK (Takagi–Sugeno–Kang)-type extension (Section 7.2) is the principled fix for continuous targets. Per-fold details and discussion in Supplementary Section S.G.

6. Discussion

6.1. Practical Guidelines: When to Use Which Explainer?

Practical guidance derived from the benchmark: choose SHAP/LIME for per-instance feature attributions on a few queries; choose Anchors when high-precision local rules with formal precision guarantees are required (at the cost of 22.8% mean coverage); choose TreeSurrogate (untuned) for the absolute fastest global surrogate (0.37 s training, 0.889 mean fidelity); choose FuzzyRules for linguistic interpretability, rule-level uncertainty signals, or matched-budget fidelity (tuned 0.9036 vs. TreeSurrogate 0.893 on the five-dataset ablation set). Full method vs. use-case decomposition in Supplementary Section S.J.

6.2. FuzzyRules vs. TreeSurrogate: A Direct Comparison

Since TreeSurrogate is the closest competitor to FuzzyRules (both are global, symbolic, rule-based surrogates), we consolidate the head-to-head comparison in Table 11 to clarify the trade-offs.
The comparison is best read through the lens of Pareto dominance. On the three predictive dimensions—fidelity, coverage, stability—TreeSurrogate never beats FuzzyRules with statistical significance: TOST formally certifies practical equivalence within ±5 pp at δ = 0.05 (p ≤ 0.002 for fidelity on the 12-classification subset, p_TOST = 0.001 on all 13, p < 0.001 for stability), and coverage is identical at 100%. On four further dimensions, FuzzyRules brings complementary value that TreeSurrogate does not offer: linguistic labels vs. numeric thresholds, 2.4× lower cognitive load (3.6 vs. 8.6 chunks/rule), standardization-free interpretation, and a rule-level uncertainty proxy (rule entropy with Spearman ρ = 0.420 against model prediction entropy; complementary to probability-entropy and margin baselines, Supplementary Table S22). With matched tuning budgets, FuzzyRules reaches comparable raw fidelity (0.902 vs. 0.893 on the ablation set, Section 5.12). TreeSurrogate strictly dominates on global rule count (8 vs. 175.8) and wall-clock training time (0.366 s vs. 4.728 s untuned); these are real advantages that make TreeSurrogate the preferred choice when a compact, fast-to-train global surrogate suffices and linguistic interpretation or rule-level uncertainty are not required.
Across the thirteen evaluation dimensions in Table 11, FuzzyRules wins on 7, ties on 4, and loses on 2 (untuned wall-clock training time and global rule count); we report the dimension count as an unweighted Pareto-competitive tally rather than a strict Pareto-dominance claim. Neither method strictly Pareto-dominates the other: TreeSurrogate strictly dominates FuzzyRules on training-time and rule-count, and FuzzyRules brings complementary value on linguistic-interpretability and rule-level uncertainty dimensions. The matched-budget tuned FuzzyRules (0.9036) edges untuned TreeSurrogate (0.893) on the ablation set, closing the untuned-fidelity gap. The two methods are best positioned as complementary tools in the practitioner’s XAI toolbox rather than as one being a superior replacement for the other: TreeSurrogate is preferable when compactness and training-time matter most, FuzzyRules when linguistic labels, soft boundary handling, and a rule-level uncertainty proxy add operational value.

6.3. Limitations and Threats to Validity

Fidelity gap. FuzzyRules does not surpass TreeSurrogate in overall fidelity. While the TOST equivalence test confirms practical equivalence (p = 0.002, δ = 0.05), practitioners who prioritize raw fidelity above all other considerations should prefer TreeSurrogate. The gap is more pronounced on regression tasks, where grid-based fuzzy systems have inherently lower approximation capacity for continuous outputs.
Rule count. The global rule base (mean 175.8 rules) is larger than TreeSurrogate’s 17.3 leaves. However, our simulatability analysis shows that a single top-firing rule retains 99% of full-ruleset fidelity, and our coverage-based selection shows that only four rules cover 95% of samples. Each rule contains only 3.6 antecedents on average (3.6 cognitive chunks, well within Miller’s 7 ± 2 limit). Nevertheless, the total rule count may concern practitioners expecting a compact global rule set, and further rule compression techniques could improve practical appeal.
Computational cost. Training is 13× slower than TreeSurrogate (including its hyperparameter tuning) in aggregate, driven by combinatorial grid enumeration on high-dimensional datasets (e.g., Spambase with 57 features). This one-time cost is acceptable for offline model documentation but may be prohibitive for real-time model monitoring applications requiring frequent re-explanation. Stability uses a single perturbation magnitude (σ = 0.01 [19]); multi-magnitude sensitivity is future work. Dataset bias: 4/13 are classical easy UCI sets; harder benchmarks (CICIDS, Kaggle imbalanced) are future work.
Interpretability claim. The interpretability advantage rests on (i) the cognitive-load comparison of Section 5.5—FuzzyRules averages 3.61 antecedents per rule vs. TreeSurrogate’s 8.6 cognitive chunks per condition, both staying within Miller [54]/Cowan [58] working-memory literature for the FuzzyRules side—and (ii) the structural property that linguistic labels (“age IS High”) do not require feature-scaling or standardization knowledge to read. Empirical user studies with pre-registered hypotheses, larger N, and multi-institution recruitment are left to follow-up work; the present paper does not advance an empirical user-preference claim. Interpretability is itself a contested construct, and the boundary between “interpretable”, “explainable”, and “transparent” is the subject of an ongoing methodological debate [59]; we ground the claim here on quantifiable cognitive-load proxies rather than on a universal interpretability metric.
Regression performance. Fidelity on the diabetes regression task (0.741) is notably lower than on classification tasks (mean 0.889; Boston Housing was removed from the benchmark—see “Dataset considerations” below). Grid-based fuzzy systems inherently discretize the output space, limiting approximation capacity for continuous targets. TSK-type fuzzy systems [60] with linear consequents may address this limitation.
Dataset considerations. We use California Housing as the primary regression benchmark (external-validation results reported in Section 5.13), replacing the deprecated Boston Housing dataset (Harrison & Rubinfeld [61], retained as a contextual citation only since the dataset itself is no longer used in any analysis), which was removed from scikit-learn v1.2+ over documented ethical concerns (scikit-learn issue #16155 and the v1.2 release notes describe the deprecation rationale, citing the racial-demographic bias in the original B feature) regarding a variable encoding racial demographics. The scikit-learn diabetes regression benchmark is retained as a smaller secondary dataset within the main benchmark.
External validity. Our evaluation covers 13 datasets from diverse domains. While this is more extensive than prior fuzzy XAI studies, generalization to all domains and data types (e.g., text, images, time series) requires further investigation.
Type-1 vs. Type-2 fuzzy choice. We adopt Type-1 fuzzy sets rather than IT2/general Type-2 (which model uncertainty about membership values themselves, often argued as the natural carrier of explainable uncertainty [39]) for three reasons. (a) The rule entropy signal of Section 4.6 already emerges from rule-firing-strength variability in Type-1 systems; FOU (Footprint of Uncertainty) would not enrich it qualitatively (the ρ = 0.420 calibration in Section 5.10 shows the Type-1 signal is practically usable). (b) IT2 antecedents (“medium-with-uncertainty”) are hypothesized to be less interpretable than Type-1 linguistic labels (we are not aware of a published controlled user study comparing the two head-to-head; a Type-1 vs. IT2 interpretability comparison is left to future work—see Section 7.2) because the FOU adds a second numeric parameter that defeats the cognitive-chunk advantage of Section 5.5. (c) Type-2 inference roughly doubles per-rule evaluation cost, pushing the framework past the real-time per-explanation latency budget. A direct Type-1-vs-IT2 comparison on the same 13-dataset benchmark is left as future work.

When the Framework Fails and Why

Four failure regimes deserve explicit discussion because they bound the practical scope of the proposed framework.
Class-imbalanced regression-like classification (Default Credit). On Default Credit (78% majority class), Random Forest attains 82.1% test accuracy (error 17.9%, below the 22% error rate of always-predict-majority baseline; majority-class accuracy = 78%). The rule-entropy correlation is suppressed by the class imbalance to ρ = 0.115—the lowest among the 11 datasets satisfying the Kk ≪ N condition (Ionosphere ρ = 0.004 is the explicit stress-test failure case discussed above). Despite this, the misclassification detector still functions on Default Credit, with a precision of 0.348, a recall of 0.895, and a 1.94× enrichment over the base error rate. Class-imbalance-aware rule weighting is left to future work; per-fold detail in ‘failure_cases_perfold.csv’ in the released code repository.
High dimensionality with small sample size (Ionosphere). Ionosphere has D = 34/N_total = 351 (80/20 split → N_train = 281, N_test = 70). Equation (9) yields k = 4 on N_train (consistent with the Section 4.3 per-dataset list); the boundary subset N_boundary ≈ 71 is the count of training-set samples lying in the decision-boundary band (model probability ∈ [0.35, 0.65]) used to estimate rule-entropy correlation in Section 5.10 (it is a subset of N_train, not N_test). Against this N_boundary, Kk = 81 already exceeds the boundary subset (ratio ≈ 1.14), violating the Kk ≪ N regime Conjecture 1 requires. The fragmented rule firing patterns yield ρ = 0.005 (95% bootstrap CI ≈ [−0.23, 0.24] at N = 71; this confidence interval is wide, and the p ≈ 0.97 result is therefore power-limited rather than a precise zero-effect estimate)—the only non-significant correlation in our experiments. Datasets with D > 30 and N < 200 should be approached with extreme caution or pre-projected to a lower-dimensional subspace. Per-fold breakdown for this and the Default Credit failure case is included as ‘failure_cases_perfold.csv’ in the released code repository.
Continuous regression (California Housing). Three structural mechanisms together make grid-based Wang–Mendel rule extraction unsuitable for continuous regression targets. (i) Target discretization. Each fired rule deterministically maps to one of at most K consequent values per fold; the surrogate’s output space is therefore bounded above by K × (number of unique training-set rule activations). On California Housing, this caps the output at a coarse grid of plausible house-price predictions whose density cannot match the dense, real-valued target distribution. (ii) Simple Mamdani-type consequents. Each rule emits a fixed centroid value (for regression) or a crisp class label (for classification), unlike TSK rule consequents whose output is a learned linear function of the antecedent variables. The fidelity gap on California Housing (0.422 vs. TreeSurrogate 0.840 = 41.8 pp; on scikit-learn diabetes 3 pp) reflects exactly this expressivity gap. (iii) K and top-k constraints. Even with K = 5 and top-k = 7 (the maximum we tested), the cell-grid count Kk = 78,125 still concentrates rule firing on a coarse partition of the input space. Increasing K linearly is not a fix: rule count grows as Kk while marginal fidelity gain saturates beyond K = 3 (Section 5.6 ablation). The principled remedy is the TSK extension outlined in Section 7.2, whose linear consequent functions remove the output-space discretization ceiling; until that variant is implemented and validated, practitioners explaining regression models should prefer TreeSurrogate.
Multi-feature interaction without axis-aligned decomposition (XOR-like patterns). The boundary-aware partitioning of Section 4.2 is by construction a one-dimensional refinement: a univariate decision tree is fitted to model predictions on each feature axis independently. When the black-box decision surface depends on a joint multi-feature interaction that admits no axis-aligned marginal decomposition—the XOR pattern y = 1 iff sign(x1) = sign(x2) being the canonical case—FuzzyRules fidelity drops to 0.736 against a 0.987 RandomForest accuracy on synthetic stress tests (Supplementary Section S.O, Table S21), a 25.1 pp gap that vanilla Wang–Mendel does not narrow (0.752) and that TreeSurrogate handles only inconsistently (mean 0.839 with fold-level volatility down to 0.540). In contrast, the same battery shows that fidelity stays within 3.4–6.2 pp of the black-box accuracy when the boundary is a rotated linear combination (axis-aligned through 45° rotation) or a higher-dimensional mixture with five informative directions mixed by a random orthonormal rotation, confirming that the structural limitation is genuine cross-feature interaction rather than mere non-axis-alignment of an otherwise smooth boundary. The principled remedy is a TSK extension whose rule consequents can carry multivariate interaction terms (Section 7.2).

7. Conclusions and Future Work

The contributions reported in this paper apply to tabular classification black-box models. The boundary-aware partitioning (Section 4.2), the rule-entropy uncertainty proxy (Section 4.6), and the 13-dataset 4-model 5-explainer benchmark (Section 5.1, Section 5.2, Section 5.3, Section 5.4, Section 5.5, Section 5.6, Section 5.7, Section 5.8, Section 5.9, Section 5.10, Section 5.11, Section 5.12 and Section 5.13) are all validated for tabular classification. On continuous regression, the framework is materially weaker—California Housing fidelity drops to 0.422 vs. TreeSurrogate 0.840 (Section 5.13 and Section When the Framework Fails and Why)—and on pure multi-feature interaction patterns (XOR-like; Supplementary Section S.O) it is structurally limited to roughly 0.74 fidelity at near-perfect black-box accuracy. The TSK extension outlined in Section 7.2 is the principled remedy for both regimes. Until that extension is implemented and validated, practitioners should treat the framework as classification-focused.

7.1. Summary of Contributions

This paper presented a fuzzy rule-based explanation framework with three main contributions plus one supporting asymptotic intuition: (i) prediction-boundary-aware fuzzy partitioning lifts ablation-set fidelity from a vanilla Wang–Mendel baseline of 0.736 to 0.893 (+10.4 pp excluding the Breast Cancer outlier; +15.7 pp aggregate including it, both transparently reported; component-wise breakdown in Table S15); (ii) rule entropy provides a zero-cost rule-level uncertainty proxy (Spearman ρ = 0.420 with model prediction entropy across 12 classification datasets, significant on 11/12; macro-F1 = 0.424 as a misclassification detector; complementary to probability-entropy and margin baselines, Supplementary Table S22); (iii) a 13-dataset, 4-model, 5-explainer benchmark (plus two external-validation datasets) with Wilcoxon, Friedman, TOST and effect-size analyses positions FuzzyRules as Pareto-competitive on most interpretability dimensions against TreeSurrogate and competitive with state-of-the-art fuzzy surrogates on fidelity, with cross-explainer comparison reported in grouped form (full-coverage surrogates vs. local-coverage methods, Section 5.2). Conjecture 1 (Section 4.7) is provided as a supporting asymptotic intuition consistent with the empirical across-dataset K-monotone trend, not a contribution claim, pending a fully formal proof of feature-selection consistency. A coverage caveat applies throughout: Anchors’s reported 0.963 fidelity in Table 4 is conditional on its 22.8% coverage (Table 4 Fidelity × Coverage column) and is not directly comparable to FuzzyRules’s full-coverage 0.878 fidelity.

7.2. Future Work

We identify the following directions for future research:
  • Rule compression via multi-objective optimization: Investigate whether NSGA-II–style multi-objective genetic selection (jointly minimizing rule count and maximizing fidelity) can compress the global rule base by ≥50% without measurable fidelity loss on the 12-classification benchmark.
  • Empirical user studies: Larger-scale controlled experiments with domain experts (physicians, financial analysts) across diverse expertise levels and culturally varied populations would strengthen the generalizability of these findings.
  • TSK-type fuzzy systems for regression: Adopting Takagi–Sugeno–Kang (TSK) [60] rules with linear consequent functions could significantly improve regression fidelity while maintaining linguistic antecedents.
  • Multi-class extension: Our main benchmark includes one multi-class dataset (Wine, 3 classes; FuzzyRules fidelity 0.942), and external validation adds Iris (3 classes; FuzzyRules fidelity 0.973, Section 5.13). Analyzing class-specific rule entropy patterns in multi-class settings could reveal which classes the model finds most confusable, providing additional diagnostic insights.
  • Online and incremental explanation: Developing mechanisms for incrementally updating the fuzzy rule base as new data arrives, enabling real-time explanation in streaming scenarios.
  • Integration with fairness analysis: Examining how fuzzy rules can expose and explain potential biases in black-box models, particularly in sensitive domains such as criminal justice (COMPAS) and credit scoring.

7.3. Concluding Remarks

Our results position fuzzy rule extraction as a complementary XAI tool, particularly suited to domains where linguistic interpretability and uncertainty awareness are valued—such as healthcare, finance, and regulatory compliance.
The source code and all experimental results are publicly available at https://github.com/ahmettezcantekin/fuzzy-rule-extraction (release tag v1.1, accessed on 31 May 2026).

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app16125896/s1, Figure S1: Computation time comparison across explanation methods (log scale); Figure S2: Ablation study—impact of individual components on fidelity; Figure S3: Simulatability—fidelity as a function of the number of top-K fired rules per prediction; Figure S4: Rule entropy vs. model prediction entropy scatter plots for six stratified datasets; Figure S5: Top-10 FuzzyRules rules on the Heart Disease dataset; Table S1: Per-dataset cognitive chunk counts; Table S2: Representative rules from the COMPAS recidivism dataset; Table S3: Per-dataset fidelity (mean ± std, 3-fold CV); Table S4: Greedy coverage-based rule selection (per dataset); Table S5: Modern fuzzy baselines—per-dataset fidelity; Table S6: Hyperparameter sensitivity—per-dataset fidelity for K, min_conf, top-k, weighting; Table S7: Tuned FuzzyRules with grid search budget; Table S8: Empirical scalability on synthetic data; Table S9: Random seed sensitivity; Table S10: Pairwise Wilcoxon signed-rank tests with Holm step-down correction, TOST, and effect sizes; Table S11: Stability per (explainer × black-box model) cell; Table S12: Fidelity and rule entropy for boundary vs. confident samples; Table S13: Per-dataset Spearman rank correlation between model prediction entropy and rule entropy; Table S14: External validation on multi-class (Iris) and regression (California Housing) datasets; Table S15: Cumulative leave-one-in framework ablation on the 5 ablation datasets; Table S16: Multi-seed tuned-ablation fidelity verification; Table S17: Per-protocol TreeSurrogate (and matched FuzzyRules) fidelity values cited in the main text; Table S18: Rule-count summary by pipeline snapshot and aggregation scope; Table S19: Comparison against modern fuzzy explanation methods on 12 classification datasets; Table S20: Simulatability analysis—fidelity achieved using only the top-K fired rules per instance; Table S21: Synthetic stress-test fidelity for non-axis-aligned boundaries and multi-feature interactions; Table S22: Rule entropy vs. probability entropy vs. margin as misclassification detectors; Algorithm S1: FuzzyRules—Boundary-Aware Wang–Mendel Rule Extraction (full pseudocode).

Funding

This research received no external funding.

Data Availability Statement

All datasets used in this study are publicly available. Classification datasets are obtained from the UCI Machine Learning Repository [62] (snapshot accessed 29 April 2026; scikit-learn version 1.5.2 used throughout, including for California Housing and the diabetes regression dataset), COMPAS from ProPublica [55] (we use ProPublica’s released “compas-scores-two-years.csv” via the Dressel & Farid [2] replication code path), Pima Diabetes from [63] (we cite the original Smith et al. study rather than the UCI repository entry because the UCI version is a derivative redistribution of [63] and the citation hierarchy points to the primary source), and the Diabetes dataset (originally distributed alongside [64]; we use the version shipped with scikit-learn). The source code and experimental results are available at https://github.com/ahmettezcantekin/fuzzy-rule-extraction (release tag v1.1, accessed on 31 May 2026).

Acknowledgments

The author acknowledges informal discussions with departmental colleagues at the Istanbul Technical University Department of Management Engineering, whose comments helped sharpen the manuscript framing. The author also acknowledges the foundational textbook references [65,66,67] consulted for the information-theoretic (Section 4.5) and asymptotic-statistical arguments (Section 4.7).

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef]
  2. Dressel, J.; Farid, H. The accuracy, fairness, and limits of predicting recidivism. Sci. Adv. 2018, 4, eaao5580. [Google Scholar] [CrossRef]
  3. Bussmann, N.; Giudici, P.; Marinelli, D.; Papenbrock, J. Explainable machine learning in credit risk management. Comput. Econ. 2021, 57, 203–216. [Google Scholar]
  4. Arrieta, A.B.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
  5. European Parliament and Council. Regulation (EU) 2016/679 (General Data Protection Regulation). Off. J. Eur. Union 2016, 59, 294. [Google Scholar]
  6. Goodman, B.; Flaxman, S. European Union regulations on algorithmic decision-making and a “right to explanation”. AI Mag. 2017, 38, 50–57. [Google Scholar]
  7. Wachter, S.; Mittelstadt, B.; Floridi, L. Why a right to explanation of automated decision-making does not exist in the General Data Protection Regulation. Int. Data Priv. Law 2017, 7, 76–99. [Google Scholar] [CrossRef]
  8. Lundberg, S.K.; Lee, S.-I. A unified approach to interpreting model predictions. In Proceedings of the NeurIPS, Long Beach, CA, USA, 4–9 December 2017; pp. 4766–4777. [Google Scholar]
  9. Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the ACM SIGKDD, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
  10. Ribeiro, M.T.; Singh, S.; Guestrin, C. Anchors: High-precision model-agnostic explanations. In Proceedings of the AAAI, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
  11. Craven, M.W.; Shavlik, J.W. Extracting tree-structured representations of trained networks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 1995; pp. 24–30. [Google Scholar]
  12. Bastani, O.; Kim, C.; Bastani, H. Interpreting blackbox models via model extraction. arXiv 2017, arXiv:1705.08504. [Google Scholar]
  13. Zadeh, L.A. Fuzzy sets. Inf. Control 1965, 8, 338–353. [Google Scholar]
  14. Mamdani, E.H.; Assilian, S. An experiment in linguistic synthesis with a fuzzy logic controller. Int. J. Man-Mach. Stud. 1975, 7, 1–13. [Google Scholar] [CrossRef]
  15. Wang, L.-X.; Mendel, J.M. Generating fuzzy rules by learning from examples. IEEE Trans. Syst. Man Cybern. 1992, 22, 1414–1427. [Google Scholar] [CrossRef]
  16. Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Wadsworth: Belmont, CA, USA, 1984. [Google Scholar]
  17. Fayyad, U.M.; Irani, K.B. Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the IJCAI, Chambéry, France, 28 August–3 September 1993; pp. 1022–1027. [Google Scholar]
  18. Lundberg, S.K.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.-I. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef] [PubMed]
  19. Alvarez-Melis, D.; Jaakkola, T.S. On the robustness of interpretability methods. In Proceedings of the ICML Workshop on Human Interpretability in Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
  20. Zafar, M.R.; Khan, N.M. DLIME: A deterministic local interpretable model-agnostic explanations approach for computer-aided diagnosis systems. arXiv 2019, arXiv:1906.10263. [Google Scholar]
  21. Sundararajan, M.; Taly, A.; Yan, Q. Axiomatic attribution for deep networks. In Proceedings of the ICML, Sydney, Australia, 6–11 August 2017; pp. 3319–3328. [Google Scholar]
  22. Shrikumar, A.; Greenside, P.; Kundaje, A. Learning important features through propagating activation differences. In Proceedings of the ICML, Sydney, Australia, 6–11 August 2017; pp. 3145–3153. [Google Scholar]
  23. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates: Red Hook, NY, USA, 2017; pp. 5998–6008. [Google Scholar]
  24. Jain, S.; Wallace, B.C. Attention is not explanation. In Proceedings of the NAACL-HLT, Minneapolis, MN, USA, 2–7 June 2019; pp. 3543–3556. [Google Scholar]
  25. Molnar, C. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable, 3rd ed.; Independently published: Munich, Germany, 2025; Available online: https://christophm.github.io/interpretable-ml-book/ (accessed on 29 April 2026).
  26. Guidotti, R.; Monreale, A.; Ruggieri, S.; Pedreschi, D.; Turini, F.; Giannotti, F. Local rule-based explanations of black box decision systems. arXiv 2018, arXiv:1805.10820. [Google Scholar] [CrossRef]
  27. Friedman, J.H.; Popescu, B.E. Predictive learning via rule ensembles. Ann. Appl. Stat. 2008, 2, 916–954. [Google Scholar] [CrossRef]
  28. Letham, B.; Rudin, C.; McCormick, T.H.; Madigan, D. Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model. Ann. Appl. Stat. 2015, 9, 1350–1371. [Google Scholar] [CrossRef]
  29. Ishibuchi, H.; Nakashima, T.; Murata, T. Performance evaluation of fuzzy classifier systems for multidimensional pattern classification problems. IEEE Trans. Syst. Man Cybern. Part B 1999, 29, 601–618. [Google Scholar] [CrossRef]
  30. Casillas, J.; Cordón, O.; Herrera, F.; Magdalena, L. Interpretability Issues in Fuzzy Modeling; Studies in Fuzziness and Soft Computing; Springer: Berlin, Germany, 2003; Volume 128. [Google Scholar]
  31. Guillaume, S. Designing fuzzy inference systems from data: An interpretability-oriented review. IEEE Trans. Fuzzy Syst. 2001, 9, 426–443. [Google Scholar] [CrossRef]
  32. Ishibuchi, H.; Yamamoto, T. Rule weight specification in fuzzy rule-based classification systems. IEEE Trans. Fuzzy Syst. 2005, 13, 428–435. [Google Scholar] [CrossRef]
  33. Herrera, F. Genetic fuzzy systems: Taxonomy, current research trends and prospects. Evol. Intell. 2008, 1, 27–46. [Google Scholar] [CrossRef]
  34. Alcalá-Fdez, J.; Alcalá, R.; Herrera, F. A fuzzy association rule-based classification model for high-dimensional problems with genetic rule selection and lateral tuning. IEEE Trans. Fuzzy Syst. 2011, 19, 857–872. [Google Scholar] [CrossRef]
  35. Setiono, R.; Leow, W.K. FERNN: An algorithm for fast extraction of rules from neural networks. Appl. Intell. 2000, 12, 15–25. [Google Scholar] [CrossRef]
  36. Huysmans, J.; Dejaeger, K.; Mues, C.; Vanthienen, J.; Baesens, B. An empirical evaluation of the comprehensibility of decision table, tree and rule based predictive models. Decis. Support Syst. 2011, 51, 141–154. [Google Scholar] [CrossRef]
  37. Pancho, D.P.; Alonso, J.M.; Cordón, O.; Quirin, A.; Magdalena, L. Fingrams: Visual representations of fuzzy rule-based inference for expert analysis of comprehensibility. IEEE Trans. Fuzzy Syst. 2013, 21, 1133–1149. [Google Scholar] [CrossRef]
  38. Loyola-González, O.; Medina-Pérez, M.A.; Martínez-Trinidad, J.F.; Carrasco-Ochoa, J.A.; Monroy, R.; García-Borroto, M. PBC4cip: A new contrast pattern-based classifier for class imbalance problems. Knowl.-Based Syst. 2017, 115, 100–109. [Google Scholar]
  39. Mendel, J.M. Explainable Uncertain Rule-Based Fuzzy Systems, 3rd ed.; Springer: Cham, Switzerland, 2024. [Google Scholar] [CrossRef]
  40. Pickering, L.; Cohen, K.; De Baets, B. A Narrative Review on the Interpretability of Fuzzy Rule-Based Models from a Modern Interpretable Machine Learning Perspective. Int. J. Fuzzy Syst. 2025, in press. [Google Scholar] [CrossRef]
  41. Pedrycz, W. (Ed.) Machine Learning and Granular Computing: A Synergistic Design Environment; Springer: Cham, Switzerland, 2024. [Google Scholar] [CrossRef]
  42. Alateeq, M.; Pedrycz, W. Logic-oriented fuzzy neural networks: A survey. Expert Syst. Appl. 2024, 257, 125120. [Google Scholar] [CrossRef]
  43. Gal, Y.; Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the ICML, New York, NY, USA, 19–24 June 2016; pp. 1050–1059. [Google Scholar]
  44. Gal, Y.; Ghahramani, Z. A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems; Curran Associates: Red Hook, NY, USA, 2016; pp. 1019–1027. [Google Scholar]
  45. Vovk, V.; Gammerman, A.; Shafer, G. Algorithmic Learning in a Random World; Springer: New York, NY, USA, 2005. [Google Scholar]
  46. Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; Wierstra, D. Weight uncertainty in neural networks. In Proceedings of the ICML, Lille, France, 6–11 July 2015; pp. 1613–1622. [Google Scholar]
  47. Slack, D.; Hilgard, S.; Jia, E.; Singh, S.; Lakkaraju, H. Fooling LIME and SHAP: Adversarial attacks on post hoc explanation methods. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, New York, NY, USA, 7–8 February 2020; pp. 180–186. [Google Scholar]
  48. Zhang, Y.; Song, K.; Sun, Y.; Tan, S.; Udell, M. “Why should you trust my explanation?” Understanding uncertainty in LIME explanations. arXiv 2019, arXiv:1904.12991. [Google Scholar]
  49. Vilone, G.; Longo, L. Notions of explainability and evaluation approaches for explainable artificial intelligence. Inf. Fusion 2021, 76, 89–106. [Google Scholar] [CrossRef]
  50. Guidotti, R.; Monreale, A.; Ruggieri, S.; Turini, F.; Giannotti, F.; Pedreschi, D. A survey of methods for explaining black box models. ACM Comput. Surv. 2018, 51, 1–42. [Google Scholar] [CrossRef]
  51. Zhu, X.; Wang, D.; Pedrycz, W.; Li, Z. Fuzzy Rule-Based Local Surrogate Models for Black-Box Model Explanation. IEEE Trans. Fuzzy Syst. 2023, 31, 2056–2064. [Google Scholar] [CrossRef]
  52. Ouifak, H.; Idri, A. A comprehensive review of fuzzy logic based interpretability and explainability of machine learning techniques across domains. Neurocomputing 2025, 647, 130602. [Google Scholar] [CrossRef]
  53. Klement, E.P.; Mesiar, R.; Pap, E. Triangular Norms; Trends in Logic; Springer: Dordrecht, The Netherlands, 2000; Volume 8. [Google Scholar]
  54. Miller, G.A. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychol. Rev. 1956, 63, 81–97. [Google Scholar] [CrossRef]
  55. Angwin, J.; Larson, J.; Mattu, S.; Kirchner, L. Machine bias: There’s Software Used Across the Country to Predict Future Criminals. And It’s Biased Against Blacks. ProPublica, 23 May 2016. Available online: https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing (accessed on 29 April 2026).
  56. Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
  57. Schuirmann, D.J. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. J. Pharmacokinet. Biopharm. 1987, 15, 657–680. [Google Scholar] [CrossRef] [PubMed]
  58. Cowan, N. The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behav. Brain Sci. 2001, 24, 87–114. [Google Scholar] [CrossRef] [PubMed]
  59. Lipton, Z.C. The mythos of model interpretability. Queue 2018, 16, 31–57. [Google Scholar] [CrossRef]
  60. Takagi, T.; Sugeno, M. Fuzzy identification of systems and its applications to modeling and control. IEEE Trans. Syst. Man Cybern. 1985, SMC-15, 116–132. [Google Scholar]
  61. Harrison, D.; Rubinfeld, D.L. Hedonic housing prices and the demand for clean air. J. Environ. Econ. Manag. 1978, 5, 81–102. [Google Scholar] [CrossRef]
  62. Dua, D.; Graff, C. UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences. 2017. Available online: https://archive.ics.uci.edu/ml (accessed on 29 April 2026).
  63. Smith, J.W.; Everhart, J.E.; Dickson, W.C.; Knowler, W.C.; Johannes, R.S. Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Annual Symposium on Computer Application in Medical Care; IEEE Computer Society Press: Washington, DC, USA, 1988; pp. 261–265. [Google Scholar]
  64. Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R. Least angle regression. Ann. Stat. 2004, 32, 407–499. [Google Scholar] [CrossRef]
  65. Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley-Interscience: Hoboken, NJ, USA, 2006. [Google Scholar]
  66. van der Vaart, A.W. Asymptotic Statistics; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
  67. Devroye, L.; Györfi, L.; Lugosi, G. A Probabilistic Theory of Pattern Recognition; Springer: New York, NY, USA, 1996. [Google Scholar]
Figure 1. Overview of the proposed fuzzy rule-based explanation framework. Given a trained black-box model and training data, the framework produces a set of linguistically interpretable IF–THEN rules with a rule-level uncertainty indicator.
Figure 1. Overview of the proposed fuzzy rule-based explanation framework. Given a trained black-box model and training data, the framework produces a set of linguistically interpretable IF–THEN rules with a rule-level uncertainty indicator.
Applsci 16 05896 g001
Figure 2. Intuition behind prediction-boundary-aware refinement. (a) Percentile-based centers may straddle the decision boundary, causing mixed-class membership and low rule confidence. (b) Boundary-aware centers align with the model’s decision regions, so each fuzzy set predominantly captures a single class, yielding higher rule confidence and fidelity.
Figure 2. Intuition behind prediction-boundary-aware refinement. (a) Percentile-based centers may straddle the decision boundary, causing mixed-class membership and low rule confidence. (b) Boundary-aware centers align with the model’s decision regions, so each fuzzy set predominantly captures a single class, yielding higher rule confidence and fidelity.
Applsci 16 05896 g002
Figure 3. Radar chart comparing all five explanation methods across key metrics. FuzzyRules provides the most balanced profile across fidelity, coverage, stability, and interpretability dimensions. Note: Anchors’s fidelity vertex (0.963) is computed only on the 22.8% of test samples it covers (Table 4 footnote ); on the full test set Anchors’s effective fidelity is substantially lower. Readers comparing fidelity vertices across methods should treat the Anchors vertex as conditional on coverage.
Figure 3. Radar chart comparing all five explanation methods across key metrics. FuzzyRules provides the most balanced profile across fidelity, coverage, stability, and interpretability dimensions. Note: Anchors’s fidelity vertex (0.963) is computed only on the 22.8% of test samples it covers (Table 4 footnote ); on the full test set Anchors’s effective fidelity is substantially lower. Readers comparing fidelity vertices across methods should treat the Anchors vertex as conditional on coverage.
Applsci 16 05896 g003
Figure 4. Critical difference diagram for fidelity (Friedman–Nemenyi post hoc test). Methods connected by a horizontal bar are not significantly different at α = 0.05.
Figure 4. Critical difference diagram for fidelity (Friedman–Nemenyi post hoc test). Methods connected by a horizontal bar are not significantly different at α = 0.05.
Applsci 16 05896 g004
Figure 5. Critical difference diagram for stability.
Figure 5. Critical difference diagram for stability.
Applsci 16 05896 g005
Figure 6. Boundary sensitivity analysis: fidelity breakdown (left) and rule entropy as an uncertainty indicator (right) for boundary vs. confident test samples across three datasets.
Figure 6. Boundary sensitivity analysis: fidelity breakdown (left) and rule entropy as an uncertainty indicator (right) for boundary vs. confident test samples across three datasets.
Applsci 16 05896 g006
Table 1. Comparison of explanation methods along key dimensions. ✓ = supported, ✗ = not supported, ~ = partially supported.
Table 1. Comparison of explanation methods along key dimensions. ✓ = supported, ✗ = not supported, ~ = partially supported.
MethodGlobalSymbolicLinguisticUncertaintyCoverageModel-Agnostic
SHAP
LIME~
Anchors~
TreeSurr
FuzzyRules (Ours)
Table 2. Comparison of empirical studies benchmarking fuzzy rule extraction against modern XAI baselines. ✓ = included, ✗ = not included, * = survey/qualitative only. Studies predating SHAP (2017), LIME (2016), and Anchors (2018) necessarily could not have benchmarked against them; the ✗ marks indicate the absence in the published study and the corresponding gap in the historical literature, not an oversight by the original authors.
Table 2. Comparison of empirical studies benchmarking fuzzy rule extraction against modern XAI baselines. ✓ = included, ✗ = not included, * = survey/qualitative only. Studies predating SHAP (2017), LIME (2016), and Anchors (2018) necessarily could not have benchmarked against them; the ✗ marks indicate the absence in the published study and the corresponding gap in the historical literature, not an oversight by the original authors.
Studyvs. SHAPvs. LIMEvs. Anchorsvs. Tree#DatasetsFidelityStat. Test
Ishibuchi et al. [28]4
Setiono & Leow [32]3
Pancho et al. [34]2
Zhu et al. [51]3
Vilone & Longo [52] *
Ouifak & Idri [53] *
This paper13
Table 3. Summary of benchmark datasets. N: number of samples, D: number of features.
Table 3. Summary of benchmark datasets. N: number of samples, D: number of features.
DatasetDomainNDTaskSource
Adult IncomeFinance48,84214ClassificationUCI [46]
Bank MarketingFinance45,21116ClassificationUCI [46]
Breast CancerHealthcare56930ClassificationUCI [46]
COMPASJustice617211ClassificationProPublica [47]
Default CreditFinance30,00023ClassificationUCI [46]
German CreditFinance100020ClassificationUCI [46]
Heart DiseaseHealthcare30313ClassificationUCI [46]
IonospherePhysics35134ClassificationUCI [46]
Magic GammaPhysics19,02010ClassificationUCI [46]
Pima DiabetesHealthcare7688Classification[50]
SpambaseCybersecurity460157ClassificationUCI [46]
WineFood Science17813ClassificationUCI [46]
DiabetesHealthcare44210Regression[49]
Table 4. Overall performance summary (macro-average across 13 datasets—12 classification + 1 regression—with 95% bootstrap CIs from 10,000 resamples of dataset-level means). Anchors’ fidelity (0.963) is computed on the 22.8% of test samples it covers and the 7 of 12 classification datasets where it produced valid rules—not directly comparable to other rows. SHAP/LIME stability cells “—”: the perturbation-based metric reflects the wrapped black-box, not attribution-vector stability (excluded from formal tests in Section 5.4/Table S10). The 12-clf mean is FuzzyRules 0.889/TreeSurrogate 0.900 (Section 5.3, Table 11). Coverage-adjusted: Anchors’s 0.963 vertex is conditional on 22.8% coverage; coverage × fidelity ≈ 0.220—not comparable to FuzzyRules’s full-coverage 0.878.
Table 4. Overall performance summary (macro-average across 13 datasets—12 classification + 1 regression—with 95% bootstrap CIs from 10,000 resamples of dataset-level means). Anchors’ fidelity (0.963) is computed on the 22.8% of test samples it covers and the 7 of 12 classification datasets where it produced valid rules—not directly comparable to other rows. SHAP/LIME stability cells “—”: the perturbation-based metric reflects the wrapped black-box, not attribution-vector stability (excluded from formal tests in Section 5.4/Table S10). The 12-clf mean is FuzzyRules 0.889/TreeSurrogate 0.900 (Section 5.3, Table 11). Coverage-adjusted: Anchors’s 0.963 vertex is conditional on 22.8% coverage; coverage × fidelity ≈ 0.220—not comparable to FuzzyRules’s full-coverage 0.878.
MethodFidelity [95% CI]Coverage [95% CI]Stability [95% CI]RulesAvg AntecedentsFidelity × Coverage
FuzzyRules (Ours)0.878 [0.844, 0.910]0.9996 [0.9990, 1.0000]0.986 [0.971, 0.996]175.83.610.878
TreeSurrogate0.890 [0.858, 0.919]1.0000 [1.0000, 1.0000]0.994 [0.986, 0.999]17.33.960.890
Anchors0.963 [0.946, 0.980]0.228 [0.120, 0.374]0.974 [0.959, 0.987]5.02.870.220
SHAP1.000
LIME1.000
Table 5. Computation time (seconds), micro-averaged over (dataset, model, fold) runs. Global methods are trained once; local methods are re-executed per query.
Table 5. Computation time (seconds), micro-averaged over (dataset, model, fold) runs. Global methods are trained once; local methods are re-executed per query.
MethodTypeMean (s)Std (s)Median (s)
FuzzyRules (Ours)Global4.7288.4480.285
TreeSurrogateGlobal0.3660.4830.198
SHAPLocal0.0790.2520.050
LIMELocal0.0350.0300.020
AnchorsLocal8.53915.6553.345
Table 6. Ablation study results (RandomForest, 3-fold CV, seed = 42). “Vanilla WM” reproduces the original 1992 Wang–Mendel baseline (K = 2, percentile MFs, min t-norm, min_conf = 0.3, no pruning, no boundary refinement, top-k capped at 7 for tractability). “Full (Ours)” enables all components. Component-wise leave-one-in decomposition in Supplementary Table S15; the Breast Cancer Vanilla outlier (0.415) is explained there as a min t-norm × K = 2 artifact rather than an implementation bug. Column-wise K parameter: the “Vanilla WM” column uses K = 2 as part of the 1992 baseline reproduction; the “Full (Ours)”, “Min T-norm”, and “No Feat. Sel.” columns all use the framework default K = 3.
Table 6. Ablation study results (RandomForest, 3-fold CV, seed = 42). “Vanilla WM” reproduces the original 1992 Wang–Mendel baseline (K = 2, percentile MFs, min t-norm, min_conf = 0.3, no pruning, no boundary refinement, top-k capped at 7 for tractability). “Full (Ours)” enables all components. Component-wise leave-one-in decomposition in Supplementary Table S15; the Breast Cancer Vanilla outlier (0.415) is explained there as a min t-norm × K = 2 artifact rather than an implementation bug. Column-wise K parameter: the “Vanilla WM” column uses K = 2 as part of the 1992 baseline reproduction; the “Full (Ours)”, “Min T-norm”, and “No Feat. Sel.” columns all use the framework default K = 3.
DatasetVanilla WMK = 2Full (Ours)Min T-NormNo Feat. Sel.
Adult Income0.8550.8700.9360.9100.904
Breast Cancer0.4150.9540.9580.847
German Credit0.8360.8710.8510.842
Heart Disease0.8590.8860.8920.879
Magic Gamma0.7170.7690.8290.8090.833
Mean0.7360.8700.8930.8570.869
Table 7. Partitioning strategy ablation (Random Forest, 3-fold CV). All four strategies share the same downstream rule extraction (K = 3 fuzzy sets, product t-norm, adaptive feature selection, min_confidence = 0.5); only the placement of membership-function centers differs. Boundary-Aware achieves the highest mean fidelity, with the largest gains over the Percentile baseline, where the model’s decision boundaries are misaligned with input percentiles (Adult Income +5.3 pp, Magic Gamma +6.1 pp); on smaller, well-separated datasets, the gain is marginal because the percentile baseline is already near-optimal. Per-strategy comparison shows that K-Means slightly exceeds Boundary-Aware on Magic Gamma (0.856 vs. 0.829)—K-Means data-driven cluster placement is competitive when the feature distribution itself is bimodal (which is the case for several Magic Gamma detector-response features). Boundary-Aware remains the best mean choice across the ablation suite, but practitioners with prior knowledge that the input distribution is multi-modal may prefer K-Means.
Table 7. Partitioning strategy ablation (Random Forest, 3-fold CV). All four strategies share the same downstream rule extraction (K = 3 fuzzy sets, product t-norm, adaptive feature selection, min_confidence = 0.5); only the placement of membership-function centers differs. Boundary-Aware achieves the highest mean fidelity, with the largest gains over the Percentile baseline, where the model’s decision boundaries are misaligned with input percentiles (Adult Income +5.3 pp, Magic Gamma +6.1 pp); on smaller, well-separated datasets, the gain is marginal because the percentile baseline is already near-optimal. Per-strategy comparison shows that K-Means slightly exceeds Boundary-Aware on Magic Gamma (0.856 vs. 0.829)—K-Means data-driven cluster placement is competitive when the feature distribution itself is bimodal (which is the case for several Magic Gamma detector-response features). Boundary-Aware remains the best mean choice across the ablation suite, but practitioners with prior knowledge that the input distribution is multi-modal may prefer K-Means.
DatasetPercentileEqual-WidthK-MeansBoundary-Aware (Ours)
Adult Income0.8830.8820.8960.936
Breast Cancer0.9560.9380.9540.958
German Credit0.8530.8470.8430.851
Heart Disease0.8790.8890.8820.892
Magic Gamma0.7680.7870.8560.829
Mean0.8680.8690.8860.893
Table 8. Hyperparameter sensitivity probes (Random Forest, 3-fold CV on 5 ablation datasets). The framework is robust across the recommended ranges; K dominates the fidelity–interpretability trade-off, the rest are essentially flat.
Table 8. Hyperparameter sensitivity probes (Random Forest, 3-fold CV on 5 ablation datasets). The framework is robust across the recommended ranges; K dominates the fidelity–interpretability trade-off, the rest are essentially flat.
HyperparameterValues TestedMean FidelityConclusion
Number of fuzzy sets K2/3/4/50.870/0.893/0.898/0.910Monotone gain; K = 3 is the interpretability sweet spot (Miller chunks)
Min confidence threshold0.3/0.4/0.5/0.6/0.70.893/0.893/0.893/0.893/0.896Insensitive across the [0.3, 0.7] range
Top-k selected features3/5/70.872/0.894/0.895Diminishing returns past k = 5; k = 7 is a safe default
Vote weightingconf 1/conf 2/conf 1·support0.893/0.893/0.885Linear and squared weighting equivalent; support-weighted slightly worse
Table 9. Representative rules from FuzzyRules vs. TreeSurrogate on the Heart Disease dataset. FuzzyRules uses linguistic labels; TreeSurrogate uses numeric thresholds on standardized features.
Table 9. Representative rules from FuzzyRules vs. TreeSurrogate on the Heart Disease dataset. FuzzyRules uses linguistic labels; TreeSurrogate uses numeric thresholds on standardized features.
FuzzyRulesTreeSurrogate
IF cp IS High AND oldpeak IS Medium AND thal IS High
THEN Class 1 (conf = 0.943)
IF thal ≤ −0.16 AND ca ≤ −0.20 AND trestbps ≤ 1.38 AND age ≤ 0.64
THEN Class 0 (covers 61 samples)
IF cp IS Medium AND oldpeak IS Low AND thal IS Low
THEN Class 0 (conf = 0.942)
IF thal > −0.16 AND cp > 0.33 AND oldpeak > −0.52 AND thal > 0.87
THEN Class 1 (covers 49 samples)
Table 10. Misclassification detection via rule entropy (RandomForest, 3-fold CV—matches main-benchmark protocol Section 5.1). Precision = misclassified fraction of high-entropy samples; Recall = high-entropy fraction of misclassified samples; Enrichment = precision/base error rate (>1 = errors over-represented in flagged subset). Macro-F1 = 0.424 (per-dataset F1 averaged); the column-mean F1 from marginal P/R is mathematically invalid (F1 is non-linear in precision and recall) because F1 is non-linear and is therefore not reported. Per-dataset F1 in the rightmost column. The indicator is zero-cost: no auxiliary error-detection model is trained. For context, the random-baseline macro-F1 (predicting “misclassified” with rate equal to base error rate per dataset) is 0.247 across the same 12 datasets; macro-F1 = 0.424, therefore, represents a 1.72× lift over random and is operationally usable as a no-true-label error flag.
Table 10. Misclassification detection via rule entropy (RandomForest, 3-fold CV—matches main-benchmark protocol Section 5.1). Precision = misclassified fraction of high-entropy samples; Recall = high-entropy fraction of misclassified samples; Enrichment = precision/base error rate (>1 = errors over-represented in flagged subset). Macro-F1 = 0.424 (per-dataset F1 averaged); the column-mean F1 from marginal P/R is mathematically invalid (F1 is non-linear in precision and recall) because F1 is non-linear and is therefore not reported. Per-dataset F1 in the rightmost column. The indicator is zero-cost: no auxiliary error-detection model is trained. For context, the random-baseline macro-F1 (predicting “misclassified” with rate equal to base error rate per dataset) is 0.247 across the same 12 datasets; macro-F1 = 0.424, therefore, represents a 1.72× lift over random and is operationally usable as a no-true-label error flag.
DatasetPrec.RecallErr RateEnrichment
Breast Cancer0.1530.9560.0354.36×|F1 = 0.264
Heart Disease0.4240.4290.1852.29×|F1 = 0.427
Wine0.1250.8890.0177.41×|F1 = 0.219
Spambase0.1710.6290.0582.93×|F1 = 0.269
Adult Income0.4240.5400.1582.69×|F1 = 0.475
Ionosphere0.1110.7180.0771.44×|F1 = 0.192
Pima Diabetes0.5130.8870.2382.15×|F1 = 0.650
German Credit0.5790.8340.2322.50×|F1 = 0.684
Magic Gamma0.2160.8400.1321.64×|F1 = 0.343
Bank Marketing0.3970.4180.1063.76×|F1 = 0.407
COMPAS0.6200.6890.3241.91×|F1 = 0.653
Default Credit0.3480.8950.1791.94×|F1 = 0.501
Average0.3400.727 2.92×|macro-F1 = 0.424
Table 11. Head-to-head comparison of FuzzyRules vs. TreeSurrogate across all evaluation dimensions. ≈ indicates no statistically significant difference (Wilcoxon p > 0.05).
Table 11. Head-to-head comparison of FuzzyRules vs. TreeSurrogate across all evaluation dimensions. ≈ indicates no statistically significant difference (Wilcoxon p > 0.05).
DimensionFuzzyRulesTreeSurrogateWinner
Fidelity (clf, untuned)0.8890.900≈(p_Holm = 0.733; TOST p = 0.002; |d| = 0.29)
Fidelity (all, untuned)0.8780.890≈(TOST p < 0.001; |d| = 0.33)
Fidelity (tuned, ablation)0.9020.893FuzzyRules (+0.9 pp; Section 5.13)
Coverage0.99961.0000
Stability0.9860.994≈(p_W = 0.131; TOST p < 0.001, δ = 0.05)
Cognitive load (chunks)3.68.6FuzzyRules (2.4×)
Within Miller’s limit12/127/12FuzzyRules
Linguistic labelsYesNoFuzzyRules
Standardization-freeYesNoFuzzyRules
Uncertainty signalRule entropyNoneFuzzyRules
Simulatability (K = 1)99% retentionN/AFuzzyRules
Training time4.728 s0.366 sTreeSurrogate (13×)
Global rule count175.817.3TreeSurrogate
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tekin, A.T. Fuzzy Rule-Based Explanations for Tabular Black-Box Classifiers: A Comprehensive Empirical Framework with Prediction-Boundary-Aware Partitioning and Rule-Level Uncertainty Indication. Appl. Sci. 2026, 16, 5896. https://doi.org/10.3390/app16125896

AMA Style

Tekin AT. Fuzzy Rule-Based Explanations for Tabular Black-Box Classifiers: A Comprehensive Empirical Framework with Prediction-Boundary-Aware Partitioning and Rule-Level Uncertainty Indication. Applied Sciences. 2026; 16(12):5896. https://doi.org/10.3390/app16125896

Chicago/Turabian Style

Tekin, Ahmet Tezcan. 2026. "Fuzzy Rule-Based Explanations for Tabular Black-Box Classifiers: A Comprehensive Empirical Framework with Prediction-Boundary-Aware Partitioning and Rule-Level Uncertainty Indication" Applied Sciences 16, no. 12: 5896. https://doi.org/10.3390/app16125896

APA Style

Tekin, A. T. (2026). Fuzzy Rule-Based Explanations for Tabular Black-Box Classifiers: A Comprehensive Empirical Framework with Prediction-Boundary-Aware Partitioning and Rule-Level Uncertainty Indication. Applied Sciences, 16(12), 5896. https://doi.org/10.3390/app16125896

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop