Deployment-Oriented Interpretable Fraud Detection via Hybrid Explainable Boosting Machines with Concept–Raw Fusion on the IEEE-CIS Benchmark

Kang, Jeongtae; Kim, Keecheon

doi:10.3390/app16125809

Open AccessArticle

Deployment-Oriented Interpretable Fraud Detection via Hybrid Explainable Boosting Machines with Concept–Raw Fusion on the IEEE-CIS Benchmark

by

Jeongtae Kang

¹ and

Keecheon Kim

^2,*

¹

Intelligent Networks and Security Laboratory, Department of Data Science, Konkuk University, Seoul 05029, Republic of Korea

²

Intelligent Networks and Security Laboratory, Department of Computer Science and Engineering, Konkuk University, Seoul 05029, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(12), 5809; https://doi.org/10.3390/app16125809 (registering DOI)

Submission received: 22 May 2026 / Revised: 3 June 2026 / Accepted: 5 June 2026 / Published: 9 June 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Fraud detection models often achieve a strong ranking performance through black-box ensembles, but operational deployment also requires calibration, low explanation cost, and auditable scoring logic. This study develops an interpretable fraud-detection pipeline for IEEE-CIS by combining a 63-variable causal concept bank with teacher-guided additive Explainable Boosting Machine (EBM) students. The concept bank summarizes the temporal state, entity history, novelty/reuse, identity missingness, and aggregate deviation. Experiments use a chronological out-of-time split and a stricter pseudo-entity-disjoint holdout. In the main three-seed evaluation, the CatBoost predictive ceiling and XGBoost teacher achieved PR-AUC 0.489 ± 0.001 and 0.478 ± 0.003, respectively. Among interpretable models, concept-only EBM reached 0.189 ± 0.000, raw-only EBMs reached 0.372 ± 0.005 (top-k = 8) and 0.383 ± 0.002 (top-k = 12), and hybrid EBMs reached 0.407 ± 0.003 (top-k = 8) and 0.407 ± 0.004 (top-k = 12), consistently improving over matched raw-only additive baselines. The final top-k = 8 hybrid reduced input features from 154 to 71, achieved about 9.7× faster inference than XGBoost, remained close to XGBoost in ECE-15 calibration (0.01587 vs. 0.01611) while having a higher Brier score, and produced native local explanations far faster than XGBoost + SHAP. The results position CatBoost as the predictive ceiling and hybrid EBM as a benchmark-supported, deployment-relevant interpretable compromise for applied financial risk-screening workflows, rather than as a production-validated fraud-monitoring system.

Keywords:

fraud detection; interpretable machine learning; explainable boosting machine; RuleFit; SHAP; IEEE-CIS; calibration; low-FPR detection; tabular learning

1. Introduction

Fraud detection is a high-stakes classification problem in which ranking quality matters, but operational usefulness depends just as much on auditability, threshold behavior, and the intelligibility of the scoring logic [1,2]. Tree ensembles often perform well on tabular fraud data, yet their decision surfaces are difficult to inspect directly. Post hoc tools such as SHAP and LIME can help diagnose individual predictions, but they do not turn a black-box system into a model whose decision process is structurally transparent [3,4].

That distinction matters in deployment. Analysts need to know not only why a transaction was flagged but also whether the same rationale is likely to remain stable across thresholds, time slices, and previously unseen entity combinations. Additive interpretable models are appealing in this regard because each predictor contributes through an explicit shape function that can be inspected and audited [5,6]. Their weakness is that they may discard too much signal when the task depends on heterogeneous interactions, historical reuse patterns, and complex missingness structure.

We study that trade-off on the IEEE-CIS Fraud Detection benchmark [7]. Instead of relying solely on raw competition variables, we build a concept bank that summarizes temporal history, entity-level transaction behavior, novelty and reuse patterns, and identity sparsity. Using that representation, we compare interpretable student families against two strong black-box references: XGBoost, which is used as the teacher for feature ranking, soft-target construction, and SHAP-based explanation-cost comparison, and CatBoost, which is retained as an additional predictive ceiling [2,8]. We also benchmark RuleFit as a rule-based interpretable comparator [9].

The central question is therefore not whether an additive model can fully match a strong teacher. The more practical question is whether it can preserve enough discriminatory power to remain operationally plausible while offering a decision structure that is easier to explain, calibrate, and audit. Our results show that a concept-only student is too restrictive, a raw-only additive baseline recovers a substantial share of teacher performance, and a hybrid EBM recovers more while preserving additive transparency. The contribution of this paper is thus a concrete design pattern for interpretable fraud detection on anonymized benchmark data: semantic concept engineering, limited raw-feature carryover for residual signal recovery, and strict time-aware evaluation for realistic assessment.

1.1. Related Work on Fraud Detection, Explainability, and Additive Models

Recent fraud-detection research has been dominated by high-performing black-box learners, including gradient-boosted trees, deep neural networks, and sequence models. On tabular transaction data, tree ensembles remain particularly competitive because they handle heterogeneous feature spaces, missing values, and non-linear interactions well [1,2,7,10,11]. Strong ranking performance alone, however, is not enough in operational fraud screening. Analysts must also understand why a transaction is flagged, whether thresholds remain trustworthy under a shift, and whether the resulting scores can be justified in audit or compliance settings [12,13,14]. Realistic fraud-detection studies further emphasize class imbalance, verification latency, calibration, and streaming deployment constraints [15,16,17].

These deployment concerns are also closely related to concept drift, dataset shift, and out-of-distribution behavior [18,19].

Post hoc explanation methods such as LIME and SHAP are now standard tools for interrogating black-box fraud models [3,4], yet their robustness can be fragile under small input perturbations [20]. They are useful for local diagnosis, but they do not eliminate the fidelity gap between a complex predictor and a retrospective explanation [6,21,22,23,24]. In high-stakes settings, several authors have argued that interpretable models should be preferred whenever feasible, particularly when the decision process must be auditable, stable, and easy to communicate to domain experts [21,22,24]. Counterfactual explanations provide one complementary perspective on actionability, but they do not remove the need for structurally transparent decision models [25]. This argument is especially relevant in fraud detection, where base-rate effects, threshold sensitivity, and severe imbalance make isolated point explanations an incomplete basis for deployment decisions [12,13].

A parallel line of research focuses on intrinsically interpretable additive models, including GAMs, GA2Ms, and modern boosting-based variants such as the Explainable Boosting Machine (EBM) [5,6,26,27], as well as newer neural additive formulations [28]. These models are attractive because they expose shape functions directly, but on difficult real-world benchmarks, they may lose too much predictive signal if the representation is oversimplified. This creates a more interesting design question: rather than enforcing a pure concept bottleneck, can one retain a compact amount of high-value raw information while keeping the final model structurally interpretable?

This study is situated at that intersection. Rather than assuming that one interpretable family must dominate every operational criterion, we compare additive concept-only, raw-only, and concept–raw hybrid students alongside broader baselines such as RuleFit and stronger black-box references such as the XGBoost teacher and CatBoost predictive ceiling. The manuscript is therefore not merely a performance report. It examines how different interpretable families distribute the trade-off among ranking accuracy, calibration, explanation cost, and structural transparency on IEEE-CIS, aligning the work with recent calls for intrinsically interpretable, deployment-aware machine learning [21,22,23,24,26,29,30,31].

1.2. Study Positioning and Contribution Logic

This paper is not intended as a standard leaderboard-style IEEE-CIS comparison. Rather, it is framed as a benchmark-based applied study of interpretable model design for deployment-relevant financial risk screening, asking how one should choose among black-box ceilings, additive interpretable students, and rule-based interpretable baselines when ranking quality, calibration, and explanation efficiency all matter.

(1): We define a time-aware evaluation package that combines a chronological out-of-time split with a stricter pseudo-entity-disjoint holdout, so that additive students are assessed not only for predictive quality but also for leakage-resistant generalization.
(2): We construct a causal 63-variable concept bank that translates anonymized IEEE-CIS fields into interpretable behavioral summaries of temporal state, entity history, novelty and reuse, identity missingness, and aggregate deviation.
(3): We compare sparse linear, concept-only, raw-only, and concept–raw hybrid additive students against the XGBoost teacher and CatBoost predictive ceiling, and we additionally benchmark RuleFit so that the conclusions are not confined to a single black-box reference or a single interpretable family.
(4): We close the evaluation loop by reporting ranking quality, threshold-based F1, low-FPR behavior, calibration, global importance and representative shape functions, computational cost, and explanation latency for XGBoost + SHAP versus native hybrid EBM local explanations, allowing interpretability claims to be judged together with operational plausibility and explanation-side cost.

The novelty of this study does not lie in proposing a new black-box learner, but in integrating causal concept construction, limited raw-feature carryover, teacher-guided additive distillation, strict time-aware validation, and explanation-cost benchmarking into a single deployment-oriented evaluation framework for interpretable fraud detection.

2. Materials and Methods

2.1. Dataset and Prediction Task

The experiments used the IEEE-CIS Fraud Detection benchmark, which provides transaction records and auxiliary identity fields linked by TransactionID [7]. After merging the transaction and identity files, we sorted the merged data by TransactionDT and treated them as a temporally ordered tabular stream. The prediction target was isFraud. Because the benchmark variables are anonymized, many informative predictors are semantically opaque; this motivates the explicit construction of human-readable concepts rather than exclusive reliance on raw variables.

The task is binary fraud classification under severe class imbalance. We therefore emphasize threshold-free ranking metrics and low-false-positive operating characteristics, which more closely reflect practical screening scenarios than accuracy. Core additive-family experiments, CatBoost and RuleFit baselines, and deployment-oriented explanation-cost comparisons were repeated over three fixed seeds under the same chronological split.

2.2. Out-of-Time and Pseudo-Entity-Disjoint Evaluation

We adopted a chronological split based on TransactionDT, using approximately 70% of the earliest observations for training, 15% for validation, and 15% for testing. This protocol blocks future information from entering training and better approximates deployment conditions. In the main chronological out-of-time experiment, this produced 413,378 training rows, 88,581 validation rows, and 88,581 test rows. To probe generalization more aggressively, we also introduced a stricter pseudo-entity-disjoint evaluation. The pseudo-entity key was defined as the concatenation of card1, card2, addr1, and P_emaildomain. These fields were selected because the IEEE-CIS benchmark does not provide a true customer identifier, while this combination captures recurring payment-card, address-region, and payer-domain signals with enough validation/test coverage for a stable robustness test. A weaker key using fewer fields would leave more repeated-entity overlap, whereas a more restrictive key using additional sparse identity fields would remove too many observations. We therefore used this four-field key as a pragmatic balance between overlap removal and sample retention. Validation and test rows whose pseudo-entities had already appeared in earlier splits were removed. Under this stricter setting, the validation and test sets were reduced to 10,393 and 10,376 rows, respectively, making the task materially harder but also more resistant to repeated-entity effects.

All main comparisons reported in this manuscript were repeated over three seeds under the chronological out-of-time protocol. The pseudo-entity-disjoint protocol is used as a robustness stress test rather than the primary reporting setting; however, to make the leakage-resistant comparison transparent, the strict-holdout table now reports the same main model families as the primary comparison, including CatBoost, XGBoost, raw-only EBM, hybrid EBM, and RuleFit.

2.3. Concept Bank

We constructed a 63-variable concept bank to translate raw tabular observations into interpretable behavioral summaries. The bank comprises five groups: (i) temporal-state variables such as day index, hour, weekday, inter-transaction delay, rolling means, rolling standard deviations, and burstiness; (ii) entity-history variables such as prior counts and prior mean transaction amounts for the pseudo-entity and related card/address aggregates; (iii) reuse and novelty variables, including new-device-for-entity, new-email-for-entity, cross-entity reuse of device identifiers, and prior counts for device, email, product, and card/address combinations; (iv) missingness variables summarizing identity sparsity and explicit missing-value indicators for important identity fields; and (v) ratio- and z-score-style deviations from historical baselines. The full 63-variable inventory, including each concept variable, construction rule, raw variables used, concept group, and operational interpretation, is provided in Appendix A Table A1.

All concept variables were computed causally after sorting by TransactionDT so that each row used only information available before the current transaction was incorporated into any history store. For each transaction, rolling, historical, entity-level, device-level, email-level, product-level, card-level, address-level, and combined-identifier features were computed first, and the corresponding history stores were updated only after feature extraction for that row was complete. Entity-level counts were implemented as prior cumulative counts, previous values were obtained through one-step shifts, and historical means and variances were computed by subtracting the current transaction amount from cumulative sums before normalization. Rolling means and standard deviations were computed from shifted histories, so the current transaction was excluded from its own rolling window. The 63 concepts were selected to satisfy three practical criteria: they had to be computable causally from information available before the current transaction, interpretable as an operational fraud-screening signal, and assignable to one of the five predefined behavioral groups rather than being included solely because of downstream test performance.

2.4. Teacher and Student Models

The final study uses two black-box reference baselines. XGBoost serves as the teacher for raw-feature ranking, soft-target construction, and the deployment-oriented SHAP comparison, whereas CatBoost is used as an additional predictive ceiling. The interpretable candidates consisted of a sparse linear student, a concept-only EBM, raw-only EBMs using teacher-selected top-k raw variables, hybrid EBMs combining the concept bank with those top-k raw variables, and a RuleFit baseline trained on the same compact feature set used for the final additive deployment candidate. Top-k raw variables were selected separately for each seed from the XGBoost teacher trained on the training partition only. We used the teacher’s built-in feature_importances_ values, sorted them in descending order, removed concept-bank variables from the ranking, and retained the highest-ranked raw variables for each k. No validation or test labels were used in raw-feature selection. Because the ranking was recomputed within each seed, the top-k raw set reflects seed-specific teacher fitting while preserving the same training-only selection rule. We also inspected the seed-wise top-k raw-feature lists and report the final selected raw variables in Supplementary File S1; therefore, the main manuscript interprets top-k as a compact teacher-guided raw-feature budget rather than as a fixed universal feature subset.

All EBM students were implemented as additive regression-style EBMs trained on a soft target that combines the hard fraud label with the XGBoost teacher probability, as formalized in Equation (10), with α_EBM = 0.35 in the submitted baseline experiments. We used a regression-style EBM because the target is a continuous teacher-guided probability-like signal rather than a binary label alone. This formulation allows the additive student to approximate the teacher-guided probability surface directly while preserving univariate shape functions. The fitted additive scores were clipped to [0, 1] only as a bounded-probability projection before probability-based metrics were computed. We did not apply Platt scaling or isotonic calibration in the main pipeline because the goal was to evaluate the native additive student and its direct explanation cost, not to optimize a separate post hoc calibration layer. Classification EBMs and post hoc calibration methods remain useful alternatives and are discussed as future extensions.

2.5. Metrics and Computational Profile

We report PR-AUC, ROC-AUC, F1 at the validation-derived threshold, expected calibration error (ECE, 15 bins), the Brier score, recall at false-positive rates of 0.1%, 0.5%, and 1.0%, and precision among the top 1% highest-risk transactions. Given the severe class imbalance, PR-AUC is treated as the primary ranking metric, following prior work on ROC/PR interpretation and imbalanced evaluation [32,33]. Calibration is reported because threshold choice and downstream risk ranking depend on the reliability of predicted probabilities [34,35,36]. The Brier score is added as a complementary strictly proper scoring metric so that probability quality is not judged by binned ECE alone.

In addition to predictive quality, we measure the mean fit time, mean prediction time on the test split, feature count, and local-explanation latency. The runtime table reports the main three-seed out-of-time evaluation. The deployment-oriented explanation-cost benchmark compares XGBoost + SHAP against the native local explanations of the final hybrid EBM.

Implementation details. The main evaluation was run over three seeds (42, 43, and 44). XGBoost used 3000 estimators, learning rate 0.035, maximum depth 7, histogram tree construction, and the CUDA device mode. CatBoost used 3000 iterations, learning rate 0.035, depth 7, L2 leaf regularization 3.0, and GPU mode. The EBM students used outer_bags = 8, max_rounds = 2000, learning_rate = 0.03, max_bins = 256, and n_jobs = 1. All experiments were run in Python 3.11 on a workstation equipped with an NVIDIA GeForce RTX 4060 GPU (NVIDIA Corporation, Santa Clara, CA, USA). The GPU mode was used for XGBoost and CatBoost when available, with automatic CPU fallback if GPU execution failed; EBM and RuleFit components were executed on CPU. To support reproducibility, the Supplementary File S1 includes the execution script, environment requirements with package versions, seed-wise summaries, configuration files, feature-group definitions, strict-holdout outputs, and final figure files used in the manuscript. The original IEEE-CIS data are not redistributed and must be obtained from the official competition source under its terms of use.

2.6. Formal Problem Definition and Equation Summary

To make the modeling assumptions explicit and the paper easier to reproduce, this subsection formalizes the prediction task, the causal concept-construction rules, the additive student score, the teacher-guided soft target used by the EBM students, and the deployment-oriented evaluation metrics used throughout the study. Let x_i, y_i, and t_i denote the merged transaction–identity representation, binary fraud label, and transaction time index, respectively.

D = {(x_{i}, y_{i}, t_{i})}_{i = 1}^{N}, y_{i} \in {0, 1} .

(1)

The chronological evaluation used in this study can be expressed through the following split operators.

T_{t r a i n} = {i : t_{i} \leq c_{1}}, T_{v a l i d} = {i : c_{1} < t_{i} \leq c_{2}}, T_{t e s t} = {i : t_{i} > c_{2}} .

(2)

Here, c₁ and c₂ are chronological cut points for the training/validation and validation/test boundaries, respectively. For the stricter robustness setting, a pseudo-entity key is built from a small set of quasi-identifying fields and used to remove overlaps between the training entities and the later splits.

e_{i} = c o n c a t (c_{i}^{(1)}, c_{i}^{(2)}, a_{i}^{(1)}, p_{i}^{(e m a i l)}) .

(3)

The four arguments in Equation (3) correspond to card1, card2, addr1, and P_emaildomain. The final hybrid student uses an additive score that combines interpretable concept terms with a small set of teacher-selected raw variables.

s_{h y b} (x) = β_{0} + \sum_{j \in C} f_{j} (x_{j}) + \sum_{r \in R_{k}} g_{r} (x_{r}) .

(4)

Here, C denotes the concept-feature set, R_k denotes the top-k teacher-selected raw-feature set, and f_j and g_r are the learned univariate additive shape functions. The reported fraud probability is obtained by clipping the additive EBM regression score to [0, 1], consistent with the regression-style EBM implementation.

{\hat{p}}_{h y b} (x) = c l i p (s_{h y b} (x), 0, 1) .

(5)

Several concepts are history-based summaries defined causally over past observations belonging to the same entity e. The first two examples are the historical mean and a standardized deviation score.

μ_{e, i}^{-} = \frac{1}{m a x (1, n_{e, i}^{-})} \sum_{τ < i} 1 (e_{τ} = e_{i}) a_{τ}, n_{e, i}^{-} = \sum_{τ < i} 1 (e_{τ} = e_{i}) .

(6)

z_{e, i} = \frac{a_{i} - μ_{e, i}^{-}}{{\hat{σ}}_{e, i}^{-} + ϵ} .

(7)

Cross-entity reuse is designed to quantify how often a device has appeared before outside the current entity history.

{r e u s e}_{d, i} = \sum_{τ < i} 1 (d_{τ} = d_{i}, ∣ e_{τ} \neq e_{i}) .

(8)

Teacher guidance enters the hybrid model in two ways: through top-k raw feature selection using XGBoost-derived feature importance and through the soft target used to train the additive EBM students.

R_{k} = T o p K (I^{T}, k) .

(9)

Here, the teacher-derived raw-feature importance ranking is used for top-k raw-feature selection, and the soft target combines the hard label with the XGBoost teacher probability.

{\tilde{y}}_{i} = (1 - α_{E B M}) y_{i} + α_{E B M} p_{i}^{T}, L_{E B M} = \sum_{i} {({\hat{p}}_{i}^{S} - {\tilde{y}}_{i})}^{2} .

(10)

Finally, calibration and low-FPR behavior are summarized through expected calibration error, the Brier score, and a quantile-based low-FPR recall metric, where P and N denote the positive and negative evaluation subsets, respectively. The Brier score is computed as the mean squared probability error, n⁻¹Σ_i(p_i − y_i)⁻², and is used as a complementary strictly proper scoring rule so that calibration is not judged by binned ECE alone. In Equations (11) and (12), B_b denotes the b-th calibration bin, n_b is the number of samples in that bin, acc(B_b) and conf(B_b) are the empirical accuracy and mean confidence of the bin, and the quantile term in Equation (12) denotes the negative-class score quantile corresponding to the target false-positive-rate level γ.

{E C E}_{B} = \sum_{b = 1}^{B} \frac{|B_{b}|}{n} |a c c (B_{b}) - c o n f (B_{b})| .

(11)

{R e c a l l}_{F P R} (γ) = \frac{1}{|P|} \sum_{i \in P} 1 [p_{i} \geq q_{1 - γ} ({p_{j} : j \in N})] .

(12)

2.7. Algorithmic Summary

Algorithms 1 and 2 summarize the causal concept-generation pipeline and the final hybrid EBM training and evaluation routine. They are not meant to replace the implementation details; rather, they provide a compact procedural map aligned with the experimental protocol reported in this paper.

Algorithm 1. Causal concept-bank construction for the ordered IEEE-CIS stream.

Input: merged table M sorted by TransactionDT; entity key e_i; amount a_i; device d_i

Output: concept bank C with temporal, history, reuse, missingness, and ratio features

1: initialize empty history stores H_entity, H_device, H_email, H_product, H_card, H_address, H_card_address, and H_device_email

2: for each row i in temporal order do

3: read current entity e_i and raw fields for row i

4: compute temporal concepts from TransactionDT and previous entity timestamps

5: compute entity-history statistics from H_entity[e_i] before inserting row i

6: compute device-, email-, product-, card-, address-, and combined-identifier reuse terms from their existing history stores

7: compute missingness flags and deviation ratios using only information already available for row i and prior histories

8: append all concept values to C for row i

9: update H_entity, H_device, H_email, H_product, H_card, H_address, H_card_address, and H_device_email with row i only after feature extraction is complete

10: end for

11: return C

The read-before-update rule is applied consistently across entity-, device-, email-, product-, card-, address-, and combined-identifier histories, preventing the current transaction from contributing to its own features.

Algorithm 2. Final hybrid EBM training and evaluation under chronological and strict protocols.

Input: full feature matrix X, concept bank C, labels y, top-k value k
Output: trained hybrid EBM, clipped probability scores, test metrics, calibration plots, shape functions, and explanation-cost summaries
1: split X and y into chronological train/valid/test partitions
2: optionally filter valid/test rows that share pseudo-entity keys with train
3: fit the teacher XGBoost model on the train partition
4: rank raw variables by teacher importance and keep the top-k set R_k
5: build hybrid design matrix Z = [C, R_k] for train/valid/test
6: compute soft training targets y_tilde_train = (1 − alpha_EBM)y_train + alpha_EBM p^T_train
7: fit regression-style Explainable Boosting Machine on Z_train and y_tilde_train
8: obtain validation and test scores by clipping additive EBM outputs to [0, 1]; derive tau*_F1 from validation scores
9: evaluate PR-AUC and ROC-AUC threshold-free; evaluate F1 at tau*_F1; compute Recall@low-FPR and Precision@Top1% on Z_test
10: export global importance, PR/calibration curves, representative shape functions, explanation-cost diagnostics, runtime, feature count, and ablation results

The same protocol is reused for raw-only and concept-only EBM baselines so that every comparison differs only in the input design matrix.

3. Results

3.1. Main Comparison on the Out-of-Time Split

Table 1 reports the main three-seed out-of-time comparison across black-box references, additive students, and a rule-based interpretable baseline. CatBoost slightly outperformed XGBoost and therefore became the strongest predictive ceiling in the main evaluation (PR-AUC 0.489 ± 0.001 vs. 0.478 ± 0.003). Within the interpretable family reported in Table 1, the concept-only EBM remained too restrictive, raw-only EBMs recovered a substantial share of teacher signal, and the hybrid EBM variants achieved the strongest additive ranking performance, with top-k = 8 and top-k = 12 both around PR-AUC 0.407. RuleFit did not outperform the hybrid in this three-seed evaluation and showed larger variance across seeds.

These results indicate that the hybrid EBM should be interpreted as the strongest additive and deployment-oriented interpretable model among the evaluated student models, rather than as a universal winner over all possible interpretable approaches. At the same time, the hybrid consistently improved on the corresponding raw-only EBM by about +0.035 PR-AUC at top-k = 8 and about +0.025 PR-AUC at top-k = 12, showing that concept engineering adds non-trivial signal beyond a compact carry-over of top-ranked raw variables.

3.2. Top-k Ablation and Final Model Selection

Table 2 reports the three-seed hybrid top-k ablation. The top-k = 4 variant is clearly underpowered. Performance rises sharply from 4 to 8 and then plateaus across 8–16. The highest mean PR-AUC occurs at top-k = 16 (0.408 ± 0.003), but the margin over top-k = 8 is only about 0.002 and comes with additional raw variables and slightly weaker low-FPR and top-1% precision behavior. The top-k = 8 model is therefore retained as the primary deployment-oriented additive model because it is compact, near the performance plateau, and has the strongest Recall@0.1% FPR and Precision@Top1% among the top-k variants. The larger top-k values are treated as ranking-sensitivity checks rather than the preferred deployment configuration.

3.3. Robustness Under Strict Pseudo-Entity Holdout

Table 3 reports the stricter pseudo-entity-disjoint robustness evaluation across the same main model families as the primary comparison. These values are interpreted separately from the main three-seed chronological out-of-time results reported in Table 1 because the strict protocol removes repeated pseudo-entity overlap and substantially reduces the validation and test sets. Under this stricter protocol, CatBoost remains the strongest predictive ceiling (PR-AUC 0.487 ± 0.005), while XGBoost reaches 0.468 ± 0.002. Among the additive students, the hybrid EBM remains above the matched raw-only additive baseline at top-k = 8 (0.399 ± 0.001 vs. 0.371 ± 0.004 PR-AUC), and the top-k = 12 hybrid gives a similar result (0.400 ± 0.005 PR-AUC). RuleFit also performs strongly under this strict test (0.431 ± 0.010 PR-AUC), but it remains less favorable in the main evaluation because of its larger seed variance and much higher fit-time cost. The Brier-score column in Table 3 further confirms a conservative calibration interpretation: CatBoost and XGBoost have lower Brier scores than the hybrid EBM and RuleFit under this stress test, so the strict calibration evidence is not used to claim probability-quality dominance for the hybrid. The strict-holdout results therefore support the main conclusion that concept–raw fusion provides additional additive signal while making clear that black-box references remain the predictive ceilings.

3.4. Interpretability, Calibration, and Error Analysis

Figure 1 shows that the XGBoost teacher still dominates the full precision–recall curve, although the final top-k = 8 hybrid remains competitive in the high-precision/low-recall portion of the curve. Figure 2 presents the calibration curves for the XGBoost teacher and the final hybrid model. In the main three-seed out-of-time evaluation, the hybrid EBM top-k = 8 achieved ECE-15 0.01587, which is close to the XGBoost teacher (0.01611) and better than RuleFit (0.02669). CatBoost achieved the strongest ECE-15 among the black-box references (0.00989), although the raw-only EBM had the lowest ECE-15 overall despite weaker ranking quality. The added Brier-score comparison reported later in the calibration table gives a more conservative view: the hybrid Brier score (0.02656) is higher than both black-box references but lower than RuleFit, so the hybrid should be described as close to XGBoost on ECE but not as matching the black-box probability-quality ceiling across all calibration metrics.

Figure 3 shows the global importance profile of the final hybrid model. The leading terms include two anonymized raw variables (C1 and C14) together with interpretable features such as cross-entity device reuse, prior device frequency, calendar position, product history, and email amount statistics. C1 and C14 remain the dominant residual raw signals, while c_cross_entity_reuse_device, c_device_prev_count, and dt_day are the leading concept terms. Because C1 and C14 remain anonymized in the IEEE-CIS benchmark, the final hybrid EBM should not be interpreted as fully semantically interpretable. Rather, it is additively transparent and partially semantically interpretable: concept variables support behavioral interpretation, while retained anonymized raw variables provide auditable shape functions and score contributions without full domain semantics.

Figure 4 presents representative shape functions of the final hybrid EBM. The two interpretable concept variables, c_cross_entity_reuse_device and c_device_prev_count, expose auditable patterns related to reuse and historical activity, while C1 and C14 show how a compact set of retained anonymized raw variables preserves residual benchmark signal within the additive structure. These raw-variable shapes are transparent at the level of score contribution and monotonic/non-monotonic response, but not at the level of direct financial semantics because the benchmark intentionally anonymizes those fields. The model is therefore best described as additively transparent with partial semantic interpretability, rather than as fully semantically interpretable.

3.5. Raw-Only Versus Hybrid Additive Modeling

To isolate the contribution of the concept layer, we compared raw-only and hybrid additive models directly. In the main three-seed out-of-time evaluation, the hybrid improved on the corresponding raw-only EBM by +0.0345 PR-AUC at top-k = 8 (0.4066 vs. 0.3721) and by +0.0246 PR-AUC at top-k = 12 (0.4074 vs. 0.3828). Under the stricter robustness evaluation reported in Table 3, the gain remained positive at approximately +0.027 PR-AUC at top-k = 8 and +0.026 PR-AUC at top-k = 12. These margins support the claim that concept engineering contributes signal beyond a small carry-over of top-ranked raw variables.

3.6. Hybrid Concept-Group Ablation

Table 4 reports concept-group ablations for the final top-k = 8 hybrid EBM in the main out-of-time experiment (mean ± std over three seeds). Removing the missingness concepts reduced PR-AUC from 0.4066 ± 0.0034 to 0.3982 ± 0.0045. Removing the relation group reduced PR-AUC to 0.3981 ± 0.0032. By contrast, removing time concepts slightly increased PR-AUC to 0.4100 ± 0.0024, suggesting that the current time block contributes limited incremental signal once the other concept groups are present. Within the main out-of-time evaluation, this ablation suggests that missingness and relation concepts provide incremental value, whereas the current time block may contain redundant or weakly aligned signals once other concept groups and selected raw variables are included.

3.7. Computational Profile

Table 5 summarizes the computational profile for the main out-of-time experiment. CatBoost and XGBoost train quickly on the available GPU but require the widest raw input representation. The sparse linear student and raw-only additive baselines are included for completeness, while the concept-rich EBMs provide the main interpretable additive candidates. The final hybrid EBM uses 71 input variables on average, fits in about 659.3 s, and predicts the test split in about 0.217 s in the three-seed summary. RuleFit, by contrast, required about 5083.5 s to fit and 1.273 s to score the test split, reinforcing its weaker deployment profile despite being structurally interpretable.

A deployment-oriented comparison against the XGBoost teacher makes the trade-off easier to interpret. Averaged over the three-seed deployment summaries, the XGBoost teacher achieved PR-AUC 0.4782, whereas the final hybrid top-k = 8 achieved 0.4066. The absolute offline fitting gap was about 10.47 min rather than an online latency penalty. At inference time, the hybrid was about 9.66× faster than the XGBoost teacher (0.217 s vs. 2.094 s), used 71 rather than 154 input variables (53.9% reduction), and remained close in ECE-15 (0.01587 vs. 0.01611), although its Brier score was higher (0.02656 vs. 0.02418). These numbers support the intended framing: the XGBoost teacher is the black-box performance ceiling, while the hybrid is a smaller, additively transparent, faster-inference candidate with explicit shape functions and native local explanations.

The explanation-cost benchmark reinforces that interpretation. Teacher-side local explanations required a separate SHAP step, whereas the hybrid EBM produced native additive explanations through explain_local(). Averaged over three seeds, XGBoost + SHAP required about 439.1 ms for one case, 1.797 ms per row for 100 cases, and 0.4896 ms per row for 500 cases. For the hybrid EBM, the corresponding local-explanation costs were about 3.85 ms for one case, 0.0525 ms per row for 100 cases, and 0.0247 ms per row for 500 cases. Depending on batch size, the additive deployment model therefore reduced explanation latency by roughly 19.8× to 114× relative to the teacher plus post hoc SHAP.

Main Calibration Metric and Soft-Target Sensitivity Checks

To address calibration beyond binned ECE and to test whether the soft-target weight drives the main conclusion, we added two main out-of-time revision-support summaries. Table 6 reports ECE-15 together with the Brier score. The hybrid EBM remains close to XGBoost in ECE-15, but its Brier score is higher than both black-box references and lower than RuleFit, supporting a bounded calibration claim rather than a claim of full probability-quality parity.

Table 7 reports the alpha_EBM sensitivity check for the final top-k = 8 hybrid design. PR-AUC varies only from 0.4061 to 0.4072 across alpha values from 0.00 to 0.75, while ECE-15 and the Brier score change only marginally. The submitted alpha_EBM = 0.35 setting therefore remains competitive, and the core conclusion does not depend on a narrow alpha choice. The highest mean PR-AUC occurs at alpha_EBM = 0.75, but the margin over 0.35 is approximately 0.0006 and is not large enough to change the model-selection argument.

These calibration and sensitivity results complement the broader runtime and explanation-cost findings. The hybrid EBM is not the strongest ranking model, but it is the only additive student in this evaluation that simultaneously preserves a substantial fraction of teacher discrimination, improves consistently on the raw-only additive baselines, remains competitive across the tested soft-target weights, and provides cheap native local explanations without an external post hoc explainer.

3.8. Expanded Visual Diagnostics and Design Summary

To deepen the interpretability narrative and provide a visually grounded account of the method, this section presents a design-summary block that complements the numerical tables. These figures do not replace the main results; instead, they make the model design, evaluation logic, operating-region behavior, and additive trade-offs easier to inspect at a glance.

3.8.1. Workflow, Concept Taxonomy, and Evaluation Design

Figure 5 summarizes the full pipeline from IEEE-CIS data ingestion to causal concept generation, XGBoost-guided raw-feature selection and soft-target construction, additive student training, and final benchmark-based deployment-relevant evaluation. The figure separates the roles of the two black-box baselines: XGBoost provides raw-feature ranking and the SHAP reference path, whereas CatBoost is retained as the predictive ceiling. Figure 6 organizes the 63-variable concept bank into five semantic groups using representative examples; the complete variable inventory with construction rules and operational interpretations is provided in Appendix A Table A1. Figure 7 then illustrates the two evaluation settings used throughout the paper: the primary out-of-time split and the stricter pseudo-entity-disjoint protocol.

3.8.2. Comparative Views of Performance, Complexity, and Sensitivity

Figure 8 provides a compact visual summary of the core additive-family comparison rather than the full expanded baseline set. Figure 9 places those same additive models on a performance–complexity plane and shows that the hybrid EBM is materially stronger than the concept-only and raw-only additive baselines while requiring fewer input variables than the XGBoost teacher. Figure 10 visualizes the top-k sensitivity analysis and confirms the same pattern seen in Table 2: a sharp improvement from k = 4 followed by a plateau around k = 8–16. Figure 11 isolates the incremental gain from concept–raw fusion over the raw-only additive baseline under both evaluation protocols. The expanded CatBoost and RuleFit baselines are reported numerically in Table 1 and Table 3 and discussed in Section 3.1, Section 3.3 and Section 4.

3.8.3. Low-FPR Operating View and Hybrid Concept-Group Ablation

In practice, fraud analysts often operate in extremely low false-positive-rate regions rather than across the full threshold range. Figure 12 therefore plots recall as a function of FPR for the teacher, the raw-only EBM, and the final hybrid EBM. Although the proposed model does not match the teacher, it remains clearly stronger than the raw-only additive baseline in the low-FPR regime. Figure 13 complements Table 4 by visualizing how concept-group utility can vary across evaluation settings: in the main out-of-time evaluation, missingness and relation concepts remain useful, whereas in the strict pseudo-entity-disjoint holdout, missingness remains important, the time block is comparatively expendable, and the relation block appears regime-dependent because removing it improves strict-holdout PR-AUC. Taken together, the ablations suggest that concept utility is real but partially regime-dependent.

3.9. Evidence Synthesis

Taken together, Table 1, Table 2, Table 3, Table 4, Table 5, Table 6 and Table 7 and Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13 support a nuanced but coherent interpretation of the evidence. Black-box references remain the strongest predictive ceilings, with CatBoost slightly ahead of XGBoost in the main three-seed out-of-time evaluation and again remaining the strongest ceiling under the strict pseudo-entity-disjoint stress test. A pure concept bottleneck is too restrictive, compact raw-only additive baselines are already competitive, and the hybrid EBM repeatedly improves on those matched raw-only baselines in both the main and strict evaluations. RuleFit does not provide a more stable interpretable alternative in the main evaluation because its mean ranking score falls below the hybrid and its variance across seeds is substantially larger; however, the strict holdout adds nuance by showing that RuleFit can achieve a higher strict PR-AUC than the hybrid under a pseudo-entity-disjoint stress test. We therefore treat the strict result as a robustness stress test rather than as the primary model-selection setting, while retaining the bounded conclusion that concept–raw fusion adds useful additive signal over matched raw-only EBM baselines.

The evaluation also follows the same operational axes used to motivate the method. It reports not only ranking-oriented comparisons but also calibration behavior, low-FPR operating diagnostics, concept-group ablations, global importance and representative shape-function views, explanation-side latency, and computational profile summaries. The resulting picture is explicitly multidimensional: CatBoost is the strongest predictive ceiling, XGBoost provides a useful teacher and SHAP reference path, and the final hybrid EBM remains the strongest additive and benchmark-supported deployment-relevant model among the students tested here.

4. Discussion

A central limitation of the proposed hybrid EBM is that it neither matches the black-box references nor establishes dominance over every possible interpretable alternative. The main three-seed out-of-time evaluation confirms this limitation: CatBoost and XGBoost remain the predictive ceilings. The strict pseudo-entity-disjoint evaluation also shows that RuleFit can be competitive in that stress-test regime, even though it is less favorable in the main evaluation because of higher variance and substantially higher fit-time cost. The more relevant scientific question, however, is not whether one interpretable family wins on every axis, but whether additive concept–raw fusion occupies a distinct and operationally useful point on the trade-off surface. Under that framing, the hybrid EBM remains important because it clearly outperforms the matched raw-only additive baselines in both the main and strict evaluations, remains more stable than RuleFit across seeds in the main setting, remains close to the XGBoost teacher in ECE-15 while showing a higher Brier score, and exposes directly auditable shape functions. Nevertheless, because part of the final hybrid relies on anonymized raw variables, its interpretability should be described as additive transparency with partial semantic interpretation rather than full semantic interpretability.

The explanation-cost analysis supports that benchmark-based deployment-relevant interpretation. In the explanation-cost benchmark, XGBoost + SHAP required about 439 ms for a single case and about 0.490 ms per row for batches of 500. By contrast, the hybrid EBM generated native additive local explanations in about 3.85 ms for one case and 0.0247 ms per row for 500 cases. These differences do not diminish the value of the black-box references, but they clarify why a natively explainable additive model may be economically attractive in repeated analyst workflows: it reduces both operational scoring cost and explanation-side latency without requiring a separate post hoc explainer. The current evidence is still benchmark-based and should not be read as live production validation.

A further limitation is that the empirical evaluation is based on a single anonymized benchmark. Although IEEE-CIS is large-scale and widely used, its anonymized fields limit direct financial-domain interpretation and do not replace validation on institution-specific transaction streams. The present results should therefore be interpreted as benchmark evidence for an interpretable deployment design pattern rather than as a complete proof of production readiness across all financial environments. We did not evaluate analyst interaction, live monitoring, drift-control policies, cost-sensitive threshold adaptation, or production feedback loops. A further practical concern is concept drift: fraud strategies, merchant behavior, identity availability, and device reuse patterns can change over time in real financial systems. Although the chronological split and pseudo-entity-disjoint holdout provide stronger benchmark evidence than a random split, they do not replace prospective monitoring, drift detection, threshold recalibration, or periodic model updating in a live environment. From an applied engineering perspective, the proposed hybrid EBM is most suitable for analyst-facing risk-screening workflows in which full black-box accuracy is not the only objective and where transparent score decomposition, fast repeated explanations, and compact input requirements are operationally valuable.

5. Conclusions

This study examined interpretable fraud detection on the IEEE-CIS benchmark as a multi-objective trade-off rather than a search for a single universal winner. In the main three-seed out-of-time comparison, CatBoost and XGBoost remained the black-box ceilings, the concept-only EBM proved too restrictive, and the teacher-guided hybrid EBM emerged as the strongest additive student while consistently outperforming matched raw-only additive baselines. RuleFit did not provide a stable alternative in the main evaluation, showing lower mean PR-AUC and larger seed-wise variance than the hybrid, although the expanded strict-holdout results show that it can outperform the hybrid on strict PR-AUC under a pseudo-entity-disjoint stress test. The final top-k = 8 hybrid further reduced the input dimension from 154 to 71, delivered about 9.7× faster inference than the XGBoost teacher, and produced native local explanations that were much cheaper than XGBoost + SHAP. The results therefore support a practical but bounded conclusion: concept–raw fusion can improve an additive interpretable fraud detector relative to matched raw-only additive baselines while preserving auditable local explanations, but it should be regarded as benchmark-supported deployment-relevant evidence rather than live production validation or universal superiority over all interpretable alternatives.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/app16125809/s1, Supplementary File S1: The reproducibility guide, source code, environment requirements, main chronological out-of-time results, alpha_EBM sensitivity and Brier-score calibration outputs, strict pseudo-entity-disjoint holdout results, seed-wise metric summaries, runtime and explanation-cost outputs, full 63-variable concept inventory, and final figure files supporting the manuscript.

Author Contributions

Conceptualization, J.K. and K.K.; methodology, J.K.; software, J.K.; validation, J.K. and K.K.; formal analysis, J.K.; investigation, J.K.; resources, J.K.; data curation, J.K.; writing—original draft preparation, J.K.; writing—review and editing, J.K. and K.K.; visualization, J.K.; supervision, K.K.; project administration, J.K. and K.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Global Copyright Issues Rapid Response (R&D) Program of the Ministry of Culture, Sports and Tourism and the Korea Culture Technology Planning and Evaluation Institute (No. RS-2026-2552393; specialized agency: Korea Creative Content Agency).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The IEEE-CIS Fraud Detection benchmark used in this study is publicly available through Kaggle under the competition terms. The code, configuration settings, derived concept definitions, seed-wise summaries, and result files generated during this study are provided in Supplementary File S1. Redistribution of the original benchmark data remains subject to the terms of the source competition.

Acknowledgments

The authors thank the members of the Intelligent Networks and Security Laboratory for discussions on fraud-detection experiment design and evaluation protocol development. During manuscript preparation, the authors used OpenAI’s ChatGPT (GPT-5.5 Thinking, OpenAI, San Francisco, CA, USA) for language editing, formatting support, and figure-layout refinement. All generated suggestions were reviewed, revised, and validated by the authors, who take full responsibility for the final content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

EBM	Explainable Boosting Machine
ECE	Expected Calibration Error
PR-AUC	Area under the Precision–Recall Curve
ROC-AUC	Area under the Receiver Operating Characteristic Curve
XAI	Explainable Artificial Intelligence

Appendix A. Full 63-Variable Concept Bank Inventory

Table A1 provides the complete 63-variable causal concept bank used in the final experiments. Each row reports the concept variable, construction rule, raw variables used, concept group, and operational interpretation. All history-based variables follow the read-before-update rule described in Section 2.3, so the current transaction is excluded before updating the corresponding history store.

Table A1. Full inventory of the 63-variable causal concept bank.

Concept Variable	Construction Rule/Formula	Raw Variables Used	Concept Group	Operational Interpretation
log_TransactionAmt	log1p of non-negative TransactionAmt.	TransactionAmt	Aggregate deviation/amount scale	Stabilizes the transaction amount scale for additive modeling.
dt_hour	Hour bucket derived from floor(TransactionDT/3600) mod 24.	TransactionDT	Temporal state	Captures within-day transaction timing.
dt_weekday	Weekday bucket derived from floor(TransactionDT/86400) mod 7.	TransactionDT	Temporal state	Captures weekly timing pattern.
dt_day	Day index derived from floor(TransactionDT/86400).	TransactionDT	Temporal state	Captures coarse temporal position in the benchmark stream.
c_entity_amt_prev_count	Prior cumulative count for the pseudo-entity before current row.	card1, card2, addr1, P_emaildomain, TransactionDT	Entity history	Measures entity transaction history depth.
c_entity_amt_delta_t	Time since previous transaction for the same pseudo-entity.	TransactionDT, pseudo-entity key	Temporal state	Measures entity inactivity or rapid recurrence.
c_entity_amt_delta_t_log1p	log1p-transformed positive inter-arrival time for the pseudo-entity.	TransactionDT, pseudo-entity key	Temporal state	Stabilizes inter-arrival time for sparse histories.
c_entity_amt_jump_ratio	Absolute amount change from the previous pseudo-entity transaction divided by previous amount plus one.	TransactionAmt, pseudo-entity key	Aggregate deviation	Captures abrupt amount jumps for the same entity.
c_entity_amt_prev_mean	Prior mean TransactionAmt for the pseudo-entity.	TransactionAmt, pseudo-entity key	Entity history	Summarizes historical spending level of the entity.
c_entity_amt_prev_std	Prior standard deviation of TransactionAmt for the pseudo-entity.	TransactionAmt, pseudo-entity key	Entity history	Summarizes historical amount variability.
c_entity_amt_z	Current amount standardized by prior pseudo-entity mean and standard deviation.	TransactionAmt, pseudo-entity key	Aggregate deviation	Flags transactions deviating from entity history.
c_entity_amt_ratio	Current amount divided by prior pseudo-entity mean plus one.	TransactionAmt, pseudo-entity key	Aggregate deviation	Measures relative amount inflation versus entity history.
c_entity_amt_burstiness	Prior entity count divided by time since previous entity transaction plus one.	TransactionDT, pseudo-entity key	Temporal state	Measures rapid repeated activity by the same entity.
c_entity_amt_roll3_mean_prev	Rolling mean of the previous three entity amounts, using shifted history.	TransactionAmt, pseudo-entity key	Temporal state	Captures short-term historical amount level.
c_entity_amt_roll3_std_prev	Rolling standard deviation of the previous three entity amounts, using shifted history.	TransactionAmt, pseudo-entity key	Temporal state	Captures short-term amount volatility.
c_entity_amt_roll5_mean_prev	Rolling mean of the previous five entity amounts, using shifted history.	TransactionAmt, pseudo-entity key	Temporal state	Captures medium short-term amount level.
c_entity_amt_roll5_std_prev	Rolling standard deviation of the previous five entity amounts, using shifted history.	TransactionAmt, pseudo-entity key	Temporal state	Captures medium short-term amount volatility.
c_entity_amt_to_roll3_ratio	Current amount divided by prior rolling-3 mean plus one.	TransactionAmt, pseudo-entity key	Aggregate deviation	Measures deviation from short-term entity history.
c_entity_amt_to_roll5_ratio	Current amount divided by prior rolling-5 mean plus one.	TransactionAmt, pseudo-entity key	Aggregate deviation	Measures deviation from medium short-term entity history.
c_card_addr_amt_prev_count	Prior count for card1-address combination.	card1, addr1	Entity history	Measures recurrence of a card-address pair.
c_card_addr_amt_prev_mean	Prior mean amount for card1-address combination.	TransactionAmt, card1, addr1	Entity history	Summarizes historical amount level for card-address pair.
c_card_addr_amt_prev_std	Prior amount standard deviation for card1-address combination.	TransactionAmt, card1, addr1	Entity history	Summarizes variability for card-address pair.
c_card_addr_amt_z	Current amount standardized by prior card-address history.	TransactionAmt, card1, addr1	Aggregate deviation	Detects deviation from card-address baseline.
c_card_addr_amt_ratio	Current amount divided by prior card-address mean plus one.	TransactionAmt, card1, addr1	Aggregate deviation	Measures relative change versus card-address history.
c_card1_amt_prev_count	Prior count for card1.	card1	Entity history	Measures card-level recurrence.
c_card1_amt_prev_mean	Prior mean TransactionAmt for card1.	TransactionAmt, card1	Entity history	Summarizes card-level historical amount.
c_card1_amt_prev_std	Prior TransactionAmt standard deviation for card1.	TransactionAmt, card1	Entity history	Summarizes card-level amount variability.
c_card1_amt_z	Current amount standardized by prior card1 history.	TransactionAmt, card1	Aggregate deviation	Flags amount deviation at card level.
c_card1_amt_ratio	Current amount divided by prior card1 mean plus one.	TransactionAmt, card1	Aggregate deviation	Measures relative amount change at card level.
c_email_amt_prev_count	Prior count for payer email domain.	P_emaildomain	Entity history	Measures email-domain recurrence.
c_email_amt_prev_mean	Prior mean amount for payer email domain.	TransactionAmt, P_emaildomain	Entity history	Summarizes email-domain historical amount.
c_email_amt_prev_std	Prior amount standard deviation for payer email domain.	TransactionAmt, P_emaildomain	Entity history	Summarizes email-domain amount variability.
c_email_amt_z	Current amount standardized by prior email-domain history.	TransactionAmt, P_emaildomain	Aggregate deviation	Flags amount deviation relative to email-domain history.
c_email_amt_ratio	Current amount divided by prior email-domain mean plus one.	TransactionAmt, P_emaildomain	Aggregate deviation	Measures relative amount change for email-domain history.
c_card1_prev_count	Prior occurrence count of card1.	card1	Reuse/novelty	Captures card reuse frequency.
c_addr1_prev_count	Prior occurrence count of addr1.	addr1	Reuse/novelty	Captures address-region reuse frequency.
c_email_prev_count	Prior occurrence count of P_emaildomain.	P_emaildomain	Reuse/novelty	Captures payer-domain reuse frequency.
c_device_prev_count	Prior occurrence count of DeviceInfo.	DeviceInfo	Reuse/novelty	Captures device reuse frequency.
c_product_prev_count	Prior occurrence count of ProductCD.	ProductCD	Reuse/novelty	Captures product-code recurrence.
c_card4_prev_count	Prior occurrence count of card4.	card4	Reuse/novelty	Captures card-network/type recurrence.
c_card6_prev_count	Prior occurrence count of card6.	card6	Reuse/novelty	Captures card-category recurrence.
c_card_addr_prev_count	Prior occurrence count of card1\|addr1 combination.	card1, addr1	Reuse/novelty	Captures card–address combination reuse.
c_device_email_prev_count	Prior occurrence count of DeviceInfo\|P_emaildomain combination.	DeviceInfo, P_emaildomain	Reuse/novelty	Captures device–email combination reuse.
c_entity_product_prev_count	Prior occurrence count of pseudo-entity\|ProductCD combination.	pseudo-entity key, ProductCD	Reuse/novelty	Captures product repetition within entity history.
c_new_device_for_entity	Indicator that DeviceInfo has not previously appeared for the pseudo-entity.	DeviceInfo, pseudo-entity key	Reuse/novelty	Flags new device use for an entity.
c_new_email_for_entity	Indicator that P_emaildomain has not previously appeared for the pseudo-entity.	P_emaildomain, pseudo-entity key	Reuse/novelty	Flags new payer-domain use for an entity.
c_new_card_addr_combo	Indicator that card1\|addr1 combination is new.	card1, addr1	Reuse/novelty	Flags novel card–address combination.
c_new_device_email_combo	Indicator that DeviceInfo\|P_emaildomain combination is new.	DeviceInfo, P_emaildomain	Reuse/novelty	Flags novel device–email combination.
c_new_hour_for_entity	Indicator that hour bucket is new for the pseudo-entity.	TransactionDT, pseudo-entity key	Reuse/novelty	Flags unusual timing for an entity.
c_new_weekday_for_entity	Indicator that weekday bucket is new for the pseudo-entity.	TransactionDT, pseudo-entity key	Reuse/novelty	Flags unusual weekday pattern for an entity.
c_new_product_for_entity	Indicator that ProductCD is new for the pseudo-entity.	ProductCD, pseudo-entity key	Reuse/novelty	Flags new product category for an entity.
c_cross_entity_reuse_device	Device prior count minus prior count of the same device within current pseudo-entity.	DeviceInfo, pseudo-entity key	Reuse/novelty	Measures whether a device is reused across different entities.
c_cross_entity_reuse_email	Email-domain prior count minus prior count of same payer domain within current pseudo-entity.	P_emaildomain, pseudo-entity key	Reuse/novelty	Measures whether an email domain appears across different entities.
c_identity_missing_ratio	Fraction of identity-like fields missing in the row.	id_, D, M*, DeviceInfo, DeviceType, P/R_emaildomain	Missingness	Summarizes identity sparsity intensity.
c_identity_missing_count	Count of missing identity-like fields in the row.	id_, D, M*, DeviceInfo, DeviceType, P/R_emaildomain	Missingness	Measures absolute identity-data sparsity.
c_core_missing_count	Count of missing core identity/location/device fields.	DeviceInfo, P/R_emaildomain, addr1, addr2, dist1, DeviceType	Missingness	Captures missingness in operationally important fields.
c_missing_DeviceInfo	Indicator that DeviceInfo is missing.	DeviceInfo	Missingness	Flags absence of device identity.
c_missing_P_emaildomain	Indicator that P_emaildomain is missing.	P_emaildomain	Missingness	Flags absence of payer email domain.
c_missing_R_emaildomain	Indicator that R_emaildomain is missing.	R_emaildomain	Missingness	Flags absence of recipient email domain.
c_missing_addr1	Indicator that addr1 is missing.	addr1	Missingness	Flags absence of primary address-region field.
c_missing_addr2	Indicator that addr2 is missing.	addr2	Missingness	Flags absence of secondary address-region field.
c_missing_dist1	Indicator that dist1 is missing.	dist1	Missingness	Flags absence of distance-related information.
c_missing_DeviceType	Indicator that DeviceType is missing.	DeviceType	Missingness	Flags absence of device-type information.

Note: Shaded rows indicate concept-group headers used to improve readability. The asterisk (*) denotes wildcard prefixes used in anonymized IEEE-CIS variable families, such as id_*, D*, and M*. The table is intentionally detailed for reproducibility: readers can map every concept name appearing in the code and Supplementary File S1 to its construction rule, raw inputs, concept group, and operational interpretation.

Appendix B. Visual and Reproducibility Summary

Appendix B complements the main text by mapping each major claim to the specific figure, table, or diagnostic that supports it. The purpose of this appendix is not to introduce new results, but to make the evidentiary structure of the manuscript explicit by showing how interpretability, calibration, sensitivity, and computational practicality are each supported in the final document.

Table A2. Map from manuscript claims to supporting evidence.

Interpretive Role	Primary Evidence in Manuscript	Component
Summarizes the end-to-end concept–raw fusion pipeline.	Figure 5	Overall workflow
Explains how 63 concepts were grouped and motivated.	Figure 6; Table A1	Concept taxonomy
Shows chronological split and strict pseudo-entity holdout.	Figure 7	Evaluation protocol
Locates the final hybrid model on the performance–interpretability frontier.	Table 1; Figure 8 and Figure 9	Main comparison/trade-off
Demonstrates that concept augmentation adds measurable signal beyond raw-only additive baselines.	Table 2; Figure 10 and Figure 11	Sensitivity and raw-to-hybrid gain
Supports bounded probability-quality claims, soft-target robustness, and strict-threshold behavior.	Figure 1, Figure 2 and Figure 12; Table 6 and Table 7	Calibration, low-FPR behavior, and soft-target sensitivity
Shows that the final model is auditable at both the model-wide and per-feature functional levels.	Figure 3 and Figure 4	Global importance/shape-function views
Connects concept utility to computational practicality.	Table 4 and Table 5; Figure 13	Ablation and runtime

References

Kou, Y.; Lu, C.-T.; Sirwongwattana, S.; Huang, Y.-P. Survey of fraud detection techniques. In Proceedings of the IEEE International Conference on Networking, Sensing and Control, Taipei, Taiwan, 21–23 March 2004; IEEE: Piscataway, NJ, USA, 2004; Volume 2, pp. 749–754. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30; Curran Associates: Red Hook, NY, USA, 2017; pp. 4765–4774. [Google Scholar]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 1135–1144. [Google Scholar] [CrossRef]
Lou, Y.; Caruana, R.; Gehrke, J.; Hooker, G. Accurate intelligible models with pairwise interactions. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, USA, 11–14 August 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 623–631. [Google Scholar] [CrossRef]
Nori, H.; Caruana, R.; Bu, Z.; Shen, J.H.; Kulkarni, J. Accuracy, interpretability, and differential privacy via explainable boosting. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Proceedings of Machine Learning Research: Cambridge, MA, USA, 2021; pp. 8227–8237. Available online: https://proceedings.mlr.press/v139/nori21a.html (accessed on 4 June 2026).
Grover, P.; Xu, J.; Tittelfitz, J.; Cheng, A.; Li, Z.; Zablocki, J.; Liu, J.; Zhou, H. Fraud dataset benchmark and applications. arXiv 2022, arXiv:2208.14417. [Google Scholar] [CrossRef]
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. In Advances in Neural Information Processing Systems 31; Curran Associates: Red Hook, NY, USA, 2018; pp. 6638–6648. [Google Scholar]
Friedman, J.H.; Popescu, B.E. Predictive learning via rule ensembles. Ann. Appl. Stat. 2008, 2, 916–954. [Google Scholar] [CrossRef]
Jurgovsky, J.; Granitzer, M.; Ziegler, K.; Calabretto, S.; Portier, P.-E.; He-Guelton, L.; Caelen, O. Sequence classification for credit-card fraud detection. Expert Syst. Appl. 2018, 100, 234–245. [Google Scholar] [CrossRef]
Whitrow, C.; Hand, D.J.; Juszczak, P.; Weston, D.; Adams, N.M. Transaction aggregation as a strategy for credit card fraud detection. Data Min. Knowl. Discov. 2009, 18, 30–55. [Google Scholar] [CrossRef]
Niculescu-Mizil, A.; Caruana, R. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 7–11 August 2005; Association for Computing Machinery: New York, NY, USA, 2005; pp. 625–632. [Google Scholar] [CrossRef]
Axelsson, S. The base-rate fallacy and the difficulty of intrusion detection. ACM Trans. Inf. Syst. Secur. 2000, 3, 186–205. [Google Scholar] [CrossRef]
Hilal, W.; Gadsden, S.A.; Yawney, J. Financial fraud: A review of anomaly detection techniques and recent advances. Expert Syst. Appl. 2022, 193, 116429. [Google Scholar] [CrossRef]
Carcillo, F.; Dal Pozzolo, A.; Le Borgne, Y.-A.; Caelen, O.; Mazzer, Y.; Bontempi, G. SCARFF: A scalable framework for streaming credit card fraud detection with Spark. Inf. Fusion 2018, 41, 182–194. [Google Scholar] [CrossRef]
Dal Pozzolo, A.; Boracchi, G.; Caelen, O.; Alippi, C.; Bontempi, G. Credit card fraud detection: A realistic modeling and a novel learning strategy. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 3784–3797. [Google Scholar] [CrossRef] [PubMed]
Bahnsen, A.C.; Stojanovic, A.; Aouada, D.; Ottersten, B. Improving credit card fraud detection with calibrated probabilities. In Proceedings of the 2014 SIAM International Conference on Data Mining, Philadelphia, PA, USA, 24–26 April 2014; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2014; pp. 677–685. [Google Scholar] [CrossRef]
Gama, J.; Žliobaitė, I.; Bifet, A.; Pechenizkiy, M.; Bouchachia, A. A survey on concept drift adaptation. ACM Comput. Surv. 2014, 46, 44. [Google Scholar] [CrossRef] [PubMed]
Liang, S.; Li, Y.; Srikant, R. Enhancing the reliability of out-of-distribution image detection in neural networks. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Slack, D.; Hilgard, S.; Jia, E.; Singh, S.; Lakkaraju, H. Fooling LIME and SHAP: Adversarial attacks on post hoc explanation methods. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, New York, NY, USA, 7–8 February 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 180–186. [Google Scholar] [CrossRef]
Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef] [PubMed]
Miller, T. Explanation in artificial intelligence: Insights from the social sciences. Artif. Intell. 2019, 267, 1–38. [Google Scholar] [CrossRef]
Guidotti, R.; Monreale, A.; Ruggieri, S.; Turini, F.; Giannotti, F.; Pedreschi, D. A survey of methods for explaining black box models. ACM Comput. Surv. 2019, 51, 93. [Google Scholar] [CrossRef]
Rudin, C.; Chen, C.; Chen, Z.; Huang, H.; Semenova, L.; Zhong, C. Interpretable machine learning: Fundamental principles and 10 grand challenges. Stat. Surv. 2022, 16, 1–85. [Google Scholar] [CrossRef]
Wachter, S.; Mittelstadt, B.; Russell, C. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harv. J. Law Technol. 2018, 31, 841–887. [Google Scholar] [CrossRef]
Caruana, R.; Lou, Y.; Gehrke, J.; Koch, P.; Sturm, M.; Elhadad, N. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13 August 2015; Association for Computing Machinery: New York, NY, USA, 2015; pp. 1721–1730. [Google Scholar] [CrossRef]
Kraus, M.; Tschernutter, D.; Weinzierl, S.; Zschech, P. Interpretable generalized additive neural networks. Eur. J. Oper. Res. 2024, 317, 303–316. [Google Scholar] [CrossRef]
Agarwal, R.; Melnick, L.; Frosst, N.; Zhang, X.; Lengerich, B.; Caruana, R.; Hinton, G.E. Neural additive models: Interpretable machine learning with neural nets. In Advances in Neural Information Processing Systems 34; Curran Associates: Red Hook, NY, USA, 2021; pp. 4699–4712. [Google Scholar]
Moreno-Torres, J.G.; Raeder, T.; Alaiz-Rodríguez, R.; Chawla, N.V.; Herrera, F. A unifying view on dataset shift in classification. Pattern Recognit. 2012, 45, 521–530. [Google Scholar] [CrossRef]
Zafar, U.; Wu, F. Methodological challenges in explainable AI for fraud detection: A systematic literature review. Artif. Intell. Rev. 2026, 59, 115. [Google Scholar] [CrossRef]
Zhou, Y.; Li, H.; Xiao, Z.; Qiu, J. A user-centered explainable artificial intelligence approach for financial fraud detection. Finance Res. Lett. 2023, 58, 104309. [Google Scholar] [CrossRef]
Davis, J.; Goadrich, M. The relationship between precision-recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; Association for Computing Machinery: New York, NY, USA, 2006; pp. 233–240. [Google Scholar] [CrossRef]
Saito, T.; Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef] [PubMed]
Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Proceedings of Machine Learning Research: Cambridge, MA, USA, 2017; pp. 1321–1330. [Google Scholar]
Zadrozny, B.; Elkan, C. Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In Proceedings of the 18th International Conference on Machine Learning, Williamstown, MA, USA, 28 June–1 July 2001; Proceedings of Machine Learning Research: Cambridge, MA, USA, 2001; pp. 609–616. [Google Scholar]
Kull, M.; Silva Filho, T.M.; Flach, P. Beta calibration: A well-founded and easily implemented improvement on logistic calibration for binary classifiers. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; Proceedings of Machine Learning Research: Cambridge, MA, USA, 2017; pp. 623–631. [Google Scholar]

Figure 1. Representative precision–recall curves for the XGBoost teacher and the final hybrid EBM (top-k = 8).

Figure 2. Representative calibration curves for the XGBoost teacher and the final hybrid EBM.

Figure 3. Representative global importance profile of the final hybrid EBM (top-k = 8).

Figure 4. Representative shape functions of the final hybrid EBM (top-k = 8): (a) c_cross_entity_reuse_device, (b) c_device_prev_count, (c) C1, and (d) C14. The plotted terms should be interpreted as additive score contributions for the regression-style EBM. In each panel, the red line represents the learned additive shape function, the gray histogram indicates the empirical feature distribution, and the shaded band indicates the variation band shown by the plotting routine.

Figure 5. Overall workflow of the proposed concept–raw fusion pipeline with separated XGBoost-teacher and CatBoost-ceiling roles.

Figure 6. Taxonomy of the 63-variable causal concept bank used in the final experiments. The figure shows representative examples from the five behavioral groups; the complete concept inventory, construction rules, raw variables used, and operational interpretations are provided in Appendix A Table A1.

Figure 7. Evaluation design combining the primary chronological out-of-time split and the strict pseudo-entity holdout protocol. The stricter protocol removes validation/test rows whose pseudo-entity keys appeared in earlier splits, reducing repeated-entity effects and providing a leakage-resistant robustness stress test.

Figure 8. Main comparison of the CatBoost predictive ceiling and additive interpretable student family under the primary out-of-time evaluation. Panels report (a) PR-AUC, (b) ROC-AUC, (c) F1, and (d) Precision@Top1% for the CatBoost ceiling, sparse linear student, concept-only EBM, raw-only EBM (top-k = 8), and final hybrid EBM (top-k = 8).

Figure 9. Performance–complexity trade-off across the XGBoost teacher and the core additive student family.

Figure 10. Sensitivity of the hybrid EBM to the number of teacher-selected raw variables. The curve shows that performance improves sharply from k = 4 to k = 8 and then reaches a plateau across k = 8–16, motivating top-k = 8 as the compact final deployment-relevant additive configuration while treating the larger k values as ranking-sensitivity checks.

Figure 11. PR-AUC gains obtained by augmenting raw-only additive models with concept features. Positive values indicate that the causal concept bank adds ranking signal beyond the matched top-k raw-only additive baseline under both the main out-of-time and strict pseudo-entity-disjoint settings.

Figure 12. Low-FPR operating characteristics of the teacher, raw-only EBM, and final hybrid EBM. The plot emphasizes the screening regime most relevant to fraud analysts, where false-positive budgets are limited and recall must be interpreted under strict FPR constraints.

Figure 13. Hybrid concept-group ablation under strict pseudo-entity holdout. The figure illustrates that concept-group utility is partly regime-dependent; missingness remains important, whereas relation and time features can change utility when repeated pseudo-entity overlap is removed.

Table 1. Main three-seed out-of-time comparison across black-box references, additive students, and a rule-based interpretable baseline (mean ± std).

Precision@Top 1%	Recall@0.1% FPR	F1	ROC-AUC	PR-AUC	Model
0.855 ± 0.002	0.204 ± 0.004	0.499 ± 0.004	0.885 ± 0.002	0.489 ± 0.001	CatBoost ceiling
0.837 ± 0.007	0.185 ± 0.003	0.494 ± 0.002	0.878 ± 0.001	0.478 ± 0.003	XGBoost teacher
0.253 ± 0.003	0.019 ± 0.000	0.193 ± 0.000	0.746 ± 0.003	0.121 ± 0.002	Sparse linear student
0.441 ± 0.002	0.046 ± 0.001	0.231 ± 0.002	0.757 ± 0.000	0.189 ± 0.000	Concept-only EBM
0.700 ± 0.017	0.141 ± 0.003	0.398 ± 0.003	0.825 ± 0.004	0.372 ± 0.005	Raw-only EBM (top-k = 8)
0.733 ± 0.011	0.149 ± 0.001	0.400 ± 0.002	0.830 ± 0.001	0.383 ± 0.002	Raw-only EBM (top-k = 12)
0.806 ± 0.001	0.180 ± 0.002	0.423 ± 0.004	0.828 ± 0.003	0.407 ± 0.003	Hybrid EBM (top-k = 8)
0.799 ± 0.006	0.172 ± 0.004	0.416 ± 0.013	0.832 ± 0.001	0.407 ± 0.004	Hybrid EBM (top-k = 12)
0.795 ± 0.065	0.179 ± 0.019	0.404 ± 0.048	0.833 ± 0.009	0.387 ± 0.041	RuleFit baseline (top-k = 8 feature set)

Table 2. Hybrid EBM top-k ablation in the main out-of-time experiment (mean ± std over three seeds).

Precision@Top 1%	Recall@0.1% FPR	F1	ROC-AUC	PR-AUC	k
0.496 ± 0.006	0.055 ± 0.007	0.268 ± 0.010	0.775 ± 0.011	0.231 ± 0.007	4
0.806 ± 0.001	0.180 ± 0.002	0.423 ± 0.004	0.828 ± 0.003	0.407 ± 0.003	8
0.799 ± 0.006	0.172 ± 0.004	0.416 ± 0.013	0.832 ± 0.001	0.407 ± 0.004	12
0.796 ± 0.005	0.170 ± 0.002	0.422 ± 0.003	0.834 ± 0.002	0.408 ± 0.003	16

Table 3. Strict pseudo-entity-disjoint robustness evaluation across the main model families (mean ± std over three seeds), including Brier score as a probability-quality check. This benchmark removes repeated pseudo-entity overlap from later splits and is interpreted as a leakage-resistant robustness stress test rather than the primary reporting protocol.

Brier Score	Precision@Top 1%	Recall@0.1% FPR	F1	ROC-AUC	PR-AUC	Model
0.02316 ± 0.00011	0.851 ± 0.009	0.162 ± 0.007	0.480 ± 0.003	0.886 ± 0.002	0.487 ± 0.005	CatBoost ceiling
0.02404 ± 0.00013	0.816 ± 0.008	0.176 ± 0.015	0.468 ± 0.005	0.870 ± 0.002	0.468 ± 0.002	XGBoost teacher
0.02724 ± 0.00006	0.741 ± 0.005	0.131 ± 0.005	0.406 ± 0.003	0.829 ± 0.003	0.371 ± 0.004	Raw-only EBM (top-k = 8)
0.02718 ± 0.00002	0.783 ± 0.005	0.152 ± 0.003	0.398 ± 0.010	0.849 ± 0.001	0.399 ± 0.001	Hybrid EBM (top-k = 8)
0.02697 ± 0.00008	0.786 ± 0.014	0.147 ± 0.015	0.393 ± 0.007	0.857 ± 0.002	0.400 ± 0.005	Hybrid EBM (top-k = 12)
0.02606 ± 0.00021	0.796 ± 0.008	0.177 ± 0.007	0.415 ± 0.011	0.867 ± 0.003	0.431 ± 0.010	RuleFit baseline (top-k = 8)

Table 4. Concept-group ablation for the final top-k = 8 hybrid EBM in the main out-of-time experiment (mean ± std over three seeds).

Precision@Top 1%	Recall@0.1% FPR	F1	ROC-AUC	PR-AUC	Setting
0.806 ± 0.001	0.180 ± 0.002	0.423 ± 0.004	0.828 ± 0.003	0.407 ± 0.003	Full hybrid (top-k = 8)
0.804 ± 0.003	0.179 ± 0.002	0.426 ± 0.002	0.830 ± 0.003	0.410 ± 0.002	Drop time concepts
0.787 ± 0.004	0.173 ± 0.003	0.406 ± 0.007	0.829 ± 0.002	0.398 ± 0.003	Drop relation concepts
0.795 ± 0.003	0.174 ± 0.002	0.407 ± 0.007	0.826 ± 0.003	0.398 ± 0.005	Drop missingness concepts

Table 5. Computational profile in the main out-of-time experiment (mean ± std over three seeds).

Predict Time (s)	Fit Time (s)	Features	Model
0.126 ± 0.004	34.1 ± 0.7	154	CatBoost ceiling
2.094 ± 0.023	30.9 ± 0.0	154	XGBoost teacher
0.098 ± 0.001	80.4 ± 0.8	8	Raw-only EBM (top-k = 8)
0.104 ± 0.002	124.0 ± 2.9	12	Raw-only EBM (top-k = 12)
0.200 ± 0.001	536.4 ± 5.6	63	Concept-only EBM
0.217 ± 0.015	659.3 ± 10.9	71	Hybrid EBM (top-k = 8)
0.218 ± 0.011	704.5 ± 13.2	75	Hybrid EBM (top-k = 12)
1.273 ± 0.053	5083.5 ± 493.9	71	RuleFit baseline (top-k = 8 feature set)
0.079 ± 0.005	624.0 ± 201.8	63	Sparse linear student

Table 6. Main out-of-time calibration comparison with ECE-15 and Brier score (mean ± std over three seeds).

Calibration Interpretation	Brier Score	ECE-15	Model
Best Brier score among reported models; strongest ECE-15 among black-box references, but not the lowest ECE-15 overall.	0.02359 ± 0.00009	0.00989 ± 0.00047	CatBoost ceiling
Black-box teacher used for raw-feature selection and soft targets.	0.02418 ± 0.00012	0.01611 ± 0.00044	XGBoost teacher
Low ECE but weak ranking and higher Brier score.	0.03160 ± 0.00001	0.01217 ± 0.00018	Concept-only EBM
Compact additive raw baseline.	0.02693 ± 0.00012	0.00735 ± 0.00043	Raw-only EBM (top-k = 8)
Close to XGBoost in ECE; Brier score remains above both black-box references.	0.02656 ± 0.00008	0.01587 ± 0.00012	Hybrid EBM (top-k = 8)
Higher ECE and Brier than the hybrid with larger seed variance.	0.02817 ± 0.00266	0.02669 ± 0.00391	RuleFit baseline (top-k = 8 feature set)

Table 7. Sensitivity of the final hybrid EBM (top-k = 8) to alpha_EBM under the main out-of-time protocol (mean ± std over three seeds).

Recall@0.1% FPR	Brier Score	ECE-15	F1	ROC-AUC	PR-AUC	Alpha_EBM
0.1794 ± 0.0023	0.02658 ± 0.00008	0.01600 ± 0.00014	0.4220 ± 0.0051	0.8276 ± 0.0027	0.4061 ± 0.0034	0.00
0.1797 ± 0.0020	0.02657 ± 0.00008	0.01593 ± 0.00012	0.4230 ± 0.0046	0.8277 ± 0.0027	0.4065 ± 0.0034	0.25
0.1799 ± 0.0018	0.02656 ± 0.00008	0.01587 ± 0.00012	0.4228 ± 0.0044	0.8278 ± 0.0028	0.4066 ± 0.0034	0.35
0.1801 ± 0.0018	0.02655 ± 0.00008	0.01582 ± 0.00009	0.4229 ± 0.0046	0.8279 ± 0.0028	0.4069 ± 0.0034	0.50
0.1800 ± 0.0017	0.02654 ± 0.00008	0.01576 ± 0.00008	0.4227 ± 0.0049	0.8280 ± 0.0029	0.4072 ± 0.0034	0.75

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kang, J.; Kim, K. Deployment-Oriented Interpretable Fraud Detection via Hybrid Explainable Boosting Machines with Concept–Raw Fusion on the IEEE-CIS Benchmark. Appl. Sci. 2026, 16, 5809. https://doi.org/10.3390/app16125809

AMA Style

Kang J, Kim K. Deployment-Oriented Interpretable Fraud Detection via Hybrid Explainable Boosting Machines with Concept–Raw Fusion on the IEEE-CIS Benchmark. Applied Sciences. 2026; 16(12):5809. https://doi.org/10.3390/app16125809

Chicago/Turabian Style

Kang, Jeongtae, and Keecheon Kim. 2026. "Deployment-Oriented Interpretable Fraud Detection via Hybrid Explainable Boosting Machines with Concept–Raw Fusion on the IEEE-CIS Benchmark" Applied Sciences 16, no. 12: 5809. https://doi.org/10.3390/app16125809

APA Style

Kang, J., & Kim, K. (2026). Deployment-Oriented Interpretable Fraud Detection via Hybrid Explainable Boosting Machines with Concept–Raw Fusion on the IEEE-CIS Benchmark. Applied Sciences, 16(12), 5809. https://doi.org/10.3390/app16125809

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deployment-Oriented Interpretable Fraud Detection via Hybrid Explainable Boosting Machines with Concept–Raw Fusion on the IEEE-CIS Benchmark

Abstract

1. Introduction

1.1. Related Work on Fraud Detection, Explainability, and Additive Models

1.2. Study Positioning and Contribution Logic

2. Materials and Methods

2.1. Dataset and Prediction Task

2.2. Out-of-Time and Pseudo-Entity-Disjoint Evaluation

2.3. Concept Bank

2.4. Teacher and Student Models

2.5. Metrics and Computational Profile

2.6. Formal Problem Definition and Equation Summary

2.7. Algorithmic Summary

3. Results

3.1. Main Comparison on the Out-of-Time Split

3.2. Top-k Ablation and Final Model Selection

3.3. Robustness Under Strict Pseudo-Entity Holdout

3.4. Interpretability, Calibration, and Error Analysis

3.5. Raw-Only Versus Hybrid Additive Modeling

3.6. Hybrid Concept-Group Ablation

3.7. Computational Profile

Main Calibration Metric and Soft-Target Sensitivity Checks

3.8. Expanded Visual Diagnostics and Design Summary

3.8.1. Workflow, Concept Taxonomy, and Evaluation Design

3.8.2. Comparative Views of Performance, Complexity, and Sensitivity

3.8.3. Low-FPR Operating View and Hybrid Concept-Group Ablation

3.9. Evidence Synthesis

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Full 63-Variable Concept Bank Inventory

Appendix B. Visual and Reproducibility Summary

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI