Next Article in Journal
A Calendar-Aware Frequency-Decoupled Framework for Day-Ahead Substation Load Forecasting Using SHAP-Based Interpretation
Previous Article in Journal
A CNN and Transformer-Based Framework for Fine-Grained Plant Species Classification in Real-World Environments
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Deployment-Oriented Interpretable Fraud Detection via Hybrid Explainable Boosting Machines with Concept–Raw Fusion on the IEEE-CIS Benchmark

1
Intelligent Networks and Security Laboratory, Department of Data Science, Konkuk University, Seoul 05029, Republic of Korea
2
Intelligent Networks and Security Laboratory, Department of Computer Science and Engineering, Konkuk University, Seoul 05029, Republic of Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(12), 5809; https://doi.org/10.3390/app16125809 (registering DOI)
Submission received: 22 May 2026 / Revised: 3 June 2026 / Accepted: 5 June 2026 / Published: 9 June 2026
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

Fraud detection models often achieve a strong ranking performance through black-box ensembles, but operational deployment also requires calibration, low explanation cost, and auditable scoring logic. This study develops an interpretable fraud-detection pipeline for IEEE-CIS by combining a 63-variable causal concept bank with teacher-guided additive Explainable Boosting Machine (EBM) students. The concept bank summarizes the temporal state, entity history, novelty/reuse, identity missingness, and aggregate deviation. Experiments use a chronological out-of-time split and a stricter pseudo-entity-disjoint holdout. In the main three-seed evaluation, the CatBoost predictive ceiling and XGBoost teacher achieved PR-AUC 0.489 ± 0.001 and 0.478 ± 0.003, respectively. Among interpretable models, concept-only EBM reached 0.189 ± 0.000, raw-only EBMs reached 0.372 ± 0.005 (top-k = 8) and 0.383 ± 0.002 (top-k = 12), and hybrid EBMs reached 0.407 ± 0.003 (top-k = 8) and 0.407 ± 0.004 (top-k = 12), consistently improving over matched raw-only additive baselines. The final top-k = 8 hybrid reduced input features from 154 to 71, achieved about 9.7× faster inference than XGBoost, remained close to XGBoost in ECE-15 calibration (0.01587 vs. 0.01611) while having a higher Brier score, and produced native local explanations far faster than XGBoost + SHAP. The results position CatBoost as the predictive ceiling and hybrid EBM as a benchmark-supported, deployment-relevant interpretable compromise for applied financial risk-screening workflows, rather than as a production-validated fraud-monitoring system.

1. Introduction

Fraud detection is a high-stakes classification problem in which ranking quality matters, but operational usefulness depends just as much on auditability, threshold behavior, and the intelligibility of the scoring logic [1,2]. Tree ensembles often perform well on tabular fraud data, yet their decision surfaces are difficult to inspect directly. Post hoc tools such as SHAP and LIME can help diagnose individual predictions, but they do not turn a black-box system into a model whose decision process is structurally transparent [3,4].
That distinction matters in deployment. Analysts need to know not only why a transaction was flagged but also whether the same rationale is likely to remain stable across thresholds, time slices, and previously unseen entity combinations. Additive interpretable models are appealing in this regard because each predictor contributes through an explicit shape function that can be inspected and audited [5,6]. Their weakness is that they may discard too much signal when the task depends on heterogeneous interactions, historical reuse patterns, and complex missingness structure.
We study that trade-off on the IEEE-CIS Fraud Detection benchmark [7]. Instead of relying solely on raw competition variables, we build a concept bank that summarizes temporal history, entity-level transaction behavior, novelty and reuse patterns, and identity sparsity. Using that representation, we compare interpretable student families against two strong black-box references: XGBoost, which is used as the teacher for feature ranking, soft-target construction, and SHAP-based explanation-cost comparison, and CatBoost, which is retained as an additional predictive ceiling [2,8]. We also benchmark RuleFit as a rule-based interpretable comparator [9].
The central question is therefore not whether an additive model can fully match a strong teacher. The more practical question is whether it can preserve enough discriminatory power to remain operationally plausible while offering a decision structure that is easier to explain, calibrate, and audit. Our results show that a concept-only student is too restrictive, a raw-only additive baseline recovers a substantial share of teacher performance, and a hybrid EBM recovers more while preserving additive transparency. The contribution of this paper is thus a concrete design pattern for interpretable fraud detection on anonymized benchmark data: semantic concept engineering, limited raw-feature carryover for residual signal recovery, and strict time-aware evaluation for realistic assessment.

1.1. Related Work on Fraud Detection, Explainability, and Additive Models

Recent fraud-detection research has been dominated by high-performing black-box learners, including gradient-boosted trees, deep neural networks, and sequence models. On tabular transaction data, tree ensembles remain particularly competitive because they handle heterogeneous feature spaces, missing values, and non-linear interactions well [1,2,7,10,11]. Strong ranking performance alone, however, is not enough in operational fraud screening. Analysts must also understand why a transaction is flagged, whether thresholds remain trustworthy under a shift, and whether the resulting scores can be justified in audit or compliance settings [12,13,14]. Realistic fraud-detection studies further emphasize class imbalance, verification latency, calibration, and streaming deployment constraints [15,16,17].
These deployment concerns are also closely related to concept drift, dataset shift, and out-of-distribution behavior [18,19].
Post hoc explanation methods such as LIME and SHAP are now standard tools for interrogating black-box fraud models [3,4], yet their robustness can be fragile under small input perturbations [20]. They are useful for local diagnosis, but they do not eliminate the fidelity gap between a complex predictor and a retrospective explanation [6,21,22,23,24]. In high-stakes settings, several authors have argued that interpretable models should be preferred whenever feasible, particularly when the decision process must be auditable, stable, and easy to communicate to domain experts [21,22,24]. Counterfactual explanations provide one complementary perspective on actionability, but they do not remove the need for structurally transparent decision models [25]. This argument is especially relevant in fraud detection, where base-rate effects, threshold sensitivity, and severe imbalance make isolated point explanations an incomplete basis for deployment decisions [12,13].
A parallel line of research focuses on intrinsically interpretable additive models, including GAMs, GA2Ms, and modern boosting-based variants such as the Explainable Boosting Machine (EBM) [5,6,26,27], as well as newer neural additive formulations [28]. These models are attractive because they expose shape functions directly, but on difficult real-world benchmarks, they may lose too much predictive signal if the representation is oversimplified. This creates a more interesting design question: rather than enforcing a pure concept bottleneck, can one retain a compact amount of high-value raw information while keeping the final model structurally interpretable?
This study is situated at that intersection. Rather than assuming that one interpretable family must dominate every operational criterion, we compare additive concept-only, raw-only, and concept–raw hybrid students alongside broader baselines such as RuleFit and stronger black-box references such as the XGBoost teacher and CatBoost predictive ceiling. The manuscript is therefore not merely a performance report. It examines how different interpretable families distribute the trade-off among ranking accuracy, calibration, explanation cost, and structural transparency on IEEE-CIS, aligning the work with recent calls for intrinsically interpretable, deployment-aware machine learning [21,22,23,24,26,29,30,31].

1.2. Study Positioning and Contribution Logic

This paper is not intended as a standard leaderboard-style IEEE-CIS comparison. Rather, it is framed as a benchmark-based applied study of interpretable model design for deployment-relevant financial risk screening, asking how one should choose among black-box ceilings, additive interpretable students, and rule-based interpretable baselines when ranking quality, calibration, and explanation efficiency all matter.
(1)
We define a time-aware evaluation package that combines a chronological out-of-time split with a stricter pseudo-entity-disjoint holdout, so that additive students are assessed not only for predictive quality but also for leakage-resistant generalization.
(2)
We construct a causal 63-variable concept bank that translates anonymized IEEE-CIS fields into interpretable behavioral summaries of temporal state, entity history, novelty and reuse, identity missingness, and aggregate deviation.
(3)
We compare sparse linear, concept-only, raw-only, and concept–raw hybrid additive students against the XGBoost teacher and CatBoost predictive ceiling, and we additionally benchmark RuleFit so that the conclusions are not confined to a single black-box reference or a single interpretable family.
(4)
We close the evaluation loop by reporting ranking quality, threshold-based F1, low-FPR behavior, calibration, global importance and representative shape functions, computational cost, and explanation latency for XGBoost + SHAP versus native hybrid EBM local explanations, allowing interpretability claims to be judged together with operational plausibility and explanation-side cost.
The novelty of this study does not lie in proposing a new black-box learner, but in integrating causal concept construction, limited raw-feature carryover, teacher-guided additive distillation, strict time-aware validation, and explanation-cost benchmarking into a single deployment-oriented evaluation framework for interpretable fraud detection.

2. Materials and Methods

2.1. Dataset and Prediction Task

The experiments used the IEEE-CIS Fraud Detection benchmark, which provides transaction records and auxiliary identity fields linked by TransactionID [7]. After merging the transaction and identity files, we sorted the merged data by TransactionDT and treated them as a temporally ordered tabular stream. The prediction target was isFraud. Because the benchmark variables are anonymized, many informative predictors are semantically opaque; this motivates the explicit construction of human-readable concepts rather than exclusive reliance on raw variables.
The task is binary fraud classification under severe class imbalance. We therefore emphasize threshold-free ranking metrics and low-false-positive operating characteristics, which more closely reflect practical screening scenarios than accuracy. Core additive-family experiments, CatBoost and RuleFit baselines, and deployment-oriented explanation-cost comparisons were repeated over three fixed seeds under the same chronological split.

2.2. Out-of-Time and Pseudo-Entity-Disjoint Evaluation

We adopted a chronological split based on TransactionDT, using approximately 70% of the earliest observations for training, 15% for validation, and 15% for testing. This protocol blocks future information from entering training and better approximates deployment conditions. In the main chronological out-of-time experiment, this produced 413,378 training rows, 88,581 validation rows, and 88,581 test rows. To probe generalization more aggressively, we also introduced a stricter pseudo-entity-disjoint evaluation. The pseudo-entity key was defined as the concatenation of card1, card2, addr1, and P_emaildomain. These fields were selected because the IEEE-CIS benchmark does not provide a true customer identifier, while this combination captures recurring payment-card, address-region, and payer-domain signals with enough validation/test coverage for a stable robustness test. A weaker key using fewer fields would leave more repeated-entity overlap, whereas a more restrictive key using additional sparse identity fields would remove too many observations. We therefore used this four-field key as a pragmatic balance between overlap removal and sample retention. Validation and test rows whose pseudo-entities had already appeared in earlier splits were removed. Under this stricter setting, the validation and test sets were reduced to 10,393 and 10,376 rows, respectively, making the task materially harder but also more resistant to repeated-entity effects.
All main comparisons reported in this manuscript were repeated over three seeds under the chronological out-of-time protocol. The pseudo-entity-disjoint protocol is used as a robustness stress test rather than the primary reporting setting; however, to make the leakage-resistant comparison transparent, the strict-holdout table now reports the same main model families as the primary comparison, including CatBoost, XGBoost, raw-only EBM, hybrid EBM, and RuleFit.

2.3. Concept Bank

We constructed a 63-variable concept bank to translate raw tabular observations into interpretable behavioral summaries. The bank comprises five groups: (i) temporal-state variables such as day index, hour, weekday, inter-transaction delay, rolling means, rolling standard deviations, and burstiness; (ii) entity-history variables such as prior counts and prior mean transaction amounts for the pseudo-entity and related card/address aggregates; (iii) reuse and novelty variables, including new-device-for-entity, new-email-for-entity, cross-entity reuse of device identifiers, and prior counts for device, email, product, and card/address combinations; (iv) missingness variables summarizing identity sparsity and explicit missing-value indicators for important identity fields; and (v) ratio- and z-score-style deviations from historical baselines. The full 63-variable inventory, including each concept variable, construction rule, raw variables used, concept group, and operational interpretation, is provided in Appendix A Table A1.
All concept variables were computed causally after sorting by TransactionDT so that each row used only information available before the current transaction was incorporated into any history store. For each transaction, rolling, historical, entity-level, device-level, email-level, product-level, card-level, address-level, and combined-identifier features were computed first, and the corresponding history stores were updated only after feature extraction for that row was complete. Entity-level counts were implemented as prior cumulative counts, previous values were obtained through one-step shifts, and historical means and variances were computed by subtracting the current transaction amount from cumulative sums before normalization. Rolling means and standard deviations were computed from shifted histories, so the current transaction was excluded from its own rolling window. The 63 concepts were selected to satisfy three practical criteria: they had to be computable causally from information available before the current transaction, interpretable as an operational fraud-screening signal, and assignable to one of the five predefined behavioral groups rather than being included solely because of downstream test performance.

2.4. Teacher and Student Models

The final study uses two black-box reference baselines. XGBoost serves as the teacher for raw-feature ranking, soft-target construction, and the deployment-oriented SHAP comparison, whereas CatBoost is used as an additional predictive ceiling. The interpretable candidates consisted of a sparse linear student, a concept-only EBM, raw-only EBMs using teacher-selected top-k raw variables, hybrid EBMs combining the concept bank with those top-k raw variables, and a RuleFit baseline trained on the same compact feature set used for the final additive deployment candidate. Top-k raw variables were selected separately for each seed from the XGBoost teacher trained on the training partition only. We used the teacher’s built-in feature_importances_ values, sorted them in descending order, removed concept-bank variables from the ranking, and retained the highest-ranked raw variables for each k. No validation or test labels were used in raw-feature selection. Because the ranking was recomputed within each seed, the top-k raw set reflects seed-specific teacher fitting while preserving the same training-only selection rule. We also inspected the seed-wise top-k raw-feature lists and report the final selected raw variables in Supplementary File S1; therefore, the main manuscript interprets top-k as a compact teacher-guided raw-feature budget rather than as a fixed universal feature subset.
All EBM students were implemented as additive regression-style EBMs trained on a soft target that combines the hard fraud label with the XGBoost teacher probability, as formalized in Equation (10), with α_EBM = 0.35 in the submitted baseline experiments. We used a regression-style EBM because the target is a continuous teacher-guided probability-like signal rather than a binary label alone. This formulation allows the additive student to approximate the teacher-guided probability surface directly while preserving univariate shape functions. The fitted additive scores were clipped to [0, 1] only as a bounded-probability projection before probability-based metrics were computed. We did not apply Platt scaling or isotonic calibration in the main pipeline because the goal was to evaluate the native additive student and its direct explanation cost, not to optimize a separate post hoc calibration layer. Classification EBMs and post hoc calibration methods remain useful alternatives and are discussed as future extensions.

2.5. Metrics and Computational Profile

We report PR-AUC, ROC-AUC, F1 at the validation-derived threshold, expected calibration error (ECE, 15 bins), the Brier score, recall at false-positive rates of 0.1%, 0.5%, and 1.0%, and precision among the top 1% highest-risk transactions. Given the severe class imbalance, PR-AUC is treated as the primary ranking metric, following prior work on ROC/PR interpretation and imbalanced evaluation [32,33]. Calibration is reported because threshold choice and downstream risk ranking depend on the reliability of predicted probabilities [34,35,36]. The Brier score is added as a complementary strictly proper scoring metric so that probability quality is not judged by binned ECE alone.
In addition to predictive quality, we measure the mean fit time, mean prediction time on the test split, feature count, and local-explanation latency. The runtime table reports the main three-seed out-of-time evaluation. The deployment-oriented explanation-cost benchmark compares XGBoost + SHAP against the native local explanations of the final hybrid EBM.
Implementation details. The main evaluation was run over three seeds (42, 43, and 44). XGBoost used 3000 estimators, learning rate 0.035, maximum depth 7, histogram tree construction, and the CUDA device mode. CatBoost used 3000 iterations, learning rate 0.035, depth 7, L2 leaf regularization 3.0, and GPU mode. The EBM students used outer_bags = 8, max_rounds = 2000, learning_rate = 0.03, max_bins = 256, and n_jobs = 1. All experiments were run in Python 3.11 on a workstation equipped with an NVIDIA GeForce RTX 4060 GPU (NVIDIA Corporation, Santa Clara, CA, USA). The GPU mode was used for XGBoost and CatBoost when available, with automatic CPU fallback if GPU execution failed; EBM and RuleFit components were executed on CPU. To support reproducibility, the Supplementary File S1 includes the execution script, environment requirements with package versions, seed-wise summaries, configuration files, feature-group definitions, strict-holdout outputs, and final figure files used in the manuscript. The original IEEE-CIS data are not redistributed and must be obtained from the official competition source under its terms of use.

2.6. Formal Problem Definition and Equation Summary

To make the modeling assumptions explicit and the paper easier to reproduce, this subsection formalizes the prediction task, the causal concept-construction rules, the additive student score, the teacher-guided soft target used by the EBM students, and the deployment-oriented evaluation metrics used throughout the study. Let xi, yi, and ti denote the merged transaction–identity representation, binary fraud label, and transaction time index, respectively.
D = { x i , y i , t i } i = 1 N , y i { 0 , 1 } .
The chronological evaluation used in this study can be expressed through the following split operators.
T t r a i n = { i : t i c 1 } , T v a l i d = { i : c 1 < t i c 2 } , T t e s t = { i : t i > c 2 } .
Here, c1 and c2 are chronological cut points for the training/validation and validation/test boundaries, respectively. For the stricter robustness setting, a pseudo-entity key is built from a small set of quasi-identifying fields and used to remove overlaps between the training entities and the later splits.
e i = c o n c a t c i 1 , c i 2 , a i 1 , p i e m a i l .
The four arguments in Equation (3) correspond to card1, card2, addr1, and P_emaildomain. The final hybrid student uses an additive score that combines interpretable concept terms with a small set of teacher-selected raw variables.
s h y b x = β 0 + j C f j x j + r R k g r x r .
Here, C denotes the concept-feature set, Rk denotes the top-k teacher-selected raw-feature set, and fj and gr are the learned univariate additive shape functions. The reported fraud probability is obtained by clipping the additive EBM regression score to [0, 1], consistent with the regression-style EBM implementation.
p ^ h y b x = c l i p s h y b x , 0 , 1 .
Several concepts are history-based summaries defined causally over past observations belonging to the same entity e. The first two examples are the historical mean and a standardized deviation score.
μ e , i = 1 m a x 1 , n e , i τ < i 1 e τ = e i a τ , n e , i = τ < i 1 e τ = e i .
z e , i = a i μ e , i σ ^ e , i + ϵ .
Cross-entity reuse is designed to quantify how often a device has appeared before outside the current entity history.
r e u s e d , i = τ < i 1 d τ = d i , e τ e i .
Teacher guidance enters the hybrid model in two ways: through top-k raw feature selection using XGBoost-derived feature importance and through the soft target used to train the additive EBM students.
R k = T o p K I T , k .
Here, the teacher-derived raw-feature importance ranking is used for top-k raw-feature selection, and the soft target combines the hard label with the XGBoost teacher probability.
y ~ i = 1 α E B M y i + α E B M p i T , L E B M = i p ^ i S y ~ i 2 .
Finally, calibration and low-FPR behavior are summarized through expected calibration error, the Brier score, and a quantile-based low-FPR recall metric, where P and N denote the positive and negative evaluation subsets, respectively. The Brier score is computed as the mean squared probability error, n−1Σi(piyi)−2, and is used as a complementary strictly proper scoring rule so that calibration is not judged by binned ECE alone. In Equations (11) and (12), Bb denotes the b-th calibration bin, nb is the number of samples in that bin, acc(Bb) and conf(Bb) are the empirical accuracy and mean confidence of the bin, and the quantile term in Equation (12) denotes the negative-class score quantile corresponding to the target false-positive-rate level γ.
E C E B = b = 1 B B b n a c c B b c o n f B b .
R e c a l l F P R γ = 1 P i P 1 p i q 1 γ { p j : j N } .

2.7. Algorithmic Summary

Algorithms 1 and 2 summarize the causal concept-generation pipeline and the final hybrid EBM training and evaluation routine. They are not meant to replace the implementation details; rather, they provide a compact procedural map aligned with the experimental protocol reported in this paper.
Algorithm 1. Causal concept-bank construction for the ordered IEEE-CIS stream.
Input: merged table M sorted by TransactionDT; entity key e_i; amount a_i; device d_i
Output: concept bank C with temporal, history, reuse, missingness, and ratio features
1: initialize empty history stores H_entity, H_device, H_email, H_product, H_card, H_address, H_card_address, and H_device_email
2: for each row i in temporal order do
3: read current entity e_i and raw fields for row i
4: compute temporal concepts from TransactionDT and previous entity timestamps
5: compute entity-history statistics from H_entity[e_i] before inserting row i
6: compute device-, email-, product-, card-, address-, and combined-identifier reuse terms from their existing history stores
7: compute missingness flags and deviation ratios using only information already available for row i and prior histories
8: append all concept values to C for row i
9: update H_entity, H_device, H_email, H_product, H_card, H_address, H_card_address, and H_device_email with row i only after feature extraction is complete
10: end for
11: return C
The read-before-update rule is applied consistently across entity-, device-, email-, product-, card-, address-, and combined-identifier histories, preventing the current transaction from contributing to its own features.
Algorithm 2. Final hybrid EBM training and evaluation under chronological and strict protocols.
Input: full feature matrix X, concept bank C, labels y, top-k value k
Output: trained hybrid EBM, clipped probability scores, test metrics, calibration plots, shape functions, and explanation-cost summaries
1: split X and y into chronological train/valid/test partitions
2: optionally filter valid/test rows that share pseudo-entity keys with train
3: fit the teacher XGBoost model on the train partition
4: rank raw variables by teacher importance and keep the top-k set R_k
5: build hybrid design matrix Z = [C, R_k] for train/valid/test
6: compute soft training targets y_tilde_train = (1 − alpha_EBM)y_train + alpha_EBM p^T_train
7: fit regression-style Explainable Boosting Machine on Z_train and y_tilde_train
8: obtain validation and test scores by clipping additive EBM outputs to [0, 1]; derive tau*_F1 from validation scores
9: evaluate PR-AUC and ROC-AUC threshold-free; evaluate F1 at tau*_F1; compute Recall@low-FPR and Precision@Top1% on Z_test
10: export global importance, PR/calibration curves, representative shape functions, explanation-cost diagnostics, runtime, feature count, and ablation results
The same protocol is reused for raw-only and concept-only EBM baselines so that every comparison differs only in the input design matrix.

3. Results

3.1. Main Comparison on the Out-of-Time Split

Table 1 reports the main three-seed out-of-time comparison across black-box references, additive students, and a rule-based interpretable baseline. CatBoost slightly outperformed XGBoost and therefore became the strongest predictive ceiling in the main evaluation (PR-AUC 0.489 ± 0.001 vs. 0.478 ± 0.003). Within the interpretable family reported in Table 1, the concept-only EBM remained too restrictive, raw-only EBMs recovered a substantial share of teacher signal, and the hybrid EBM variants achieved the strongest additive ranking performance, with top-k = 8 and top-k = 12 both around PR-AUC 0.407. RuleFit did not outperform the hybrid in this three-seed evaluation and showed larger variance across seeds.
These results indicate that the hybrid EBM should be interpreted as the strongest additive and deployment-oriented interpretable model among the evaluated student models, rather than as a universal winner over all possible interpretable approaches. At the same time, the hybrid consistently improved on the corresponding raw-only EBM by about +0.035 PR-AUC at top-k = 8 and about +0.025 PR-AUC at top-k = 12, showing that concept engineering adds non-trivial signal beyond a compact carry-over of top-ranked raw variables.

3.2. Top-k Ablation and Final Model Selection

Table 2 reports the three-seed hybrid top-k ablation. The top-k = 4 variant is clearly underpowered. Performance rises sharply from 4 to 8 and then plateaus across 8–16. The highest mean PR-AUC occurs at top-k = 16 (0.408 ± 0.003), but the margin over top-k = 8 is only about 0.002 and comes with additional raw variables and slightly weaker low-FPR and top-1% precision behavior. The top-k = 8 model is therefore retained as the primary deployment-oriented additive model because it is compact, near the performance plateau, and has the strongest Recall@0.1% FPR and Precision@Top1% among the top-k variants. The larger top-k values are treated as ranking-sensitivity checks rather than the preferred deployment configuration.

3.3. Robustness Under Strict Pseudo-Entity Holdout

Table 3 reports the stricter pseudo-entity-disjoint robustness evaluation across the same main model families as the primary comparison. These values are interpreted separately from the main three-seed chronological out-of-time results reported in Table 1 because the strict protocol removes repeated pseudo-entity overlap and substantially reduces the validation and test sets. Under this stricter protocol, CatBoost remains the strongest predictive ceiling (PR-AUC 0.487 ± 0.005), while XGBoost reaches 0.468 ± 0.002. Among the additive students, the hybrid EBM remains above the matched raw-only additive baseline at top-k = 8 (0.399 ± 0.001 vs. 0.371 ± 0.004 PR-AUC), and the top-k = 12 hybrid gives a similar result (0.400 ± 0.005 PR-AUC). RuleFit also performs strongly under this strict test (0.431 ± 0.010 PR-AUC), but it remains less favorable in the main evaluation because of its larger seed variance and much higher fit-time cost. The Brier-score column in Table 3 further confirms a conservative calibration interpretation: CatBoost and XGBoost have lower Brier scores than the hybrid EBM and RuleFit under this stress test, so the strict calibration evidence is not used to claim probability-quality dominance for the hybrid. The strict-holdout results therefore support the main conclusion that concept–raw fusion provides additional additive signal while making clear that black-box references remain the predictive ceilings.

3.4. Interpretability, Calibration, and Error Analysis

Figure 1 shows that the XGBoost teacher still dominates the full precision–recall curve, although the final top-k = 8 hybrid remains competitive in the high-precision/low-recall portion of the curve. Figure 2 presents the calibration curves for the XGBoost teacher and the final hybrid model. In the main three-seed out-of-time evaluation, the hybrid EBM top-k = 8 achieved ECE-15 0.01587, which is close to the XGBoost teacher (0.01611) and better than RuleFit (0.02669). CatBoost achieved the strongest ECE-15 among the black-box references (0.00989), although the raw-only EBM had the lowest ECE-15 overall despite weaker ranking quality. The added Brier-score comparison reported later in the calibration table gives a more conservative view: the hybrid Brier score (0.02656) is higher than both black-box references but lower than RuleFit, so the hybrid should be described as close to XGBoost on ECE but not as matching the black-box probability-quality ceiling across all calibration metrics.
Figure 3 shows the global importance profile of the final hybrid model. The leading terms include two anonymized raw variables (C1 and C14) together with interpretable features such as cross-entity device reuse, prior device frequency, calendar position, product history, and email amount statistics. C1 and C14 remain the dominant residual raw signals, while c_cross_entity_reuse_device, c_device_prev_count, and dt_day are the leading concept terms. Because C1 and C14 remain anonymized in the IEEE-CIS benchmark, the final hybrid EBM should not be interpreted as fully semantically interpretable. Rather, it is additively transparent and partially semantically interpretable: concept variables support behavioral interpretation, while retained anonymized raw variables provide auditable shape functions and score contributions without full domain semantics.
Figure 4 presents representative shape functions of the final hybrid EBM. The two interpretable concept variables, c_cross_entity_reuse_device and c_device_prev_count, expose auditable patterns related to reuse and historical activity, while C1 and C14 show how a compact set of retained anonymized raw variables preserves residual benchmark signal within the additive structure. These raw-variable shapes are transparent at the level of score contribution and monotonic/non-monotonic response, but not at the level of direct financial semantics because the benchmark intentionally anonymizes those fields. The model is therefore best described as additively transparent with partial semantic interpretability, rather than as fully semantically interpretable.

3.5. Raw-Only Versus Hybrid Additive Modeling

To isolate the contribution of the concept layer, we compared raw-only and hybrid additive models directly. In the main three-seed out-of-time evaluation, the hybrid improved on the corresponding raw-only EBM by +0.0345 PR-AUC at top-k = 8 (0.4066 vs. 0.3721) and by +0.0246 PR-AUC at top-k = 12 (0.4074 vs. 0.3828). Under the stricter robustness evaluation reported in Table 3, the gain remained positive at approximately +0.027 PR-AUC at top-k = 8 and +0.026 PR-AUC at top-k = 12. These margins support the claim that concept engineering contributes signal beyond a small carry-over of top-ranked raw variables.

3.6. Hybrid Concept-Group Ablation

Table 4 reports concept-group ablations for the final top-k = 8 hybrid EBM in the main out-of-time experiment (mean ± std over three seeds). Removing the missingness concepts reduced PR-AUC from 0.4066 ± 0.0034 to 0.3982 ± 0.0045. Removing the relation group reduced PR-AUC to 0.3981 ± 0.0032. By contrast, removing time concepts slightly increased PR-AUC to 0.4100 ± 0.0024, suggesting that the current time block contributes limited incremental signal once the other concept groups are present. Within the main out-of-time evaluation, this ablation suggests that missingness and relation concepts provide incremental value, whereas the current time block may contain redundant or weakly aligned signals once other concept groups and selected raw variables are included.

3.7. Computational Profile

Table 5 summarizes the computational profile for the main out-of-time experiment. CatBoost and XGBoost train quickly on the available GPU but require the widest raw input representation. The sparse linear student and raw-only additive baselines are included for completeness, while the concept-rich EBMs provide the main interpretable additive candidates. The final hybrid EBM uses 71 input variables on average, fits in about 659.3 s, and predicts the test split in about 0.217 s in the three-seed summary. RuleFit, by contrast, required about 5083.5 s to fit and 1.273 s to score the test split, reinforcing its weaker deployment profile despite being structurally interpretable.
A deployment-oriented comparison against the XGBoost teacher makes the trade-off easier to interpret. Averaged over the three-seed deployment summaries, the XGBoost teacher achieved PR-AUC 0.4782, whereas the final hybrid top-k = 8 achieved 0.4066. The absolute offline fitting gap was about 10.47 min rather than an online latency penalty. At inference time, the hybrid was about 9.66× faster than the XGBoost teacher (0.217 s vs. 2.094 s), used 71 rather than 154 input variables (53.9% reduction), and remained close in ECE-15 (0.01587 vs. 0.01611), although its Brier score was higher (0.02656 vs. 0.02418). These numbers support the intended framing: the XGBoost teacher is the black-box performance ceiling, while the hybrid is a smaller, additively transparent, faster-inference candidate with explicit shape functions and native local explanations.
The explanation-cost benchmark reinforces that interpretation. Teacher-side local explanations required a separate SHAP step, whereas the hybrid EBM produced native additive explanations through explain_local(). Averaged over three seeds, XGBoost + SHAP required about 439.1 ms for one case, 1.797 ms per row for 100 cases, and 0.4896 ms per row for 500 cases. For the hybrid EBM, the corresponding local-explanation costs were about 3.85 ms for one case, 0.0525 ms per row for 100 cases, and 0.0247 ms per row for 500 cases. Depending on batch size, the additive deployment model therefore reduced explanation latency by roughly 19.8× to 114× relative to the teacher plus post hoc SHAP.

Main Calibration Metric and Soft-Target Sensitivity Checks

To address calibration beyond binned ECE and to test whether the soft-target weight drives the main conclusion, we added two main out-of-time revision-support summaries. Table 6 reports ECE-15 together with the Brier score. The hybrid EBM remains close to XGBoost in ECE-15, but its Brier score is higher than both black-box references and lower than RuleFit, supporting a bounded calibration claim rather than a claim of full probability-quality parity.
Table 7 reports the alpha_EBM sensitivity check for the final top-k = 8 hybrid design. PR-AUC varies only from 0.4061 to 0.4072 across alpha values from 0.00 to 0.75, while ECE-15 and the Brier score change only marginally. The submitted alpha_EBM = 0.35 setting therefore remains competitive, and the core conclusion does not depend on a narrow alpha choice. The highest mean PR-AUC occurs at alpha_EBM = 0.75, but the margin over 0.35 is approximately 0.0006 and is not large enough to change the model-selection argument.
These calibration and sensitivity results complement the broader runtime and explanation-cost findings. The hybrid EBM is not the strongest ranking model, but it is the only additive student in this evaluation that simultaneously preserves a substantial fraction of teacher discrimination, improves consistently on the raw-only additive baselines, remains competitive across the tested soft-target weights, and provides cheap native local explanations without an external post hoc explainer.

3.8. Expanded Visual Diagnostics and Design Summary

To deepen the interpretability narrative and provide a visually grounded account of the method, this section presents a design-summary block that complements the numerical tables. These figures do not replace the main results; instead, they make the model design, evaluation logic, operating-region behavior, and additive trade-offs easier to inspect at a glance.

3.8.1. Workflow, Concept Taxonomy, and Evaluation Design

Figure 5 summarizes the full pipeline from IEEE-CIS data ingestion to causal concept generation, XGBoost-guided raw-feature selection and soft-target construction, additive student training, and final benchmark-based deployment-relevant evaluation. The figure separates the roles of the two black-box baselines: XGBoost provides raw-feature ranking and the SHAP reference path, whereas CatBoost is retained as the predictive ceiling. Figure 6 organizes the 63-variable concept bank into five semantic groups using representative examples; the complete variable inventory with construction rules and operational interpretations is provided in Appendix A Table A1. Figure 7 then illustrates the two evaluation settings used throughout the paper: the primary out-of-time split and the stricter pseudo-entity-disjoint protocol.

3.8.2. Comparative Views of Performance, Complexity, and Sensitivity

Figure 8 provides a compact visual summary of the core additive-family comparison rather than the full expanded baseline set. Figure 9 places those same additive models on a performance–complexity plane and shows that the hybrid EBM is materially stronger than the concept-only and raw-only additive baselines while requiring fewer input variables than the XGBoost teacher. Figure 10 visualizes the top-k sensitivity analysis and confirms the same pattern seen in Table 2: a sharp improvement from k = 4 followed by a plateau around k = 8–16. Figure 11 isolates the incremental gain from concept–raw fusion over the raw-only additive baseline under both evaluation protocols. The expanded CatBoost and RuleFit baselines are reported numerically in Table 1 and Table 3 and discussed in Section 3.1, Section 3.3 and Section 4.

3.8.3. Low-FPR Operating View and Hybrid Concept-Group Ablation

In practice, fraud analysts often operate in extremely low false-positive-rate regions rather than across the full threshold range. Figure 12 therefore plots recall as a function of FPR for the teacher, the raw-only EBM, and the final hybrid EBM. Although the proposed model does not match the teacher, it remains clearly stronger than the raw-only additive baseline in the low-FPR regime. Figure 13 complements Table 4 by visualizing how concept-group utility can vary across evaluation settings: in the main out-of-time evaluation, missingness and relation concepts remain useful, whereas in the strict pseudo-entity-disjoint holdout, missingness remains important, the time block is comparatively expendable, and the relation block appears regime-dependent because removing it improves strict-holdout PR-AUC. Taken together, the ablations suggest that concept utility is real but partially regime-dependent.

3.9. Evidence Synthesis

Taken together, Table 1, Table 2, Table 3, Table 4, Table 5, Table 6 and Table 7 and Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13 support a nuanced but coherent interpretation of the evidence. Black-box references remain the strongest predictive ceilings, with CatBoost slightly ahead of XGBoost in the main three-seed out-of-time evaluation and again remaining the strongest ceiling under the strict pseudo-entity-disjoint stress test. A pure concept bottleneck is too restrictive, compact raw-only additive baselines are already competitive, and the hybrid EBM repeatedly improves on those matched raw-only baselines in both the main and strict evaluations. RuleFit does not provide a more stable interpretable alternative in the main evaluation because its mean ranking score falls below the hybrid and its variance across seeds is substantially larger; however, the strict holdout adds nuance by showing that RuleFit can achieve a higher strict PR-AUC than the hybrid under a pseudo-entity-disjoint stress test. We therefore treat the strict result as a robustness stress test rather than as the primary model-selection setting, while retaining the bounded conclusion that concept–raw fusion adds useful additive signal over matched raw-only EBM baselines.
The evaluation also follows the same operational axes used to motivate the method. It reports not only ranking-oriented comparisons but also calibration behavior, low-FPR operating diagnostics, concept-group ablations, global importance and representative shape-function views, explanation-side latency, and computational profile summaries. The resulting picture is explicitly multidimensional: CatBoost is the strongest predictive ceiling, XGBoost provides a useful teacher and SHAP reference path, and the final hybrid EBM remains the strongest additive and benchmark-supported deployment-relevant model among the students tested here.

4. Discussion

A central limitation of the proposed hybrid EBM is that it neither matches the black-box references nor establishes dominance over every possible interpretable alternative. The main three-seed out-of-time evaluation confirms this limitation: CatBoost and XGBoost remain the predictive ceilings. The strict pseudo-entity-disjoint evaluation also shows that RuleFit can be competitive in that stress-test regime, even though it is less favorable in the main evaluation because of higher variance and substantially higher fit-time cost. The more relevant scientific question, however, is not whether one interpretable family wins on every axis, but whether additive concept–raw fusion occupies a distinct and operationally useful point on the trade-off surface. Under that framing, the hybrid EBM remains important because it clearly outperforms the matched raw-only additive baselines in both the main and strict evaluations, remains more stable than RuleFit across seeds in the main setting, remains close to the XGBoost teacher in ECE-15 while showing a higher Brier score, and exposes directly auditable shape functions. Nevertheless, because part of the final hybrid relies on anonymized raw variables, its interpretability should be described as additive transparency with partial semantic interpretation rather than full semantic interpretability.
The explanation-cost analysis supports that benchmark-based deployment-relevant interpretation. In the explanation-cost benchmark, XGBoost + SHAP required about 439 ms for a single case and about 0.490 ms per row for batches of 500. By contrast, the hybrid EBM generated native additive local explanations in about 3.85 ms for one case and 0.0247 ms per row for 500 cases. These differences do not diminish the value of the black-box references, but they clarify why a natively explainable additive model may be economically attractive in repeated analyst workflows: it reduces both operational scoring cost and explanation-side latency without requiring a separate post hoc explainer. The current evidence is still benchmark-based and should not be read as live production validation.
A further limitation is that the empirical evaluation is based on a single anonymized benchmark. Although IEEE-CIS is large-scale and widely used, its anonymized fields limit direct financial-domain interpretation and do not replace validation on institution-specific transaction streams. The present results should therefore be interpreted as benchmark evidence for an interpretable deployment design pattern rather than as a complete proof of production readiness across all financial environments. We did not evaluate analyst interaction, live monitoring, drift-control policies, cost-sensitive threshold adaptation, or production feedback loops. A further practical concern is concept drift: fraud strategies, merchant behavior, identity availability, and device reuse patterns can change over time in real financial systems. Although the chronological split and pseudo-entity-disjoint holdout provide stronger benchmark evidence than a random split, they do not replace prospective monitoring, drift detection, threshold recalibration, or periodic model updating in a live environment. From an applied engineering perspective, the proposed hybrid EBM is most suitable for analyst-facing risk-screening workflows in which full black-box accuracy is not the only objective and where transparent score decomposition, fast repeated explanations, and compact input requirements are operationally valuable.

5. Conclusions

This study examined interpretable fraud detection on the IEEE-CIS benchmark as a multi-objective trade-off rather than a search for a single universal winner. In the main three-seed out-of-time comparison, CatBoost and XGBoost remained the black-box ceilings, the concept-only EBM proved too restrictive, and the teacher-guided hybrid EBM emerged as the strongest additive student while consistently outperforming matched raw-only additive baselines. RuleFit did not provide a stable alternative in the main evaluation, showing lower mean PR-AUC and larger seed-wise variance than the hybrid, although the expanded strict-holdout results show that it can outperform the hybrid on strict PR-AUC under a pseudo-entity-disjoint stress test. The final top-k = 8 hybrid further reduced the input dimension from 154 to 71, delivered about 9.7× faster inference than the XGBoost teacher, and produced native local explanations that were much cheaper than XGBoost + SHAP. The results therefore support a practical but bounded conclusion: concept–raw fusion can improve an additive interpretable fraud detector relative to matched raw-only additive baselines while preserving auditable local explanations, but it should be regarded as benchmark-supported deployment-relevant evidence rather than live production validation or universal superiority over all interpretable alternatives.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/app16125809/s1, Supplementary File S1: The reproducibility guide, source code, environment requirements, main chronological out-of-time results, alpha_EBM sensitivity and Brier-score calibration outputs, strict pseudo-entity-disjoint holdout results, seed-wise metric summaries, runtime and explanation-cost outputs, full 63-variable concept inventory, and final figure files supporting the manuscript.

Author Contributions

Conceptualization, J.K. and K.K.; methodology, J.K.; software, J.K.; validation, J.K. and K.K.; formal analysis, J.K.; investigation, J.K.; resources, J.K.; data curation, J.K.; writing—original draft preparation, J.K.; writing—review and editing, J.K. and K.K.; visualization, J.K.; supervision, K.K.; project administration, J.K. and K.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Global Copyright Issues Rapid Response (R&D) Program of the Ministry of Culture, Sports and Tourism and the Korea Culture Technology Planning and Evaluation Institute (No. RS-2026-2552393; specialized agency: Korea Creative Content Agency).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The IEEE-CIS Fraud Detection benchmark used in this study is publicly available through Kaggle under the competition terms. The code, configuration settings, derived concept definitions, seed-wise summaries, and result files generated during this study are provided in Supplementary File S1. Redistribution of the original benchmark data remains subject to the terms of the source competition.

Acknowledgments

The authors thank the members of the Intelligent Networks and Security Laboratory for discussions on fraud-detection experiment design and evaluation protocol development. During manuscript preparation, the authors used OpenAI’s ChatGPT (GPT-5.5 Thinking, OpenAI, San Francisco, CA, USA) for language editing, formatting support, and figure-layout refinement. All generated suggestions were reviewed, revised, and validated by the authors, who take full responsibility for the final content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

EBMExplainable Boosting Machine
ECEExpected Calibration Error
PR-AUCArea under the Precision–Recall Curve
ROC-AUCArea under the Receiver Operating Characteristic Curve
XAIExplainable Artificial Intelligence

Appendix A. Full 63-Variable Concept Bank Inventory

Table A1 provides the complete 63-variable causal concept bank used in the final experiments. Each row reports the concept variable, construction rule, raw variables used, concept group, and operational interpretation. All history-based variables follow the read-before-update rule described in Section 2.3, so the current transaction is excluded before updating the corresponding history store.
Table A1. Full inventory of the 63-variable causal concept bank.
Table A1. Full inventory of the 63-variable causal concept bank.
Concept VariableConstruction Rule/FormulaRaw Variables UsedConcept GroupOperational Interpretation
log_TransactionAmtlog1p of non-negative TransactionAmt.TransactionAmtAggregate deviation/amount scaleStabilizes the transaction amount scale for additive modeling.
dt_hourHour bucket derived from floor(TransactionDT/3600) mod 24.TransactionDTTemporal stateCaptures within-day transaction timing.
dt_weekdayWeekday bucket derived from floor(TransactionDT/86400) mod 7.TransactionDTTemporal stateCaptures weekly timing pattern.
dt_dayDay index derived from floor(TransactionDT/86400).TransactionDTTemporal stateCaptures coarse temporal position in the benchmark stream.
c_entity_amt_prev_countPrior cumulative count for the pseudo-entity before current row.card1, card2, addr1, P_emaildomain, TransactionDTEntity historyMeasures entity transaction history depth.
c_entity_amt_delta_tTime since previous transaction for the same pseudo-entity.TransactionDT, pseudo-entity keyTemporal stateMeasures entity inactivity or rapid recurrence.
c_entity_amt_delta_t_log1plog1p-transformed positive inter-arrival time for the pseudo-entity.TransactionDT, pseudo-entity keyTemporal stateStabilizes inter-arrival time for sparse histories.
c_entity_amt_jump_ratioAbsolute amount change from the previous pseudo-entity transaction divided by previous amount plus one.TransactionAmt, pseudo-entity keyAggregate deviationCaptures abrupt amount jumps for the same entity.
c_entity_amt_prev_meanPrior mean TransactionAmt for the pseudo-entity.TransactionAmt, pseudo-entity keyEntity historySummarizes historical spending level of the entity.
c_entity_amt_prev_stdPrior standard deviation of TransactionAmt for the pseudo-entity.TransactionAmt, pseudo-entity keyEntity historySummarizes historical amount variability.
c_entity_amt_zCurrent amount standardized by prior pseudo-entity mean and standard deviation.TransactionAmt, pseudo-entity keyAggregate deviationFlags transactions deviating from entity history.
c_entity_amt_ratioCurrent amount divided by prior pseudo-entity mean plus one.TransactionAmt, pseudo-entity keyAggregate deviationMeasures relative amount inflation versus entity history.
c_entity_amt_burstinessPrior entity count divided by time since previous entity transaction plus one.TransactionDT, pseudo-entity keyTemporal stateMeasures rapid repeated activity by the same entity.
c_entity_amt_roll3_mean_prevRolling mean of the previous three entity amounts, using shifted history.TransactionAmt, pseudo-entity keyTemporal stateCaptures short-term historical amount level.
c_entity_amt_roll3_std_prevRolling standard deviation of the previous three entity amounts, using shifted history.TransactionAmt, pseudo-entity keyTemporal stateCaptures short-term amount volatility.
c_entity_amt_roll5_mean_prevRolling mean of the previous five entity amounts, using shifted history.TransactionAmt, pseudo-entity keyTemporal stateCaptures medium short-term amount level.
c_entity_amt_roll5_std_prevRolling standard deviation of the previous five entity amounts, using shifted history.TransactionAmt, pseudo-entity keyTemporal stateCaptures medium short-term amount volatility.
c_entity_amt_to_roll3_ratioCurrent amount divided by prior rolling-3 mean plus one.TransactionAmt, pseudo-entity keyAggregate deviationMeasures deviation from short-term entity history.
c_entity_amt_to_roll5_ratioCurrent amount divided by prior rolling-5 mean plus one.TransactionAmt, pseudo-entity keyAggregate deviationMeasures deviation from medium short-term entity history.
c_card_addr_amt_prev_countPrior count for card1-address combination.card1, addr1Entity historyMeasures recurrence of a card-address pair.
c_card_addr_amt_prev_meanPrior mean amount for card1-address combination.TransactionAmt, card1, addr1Entity historySummarizes historical amount level for card-address pair.
c_card_addr_amt_prev_stdPrior amount standard deviation for card1-address combination.TransactionAmt, card1, addr1Entity historySummarizes variability for card-address pair.
c_card_addr_amt_zCurrent amount standardized by prior card-address history.TransactionAmt, card1, addr1Aggregate deviationDetects deviation from card-address baseline.
c_card_addr_amt_ratioCurrent amount divided by prior card-address mean plus one.TransactionAmt, card1, addr1Aggregate deviationMeasures relative change versus card-address history.
c_card1_amt_prev_countPrior count for card1.card1Entity historyMeasures card-level recurrence.
c_card1_amt_prev_meanPrior mean TransactionAmt for card1.TransactionAmt, card1Entity historySummarizes card-level historical amount.
c_card1_amt_prev_stdPrior TransactionAmt standard deviation for card1.TransactionAmt, card1Entity historySummarizes card-level amount variability.
c_card1_amt_zCurrent amount standardized by prior card1 history.TransactionAmt, card1Aggregate deviationFlags amount deviation at card level.
c_card1_amt_ratioCurrent amount divided by prior card1 mean plus one.TransactionAmt, card1Aggregate deviationMeasures relative amount change at card level.
c_email_amt_prev_countPrior count for payer email domain.P_emaildomainEntity historyMeasures email-domain recurrence.
c_email_amt_prev_meanPrior mean amount for payer email domain.TransactionAmt, P_emaildomainEntity historySummarizes email-domain historical amount.
c_email_amt_prev_stdPrior amount standard deviation for payer email domain.TransactionAmt, P_emaildomainEntity historySummarizes email-domain amount variability.
c_email_amt_zCurrent amount standardized by prior email-domain history.TransactionAmt, P_emaildomainAggregate deviationFlags amount deviation relative to email-domain history.
c_email_amt_ratioCurrent amount divided by prior email-domain mean plus one.TransactionAmt, P_emaildomainAggregate deviationMeasures relative amount change for email-domain history.
c_card1_prev_countPrior occurrence count of card1.card1Reuse/noveltyCaptures card reuse frequency.
c_addr1_prev_countPrior occurrence count of addr1.addr1Reuse/noveltyCaptures address-region reuse frequency.
c_email_prev_countPrior occurrence count of P_emaildomain.P_emaildomainReuse/noveltyCaptures payer-domain reuse frequency.
c_device_prev_countPrior occurrence count of DeviceInfo.DeviceInfoReuse/noveltyCaptures device reuse frequency.
c_product_prev_countPrior occurrence count of ProductCD.ProductCDReuse/noveltyCaptures product-code recurrence.
c_card4_prev_countPrior occurrence count of card4.card4Reuse/noveltyCaptures card-network/type recurrence.
c_card6_prev_countPrior occurrence count of card6.card6Reuse/noveltyCaptures card-category recurrence.
c_card_addr_prev_countPrior occurrence count of card1|addr1 combination.card1, addr1Reuse/noveltyCaptures card–address combination reuse.
c_device_email_prev_countPrior occurrence count of DeviceInfo|P_emaildomain combination.DeviceInfo, P_emaildomainReuse/noveltyCaptures device–email combination reuse.
c_entity_product_prev_countPrior occurrence count of pseudo-entity|ProductCD combination.pseudo-entity key, ProductCDReuse/noveltyCaptures product repetition within entity history.
c_new_device_for_entityIndicator that DeviceInfo has not previously appeared for the pseudo-entity.DeviceInfo, pseudo-entity keyReuse/noveltyFlags new device use for an entity.
c_new_email_for_entityIndicator that P_emaildomain has not previously appeared for the pseudo-entity.P_emaildomain, pseudo-entity keyReuse/noveltyFlags new payer-domain use for an entity.
c_new_card_addr_comboIndicator that card1|addr1 combination is new.card1, addr1Reuse/noveltyFlags novel card–address combination.
c_new_device_email_comboIndicator that DeviceInfo|P_emaildomain combination is new.DeviceInfo, P_emaildomainReuse/noveltyFlags novel device–email combination.
c_new_hour_for_entityIndicator that hour bucket is new for the pseudo-entity.TransactionDT, pseudo-entity keyReuse/noveltyFlags unusual timing for an entity.
c_new_weekday_for_entityIndicator that weekday bucket is new for the pseudo-entity.TransactionDT, pseudo-entity keyReuse/noveltyFlags unusual weekday pattern for an entity.
c_new_product_for_entityIndicator that ProductCD is new for the pseudo-entity.ProductCD, pseudo-entity keyReuse/noveltyFlags new product category for an entity.
c_cross_entity_reuse_deviceDevice prior count minus prior count of the same device within current pseudo-entity.DeviceInfo, pseudo-entity keyReuse/noveltyMeasures whether a device is reused across different entities.
c_cross_entity_reuse_emailEmail-domain prior count minus prior count of same payer domain within current pseudo-entity.P_emaildomain, pseudo-entity keyReuse/noveltyMeasures whether an email domain appears across different entities.
c_identity_missing_ratioFraction of identity-like fields missing in the row.id_*, D*, M*, DeviceInfo, DeviceType, P/R_emaildomainMissingnessSummarizes identity sparsity intensity.
c_identity_missing_countCount of missing identity-like fields in the row.id_*, D*, M*, DeviceInfo, DeviceType, P/R_emaildomainMissingnessMeasures absolute identity-data sparsity.
c_core_missing_countCount of missing core identity/location/device fields.DeviceInfo, P/R_emaildomain, addr1, addr2, dist1, DeviceTypeMissingnessCaptures missingness in operationally important fields.
c_missing_DeviceInfoIndicator that DeviceInfo is missing.DeviceInfoMissingnessFlags absence of device identity.
c_missing_P_emaildomainIndicator that P_emaildomain is missing.P_emaildomainMissingnessFlags absence of payer email domain.
c_missing_R_emaildomainIndicator that R_emaildomain is missing.R_emaildomainMissingnessFlags absence of recipient email domain.
c_missing_addr1Indicator that addr1 is missing.addr1MissingnessFlags absence of primary address-region field.
c_missing_addr2Indicator that addr2 is missing.addr2MissingnessFlags absence of secondary address-region field.
c_missing_dist1Indicator that dist1 is missing.dist1MissingnessFlags absence of distance-related information.
c_missing_DeviceTypeIndicator that DeviceType is missing.DeviceTypeMissingnessFlags absence of device-type information.
Note: Shaded rows indicate concept-group headers used to improve readability. The asterisk (*) denotes wildcard prefixes used in anonymized IEEE-CIS variable families, such as id_*, D*, and M*. The table is intentionally detailed for reproducibility: readers can map every concept name appearing in the code and Supplementary File S1 to its construction rule, raw inputs, concept group, and operational interpretation.

Appendix B. Visual and Reproducibility Summary

Appendix B complements the main text by mapping each major claim to the specific figure, table, or diagnostic that supports it. The purpose of this appendix is not to introduce new results, but to make the evidentiary structure of the manuscript explicit by showing how interpretability, calibration, sensitivity, and computational practicality are each supported in the final document.
Table A2. Map from manuscript claims to supporting evidence.
Table A2. Map from manuscript claims to supporting evidence.
Interpretive RolePrimary Evidence in ManuscriptComponent
Summarizes the end-to-end concept–raw fusion pipeline.Figure 5Overall workflow
Explains how 63 concepts were grouped and motivated.Figure 6; Table A1Concept taxonomy
Shows chronological split and strict pseudo-entity holdout.Figure 7Evaluation protocol
Locates the final hybrid model on the performance–interpretability frontier.Table 1; Figure 8 and Figure 9Main comparison/trade-off
Demonstrates that concept augmentation adds measurable signal beyond raw-only additive baselines.Table 2; Figure 10 and Figure 11Sensitivity and raw-to-hybrid gain
Supports bounded probability-quality claims, soft-target robustness, and strict-threshold behavior.Figure 1, Figure 2 and Figure 12; Table 6 and Table 7Calibration, low-FPR behavior, and soft-target sensitivity
Shows that the final model is auditable at both the model-wide and per-feature functional levels.Figure 3 and Figure 4Global importance/shape-function views
Connects concept utility to computational practicality.Table 4 and Table 5; Figure 13Ablation and runtime

References

  1. Kou, Y.; Lu, C.-T.; Sirwongwattana, S.; Huang, Y.-P. Survey of fraud detection techniques. In Proceedings of the IEEE International Conference on Networking, Sensing and Control, Taipei, Taiwan, 21–23 March 2004; IEEE: Piscataway, NJ, USA, 2004; Volume 2, pp. 749–754. [Google Scholar] [CrossRef]
  2. Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef]
  3. Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30; Curran Associates: Red Hook, NY, USA, 2017; pp. 4765–4774. [Google Scholar]
  4. Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 1135–1144. [Google Scholar] [CrossRef]
  5. Lou, Y.; Caruana, R.; Gehrke, J.; Hooker, G. Accurate intelligible models with pairwise interactions. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, USA, 11–14 August 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 623–631. [Google Scholar] [CrossRef]
  6. Nori, H.; Caruana, R.; Bu, Z.; Shen, J.H.; Kulkarni, J. Accuracy, interpretability, and differential privacy via explainable boosting. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Proceedings of Machine Learning Research: Cambridge, MA, USA, 2021; pp. 8227–8237. Available online: https://proceedings.mlr.press/v139/nori21a.html (accessed on 4 June 2026).
  7. Grover, P.; Xu, J.; Tittelfitz, J.; Cheng, A.; Li, Z.; Zablocki, J.; Liu, J.; Zhou, H. Fraud dataset benchmark and applications. arXiv 2022, arXiv:2208.14417. [Google Scholar] [CrossRef]
  8. Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. In Advances in Neural Information Processing Systems 31; Curran Associates: Red Hook, NY, USA, 2018; pp. 6638–6648. [Google Scholar]
  9. Friedman, J.H.; Popescu, B.E. Predictive learning via rule ensembles. Ann. Appl. Stat. 2008, 2, 916–954. [Google Scholar] [CrossRef]
  10. Jurgovsky, J.; Granitzer, M.; Ziegler, K.; Calabretto, S.; Portier, P.-E.; He-Guelton, L.; Caelen, O. Sequence classification for credit-card fraud detection. Expert Syst. Appl. 2018, 100, 234–245. [Google Scholar] [CrossRef]
  11. Whitrow, C.; Hand, D.J.; Juszczak, P.; Weston, D.; Adams, N.M. Transaction aggregation as a strategy for credit card fraud detection. Data Min. Knowl. Discov. 2009, 18, 30–55. [Google Scholar] [CrossRef]
  12. Niculescu-Mizil, A.; Caruana, R. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 7–11 August 2005; Association for Computing Machinery: New York, NY, USA, 2005; pp. 625–632. [Google Scholar] [CrossRef]
  13. Axelsson, S. The base-rate fallacy and the difficulty of intrusion detection. ACM Trans. Inf. Syst. Secur. 2000, 3, 186–205. [Google Scholar] [CrossRef]
  14. Hilal, W.; Gadsden, S.A.; Yawney, J. Financial fraud: A review of anomaly detection techniques and recent advances. Expert Syst. Appl. 2022, 193, 116429. [Google Scholar] [CrossRef]
  15. Carcillo, F.; Dal Pozzolo, A.; Le Borgne, Y.-A.; Caelen, O.; Mazzer, Y.; Bontempi, G. SCARFF: A scalable framework for streaming credit card fraud detection with Spark. Inf. Fusion 2018, 41, 182–194. [Google Scholar] [CrossRef]
  16. Dal Pozzolo, A.; Boracchi, G.; Caelen, O.; Alippi, C.; Bontempi, G. Credit card fraud detection: A realistic modeling and a novel learning strategy. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 3784–3797. [Google Scholar] [CrossRef] [PubMed]
  17. Bahnsen, A.C.; Stojanovic, A.; Aouada, D.; Ottersten, B. Improving credit card fraud detection with calibrated probabilities. In Proceedings of the 2014 SIAM International Conference on Data Mining, Philadelphia, PA, USA, 24–26 April 2014; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2014; pp. 677–685. [Google Scholar] [CrossRef]
  18. Gama, J.; Žliobaitė, I.; Bifet, A.; Pechenizkiy, M.; Bouchachia, A. A survey on concept drift adaptation. ACM Comput. Surv. 2014, 46, 44. [Google Scholar] [CrossRef] [PubMed]
  19. Liang, S.; Li, Y.; Srikant, R. Enhancing the reliability of out-of-distribution image detection in neural networks. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  20. Slack, D.; Hilgard, S.; Jia, E.; Singh, S.; Lakkaraju, H. Fooling LIME and SHAP: Adversarial attacks on post hoc explanation methods. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, New York, NY, USA, 7–8 February 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 180–186. [Google Scholar] [CrossRef]
  21. Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef] [PubMed]
  22. Miller, T. Explanation in artificial intelligence: Insights from the social sciences. Artif. Intell. 2019, 267, 1–38. [Google Scholar] [CrossRef]
  23. Guidotti, R.; Monreale, A.; Ruggieri, S.; Turini, F.; Giannotti, F.; Pedreschi, D. A survey of methods for explaining black box models. ACM Comput. Surv. 2019, 51, 93. [Google Scholar] [CrossRef]
  24. Rudin, C.; Chen, C.; Chen, Z.; Huang, H.; Semenova, L.; Zhong, C. Interpretable machine learning: Fundamental principles and 10 grand challenges. Stat. Surv. 2022, 16, 1–85. [Google Scholar] [CrossRef]
  25. Wachter, S.; Mittelstadt, B.; Russell, C. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harv. J. Law Technol. 2018, 31, 841–887. [Google Scholar] [CrossRef]
  26. Caruana, R.; Lou, Y.; Gehrke, J.; Koch, P.; Sturm, M.; Elhadad, N. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13 August 2015; Association for Computing Machinery: New York, NY, USA, 2015; pp. 1721–1730. [Google Scholar] [CrossRef]
  27. Kraus, M.; Tschernutter, D.; Weinzierl, S.; Zschech, P. Interpretable generalized additive neural networks. Eur. J. Oper. Res. 2024, 317, 303–316. [Google Scholar] [CrossRef]
  28. Agarwal, R.; Melnick, L.; Frosst, N.; Zhang, X.; Lengerich, B.; Caruana, R.; Hinton, G.E. Neural additive models: Interpretable machine learning with neural nets. In Advances in Neural Information Processing Systems 34; Curran Associates: Red Hook, NY, USA, 2021; pp. 4699–4712. [Google Scholar]
  29. Moreno-Torres, J.G.; Raeder, T.; Alaiz-Rodríguez, R.; Chawla, N.V.; Herrera, F. A unifying view on dataset shift in classification. Pattern Recognit. 2012, 45, 521–530. [Google Scholar] [CrossRef]
  30. Zafar, U.; Wu, F. Methodological challenges in explainable AI for fraud detection: A systematic literature review. Artif. Intell. Rev. 2026, 59, 115. [Google Scholar] [CrossRef]
  31. Zhou, Y.; Li, H.; Xiao, Z.; Qiu, J. A user-centered explainable artificial intelligence approach for financial fraud detection. Finance Res. Lett. 2023, 58, 104309. [Google Scholar] [CrossRef]
  32. Davis, J.; Goadrich, M. The relationship between precision-recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; Association for Computing Machinery: New York, NY, USA, 2006; pp. 233–240. [Google Scholar] [CrossRef]
  33. Saito, T.; Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef] [PubMed]
  34. Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Proceedings of Machine Learning Research: Cambridge, MA, USA, 2017; pp. 1321–1330. [Google Scholar]
  35. Zadrozny, B.; Elkan, C. Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In Proceedings of the 18th International Conference on Machine Learning, Williamstown, MA, USA, 28 June–1 July 2001; Proceedings of Machine Learning Research: Cambridge, MA, USA, 2001; pp. 609–616. [Google Scholar]
  36. Kull, M.; Silva Filho, T.M.; Flach, P. Beta calibration: A well-founded and easily implemented improvement on logistic calibration for binary classifiers. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; Proceedings of Machine Learning Research: Cambridge, MA, USA, 2017; pp. 623–631. [Google Scholar]
Figure 1. Representative precision–recall curves for the XGBoost teacher and the final hybrid EBM (top-k = 8).
Figure 1. Representative precision–recall curves for the XGBoost teacher and the final hybrid EBM (top-k = 8).
Applsci 16 05809 g001
Figure 2. Representative calibration curves for the XGBoost teacher and the final hybrid EBM.
Figure 2. Representative calibration curves for the XGBoost teacher and the final hybrid EBM.
Applsci 16 05809 g002
Figure 3. Representative global importance profile of the final hybrid EBM (top-k = 8).
Figure 3. Representative global importance profile of the final hybrid EBM (top-k = 8).
Applsci 16 05809 g003
Figure 4. Representative shape functions of the final hybrid EBM (top-k = 8): (a) c_cross_entity_reuse_device, (b) c_device_prev_count, (c) C1, and (d) C14. The plotted terms should be interpreted as additive score contributions for the regression-style EBM. In each panel, the red line represents the learned additive shape function, the gray histogram indicates the empirical feature distribution, and the shaded band indicates the variation band shown by the plotting routine.
Figure 4. Representative shape functions of the final hybrid EBM (top-k = 8): (a) c_cross_entity_reuse_device, (b) c_device_prev_count, (c) C1, and (d) C14. The plotted terms should be interpreted as additive score contributions for the regression-style EBM. In each panel, the red line represents the learned additive shape function, the gray histogram indicates the empirical feature distribution, and the shaded band indicates the variation band shown by the plotting routine.
Applsci 16 05809 g004
Figure 5. Overall workflow of the proposed concept–raw fusion pipeline with separated XGBoost-teacher and CatBoost-ceiling roles.
Figure 5. Overall workflow of the proposed concept–raw fusion pipeline with separated XGBoost-teacher and CatBoost-ceiling roles.
Applsci 16 05809 g005
Figure 6. Taxonomy of the 63-variable causal concept bank used in the final experiments. The figure shows representative examples from the five behavioral groups; the complete concept inventory, construction rules, raw variables used, and operational interpretations are provided in Appendix A Table A1.
Figure 6. Taxonomy of the 63-variable causal concept bank used in the final experiments. The figure shows representative examples from the five behavioral groups; the complete concept inventory, construction rules, raw variables used, and operational interpretations are provided in Appendix A Table A1.
Applsci 16 05809 g006
Figure 7. Evaluation design combining the primary chronological out-of-time split and the strict pseudo-entity holdout protocol. The stricter protocol removes validation/test rows whose pseudo-entity keys appeared in earlier splits, reducing repeated-entity effects and providing a leakage-resistant robustness stress test.
Figure 7. Evaluation design combining the primary chronological out-of-time split and the strict pseudo-entity holdout protocol. The stricter protocol removes validation/test rows whose pseudo-entity keys appeared in earlier splits, reducing repeated-entity effects and providing a leakage-resistant robustness stress test.
Applsci 16 05809 g007
Figure 8. Main comparison of the CatBoost predictive ceiling and additive interpretable student family under the primary out-of-time evaluation. Panels report (a) PR-AUC, (b) ROC-AUC, (c) F1, and (d) Precision@Top1% for the CatBoost ceiling, sparse linear student, concept-only EBM, raw-only EBM (top-k = 8), and final hybrid EBM (top-k = 8).
Figure 8. Main comparison of the CatBoost predictive ceiling and additive interpretable student family under the primary out-of-time evaluation. Panels report (a) PR-AUC, (b) ROC-AUC, (c) F1, and (d) Precision@Top1% for the CatBoost ceiling, sparse linear student, concept-only EBM, raw-only EBM (top-k = 8), and final hybrid EBM (top-k = 8).
Applsci 16 05809 g008
Figure 9. Performance–complexity trade-off across the XGBoost teacher and the core additive student family.
Figure 9. Performance–complexity trade-off across the XGBoost teacher and the core additive student family.
Applsci 16 05809 g009
Figure 10. Sensitivity of the hybrid EBM to the number of teacher-selected raw variables. The curve shows that performance improves sharply from k = 4 to k = 8 and then reaches a plateau across k = 8–16, motivating top-k = 8 as the compact final deployment-relevant additive configuration while treating the larger k values as ranking-sensitivity checks.
Figure 10. Sensitivity of the hybrid EBM to the number of teacher-selected raw variables. The curve shows that performance improves sharply from k = 4 to k = 8 and then reaches a plateau across k = 8–16, motivating top-k = 8 as the compact final deployment-relevant additive configuration while treating the larger k values as ranking-sensitivity checks.
Applsci 16 05809 g010
Figure 11. PR-AUC gains obtained by augmenting raw-only additive models with concept features. Positive values indicate that the causal concept bank adds ranking signal beyond the matched top-k raw-only additive baseline under both the main out-of-time and strict pseudo-entity-disjoint settings.
Figure 11. PR-AUC gains obtained by augmenting raw-only additive models with concept features. Positive values indicate that the causal concept bank adds ranking signal beyond the matched top-k raw-only additive baseline under both the main out-of-time and strict pseudo-entity-disjoint settings.
Applsci 16 05809 g011
Figure 12. Low-FPR operating characteristics of the teacher, raw-only EBM, and final hybrid EBM. The plot emphasizes the screening regime most relevant to fraud analysts, where false-positive budgets are limited and recall must be interpreted under strict FPR constraints.
Figure 12. Low-FPR operating characteristics of the teacher, raw-only EBM, and final hybrid EBM. The plot emphasizes the screening regime most relevant to fraud analysts, where false-positive budgets are limited and recall must be interpreted under strict FPR constraints.
Applsci 16 05809 g012
Figure 13. Hybrid concept-group ablation under strict pseudo-entity holdout. The figure illustrates that concept-group utility is partly regime-dependent; missingness remains important, whereas relation and time features can change utility when repeated pseudo-entity overlap is removed.
Figure 13. Hybrid concept-group ablation under strict pseudo-entity holdout. The figure illustrates that concept-group utility is partly regime-dependent; missingness remains important, whereas relation and time features can change utility when repeated pseudo-entity overlap is removed.
Applsci 16 05809 g013
Table 1. Main three-seed out-of-time comparison across black-box references, additive students, and a rule-based interpretable baseline (mean ± std).
Table 1. Main three-seed out-of-time comparison across black-box references, additive students, and a rule-based interpretable baseline (mean ± std).
Precision@Top 1%Recall@0.1% FPRF1ROC-AUCPR-AUCModel
0.855 ± 0.0020.204 ± 0.0040.499 ± 0.0040.885 ± 0.0020.489 ± 0.001CatBoost ceiling
0.837 ± 0.0070.185 ± 0.0030.494 ± 0.0020.878 ± 0.0010.478 ± 0.003XGBoost teacher
0.253 ± 0.0030.019 ± 0.0000.193 ± 0.0000.746 ± 0.0030.121 ± 0.002Sparse linear student
0.441 ± 0.0020.046 ± 0.0010.231 ± 0.0020.757 ± 0.0000.189 ± 0.000Concept-only EBM
0.700 ± 0.0170.141 ± 0.0030.398 ± 0.0030.825 ± 0.0040.372 ± 0.005Raw-only EBM (top-k = 8)
0.733 ± 0.0110.149 ± 0.0010.400 ± 0.0020.830 ± 0.0010.383 ± 0.002Raw-only EBM (top-k = 12)
0.806 ± 0.0010.180 ± 0.0020.423 ± 0.0040.828 ± 0.0030.407 ± 0.003Hybrid EBM (top-k = 8)
0.799 ± 0.0060.172 ± 0.0040.416 ± 0.0130.832 ± 0.0010.407 ± 0.004Hybrid EBM (top-k = 12)
0.795 ± 0.0650.179 ± 0.0190.404 ± 0.0480.833 ± 0.0090.387 ± 0.041RuleFit baseline (top-k = 8 feature set)
Table 2. Hybrid EBM top-k ablation in the main out-of-time experiment (mean ± std over three seeds).
Table 2. Hybrid EBM top-k ablation in the main out-of-time experiment (mean ± std over three seeds).
Precision@Top 1%Recall@0.1% FPRF1ROC-AUCPR-AUCk
0.496 ± 0.0060.055 ± 0.0070.268 ± 0.0100.775 ± 0.0110.231 ± 0.0074
0.806 ± 0.0010.180 ± 0.0020.423 ± 0.0040.828 ± 0.0030.407 ± 0.0038
0.799 ± 0.0060.172 ± 0.0040.416 ± 0.0130.832 ± 0.0010.407 ± 0.00412
0.796 ± 0.0050.170 ± 0.0020.422 ± 0.0030.834 ± 0.0020.408 ± 0.00316
Table 3. Strict pseudo-entity-disjoint robustness evaluation across the main model families (mean ± std over three seeds), including Brier score as a probability-quality check. This benchmark removes repeated pseudo-entity overlap from later splits and is interpreted as a leakage-resistant robustness stress test rather than the primary reporting protocol.
Table 3. Strict pseudo-entity-disjoint robustness evaluation across the main model families (mean ± std over three seeds), including Brier score as a probability-quality check. This benchmark removes repeated pseudo-entity overlap from later splits and is interpreted as a leakage-resistant robustness stress test rather than the primary reporting protocol.
Brier ScorePrecision@Top 1%Recall@0.1% FPRF1ROC-AUCPR-AUCModel
0.02316 ± 0.000110.851 ± 0.0090.162 ± 0.0070.480 ± 0.0030.886 ± 0.0020.487 ± 0.005CatBoost ceiling
0.02404 ± 0.000130.816 ± 0.0080.176 ± 0.0150.468 ± 0.0050.870 ± 0.0020.468 ± 0.002XGBoost teacher
0.02724 ± 0.000060.741 ± 0.0050.131 ± 0.0050.406 ± 0.0030.829 ± 0.0030.371 ± 0.004Raw-only EBM (top-k = 8)
0.02718 ± 0.000020.783 ± 0.0050.152 ± 0.0030.398 ± 0.0100.849 ± 0.0010.399 ± 0.001Hybrid EBM (top-k = 8)
0.02697 ± 0.000080.786 ± 0.0140.147 ± 0.0150.393 ± 0.0070.857 ± 0.0020.400 ± 0.005Hybrid EBM (top-k = 12)
0.02606 ± 0.000210.796 ± 0.0080.177 ± 0.0070.415 ± 0.0110.867 ± 0.0030.431 ± 0.010RuleFit baseline (top-k = 8)
Table 4. Concept-group ablation for the final top-k = 8 hybrid EBM in the main out-of-time experiment (mean ± std over three seeds).
Table 4. Concept-group ablation for the final top-k = 8 hybrid EBM in the main out-of-time experiment (mean ± std over three seeds).
Precision@Top 1%Recall@0.1% FPRF1ROC-AUCPR-AUCSetting
0.806 ± 0.0010.180 ± 0.0020.423 ± 0.0040.828 ± 0.0030.407 ± 0.003Full hybrid (top-k = 8)
0.804 ± 0.0030.179 ± 0.0020.426 ± 0.0020.830 ± 0.0030.410 ± 0.002Drop time concepts
0.787 ± 0.0040.173 ± 0.0030.406 ± 0.0070.829 ± 0.0020.398 ± 0.003Drop relation concepts
0.795 ± 0.0030.174 ± 0.0020.407 ± 0.0070.826 ± 0.0030.398 ± 0.005Drop missingness concepts
Table 5. Computational profile in the main out-of-time experiment (mean ± std over three seeds).
Table 5. Computational profile in the main out-of-time experiment (mean ± std over three seeds).
Predict Time (s)Fit Time (s)FeaturesModel
0.126 ± 0.00434.1 ± 0.7154CatBoost ceiling
2.094 ± 0.02330.9 ± 0.0154XGBoost teacher
0.098 ± 0.00180.4 ± 0.88Raw-only EBM (top-k = 8)
0.104 ± 0.002124.0 ± 2.912Raw-only EBM (top-k = 12)
0.200 ± 0.001536.4 ± 5.663Concept-only EBM
0.217 ± 0.015659.3 ± 10.971Hybrid EBM (top-k = 8)
0.218 ± 0.011704.5 ± 13.275Hybrid EBM (top-k = 12)
1.273 ± 0.0535083.5 ± 493.971RuleFit baseline (top-k = 8 feature set)
0.079 ± 0.005624.0 ± 201.863Sparse linear student
Table 6. Main out-of-time calibration comparison with ECE-15 and Brier score (mean ± std over three seeds).
Table 6. Main out-of-time calibration comparison with ECE-15 and Brier score (mean ± std over three seeds).
Calibration InterpretationBrier ScoreECE-15Model
Best Brier score among reported models; strongest ECE-15 among black-box references, but not the lowest ECE-15 overall.0.02359 ± 0.000090.00989 ± 0.00047CatBoost ceiling
Black-box teacher used for raw-feature selection and soft targets.0.02418 ± 0.000120.01611 ± 0.00044XGBoost teacher
Low ECE but weak ranking and higher Brier score.0.03160 ± 0.000010.01217 ± 0.00018Concept-only EBM
Compact additive raw baseline.0.02693 ± 0.000120.00735 ± 0.00043Raw-only EBM (top-k = 8)
Close to XGBoost in ECE; Brier score remains above both black-box references.0.02656 ± 0.000080.01587 ± 0.00012Hybrid EBM (top-k = 8)
Higher ECE and Brier than the hybrid with larger seed variance.0.02817 ± 0.002660.02669 ± 0.00391RuleFit baseline (top-k = 8 feature set)
Table 7. Sensitivity of the final hybrid EBM (top-k = 8) to alpha_EBM under the main out-of-time protocol (mean ± std over three seeds).
Table 7. Sensitivity of the final hybrid EBM (top-k = 8) to alpha_EBM under the main out-of-time protocol (mean ± std over three seeds).
Recall@0.1% FPRBrier ScoreECE-15F1ROC-AUCPR-AUCAlpha_EBM
0.1794 ± 0.00230.02658 ± 0.000080.01600 ± 0.000140.4220 ± 0.00510.8276 ± 0.00270.4061 ± 0.00340.00
0.1797 ± 0.00200.02657 ± 0.000080.01593 ± 0.000120.4230 ± 0.00460.8277 ± 0.00270.4065 ± 0.00340.25
0.1799 ± 0.00180.02656 ± 0.000080.01587 ± 0.000120.4228 ± 0.00440.8278 ± 0.00280.4066 ± 0.00340.35
0.1801 ± 0.00180.02655 ± 0.000080.01582 ± 0.000090.4229 ± 0.00460.8279 ± 0.00280.4069 ± 0.00340.50
0.1800 ± 0.00170.02654 ± 0.000080.01576 ± 0.000080.4227 ± 0.00490.8280 ± 0.00290.4072 ± 0.00340.75
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kang, J.; Kim, K. Deployment-Oriented Interpretable Fraud Detection via Hybrid Explainable Boosting Machines with Concept–Raw Fusion on the IEEE-CIS Benchmark. Appl. Sci. 2026, 16, 5809. https://doi.org/10.3390/app16125809

AMA Style

Kang J, Kim K. Deployment-Oriented Interpretable Fraud Detection via Hybrid Explainable Boosting Machines with Concept–Raw Fusion on the IEEE-CIS Benchmark. Applied Sciences. 2026; 16(12):5809. https://doi.org/10.3390/app16125809

Chicago/Turabian Style

Kang, Jeongtae, and Keecheon Kim. 2026. "Deployment-Oriented Interpretable Fraud Detection via Hybrid Explainable Boosting Machines with Concept–Raw Fusion on the IEEE-CIS Benchmark" Applied Sciences 16, no. 12: 5809. https://doi.org/10.3390/app16125809

APA Style

Kang, J., & Kim, K. (2026). Deployment-Oriented Interpretable Fraud Detection via Hybrid Explainable Boosting Machines with Concept–Raw Fusion on the IEEE-CIS Benchmark. Applied Sciences, 16(12), 5809. https://doi.org/10.3390/app16125809

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop