Deployment-Oriented Interpretable Fraud Detection via Hybrid Explainable Boosting Machines with Concept–Raw Fusion on the IEEE-CIS Benchmark
Abstract
1. Introduction
1.1. Related Work on Fraud Detection, Explainability, and Additive Models
1.2. Study Positioning and Contribution Logic
- (1)
- We define a time-aware evaluation package that combines a chronological out-of-time split with a stricter pseudo-entity-disjoint holdout, so that additive students are assessed not only for predictive quality but also for leakage-resistant generalization.
- (2)
- We construct a causal 63-variable concept bank that translates anonymized IEEE-CIS fields into interpretable behavioral summaries of temporal state, entity history, novelty and reuse, identity missingness, and aggregate deviation.
- (3)
- We compare sparse linear, concept-only, raw-only, and concept–raw hybrid additive students against the XGBoost teacher and CatBoost predictive ceiling, and we additionally benchmark RuleFit so that the conclusions are not confined to a single black-box reference or a single interpretable family.
- (4)
- We close the evaluation loop by reporting ranking quality, threshold-based F1, low-FPR behavior, calibration, global importance and representative shape functions, computational cost, and explanation latency for XGBoost + SHAP versus native hybrid EBM local explanations, allowing interpretability claims to be judged together with operational plausibility and explanation-side cost.
2. Materials and Methods
2.1. Dataset and Prediction Task
2.2. Out-of-Time and Pseudo-Entity-Disjoint Evaluation
2.3. Concept Bank
2.4. Teacher and Student Models
2.5. Metrics and Computational Profile
2.6. Formal Problem Definition and Equation Summary
2.7. Algorithmic Summary
| Algorithm 1. Causal concept-bank construction for the ordered IEEE-CIS stream. |
| Input: merged table M sorted by TransactionDT; entity key e_i; amount a_i; device d_i |
| Output: concept bank C with temporal, history, reuse, missingness, and ratio features |
| 1: initialize empty history stores H_entity, H_device, H_email, H_product, H_card, H_address, H_card_address, and H_device_email |
| 2: for each row i in temporal order do |
| 3: read current entity e_i and raw fields for row i |
| 4: compute temporal concepts from TransactionDT and previous entity timestamps |
| 5: compute entity-history statistics from H_entity[e_i] before inserting row i |
| 6: compute device-, email-, product-, card-, address-, and combined-identifier reuse terms from their existing history stores |
| 7: compute missingness flags and deviation ratios using only information already available for row i and prior histories |
| 8: append all concept values to C for row i |
| 9: update H_entity, H_device, H_email, H_product, H_card, H_address, H_card_address, and H_device_email with row i only after feature extraction is complete |
| 10: end for |
| 11: return C |
| Algorithm 2. Final hybrid EBM training and evaluation under chronological and strict protocols. |
| Input: full feature matrix X, concept bank C, labels y, top-k value k Output: trained hybrid EBM, clipped probability scores, test metrics, calibration plots, shape functions, and explanation-cost summaries 1: split X and y into chronological train/valid/test partitions 2: optionally filter valid/test rows that share pseudo-entity keys with train 3: fit the teacher XGBoost model on the train partition 4: rank raw variables by teacher importance and keep the top-k set R_k 5: build hybrid design matrix Z = [C, R_k] for train/valid/test 6: compute soft training targets y_tilde_train = (1 − alpha_EBM)y_train + alpha_EBM p^T_train 7: fit regression-style Explainable Boosting Machine on Z_train and y_tilde_train 8: obtain validation and test scores by clipping additive EBM outputs to [0, 1]; derive tau*_F1 from validation scores 9: evaluate PR-AUC and ROC-AUC threshold-free; evaluate F1 at tau*_F1; compute Recall@low-FPR and Precision@Top1% on Z_test 10: export global importance, PR/calibration curves, representative shape functions, explanation-cost diagnostics, runtime, feature count, and ablation results |
3. Results
3.1. Main Comparison on the Out-of-Time Split
3.2. Top-k Ablation and Final Model Selection
3.3. Robustness Under Strict Pseudo-Entity Holdout
3.4. Interpretability, Calibration, and Error Analysis
3.5. Raw-Only Versus Hybrid Additive Modeling
3.6. Hybrid Concept-Group Ablation
3.7. Computational Profile
Main Calibration Metric and Soft-Target Sensitivity Checks
3.8. Expanded Visual Diagnostics and Design Summary
3.8.1. Workflow, Concept Taxonomy, and Evaluation Design
3.8.2. Comparative Views of Performance, Complexity, and Sensitivity
3.8.3. Low-FPR Operating View and Hybrid Concept-Group Ablation
3.9. Evidence Synthesis
4. Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| EBM | Explainable Boosting Machine |
| ECE | Expected Calibration Error |
| PR-AUC | Area under the Precision–Recall Curve |
| ROC-AUC | Area under the Receiver Operating Characteristic Curve |
| XAI | Explainable Artificial Intelligence |
Appendix A. Full 63-Variable Concept Bank Inventory
| Concept Variable | Construction Rule/Formula | Raw Variables Used | Concept Group | Operational Interpretation |
|---|---|---|---|---|
| log_TransactionAmt | log1p of non-negative TransactionAmt. | TransactionAmt | Aggregate deviation/amount scale | Stabilizes the transaction amount scale for additive modeling. |
| dt_hour | Hour bucket derived from floor(TransactionDT/3600) mod 24. | TransactionDT | Temporal state | Captures within-day transaction timing. |
| dt_weekday | Weekday bucket derived from floor(TransactionDT/86400) mod 7. | TransactionDT | Temporal state | Captures weekly timing pattern. |
| dt_day | Day index derived from floor(TransactionDT/86400). | TransactionDT | Temporal state | Captures coarse temporal position in the benchmark stream. |
| c_entity_amt_prev_count | Prior cumulative count for the pseudo-entity before current row. | card1, card2, addr1, P_emaildomain, TransactionDT | Entity history | Measures entity transaction history depth. |
| c_entity_amt_delta_t | Time since previous transaction for the same pseudo-entity. | TransactionDT, pseudo-entity key | Temporal state | Measures entity inactivity or rapid recurrence. |
| c_entity_amt_delta_t_log1p | log1p-transformed positive inter-arrival time for the pseudo-entity. | TransactionDT, pseudo-entity key | Temporal state | Stabilizes inter-arrival time for sparse histories. |
| c_entity_amt_jump_ratio | Absolute amount change from the previous pseudo-entity transaction divided by previous amount plus one. | TransactionAmt, pseudo-entity key | Aggregate deviation | Captures abrupt amount jumps for the same entity. |
| c_entity_amt_prev_mean | Prior mean TransactionAmt for the pseudo-entity. | TransactionAmt, pseudo-entity key | Entity history | Summarizes historical spending level of the entity. |
| c_entity_amt_prev_std | Prior standard deviation of TransactionAmt for the pseudo-entity. | TransactionAmt, pseudo-entity key | Entity history | Summarizes historical amount variability. |
| c_entity_amt_z | Current amount standardized by prior pseudo-entity mean and standard deviation. | TransactionAmt, pseudo-entity key | Aggregate deviation | Flags transactions deviating from entity history. |
| c_entity_amt_ratio | Current amount divided by prior pseudo-entity mean plus one. | TransactionAmt, pseudo-entity key | Aggregate deviation | Measures relative amount inflation versus entity history. |
| c_entity_amt_burstiness | Prior entity count divided by time since previous entity transaction plus one. | TransactionDT, pseudo-entity key | Temporal state | Measures rapid repeated activity by the same entity. |
| c_entity_amt_roll3_mean_prev | Rolling mean of the previous three entity amounts, using shifted history. | TransactionAmt, pseudo-entity key | Temporal state | Captures short-term historical amount level. |
| c_entity_amt_roll3_std_prev | Rolling standard deviation of the previous three entity amounts, using shifted history. | TransactionAmt, pseudo-entity key | Temporal state | Captures short-term amount volatility. |
| c_entity_amt_roll5_mean_prev | Rolling mean of the previous five entity amounts, using shifted history. | TransactionAmt, pseudo-entity key | Temporal state | Captures medium short-term amount level. |
| c_entity_amt_roll5_std_prev | Rolling standard deviation of the previous five entity amounts, using shifted history. | TransactionAmt, pseudo-entity key | Temporal state | Captures medium short-term amount volatility. |
| c_entity_amt_to_roll3_ratio | Current amount divided by prior rolling-3 mean plus one. | TransactionAmt, pseudo-entity key | Aggregate deviation | Measures deviation from short-term entity history. |
| c_entity_amt_to_roll5_ratio | Current amount divided by prior rolling-5 mean plus one. | TransactionAmt, pseudo-entity key | Aggregate deviation | Measures deviation from medium short-term entity history. |
| c_card_addr_amt_prev_count | Prior count for card1-address combination. | card1, addr1 | Entity history | Measures recurrence of a card-address pair. |
| c_card_addr_amt_prev_mean | Prior mean amount for card1-address combination. | TransactionAmt, card1, addr1 | Entity history | Summarizes historical amount level for card-address pair. |
| c_card_addr_amt_prev_std | Prior amount standard deviation for card1-address combination. | TransactionAmt, card1, addr1 | Entity history | Summarizes variability for card-address pair. |
| c_card_addr_amt_z | Current amount standardized by prior card-address history. | TransactionAmt, card1, addr1 | Aggregate deviation | Detects deviation from card-address baseline. |
| c_card_addr_amt_ratio | Current amount divided by prior card-address mean plus one. | TransactionAmt, card1, addr1 | Aggregate deviation | Measures relative change versus card-address history. |
| c_card1_amt_prev_count | Prior count for card1. | card1 | Entity history | Measures card-level recurrence. |
| c_card1_amt_prev_mean | Prior mean TransactionAmt for card1. | TransactionAmt, card1 | Entity history | Summarizes card-level historical amount. |
| c_card1_amt_prev_std | Prior TransactionAmt standard deviation for card1. | TransactionAmt, card1 | Entity history | Summarizes card-level amount variability. |
| c_card1_amt_z | Current amount standardized by prior card1 history. | TransactionAmt, card1 | Aggregate deviation | Flags amount deviation at card level. |
| c_card1_amt_ratio | Current amount divided by prior card1 mean plus one. | TransactionAmt, card1 | Aggregate deviation | Measures relative amount change at card level. |
| c_email_amt_prev_count | Prior count for payer email domain. | P_emaildomain | Entity history | Measures email-domain recurrence. |
| c_email_amt_prev_mean | Prior mean amount for payer email domain. | TransactionAmt, P_emaildomain | Entity history | Summarizes email-domain historical amount. |
| c_email_amt_prev_std | Prior amount standard deviation for payer email domain. | TransactionAmt, P_emaildomain | Entity history | Summarizes email-domain amount variability. |
| c_email_amt_z | Current amount standardized by prior email-domain history. | TransactionAmt, P_emaildomain | Aggregate deviation | Flags amount deviation relative to email-domain history. |
| c_email_amt_ratio | Current amount divided by prior email-domain mean plus one. | TransactionAmt, P_emaildomain | Aggregate deviation | Measures relative amount change for email-domain history. |
| c_card1_prev_count | Prior occurrence count of card1. | card1 | Reuse/novelty | Captures card reuse frequency. |
| c_addr1_prev_count | Prior occurrence count of addr1. | addr1 | Reuse/novelty | Captures address-region reuse frequency. |
| c_email_prev_count | Prior occurrence count of P_emaildomain. | P_emaildomain | Reuse/novelty | Captures payer-domain reuse frequency. |
| c_device_prev_count | Prior occurrence count of DeviceInfo. | DeviceInfo | Reuse/novelty | Captures device reuse frequency. |
| c_product_prev_count | Prior occurrence count of ProductCD. | ProductCD | Reuse/novelty | Captures product-code recurrence. |
| c_card4_prev_count | Prior occurrence count of card4. | card4 | Reuse/novelty | Captures card-network/type recurrence. |
| c_card6_prev_count | Prior occurrence count of card6. | card6 | Reuse/novelty | Captures card-category recurrence. |
| c_card_addr_prev_count | Prior occurrence count of card1|addr1 combination. | card1, addr1 | Reuse/novelty | Captures card–address combination reuse. |
| c_device_email_prev_count | Prior occurrence count of DeviceInfo|P_emaildomain combination. | DeviceInfo, P_emaildomain | Reuse/novelty | Captures device–email combination reuse. |
| c_entity_product_prev_count | Prior occurrence count of pseudo-entity|ProductCD combination. | pseudo-entity key, ProductCD | Reuse/novelty | Captures product repetition within entity history. |
| c_new_device_for_entity | Indicator that DeviceInfo has not previously appeared for the pseudo-entity. | DeviceInfo, pseudo-entity key | Reuse/novelty | Flags new device use for an entity. |
| c_new_email_for_entity | Indicator that P_emaildomain has not previously appeared for the pseudo-entity. | P_emaildomain, pseudo-entity key | Reuse/novelty | Flags new payer-domain use for an entity. |
| c_new_card_addr_combo | Indicator that card1|addr1 combination is new. | card1, addr1 | Reuse/novelty | Flags novel card–address combination. |
| c_new_device_email_combo | Indicator that DeviceInfo|P_emaildomain combination is new. | DeviceInfo, P_emaildomain | Reuse/novelty | Flags novel device–email combination. |
| c_new_hour_for_entity | Indicator that hour bucket is new for the pseudo-entity. | TransactionDT, pseudo-entity key | Reuse/novelty | Flags unusual timing for an entity. |
| c_new_weekday_for_entity | Indicator that weekday bucket is new for the pseudo-entity. | TransactionDT, pseudo-entity key | Reuse/novelty | Flags unusual weekday pattern for an entity. |
| c_new_product_for_entity | Indicator that ProductCD is new for the pseudo-entity. | ProductCD, pseudo-entity key | Reuse/novelty | Flags new product category for an entity. |
| c_cross_entity_reuse_device | Device prior count minus prior count of the same device within current pseudo-entity. | DeviceInfo, pseudo-entity key | Reuse/novelty | Measures whether a device is reused across different entities. |
| c_cross_entity_reuse_email | Email-domain prior count minus prior count of same payer domain within current pseudo-entity. | P_emaildomain, pseudo-entity key | Reuse/novelty | Measures whether an email domain appears across different entities. |
| c_identity_missing_ratio | Fraction of identity-like fields missing in the row. | id_*, D*, M*, DeviceInfo, DeviceType, P/R_emaildomain | Missingness | Summarizes identity sparsity intensity. |
| c_identity_missing_count | Count of missing identity-like fields in the row. | id_*, D*, M*, DeviceInfo, DeviceType, P/R_emaildomain | Missingness | Measures absolute identity-data sparsity. |
| c_core_missing_count | Count of missing core identity/location/device fields. | DeviceInfo, P/R_emaildomain, addr1, addr2, dist1, DeviceType | Missingness | Captures missingness in operationally important fields. |
| c_missing_DeviceInfo | Indicator that DeviceInfo is missing. | DeviceInfo | Missingness | Flags absence of device identity. |
| c_missing_P_emaildomain | Indicator that P_emaildomain is missing. | P_emaildomain | Missingness | Flags absence of payer email domain. |
| c_missing_R_emaildomain | Indicator that R_emaildomain is missing. | R_emaildomain | Missingness | Flags absence of recipient email domain. |
| c_missing_addr1 | Indicator that addr1 is missing. | addr1 | Missingness | Flags absence of primary address-region field. |
| c_missing_addr2 | Indicator that addr2 is missing. | addr2 | Missingness | Flags absence of secondary address-region field. |
| c_missing_dist1 | Indicator that dist1 is missing. | dist1 | Missingness | Flags absence of distance-related information. |
| c_missing_DeviceType | Indicator that DeviceType is missing. | DeviceType | Missingness | Flags absence of device-type information. |
Appendix B. Visual and Reproducibility Summary
| Interpretive Role | Primary Evidence in Manuscript | Component |
|---|---|---|
| Summarizes the end-to-end concept–raw fusion pipeline. | Figure 5 | Overall workflow |
| Explains how 63 concepts were grouped and motivated. | Figure 6; Table A1 | Concept taxonomy |
| Shows chronological split and strict pseudo-entity holdout. | Figure 7 | Evaluation protocol |
| Locates the final hybrid model on the performance–interpretability frontier. | Table 1; Figure 8 and Figure 9 | Main comparison/trade-off |
| Demonstrates that concept augmentation adds measurable signal beyond raw-only additive baselines. | Table 2; Figure 10 and Figure 11 | Sensitivity and raw-to-hybrid gain |
| Supports bounded probability-quality claims, soft-target robustness, and strict-threshold behavior. | Figure 1, Figure 2 and Figure 12; Table 6 and Table 7 | Calibration, low-FPR behavior, and soft-target sensitivity |
| Shows that the final model is auditable at both the model-wide and per-feature functional levels. | Figure 3 and Figure 4 | Global importance/shape-function views |
| Connects concept utility to computational practicality. | Table 4 and Table 5; Figure 13 | Ablation and runtime |
References
- Kou, Y.; Lu, C.-T.; Sirwongwattana, S.; Huang, Y.-P. Survey of fraud detection techniques. In Proceedings of the IEEE International Conference on Networking, Sensing and Control, Taipei, Taiwan, 21–23 March 2004; IEEE: Piscataway, NJ, USA, 2004; Volume 2, pp. 749–754. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef]
- Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30; Curran Associates: Red Hook, NY, USA, 2017; pp. 4765–4774. [Google Scholar]
- Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 1135–1144. [Google Scholar] [CrossRef]
- Lou, Y.; Caruana, R.; Gehrke, J.; Hooker, G. Accurate intelligible models with pairwise interactions. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, USA, 11–14 August 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 623–631. [Google Scholar] [CrossRef]
- Nori, H.; Caruana, R.; Bu, Z.; Shen, J.H.; Kulkarni, J. Accuracy, interpretability, and differential privacy via explainable boosting. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Proceedings of Machine Learning Research: Cambridge, MA, USA, 2021; pp. 8227–8237. Available online: https://proceedings.mlr.press/v139/nori21a.html (accessed on 4 June 2026).
- Grover, P.; Xu, J.; Tittelfitz, J.; Cheng, A.; Li, Z.; Zablocki, J.; Liu, J.; Zhou, H. Fraud dataset benchmark and applications. arXiv 2022, arXiv:2208.14417. [Google Scholar] [CrossRef]
- Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. In Advances in Neural Information Processing Systems 31; Curran Associates: Red Hook, NY, USA, 2018; pp. 6638–6648. [Google Scholar]
- Friedman, J.H.; Popescu, B.E. Predictive learning via rule ensembles. Ann. Appl. Stat. 2008, 2, 916–954. [Google Scholar] [CrossRef]
- Jurgovsky, J.; Granitzer, M.; Ziegler, K.; Calabretto, S.; Portier, P.-E.; He-Guelton, L.; Caelen, O. Sequence classification for credit-card fraud detection. Expert Syst. Appl. 2018, 100, 234–245. [Google Scholar] [CrossRef]
- Whitrow, C.; Hand, D.J.; Juszczak, P.; Weston, D.; Adams, N.M. Transaction aggregation as a strategy for credit card fraud detection. Data Min. Knowl. Discov. 2009, 18, 30–55. [Google Scholar] [CrossRef]
- Niculescu-Mizil, A.; Caruana, R. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 7–11 August 2005; Association for Computing Machinery: New York, NY, USA, 2005; pp. 625–632. [Google Scholar] [CrossRef]
- Axelsson, S. The base-rate fallacy and the difficulty of intrusion detection. ACM Trans. Inf. Syst. Secur. 2000, 3, 186–205. [Google Scholar] [CrossRef]
- Hilal, W.; Gadsden, S.A.; Yawney, J. Financial fraud: A review of anomaly detection techniques and recent advances. Expert Syst. Appl. 2022, 193, 116429. [Google Scholar] [CrossRef]
- Carcillo, F.; Dal Pozzolo, A.; Le Borgne, Y.-A.; Caelen, O.; Mazzer, Y.; Bontempi, G. SCARFF: A scalable framework for streaming credit card fraud detection with Spark. Inf. Fusion 2018, 41, 182–194. [Google Scholar] [CrossRef]
- Dal Pozzolo, A.; Boracchi, G.; Caelen, O.; Alippi, C.; Bontempi, G. Credit card fraud detection: A realistic modeling and a novel learning strategy. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 3784–3797. [Google Scholar] [CrossRef] [PubMed]
- Bahnsen, A.C.; Stojanovic, A.; Aouada, D.; Ottersten, B. Improving credit card fraud detection with calibrated probabilities. In Proceedings of the 2014 SIAM International Conference on Data Mining, Philadelphia, PA, USA, 24–26 April 2014; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2014; pp. 677–685. [Google Scholar] [CrossRef]
- Gama, J.; Žliobaitė, I.; Bifet, A.; Pechenizkiy, M.; Bouchachia, A. A survey on concept drift adaptation. ACM Comput. Surv. 2014, 46, 44. [Google Scholar] [CrossRef] [PubMed]
- Liang, S.; Li, Y.; Srikant, R. Enhancing the reliability of out-of-distribution image detection in neural networks. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Slack, D.; Hilgard, S.; Jia, E.; Singh, S.; Lakkaraju, H. Fooling LIME and SHAP: Adversarial attacks on post hoc explanation methods. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, New York, NY, USA, 7–8 February 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 180–186. [Google Scholar] [CrossRef]
- Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef] [PubMed]
- Miller, T. Explanation in artificial intelligence: Insights from the social sciences. Artif. Intell. 2019, 267, 1–38. [Google Scholar] [CrossRef]
- Guidotti, R.; Monreale, A.; Ruggieri, S.; Turini, F.; Giannotti, F.; Pedreschi, D. A survey of methods for explaining black box models. ACM Comput. Surv. 2019, 51, 93. [Google Scholar] [CrossRef]
- Rudin, C.; Chen, C.; Chen, Z.; Huang, H.; Semenova, L.; Zhong, C. Interpretable machine learning: Fundamental principles and 10 grand challenges. Stat. Surv. 2022, 16, 1–85. [Google Scholar] [CrossRef]
- Wachter, S.; Mittelstadt, B.; Russell, C. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harv. J. Law Technol. 2018, 31, 841–887. [Google Scholar] [CrossRef]
- Caruana, R.; Lou, Y.; Gehrke, J.; Koch, P.; Sturm, M.; Elhadad, N. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13 August 2015; Association for Computing Machinery: New York, NY, USA, 2015; pp. 1721–1730. [Google Scholar] [CrossRef]
- Kraus, M.; Tschernutter, D.; Weinzierl, S.; Zschech, P. Interpretable generalized additive neural networks. Eur. J. Oper. Res. 2024, 317, 303–316. [Google Scholar] [CrossRef]
- Agarwal, R.; Melnick, L.; Frosst, N.; Zhang, X.; Lengerich, B.; Caruana, R.; Hinton, G.E. Neural additive models: Interpretable machine learning with neural nets. In Advances in Neural Information Processing Systems 34; Curran Associates: Red Hook, NY, USA, 2021; pp. 4699–4712. [Google Scholar]
- Moreno-Torres, J.G.; Raeder, T.; Alaiz-Rodríguez, R.; Chawla, N.V.; Herrera, F. A unifying view on dataset shift in classification. Pattern Recognit. 2012, 45, 521–530. [Google Scholar] [CrossRef]
- Zafar, U.; Wu, F. Methodological challenges in explainable AI for fraud detection: A systematic literature review. Artif. Intell. Rev. 2026, 59, 115. [Google Scholar] [CrossRef]
- Zhou, Y.; Li, H.; Xiao, Z.; Qiu, J. A user-centered explainable artificial intelligence approach for financial fraud detection. Finance Res. Lett. 2023, 58, 104309. [Google Scholar] [CrossRef]
- Davis, J.; Goadrich, M. The relationship between precision-recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; Association for Computing Machinery: New York, NY, USA, 2006; pp. 233–240. [Google Scholar] [CrossRef]
- Saito, T.; Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef] [PubMed]
- Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Proceedings of Machine Learning Research: Cambridge, MA, USA, 2017; pp. 1321–1330. [Google Scholar]
- Zadrozny, B.; Elkan, C. Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In Proceedings of the 18th International Conference on Machine Learning, Williamstown, MA, USA, 28 June–1 July 2001; Proceedings of Machine Learning Research: Cambridge, MA, USA, 2001; pp. 609–616. [Google Scholar]
- Kull, M.; Silva Filho, T.M.; Flach, P. Beta calibration: A well-founded and easily implemented improvement on logistic calibration for binary classifiers. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; Proceedings of Machine Learning Research: Cambridge, MA, USA, 2017; pp. 623–631. [Google Scholar]













| Precision@Top 1% | Recall@0.1% FPR | F1 | ROC-AUC | PR-AUC | Model |
|---|---|---|---|---|---|
| 0.855 ± 0.002 | 0.204 ± 0.004 | 0.499 ± 0.004 | 0.885 ± 0.002 | 0.489 ± 0.001 | CatBoost ceiling |
| 0.837 ± 0.007 | 0.185 ± 0.003 | 0.494 ± 0.002 | 0.878 ± 0.001 | 0.478 ± 0.003 | XGBoost teacher |
| 0.253 ± 0.003 | 0.019 ± 0.000 | 0.193 ± 0.000 | 0.746 ± 0.003 | 0.121 ± 0.002 | Sparse linear student |
| 0.441 ± 0.002 | 0.046 ± 0.001 | 0.231 ± 0.002 | 0.757 ± 0.000 | 0.189 ± 0.000 | Concept-only EBM |
| 0.700 ± 0.017 | 0.141 ± 0.003 | 0.398 ± 0.003 | 0.825 ± 0.004 | 0.372 ± 0.005 | Raw-only EBM (top-k = 8) |
| 0.733 ± 0.011 | 0.149 ± 0.001 | 0.400 ± 0.002 | 0.830 ± 0.001 | 0.383 ± 0.002 | Raw-only EBM (top-k = 12) |
| 0.806 ± 0.001 | 0.180 ± 0.002 | 0.423 ± 0.004 | 0.828 ± 0.003 | 0.407 ± 0.003 | Hybrid EBM (top-k = 8) |
| 0.799 ± 0.006 | 0.172 ± 0.004 | 0.416 ± 0.013 | 0.832 ± 0.001 | 0.407 ± 0.004 | Hybrid EBM (top-k = 12) |
| 0.795 ± 0.065 | 0.179 ± 0.019 | 0.404 ± 0.048 | 0.833 ± 0.009 | 0.387 ± 0.041 | RuleFit baseline (top-k = 8 feature set) |
| Precision@Top 1% | Recall@0.1% FPR | F1 | ROC-AUC | PR-AUC | k |
|---|---|---|---|---|---|
| 0.496 ± 0.006 | 0.055 ± 0.007 | 0.268 ± 0.010 | 0.775 ± 0.011 | 0.231 ± 0.007 | 4 |
| 0.806 ± 0.001 | 0.180 ± 0.002 | 0.423 ± 0.004 | 0.828 ± 0.003 | 0.407 ± 0.003 | 8 |
| 0.799 ± 0.006 | 0.172 ± 0.004 | 0.416 ± 0.013 | 0.832 ± 0.001 | 0.407 ± 0.004 | 12 |
| 0.796 ± 0.005 | 0.170 ± 0.002 | 0.422 ± 0.003 | 0.834 ± 0.002 | 0.408 ± 0.003 | 16 |
| Brier Score | Precision@Top 1% | Recall@0.1% FPR | F1 | ROC-AUC | PR-AUC | Model |
|---|---|---|---|---|---|---|
| 0.02316 ± 0.00011 | 0.851 ± 0.009 | 0.162 ± 0.007 | 0.480 ± 0.003 | 0.886 ± 0.002 | 0.487 ± 0.005 | CatBoost ceiling |
| 0.02404 ± 0.00013 | 0.816 ± 0.008 | 0.176 ± 0.015 | 0.468 ± 0.005 | 0.870 ± 0.002 | 0.468 ± 0.002 | XGBoost teacher |
| 0.02724 ± 0.00006 | 0.741 ± 0.005 | 0.131 ± 0.005 | 0.406 ± 0.003 | 0.829 ± 0.003 | 0.371 ± 0.004 | Raw-only EBM (top-k = 8) |
| 0.02718 ± 0.00002 | 0.783 ± 0.005 | 0.152 ± 0.003 | 0.398 ± 0.010 | 0.849 ± 0.001 | 0.399 ± 0.001 | Hybrid EBM (top-k = 8) |
| 0.02697 ± 0.00008 | 0.786 ± 0.014 | 0.147 ± 0.015 | 0.393 ± 0.007 | 0.857 ± 0.002 | 0.400 ± 0.005 | Hybrid EBM (top-k = 12) |
| 0.02606 ± 0.00021 | 0.796 ± 0.008 | 0.177 ± 0.007 | 0.415 ± 0.011 | 0.867 ± 0.003 | 0.431 ± 0.010 | RuleFit baseline (top-k = 8) |
| Precision@Top 1% | Recall@0.1% FPR | F1 | ROC-AUC | PR-AUC | Setting |
|---|---|---|---|---|---|
| 0.806 ± 0.001 | 0.180 ± 0.002 | 0.423 ± 0.004 | 0.828 ± 0.003 | 0.407 ± 0.003 | Full hybrid (top-k = 8) |
| 0.804 ± 0.003 | 0.179 ± 0.002 | 0.426 ± 0.002 | 0.830 ± 0.003 | 0.410 ± 0.002 | Drop time concepts |
| 0.787 ± 0.004 | 0.173 ± 0.003 | 0.406 ± 0.007 | 0.829 ± 0.002 | 0.398 ± 0.003 | Drop relation concepts |
| 0.795 ± 0.003 | 0.174 ± 0.002 | 0.407 ± 0.007 | 0.826 ± 0.003 | 0.398 ± 0.005 | Drop missingness concepts |
| Predict Time (s) | Fit Time (s) | Features | Model |
|---|---|---|---|
| 0.126 ± 0.004 | 34.1 ± 0.7 | 154 | CatBoost ceiling |
| 2.094 ± 0.023 | 30.9 ± 0.0 | 154 | XGBoost teacher |
| 0.098 ± 0.001 | 80.4 ± 0.8 | 8 | Raw-only EBM (top-k = 8) |
| 0.104 ± 0.002 | 124.0 ± 2.9 | 12 | Raw-only EBM (top-k = 12) |
| 0.200 ± 0.001 | 536.4 ± 5.6 | 63 | Concept-only EBM |
| 0.217 ± 0.015 | 659.3 ± 10.9 | 71 | Hybrid EBM (top-k = 8) |
| 0.218 ± 0.011 | 704.5 ± 13.2 | 75 | Hybrid EBM (top-k = 12) |
| 1.273 ± 0.053 | 5083.5 ± 493.9 | 71 | RuleFit baseline (top-k = 8 feature set) |
| 0.079 ± 0.005 | 624.0 ± 201.8 | 63 | Sparse linear student |
| Calibration Interpretation | Brier Score | ECE-15 | Model |
|---|---|---|---|
| Best Brier score among reported models; strongest ECE-15 among black-box references, but not the lowest ECE-15 overall. | 0.02359 ± 0.00009 | 0.00989 ± 0.00047 | CatBoost ceiling |
| Black-box teacher used for raw-feature selection and soft targets. | 0.02418 ± 0.00012 | 0.01611 ± 0.00044 | XGBoost teacher |
| Low ECE but weak ranking and higher Brier score. | 0.03160 ± 0.00001 | 0.01217 ± 0.00018 | Concept-only EBM |
| Compact additive raw baseline. | 0.02693 ± 0.00012 | 0.00735 ± 0.00043 | Raw-only EBM (top-k = 8) |
| Close to XGBoost in ECE; Brier score remains above both black-box references. | 0.02656 ± 0.00008 | 0.01587 ± 0.00012 | Hybrid EBM (top-k = 8) |
| Higher ECE and Brier than the hybrid with larger seed variance. | 0.02817 ± 0.00266 | 0.02669 ± 0.00391 | RuleFit baseline (top-k = 8 feature set) |
| Recall@0.1% FPR | Brier Score | ECE-15 | F1 | ROC-AUC | PR-AUC | Alpha_EBM |
|---|---|---|---|---|---|---|
| 0.1794 ± 0.0023 | 0.02658 ± 0.00008 | 0.01600 ± 0.00014 | 0.4220 ± 0.0051 | 0.8276 ± 0.0027 | 0.4061 ± 0.0034 | 0.00 |
| 0.1797 ± 0.0020 | 0.02657 ± 0.00008 | 0.01593 ± 0.00012 | 0.4230 ± 0.0046 | 0.8277 ± 0.0027 | 0.4065 ± 0.0034 | 0.25 |
| 0.1799 ± 0.0018 | 0.02656 ± 0.00008 | 0.01587 ± 0.00012 | 0.4228 ± 0.0044 | 0.8278 ± 0.0028 | 0.4066 ± 0.0034 | 0.35 |
| 0.1801 ± 0.0018 | 0.02655 ± 0.00008 | 0.01582 ± 0.00009 | 0.4229 ± 0.0046 | 0.8279 ± 0.0028 | 0.4069 ± 0.0034 | 0.50 |
| 0.1800 ± 0.0017 | 0.02654 ± 0.00008 | 0.01576 ± 0.00008 | 0.4227 ± 0.0049 | 0.8280 ± 0.0029 | 0.4072 ± 0.0034 | 0.75 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Kang, J.; Kim, K. Deployment-Oriented Interpretable Fraud Detection via Hybrid Explainable Boosting Machines with Concept–Raw Fusion on the IEEE-CIS Benchmark. Appl. Sci. 2026, 16, 5809. https://doi.org/10.3390/app16125809
Kang J, Kim K. Deployment-Oriented Interpretable Fraud Detection via Hybrid Explainable Boosting Machines with Concept–Raw Fusion on the IEEE-CIS Benchmark. Applied Sciences. 2026; 16(12):5809. https://doi.org/10.3390/app16125809
Chicago/Turabian StyleKang, Jeongtae, and Keecheon Kim. 2026. "Deployment-Oriented Interpretable Fraud Detection via Hybrid Explainable Boosting Machines with Concept–Raw Fusion on the IEEE-CIS Benchmark" Applied Sciences 16, no. 12: 5809. https://doi.org/10.3390/app16125809
APA StyleKang, J., & Kim, K. (2026). Deployment-Oriented Interpretable Fraud Detection via Hybrid Explainable Boosting Machines with Concept–Raw Fusion on the IEEE-CIS Benchmark. Applied Sciences, 16(12), 5809. https://doi.org/10.3390/app16125809
