Trust Triangle: A Reliability-Validity-Generation Framework for Explainable Credit Card Fraud Detection with RAG-Enhanced LLMs Reasoning

Shen, Jin-Ching; Su, Nai-Ching; Lin, Yi-Bing

doi:10.3390/ai7030114

Open AccessArticle

Trust Triangle: A Reliability-Validity-Generation Framework for Explainable Credit Card Fraud Detection with RAG-Enhanced LLMs Reasoning

by

Jin-Ching Shen

¹,

Nai-Ching Su

^2,* and

Yi-Bing Lin

¹

College of Artificial Intelligence, National Yang Ming Chiao Tung University, No. 301, Sec. 2, Gaofa 3rd Rd., Guiren Dist., Tainan City 711, Taiwan

²

Department of Project Management and Industrial Engineering, Shandong University, 27 Shanda Nanlu, Jinan 250100, China

^*

Author to whom correspondence should be addressed.

AI 2026, 7(3), 114; https://doi.org/10.3390/ai7030114

Submission received: 8 February 2026 / Revised: 4 March 2026 / Accepted: 5 March 2026 / Published: 19 March 2026

(This article belongs to the Section AI Systems: Theory and Applications)

Download

Browse Figures

Versions Notes

Abstract

We propose Trust Triangle, a Bridging Methodology that establishes evidential reliability through multi-attribution consensus, ensures external validity via statistical hypothesis testing, and enables controlled generation with RAG-anchored LLMs to transform black-box predictions into trustworthy, auditable explanations. This framework is instantiated for credit card fraud detection by integrating multi-method feature attributions with rigorous statistical validation. The resulting reliability-validity-verified insights are synthesized with high-relevance domain knowledge (relevance score > 0.7) retrieved from a real-world corpus via Retrieval-Augmented Generation (RAG). A structured Chain-of-Thought (CoT) prompt then guides an LLM to produce coherent, audit-ready case reports. Our contributions are threefold: (1) a verifiable framework for quantifying attribution reliability and validity, (2) a demonstrated end-to-end pipeline from robust prediction to semantically grounded explanation, and (3) a generalizable paradigm for Trustworthy ML in high-stakes domains. Experiments on a highly imbalanced dataset (fraud rate: 8.74%) demonstrate robust performance (PR-AUC = 0.7867), successfully identify statistically significant predictive features, and generate audit-ready reports, thereby advancing a rigorous, evidence-based pathway from model output to decision-ready support.

Keywords:

trustworthy AI; explainable machine learning; reliability and validity verification; feature attribution; credit card fraud detection; retrieval-augmented generation

1. Introduction

Machine learning (ML) models have demonstrated strong efficacy in critical tasks such as credit card fraud detection. However, their “Prediction black box” nature lacks transparency in decision-making, severely eroding user trust and hindering deployment in practice [1]. Existing approaches to explainability face a dual challenge. First, mainstream anomaly detection models, such as Bayesian Autoencoders (BAEs), focus on predictive performance but often fail to provide quantitative uncertainty estimates and statistically validated reasoning for their decisions [2,3]. Second, post hoc feature attribution frequently relies on a single method, yielding results that lack reliability (internal consistency across methods) and validity (empirical association with ground-truth labels), effectively creating an “Explanation black-box” [4].

To systematically address these limitations, we propose a novel “Trust Triangle” framework and demonstrate its application in credit card fraud detection. The framework, illustrated in Figure 1, establishes a trustworthy explanation pipeline by structurally bridging three pillars: evidential reliability, external validity, and controlled generation.

The pipeline begins at the Machine Learning Module (ML) by enhancing a BAE for robust and uncertainty-aware fraud identification based on reconstruction errors (REs), utilizing Monte Carlo Dropout [5] and Huber loss, which outputs: feature-level RE, instance-level RE, and optimal threshold. This is followed by the reliability–validity Bridging Module (B), whose core innovation lies in a parallel processing stage that operationalizes our two pillars: (1) establishing evidential reliability through multi-attribution consensus of feature importance, integrating three theoretically distinct methods—Integrated Gradients [6], SHAP [7], and Feature Perturbation [1]; (2) grounding decisions with external validity via rigorous reliability and validity testing, employing statistical validation (e.g., Mann–Whitney U test with FDR correction [8]) against actual fraud outcomes. After aggregating the results of reliability and validity evaluations, the output of the Bridging Module comprises two primary indicators: Feature Importance Score and Feature Confidence Score. These indicators are jointly integrated with the Optimized Threshold and reconstruction errors at both the feature-level and instance-level produced by the Machine Learning Module and subsequently fed into the LLM module to enable controllable generation. We adopt a Retrieval-Augmented Generation (RAG) [9] with domain knowledge to explicitly restrict the generation context of the LLM, enhancing controllability and mitigating hallucination. The synthesized evidence from these streams is then structured via a Chain-of-Thought (CoT) prompt [10] to guide a Large Language Model (LLM) [11] in producing an auditable, semantic report. This end-to-end process transforms an opaque anomaly probability into a transparent, evidence-anchored insight.

Crucially, the Trust Triangle is designed as a generalizable architectural blueprint. To adapt it to a new domain (e.g., healthcare diagnostics or judicial risk assessment), the core process remains unchanged: practitioners would (1) supply the relevant quantitative data and train a suitable predictive model; (2) curate a domain-specific knowledge base (e.g., medical journals, legal statutes); (3) define corresponding rule templates and semantic descriptors (e.g., symptom triggers, legal precedent descriptions); and (4) customize the prompt template by substituting domain-specific variables (e.g., feature names, reconstruction errors, impact scores, and retrieved knowledge) to ensure the final explanation remains coherent and factually anchored [10]. The RAG pipeline [9] then retrieves high-quality context from this new knowledge base using the defined keywords, and a structured CoT prompt [10] integrates this with the model’s validated evidence to guide the LLM [11] in generating rigorous, domain-appropriate reports. This demonstrates the framework’s extensibility beyond the fraud detection exemplar detailed herein. Notably, the modular design acknowledges the importance of cost-efficient LLM deployment strategies [12] and is compatible with principles of interpretable representation learning from other modalities [13].

Our contributions are threefold: (1) We propose the Trust Triangle, a verifiable bridging framework that sets quantitative standards for assessing the reliability and validity of feature attributions. (2) We construct and demonstrate an end-to-end instantiation for credit card fraud detection, from robust prediction to trustworthy explanation, anchoring LLMs [11] in rigorous statistical evidence. (3) We provide a generalizable methodological paradigm that enhances transparency and auditability for Trustworthy ML in high-risk domains, complete with a clear pathway for adaptation to other fields.

2. Related Works

2.1. Challenges in Trustworthy Anomaly Explanation

Building trustworthy machine learning for high-stakes domains like fraud detection requires reconciling three pillars, external validity, evidential reliability, and controlled generation, each with significant limitations. First, while Bayesian Autoencoders (BAEs) provide principled uncertainty estimation for anomaly detection by combining the generative modeling capabilities of variational autoencoders [2] with the flexibility of deep Gaussian mixture models for capturing complex data distributions [3], and further enhanced by Monte Carlo Dropout to approximate Bayesian inference in deep neural networks [5], they remain “black-box predictors”, outputting a scalar score without transparent justification for why a transaction is flagged. Second, post hoc attribution methods—such as gradient-based (Integrated Gradients [6]), game-theoretic (SHAP [7]), and perturbation-based approaches [1]—offer insights but produce inconsistent results when used in isolation. Reliance on any single method lacks both reliability (cross-method agreement) and validity (statistical grounding), creating an “explanation black box” [4,14]. Third, while LLMs [11,15] excel at fluent generation, deploying them for untethered explanation risks “hallucinations” and misalignment with the underlying model’s logic [16], undermining trust in critical applications.

2.2. The Trust Triangle Framework

To address these gaps, we introduce the Trust Triangle Framework. It integrates a robust BAE detector, a multi-attribution consensus module with statistical validation, and a RAG [9] system. This framework systematically bridges the quantitative evidence from the ML model with domain knowledge, ensuring that the final explanations are reliable (via method consensus), valid (via statistical testing), and semantically grounded (via controlled generation). It transforms opaque predictions into auditable, decision-ready insights.

High-stakes applications like fraud detection demand AI systems that are not only accurate but also trustworthy—their predictions must come with comprehensible, evidence-based explanations. A fundamental tension exists between the dominant paradigms capable of delivering these components. On one hand, complex ML models excel at deriving quantitative predictions from raw data but often function as inscrutable “black boxes,” providing limited intuitive justification for their outputs [1,17]. On the other hand, LLMs demonstrate powerful semantic reasoning and fluent generation [11,18], capabilities fundamentally enabled by attention mechanisms [15], yet they lack grounded, quantitative judgment and are prone to “hallucinating” plausible but unsubstantiated assertions [16].

A naive integration—feeding an ML model’s raw output to an LLM for post hoc narration—fails to resolve this tension and can exacerbate it. The uncertainty inherent in the ML prediction can be unwittingly amplified by the LLM’s generation process, risking explanations that are semantically fluent but statistically ungrounded, thereby eroding trust [4].

To achieve truly trustworthy and actionable interpretations, we must construct a principled bridge that transforms the ML model’s evidence into a verified, structured form before it is articulated by the LLM. This is the purpose of our Trust Triangle, built upon three interdependent pillars:

1. Evidential Reliability: To combat the opacity and method-specific bias of post hoc explanations [4,19], we transform the ML model’s internal state into consensus-verified feature attributions. This involves aggregating multiple theoretically distinct attribution methods—each with a unique axiomatic foundation: Integrated Gradients [6] satisfy sensitivity and implementation invariance axioms; SHAP [7] provides a unified framework based on cooperative game theory with desirable fairness properties; and Feature Perturbation [1] offers an intuitive, model-agnostic approach that directly measures the impact of input changes on model output and rigorously measuring their agreement, ensuring derived importance is robust and not an artifact of a single technique [20].

2. External Validity: To ensure explanations are relevant to the real-world task, attributed feature importance must be statistically anchored to actual outcomes. This requires rigorous hypothesis testing against ground truth [21] and the evaluation of practical effect sizes, moving beyond visual saliency to establish a defensible link between model evidence and business impact [8].

3. Controlled, Grounded Generation: To mitigate LLM hallucination [16], the generative process must be constrained and enriched. This is achieved by providing the LLM with structured prompts that integrate the validated quantitative scores with retrieved context from authoritative, domain-specific knowledge bases [9], guided by chain-of-thought reasoning [10].

Our framework instantiates this Trust Triangle: The Bridging Module (B) fulfills the first two pillars (evidential reliability and external validity), producing rigorously verified quantitative evidence. This evidence then directs the LLM module (LLM) to execute the third pillar (controlled generation). The final explanation is thus a traceable synthesis of reliability-tested evidence and validity-anchored knowledge—a closed loop from quantitative verification to trustworthy semantic articulation.

2.3. Dataset Characteristics

We employ a real-world credit card fraud dataset [22] of 1 M transactions with severe class imbalance (labels include “normal” and “fraud”, fraud rate: 8.74%). It provides seven domain-relevant raw features: two geographical (distance from home, from last transaction), three behavioral (transaction amount ratio, same merchant flag, online order flag), and two security-related (chip card usage, PIN verification). The large sample ensures statistical robustness for reliability-validity analysis, while the semantically rich feature space facilitates the RAG process, enabling the precise grounding of generated explanations in relevant domain knowledge [9].

2.4. Comparison with Existing Approaches: The Need for a Bridging Framework

While the methods reviewed above have advanced the field of explainable AI, each exhibits inherent limitations that our Trust Triangle framework systematically addresses. Table 1 summarizes the key comparisons.

Key Insights from Comparison:

Beyond Single-Method Attribution: While LIME [1], Integrated Gradients [6], and SHAP [7] each provide valuable perspectives, reliance on any single method risks specific bias [4,19]. Our framework transforms attribution from a point estimate into a consensus-verified distribution across methods, with explicit consistency metrics.
From Statistical Significance to Practical Validity: Traditional hypothesis testing [21,23] identifies statistically significant differences, but without effect size [8] or multiple testing correction, features may be declared “important” despite negligible practical impact. Our external validity pillar integrates p-values, FDR correction [8], and effect sizes into a composite validity weight.
Uncertainty-Aware Detection: Variational autoencoders [2] and their extensions [3,26] excel at learning normal patterns, but their point estimates ignore epistemic uncertainty. By incorporating Monte Carlo Dropout [5] and bootstrap validation [24], our BAE backbone provides calibrated uncertainty estimates essential for reliable threshold selection.
Grounded Generation with RAG: While LLMs demonstrate remarkable fluency [11,18], their tendency to hallucinate [16] makes them unsuitable for direct explanation of high-stakes predictions. Our RAG pipeline [9,25] retrieves authoritative context, and CoT prompting [10] ensures that generated narratives remain faithful to the verified evidence.
The Missing Bridge: Existing work either stops at explanation (post hoc methods) or generation (LLMs), but none systematically bridges quantitative verification with semantic articulation. The Trust Triangle fills this gap by introducing a dedicated Bridging Module that transforms raw model outputs into validated evidence before generation—a distinction that is both novel and essential for trustworthy AI in high-risk domains.

This comparative analysis demonstrates that our framework is not merely an aggregation of existing techniques but a principled integration that addresses their individual weaknesses while preserving their strengths. The result is an end-to-end pipeline that meets the rigorous demands of real-world fraud detection: reliable, valid, and auditable explanations.

3. Method

3.1. A Robust Predictive Backbone for Imbalanced Fraud Detection

Our framework begins with a robust, uncertainty-aware predictive backbone. The pipeline operates in two stages to ensure reliability: Unsupervised Pre-training (UPT, Stage 1) followed by Supervised Fine-tuning (SFT, Stage 2). The core detector is a BAE [2], trained on a real-world credit card transaction dataset of 1 million instances (fraud rate: 8.74%) with 7 domain-specific features defined by foundational principles [23]. The encoder

f_{θ_{e}} : R^{7} \to R^{16}

and decoder

g_{θ_{d}}

form symmetrically fully connected blocks (64-48-32-24-16-24-32-48-64) with ReLU, batch normalization, and dropout.

3.1.1. Two-Stage BAE Training

Stage 1: Unsupervised Pre-training. We pre-train the BAE using only normal transactions (

X_{N}

), e.g., 90% of normal transactions while explicitly excluding fraudulent ones (

X_{F}

). A key enhancement is uncertainty quantification via Monte Carlo Dropout at inference [5], which yields a posterior distribution over the reconstruction error for each feature, providing mean

E [RE]

and variance

V [RE]

. For robust optimization, we employ Huber loss instead of MSE to mitigate the influence of outlier reconstruction errors during training [27]. The model’s fraud detection capability (Performance) is then evaluated, and its reliability is rigorously quantified by applying bootstrap resampling (200 iterations) to estimate confidence intervals (CI) for performance metrics (e.g., PR-AUC).

Stage 2: Supervised Fine-tuning. The pre-trained BAE is fine-tuned on a mixed dataset after class imbalance mitigation (e.g., containing 10% of normal transactions and 100% of fraudulent ones). Crucially, we apply the Mann–Whitney U test to verify that the training and validation sets are drawn from statistically equivalent distributions, mitigating data leakage risks [28]. From this stage, we obtain feature-wise reconstruction errors for both normal and fraudulent transactions. An instance-level optimal decision threshold is selected via bootstrap resampling (200 iterations), whose reliability is likewise validated through this process.

3.1.2. BAE Performance Evaluation

Model performance focuses on the fraud class. We report Precision–Recall AUC (PR-AUC)—appropriate for imbalanced data [29]—and the maximum F1-score (

{F 1}_{m a x}

). A composite score

S_{comb} = 0.7 \cdot PR - AUC + 0.3 \cdot {F 1}_{m a x}

[29] balances ranking quality with threshold-specific performance, aligning with the business goal of maximizing fraud discovery while controlling false alarms. This backbone provides not only accurate anomaly scores but also a quantifiable measure of uncertainty, forming the first pillar of our Trust Triangle.

3.1.3. Workflow Description of Evidential Reliability

Figure 2 illustrates our machine learning training and deployment process, which enhances the reliability of BAE predictions and establishes the reliability of threshold determination based on instance-level reconstruction errors.

BAE Training: The module takes normal instance data as input (90% of

X_{N}

), allowing the BAE to learn the underlying structure of normal transactions in Stage 1: UPT. Subsequently, mixed data, validated via the Mann–Whitney U test for distributional differences, is introduced to train the BAE’s capability to identify fraudulent patterns in Stage 2: SFT. On one hand, the performance of the BAE is validated through Bootstrap resampling (n = 200) to assess the stability of its fraud detection capability, providing users with trustworthy performance metrics. On the other hand, the BAE outputs the reconstruction error (RE) for each instance at both the feature-level and instance-level. We use Bootstrap validation (sample size n = 200) to examine the stability of the optimal threshold derived from the error distributions for distinguishing between normal and fraudulent transactions. Both instance-level and feature-level reconstruction errors, along with a statistically validated optimal threshold, are then passed to the subsequent Bridging Module to serve as the foundation for feature attribution.

BAE Deployment: We input 20 new unlabeled instances into the pre-trained BAE and compute both feature-level and instance-level reconstruction errors for each instance. Based on the instance-level reconstruction error, a risk score is calculated. For each feature, an impact score is derived from the standardized position (z-score) of its feature-level reconstruction error relative to the distribution of that feature’s reconstruction error across normal transactions. The instance’s risk score (risk score) and the corresponding feature impact scores (impact score) are then fed into an LLM Module, where they are mapped, respectively, to the instance’s risk level and the fraud types triggered based on the feature impact scores.

3.2. Bridging Evidence and Validity for Trustworthy Attributions

To transform opaque model predictions into trustworthy explanations, we introduce the Reliability-Validity Bridging Module. This component ingests instance-level and feature-level reconstruction errors and outputs two rigorously verified metrics per feature: a Feature Importance Score and a Feature Confidence Score. The framework performs two core, sequential tasks to establish this trust: first, it quantifies evidential reliability through multi-method consensus; second, it assesses external validity through statistical grounding in real outcomes.

3.2.1. Evidential Reliability via Multi-Method Consensus

We establish reliability by aggregating evidence from three theoretically distinct attribution methods applied in parallel: Integrated Gradients (IGs) for its axiomatic foundation [6], SHAP for its game-theoretic fairness [7], and Perturbation for its causal intuitiveness [1]. For feature j, the raw score

s_{j}^{(m)}

from method m is min-max-normalized to

{\tilde{s}}_{j}^{(m)}

. Reliability is quantified at two levels:

Micro-Consistency (Feature-Level): We compute consistency_j $= 1 - m i n (1.0,2 \cdot std ({{\tilde{s}}_{j}^{(m)}}_{m = 1}^{3}))$ , measuring the agreement across methods for that specific feature.
Macro-Consensus (System-Level): We calculate global consistency $C_{global}$ = (Spearman’s $ρ$ + Kendall’s $τ$ )/2, as the mean of pairwise Spearman’s $ρ$ and Kendall’s $τ$ rank correlations between the three methods’ rankings [20,21]; $C_{global}$ dynamically determines the fusion weight $w_{rel}$ for each method, favoring more stable methods (e.g., SHAP) when consensus is low.

3.2.2. External Validity via Statistical Association

We ground the attributions in empirical reality by testing their association with the target variable. For each feature j, we apply the Mann–Whitney U test [21] to the distributions of reconstruction error for positive (e.g., fraud) versus negative (e.g., normal) instances, obtaining a p-value. The Benjamini–Hochberg procedure [8] controls the False Discovery Rate (FDR), yielding corrected p-values

p_{corr}^{j}

. We also compute the effect size

∣ d_{j} ∣

(Cohen’s d). These metrics are fused into a composite validity weight:

w_{val}^{(j)} = 0.6 \cdot I (p_{corr}^{j} < 0.05) + 0.4 \cdot m i n (1,2 ∣ d_{j} ∣),

where

I (\cdot)

is the indicator function. This ensures features must be both statistically significant and practically meaningful to receive high validity weighting [8].

3.2.3. Fusion and Output

The final, verified Feature Importance Score for feature j is

{Importance}_{j} = \frac{(\sum_{m = 1}^{3} w_{rel} \cdot {\tilde{s}}_{j}^{(m)}) \cdot w_{val}^{(j)}}{\sum_{k = 1}^{n} [(\sum_{m = 1}^{3} w_{rel} \cdot {\tilde{s}}_{k}^{(m)}) \cdot w_{val}^{(k)}]},

where stability is confirmed via 200 bootstrap resamples. The Feature Confidence Score, a meta-evaluation of trustworthiness, is independently computed as

{Confidence}_{j} = 0.4 \cdot {consistency}_{j} + 0.4 \cdot w_{val}^{(j)} + 0.2 \cdot C_{global} .

3.2.4. Workflow Description of External Validit

As shown in Figure 3, the Bridging Module (B) systematically processes the ML model’s output. First, extract feature-level reconstruction error to perform evidential reliability. It does this by executing the three attribution methods (IG, SHAP, Perturbation) in parallel, calculating both micro-consistency and macro-consensus [21,22], and deriving the adaptive reliability weight

w_{rel}

. This process ensures the internal robustness of the attributes generated.

Second, extract instance-level reconstruction error and the optimal threshold to perform external validity. This involves statistically testing the association between each feature’s reconstruction error and the ground-truth labels using the Mann–Whitney U test [21], applying FDR correction [8] for rigor, and validating the reliability of feature attributions through bootstrap resampling (200 iterations). This anchors the explanations in observable outcomes.

Finally, the results of these two tasks are consolidated. Reliability and validity weights are fused into feature attribution evidence to produce the final normalized Feature Importance Score. Concurrently, the Feature Confidence Score is computed as a separate, holistic trust indicator and is presented in Figure 3 to the user for reference. The output of the Bridge Module is thus a rigorously validated quantitative evidence set, which, along with Optimized Threshold and reconstruction errors at both the feature level and instance level, is then used to guide an LLM in a controlled generation pipeline (LLM Module) for producing faithful, natural-language explanations.

3.3. Controlled Generation for Actionable Explanations

The Bridge Module provides the verified quantitative evidence—features weighted by both evidential reliability (multi-method consensus) and external validity (statistical association). This section describes the final pillar of the Trust Triangle: controlled generation (LLM Module, LLM). We translate the static, validated scores into dynamic, domain-specific narratives by strictly controlling the LLM’s reasoning with this multi-source evidence.

3.3.1. Risk Quantification and Rule Mapping

We first operationalize the Bridge Module’s output into concrete, instance-specific risk metrics. For a new instance i, an anomaly score

e_{i}

is derived from the detector (e.g., instance-level reconstruction error [2]). This is normalized to produce an instance risk score:

{R i s k}_{i} = e_{i} / T

where T is a threshold from normal samples. Concurrently, we compute a feature impact score for feature j:

I m p a c t_{j}^{(i)} = w_{j}^{(i)} \cdot (∣ (x_{j}^{(i)} - μ_{j}) / σ_{j} ∣ \cdot I m p o r t a n c e_{j})

Here,

I m p o r t a n c e_{j}

is the reliability- and validity-verified importance from the Bridge Module. The term

w_{j}^{(i)}

is the instance-specific attribution weight (e.g., from SHAP [7]), and for the reconstruction error

x_{j}^{(i)}

of feature

j

, for instance,

i

,

(x_{j}^{(i)} - μ_{j}) / σ_{j}

means the standardized deviation captures the feature’s atypicality, where

μ_{j}

denotes the mean reconstruction error of the feature j for normal transactions, and

σ_{j}

denotes the corresponding standard deviation. A high

I m p a c t_{j}^{(i)}

indicates a feature that is both globally important and locally anomalous. These scores are mapped to predefined, interpretable fraud rule templates (e.g., “high-value transaction from a new country”), ensuring alerts are grounded in statistically verified feature importance [8].

3.3.2. Controlled Generation with RAG

To generate trustworthy, actionable narratives from the quantitative evidence (

{R i s k}_{i}

,

I m p a c t_{j}^{(i)}

,

I m p o r t a n c e_{j}

), we employ a RAG pipeline [9] for controlled generation. Upon crime trigger rules, the system uses the rules type and high-impact features as keys to retrieve relevant explanatory passages from a curated Domain Knowledge Base (DKB). A structured prompt is then constructed for an LLM (e.g., [11,18]), which integrates three core components: (1) the verified quantitative evidence (including FDR-corrected p-values), (2) the retrieved qualitative knowledge, and (3) explicit instructions for causal, evidence-anchored synthesis. This controlled input ensures that every claim in the generated explanation is explicitly constrained by and traceable to the Bridge Module’s validated data and domain literature [1], effectively mitigating hallucination and providing an auditable justification for the alert.

3.3.3. Workflow Description of Controlled, Grounded Generation

This workflow fulfills the ultimate objective of the trust triangle: enabling users to confidently trust the predictions of ML, as illustrated in Figure 4 and Table 2. First, these raw outputs of the Bridging Module (B), including reconstruction errors at both the feature-level and instance-level (testing set), optimal threshold, Feature Importance Score, and Feature Confidence Score, are fused into Feature Importance Fusion (Step ➊), while the data of a new instance are applied into the pre-trained BAE to compute the risk score and personalized feature impact scores of new instances based on its reconstruction errors (Steps ➋ and ➌).

The generation process begins by extracting canonical credit card fraud patterns from a domain knowledge base (Step 4). These patterns are encoded into an LLM Interpretation Guide, detailing seven major fraud types, their typical modus operandi, and five designated keywords per pattern. For a new instance, we match its feature impact scores against these patterns to identify triggered fraud types. The associated keywords are used to retrieve relevant contextual passages from the knowledge base (Step ➎), with the top three results by relevance score retained for quality (Step ➏). All components—triggered patterns, quantitative impact scores, and retrieved text—are integrated into a structured, multi-source evidence set (Step ➐). This set then guides an LLM via a CoT prompt in a controlled generation process to produce a faithful, natural-language report to the user (Step ➑).

This end-to-end process ensures every claim in the final Trustworthy Report is explicitly anchored in statistically verified evidence and domain expertise, fulfilling the promise of a reliable explanatory system.

4. Implementation

Our implementation is rigorously structured to operationalize the three pillars of the Trust Triangle: evidential reliability, external validity, and controlled generation. Each subsection details the methods that precisely align with and fulfill the requirements of its corresponding pillar.

4.1. Evidential Reliability: Building a Statistically Stable Detection Foundation

This phase establishes a trustworthy evidentiary base for detection. We begin with a public credit card fraud dataset [22], performing a sequential split to preserve temporal dynamics. The distributional equivalence between the resulting sets is verified using the Mann–Whitney U test [21], ensuring no data leakage biases our foundation. The core detector is a BAE [2], whose architecture and Monte Carlo Dropout inference [5] are designed to quantify predictive uncertainty intrinsically. This model is trained with the robust Huber loss [27] and the Adam optimizer. To quantify the stability of our primary performance metrics (PR-AUC, F1-score) [29] and all subsequent analyses, we employ bootstrap resampling (n = 200) [24]. Furthermore, the reliability of our feature attribution—a critical form of evidence—is quantified by measuring micro-consistency and macro-consensus (

C_{g l o b a l}

) across multiple, independent attribution methods (Integrated Gradients [6], SHAP [7], and Perturbation [1]) using rank correlation [20]. This comprehensive approach ensures that the core detection evidence is robust, repeatable, and resistant to methodological variance, directly fulfilling the evidential reliability criterion.

4.2. External Validity: Establishing Causal Plausibility for Detected Features

This phase bridges the reliable detector’s outputs to real-world, generalizable phenomena. To ensure that features identified as important are not mere artifacts of the specific model or dataset, we subject them to rigorous statistical validation. For each attributed feature, we perform a Mann–Whitney U test [21] between the normal and fraudulent transaction populations in the data. We then apply False Discovery Rate (FDR) correction [8] to control for multiple comparisons, obtaining corrected p-values (

p_{j}^{c o r r}

) and effect sizes (

∣ d_{j} ∣

). This process provides a statistically grounded, per-feature validity assessment. The Feature Importance Scores and confidence scores integrate this weighted statistical evidence, ensuring that the attributed importance aligns with statistically significant, real-world differences between the two classes. This step directly operationalizes the external validity pillar by tethering model inferences to empirically verifiable, population-level differences.

4.3. Controlled Generation: Orchestrating Auditable and Context-Aware Reporting

This phase translates detection evidence into actionable, trustworthy narratives through a constrained generative process. We implement a RAG pipeline [9] to ground the generation in external knowledge. For a new transaction instance, a risk score (

R i s k_{i} = e_{i} / T

) and feature impact scores (

I m p a c t_{j}^{(i)}

) are computed. These scores trigger predefined, auditable rule templates that map feature subsets to semantic fraud concepts. The triggered rules guide the RAG system [9,25] to retrieve the top three relevant contextual evidence from an external crime modus operandi database. This retrieved evidence is then structured into a sophisticated prompt for a LLM [18]. We implement a controlled, three-stage pipeline, qwen3-embedding:8b, for retrieval; the lightweight qwen2.5:1.5b-instruct [12] for intermediate description; and qwen2.5vl:7b [10] for final report generation, with temperature = 0.3 [11] for stability. This multi-stage, evidence-anchored pipeline ensures the generation of a coherent, audit-ready report that is directly controlled by and traceable to the initial detection evidence, thereby fulfilling the controlled generation pillar.

5. Results

5.1. Overall Predictive Performance and Robustness

Our BAE establishes a statistically robust and reliable foundation. It achieves a fraud-class F1-score of 0.75 and a PR-AUC of 0.7867 on the validation set. Bootstrap analysis (n = 200) confirms the stability of these estimates (e.g., PR-AUC 95% CI: [0.7812, 0.7925]). These performance metrics faithfully reflect the inherent classification ambiguity present in the data, which aligns with our goal of establishing a trustworthy foundation, as visualized by the overlapping reconstruction error distributions in Figure A1. The model demonstrates strong, stable discriminative power, with the average reconstruction error of fraudulent transactions being 5.52 times that of normal separation confirmed as statistically significant (Mann–Whitney U test, p < 0.001, Cliff’s δ = 0.612). The training process converged stably (final validation loss = 0.2960). This validates a reliable predictive backbone for subsequent interpretability analysis.

5.2. Evidential Reliability (Multi-Method Consensus)

Three attribution methods (IG, SHAP, Perturbation) achieve strong consensus (global consistency = 0.7024). The highest agreement is on ratio_to_median_purchase_price (0.7005). The lowest agreement (0.2605) is for distance_from_last_transaction, where its negative impact under perturbation contrasts with positive rankings from IG/SHAP, revealing complementary signals rather than contradiction (details as Table A1 and Figure A2 in Appendix A.1).

5.3. External Validity (Statistical Grounding)

A core contribution is feature-level validation. While overall model discriminative validity is significant (p = 0.003), only the feature ratio_to_median_purchase_price achieves statistical significance after FDR correction (p < 0.05, details in Appendix A.2). This demonstrates our framework’s ability to distinguish statistically grounded evidence from merely important signals.

To further assess the robustness of our statistical tests under the imbalanced setting, we performed a post hoc power analysis for each feature using the Mann–Whitney U test (normal approximation). As shown in Table A2, six out of seven features achieved a power of 1.000 at α = 0.05, indicating that the sample size (N ≈ 1 M, fraud rate 8.74%) provides exceptional sensitivity to detect true differences. Critically, the only feature that remained significant after FDR correction—ratio_to_median_purchase_price—exhibited the largest effect size (Cliff’s δ = −0.7009) and perfect power. Features with small effect sizes (|δ| < 0.2), such as used_chip and used_pin_number, failed to survive correction despite perfect power, underscoring our framework’s distinction between statistical significance and practical validity. The non-significant feature repeat_retailer showed low power (0.1183), consistent with its negligible role in the model. These results confirm that our conclusions are not compromised by insufficient power due to class imbalance, and they reinforce the necessity of combining effect size, significance testing, and power analysis for trustworthy feature selection.

5.4. Controlled Generation (Synthesis of Evidence)

The final outputs are synthesized through a controlled pipeline. Feature importance (

{I m p o r t a n c e}_{j}

) and confidence scores (

{C o n f i d e n c e}_{j}

), informed by reliability and validity weights, are integrated with predefined fraud rules. This structured evidence guides a RAG-enhanced LLM to generate audit-ready, narrative explanations, completing the bridge from quantitative evidence to actionable insights. (This is shown in Figure 5. The statistical testing results are provided in Table A3 of Appendix A.2, and the full workflow is detailed in Section 5.4 and demonstrated in Deployment, Section 6).

5.5. Stability of Key Evidence

Bootstrap analysis (n = 200) confirms the exceptional stability of the top-ranked features: ratio_to_median_purchase_price and distance_from_last_transaction, which show near-perfect rank stability (>0.98) and appear in the top two positions in 98.5% and 88.5% of resamples, respectively (see Table A4 in Appendix A.3). This underscores their reliable and central role in the model’s decision-making process.

6. Deployment

We concretely demonstrate the Bridging framework by analyzing an extreme-risk transaction (Instance_13; all 20 new instances are provided in Appendix B.1 Table A5), showcasing the operationalization of the Trust Triangle’s three pillars: evidential reliability, external validity, and controlled generation.

6.1. Case Analysis

We concretely demonstrate the Bridging framework by analyzing an extreme-risk transaction (Instance_13; all 20 new instances are provided in Appendix B.1 Table A5, and their risk score distribution is visualized in Figure A2), showcasing the operationalization of the Trust Triangle’s three pillars: evidential reliability, external validity, and controlled generation.

Instance_13 was assigned a risk score of 7.082 (see Appendix B.1 Table A6), indicating its reconstruction error was over seven times the statistically derived decision threshold. Figure A3 presents the distribution of risk scores for the 20 new instances, and Figure A4 visualizes the linear relationship between reconstruction error and risk score for all 20 new instances, with Instance 13 highlighted as an extreme outlier. Our framework identified ratio_to_median_purchase_price (value: 10.124, i.e., over ten times the median, see Appendix B.1 Table A5) as the decisive factor. Crucially, all three attribution methods converged, assigning it aligned, high importance scores (IG: 4.142; SHAP: 1.368; Perturbation: 2.231, see Appendix B.1 Table A6, and Feature importance of all features for the 20 new instances sees Appendix B.1 Table A7.). This instance-level multi-method consensus directly demonstrates evidential reliability.

6.2. Grounding the Evidence

The feature’s (ratio_to_median_purchase_price) reconstruction error deviated from the normal mean by 5.3 standard deviations (see Appendix B.2, Table A8), providing powerful instance-level corroboration of its global statistical significance established in Section 5.3. The calculated Feature Impact Score was 1.264 (Figure 6; see Appendix B.2, Table A8), accounting for 95.3% of the total impact for this instance. This overwhelming signal triggered predefined “High-Value Transaction” rules (see Appendix B.3, Figure A5), instantiating external validity. Furthermore, the RAG system retrieved seven distinct fraud patterns from an authoritative database (e.g., Counterfeit Card Fraud; see Appendix B.3, Figure A6), all explicitly linking “abnormally high transaction amounts” to criminal activity (relevance scores > 0.7, see Appendix B.3, Figure A6). This aligns the quantitative attribution with external, verifiable domain knowledge.

6.3. Synthesis and Audit Trail

The final step exemplifies controlled generation. The structured quantitative evidence (risk score, impact scores, triggered rules) and retrieved qualitative context were integrated into a prompt. An LLM, guided by chain-of-thought reasoning, synthesized this into a coherent, audit-ready report (see Appendix B.4). This completes a fully traceable loop from a black-box anomaly score to a semantically grounded, actionable narrative, validating the framework’s core promise of trustworthy explanation.

7. Conclusions, Limitations, and Future Work

7.1. Conclusions

We propose the Trust Triangle framework, a novel paradigm that systematically bridges evidential reliability, external validity, and controlled generation to transform black-box predictions into trustworthy, auditable explanations for high-stakes credit card fraud detection. Empirically, our framework delivers robust performance (PR-AUC = 0.7867) on highly imbalanced data (8.74% fraud rate) and uniquely identifies statistically grounded predictive drivers (ratio_to_median_purchase_price). Through case-based deployment, we demonstrate its ability to generate semantically coherent, evidence-anchored reports, completing a closed-loop verification from model output to decision support. We are excited to extend this paradigm to other high-stakes domains (e.g., healthcare, justice) and to investigate more efficient consensus mechanisms. To foster reproducible research, we will release our code and model components, advancing the practice of trustworthy AI from principle to practice.

7.2. Limitations

Despite its contributions, this study has several limitations that should be acknowledged:

Single Validated Feature: As reported in Section 5.3, only one feature (ratio_to_median_purchase_price) achieved statistical significance after FDR correction. While this finding underscores the rigor of our validity assessment—demonstrating that our framework successfully distinguishes statistically grounded signals from noise—it also raises the question of whether other features carry meaningful predictive signals that are masked by their small individual effect sizes or by correlations with other features.
Univariate Statistical Testing: Our external validity pillar relies on the Mann–Whitney U test, a univariate non-parametric method. This approach does not account for feature interactions or nonlinear relationships, which may be crucial for understanding complex fraud patterns.
Static Knowledge Base: The RAG pipeline currently depends on a fixed, pre-curated domain knowledge base. While we ensured high relevance scores (>0.7) for retrieved content, the knowledge base is not automatically updated as new fraud patterns emerge, limiting the system’s adaptability to concept drift.
LLM Dependence and Computational Cost: The controlled generation module employs a multi-stage LLM pipeline (embedding, intermediate description, final generation). Although we selected lightweight models (e.g., qwen2.5:1.5b-instruct) to mitigate cost [12], the approach still requires significant computational resources and relies on the inherent capabilities of the chosen LLMs, which may introduce biases or inconsistencies [16]. This trade-off between explainability and computational efficiency is a common challenge in deploying LLM-based systems for real-time applications.
Human Feedback Not Yet Implemented: While we propose a human-in-the-loop feedback mechanism, this component remains conceptual and has not been implemented or validated empirically. The effectiveness of expert feedback for model improvement and knowledge base updating requires future investigation.
Single Dataset and Domain: The framework has been demonstrated on a single credit card fraud dataset [22]. Its generalizability to other high-stakes domains (e.g., healthcare diagnostics, financial auditing, or cybersecurity) remains to be tested.

7.3. Future Work

Building on the limitations identified above, we outline several directions for future research:

Alternative Statistical Methods: While our framework successfully identifies one statistically validated feature, future work will explore alternative statistical methods—such as LASSO-regularized logistic regression, permutation importance with significance testing [14,26], Bayesian feature selection, and the Boruta algorithm—to uncover potential predictive signals in features that did not survive FDR correction. These methods offer complementary strengths: LASSO handles multicollinearity and feature selection jointly; permutation importance provides model-specific significance testing; Bayesian approaches incorporate prior knowledge; and the Boruta algorithm explicitly compares features to random probes. Such multivariate approaches may reveal interactions or nonlinear effects masked by univariate tests.
Feature Interaction Modeling: We plan to investigate methods that capture feature interactions, such as tree-based models with built-in interaction detection or neural attention mechanisms [15], to provide a more holistic understanding of fraud patterns.
Dynamic Knowledge Base Updating: Future iterations of the framework will incorporate mechanisms for semi-automated knowledge base updating, potentially using online learning or periodic retraining of the retrieval model [25] to adapt to emerging fraud modus operandi.
Human-in-the-Loop Feedback Mechanism: We envision a human-in-the-loop feedback mechanism to continuously improve system trustworthiness. Domain experts could review generated reports, flag errors, and update the knowledge base; their feedback would be logged and used to recalibrate attribution weights, refine rule templates, and periodically retrain the predictive model. This closed-loop learning from expert input would enable the Trust Triangle to adapt to evolving fraud patterns and reduce mistakes over time, moving toward truly adaptive and auditable AI in high-stakes domains.
Cross-Domain Validation: We aim to apply the Trust Triangle framework to other high-stakes domains, such as healthcare diagnostics (e.g., detecting anomalous patient records) and financial auditing (e.g., identifying irregular transactions), to assess its generalizability and adaptability. We plan to collaborate with domain experts in these fields to adapt the framework’s components—particularly the knowledge base and rule templates—to their specific contexts.
Efficiency Optimization: To address computational costs, we will explore more efficient consensus mechanisms, model distillation techniques, and lighter-weight LLM architectures [12] suitable for real-time deployment in production environments.
User Studies and Explainability Evaluation: Beyond quantitative validation, future work should include user studies with domain experts (e.g., fraud analysts) to evaluate the usefulness, interpretability, and actionability of the generated reports, providing qualitative evidence of the framework’s practical value.

By addressing these limitations and pursuing these future directions, we believe the Trust Triangle can evolve into a mature, deployable solution for trustworthy AI in high-risk applications, bridging the gap between rigorous statistical validation and human-interpretable explanations.

Author Contributions

Conceptualization, J.-C.S. and Y.-B.L.; methodology, J.-C.S. and N.-C.S.; software, J.-C.S.; validation, J.-C.S. and N.-C.S.; formal analysis, J.-C.S.; investigation, J.-C.S.; resources, Y.-B.L.; data curation, J.-C.S.; writing—original draft preparation, J.-C.S.; writing—review and editing, N.-C.S. and Y.-B.L.; visualization, J.-C.S.; supervision, Y.-B.L.; project administration, N.-C.S.; funding acquisition, J.-C.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was funded by the authors.

Institutional Review Board Statement

Not applicable. The study utilized a publicly available, anonymized credit card fraud dataset and did not involve humans or animals.

Informed Consent Statement

Not applicable. The study did not involve humans.

Data Availability Statement

The credit card fraud dataset used in this study is publicly available on Kaggle and can be accessed via the reference [22] (https://www.kaggle.com/datasets/dhanushnarayananr/credit-card-fraud, accessed on 1 February 2026). The code and model components developed for the Trust Triangle framework are available from the corresponding author upon reasonable request to foster reproducible research.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. Quantifying Reliability—Raw Multi-Method Attribution Consensus

Figure A1 illustrates the distribution of reconstruction errors (MSE) for normal versus fraudulent transaction samples in an anomaly detection framework. The optimal classification threshold (0.2974, indicated by the vertical dashed line) was determined by optimizing the PR-AUC metric. A key observation from this distribution is the significant overlap between the reconstruction errors of normal and fraudulent samples, leading to a substantial number of False Positive (FP) and False Negative (FN) instances. This indicates an inherent classification ambiguity that cannot be fully resolved by static models relying solely on feature representations learned from raw transactional data (e.g., spending behavior, spatial distance, security features). Many fraudulent transactions are highly similar in low-dimensional feature space to legitimate yet rare normal behaviors (e.g., a large purchase in a distant location, which could be a gift bought during a business trip). This observation explains the theoretical upper bound on any model’s performance on this dataset and clarifies why pursuing near-perfect static classification metrics may lead to overfitting to dataset-specific artifacts and result in unacceptably high false-positive rates in real-world deployment.

Figure A1. Distribution of Transaction Reconstruction Errors with an Optimal Threshold.

Our research directly acknowledges and leverages this challenge. The proposed dynamic modus operandi description framework (Steps 4 and 5 in Table 2) does not aim to forcibly eliminate this overlap—an often-unrealistic goal. Instead, it allows the system to incorporate iteratively updatable business rules and contextual knowledge (Step 6 in Table 2) when such ambiguous samples are detected, thereby enabling adaptation to concept drift (Steps 7 and 8 in Table 2). Consequently, the overlapping region in this figure underscores the core motivation and advantage of our approach: to construct a continuously evolving, robust, and practical fraud detection system that operates effectively within the acknowledged constraints of data ambiguity.

This appendix presents the raw scores and ranking comparisons (Table A1), along with visualizations (Figure A2), for the three feature attribution methods (IG, SHAP, Perturbation). This data is the source for calculating the micro-level method consistency (

{m e t h o d_c o n s i s t e n c y}_{j}

) and the macro-level consensus coefficient (

C_{g l o b a l}

). It not only intuitively reveals the high degree of consensus across methods (global consistency = 0.7024) but also concretely presents complementary perspectives, such as for the feature distance_from_last_transaction across different methods. This serves as the core raw material for assessing evidential reliability.

Table A1. Comparison of Feature Importance Scores and rankings across three attribution methods.

Feature	IG		SHAP		Pert		Consistency Among Methods
Feature	Score	Rank	Score	Rank	Score	Rank	Consistency Among Methods
distance_from_last_transaction	0.673093	1	0.747015	1	−0.678799	7	0.2605
online_order	0.653953	2	0.039471	5	0.107362	2	0.5005
ratio_to_median_purchase_price	0.525062	3	0.452624	3	−0.413608	6	0.7005
distance_from_home	0.307944	4	0.531014	2	1.177105	1	0.3805
repeat_retailer	0.121161	5	0.201827	4	−0.058288	5	0.5005
used_chip	0.000641	6	0.014732	6	−0.008623	4	0.5005
pin_number	0.000000	7	0.014301	7	0.022428	3	0.5005

Figure A2. Comparison of Feature Importance Scores and rankings across three attribution methods.

Table A2. Power Analysis Results (α = 0.05, Two-Tailed Test).

Feature	U_Statistic	p_Value	Cliff_Delta	AUC	Power
distance_from_home	3.213396∗ $10^{10}$	0.0000	−0.1943	0.4029	1.0000
distance_from_last_transaction	3.705598∗ $10^{10}$	0.0000	−0.0709	0.4646	1.0000
ratio_to_median_purchase_price	1.193061∗ $10^{10}$	0.0000	−0.7009	0.1496	1.0000
repeat_retailer	3.994380∗ $10^{10}$	0.1746	0.0016	0.5008	0.1183
used_chip	4.398982∗ $10^{10}$	0.0000	0.1030	0.5515	1.0000
used_pin_number	4.414208∗ $10^{10}$	0.0000	0.1068	0.5534	1.0000
online_order	2.695646∗ $10^{10}$	0.0000	−0.3241	0.3380	1.0000

Appendix A.2. Integrated Feature Importance Ranking with Reliability-Validity Verification

Table A3 presents the core output of our Bridging framework. The consensus-verified confidence scores embody evidential reliability, synthesizing multi-method agreement and stability. Statistical significance and effect size columns establish external validity, identifying ratio_to_median_purchase_price as the sole statistically grounded, high-confidence driver. This integrated and validated ranking provides the essential, structured input for controlled generation, enabling the RAG-enhanced LLM to produce an explanation firmly anchored in trustworthy, auditable evidence, thus completing the transformation from a black-box score to a white-box insight.

Table A3. Final Integrated Feature Importance with Statistical Validity.

Rank	Feature	Importance	Confidence	Stat. Sig.	Effect Size	Rank
1	ratio_to_median_purchase_price	0.3196	0.726	✓	0.5473	Medium effect
2	distance_from_last_transaction	0.2493	0.395	✗	0.4992	Small effect
3	online_order	0.1685	0.465	✗	0.9261	Large effect
4	distance_from_home	0.1564	0.524	✗	0.2764	Small effect
5	repeat_retailer	0.0463	0.622	✗	0.1863	Very small effect
6	used_chip	0.0302	0.637	✗	0.7714	Medium effect
7	used_pin_number	0.0297	0.617	✗	0.4602	Small effect

Appendix A.3. Substantiating Stability—Full Bootstrap Analysis for Key Features

This appendix lists the complete feature importance stability assessment results based on 200 bootstrap resamples (Table A4). It provides all details for the key features mentioned in the main text (e.g., ratio_to_median_purchase_price), including the mean importance score, standard deviation, 95% confidence interval, rank stability, and Top3 frequency. This data powerfully substantiates the reliability of the evidence, demonstrating that the core findings are not accidental results of data fluctuation. Simultaneously, it enhances the generalizability of the validity conclusions through large-scale repeated measurement.

Table A4. Feature importance stability evaluation based on 200 bootstrap resampling.

Rank	Feature	Mean Importance (±Standard Deviation)	95% Confidence Interval	Ranking Stability	Top-3 Frequency	Explanation
1	price	0.2770 ± 0.0942	[0.1179, 0.4753]	0.992	98.5%	Key Feature
2	transaction	0.2589 ± 0.1090	[0.0716, 0.4859]	0.988	88.5%	Key Feature
3	home	0.1746 ± 0.0861	[0.0640, 0.4082]	0.987	71.0%	Important Feature
4	online	0.1025 ± 0.0406	[0.0330, 0.1887]	0.990	30.5%	Important Feature
5	repeat	0.0751 ± 0.0371	[0.0219, 0.1582]	0.988	8.5%	Auxiliary Feature
6	chip	0.0610 ± 0.0332	[0.0151, 0.1183]	0.989	2.5%	Auxiliary Feature
7	pin	0.0509 ± 0.0302	[0.0081, 0.1085]	0.991	0.5%	Auxiliary Feature

Feature: price = ratio_to_median_purchase_price, transaction = distance_from_last_transaction, home = distance_from_home, online = online_order, repeat = repeat_retailer, chip = used_chip, pin = used_pin_number.

Appendix B

Appendix B.1. Twenty New Instances and Their Corresponding Analyses

Table A5. Raw data of 20 new instances.

Instance_ID	Home	Transaction	Price	Repeat	Chip	Pin	Online_Order
Instance_1	57.87785658	0.311140008	1.945939978	1	1	0	0
Instance_2	10.8299427	0.175591502	1.294218811	1	0	0	0
Instance_3	5.091079491	0.805152595	0.427714563	1	0	0	1
Instance_4	2.247564328	5.600043547	0.362662578	1	1	0	1
Instance_5	44.190936	0.566486268	2.222767293	1	1	0	1
Instance_6	5.586407674	13.26107327	0.064768465	1	0	0	0
Instance_7	3.724019125	0.956837928	0.278464933	1	0	0	1
Instance_8	4.848246572	0.320735427	1.273049534	1	0	1	0
Instance_9	0.876632256	2.503608927	1.516999333	0	0	0	0
Instance_10	8.839046704	2.970512276	2.361682533	1	0	0	1
Instance_11	14.26352874	0.158758086	1.136101943	1	1	0	1
Instance_12	13.59238757	0.240539813	1.370329863	1	1	0	1
Instance_13	5.282558261	0.371561962	10.12447336	1	0	0	1
Instance_14	13.95587237	0.271523528	2.798901123	1	0	0	1
Instance_15	179.6651877	0.120919634	0.535640483	1	1	1	1
Instance_16	114.5187894	0.707003353	0.516989925	1	0	0	0
Instance_17	3.589688598	6.247457543	1.846450527	1	0	0	0
Instance_18	11.08585248	34.66135143	2.530758449	1	0	0	1
Instance_19	2.131955666	56.37240053	6.358667334	1	0	0	1
Instance_20	3.803057351	67.24108053	1.872949614	1	0	0	1

Feature: price = ratio_to_median_purchase_price, transaction = distance_from_last_transaction, home = distance_from_home, online = online_order, repeat = repeat_retailer, chip = used_chip, pin = used_pin_number.

Table A6. Risk scores and Feature attribution scores of 20 new instances.

Instance_ID	Reconstruction Error	Risk Score	Risk_Level	IG_Value	SHAP_Value	Perturbation_Value
Instance_1	0.096563	0.324644	Normal	0.332549	0.590415	0.382106
Instance_2	0.017343	0.058308	Normal	0.347724	4.04E−05	0.349023
Instance_3	0.23296	0.783207	Normal	0.277649	0.450088	0.34465
Instance_4	0.312947	1.052123	Low_Risk	1.538347	0.989904	0.240123
Instance_5	0.064874	0.218105	Normal	1.084932	1.214681	0.378436
Instance_6	0.120097	0.403763	Normal	0.45078	0.000568	0.45247
Instance_7	0.262711	0.88323	Normal	0.305198	0.43524	0.359591
Instance_8	0.049949	0.167927	Normal	0.462337	1.435201	0.171437
Instance_9	1.243194	4.179594	Medium_Risk	1.572173	0.000104	1.536995
Instance_10	0.028579	0.09608	Normal	0.197178	0.636449	0.442014
Instance_11	0.160518	0.539658	Normal	0.449272	1.084443	0.184187
Instance_12	0.127472	0.428557	Normal	0.420437	1.114343	0.18204
Instance_13	2.106589	7.08231	Extreme_Risk	4.142073	1.368475	2.2313
Instance_14	0.019202	0.064558	Normal	0.190981	0.679516	0.48421
Instance_15	1.463372	4.919827	Medium_Risk	7.358275	2.510896	1.714236
Instance_16	0.463765	1.559168	Low_Risk	0.794006	0.000273	0.793759
Instance_17	0.032802	0.11028	Normal	0.363011	0.000211	0.363865
Instance_18	0.277533	0.933059	Normal	0.452468	0.668768	0.728295
Instance_19	1.301491	4.375586	Medium_Risk	1.759496	1.036506	1.426202
Instance_20	1.175546	3.952162	Medium_Risk	1.309625	0.617724	1.308714

Figure A3. Risk Score Distribution of new instances.

Figure A4. Correlation between reconstruction error and risk score: 1.0000. Regression equation:Risk Score = 3.3620 × Reconstruction Error − 0.0000.

Table A7. Feature importance of all features for the 20 new instances.

Instance_ID	W-Home	W-Transaction	W-Price	W-Retailer	W-Chip	W-Pin	W-Online
Instance_1	0.237267139	0.010103283	0.007877149	0.000146618	0.007112523	0.01005182	0.727441469
Instance_2	0.028305366	0.018029417	0.006613606	0.0004834	0.050477914	0.011348816	0.884741482
Instance_3	0.080796448	0.017816731	0.758909309	0.005076726	0.047685947	0.010416248	0.07929859
Instance_4	0.00148074	0.003380702	0.196502697	0.004808384	0.06757851	0.002727031	0.723521936
Instance_5	0.054408818	0.012614053	0.341553393	0.00275872	0.460102583	0.014345476	0.114216956
Instance_6	0.034136039	0.085371033	0.307591742	0.000293307	0.030516904	0.006860236	0.535230739
Instance_7	0.078358473	0.014046829	0.785137777	0.004364813	0.040997123	0.008955218	0.068139767
Instance_8	0.034920621	0.023374237	0.040445422	0.001720534	0.046433054	0.083857952	0.769248179
Instance_9	0.03978803	0.00062748	0.000305463	0.540732547	0.022326325	0.005019335	0.39120082
Instance_10	0.110279153	0.005628721	0.563695902	0.0098242	0.094632542	0.020636935	0.195302547
Instance_11	0.02252188	0.034650133	0.391721133	0.002748312	0.393938029	0.023913809	0.130506704
Instance_12	0.038514127	0.048962087	0.09936341	0.004088747	0.580565782	0.035329299	0.193176548
Instance_13	0.000820054	0.00151252	0.745757354	0.001736528	0.0003937	0.000310623	0.249469221
Instance_14	0.057653496	0.048524927	0.562164565	0.010151496	0.097811478	0.021329667	0.202364371
Instance_15	0.555578632	0.02351932	0.002014832	0.002673775	0.003209305	0.346585902	0.066418235
Instance_16	0.547957038	0.004843872	0.087897626	0.000183716	0.019149422	0.004305005	0.335663321
Instance_17	0.065404246	0.009072595	0.037325634	0.000452796	0.047356019	0.010647526	0.829741184
Instance_18	0.024102204	0.719180111	0.165032308	0.002807489	0.027043632	0.005897326	0.055936929
Instance_19	0.009547065	0.46215441	0.401369918	0.001405106	0.004662301	0.001136057	0.119725143
Instance_20	0.015337583	0.925522783	0.031828858	0.000838015	0.008068402	0.001759479	0.016644881

Appendix B.2. Detailed Feature-Level Impact Analysis for Case Study

Table A8. Feature impact analysis for Instance 13. Integrated computation based on reconstruction error deviation and feature attributions.

Feature	Impact_Score	Importance	Deviation_Score	Deviation	Instance_Error	N_Mean	N_Std	N_Importance
price	1.264097	0.745757	1.695051	5.303835	14.49372	2.041977	2.347686	0.31959
online	0.055845	0.249469	0.223857	1.328263	7.33E−07	0.65	0.48936	0.168534
repeat	0.000336	0.001737	0.193629	4.184126	0.014401	0.95	0.223607	0.046277
transaction	0.000186	0.001513	0.122953	0.493284	0.033636	9.693189	19.58212	0.249255
home	7.11∗ $10^{- 5}$	0.00082	0.086688	0.554225	0.204363	25.30003	45.28064	0.156414
chip	7.59∗ $10^{- 6}$	0.000394	0.01929	0.638075	1.38∗ $10^{- 6}$	0.3	0.470162	0.030231
pin	3∗ $10^{- 6}$	0.000311	0.009649	0.324882	3.33∗ $10^{- 6}$	0.1	0.307794	0.0297

N_Importance: Attribution-based importance of each feature with respect to the original dataset (see Table A2). N_Mean: Mean reconstruction error of each feature for normal transactions in the original dataset. N_Std: Standard deviation of reconstruction error of each feature for normal transactions in the original dataset. Instance_Error: Reconstruction error of each feature for Instance 13. Deviation: Feature-wise deviation for Instance 13, defined as Deviation = abs((Instance_Error − N_Mean)/N_Std). Deviation_Score: Feature deviation score for Instance 13, computed as Deviation_Score = Deviation × N_Importance. Importance: Feature importance for Instance 13 (see Appendix B.1, Table A7). Impact_Score: Feature impact score for Instance 13, computed as Impact_Score = Importance × Deviation_Score.

Appendix B.3. Implementation Templates and Retrieved Knowledge for Case Study

Figure A5. An Example of RAG-Retrieved Domain Knowledge for Fraud Type 1 (“Counterfeit Card Fraud”).

These two resources fully demonstrate the closed-loop workflow from model attribution to auditable explanation. Figure A6 maps a validated key feature to a predefined fraud rule template, implementing the conversion from quantitative evidence to business semantics. Figure A6 displays the specific modus operandi descriptions retrieved by the RAG system from an authoritative knowledge base based on the triggered rule, providing a verifiable external knowledge anchor for the generation step. Together, they ensure the controllability and factuality of the final explanation.

Figure A6. An Example of Multi-Fraud Rule Mapping via Key Feature (ratio_to_median_purchase_price).

Appendix B.4. Complete Generated Report for the Instance_13 Case Study

Trustworthy AI Deep-Dive Fraud Detection Analysis Report—Instance_13

A Comprehensive Evaluation Based on the Bridging Methodology and NeurIPS Trustworthy AI Standards

Report ID: AI-FD-20260103-001

Analysis Target: Instance_13 (Extreme-Risk Transaction)

Generation Time: 3 January 2026, 01:02:39

Analysis Model: qwen2.5:7b

Data Sources: Five Feature Analysis Files

Report Type: In-depth Insight Analysis

Table of Contents

Executive Summary
Research Background and Methodology
Feature Analysis of Instance_13
Feature Importance Evaluation
Crime Pattern Matching Analysis
Trustworthiness Assessment (Based on NeurIPS Standards)
Risk Management Recommendations
Conclusions and Outlook
Appendix

1. Executive Summary

This report presents a comprehensive trustworthy AI analysis of Instance_13, an extreme-risk transaction. The transaction was identified as “extreme risk” with a risk score of 7.082, primarily driven by an abnormally high ratio of transaction amount to historical median purchase price (ratio_to_median_purchase_price = 10.124), which alone contributed 32.0% to the overall risk.

Instance_13 simultaneously triggered seven distinct crime patterns, all classified at the “strong alert” level.

The analysis adopts a Bridging Methodology, constructing a “Trust Triangle” through:

Multi-attribution consistency (reliability),
Statistical significance testing (validity), and
RAG-enhanced real-world crime knowledge (semantic grounding).

Beyond risk classification, this report delivers deep insights across four dimensions: feature anomalies, crime pattern matching, trustworthiness standards, and actionable risk management strategies.

2. Research Background and Methodology

2.1. Research Background

With the widespread adoption of AI in financial risk management, the “black-box” problem has become a major regulatory concern. Traditional machine learning models often lack interpretability, making it difficult to meet compliance requirements. The Bridging Methodology proposed in this report addresses this challenge by transforming algorithmic predictions into verifiable and trustworthy business insights through a quantitative validation framework.

2.2. Analytical Methodology

Reliability Assessment: Integration of SHAP, perturbation-based methods, and Integrated Gradients to compute an attribution consistency score (0.702).
Validity Verification: Stability evaluation using Bootstrap confidence intervals and statistical significance testing (adjusted p-values).
Semantic Enhancement: Matching against real-world crime patterns using Retrieval-Augmented Generation (RAG).
Instance-Level Analysis: Multi-layer analysis of Instance_13 at the feature, pattern, and system levels.

3. Feature Analysis of Instance_13

Detailed Feature Analysis and Risk Insights

1. Interpretation of Feature Values

ratio_to_median_purchase_price = 10.124

This value is significantly higher than the historical median purchase price, indicating that the transaction amount far exceeds normal levels. From a business perspective, this suggests a heightened risk of fraud or anomalous high-value transactions. Such deviations may stem from one-time large purchases, special promotions, or malicious credit card misuse.

online_order = 1 and used_chip = 0

This combination indicates an online transaction without chip-based verification. While online transactions are not inherently risky, the absence of chip authentication increases the likelihood of card misuse or fraud, warranting further investigation.

distance_from_home = 5.282 and distance_from_last_transaction = 0.372

These values indicate that the transaction occurred approximately 5 km from the cardholder’s residence but very close to the location of the previous transaction. This geographic pattern may suggest multiple transactions within a short time frame and limited spatial movement, potentially indicating repeated use at the same merchant or abnormal purchasing behavior.

repeat_retailer = 1 with other anomalous features

Although the retailer is repeatedly used, other features (e.g., distance and timing) exhibit abnormal behavior. This inconsistency may imply altered purchasing habits under special circumstances or potential fraudulent activity.

2. Feature Interactions and Combined Risk

High purchase price ratio

When combined with offline transactions (online_order = 0), an unusually high ratio_to_median_purchase_price may signal large cash-based purchases. In the absence of chip verification, this pattern further elevates fraud risk.

Geographical behavior analysis

Fluctuations between distance_from_home and distance_from_last_transaction may reflect unusual behavioral or environmental changes, indicating potential risk factors.

3. Anomaly Comparison

Compared with normal transaction patterns, Instance_13 deviates across multiple dimensions. Typically, users rely on chip authentication and exhibit stable geographic behavior. The combination of a high purchase ratio, lack of chip usage, and anomalous geographic patterns collectively signals elevated risk.

4. Business Implications

In operational settings, these feature combinations can serve as key indicators of potential fraud. For example:

A significantly elevated ratio_to_median_purchase_price should automatically trigger enhanced verification or manual review.
Online transactions without chip usage should prompt strengthened security checks.

5. High-Risk Signal Patterns

Identified high-risk patterns include:

Abnormally large transactions
Non-standard payment methods (online_order = 1 and used_chip = 0)
Geographical pattern anomalies
Repeated retailer usage with other abnormal features

Recognizing these patterns can substantially improve fraud detection accuracy and efficiency.

4. Feature Importance Evaluation

Global Feature Importance Analysis

1. Business Implications of Global Feature Ranking

In model training, ratio_to_median_purchase_price was identified as the most influential feature, with an importance score of 0.3196 and strong statistical significance (p = 0.0010). This indicates high explanatory power for the target variable and suggests that spending deviations effectively capture abnormal financial behavior.

2. Relationship Between Statistical Significance and Stability

Although distance_from_last_transaction shows a relatively high importance score (0.2493), its statistical significance is weak (p = 0.2457) and stability is low, suggesting susceptibility to noise. In contrast, features such as repeat_retailer and chip usage exhibit both low importance and low stability and may be safely deprioritized.

3. Global vs. Instance-Level Importance

Global importance reflects aggregate contributions across all samples, potentially masking features that are critical in specific cases. Instance-level importance, by contrast, highlights features that are decisive for individual predictions, as demonstrated in Instance_13.

4. Interpretation of p-values and Effect Sizes

The online_order feature exhibits high statistical stability but moderate importance (0.1685), indicating consistent yet limited predictive contribution.

5. Risk Management Challenges of Low-Stability, High-Importance Features

distance_from_home exemplifies this issue, with moderate importance (0.1564) but low stability (CI width = 0.344), increasing predictive uncertainty and necessitating further validation.

6. Limitations of Feature Importance Analysis

Feature importance does not capture feature interactions and may undervalue features critical to real-world decision-making. Thus, it should be complemented with expert knowledge.

Business Recommendations

Further validate key low-stability features through additional data and improved preprocessing.
Do not fully disregard statistically weak features, as they may be crucial in specific scenarios.
Incorporate domain expertise to enhance interpretability and robustness.

5. Crime Pattern Matching Analysis

Crime Pattern Matching

1. Reasons for Triggering Seven Crime Patterns

Instance_13 exhibits feature combinations that activate multiple crime patterns. For example, counterfeit card fraud relies on geographic distance and purchase ratio features, while phishing-based fraud emphasizes online transactions and repeated merchant usage.

2. Feature Weights and Priority Across Crime Patterns

Counterfeit Card Fraud (Physical Forged Cards): Driven primarily by high purchase ratios, followed by geographic distance features.
Intercepted New Card Fraud: Emphasizes purchase ratio and cardholder residence information.
Phishing-Based Fraud (Card-Not-Present): Focuses on online transactions and repeated merchant usage.

3. Alignment Between RAG-Extracted Crime Knowledge and Observed Features

RAG-extracted real-world crime techniques align closely with Instance_13’s feature patterns, strengthening the realism and credibility of the explanation.

4. Potential Organized Crime Operation Modes

Credit card forgery
Phishing attacks and malware-based data theft
Cross-channel fraud across online and offline transactions

5. Trends in Crime Pattern Evolution

Future crime trends may increasingly leverage advanced technologies (e.g., AI-generated phishing sites, cryptocurrency platforms), underscoring the need for enhanced online security, real-time monitoring, and cross-agency collaboration.

Investigation Recommendations

Strengthen identity verification, especially for high-risk transactions.
Improve user education regarding phishing and counterfeit card fraud.
Conduct regular reviews of anomalous transaction patterns.

6. Trustworthiness Assessment (Based on NeurIPS Standards)

Trustworthiness Evaluation

1. Reliability (NeurIPS Standard)

The feature ratio_to_median_purchase_price demonstrates high importance, strong statistical significance, and moderate CI width (0.357), indicating robustness. Other features exhibit wider confidence intervals and lower robustness.

2. Validity (NeurIPS Standard)

While the primary feature shows strong predictive relevance, others (e.g., online_order, distance_from_home) have limited contribution and weak statistical significance.

3. Interpretability and Transparency

Although feature importance results are clear, detailed explanatory narratives should be expanded to enhance transparency and acceptance.

4. Fairness and Bias

Potential sample selection bias cannot be ruled out, particularly for statistically insignificant features, highlighting the need for representative training data.

5. Recommendations for Improving Trustworthiness

Increase data volume and diversity
Optimize feature engineering
Enhance explainability using advanced attribution techniques
Establish continuous model monitoring mechanisms

7. Risk Management Recommendations

Comprehensive Risk Management Strategy

1. Immediate Risk Response

Immediately freeze high-risk transactions.
Promptly contact affected customers for identity verification.

2. Mid-Term Monitoring Adjustments

Dynamically adjust feature thresholds.
Introduce advanced machine learning models (e.g., Random Forest, XGBoost).

3. Long-Term Risk Governance

Implement multi-layer defense systems.
Develop real-time monitoring and alert platforms.

4. Technical System Optimization

Visualize feature importance.
Integrate explainability tools such as SHAP.

5. Personnel Training and Process Redesign

Conduct regular professional training.
Perform simulated incident response drills.

Expected Outcomes

Reduced fraud incidence
Improved customer satisfaction
Optimized resource allocation

8. Conclusions and Outlook

Key Findings and Value

This study identifies ratio_to_median_purchase_price as a critical driver of fraud detection accuracy and validates its importance through crime pattern matching and real-world case alignment.

Limitations and Challenges

Data representativeness, feature dependency, and limited interpretability remain challenges, particularly for regulatory compliance.

Future Research Directions

Future work should incorporate richer data sources, develop highly interpretable AI models, and explore privacy-preserving techniques such as federated learning.

Industry and Regulatory Impact

This research provides valuable guidance for financial institutions and supports the development of standardized AI governance frameworks.

Outlook for Trustworthy AI in Financial Risk Management

Trustworthy AI will play a central role in enhancing transparency, accountability, and customer trust in financial risk management systems.

9. Appendix

9.1 Data Sources

Global feature importance analysis
Instance_13 raw feature values
Feature impact scores
Triggered fraud rules
RAG-based real crime knowledge base

9.2 Method Overview

Integrated attribution-based feature importance
Bonferroni-corrected significance testing
Impact score formulation
Rule-based crime pattern triggering

9.3 Model Configuration

Model: qwen2.5:7b
Temperature: 0.3
Analysis Depth: Feature → Instance → Pattern → System

Report generation completed.

References

Samek, W.; Montavon, G.; Vedaldi, A.; Hansen, L.K.; Müller, K.R. (Eds.) Explainable AI: Interpreting, Explaining and Visualizing Deep Learning; Lecture Notes in Artificial Intelligence 11700; Springer: Cham, Switzerland, 2019. [Google Scholar]
Kingma, D.P.; Welling, M. An Introduction to Variational Autoencoders. Found. Trends Mach. Learn. 2019, 12, 307–392. [Google Scholar]
Zong, B.; Song, Q.; Min, M.R.; Cheng, W.; Lumezanu, C.; Cho, D.; Chen, H. Deep Autoencoding Gaussian Mixture Model for Unsupervised Anomaly Detection. In Proceedings of the 6th International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Adebayo, J.; Gilmer, J.; Muelly, M.; Goodfellow, I.; Hardt, M.; Kim, B. Sanity checks for saliency maps. In Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, QC, Canada, 3–8 December 2018; Volume 31. [Google Scholar] [CrossRef]
Gal, Y.; Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), New York, NY, USA, 19–24 June 2016; pp. 1050–1059. [Google Scholar] [CrossRef]
Sundararajan, M.; Taly, A.; Yan, Q. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Sydney, Australia, 6–11 July 2017; pp. 3319–3328. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef]
Wasserstein, R.L.; Schirm, A.L.; Lazar, N.A. Moving to a World Beyond “p < 0.05”. Am. Stat. 2019, 73, 1–19. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-t.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2020), Virtual, 6–12 December 2020; Volume 33, pp. 9459–9474. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 24824–24837. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2020), Virtual, 6–12 December 2020; Volume 33, pp. 1877–1901. [Google Scholar] [CrossRef]
Chen, L.; Zaharia, M.; Zou, J. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv 2023, arXiv:2305.05176. [Google Scholar] [CrossRef]
Chen, Z.; Bei, Y.; Rudin, C. Concept Whitening for Interpretable Image Recognition. Nat. Mach. Intell. 2020, 2, 772–782. [Google Scholar] [CrossRef]
Molnar, C. Interpretable Machine Learning. 2020. Lulu.com. Available online: https://www.academia.edu/103808014/Interpretable_Machine_Learning (accessed on 4 March 2026).
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar] [CrossRef]
Mao, R.; Liu, Q.; He, K.; Li, W.; Cambria, E. The Biases of Pre-Trained Language Models: An Empirical Study on Prompt-Based Sentiment Analysis and Emotion Detection. IEEE Trans. Affect. Comput. 2023, 14, 1743–1753. [Google Scholar] [CrossRef]
Lakkaraju, H.; Kamar, E.; Caruana, R.; Leskovec, J. Faithful and customizable explanations of black box models. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (AIES ’19), Honolulu, HI, USA, 27–28 January 2019; pp. 131–138. [Google Scholar] [CrossRef]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Kumar, I.E.; Venkatasubramanian, S.; Scheidegger, C.; Friedler, S. Problems with Shapley-value-based explanations as feature importance measures. In Proceedings of the 37th International Conference on Machine Learning (ICML 2020), Virtual, 13–18 July 2020; pp. 5491–5500. [Google Scholar] [CrossRef]
Wang, S.; Deng, Q.; Feng, S.; Zhang, H.; Liang, C. A Survey on Rank Aggregation. In Proceedings of the 33rd International Joint Conference on Artificial Intelligence (IJCAI 2024), Jeju, South Korea, 3–9 August 2024; pp. 8281–8289. [Google Scholar] [CrossRef]
Conover, W.J. Practical Nonparametric Statistics, 4th ed.; John Wiley & Sons: New York, NY, USA, 2024. [Google Scholar]
Kaggle. Dhanush Narayanan, R. Credit Card Fraud Dataset. 2021. Available online: https://www.kaggle.com/datasets/dhanushnarayananr/credit-card-fraud (accessed on 1 February 2026).
Guyon, I.; Gunn, S.; Nikravesh, M.; Zadeh, L.A. (Eds.) Feature Extraction: Foundations and Applications, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2020; Volume 207. [Google Scholar]
Kohavi, R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI 1995), Montreal, QC, Canada, 20–25 August 1995; Volume 14, pp. 1137–1145. [Google Scholar]
Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.-t. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), Virtual, 16–20 November 2020; pp. 6769–6781. [Google Scholar] [CrossRef]
Mentch, L.; Hooker, G. Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests. J. Mach. Learn. Res. 2016, 17, 1–41. [Google Scholar]
Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation forest. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM 2008), Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar] [CrossRef]
Kaufman, S.; Rosset, S.; Perlich, C.; Stitelman, O. Leakage in Data Mining: Formulation, Detection, and Avoidance. ACM Trans. Knowl. Discov. Data (TKDD) 2012, 6, 15. [Google Scholar] [CrossRef]
Davis, J.; Goadrich, M. The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning (ICML 2006), Pittsburgh, PA, USA, 25–29 June 2006; pp. 233–240. [Google Scholar] [CrossRef]
Demšar, J. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]

Figure 1. The Trust Triangle framework: A three-module architecture for trustworthy and explainable fraud detection.

Figure 2. Two-stage BAE pipeline for reliable fraud detection: building a statistically robust predictive foundation.

Figure 3. Evidential reliability and external validity pipeline: transforming raw model outputs into verified quantitative evidence.

Figure 4. Controlled generation pipeline: Transforming verified evidence into trustworthy, audit-ready natural language explanations.

Figure 5. Visualization of the final integrated feature importance ranking (color-coded by confidence level).

Figure 6. Distribution of feature impact scores for Instance 13: identification of key features based on impact scores (Risk Level: Extreme Risk).

Table 1. Comparison of existing approaches with the proposed Trust Triangle framework.

Aspect	Existing Approaches	Limitations	Trust Triangle Advantage
Feature Attribution	Single-method approaches: LIME [1], Integrated Gradients [6], SHAP [7]	Each method has distinct biases; results can be inconsistent and method-dependent [4,19]	Multi-method consensus (Section 3.2.1) aggregates three theoretically distinct methods, quantifying agreement via micro-consistency and macro-consensus ( $C_{g l o b a l}$ ) [20]
Attribution Reliability	Sanity checks reveal that many saliency maps are insensitive to model randomization [4]; Shapley-value-based explanations may not reflect true feature importance [19]	No quantitative standard for assessing attribution reliability	The evidential reliability pillar provides quantitative consistency metrics and adaptive fusion weights ( $w_{r e l}$ ) based on cross-method agreement
Statistical Grounding	Post hoc explanations rarely validate attributions against ground-truth outcomes	Attributions may highlight features that are statistically insignificant or lack real-world relevance	The external validity pillar applies the Mann–Whitney U test with FDR correction [8,21] and effect size analysis, ensuring only statistically grounded features receive high importance
Uncertainty Quantification	Standard autoencoders [2,23] provide point estimates without confidence intervals; DAGMM [3] improves density estimation but lacks inference-time uncertainty	Predictions lack calibrated uncertainty, undermining trust in high-stakes decisions	Bayesian Autoencoder with Monte Carlo Dropout [5] provides distributional estimates of reconstruction errors; bootstrap resampling [24] validates stability of thresholds and metrics
Explanation Generation	LLMs alone [11,18] risk hallucination [16] when generating explanations from unvalidated inputs	Fluency without faithfulness; explanations may be plausible but ungrounded	Controlled generation with RAG [9,25] and CoT prompting [10] constrains LLM reasoning to verified quantitative evidence and authoritative domain knowledge
End-to-End Trustworthiness	Existing frameworks lack integrated verification of both reliability and validity before explanation	Trust is assumed post hoc rather than built systematically	Trust Triangle establishes a closed loop: reliability-validity verification → multi-source evidence integration → controlled generation → auditable report

Table 2. The Eight-Step Workflow of Controlled Generation: From Verification to Trustworthy Explanation.

Step	Core Work (Essence)	Alignment with Trust Triangle
➊	Feature Importance Fusion Aggregates scores from three theoretically distinct attribution methods (IG, SHAP, Perturbation) to generate a consensus-verified composite importance ranking.	Achieves Evidential Reliability. By establishing multi-method consensus, it mitigates the bias and instability inherent in any single post hoc explanation method [4,19]. This transforms the ML model’s internal reasoning into robust, reproducible quantitative evidence, forming a credible foundation for all subsequent steps [20].
➋	New instance Risk Scoring Computes a normalized risk score $(R_{i}= e_{i} / T)$ by comparing the instance’s reconstruction error to a statistically derived threshold from normal behavior.	Initiates the Instantiation of External Validity. It converts the model’s raw, absolute anomaly score into a statistically grounded, interpretable relative risk measure [8,24], enabling the transition from global model assessment to individualized risk evaluation.
➌	New instance feature Impact Analysis Calculates the personalized contribution $(I m p a c t_{j}^{(i)})$ of each feature by fusing global importance, instance-specific attribution, and feature-value anomaly.	Deeply Integrates Evidential Reliability and External Validity. This step dynamically combines consensus-verified importance (reliability) with instance-level statistical anomalies (validity), creating a traceable, quantitative anchor for individualized explanations [1,7].
➍	Fraud Rule Template Predefined, structured crime patterns, associated features, and semantic descriptions based on domain knowledge.	Establishes the Semantic Anchor for External Validity and Controlled Generation. Encoding expert knowledge into computable rules provides the necessary structure for aligning statistical evidence with actionable business logic, ensuring explanations possess inherent relevance [30].
➎	Rule Triggering and Alerting Dynamically matches predefined fraud rules based on aggregated feature impact scores of new instance and output tiered alerts (based on: LLM Interpretation Guideline, Impact trigger)	Realizes the Business Closure of External Validity. It systematically maps quantitative evidence to comprehensible business semantics (fraud rules), generating actionable, prioritized insights and ensuring the practical utility of the explanation [21].
➏	RAG Knowledge Retrieval Retrieves authoritative crime modus operandi details from an external knowledge base, strictly keyed by triggered Rule IDs.	Implements the Foundational Constraint for Controlled Generation. By tethering retrieval to quantifiably verified triggers, it restricts the LLM’s context to high-quality, relevant evidence, directly mitigating hallucinations [9,16].
➐	Multi-Source Evidence Integration Aggregates all quantitative evidence (from ➊➋➌) and qualitative knowledge (from ➍➎➏) into a unified, structured input schema.	Completes the Bridge for Controlled Generation. It constructs a structured interface that forces the subsequent LLM to reason upon an integrated, auditable evidence set, enabling end-to-end traceability [10,11].
➑	Report Generation Generates the final natural language report via an LLM driven by structured prompts and the integrated evidence.	Executes Controlled, Evidence-Based Semantic Articulation. Guided by chain-of-thought prompting [10] within the evidence-rich context, the LLM produces a coherent, audit-ready narrative that is a direct synthesis of the validated inputs, fulfilling the promise of a trustworthy explanatory system.

Summary. The bridging framework progresses through three stages to fulfill its core principles: 1. Foundation in Verified Evidence: Establishes reliable quantitative attributions (➊) via multi-method consensus, converting the model’s internal state into robust Feature Importance Scores. 2. Anchoring to Real-World Validity: Grounds this evidence in statistical and domain reality by calibrating instance risk (➋), personalizing feature impact (➌), mapping to Fraud Rule Template (➍) and business rules (➎), ensuring external validity and actionability. 3. Controlled Synthesis into Insights: Converts the anchored evidence into a semantic explanation through governed generation—retrieving authoritative context (➏), integrating all sources (➐), and producing an auditable report (➑). This completes the loop from a black-box prediction to a white-box explanation.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shen, J.-C.; Su, N.-C.; Lin, Y.-B. Trust Triangle: A Reliability-Validity-Generation Framework for Explainable Credit Card Fraud Detection with RAG-Enhanced LLMs Reasoning. AI 2026, 7, 114. https://doi.org/10.3390/ai7030114

AMA Style

Shen J-C, Su N-C, Lin Y-B. Trust Triangle: A Reliability-Validity-Generation Framework for Explainable Credit Card Fraud Detection with RAG-Enhanced LLMs Reasoning. AI. 2026; 7(3):114. https://doi.org/10.3390/ai7030114

Chicago/Turabian Style

Shen, Jin-Ching, Nai-Ching Su, and Yi-Bing Lin. 2026. "Trust Triangle: A Reliability-Validity-Generation Framework for Explainable Credit Card Fraud Detection with RAG-Enhanced LLMs Reasoning" AI 7, no. 3: 114. https://doi.org/10.3390/ai7030114

APA Style

Shen, J.-C., Su, N.-C., & Lin, Y.-B. (2026). Trust Triangle: A Reliability-Validity-Generation Framework for Explainable Credit Card Fraud Detection with RAG-Enhanced LLMs Reasoning. AI, 7(3), 114. https://doi.org/10.3390/ai7030114

Article Menu

Trust Triangle: A Reliability-Validity-Generation Framework for Explainable Credit Card Fraud Detection with RAG-Enhanced LLMs Reasoning

Abstract

1. Introduction

2. Related Works

2.1. Challenges in Trustworthy Anomaly Explanation

2.2. The Trust Triangle Framework

2.3. Dataset Characteristics

2.4. Comparison with Existing Approaches: The Need for a Bridging Framework

3. Method

3.1. A Robust Predictive Backbone for Imbalanced Fraud Detection

3.1.1. Two-Stage BAE Training

3.1.2. BAE Performance Evaluation

3.1.3. Workflow Description of Evidential Reliability

3.2. Bridging Evidence and Validity for Trustworthy Attributions

3.2.1. Evidential Reliability via Multi-Method Consensus

3.2.2. External Validity via Statistical Association

3.2.3. Fusion and Output

3.2.4. Workflow Description of External Validit

3.3. Controlled Generation for Actionable Explanations

3.3.1. Risk Quantification and Rule Mapping

3.3.2. Controlled Generation with RAG

3.3.3. Workflow Description of Controlled, Grounded Generation

4. Implementation

4.1. Evidential Reliability: Building a Statistically Stable Detection Foundation

4.2. External Validity: Establishing Causal Plausibility for Detected Features

4.3. Controlled Generation: Orchestrating Auditable and Context-Aware Reporting

5. Results

5.1. Overall Predictive Performance and Robustness

5.2. Evidential Reliability (Multi-Method Consensus)

5.3. External Validity (Statistical Grounding)

5.4. Controlled Generation (Synthesis of Evidence)

5.5. Stability of Key Evidence

6. Deployment

6.1. Case Analysis

6.2. Grounding the Evidence

6.3. Synthesis and Audit Trail

7. Conclusions, Limitations, and Future Work

7.1. Conclusions

7.2. Limitations

7.3. Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. Quantifying Reliability—Raw Multi-Method Attribution Consensus

Appendix A.2. Integrated Feature Importance Ranking with Reliability-Validity Verification

Appendix A.3. Substantiating Stability—Full Bootstrap Analysis for Key Features

Appendix B

Appendix B.1. Twenty New Instances and Their Corresponding Analyses

Appendix B.2. Detailed Feature-Level Impact Analysis for Case Study

Appendix B.3. Implementation Templates and Retrieved Knowledge for Case Study

Appendix B.4. Complete Generated Report for the Instance_13 Case Study

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI