This section presents and discusses the findings from evaluating Random Forest (RF) and XGBoost (XGB) within the Integrated Transparency and Confidence Framework (ITCF). The analysis addresses predictive performance under extreme class imbalance, uncertainty quantification using split conformal prediction, interpretability through LIME, and operational latency. The discussion prioritises practical decision support, auditability, and deployment considerations.
3.1. Predictive Performance Under Extreme Class Imbalance
Table 2 summarises classification performance on the PaySim test set (
), which demonstrates severe class imbalance (Fraud: 1643; No Fraud:
). Both models achieve near-perfect accuracy; however, this metric is not informative in this context because the majority class dominates. Therefore, the analysis emphasises Recall, F1-score, and Matthews Correlation Coefficient (MCC), which more accurately reflect minority-class detection and balanced error rates.
XGBoost achieves higher Recall, F1-score, and MCC, indicating improved sensitivity to fraudulent transactions and a better balance across error types. Random Forest achieves slightly higher Precision and AUC-ROC, reflecting marginally stronger suppression of false positives and threshold-independent separability. In fraud detection, where false negatives correspond to missed fraud events, higher Recall is often prioritised even at the cost of a modest increase in false positives. Under this operational framing, XGBoost is selected as the primary base model for subsequent ITCF analyses, while Random Forest is retained for comparative assessment.
3.2. Probability Calibration and Reliability
Table 3 shows that XGBoost exhibits lower ECE and substantially lower MCE compared to Random Forest, indicating smaller worst-case deviations from perfect calibration (lower MCE). Accordingly, these values are reported as descriptive diagnostics rather than as evidence of decision-calibrated probabilities. Under extreme imbalance, ECE can be dominated by bins that are almost entirely populated by the majority class, yielding very small averages even when minority-relevant regions exhibit substantial local miscalibration, as reflected in MCE. Sensitivity to binning choices and class-conditional calibration diagnostics are deferred to future work.
Reliability diagrams (
Figure 1) illustrate the relationship between predicted probabilities and observed frequencies across bins. Although the global ECE values in
Table 3 are numerically small, they should be interpreted cautiously under severe class imbalance, where the dominance of the majority class and binning effects can mask localised miscalibration. For this reason, MCE is reported as a complementary worst-case diagnostic, and calibration plots are used only to describe probability–reliability behaviour. Importantly, ITCF does not rely on calibrated point probabilities for routing decisions; operational decision logic is driven by conformal prediction sets, abstention behaviour, and coverage properties, while calibration diagnostics are included as supplementary evidence of where point probabilities may be locally unreliable.
3.3. Uncertainty Quantification via Split Conformal Prediction
As described in
Section 2, we report empirical coverage, abstention behaviour, and prediction-region characteristics observed on the PaySim test set.
Uncertainty was quantified using split conformal prediction with a target coverage of
.
Table 4 reports the global properties of the resulting prediction regions.
In this binary setting, the chosen nonconformity score
combined with a single global quantile threshold tends to yield either (i) a singleton set when one class probability comfortably exceeds the inclusion threshold or (ii) an empty set when neither class meets the criterion. Two-label sets would require both classes simultaneously satisfying the inclusion inequality, which is uncommon under sharply peaked posteriors and extreme imbalance. This behaviour is therefore a property of the current conformal configuration (score choice, binning-free quantile thresholding, and probability sharpness), not a general guarantee of conformal prediction in binary classification.
3.3.1. Class-Conditional Uncertainty Characteristics for XGBoost
To illustrate how uncertainty differs across classes,
Table 5 presents entropy and maximum probability statistics for XGBoost, highlighting class-specific uncertainty. Fraud cases exhibited higher mean entropy and lower mean maximum probability than non-fraud cases, indicating that detecting the minority class remains inherently more uncertain in highly imbalanced settings. In this study, uncertainty primarily manifests through abstentions (empty regions) rather than multi-class ambiguity.
3.3.2. Illustrative High- and Low-Uncertainty Examples
Table 6 presents representative transactions spanning both high- and low-uncertainty regimes within the Integrated Transparency and Confidence Framework (ITCF), clarifying the instance-level behaviour underlying the reported prediction-region statistics and coverage results. These examples are illustrative rather than exhaustive and are intended to provide operational insight into how entropy thresholds and conformal prediction regions inform abstention, automation, and analyst triage during inference.
For instance, when an analyst receives case 17,257, balanced predicted class probabilities and an entropy value exceeding the operational threshold are observed. Due to insufficient confidence in the prediction, the case is escalated for further review. In a deployment setting, each abstention could be accompanied by a structured note documenting the rationale for escalation (e.g., balanced probabilities and entropy exceeding the policy threshold) to support audit trails and evidence of human oversight.
In contrast, low-uncertainty instances are characterised by highly skewed predicted probabilities, near-zero entropy, and single-class conformal regions. Such cases support confident automated decision-making, thereby preserving operational efficiency while maintaining formal coverage guarantees.
The following examples are illustrative and operational in nature and do not constitute additional statistical evidence beyond the aggregate uncertainty results already reported.
For the same instances shown in
Table 6, LIME explanations are generated only for high-uncertainty cases and are used to contextualise analyst review by highlighting the features most responsible for predictive ambiguity, as discussed in the explainability analysis section.
3.4. Interpretability with LIME Explanations
LIME generated instance-level explanations and aggregated attribution summaries to enhance transparency. Across sampled instances, the most frequently identified features were transaction
amount, origin/destination balances (
oldbalanceOrg,
newbalanceOrig,
oldbalanceDest,
newbalanceDest), and transaction
type.
Table 7 and
Figure 2 provide an aggregated summary of the most influential features identified in the sampled explanations.
LIME explanations are interpreted as descriptive indicators of local model behaviour rather than as causal attributions. A key next step for deployment-oriented validation is benchmarking explanation patterns against analyst reason codes or internal fraud typologies, to assess whether the explanations are consistent with domain expectations.
Within ITCF, LIME is applied selectively to cases routed for review (i.e., abstentions or high-entropy predictions). Explanations are used as decision-support artefacts for analysts rather than as definitive causal statements. This selective use mitigates interpretability overhead while acknowledging known variability in LIME explanations due to stochastic perturbation sampling.
LIME explanations were not validated against expert-labeled reason codes or rule-based typologies. As such, explanations should be interpreted as model-consistent descriptors rather than regulatory justifications.
3.5. Effectiveness of the Proposed Integrated Transparency and Confidence Framework
The ITCF operationalises uncertainty and explainability within a unified decision workflow. Low-uncertainty singleton predictions are eligible for low-touch handling, while abstentions and high-entropy cases are escalated for human review. For escalated cases, local explanations provide contextual evidence to support the analyst’s judgement and documentation.
This applied integration aligns established explainability and uncertainty quantification techniques within a coherent operational workflow. The framework focuses on operational decision support, traceability, and governance, which are critical considerations in regulated financial environments. Traceability features aim to align with common model risk management expectations (e.g., documented validation, monitoring, and audit trails) and support governance and supervisory review in regulated financial environments.
A consolidated comparison of model performance, explainability, and probability-diagnostic metrics is presented in
Table 8.
The overall score is computed using an illustrative weighted aggregation of performance, explainability, and probability diagnostics, intended to demonstrate comparative trade-offs rather than prescribe a fixed policy-dependent ranking. The weighting scheme was selected by the authors solely for illustrative purposes to reflect a common governance trade-off in fraud operations, where detection effectiveness is typically prioritised while explainability and reliability remain material for auditability and oversight. The weights (Performance 40%, Explainability 35%, Probability diagnostics 25%) are not normative and do not reflect regulatory prescriptions. Institutions may reweigh these dimensions based on policy objectives, analyst capacity, or risk appetite. A sensitivity check can be performed by varying weights (e.g., percentage points) to assess ranking stability under alternative governance priorities.
3.5.1. Targeted Human Review and Efficient Analyst Resource Allocation
The ITCF employs a triage-oriented decision workflow that connects predictive uncertainty with human review, ensuring analyst attention is directed toward cases where model-supported decisions are subject to human oversight. Instead of applying a uniform manual review or fixed alert thresholds, the framework routes transactions based on uncertainty signals derived from split-conformal prediction and supplementary probability-based measures.
At the chosen confidence level (), approximately of transactions yield empty conformal prediction regions and are therefore conservatively abstained from. These abstentions serve as the primary trigger for analyst review and indicate cases where the model cannot assign a label with the required coverage guarantees. In principle, abstention-based routing can reduce review volume relative to indiscriminate escalation, but operational impact is not measured in this offline evaluation.
The ITCF decision routing logic and corresponding responsibility allocation are summarised in
Table 9.
According to a study by reference [
33], predictive entropy, calculated from point probabilities, provides an additional signal for prioritising cases in the analyst queue. However, entropy does not supersede conformal abstentions. Instead, it offers additional details for ordering and managing the cases for review. This distinction preserves the statistical validity of conformal prediction while enabling operational flexibility in workload management for analysts.
The applied triage design enhances operational efficiency and decision quality by directing expert attention to cases characterised by genuine uncertainty or policy sensitivity. By integrating uncertainty-aware routing and selective explainability, the ITCF facilitates defensible human-in-the-loop decision-making while mitigating the risks of overconfident automation and reducing unnecessary manual workload.
3.5.2. Governance-Oriented Transparency
The ITCF generates auditable artefacts for each decision, such as uncertainty objects (prediction regions) and traceable explanations for escalated cases. Although these artefacts can enhance transparency and documentation in regulated environments, successful deployment necessitates further validation using real-world datasets, including drift-aware evaluation and explicit calibration analysis.
3.5.3. Operational Latency and Deployability
Operational feasibility was evaluated through latency measurements across three inference scenarios: single-instance (real-time), small-batch (32 instances), and full-batch (offline processing).
Table 10 reports summary statistics.
The results suggest that XGBoost is better suited to real-time and near-real-time fraud detection workflows, where rapid decision-making is essential. Although both models demonstrate similar conformal coverage and predictive accuracy, XGBoost’s lower latency offers a distinct operational advantage, justifying its selection as the preferred model within the ITCF. Notably, latency is considered a secondary deployment factor rather than a primary optimisation objective, thereby maintaining performance, uncertainty, reliability, and interpretability as the core criteria for model evaluation.
3.5.4. Operational Impact
The following points outline potential operational implications, as opposed to empirically validated outcomes:
Proactive Risk Management: By routing uncertain cases for review, the framework may support earlier intervention in operational settings; however, operational impact is not measured in this offline evaluation.
Explainability for Human Analysts: Aligning model explanations with established fraud behaviour patterns fosters trust and assists analysts in understanding the rationale for each flagged case.
Efficient Resource Allocation: Conformal prediction regions with target coverage facilitate prioritisation, allowing analysts to focus on high-uncertainty cases rather than reviewing all model alerts. This approach enhances both efficiency and decision quality.
Governance readiness: The integrated interpretability and uncertainty artefacts support transparency, traceability, and audit documentation practices commonly used in regulated financial environments.
3.6. Comparison of the ITCF with Prior Research Work
The proposed framework (ITCF) is qualitatively compared with prior research on explainability and uncertainty quantification, while accounting for differences in datasets, imbalance ratios, and operational objectives. The results of this comparison are presented in
Table 11.
Table 11 compares our Integrated Transparency and Confidence Framework (ITCF) with prior work and illustrates one practical integration pattern of CP-based triage with selective local explanations under extreme imbalance, specifically at a ratio of 773:1. This applied integration is relevant for regulatory technology discussions, as it illustrates how uncertainty and explainability can be aligned within governance workflows. By offering a structured framework for transparency and uncertainty awareness, the proposed approach supports policy discussions and responsible automation, without asserting direct regulatory compliance or deployment outcomes.
The framework comparison table illustrates the distinctiveness of our approach: it integrates Conformal Prediction (CP) with LIME.
The table below presents this essential differentiation:
Comparing Fraud vs. Non-Fraud Uncertainty for XGBoost
Fraud Cases: Showed a higher mean entropy () and a lower mean maximum probability () compared to non-fraud cases. This indicates that fraud predictions are inherently more uncertain than non-fraud predictions, a common problem in highly imbalanced datasets where the minority class is harder to predict with high confidence.
Non-Fraud Cases: Exhibited very low mean entropy () and very high mean maximum probability (), confirming the model’s high confidence in predicting the majority non-fraud class.
These comparisons are indicative rather than strictly comparable, as prior studies differ in their dataset structures, class imbalances, and operational objectives.
3.7. Key Insights
The experimental evaluation of the Integrated Transparency and Confidence Framework (ITCF) reveals several key insights:
- 1.
Robust performance under extreme class imbalance: Both Random Forest and XGBoost demonstrate strong predictive performance despite the severe class imbalance present in the PaySim dataset (). In this context, the Matthews Correlation Coefficient (MCC) offers a more informative and balanced assessment than accuracy or AUC-ROC, highlighting significant differences in minority-class detection capability.
- 2.
XGBoost as the preferred base model for operational deployment: XGBoost consistently outperforms Random Forest in recall, F1-score, and MCC, and demonstrates substantially lower inference latency in both single-instance and batch scenarios. This combination of enhanced minority-class sensitivity and computational efficiency makes XGBoost more suitable for operational fraud-detection pipelines.
- 3.
Meaningful uncertainty stratification via conformal prediction: Split conformal prediction achieves near-target marginal coverage and expresses uncertainty primarily through conservative abstention, represented by empty prediction regions, rather than ambiguous multi-class outputs. Approximately of transactions are abstained, offering a principled mechanism to identify cases where automated decisions are least reliable.
- 4.
Fraud-specific uncertainty characteristics: Fraudulent transactions exhibit higher predictive entropy and lower maximum class probabilities than non-fraud cases, suggesting that minority-class predictions are inherently more uncertain. These characteristics account for most abstentions and support the need for targeted human review in high-risk scenarios.
- 5.
Applied integration of uncertainty and explainability: The ITCF demonstrates how conformal uncertainty can be operationalised as a control signal for human-in-the-loop decision-making, while LIME explanations provide contextual, instance-level support for analyst judgement in uncertain cases. This integration is oriented toward deployment and emphasises workflow design over algorithmic novelty.
- 6.
Governance- and audit-oriented transparency: By coupling uncertainty objects, such as prediction regions and entropy, with selective, case-specific explanations, the framework supports traceability and accountability at the point of decision-making. Notably, these benefits are achieved without overstating probability calibration, causal interpretability, or regulatory compliance guarantees.
Overall, these findings suggest that the ITCF offers a practical, risk-aware approach to integrating predictive performance, uncertainty quantification, and interpretability within a unified operational workflow. Conformal prediction facilitates conservative management of uncertain cases, while explainability enables informed human oversight. Within this framework, XGBoost is identified as the more suitable base model due to its superior fraud-detection capabilities and operational efficiency. Future work will involve testing the ITCF on live data and advancing toward real-world deployment. This subsequent deployment study will aim to confirm the framework’s applicability and reliability in dynamic environments, ensuring that stakeholders realise tangible benefits and improvements in operational workflows.