Robust Financial Fraud Detection via Causal Intervention and Multi-View Contrastive Learning on Dynamic Hypergraphs

Luo, Xiong

doi:10.3390/math13244018

Open AccessArticle

Robust Financial Fraud Detection via Causal Intervention and Multi-View Contrastive Learning on Dynamic Hypergraphs

by

Xiong Luo

Department of Information Technology, Uppsala University, 752 37 Uppsala, Sweden

Mathematics 2025, 13(24), 4018; https://doi.org/10.3390/math13244018

Submission received: 2 December 2025 / Revised: 13 December 2025 / Accepted: 14 December 2025 / Published: 17 December 2025

Download

Browse Figures

Versions Notes

Abstract

Financial fraud detection is critical to modern economic security, yet remains challenging due to collusive group behavior, temporal drift, and severe class imbalance. Most existing graph neural network (GNN) detectors rely on pairwise edges and correlation-driven learning, which limits their ability to represent high-order group interactions and makes them vulnerable to spurious environmental cues (e.g., hubs or temporal bursts) that correlate with labels but are not necessarily causal. We propose Causal-DHG, a dynamic hypergraph framework that integrates hypergraph modeling, causal intervention, and multi-view contrastive learning. First, we construct label-agnostic hyperedges from publicly available metadata to capture high-order group structures. Second, a Multi-Head Spatio-Temporal Hypergraph Attention encoder models group-wise dependencies and their temporal evolution. Third, a Causal Disentanglement Module decomposes representations into causal and environment-related factors using HSIC regularization, and a dictionary-based backdoor adjustment approximates the interventional prediction

P (Y ∣ d o (C))

to suppress spurious correlations. Finally, we employ self-supervised multi-view contrastive learning with mild hypergraph augmentations to leverage unlabeled data and stabilize training. Experiments on YelpChi, Amazon, and DGraph-Fin show consistent gains in AUC/F1 over strong baselines such as CARE-GNN and PC-GNN, together with improved robustness under feature and structural perturbations.

Keywords:

financial fraud detection; dynamic hypergraphs; graph neural networks; causal intervention; disentangled representation learning; contrastive learning; out-of-distribution robustness

MSC:

62H30; 62P05; 68T07

1. Introduction

The rapid growth of digital financial services has reshaped global markets, enabling instant credit approvals, online lending, and cross-border payments, but also enlarging the attack surface for fraud. Cases now range from synthetic identity lending and collusive rating manipulation to contact-circle abuse and credit default schemes, with annual losses estimated in the hundreds of billions of dollars [1]. Modern financial fraud typically involves three coupled characteristics: group collusion, where coordinated rings operate across accounts, reviews, or contact profiles; dynamic drift, as fraud patterns respond to deployed models and induce temporal distribution shifts [2]; and camouflage, where fraudulent accounts mimic benign behavior statistics to evade detection [3]. These factors make purely rule-based and i.i.d. learning approaches inadequate.

Graph neural networks (GNNs) are a natural tool for fraud detection because transaction logs, review graphs, and emergency-contact networks are inherently relational. By modeling entities (e.g., users or reviews) as nodes and interactions as edges, GNN-based methods can exploit both local and global structures. Fraud-oriented architectures such as SemiGNN, GeniePath, CARE-GNN, and PC-GNN [3,4,5,6] address heterogeneity, class imbalance, and camouflage and have shown strong performance on public benchmarks and industrial platforms.

However, GNN-based fraud detectors still face two core limitations in realistic deployments. First, they are built on simple graphs, where edges represent pairwise relations. Real fraud often appears as high-order group patterns: multiple suspicious reviews tied to the same user and product, clusters of users sharing contact behavior within short time windows, or dense circles of mutually referencing accounts. Naively decomposing such structures into pairwise edges causes the “clique expansion” problem in hypergraph theory [7], inflating edges and obscuring the semantics of joint co-participation. Second, most models are correlation-driven and suffer from a causal gap. Deep networks tend to overfit environmental context—anonymous profile statistics, temporal bursts, or hub nodes—that correlate with labels in the training environment but do not reflect genuine fraudulent intent [8,9,10]. When the environment changes, performance can drop sharply.

Figure 1 illustrates this causal gap. In practice, fraud labels are often statistically associated with environmental properties that are not causal mechanisms. For example, users attached to a particular hub or concentrated in certain time windows may be disproportionately fraudulent in the training set due to historical enforcement or platform policies. A correlation-driven GNN treats such patterns as strong predictors; once benign users exhibit similar structures, the model degrades. Bridging this gap requires representations that separate intrinsic signals of fraudulent intent from an environment-dependent context.

We propose Causal-DHG, a dynamic hypergraph neural network that integrates causal intervention and contrastive learning for robust financial fraud detection. The framework is instantiated on three widely used public datasets (YelpChi, Amazon, DGraph-Fin). Our main contributions are as follows:

A causality-aware dynamic hypergraph framework that captures high-order group interactions in realistic financial graphs (review networks and contact networks), with hypergraphs constructed directly from observed structures and attributes in YelpChi, Amazon, and DGraph-Fin.
A Causal Disentanglement Module that decomposes node representations into intrinsic causal and environment-related factors, enforces their independence via HSIC, and applies a dictionary-based backdoor adjustment mechanism to approximate the interventional distribution $P (Y ∣ d o (C))$ and suppress spurious environmental correlations.
A multi-view hypergraph contrastive learning scheme that exploits unlabeled data under label scarcity, using edge dropping and feature masking together with an InfoNCE-style objective defined on disentangled representations.
A theoretical characterization of the causal graph underlying Causal-DHG, explaining how HSIC-based disentanglement and dictionary-based backdoor adjustment approximate environment-marginalized prediction and analyze robustness and complexity under environment shifts.
Extensive experiments on YelpChi, Amazon, and DGraph-Fin show consistent improvements over strong baselines such as CARE-GNN and PC-GNN (up to about four percentage points in AUC) and demonstrate that the causal module substantially mitigates performance degradation under feature perturbations and structural noise.

2. Related Work

2.1. Graph Neural Networks for Fraud Detection

The relational nature of financial data makes GNNs a natural tool for fraud detection. Early work applied general-purpose graph models, such as GraphSAGE [11], to transaction and review graphs by treating entities as nodes and interactions as edges. These methods aggregate neighbor information through message passing and provide strong baselines, but they are not specifically tailored to severe class imbalance, structured camouflage, or heterogeneous relations.

Several GNN variants have been developed for fraud detection. SemiGNN [4] uses hierarchical attention to integrate user, product, and review views; GeniePath [5] learns adaptive receptive fields to filter noisy neighbors; CARE-GNN [3] introduces camouflage-resistant neighbor selection driven by reinforcement learning; and PC-GNN [6] combines pick-and-choose sampling with graph augmentations to handle imbalanced, camouflaged fraud. These models demonstrate the benefit of domain-specific GNN design but remain pairwise and correlation-driven, without explicit treatment of high-order interactions or causal mechanisms, making them sensitive to distribution shifts [10].

Beyond GNN-based architectures, matrix factorization and subspace learning techniques have also been explored for detecting abnormal behavior in structured data. For example, graph-regularized orthogonal non-negative matrix factorization with Itakura–Saito divergence has been applied to fault detection problems on graph-structured signals [12], providing a complementary perspective to graph neural models.

2.2. Hypergraph Learning and Dynamic Modeling

Hypergraphs generalize graphs by allowing a hyperedge to connect an arbitrary number of nodes, providing a natural representation for group-level interactions [13]. HGNN [7] extends spectral convolutions to hypergraphs via a hypergraph Laplacian, and subsequent work has improved efficiency and flexibility through low-rank approximations and attention mechanisms. Dynamic hypergraph models, such as DHGNN [14], further incorporate temporal evolution by updating node and hyperedge states over time. In parallel, temporal graph models like EvolveGCN and TGN [15,16] have shown that explicitly modeling time improves prediction on evolving networks.

These ideas have been explored in adjacent areas such as session-based recommendation, where hyperedges group items within a session [17]. However, applications of dynamic hypergraphs to financial fraud remain limited, and existing models typically treat learning as purely correlation-based. They capture high-order and temporal structures but do not aim to separate causal and environment-related factors or to implement principled intervention.

2.3. Causal Inference on Graphs

Causal inference provides a framework for distinguishing correlation from causation [9] and has recently attracted interest in representation learning and graph modeling [18]. On graphs, several works propose stable or invariant objectives to mitigate biases introduced by structural properties and environment shifts [10,19]. Disentangled graph representation methods, such as DisenGCN and related models, seek to factorize node embeddings into relatively independent components, improving interpretability and, in some cases, robustness.

Most existing approaches, however, either operate on static pairwise graphs or target graph-level tasks. They do not jointly model dynamic high-order interactions and node-level fraud labels, and they rarely provide an explicit mechanism to approximate the interventional distribution

P (Y ∣ d o (C))

. In contrast, Causal-DHG is built on dynamic hypergraphs, explicitly disentangles intrinsic causal and environment-related factors, and combines dictionary-based backdoor adjustment with contrastive learning to improve robustness under temporal and structural shifts in fraud detection.

3. Methodology: The Causal-DHG Framework

In this section, we present the proposed Causal-DHG framework in detail. We first describe the unified dynamic hypergraph construction that is carefully instantiated for the three public datasets. We then introduce the Multi-Head Spatio-Temporal Hypergraph Attention (MH-DHA) encoder, followed by the Causal Disentanglement and Backdoor Adjustment module and the multi-view contrastive objective (Figure 2).

3.1. Dynamic Hypergraph Construction

We start from the released graphs

𝒢 = (𝒱, E)

with node features

X \in R^{| 𝒱 | \times d_{x}}

. In YelpChi, nodes are review instances and edges encode shared user/product and similarity relations; in Amazon, nodes are user accounts with edges induced by overlapping reviewing behaviors; in DGraph-Fin, nodes are users linked by directed emergency-contact edges annotated with timestamps and contact types. Node features are the provided handcrafted statistics (32/25/17 dimensions for YelpChi/Amazon/DGraph-Fin), and labels indicate fraudulent vs. normal entities.

To capture temporal evolution, we discretize interactions into T time windows. For YelpChi and Amazon, which are commonly used as static benchmarks, we treat the provided graphs as single snapshots (

T = 1

) unless otherwise stated. For DGraph-Fin, edge timestamps are grouped into T non-overlapping intervals (e.g., by quantiles or fixed-size buckets), yielding a sequence of time-indexed graphs

{𝒢_{1}, \dots, 𝒢_{T}}

.

For each window t, we build a hypergraph

H_{t} = (𝒱, E_{t}^{H})

whose hyperedges connect nodes that participate in observed group patterns, using only publicly available structures and attributes:

YelpChi (review nodes).
(i) User-centric: For each user u, a hyperedge $e_{u}$ connects all reviews written by u. (ii) Product-centric: For each product p, a hyperedge $e_{p}$ connects all reviews of p in the window. (iii) Similarity-based: For each review, we connect it with its top-k structurally or textually similar neighbors (given by the released similarity edges) into a hyperedge.
Amazon (user nodes). (i) Co-item: For each item p, users who reviewed p in a window form a hyperedge. (ii) Temporal co-activity: For each $(p, t)$ , users reviewing p within the same temporal bin form a hyperedge. (iii) Pattern-based: Users with similar rating or activity statistics (computed from released features) are grouped into hyperedges.
DGraph-Fin (user nodes with contact edges $u \to v$ ). (i) Shared-contact: For each contact user v and window t, a hyperedge connects all u such that $u \to v$ occurs in t. (ii) Contact-pattern: For each contact type c and window t, users with at least one outgoing edge of type c in t form a hyperedge. (iii) Local structure: For each user, we build hyperedges over densely connected 1-hop ego neighborhoods to summarize tight contact circles.

The resulting dynamic hypergraphs

{H_{t}}_{t = 1}^{T}

are then fed into the hypergraph encoder to obtain node representations

H_{t} \in R^{| 𝒱 | \times d_{h}}

.

To avoid any label-dependent bias or artificial homophily, all hypergraph construction rules are strictly label-agnostic: hyperedges are defined only from publicly released identifiers and metadata that are also available at inference time (such as user and item IDs, emergency-contact edges, contact-type codes, and discretized timestamps), without using fraud labels or manually engineered “suspicious” patterns. Although the concrete grouping rules differ across YelpChi, Amazon, and DGraph-Fin due to their heterogeneous graph schemas (review-centric graphs versus user–contact networks), they follow the same principle of connecting nodes that share observable context in the original benchmarks. Structural statistics of the three benchmarks are summarized in Figure 3.

3.2. Multi-Head Spatio-Temporal Hypergraph Attention

We now describe the MH-DHA encoder, which processes the sequence of dynamic hypergraphs

{H_{t}}_{t = 1}^{T}

to produce spatio-temporal node representations.

3.2.1. Spatial Hypergraph Attention

For each time window t, we consider the hypergraph

H_{t} = (𝒱, E_{t}^{H})

with incidence matrix

B_{t} \in {0, 1}^{| 𝒱 | \times | E_{t}^{H} |}

. Let

h_{v}^{(l, t)}

and

h_{e}^{(l, t)}

denote node and hyperedge embeddings at layer l and time t.

We adopt a two-stage node–hyperedge attention mechanism with K heads. For clarity, we omit the time index t when not ambiguous.

Node-to-Hyperedge Aggregation

For each hyperedge e and node

v \in e

, we compute attention scores:

α_{v e}^{(k)} = \frac{exp (σ (a_{1}^{(k) ⊤} [W_{1}^{(k)} h_{v}^{(l)} ∥ W_{1}^{(k)} h_{e}^{(l - 1)}]))}{\sum_{u \in e} exp (σ (a_{1}^{(k) ⊤} [W_{1}^{(k)} h_{u}^{(l)} ∥ W_{1}^{(k)} h_{e}^{(l - 1)}]))},

(1)

where

∥ \cdot ∥

denotes concatenation,

σ (\cdot)

is a nonlinearity, and

(W_{1}^{(k)}, a_{1}^{(k)})

are trainable parameters for head k. The hyperedge embedding is

h_{e}^{(l)} = ∥_{k = 1}^{K} ELU (\sum_{v \in e} α_{v e}^{(k)} W_{1}^{(k)} h_{v}^{(l)}),

(2)

where

∥

denotes concatenation across heads.

Hyperedge-to-Node Aggregation

Similarly, for each node v and incident hyperedge

e \in E_{v}

, we compute

β_{e v}^{(k)} = \frac{exp (σ (a_{2}^{(k) ⊤} [W_{2}^{(k)} h_{e}^{(l)} ∥ W_{2}^{(k)} h_{v}^{(l)}]))}{\sum_{f \in E_{v}} exp (σ (a_{2}^{(k) ⊤} [W_{2}^{(k)} h_{f}^{(l)} ∥ W_{2}^{(k)} h_{v}^{(l)}]))},

(3)

and update the node embedding as

h_{v}^{(l + 1)} = ∥_{k = 1}^{K} ELU (\sum_{e \in E_{v}} β_{e v}^{(k)} W_{2}^{(k)} h_{e}^{(l)}) .

(4)

After L layers of hypergraph attention, we obtain the spatial representations

S_{t} \in R^{| 𝒱 | \times d_{s}}

for each time t.

3.2.2. Temporal Aggregation

To capture temporal evolution, we feed the sequence

{S_{1}, \dots, S_{T}}

into a Gated Recurrent Unit (GRU) at the node level. For each node v, its embedding at time t is denoted by

s_{v}^{(t)}

, and the GRU updates:

\begin{matrix} h_{v}^{(t)} & = GRU (s_{v}^{(t)}, h_{v}^{(t - 1)}), \end{matrix}

(5)

\begin{matrix} H_{init} & = H^{(T)} = {[h_{v}^{(T)}]}_{v \in 𝒱} . \end{matrix}

(6)

When

T = 1

, the GRU reduces to a simple transformation, and

H_{init}

coincides with

S_{1}

up to a linear mapping.

3.3. Causal Disentanglement and Backdoor Adjustment

We now describe the causal module that operates on

H_{init}

to obtain robust node representations (Figure 4).

3.3.1. Disentangled Representations

We assume that

H_{init}

encodes both intrinsic causal factors C (e.g., stable behavioral patterns associated with fraudulent intent) and environment-related factors E (e.g., structural hubs, temporal bursts, or global usage shifts). We use two MLP projectors:

Z_{c} = {MLP}_{c} (H_{init}), Z_{e} = {MLP}_{e} (H_{init}) .

(7)

To encourage

Z_{c}

and

Z_{e}

to capture distinct information, we penalize their statistical dependence via the Hilbert–Schmidt Independence Criterion (HSIC) [20]:

L_{dis} = HSIC (Z_{c}, Z_{e}) = \frac{1}{{(n - 1)}^{2}} Tr (K_{c} H K_{e} H),

(8)

where

K_{c}

and

K_{e}

are kernel matrices computed from

Z_{c}

and

Z_{e}

,

H

is the centering matrix, and n is the batch size.

3.3.2. Dictionary-Based Backdoor Adjustment

We model the environment-related factors via a learnable dictionary

M = {[m_{1}, \dots, m_{K}]}^{⊤} \in R^{K \times d_{e}}

. Intuitively, each dictionary atom

m_{k}

corresponds to a prototypical environment pattern, such as a certain combination of node degree, temporal activity, and hyperedge participation statistics derived from the publicly available data.

Following Pearl’s backdoor adjustment [9], the interventional distribution

P (Y ∣ d o (C))

can be approximated by averaging over environment patterns:

P (Y ∣ d o (C)) \approx \sum_{k = 1}^{K} P (Y ∣ C, E = m_{k}) P (E = m_{k}) .

(9)

In practice, we implement this via a scaled dot-product attention over the dictionary. Let

Q, K, V

be linear projections applied to

Z_{c}

and

M

:

\begin{matrix} Q & = Z_{c} W_{Q}, \end{matrix}

(10)

\begin{matrix} K_{M} & = M W_{K}, \end{matrix}

(11)

\begin{matrix} V_{M} & = M W_{V} . \end{matrix}

(12)

The attention weights over the dictionary are

A = Softmax (\frac{Q K_{M}^{⊤}}{\sqrt{d}}),

(13)

and the aggregated environmental contribution is

E_{adj} = A V_{M} .

(14)

We then form the robust representation:

Z^{*} = [Z_{c} ∥ E_{adj}],

(15)

which is used for both supervised fraud classification and contrastive learning.

3.4. Multi-View Contrastive Learning

To exploit unlabeled nodes and enhance robustness under feature/structure perturbations, we adopt a self-supervised multi-view contrastive learning strategy.

3.4.1. Hypergraph Augmentations

We generate two augmented hypergraph views

H^{(1)}

and

H^{(2)}

by applying stochastic perturbations that are independent of fraud labels:

Hyperedge dropping. Each hyperedge is removed with probability $p_{edge}$ , simulating missing or noisy group interactions.
Feature masking. Each feature dimension is masked with probability $p_{feat}$ , simulating incomplete or corrupted features (e.g., missing profile entries).

Passing each view through the MH-DHA encoder and causal module yields robust representations

Z^{* (1)}

and

Z^{* (2)}

. A small MLP projection head

g (\cdot)

maps them into

z^{(1)}

and

z^{(2)}

for contrastive learning.

3.4.2. Contrastive Objective

For a mini-batch of N nodes, we define the InfoNCE loss for each node i:

ℓ (i) = - log \frac{exp (sim (z_{i}^{(1)}, z_{i}^{(2)}) / τ)}{\sum_{j = 1}^{N} I [j \neq i] exp (sim (z_{i}^{(1)}, z_{j}^{(1)}) / τ) + \sum_{j = 1}^{N} exp (sim (z_{i}^{(1)}, z_{j}^{(2)}) / τ)},

(16)

where

sim (u, v) = u^{⊤} v / (∥ u ∥ ∥ v ∥)

is cosine similarity,

I [\cdot]

is the indicator function, and

τ

is a temperature hyperparameter. The total contrastive loss is

L_{cl} = \frac{1}{2 N} \sum_{i = 1}^{N} (ℓ (i) + ℓ^{'} (i)),

(17)

where

ℓ^{'} (i)

is defined symmetrically by swapping the two views.

3.5. Supervised Objective and Joint Training

For labeled nodes

𝒱_{L}

, we apply a linear classifier on top of

Z^{*}

and use binary cross-entropy loss as the supervised objective:

L_{\sup} = - \frac{1}{| 𝒱_{L} |} \sum_{v \in 𝒱_{L}} [y_{v} log {\hat{y}}_{v} + (1 - y_{v}) log (1 - {\hat{y}}_{v})] .

(18)

The total training loss combines all components:

L_{total} = L_{\sup} + λ_{1} L_{dis} + λ_{2} L_{cl},

(19)

where

λ_{1}

and

λ_{2}

control the contributions of the disentanglement and contrastive terms.

4. Theoretical Analysis

This section provides a theoretical characterization of Causal-DHG. We first formalize a structural causal model (SCM) on dynamic hypergraphs and the associated disentangled representations, then analyze the dictionary-based backdoor approximation and robustness under environment shift, and finally summarize the computational complexity.

4.1. Causal Formulation and Disentangled Representations

We consider node features

X \in R^{d_{x}}

derived from publicly available statistics (e.g., review behavior and profile attributes), a sequence of hypergraph incidence structures

A^{H} = {A_{t}^{H}}_{t = 1}^{T}

constructed from observed interactions, intrinsic causal factors C encoding user intent and stable behavioral mechanisms related to fraud, environment-related factors E capturing structural and temporal context (e.g., local degree patterns, hyperedge participation, time windows), and the fraud label

Y \in {0, 1}

. We model these variables via the SCM

\begin{matrix} C & = f_{c} (X, A^{H}, U_{c}), \end{matrix}

(20)

\begin{matrix} E & = f_{e} (X, A^{H}, U_{e}), \end{matrix}

(21)

\begin{matrix} Y & = g (C, E, U_{y}), \end{matrix}

(22)

where

U_{c}, U_{e}, U_{y}

are mutually independent exogenous noise variables. The associated causal graph contains directed edges

X \to C, A^{H} \to C, X \to E, A^{H} \to E, C \to Y, E \to Y .

Our goal is to approximate the interventional distribution

P (Y ∣ d o (C = c)) = \sum_{e} P (Y ∣ C = c, E = e) P (E = e),

(23)

which blocks spurious environment-driven pathways. We adopt the following assumptions.

Assumption 1

(Environment shift and invariance). Between training and deployment, the conditional mechanism

P (Y ∣ C, E)

remains invariant, while the environment distribution

P (E)

and its dependence on

(X, A^{H})

may change:

P_{train} (Y ∣ C, E) = P_{test} (Y ∣ C, E), P_{train} (E ∣ X, A^{H}) \neq P_{test} (E ∣ X, A^{H}) in general .

Assumption 2

(Representation sufficiency). There exist encoders

h_{c}

and

h_{e}

such that the learned representations

Z_{c} = h_{c} (X, A^{H}), Z_{e} = h_{e} (X, A^{H})

are sufficient statistics for C and E in the sense that

P (Y ∣ C, E) = P (Y ∣ Z_{c}, Z_{e}), P (E ∣ C) = P (Z_{e} ∣ Z_{c}),

up to invertible transformations in the representation space.

Causal-DHG instantiates these representations as

Z_{c} = {MLP}_{c} (H_{init}), Z_{e} = {MLP}_{e} (H_{init}),

where

H_{init}

is the output of the dynamic hypergraph encoder. Under Assumptions 1 and 2, approximating

P (Y ∣ d o (C))

reduces to learning

Z_{c}

that is predictive for Y but insensitive to shifts in

P (E)

, while

Z_{e}

captures environment-specific variability that can be marginalized out.

4.2. HSIC-Based Disentanglement

Causal-DHG enforces independence between

Z_{c}

and

Z_{e}

via an HSIC-based regularizer. Let

k_{c}

and

k_{e}

be characteristic kernels on the domains of

Z_{c}

and

Z_{e}

, respectively, and let

{HSIC}_{k_{c}, k_{e}} (Z_{c}, Z_{e})

denote the corresponding empirical HSIC [20]. The disentanglement loss is

L_{dis} = {HSIC}_{k_{c}, k_{e}} (Z_{c}, Z_{e}) .

(24)

Proposition 1

(HSIC and independence). If

k_{c}

and

k_{e}

are characteristic kernels, then

{HSIC}_{k_{c}, k_{e}} (Z_{c}, Z_{e}) = 0

if and only if

Z_{c}

and

Z_{e}

are statistically independent. Minimizing (24) therefore drives the learned causal and environmental representations toward independence.

Proof sketch.

For characteristic kernels, the kernel mean embedding of a distribution is injective. HSIC can be written as the squared Hilbert–Schmidt norm of the cross-covariance operator between the embeddings of

Z_{c}

and

Z_{e}

[20]. This norm is zero if and only if the cross-covariance operator is zero, which is equivalent to independence. The empirical HSIC is a consistent estimator of the population quantity, so minimizing it encourages independence in the large-sample limit. □

Combined with Assumption 2, Proposition 1 implies that

Z_{c}

can align with intrinsic causal factors, while

Z_{e}

captures orthogonal environment-related patterns. In practice, we use mini-batch estimates of HSIC; although exact independence is unattainable in finite samples, prior work [10,19,21,22] shows that HSIC penalties are effective in suppressing spurious dependencies.

4.3. Dictionary-Based Backdoor Approximation and Robustness

The dictionary-based module aims to approximate the backdoor-adjusted predictor associated with (23) at the representation level. Let

M = {[m_{1}, \dots, m_{K}]}^{⊤}

be the environmental dictionary, and let

α_{k} (Z_{c})

be the attention weight assigned to atom

m_{k}

given

Z_{c}

. Let

v_{k}

denote the value projection of

m_{k}

. The adjusted environment representation is

E_{adj} (Z_{c}) = \sum_{k = 1}^{K} α_{k} (Z_{c}) v_{k} .

(25)

We require mild coverage and calibration conditions on the dictionary.

Assumption 3

(Dictionary coverage and calibration). The atoms

{m_{k}}_{k = 1}^{K}

form an ε-cover of the support of E; i.e., for any environment realization e there exists some k with

∥ e - m_{k} ∥ \leq ε

, and the attention weights satisfy

α_{k} (Z_{c}) \approx P (E = m_{k} ∣ Z_{c})

for all k.

Denote by

h_{θ}

the classifier acting on

[Z_{c} ∥ E_{adj} (Z_{c})]

, and assume it is Lipschitz in its environment-related argument.

Proposition 2

(Variational backdoor approximation). Under Assumptions 1–3 and Lipschitz continuity of

h_{θ}

in its second argument, the dictionary-based predictor

\hat{Y} (Z_{c}) = h_{θ} (Z_{c}, E_{adj} (Z_{c}))

(26)

provides a finite-dimensional variational approximation to the interventional predictor based on

P (Y ∣ d o (C))

. The approximation error is bounded by a constant that scales with the dictionary resolution ε.

Proof sketch.

By Assumption 2,

P (Y ∣ C, E) = P (Y ∣ Z_{c}, E)

, and the ideal backdoor-adjusted predictor based on

Z_{c}

is

f^{*} (Z_{c}) = \sum_{e} P (Y ∣ Z_{c}, E = e) P (E = e) .

(27)

Assumption 3 allows approximating (27) by a finite mixture over

{m_{k}}

:

f^{*} (Z_{c}) \approx \sum_{k = 1}^{K} P (Y ∣ Z_{c}, E = m_{k}) P (E = m_{k}),

(28)

with an error controlled by

ε

and the Lipschitz constant of

P (Y ∣ Z_{c}, E)

in E. If

α_{k} (Z_{c}) \approx P (E = m_{k} ∣ Z_{c})

, then

E_{adj} (Z_{c})

in (25) approximates a suitable conditional expectation of an environment embedding, and

h_{θ}

in (26) can approximate the mapping from

(Z_{c}, E [g_{E} (E) ∣ Z_{c}])

to

f^{*} (Z_{c})

. Lipschitz continuity ensures that the induced error is bounded by a function of

ε

. □

Beyond approximating

P (Y ∣ d o (C))

, this construction improves robustness under environment shift. Let

ℓ (\hat{y}, y)

be a bounded loss and define

R (h; P) = E_{(X, A^{H}, Y) \sim P} [ℓ (h (X, A^{H}), Y)],

(29)

the expected risk of predictor h under distribution P. Consider a correlation-based baseline

h_{corr} (X, A^{H})

and the Causal-DHG predictor

h_{causal} (X, A^{H})

based on backdoor-adjusted representations. Let

d (\cdot, \cdot)

be an integral probability metric (e.g., a Wasserstein distance) on environment distributions.

Assumption 4

(Lipschitz dependence on environment). There exist constants

L_{corr}

and

L_{causal}

such that for any fixed C and any

e_{1}, e_{2}

,

\begin{matrix} | ℓ (h_{corr} (C, e_{1}), Y) - ℓ (h_{corr} (C, e_{2}), Y) | & \leq L_{corr} ∥ e_{1} - e_{2} ∥, \\ | ℓ (h_{causal} (C, e_{1}), Y) - ℓ (h_{causal} (C, e_{2}), Y) | & \leq L_{causal} ∥ e_{1} - e_{2} ∥ . \end{matrix}

Due to HSIC-based disentanglement and backdoor adjustment, we expect

L_{causal} ≪ L_{corr}

.

Lemma 1

(Risk sensitivity to environment shift). Let

P_{train}

and

P_{test}

satisfy Assumption 1. Under Assumption 4, there exist constants

K_{corr}, K_{causal}

, such that

\begin{matrix} | R (h_{corr}; P_{test}) - R (h_{corr}; P_{train}) | & \leq K_{corr} d (P_{train} (E), P_{test} (E)), \\ | R (h_{causal}; P_{test}) - R (h_{causal}; P_{train}) | & \leq K_{causal} d (P_{train} (E), P_{test} (E)), \end{matrix}

with

K_{causal} / K_{corr} \approx L_{causal} / L_{corr}

.

Proof sketch.

For a fixed predictor, the change in risk when moving from

P_{train}

to

P_{test}

that share the same

P (Y ∣ C, E)

is controlled by the change in

P (E)

. Under Lipschitz dependence on E, this difference can be bounded by constant times

d (P_{train} (E), P_{test} (E))

. The constants

K_{corr}, K_{causal}

scale with

L_{corr}, L_{causal}

, respectively. Since Causal-DHG explicitly suppresses spurious dependence on E, one obtains

L_{causal} ≪ L_{corr}

and thus a tighter bound. □

Lemma 1 formalizes the robustness improvement observed in our feature perturbation experiments: for a fixed magnitude of environment shift

d (P_{train} (E), P_{test} (E))

, the risk variation of Causal-DHG is controlled by a smaller constant factor than that of correlation-based baselines.

4.4. Complexity Guarantees

We finally summarize the computational complexity of Causal-DHG. Let

| 𝒱 |

be the number of nodes,

| E^{H} |

the number of hyperedges, and

| I |

the number of node–hyperedge incidences (i.e., non-zero entries in the incidence matrices). Denote by L the number of MH-DHA layers, T the number of time windows, d the hidden dimension, K the dictionary size, and B the batch size.

Theorem 1

(Time and space complexity). Assume that (i) the maximum number of hyperedges incident to any node is bounded by

C_{h}

, and (ii) the maximum size of any hyperedge is bounded by

C_{s}

. Then a single training epoch of Causal-DHG has time complexity

O (L T | I | d + B^{2} + B K d),

(30)

and space complexity

O (| 𝒱 | d + | E^{H} | d + K d) .

(31)

In particular, since

| I | = O (| 𝒱 | C_{h}) = O (| E^{H} | C_{s})

with

C_{h}, C_{s}

treated as constants, the time complexity is linear in

| 𝒱 | + | E^{H} |

up to logarithmic and constant factors.

Proof sketch.

Each MH-DHA layer performs node-to-hyperedge and hyperedge-to-node attention. Both directions cost

O (| I | d)

operations, leading to

O (L T | I | d)

over L layers and T time windows. The GRU-based temporal aggregation scales as

O (L T | 𝒱 | d^{2})

and is dominated by the incidence term in sparse hypergraphs. The HSIC term involves batch-wise kernel matrices with cost

O (B^{2})

per batch, and the dictionary-based module performs attention over K atoms per node, giving

O (B K d)

per batch. Space is dominated by storing node and hyperedge embeddings,

O (| 𝒱 | d + | E^{H} | d)

, and the dictionary

O (K d)

. Under bounded

C_{h}

and

C_{s}

, the total number of incidences

| I |

scales linearly with

| 𝒱 |

and

| E^{H} |

, completing the argument. □

Theorem 1 shows that Causal-DHG retains linear-time scalability in the size of the dynamic hypergraph, making it suitable for large-scale fraud detection scenarios such as DGraph-Fin, while providing theoretically grounded robustness benefits through causal modeling.

5. Experimental Section

5.1. Experimental Setup

5.1.1. Datasets

We evaluate Causal-DHG on three widely used public fraud detection datasets (Table 1).

YelpChi

This dataset contains review nodes with manually crafted 32-dimensional features capturing textual and behavioral statistics. The binary label indicates whether a review is fraudulent. The released graph includes multiple review–review relations capturing common user and item context. We construct hyperedges as described in Section 3, using user-centric, product-centric, and similarity-based groups. In our experiments, we use the benchmark version provided by the FraudYelpDataset interface in the Deep Graph Library (DGL). (See the DGL documentation at FraudYelpDataset (https://www.dgl.ai/dgl_docs/generated/dgl.data.FraudYelpDataset.html (accessed on 1 December 2025))).

Amazon

FraudAmazonDataset represents users as nodes with 25-dimensional features summarizing their reviewing behaviors, including activity patterns and rating statistics. Labels denote fraudulent or benign users. We build hyperedges based on co-reviewing the same items, temporal co-activity, and behavior-pattern similarity, all derived from available metadata and edges. We use the processed benchmark provided by the FraudAmazonDataset interface in DGL. (See the DGL documentation at FraudAmazonDataset (https://www.dgl.ai/dgl_docs/generated/dgl.data.FraudAmazonDataset.html (accessed on 1 December 2025))).

DGraph-Fin

DGraph-Fin is a large credit-related user contact network from Finvolution, with 17-dimensional anonymous node features. Directed edges represent emergency-contact relations with discrete timestamps and contact types. We construct hyperedges that group users sharing the same contact in a time window, users with similar contact-type patterns, and dense local contact circles. The dataset and leaderboard are publicly available through the DGraph platform (https://dgraph.xinye.com (accessed on 1 December 2025)).

5.1.2. Baselines

We compare Causal-DHG with a diverse set of baselines:

GCN [23]: A standard spectral graph convolutional network.
GAT [24]: A graph attention network.
GraphSAGE [11]: A neighborhood sampling-based GNN.
H-GNN [7]: A hypergraph neural network applied to static hypergraphs constructed from the same grouping strategy as our method but without temporal or causal modules.
EvolveGCN [15]: Temporal GNN modeling evolving graphs.
CARE-GNN [3]: A fraud-specialized GNN with camouflage-resistant neighbor selection.
PC-GNN [6]: A GNN for imbalanced and camouflaged fraud detection with pick-and-choose sampling.

5.1.3. Evaluation Protocol

We follow common practice and report Area Under the ROC Curve (AUC) and macro-averaged F1 (F1-Macro). For each dataset, we randomly split nodes into training, validation, and test sets with ratios of

60 % / 20 % / 20 %

, preserving the label distribution. All results are averaged over five runs with different random seeds. We additionally report precision–recall AUC (PR-AUC/Average Precision) in Appendix A.1 Table A1 to complement ROC-AUC under class imbalance.

5.1.4. Implementation Details

We implement Causal-DHG in PyTorch 2.6.0 with appropriate graph libraries. The node embedding dimension is set to 64, with

L = 2

hypergraph attention layers and

K = 4

attention heads by default. The GRU hidden dimension equals 64. The environmental dictionary size is

K = 32

. We use the Adam optimizer with a learning rate of

10^{- 3}

, weight decay

5 \times 10^{- 5}

, and batch sizes of 256 (YelpChi), 256 (Amazon), and 1024 (DGraph-Fin). For temporal modeling, YelpChi and Amazon are treated as single-snapshot benchmarks (

T = 1

) due to the released benchmark setting, while for DGraph-Fin, we discretize timestamps into

T = 7

non-overlapping windows (see Appendix A.3 Table A3 for a window-granularity sweep). The loss weights are

λ_{1} = 0.1

and

λ_{2} = 0.5

. Hyperedge dropping probability

p_{edge}

and feature masking probability

p_{feat}

are set to

0.2

and

0.1

, respectively, and the contrastive temperature

τ

is fixed at

0.5

. These values are chosen in the commonly used range of

0.1

–

0.2

for graph contrastive perturbations, in line with prior work such as GraphCL [25]. HSIC is estimated on each mini-batch; across five runs, we did not observe training divergence or instability attributable to HSIC estimation. Experiments were conducted on an NVIDIA A100 GPU (40GB) with 64 GB RAM and an Intel Xeon CPU.

5.2. Overall Performance

Table 2 reports the main results on the three datasets. Causal-DHG consistently outperforms all baselines in terms of both AUC and F1-Macro, while maintaining moderate absolute gains (generally within four percentage points) over the strongest methods.

On YelpChi, Causal-DHG achieves an AUC of 0.9045 and F1-Macro of 0.7820, surpassing PC-GNN by approximately 2.3 and 3.7 percentage points, respectively. On Amazon, the gains are around 1.9 and 3.3 percentage points in AUC and F1-Macro. On the more challenging DGraph-Fin dataset, Causal-DHG improves AUC from 0.7620 to 0.7985 and F1-Macro from 0.6350 to 0.6870. These improvements are consistent yet moderate, aligning with practical expectations for incremental gains on strong baselines. We further report PR-AUC (Average Precision) in Appendix A.1 Table A1. Sensitivity analyses on hyperedge size and temporal window granularity, as well as robustness to structural perturbations, are provided in Appendix A Table A2, Table A3 and Table A4.

5.3. Robustness to Feature Perturbations

To evaluate robustness against environment-related noise, we inject controlled perturbations into node features that serve as proxies for environmental statistics. Concretely, for each dataset, we first compute simple per-dimension statistics (such as variance and correlation with local context indicators like node degree and temporal activity counts). We then select a small subset of feature dimensions whose values exhibit strong dependence on these context indicators and treat them as environment-sensitive proxies. For each node, we add Gaussian noise

ϵ \sim N (0, σ^{2})

to this subset of dimensions and measure the AUC degradation as

σ

increases.

As shown in Table 3, PC-GNN suffers significant AUC drops as noise increases, while Causal-DHG degrades much more slowly. At the highest noise level (

σ = 2.0

), Causal-DHG retains an AUC of 0.8520, compared to 0.7150 for PC-GNN. This suggests that disentangling causal and environment-related factors and performing backdoor adjustment indeed enhances robustness against perturbations that mimic environmental shifts.

5.4. Ablation Study

We conduct ablation studies on YelpChi to assess the contribution of each component: hypergraph modeling, causal disentanglement, and contrastive learning (Table 4).

Removing the causal module (disentanglement and backdoor adjustment) causes the largest drop in performance, with AUC decreasing from 0.9045 to 0.8720 and F1-Macro from 0.7820 to 0.7310. This confirms that explicitly modeling and marginalizing out environment-related factors is critical for robust fraud detection. Replacing hypergraphs with pairwise graphs also reduces performance, validating the importance of high-order interaction modeling. Finally, discarding contrastive learning yields a moderate performance loss, indicating that leveraging unlabeled data improves representation quality under label scarcity.

5.5. Representation Analysis

To further understand the effect of the causal module, we visualize node embeddings via t-SNE on YelpChi.

As shown in Figure 5, the embeddings produced by Causal-DHG exhibit clearer separation between fraudulent and benign nodes compared to PC-GNN, consistent with the quantitative improvements in AUC and F1-Macro.

5.6. Hyperparameter Sensitivity and Efficiency

We study the sensitivity of Causal-DHG to the dictionary size K and analyze its training and inference efficiency (Figure 6).

Table 5 summarizes the efficiency comparison on DGraph-Fin. Causal-DHG requires slightly more training time per epoch than PC-GNN due to hypergraph attention and the causal module, but the inference time remains comparable and suitable for large-scale deployment.

6. Conclusions

This paper proposed Causal-DHG, a dynamic hypergraph neural network with causal intervention for robust financial fraud detection. We constructed dynamic hypergraphs that explicitly capture high-order interaction patterns using only observed entities and relations. On top of these structures, we designed a Multi-Head Spatio-Temporal Hypergraph Attention encoder and a Causal Disentanglement Module with dictionary-based backdoor adjustment to separate and marginalize environment-related factors. A multi-view contrastive objective was further introduced to leverage unlabeled nodes and enhance robustness under feature and structural perturbations.

Extensive experiments demonstrated that Causal-DHG achieves consistent performance gains over strong baselines while providing substantially improved robustness against environment-related noise. Ablation and representation analyses confirmed that both hypergraph modeling and causal intervention play essential roles. These findings suggest that integrating causal reasoning into hypergraph-based fraud detection is a promising step toward trustworthy AI in financial applications.

Future work includes extending the framework to multi-task settings (e.g., jointly detecting fraud and predicting credit risk), incorporating more fine-grained temporal encoders, and exploring causal discovery techniques to learn structural causal relations directly from data.

Funding

The author received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Data Availability Statement

The data presented in this study are openly available in [DGraph-Fin] [https://dgraph.xinye.com (accessed on 1 December 2025)].

Conflicts of Interest

The author declares no conflicts of interest.

Appendix A. Additional Experiments and Analyses

Appendix A.1. PR-AUC (Average Precision)

Table A1. PR-AUC (Average Precision) on YelpChi, Amazon, and DGraph-Fin (mean ± std over five runs). Note: DGraph-Fin values are lower due to extreme class imbalance (approx. 1%).

Method	YelpChi	Amazon	DGraph-Fin
CARE-GNN	0.7124 ± 0.011	0.7850 ± 0.009	0.2215 ± 0.013
PC-GNN	0.7560 ± 0.009	0.8120 ± 0.008	0.2450 ± 0.011
Causal-DHG	0.7950 ± 0.008	0.8410 ± 0.006	0.2940 ± 0.009

Appendix A.2. Sensitivity to Hyperedge Size (Top-k_sim in Similarity Hyperedges)

Table A2. Sensitivity of Causal-DHG to the hyperedge size controlled by the top-k_sim neighbors in the similarity-based hyperedge construction on YelpChi. We report AUC (mean over five runs).

k_sim	5	10	15	20	30
AUC	0.8950	0.9020	0.9045	0.9035	0.9030

Appendix A.3. Sensitivity to the Number of Time Windows T (DGraph-Fin)

Table A3. Effect of time-window granularity T on DGraph-Fin. Edge timestamps are discretized into T non-overlapping windows; we report AUC (mean over five runs).

T	1	3	5	7	10
AUC	0.7610	0.7850	0.7920	0.7985	0.7820

Appendix A.4. Robustness to Structural Perturbations

Table A4. Robustness to structural perturbations on YelpChi. We randomly drop a fraction r of edges for PC-GNN and a fraction r of hypergraph incidences for Causal-DHG at test time, with

r \in {0, 0.1, 0.2}

. We report AUC (mean over five runs).

Table A4. Robustness to structural perturbations on YelpChi. We randomly drop a fraction r of edges for PC-GNN and a fraction r of hypergraph incidences for Causal-DHG at test time, with

r \in {0, 0.1, 0.2}

. We report AUC (mean over five runs).

Method	$r = 0$	$r = 0.1$	$r = 0.2$
PC-GNN	0.8810	0.8540	0.8210
Causal-DHG	0.9045	0.8950	0.8820

References

The Nilson Report. Card Fraud Losses Dip to $28.58 Billion in 2020; The Nilson Report: Santa Barbara, CA, USA, 2020. [Google Scholar]
Xu, D.; Ruan, C.; Korpeoglu, E.; Kumar, S.; Achan, K. Inductive representation learning on temporal graphs. arXiv 2020, arXiv:2002.07962. [Google Scholar] [CrossRef]
Dou, Y.; Liu, Z.; Sun, L.; Deng, Y.; Peng, H.; Yu, P.S. Enhancing graph neural network-based fraud detectors against camouflaged fraudsters. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual Event, 19–23 October 2020; pp. 315–324. [Google Scholar] [CrossRef]
Wang, D.; Lin, J.; Cui, P.; Jia, Q.; Wang, Z.; Fang, Y.; Yu, Q.; Zhou, J.; Yang, S.; Qi, Y. A semi-supervised graph attentive network for financial fraud detection. In Proceedings of the 2019 IEEE International Conference on Data Mining (ICDM), Beijing, China, 8–11 November 2019; IEEE: New York, NY, USA, 2019; pp. 598–607. [Google Scholar] [CrossRef]
Liu, Z.; Chen, C.; Li, L.; Zhou, J.; Li, X.; Song, L.; Qi, Y. Geniepath: Graph neural networks with adaptive receptive paths. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 4424–4431. [Google Scholar] [CrossRef]
Liu, Y.; Ao, X.; Qin, Z.; Chi, J.; Feng, J.; Yang, H.; He, Q. Pick and choose: A GNN-based imbalanced learning approach for fraud detection. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 12–23 April 2021; pp. 3168–3177. [Google Scholar] [CrossRef]
Feng, Y.; You, H.; Zhang, Z.; Ji, R.; Gao, Y. Hypergraph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 3558–3565. [Google Scholar] [CrossRef]
Bengio, Y.; Deleu, T.; Rahaman, N.; Ke, R.; Lachapelle, S.; Bilaniuk, O.; Goyal, A.; Pal, C. A Meta-Transfer Objective for Learning to Disentangle Causal Mechanisms. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar] [CrossRef]
Pearl, J. Causality, 2nd ed.; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
Wu, Q.; Zhang, H.; Yan, J.; Wipf, D. Handling distribution shifts on graphs: An invariance perspective. arXiv 2022, arXiv:2202.02466. [Google Scholar] [CrossRef]
Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar] [CrossRef]
Liu, Y.; Wu, J.; Zhang, J.; Leung, M.F. Graph-Regularized Orthogonal Non-Negative Matrix Factorization with Itakura–Saito (IS) Divergence for Fault Detection. Mathematics 2025, 13, 2343. [Google Scholar] [CrossRef]
Bretto, A. Hypergraph Theory: An Introduction; Springer: Cham, Switzerland, 2013; Volume 1, pp. 209–216. [Google Scholar]
Jiang, J.; Wei, Y.; Feng, Y.; Cao, J.; Gao, Y. Dynamic hypergraph neural networks. In Proceedings of the IJCAI, Macao, China, 10–16 August 2019; pp. 2635–2641. [Google Scholar] [CrossRef]
Pareja, A.; Domeniconi, G.; Chen, J.; Ma, T.; Suzumura, T.; Kanezashi, H.; Kaler, T.; Schardl, T.; Leiserson, C. Evolvegcn: Evolving graph convolutional networks for dynamic graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 5363–5370. [Google Scholar] [CrossRef]
Rossi, E.; Chamberlain, B.; Frasca, F.; Eynard, D.; Monti, F.; Bronstein, M. Temporal graph networks for deep learning on dynamic graphs. arXiv 2020, arXiv:2006.10637. [Google Scholar] [CrossRef]
Xia, X.; Yin, H.; Yu, J.; Wang, Q.; Cui, L.; Zhang, X. Self-supervised hypergraph convolutional networks for session-based recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, 2–9 February 2021; Volume 35, pp. 4503–4511. [Google Scholar] [CrossRef]
Schölkopf, B.; Locatello, F.; Bauer, S.; Ke, N.R.; Kalchbrenner, N.; Goyal, A.; Bengio, Y. Toward causal representation learning. Proc. IEEE 2021, 109, 612–634. [Google Scholar] [CrossRef]
Kuang, K.; Xiong, R.; Cui, P.; Athey, S.; Li, B. Stable Prediction with Model Misspecification and Agnostic Distribution Shift. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar] [CrossRef]
Gretton, A.; Bousquet, O.; Smola, A.; Schölkopf, B. Measuring Statistical Dependence with Hilbert-Schmidt Norms. In Proceedings of the Algorithmic Learning Theory (ALT), Singapore, 8–11 October 2005. [Google Scholar] [CrossRef]
Bae, I.; Jeon, H.G. Disentangled multi-relational graph convolutional network for pedestrian trajectory prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, 2–9 February 2021; Volume 35, pp. 911–919. [Google Scholar] [CrossRef]
Yang, D.; Zha, D.; Kurokawa, R.; Zhao, T.; Wang, H.; Zou, N. Factorizable Graph Convolutional Networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual Event, 6–12 December 2020. [Google Scholar] [CrossRef]
Kipf, T. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar] [CrossRef]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. Stat 2017, 1050, 10-48550. [Google Scholar] [CrossRef]
You, Y.; Chen, T.; Sui, Y.; Chen, T.; Wang, Z.; Shen, Y. Graph contrastive learning with augmentations. Adv. Neural Inf. Process. Syst. 2020, 33, 5812–5823. [Google Scholar] [CrossRef]

Figure 1. Motivating comparison between correlation-based GNNs and Causal-DHG. (Left): Standard GNNs aggregate neighbor information indiscriminately and tend to rely on environment-driven structural patterns, such as high-degree hubs or temporally concentrated bursts, which may be spuriously correlated with fraud labels. (Right): The proposed Causal-DHG disentangles intrinsic causal factors from environment-related factors and performs backdoor-style intervention to block spurious environmental paths, yielding representations that are more robust under distribution shifts.

Figure 2. Overall Causal-DHG architecture. Raw graphs are converted into dynamic hypergraphs and encoded by the MH-DHA module to obtain initial node representations

H_{init}

. The Causal Disentanglement Module projects

H_{init}

into causal factors

Z_{c}

and environmental factors

Z_{e}

, enforces their independence (HSIC), and performs dictionary-based backdoor adjustment to produce an environment-marginalized representation

Z^{*}

. The model is optimized jointly with supervised loss

L_{\sup}

and contrastive loss

L_{cl}

.

Figure 2. Overall Causal-DHG architecture. Raw graphs are converted into dynamic hypergraphs and encoded by the MH-DHA module to obtain initial node representations

H_{init}

. The Causal Disentanglement Module projects

H_{init}

into causal factors

Z_{c}

and environmental factors

Z_{e}

, enforces their independence (HSIC), and performs dictionary-based backdoor adjustment to produce an environment-marginalized representation

Z^{*}

. The model is optimized jointly with supervised loss

L_{\sup}

and contrastive loss

L_{cl}

.

Figure 3. Structural statistics of YelpChi, Amazon, and DGraph-Fin. (Top row): Node-degree CDFs

P (K \leq k)

versus degree k (log–log scale), showing a long-tail with a small number of super-nodes. (Bottom row): Hyperedge-size CDFs

P (S \leq s)

versus hyperedge size s (log–log scale), indicating the existence of dense groups. These heavy-tailed patterns motivate explicit hypergraph modeling of high-order relations rather than relying only on pairwise edges.

Figure 3. Structural statistics of YelpChi, Amazon, and DGraph-Fin. (Top row): Node-degree CDFs

P (K \leq k)

versus degree k (log–log scale), showing a long-tail with a small number of super-nodes. (Bottom row): Hyperedge-size CDFs

P (S \leq s)

versus hyperedge size s (log–log scale), indicating the existence of dense groups. These heavy-tailed patterns motivate explicit hypergraph modeling of high-order relations rather than relying only on pairwise edges.

Figure 4. Causal Disentanglement Module. Node representations

H_{init}

are projected into a causal subspace

Z_{c}

and an environment-related subspace

Z_{e}

via two MLPs. HSIC regularization encourages independence between

Z_{c}

and

Z_{e}

. A learnable environmental dictionary aggregates global environment prototypes, and a backdoor-style adjustment uses attention over the dictionary to marginalize out spurious environmental factors, yielding robust representations

Z^{*}

for classification and contrastive learning.

Figure 4. Causal Disentanglement Module. Node representations

H_{init}

are projected into a causal subspace

Z_{c}

and an environment-related subspace

Z_{e}

via two MLPs. HSIC regularization encourages independence between

Z_{c}

and

Z_{e}

. A learnable environmental dictionary aggregates global environment prototypes, and a backdoor-style adjustment uses attention over the dictionary to marginalize out spurious environmental factors, yielding robust representations

Z^{*}

for classification and contrastive learning.

Figure 5. t-SNE visualization of node embeddings on YelpChi. (a) PC-GNN embeddings show considerable overlap between normal and fraudulent nodes. (b) Causal-DHG embeddings result in more clearly separated clusters, illustrating that the causal module helps to suppress environment-driven noise and emphasize intrinsic fraudulent patterns.

Figure 6. Model analysis. (a) AUC on YelpChi as a function of dictionary size K; performance improves as K increases from 8 to 32 and stabilizes thereafter. (b) Training time per epoch on DGraph-Fin for PC-GNN and Causal-DHG, showing that Causal-DHG incurs only a moderate overhead relative to PC-GNN while providing stronger robustness. The arrow indicates performance trend only and does not encode additional information.

Table 1. Statistics of the datasets used in the experiments. Hyperedge counts are induced by our construction and scale with the number of users/products/contacts and time windows.

Dataset	Nodes	Edges	Node Feature Dim	Fraud Ratio
YelpChi	45,954 (reviews)	3,846,979	32	14.5%
Amazon	11,944 (users)	4,398,392	25	6.8%
DGraph-Fin	3,700,550 (users)	4,300,999	17	1.04%

Table 2. Overall performance on YelpChi, Amazon, and DGraph-Fin. We report AUC and F1-Macro (mean ± standard deviation over five runs). The best result is in bold, and the second best is underlined.

Method	YelpChi		Amazon		DGraph-Fin
Method	AUC	F1-Macro	AUC	F1-Macro	AUC	F1-Macro
GCN	$0.7420 \pm 0.010$	$0.5830 \pm 0.018$	$0.8120 \pm 0.012$	$0.6340 \pm 0.019$	$0.6810 \pm 0.017$	$0.5120 \pm 0.021$
GAT	$0.7640 \pm 0.011$	$0.6120 \pm 0.020$	$0.8250 \pm 0.013$	$0.6510 \pm 0.018$	$0.6950 \pm 0.015$	$0.5340 \pm 0.020$
GraphSAGE	$0.7510 \pm 0.012$	$0.5940 \pm 0.017$	$0.8200 \pm 0.011$	$0.6420 \pm 0.016$	$0.7020 \pm 0.016$	$0.5280 \pm 0.019$
H-GNN	$0.8520 \pm 0.010$	$0.7100 \pm 0.019$	$0.9010 \pm 0.009$	$0.8300 \pm 0.017$	$0.7310 \pm 0.018$	$0.5980 \pm 0.022$
EvolveGCN	$0.8150 \pm 0.015$	$0.6720 \pm 0.021$	$0.8840 \pm 0.015$	$0.8010 \pm 0.020$	$0.7250 \pm 0.014$	$0.5820 \pm 0.021$
CARE-GNN	$0.8650 \pm 0.009$	$0.7210 \pm 0.016$	$0.9150 \pm 0.008$	$0.8450 \pm 0.014$	$0.7450 \pm 0.017$	$0.6120 \pm 0.019$
PC-GNN	$0.8810 \pm 0.008$	$0.7450 \pm 0.015$	$0.9320 \pm 0.007$	$0.8610 \pm 0.013$	$0.7620 \pm 0.010$	$0.6350 \pm 0.017$
Causal-DHG	$0.9045 \pm 0.007$	$0.7820 \pm 0.012$	$0.9510 \pm 0.006$	$0.8940 \pm 0.010$	$0.7985 \pm 0.009$	$0.6870 \pm 0.014$

Table 3. Robustness evaluation on YelpChi under increasing feature noise levels

σ

. We report AUC; higher is better.

Table 3. Robustness evaluation on YelpChi under increasing feature noise levels

σ

. We report AUC; higher is better.

Method	$σ = 0$	$σ = 0.5$	$σ = 1.0$	$σ = 2.0$
PC-GNN	0.8810	0.8450	0.7920	0.7150
Causal-DHG	0.9045	0.8911	0.8750	0.8520

Table 4. Ablation study on YelpChi. We remove one component at a time and report AUC and F1-Macro.

Variant	AUC	F1-Macro
Causal-DHG (full)	0.9045	0.7820
w/o Causal Module	0.8720	0.7310
w/o Hypergraph (pairwise only)	0.8850	0.7540
w/o Contrastive Learning	0.8910	0.7620

Table 5. Training and inference efficiency on DGraph-Fin. Inference time is measured per 10,000 nodes.

Method	Train Time/Epoch (s)	Inference Time (ms)
GraphSAGE	15.2	45
CARE-GNN	128.5	210
PC-GNN	145.0	235
Causal-DHG	168.2	260

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Luo, X. Robust Financial Fraud Detection via Causal Intervention and Multi-View Contrastive Learning on Dynamic Hypergraphs. Mathematics 2025, 13, 4018. https://doi.org/10.3390/math13244018

AMA Style

Luo X. Robust Financial Fraud Detection via Causal Intervention and Multi-View Contrastive Learning on Dynamic Hypergraphs. Mathematics. 2025; 13(24):4018. https://doi.org/10.3390/math13244018

Chicago/Turabian Style

Luo, Xiong. 2025. "Robust Financial Fraud Detection via Causal Intervention and Multi-View Contrastive Learning on Dynamic Hypergraphs" Mathematics 13, no. 24: 4018. https://doi.org/10.3390/math13244018

APA Style

Luo, X. (2025). Robust Financial Fraud Detection via Causal Intervention and Multi-View Contrastive Learning on Dynamic Hypergraphs. Mathematics, 13(24), 4018. https://doi.org/10.3390/math13244018

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robust Financial Fraud Detection via Causal Intervention and Multi-View Contrastive Learning on Dynamic Hypergraphs

Abstract

1. Introduction

2. Related Work

2.1. Graph Neural Networks for Fraud Detection

2.2. Hypergraph Learning and Dynamic Modeling

2.3. Causal Inference on Graphs

3. Methodology: The Causal-DHG Framework

3.1. Dynamic Hypergraph Construction

3.2. Multi-Head Spatio-Temporal Hypergraph Attention

3.2.1. Spatial Hypergraph Attention

Node-to-Hyperedge Aggregation

Hyperedge-to-Node Aggregation

3.2.2. Temporal Aggregation

3.3. Causal Disentanglement and Backdoor Adjustment

3.3.1. Disentangled Representations

3.3.2. Dictionary-Based Backdoor Adjustment

3.4. Multi-View Contrastive Learning

3.4.1. Hypergraph Augmentations

3.4.2. Contrastive Objective

3.5. Supervised Objective and Joint Training

4. Theoretical Analysis

4.1. Causal Formulation and Disentangled Representations

4.2. HSIC-Based Disentanglement

4.3. Dictionary-Based Backdoor Approximation and Robustness

4.4. Complexity Guarantees

5. Experimental Section

5.1. Experimental Setup

5.1.1. Datasets

YelpChi

Amazon

DGraph-Fin

5.1.2. Baselines

5.1.3. Evaluation Protocol

5.1.4. Implementation Details

5.2. Overall Performance

5.3. Robustness to Feature Perturbations

5.4. Ablation Study

5.5. Representation Analysis

5.6. Hyperparameter Sensitivity and Efficiency

6. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Additional Experiments and Analyses

Appendix A.1. PR-AUC (Average Precision)

Appendix A.2. Sensitivity to Hyperedge Size (Top-ksim in Similarity Hyperedges)

Appendix A.3. Sensitivity to the Number of Time Windows T (DGraph-Fin)

Appendix A.4. Robustness to Structural Perturbations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Appendix A.2. Sensitivity to Hyperedge Size (Top-k_sim in Similarity Hyperedges)