A Review of Artificial Intelligence for Financial Fraud Detection

Yang, Haiquan; Shukur, Zarina; Sahran, Shahnorbanun

doi:10.3390/app16041931

Open AccessReview

A Review of Artificial Intelligence for Financial Fraud Detection

by

Haiquan Yang

^1,2,*

,

Zarina Shukur

¹ and

Shahnorbanun Sahran

³

¹

Center for Cyber Security, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia (UKM), Bangi 43600, Selangor, Malaysia

²

Faculty of Information and Intelligent Engineering, Yunnan College of Business Management, Kunming 650106, China

³

Center for Artificial Intelligence Technology, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia (UKM), Bangi 43600, Selangor, Malaysia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(4), 1931; https://doi.org/10.3390/app16041931

Submission received: 15 January 2026 / Revised: 9 February 2026 / Accepted: 10 February 2026 / Published: 14 February 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Financial fraud has expanded rapidly with the growth of the digital economy, evolving from conventional transactional misconduct to more complex and data-intensive forms. Traditional rule-based detection methods are increasingly inadequate for addressing the scale, heterogeneity, and dynamic behavior of modern fraud. In this context, artificial intelligence (AI) has become a core tool in financial fraud detection research. This review systematically surveys AI-based financial fraud detection studies published between 2015 and 2025. It summarizes representative machine learning and deep learning approaches, including tree-based models, neural networks, and graph-based methods, and examines their applications in major fraud scenarios such as credit card fraud, loan fraud, and anti-money laundering. In addition, emerging research on cryptocurrency- and blockchain-related fraud is reviewed, highlighting the distinct challenges posed by decentralized transaction environments. Through a comparative analysis of methods, datasets, and evaluation practices, this review identifies persistent issues in the literature, including severe class imbalance, concept drift, limited access to labeled data, and trade-offs between detection performance and interpretability. Based on these findings, the paper discusses practical considerations for applied fraud detection systems and outlines future research directions from a data-centric and application-oriented perspective. This review aims to provide a structured reference for researchers and practitioners working on real-world financial fraud detection problems.

Keywords:

financial fraud detection; machine learning; deep learning; emerging fraud; data quality; model interpretability

1. Introduction

The rapid expansion of the digital economy has fostered financial innovation while simultaneously opening novel avenues for fraudulent activities. Consumers and financial institutions worldwide have suffered substantial financial losses as fraud schemes become increasingly diverse and concealed [1]. A 2022 study found that financial fraud costs billions of dollars every year worldwide. In the US alone, losses are over $400 billion [2]. Fraud tactics are becoming more complex. This makes rule-based detection systems less effective [3]. Owing to its strong capabilities in pattern recognition and predictive analytics, artificial intelligence (AI), particularly machine learning (ML) and deep learning (DL), has become a key enabling technology in modern financial fraud detection systems [4,5]. Many studies have shown that artificial intelligence is effective in different fraud detection settings, including credit card use and insurance claims.

While recent systematic reviews have examined AI in financial fraud detection [6,7,8,9], they predominantly focus on general machine learning efficacy or traditional banking and accounting sectors. Specifically, Ali et al. [10] and Kamuangu [11] provide comprehensive analyses of standard ML classifiers, while Bao et al. [12] concentrate on financial statement auditing. However, a holistic analysis that explicitly examines the technical characteristics of DeFi-native attacks (e.g., rug pulls and flash loan exploits) and discusses systematic approaches to dataset construction remains limited. Moreover, systematic reviews and comparative analyses are still rare, despite the fact that problems like dataset imbalance and accessibility have been recognized [11]. These gaps underline the necessity of a thorough and current review that spans methodological developments, dataset characteristics, traditional and emerging domains, and future directions. Such a review is particularly valuable for guiding the design and deployment of practical fraud detection systems in real-world financial environments.

This review gives a full overview of AI use in financial fraud detection. It also adds new areas such as cryptocurrency scams, DeFi attacks, and NFT rug pulls. It covers these together with traditional fraud cases. It synthesizes recent methodological advancements, including natural language processing, deep learning, and graph neural networks. In addition, it systematically analyzes commonly used datasets and evaluation metrics, and discusses emerging research directions such as explainable AI, federated learning, and blockchain integration.

To clearly distinguish this survey from existing reviews on financial fraud detection, the main contributions of this work are summarized as follows:

A structured and task-driven taxonomy of AI-based fraud detection methods, organizing prior studies according to fraud types, data modalities, and modeling paradigms, rather than providing a purely algorithm-centric overview.
A comprehensive analysis of datasets and data-related challenges, including a roadmap of commonly used public datasets, labeling limitations, and data scarcity issues, with particular emphasis on emerging and decentralized financial environments.
Dedicated coverage of emerging fraud scenarios, such as cryptocurrency-related fraud, rug pulls, and flash loan attacks, highlighting their unique characteristics, data requirements, and modeling challenges that are not adequately addressed in traditional fraud detection surveys.
A critical discussion of evaluation challenges and open research issues, extending beyond predictive accuracy to include robustness, interpretability, scalability, and adversarial considerations in real-world financial systems.

To make these contributions clear at a glance, Table 1 contrasts this review with representative existing surveys.

As summarized in Table 1, this review offers broader coverage and a more integrated methodological perspective compared with existing surveys. Figure 1 presents a conceptual synthesis framework of this review, illustrating how findings from the surveyed literature are systematically organized across fraud categories, datasets, methodologies, evaluation metrics, and future research directions.

2. Related Research

Both industry and academia have shown increasing interest in applying artificial intelligence (AI) techniques to financial fraud detection in recent years. This section reviews representative studies that use machine learning, deep learning, graph-based modeling, and natural language processing (NLP) across traditional and emerging fraud scenarios. Rather than only listing reported results, we also comment on why particular methods tend to work (or break) in specific fraud settings, and we relate these findings to practical constraints that shape real deployments, such as extreme class imbalance, delayed or selective supervision, temporal drift, scalability, and auditability.

2.1. Machine Learning Methods

For classical machine learning, Awoyemi et al. [13] compared several algorithms for credit card fraud detection and found that k-nearest neighbors outperformed Naïve Bayes and logistic regression. One reason k-nearest neighbors can appear strong is that it does not impose a restrictive functional form: if fraudulent transactions form small and irregular pockets in feature space, an instance-based decision rule can capture such local structure without assuming linear separability. At the same time, its performance is tightly coupled to representation choices. In real transaction data, features are often high-dimensional, sparse, and partly categorical, and the induced distance can become less informative, making “nearest” neighbors noisy. The method is also sensitive to temporal drift—neighbors drawn from historical behavior may stop being relevant once fraud strategies shift. Beyond these modeling aspects, kNN also raises practical concerns for large-scale deployment because exact neighbor search can be expensive at high throughput, often requiring approximate indexing and careful system engineering.

While [13] provides an early comparative analysis in this domain, its evaluation setting still departs from operational conditions in several technical respects. First, the benchmark covers a limited time span, which makes it difficult to assess robustness under temporal drift. Second, class rebalancing through resampling (e.g., under- and/or over-sampling) can make training feasible, but it also alters the effective class prior; as a result, metrics and model rankings observed under resampled distributions may not translate directly to production alerting behavior, where base rates are extremely low and calibration matters. Finally, the PCA-anonymized feature space limits interpretability and provides limited insight into which signals would remain stable and actionable in real deployments.

Dal Pozzolo et al. [14] moved closer to operational reality by proposing a learning strategy that treats feedback data and delayed supervision samples separately. This is important because confirmed labels in finance are commonly delayed (e.g., chargebacks), and the observed labels are shaped by review processes. Handling delayed supervision explicitly helps reduce the bias that arises when “not yet confirmed” cases are implicitly treated as legitimate during training. The main caveat is that the method assumes feedback is continuously available and sufficiently representative. In practice, feedback can be sporadic and policy-driven (e.g., limited review capacity concentrates attention on a subset of alerts). Under such selection effects, a model may end up fitting the institution’s investigation policy rather than fraud itself, and performance can shift when thresholds or review processes change.

In the insurance domain, Aslam et al. [15] proposed an auto insurance fraud detection framework that performs feature selection using the Boruta algorithm and then develops predictive models using logistic regression, support vector machines, and Naïve Bayes. Their results suggest that, even under the same feature set, different classical learners can behave quite differently across evaluation metrics, underscoring the sensitivity of claim fraud detection to model choice and metric selection. Nevertheless, the study is built on a publicly available dataset with a specific feature schema and label definition, and such design choices are often tied to local claim processes and investigation standards. This raises a practical question about how reliably learned decision rules would transfer across insurers, product lines, or jurisdictions where coding practices, fraud definitions, and review workflows differ. Finally, while classical models are generally viewed as more transparent than deep architectures, transparency becomes operationally meaningful only when explanations are stable at the case level and can be mapped to auditable claim factors; in practice, the combined effects of feature selection, preprocessing, and model-specific decision boundaries can still make regulatory justification non-trivial.

Overall, traditional ML methods remain attractive because they are relatively simple to implement, easier to control, and useful as baselines. Their weaknesses, however, align with two persistent properties of fraud data: severe class imbalance and concept drift. In addition, most tabular classifiers do not naturally exploit relational structure (shared devices, counterparties, coordinated rings) unless those dependencies are manually engineered into features.

2.2. Deep Learning Approaches

Deep learning approaches are often adopted to reduce manual feature engineering and to model nonlinear interactions and temporal dependencies more directly. Ghosh Dastidar et al. [16] proposed the Neural Aggregate Generator (NAG), which learns transaction-history representations end-to-end by combining feature embeddings with lightweight 1-D convolutional components that implement learnable aggregation over past transactions. They reported improved performance compared with generic CNN and LSTM baselines. Embeddings are particularly useful for high-cardinality categorical variables (e.g., merchant or channel identifiers), since dense representations make interaction learning feasible. The flip side is that such representations can become fragile when the categorical space evolves: new merchants, devices, or channels create cold-start behavior, and shifting semantics can degrade embeddings unless the model is updated regularly. In addition, performance gains from convolution-style components remain sensitive to how transaction histories are structured and aligned in the input; if the imposed structure does not match the underlying behavioral process, improvements can be dataset-specific and harder to transfer across institutions or products.

Forough et al. [17] proposed an ensemble of deep sequential models with a voting mechanism for detecting fraud in transaction sequences. Sequence models are a natural choice when fraud is expressed as a tactic over time—bursts of activity, probing transactions, or abrupt shifts in spending context. The cost is operational: long histories, multiple base models, and the voting stage increase implementation complexity and can pressure latency budgets in authorization-time settings. While [17] reports favorable time analysis, whether such ensembles can consistently meet strict sub-second constraints still depends on production-specific factors such as feature availability, batching strategy, and hardware limits.

More broadly, deep models can be effective when supervision is sufficiently rich and stable, but fraud labels are often delayed, noisy, and shaped by intervention policies. Without careful temporal evaluation and monitoring, deep models may overfit dataset artifacts and then degrade quietly in production. Interpretability remains another practical sticking point: in high-stakes financial environments, explanations typically need to be grounded in traceable, case-level evidence, which deep representations do not provide by default and often require additional design to support.

2.3. Graph-Based Methods for Relational and Collective Fraud

Graph-based modeling is increasingly used for fraud types where relationships are the main signal, including collusion rings, mule networks, and multi-entity laundering patterns. Shi [18] proposed a Hierarchical Graph Attention Network (HGAT) that encodes both local and global structural information via multi-head self-attention, achieving improved results on public datasets. Attention-based aggregation is appealing in financial graphs because connectivity is often extremely uneven: prioritizing informative neighbors can matter more than averaging over large, noisy neighborhoods. The hierarchical design is also consistent with how fraud evidence can appear at multiple scales, from immediate counterparties to broader communities.

However, bringing GNN-style models into production is not straightforward. First, real transaction graphs are large and dynamic: edges arrive continuously, and multi-hop aggregation can quickly trigger neighbor explosion, pushing both training and inference costs upward. Second, graph construction is itself a modeling decision—what becomes a node, what becomes an edge, and how time is represented—and small choices here can change outcomes substantially. Time handling is especially delicate: if future-linked information leaks into training or evaluation, offline results can be inflated without any real operational value. Third, graph supervision is typically sparse and delayed, and labels tend to be concentrated in investigated regions of the graph. This makes training sensitive to sampling strategies and to selection bias, and it can reduce robustness when review policies or alerting thresholds shift. Finally, while attention weights can be informative, they do not automatically translate into audit-ready explanations; additional constraints or analysis is usually needed to turn learned structure into human-verifiable evidence. These issues, together with the need for large annotated graphs and higher computational complexity, remain major barriers to broad deployment.

2.4. Natural Language Processing (NLP) Techniques

NLP becomes relevant when unstructured text is central to the fraud signal, such as financial disclosures, narrative reports, or social media coordination. Li et al. [19] used NLP for financial statement fraud detection and reported notable gains when narrative disclosures were incorporated alongside traditional features. A clear advantage of NLP in this context is that it can capture linguistic cues that are awkward to hand-code (inconsistent phrasing, unusual hedging, selective disclosure). Yet financial narratives are also heavily templated. Models can end up learning issuer style, sector conventions, or formatting artifacts rather than fraud intent. Long documents add another practical limitation: truncation, uneven distribution of signal across sections, and the frequent mixing of text with numeric statements all make it difficult for standard text encoders to reliably preserve the subtle cues that matter.

In cryptocurrency settings, Mirtaheri et al. [20] examined pump-and-dump schemes coordinated on social platforms by combining NLP with social network analysis. This is technically well motivated because coordination can show up in language and group structure before market anomalies fully develop. Even so, much of the NLP-focused literature still concentrates on a narrow set of scenarios and often evaluates models in modality-specific pipelines [20]. In practice, text-only detectors can be brittle under platform migration, rapid slang shifts, and adversarial phrasing. Robust detection typically requires aligning text with transactional and relational evidence, but multimodal alignment is not trivial: the modalities differ in timing, granularity, and noise, and naïve fusion can produce unstable behavior across domains.

2.5. Dataset Characteristics and Research Implications

Dataset characteristics and evaluation design often decide whether improvements are meaningful beyond a controlled benchmark. Dahee Choi et al. [21] surveyed public corpora and found that many datasets are small, feature-limited, and lack temporal depth. These constraints help explain why models can look strong offline yet degrade under real-world conditions where attackers adapt and labels arrive late. To mitigate scarcity, later studies explored synthetic data generation with generative adversarial networks [22] and cross-domain transfer learning [23,24]. These directions are promising, but they also introduce new risks. Synthetic data must be checked for fidelity—rare but operationally important fraud modes must not be washed out—and privacy concerns must be addressed [25]. Transfer learning can reduce labeling effort, but differences in label definitions, feature semantics, and operational policies across organizations can cause negative transfer if not handled carefully.

In conclusion, although AI techniques have shown considerable promise in financial fraud detection, much of the literature still relies on limited datasets, narrow validations, or model choices that emphasize offline accuracy without fully confronting deployment realities such as drift, delayed supervision, scalability, and auditability. These gaps point to the need for fraud detection approaches that remain reliable and transparent under evolving fraud patterns and real operational constraints.

3. Research Methods

This study examines the current state of artificial intelligence (AI) applications in financial fraud detection using a systematic literature review (SLR) methodology. The research process is guided by the PRISMA [26] framework, with a focus on study identification, screening, eligibility assessment, and reporting transparency. The review procedure consists of the following key steps:

(1)

Research Question Formulation: This review is structured around the following five core questions:

Q1: What are the traditional and emerging types of financial fraud?

Q2: How is artificial intelligence currently being used in conventional fraud detection, like credit card and insurance fraud? What is the performance comparison of various techniques?

Q3: What are the potential applications, key challenges, and emerging research opportunities of AI for identifying novel fraud types, such as cryptocurrency-related fraud and flash loan attacks?

Q4: What are the boundaries of the public datasets that are currently available in this field? How can we create better datasets to propel new advancements?

Q5: What are the primary limitations and gaps in the existing research? Which avenues for further research are recommended?

(2)

Search strategy: Relevant studies were retrieved from major academic databases, including Web of Science, IEEE Xplore, ScienceDirect, and Scopus. Search terms included:

“financial fraud” or “financial fraud detection” or “credit card fraud” or “insurance fraud” or “loan fraud” or “money laundering” or “cryptocurrency fraud” or “online loan fraud” or “rug pulls” or “flash loan attacks”) and (“artificial intelligence” or “machine learning” or “deep learning” or “graph neural network” or “GNN” or “natural language processing” or “federated learning” or “explainable AI” or “XAI”). The search was limited to English-language journal and conference papers published since 2015. Considering that the large-scale application of deep learning, graph neural networks, and other advanced AI techniques in fraud detection began to emerge after 2015, this review takes 2015 as the starting year to capture the most relevant and technologically important advances. The search and screening process was independently reviewed to ensure consistency and completeness.

(3)

Eligibility criteria:

The following inclusion and exclusion criteria were applied:

Inclusion criteria:

(i): Studies explicitly applying artificial intelligence techniques (including ML, DL, GNNs, NLP, or hybrid approaches) to financial fraud detection, such as credit card fraud, loan fraud, insurance fraud, anti-money laundering, cryptocurrency fraud, and DeFi-specific attacks (e.g., rug pulls or flash loan exploits);
(ii): Studies presenting experimental, benchmark, or real-world evaluation results, including datasets and evaluation metrics;
(iii): Publications in peer-reviewed journals or conferences published between 2015 and 2025;
(iv): Full-text available in English.

Exclusion criteria:

(i): Editorials, commentaries, or opinion pieces without methodological contribution;
(ii): Studies unrelated to financial fraud (e.g., general intrusion detection without a financial context);
(iii): Duplicate records;
(iv): Studies with inaccessible full text or insufficient methodological details for quality assessment;
(v): Non-English publications.

(4)

Study selection and screening:

Potentially eligible studies were subsequently assessed at the full-text level based on the predefined inclusion and exclusion criteria. The study selection process was documented in accordance with PRISMA and is summarized in the flow diagram (Figure 2).

(5)

Quality assessment criteria:

Using the quality evaluation criteria listed in Table 2, we screened the chosen studies and assessed their methodological quality. The final dataset consists of 122 studies covering a wide range of financial fraud scenarios, including credit card fraud, insurance fraud, loan fraud, anti-money laundering, and cryptocurrency-related fraud.

The quality assessment criteria adopted in this review (Table 2) are specifically designed to reflect the methodological characteristics and practical constraints of financial fraud detection research. Fraud detection datasets are typically highly imbalanced, with fraudulent cases representing only a small fraction of observed instances. Consequently, criteria related to data description, evaluation metrics, and validation strategies are essential for distinguishing studies that adequately address class imbalance from those relying on overly optimistic performance reporting.

In addition, fraud patterns are known to evolve over time due to regulatory changes, user behavior shifts, and adversarial adaptation. Quality criteria emphasizing temporal validation, dataset partitioning strategies, and experimental transparency therefore help identify studies that consider temporal drift and realistic deployment settings.

Finally, data accessibility and reproducibility remain persistent challenges in fraud detection research, as many datasets are proprietary or subject to strict privacy constraints. Including criteria related to dataset availability, experimental reproducibility, and methodological clarity enables a more consistent and fair assessment of the existing literature, particularly when comparing studies across different fraud domains and data sources.

In decentralized financial environments such as cryptocurrency and DeFi ecosystems, ensuring data integrity, transparency, and trustworthiness is particularly critical for reliable model evaluation. Prior research has demonstrated that blockchain-assisted learning frameworks can enhance the reliability and auditability of intelligent financial decision systems by leveraging tamper-resistant data storage and decentralized verification mechanisms. From a methodological perspective, this further motivates the emphasis on data transparency, validation rigor, and reproducibility when assessing AI-based fraud detection studies operating in decentralized and adversarial financial settings [27].

These criteria help ensure that the reviewed studies provide sufficient methodological detail and comparable evaluation results.

4. Research Findings

The findings section of this study unfolds across two dimensions: overall trends and subject distribution. As illustrated in Figure 3, the publication volume exhibits a dynamic evolution over the observed period. An initial accumulation can be observed from 2015 to 2018, followed by a pronounced increase starting in 2019 and reaching a peak in 2021. After 2021, the trajectory shows non-monotonic variations rather than a continuous decline: publication counts decreased in 2022 and 2023, experienced a modest rebound in 2024, and declined again in 2025. Overall, the figure highlights a transition from rapid growth to a phase characterized by year-to-year variability.

Secondly, regarding the distribution of research subjects, the included papers cover a broad range of financial fraud types, with substantial variation in research intensity across categories (Figure 4). Within traditional fraud domains, credit card fraud is the most frequently studied topic. AML and insurance fraud follow at comparable levels, while financial statement fraud also attracts notable attention; loan-related fraud is less represented. For emerging fraud domains, crypto scams, DeFi/flash-loan attacks, and cryptocurrency-related fraud form a smaller yet meaningful body of work compared with the most studied traditional categories, whereas mobile/online payment fraud is relatively underexplored. Overall, the distribution indicates diversified but uneven research efforts across traditional and emerging fraud domains.

Q1: What are the traditional and emerging types of financial fraud?

4.1. These Major Fraud Types Are Described in Detail Below

In general, financial fraud can be broadly categorized into established forms—such as credit card, insurance, and loan fraud—and emerging forms, including cryptocurrency-related scams and online lending fraud. Because of variations in data availability, fraud patterns, and adversarial behaviors, each type presents different difficulties for AI-based detection.

Traditional and new types of financial fraud
- Credit Card Fraud.
  Credit card fraud is among the most prevalent forms of financial fraud and typically involves the unauthorized use of credit or debit card information to conduct purchases or withdraw funds [28].
- Financial Statement Fraud.
  The deliberate falsification of financial data in official statements with the intention of deceiving creditors, investors, and other stakeholders is known as financial statement fraud [29,30,31,32].
- Insurance Fraud.
  The intentional use of deceit to obtain unjustified financial benefits from an insurance policy is known as insurance fraud. Falsifying supporting documentation, hiding actual losses, and misrepresenting facts are common examples of this. Insurance fraud drives up premiums for policyholders and costs insurance companies a lot of money [33,34]. Fraud involving health insurance [35], property insurance [34], and auto insurance [36,37,38] are common types.
- Internet Loan Fraud.
  Internet loan fraud involves the fraudulent acquisition of loans or associated proceeds via digital platforms through deceptive or illicit means [39,40].
- Money Laundering Fraud (MLF).
  The practice of disguising funds obtained illegally as legitimate assets to hide their illicit origin is known as money laundering [36,41,42,43].
- Cryptocurrency Fraud.
  Decentralized transaction mechanisms and minimal regulatory oversight are provided by cryptocurrencies, which are characterized as digital or virtual currencies protected by cryptographic techniques [44,45,46,47]. a quickly expanding field that includes illegal blockchain transactions and Ponzi schemes [48].
- Rug Pulls.
  Token developers withdraw liquidity from decentralized exchanges after attracting investors, leaving tokens worthless; detection requires integrating social media signals with on-chain liquidity patterns [49,50,51,52].
- Flash Loan Attacks.
  Exploits using atomic, uncollateralized loans within one transaction to manipulate price oracles or drain protocols; detection requires ultra-fast on-chain tracing and smart contract forensics [53,54,55,56,57].
Characteristics of major financial fraud types
- To provide a systematic comparison, the main characteristics of both traditional and emerging fraud types are summarized in Table 3. Their implementation strategies, common scenarios, and the main obstacles they present for AI-based detection are described in this table.

The landscape of financial fraud detection is marked by a distinct bifurcation. Established domains, such as credit card and insurance fraud, benefit from mature methodologies but remain hindered by privacy regulations and data silos that limit external validation. Conversely, emerging threats in digital finance—spanning cryptocurrency, internet lending, and DeFi exploits like rug pulls—present a more volatile challenge. These scenarios are defined by rapid adversarial adaptation and data scarcity, necessitating models that can synthesize on-chain behaviors with smart contract semantics. This contrast highlights the need to bridge the maturity gap between traditional and emerging fraud domains by developing adaptable and deployment-oriented AI frameworks.

Q2: How is artificial intelligence currently being used in conventional fraud detection, like credit card and insurance fraud? What is the performance comparison of various techniques?

4.2. Application of Artificial Intelligence Techniques in Traditional Fraud Detection

Traditional financial fraud detection, particularly in credit card fraud, insurance fraud, and loan fraud, has been extensively studied in recent years. Many AI techniques have been used, from classic machine learning classifiers to more advanced deep learning and ensemble methods [37,60,68,69,70,71]. Despite their empirical success, these approaches continue to face persistent challenges in practical deployment, including extreme data imbalance, real-time detection requirements, and the limited interpretability of complex AI models [18,60,62,72].

As illustrated in Figure 5, the reported percentages reflect the proportion of reviewed studies (N = 122) adopting each dominant AI paradigm, as determined through a structured evidence matrix. The methodological landscape remains heavily anchored in Machine Learning (45.4%), reflecting its maturity and interpretability in regulated sectors. This dominance suggests that regulatory requirements and deployment constraints continue to shape model selection in real-world financial institutions. Nevertheless, a transition is evident: Deep Learning (27.6%) and Ensemble methods (11.2%) now command a substantial share, signaling a shift toward higher-capacity models. Newer paradigms such as graph-based learning and hybrid approaches are also emerging, aiming to better capture relational dependencies that are difficult to model using purely tabular features. Table 4 summarizes representative AI methods used in traditional fraud detection, highlighting their typical application scenarios, strengths, and practical limitations.

To complement the qualitative analysis in Table 4, we further provide a quantitative benchmarking of these methods. Table 5 lists the specific performance metrics (Accuracy, F1, AUC) achieved by state-of-the-art models across different datasets.

As summarized in Table 4, AI methodologies in traditional fraud detection have evolved from standard machine learning to more complex deep learning and graph-based approaches. Table 5 further provides a quantitative snapshot of representative studies.

A recurring pattern in credit card fraud studies is that supervised models often report very high accuracy, frequently exceeding 98% (e.g., [5]). However, the gap between Accuracy and F1-score reported in Table 5 (e.g., [5,15]) illustrates the core technical issue of extreme class imbalance: a model can achieve excellent overall accuracy while still missing a substantial portion of minority fraud cases or producing a precision level that is operationally unacceptable. This observation also implies that accuracy alone is not an adequate proxy for detection quality in real-world fraud settings, where the practical objective is usually to maximize fraud capture under a constrained false-positive/alert budget [96].

From a metric-design standpoint, performance interpretation should be tied to the operating regime. For highly imbalanced problems, AUC-ROC can remain optimistic because false-positive rate may look small in absolute terms even when the absolute number of false alarms is large. In contrast, PR-oriented measures (e.g., AUC-PR, precision at a target recall, or recall at a tolerable precision) better reflect the minority-class trade-off that drives manual review workload and customer friction. However, even metrics such as the F1-score implicitly assume equal misclassification costs across error types. In operational fraud detection, this assumption rarely holds. A false negative typically results in direct financial loss and regulatory exposure, whereas a false positive primarily generates investigation overhead and potential customer dissatisfaction. Consequently, evaluation protocols should incorporate cost-aware metrics or savings-based indicators derived from domain-specific cost matrices. These measures more accurately capture the economic and operational consequences of model decisions rather than relying solely on statistical discrimination performance.

From a modeling perspective, mitigating class imbalance also requires moving beyond data-level resampling strategies. Although techniques such as SMOTE remain widely adopted, algorithm-level approaches—including focal loss and cost-sensitive learning—have demonstrated improved stability under extreme imbalance conditions. Unlike resampling, which may distort underlying data distributions or introduce synthetic artifacts, focal loss dynamically down-weights well-classified majority-class samples and encourages the model to focus on difficult minority-class instances. Such mechanisms help improve robustness while preserving the integrity of original transaction distributions. In addition, it is important to recognize that the F1-score remains threshold-dependent rather than purely model-dependent, reinforcing the necessity of combining threshold analysis with business-driven cost calibration when evaluating deployment readiness.

While ensemble and deep learning methods are frequently reported as robust in Table 5 (e.g., [17,22]), these gains must be interpreted together with deployment constraints. In latency-sensitive settings (such as authorization-time card fraud), the feasibility of sub-second inference can limit the practical use of sequence ensembles, large models, or complex feature pipelines, even when offline discrimination metrics improve. Similarly, graph-based models may improve relational signal capture but can introduce additional overhead through graph construction and neighborhood aggregation, which creates a non-trivial engineering burden at scale. Finally, interpretability remains a practical bottleneck: as model complexity increases, translating model outputs into stable, case-level evidence that satisfies audit and compliance workflows becomes more difficult, which can slow or prevent adoption even when headline metrics are strong [41,60,62].

Recent studies further indicate that, beyond network architecture, effective optimization strategies play a critical role in improving the stability and practical performance of sequential models such as LSTM in real banking environments, where transaction patterns are non-stationary and operational constraints are strict [97].

It is also important to note that many of the high-performing results in Table 5 are obtained in settings where historical labels are available and data collection is relatively centralized. As fraud detection moves toward more decentralized and pseudonymous environments (e.g., blockchain and DeFi), supervision becomes weaker, behaviors shift faster, and feature availability differs substantially from conventional banking systems. These factors make direct transfer of conventional supervised pipelines non-trivial, even when offline metrics on traditional benchmarks appear strong. The following section (Q3) therefore examines recent efforts to adapt AI techniques to emerging fraud threats in decentralized financial environments.

Q3: What are the potential applications, key challenges, and emerging research opportunities of AI for identifying novel fraud types, such as cryptocurrency-related fraud and flash loan attacks?

4.3. Application of AI Technologies in Emerging Fraud Scenarios

Compared with traditional fraud detection tasks, emerging fraud types—such as cryptocurrency fraud, online loan scams, and DeFi-specific attacks including rug pulls and flash loan exploits—exhibit fundamentally different characteristics that challenge conventional AI-based detection pipelines. These schemes evolve rapidly, frequently lack large and consistently annotated datasets, and rely on heterogeneous data sources, including blockchain transaction graphs, project documentation, and social media content [48,90,95,98,99]. Moreover, participant pseudonymity and highly adversarial settings further complicate reliable fraud identification and limit the direct transferability of models developed for centralized financial systems.

To address these complexities, recent work has explored a range of methods. Sequential models have been applied to capture temporal patterns in transaction sequences [85,87]; network-based techniques have been used to represent relationships in transaction graphs [62,93,100,101]; and approaches combining text analysis with structured data have been used to leverage unstructured information [80,90]. Despite these advances, this literature is still in a formative stage, and many issues—such as scalability, interpretability, and robustness in adversarial environments—remain unresolved. This reflects the fact that most existing approaches remain problem-specific and have yet to demonstrate robust generalization across different emerging fraud scenarios.

To synthesize these fragmented efforts, Table 6 provides a structured overview of representative emerging fraud scenarios, highlighting their core modeling challenges, commonly adopted AI techniques, and open research directions. For example, in the context of cryptocurrency fraud, user pseudonymity and rapidly shifting strategies have led researchers to examine both sequential patterns and graph structures. In online lending fraud, the scarcity of publicly available labeled data and inconsistent labeling across platforms has prompted the use of hybrid text and tabular methods. For DeFi-specific threats such as rug pulls, integrating on-chain market behavior with off-chain social signals appears promising. In the case of flash loan attacks, which unfold over extremely short timescales, researchers have begun investigating methods capable of capturing both temporal dynamics and transaction relationships.

Overall, existing AI-based approaches show promise in addressing the diverse characteristics of emerging fraud types; however, their level of maturity remains noticeably lower than that observed in traditional fraud detection domains. Multimodal approaches that integrate on-chain behaviors with textual or social signals have demonstrated potential, yet their practical effectiveness is constrained by limited domain-specific annotations and the inherent difficulty of aligning heterogeneous data sources [64,80,90]. Moving forward, meaningful progress will depend not only on methodological refinement, but also on the construction of higher-quality datasets, improved cross-scenario generalization, and greater emphasis on model transparency and robustness in adversarial financial environments [53,65,66,67,94,95].

In parallel with these model-centric advances, recent progress in generative artificial intelligence—particularly large language models (LLMs)—has introduced a complementary research direction for addressing several unresolved challenges in emerging fraud detection.

Role of Generative AI and Large Language Models in Emerging Fraud Detection

Recent advances in generative artificial intelligence, particularly large language models (LLMs), have expanded the scope of AI applications beyond conventional prediction-oriented fraud detection [105]. Unlike traditional models that primarily operate on structured transaction records or graph representations, LLMs are capable of processing and generating unstructured textual information, which is increasingly relevant in emerging fraud scenarios involving project documentation, investigative narratives, and social media content.

One potential application of LLMs lies in synthetic data generation to alleviate class imbalance and data scarcity. By generating semantically coherent fraud-related narratives—such as scam descriptions or summaries of illicit schemes—LLMs may support weak supervision and scenario-based testing in domains where labeled data are limited [106]. However, generated samples may overrepresent frequent patterns and fail to capture rare but critical fraud variants, requiring careful validation before downstream use.

LLMs also offer new possibilities for anomaly explanation and investigator-oriented decision support. By synthesizing structured signals, including transaction features and graph-based risk indicators, into natural language summaries, LLMs may help translate complex model outputs into more accessible explanations for analysts. Nevertheless, these explanations are not guaranteed to be factually accurate, as LLMs are prone to hallucination, which poses non-trivial risks in regulatory and compliance-sensitive environments [107]. As a result, explanation generation should be constrained and embedded within human-in-the-loop workflows.

Beyond classification-oriented NLP tasks, LLMs enable higher-level semantic analysis of fraud-related narratives, such as summarizing fraud schemes and identifying recurring behavioral patterns across cases [108]. These capabilities are particularly relevant for emerging fraud types, including rug pulls and Ponzi schemes, where textual communication plays a central role. At the same time, LLM-based approaches remain vulnerable to adversarial language manipulation and domain shifts, limiting their robustness when deployed in isolation.

Overall, generative AI and LLMs represent a promising yet still complementary direction in emerging fraud detection. Their strengths lie in semantic understanding, explanation support, and unstructured data analysis rather than in standalone fraud prediction. Future research should therefore focus on integrating LLMs with established sequential, graph-based, and multimodal models in a controlled and auditable manner, while carefully addressing reliability, adversarial robustness, and data privacy concerns [109,110].

Q4: What are the boundaries of the public datasets that are currently available in this field? How can we create better datasets to propel new advancements?

4.4. Dataset Limitations and Development

The availability and design of datasets fundamentally influence both model performance and the fairness, reproducibility, and deployment relevance of evaluation results. However, different fraud types rely on different data sources and labeling processes, and these differences often dominate model outcomes. In what follows, we first summarize commonly used datasets and their accessibility, then discuss the technical limitations of existing public benchmarks, and finally outline directions for building more realistic and reusable datasets.

Representative datasets for financial fraud detection.

Apart from their distinguishing features, dataset accessibility strongly shapes the direction of research. For each fraud type, Table 7 lists frequently used datasets and categorizes accessibility from fully public sources to proprietary logs. While researchers can easily access fully public datasets such as Enron, other resources involve partial or restricted access—for example, LendingClub provides only a loan-listing portion, and CSMAR typically requires a paid subscription. These access differences partially explain why some fraud types (e.g., credit card fraud) have become benchmark-driven research areas, while other settings remain dominated by institution-specific case studies.

b.: Limitations of Existing Datasets.

High-quality datasets are essential for developing reliable and generalizable fraud detection models. However, as reflected in Table 8, available datasets are unevenly distributed across fraud types, and even widely used public benchmarks carry technical constraints that affect what can be concluded from reported results.

(i): Technical implications of anonymized features in public benchmarks.

In traditional domains, public datasets exist for credit card fraud (e.g., the European Credit Card dataset on Kaggle). Yet these benchmarks are often limited to short time windows and anonymized/transformed features (e.g., PCA components) [28]. Importantly, anonymization is not only an interpretability concern. First, the loss of feature semantics makes it difficult to assess feature stability and actionability under deployment: a model may rely on a particular PCA direction that is predictive in the benchmark but corresponds to an unstable mixture of raw signals that shifts across institutions, time periods, or preprocessing pipelines. Second, anonymization constrains error analysis and debugging. When false positives increase, it becomes difficult to trace model behavior back to business-relevant factors (channels, merchant characteristics, devices, geography), which complicates operational remediation and audit justification. Third, anonymized benchmarks reduce the value of domain-informed robustness design (e.g., targeted monitoring of semantically meaningful features, or rule-based constraints that align with operational controls). As a result, strong performance on anonymized benchmarks may overstate how easily the approach transfers to real institutional data, where feature definitions, availability, and preprocessing differ.

(ii): Limited temporal depth and the under-evaluation of drift

A second recurring limitation is insufficient temporal coverage. Short time spans make it difficult to meaningfully assess concept drift, even though drift is one of the defining challenges in operational fraud detection [14,23,109]. In practice, drift may appear as (a) prior drift (fraud prevalence changes after interventions, campaigns, or seasonal cycles), (b) covariate shift (new products, channels, and user segments change the feature distribution), and (c) behavioral/adversarial drift (attackers adjust to deployed controls). These shifts can break thresholds and calibration even when rank-based metrics appear stable. Consequently, evaluations that rely on random splits or that do not respect time ordering can underestimate degradation and overstate long-term stability.

(iii): Emerging domains: public raw data does not automatically yield usable benchmarks

In contrast to credit card benchmarks, insurance, P2P lending, and emerging cryptocurrency-related fraud still lack unified, standardized datasets [5,49,56,64]. For DeFi-related threats such as rug pulls and flash-loan attacks, raw on-chain transaction traces are publicly available, yet constructing representative labeled datasets remains technically challenging [52,56,57,63,64,65,66]. Two issues are particularly limiting. First, “ground truth” labels are often incident-driven and curated from heterogeneous sources (security reports, exploit lists, post-mortems), which leads to inconsistent labeling criteria and incomplete coverage across studies. Second, defining negative samples is non-trivial: many benign on-chain behaviors can look superficially similar to attacks (e.g., arbitrage, liquidation, MEV-driven bursts). Naïve negative sampling therefore introduces label noise and can inflate offline metrics without improving true detection capability. Moreover, DeFi ecosystems evolve rapidly (protocol upgrades, migrations, multi-chain fragmentation), so benchmarks built around fixed incident sets can become stale quickly. These constraints push many studies toward fragmented case-based evaluation instead of standardized training and testing pipelines, which limits meaningful cross-study comparisons (Table 8).

Overall, Table 8 highlights that while established datasets exist for certain traditional fraud types, many other domains—especially insurance fraud, online lending, and novel cryptocurrency-related frauds—still lack standardized, large-scale, and multimodal benchmarks. This disparity helps explain why reported advances are often difficult to compare across studies and why performance may not translate smoothly across deployment settings [5,54,55,56,60,61,63].

c.: Toward High-Quality Datasets

Toward more reliable benchmarks, dataset development should not only increase sample size but also improve technical realism along three dimensions: time, labels, and feature/graph construction. First, temporal design should reflect deployment: benchmarks should adopt chronological splits, support sliding-window evaluation, and report how performance and calibration drift over time, rather than relying on a single static split [14,23,109]. Second, labeling pipelines should be treated as part of the dataset specification: label delay, investigation selection bias, and incident definition criteria (especially for emerging crypto/DeFi fraud) should be documented explicitly so that reported metrics can be interpreted in context [48,64]. Third, for relational and DeFi-style fraud, benchmark construction should state entity resolution assumptions (address clustering, contract identity, cross-platform mapping) and define event boundaries (start/end time of an incident), because these choices directly affect both features and leakage risks during evaluation [20,51,54,63,66].

Building on the limitations identified above, Table 9 outlines a structured and actionable roadmap for improving fraud detection datasets by linking common data issues to corresponding technical solutions and evaluation strategies. This framework is intended to move dataset discussion beyond high-level statements (e.g., “imbalanced” or “private”) and toward reproducible remedies and transparent reporting practices. To further strengthen reproducibility across studies, the research community would benefit from standardized data-sharing infrastructures. Establishing open repositories with unified feature definitions, annotation protocols, and preprocessing guidelines could significantly reduce inconsistencies across case-specific datasets. Comparable efforts in other AI domains, such as standardized dataset hubs for natural language processing, demonstrate how coordinated data governance can accelerate benchmarking and cross-study comparability. Future financial fraud datasets should therefore prioritize structured metadata documentation, multi-source integration capability, and version-controlled update mechanisms to better support longitudinal fraud research and model validation.

This problem–solution–evaluation framework provides a systematic roadmap for guiding future dataset development in financial fraud detection. By explicitly linking dataset issues with technical remedies and robust evaluation metrics, it not only addresses the gaps identified in Section 4.4 but also emphasizes practical strategies to ensure realism, interpretability, and generalizability in future benchmarks.

Although the reviewed studies collectively address the core research questions concerning fraud types, modeling paradigms, and data characteristics, the synthesis of findings also exposes a number of persistent gaps that remain insufficiently explored.

Q5: What are the primary limitations and gaps in the existing research? Which avenues for further research are recommended?

4.5. Research Gaps and Future Research Directions

Research Gaps and Future Research Directions Despite the progress highlighted in Section 4.1, Section 4.2, Section 4.3 and Section 4.4, several persistent challenges remain, particularly in emerging domains such as decentralized finance (DeFi) and multimodal fraud scenarios. These challenges arise from both modeling limitations and structural issues in data and evaluation design.

Interpretability beyond feature importance.

Interpretability has been investigated from multiple perspectives, including feature attribution methods such as SHAP and LIME [118,119], post hoc explanation frameworks for black-box models [119], and regulatory-oriented explainability requirements in financial systems [120,121]. While these approaches provide useful insights into feature relevance, they offer limited support for understanding the underlying mechanisms of complex fraud behaviors. This limitation becomes especially evident in scenarios involving sequential decision processes or programmatic execution, such as money laundering chains or smart contract exploits. In such contexts, feature-level explanations alone are insufficient to capture the temporal and causal structure of fraudulent activities. A deeper notion of interpretability—one that connects model outputs to interpretable representations of event sequences and decision logic—remains largely underexplored. Future research may benefit from explanation frameworks that can be evaluated against expert judgments in realistic audit and investigation settings.

In addition, existing interpretability studies rarely consider how explanation outputs are operationalized within real-world fraud detection systems, where explanations must support regulatory audits, analyst review workflows, and time-critical decision making. Bridging the gap between explanation methods and their system-level integration therefore represents an important yet underexplored research direction.

Joint modeling of on-chain and off-chain signals.

Emerging fraud patterns increasingly span heterogeneous data modalities, including transaction graphs, market indicators, project documentation, and social media or community discussions. However, most existing approaches analyze these information sources in isolation, which limits their ability to detect fraud behaviors that only emerge through cross-modal interactions. Although several studies have begun exploring hybrid or multimodal representations, systematic frameworks for jointly modeling semantic text, temporal transaction data, and structural graph features remain limited. Moreover, the generalization of such multimodal models across different fraud types and platforms has not been sufficiently examined [10,120].

Synthetic and simulation-driven data for rare events.

As identified in Section 4.4, a major bottleneck in emerging fraud detection is the scarcity of labeled data for rare or novel attack patterns. Traditional oversampling techniques, such as SMOTE, provide limited benefits when new fraud strategies lack close historical analogues [14]. Under current data constraints, there is increasing interest in simulation-based and synthetic data generation approaches that incorporate domain-informed assumptions about adversarial behavior. However, the effectiveness of such synthetic benchmarks depends critically on their realism and their ability to transfer to real-world detection tasks. Rigorous evaluation against real fraud cases therefore remains essential.

Collaborative and cross-domain learning.

Fraud activities frequently transcend institutional, platform, and even jurisdictional boundaries, while available datasets are typically confined to isolated organizations or individual platforms. As discussed in Section 4.4, this fragmentation limits model generalization and hampers the detection of coordinated or cross-platform fraud schemes. Although privacy-preserving learning paradigms such as federated learning have been explored in traditional financial settings, extending these approaches to heterogeneous and decentralized environments introduces additional challenges. Future research may explore cross-domain learning protocols that balance privacy preservation with the need to capture shared fraud patterns across institutions and platforms.

Standardized evaluation and benchmarking frameworks.

Across both traditional and emerging fraud domains, the absence of widely adopted benchmarks remains a fundamental obstacle to reproducibility and fair comparison. Current studies employ diverse evaluation metrics, dataset splits, and validation strategies, making it difficult to assess relative performance under realistic conditions. In particular, there is no consensus on how to evaluate models in the presence of severe class imbalance, temporal concept drift, and adversarial adaptation [122]. More comprehensive and standardized evaluation frameworks are therefore needed to reflect real operational constraints and deployment requirements.

5. Discussion

The studies reviewed in this work collectively demonstrate both substantial progress and persistent structural challenges in AI-based financial fraud detection. Across methodological choices, data availability, and application contexts, several cross-cutting patterns emerge that help explain the current research landscape and the factors shaping research priorities.

One notable observation is the uneven distribution of research attention across fraud types. Domains such as credit card fraud and cryptocurrency-related schemes dominate the literature, largely due to the availability of public or semi-public datasets that facilitate benchmarking and comparative evaluation. In contrast, insurance fraud, loan fraud, and several decentralized finance (DeFi)-related threats remain comparatively underexplored despite their significant economic impact. As discussed in Section 4.4, data accessibility continues to shape benchmark-driven research progress while also constraining investigation into institution-specific or emerging fraud scenarios.

Fraud detection in blockchain-based ecosystems introduces additional technical complexities beyond those encountered in traditional financial systems. Pseudonymity of wallet addresses obscures direct identity linkage, while transaction transparency paradoxically coexists with limited contextual metadata. Furthermore, effective analysis often requires integrating on-chain transaction graphs with off-chain behavioral or exchange-level information. These characteristics have motivated the development of hybrid detection frameworks that combine graph neural networks with advanced anomaly detection algorithms and blockchain-specific feature engineering. Such approaches leverage relational transaction patterns, temporal interaction dynamics, and smart contract behavior to improve detection capability in decentralized financial environments.

From a methodological perspective, the field has gradually transitioned from traditional machine learning toward ensemble learning, deep neural networks, and graph-based approaches capable of capturing nonlinear dependencies, temporal transaction dynamics, and relational structures. However, these advances introduce important trade-offs. Increased model complexity often reduces interpretability and operational transparency, both of which are critical requirements in regulated financial environments such as banking and anti-money laundering compliance. Consequently, improvements in predictive performance do not necessarily translate into deployment readiness or regulatory acceptance.

Dataset characteristics continue to exert a strong influence on research outcomes and evaluation practices. Widely adopted benchmarks, including the Kaggle Credit Card and Elliptic Bitcoin datasets, have enabled reproducible experimentation and standardized comparison across studies. Nevertheless, these datasets remain constrained by class imbalance, feature anonymization, and limited temporal coverage. For emerging fraud domains, datasets are frequently fragmented, case-specific, or platform-dependent, restricting cross-study comparability and weakening the external validity of reported results. These structural limitations help explain why models achieving strong performance in controlled experimental settings often face robustness and generalization challenges in real-world deployments.

Different imbalance mitigation strategies also present distinct trade-offs in fairness and robustness. Oversampling methods can enhance minority representation but may introduce overfitting or unrealistic synthetic patterns, particularly in highly sparse fraud scenarios. Undersampling, while computationally efficient, risks discarding informative legitimate transaction behavior that is critical for reducing false positives. In contrast, algorithm-level approaches such as focal loss and cost-sensitive optimization preserve the original data distribution while emphasizing high-risk samples during training. Emerging comparative studies suggest that hybrid strategies combining moderate resampling with adaptive loss functions often achieve more stable performance across varying fraud prevalence rates. Establishing standardized evaluation protocols for imbalance mitigation remains an important direction for improving cross-study comparability.

Interpretability remains a critical dimension influencing both regulatory acceptance and operational usability. Existing studies predominantly focus on feature-level explanations that attribute model predictions to individual transaction or account attributes. While effective for relatively static fraud scenarios such as credit card or consumer loan fraud, feature-level explanations provide limited insight into complex fraud schemes involving sequential behaviors and multi-entity interactions. For fraud types such as money laundering networks, coordinated cryptocurrency scams, and DeFi exploits, behavior-level or sequence-based explanations become essential for understanding why transaction patterns are classified as suspicious. Developing scalable and reliable behavioral explanation mechanisms therefore remains an open challenge, particularly for graph-based and sequential models operating in adversarial and rapidly evolving environments.

Beyond interpretability, fraud detection research is increasingly shaped by institutional, ethical, and regulatory constraints. Requirements related to fairness, accountability, and auditability are now widely recognized as prerequisites for deployment in regulated financial sectors. Data-sharing restrictions have also motivated exploration of privacy-preserving techniques, including federated learning and secure multi-party computation, as potential solutions to data silos. However, empirical evidence demonstrating their effectiveness in heterogeneous and non-independent and identically distributed financial environments remains limited. The emergence of generative artificial intelligence and large language models further illustrates this tension. While these techniques enable analysis of unstructured narratives, contractual documents, and online communications, their limited transparency and susceptibility to adversarial manipulation introduce additional governance challenges.

Practical deployment constraints represent another critical yet frequently underemphasized dimension in academic fraud detection research. Real-world fraud detection systems must operate under strict latency requirements, regulatory oversight, and computational resource limitations. In real-time transaction screening scenarios, decisions often need to be made within millisecond-level latency budgets, creating a fundamental tension between predictive complexity and inference efficiency. Although deep neural networks and graph-based models effectively capture complex temporal and relational patterns, their substantial computational overhead often limits direct deployment without architectural simplification. To address these computational and latency-related operational constraints, recent studies increasingly emphasize model compression techniques, including structured pruning, parameter quantization, and knowledge distillation. These approaches aim to reduce computational complexity and memory footprint while preserving predictive accuracy. In addition to compression-based optimization, lightweight architectural designs, such as shallow ensemble cascades and early-exit neural networks, have also been investigated to further reduce inference latency in real-time fraud detection scenarios. When carefully implemented, these techniques enable real-time fraud detection pipelines to meet sub-second response requirements while maintaining acceptable predictive performance and operational transparency.

Beyond computational efficiency, the dynamic and adversarial nature of fraud behavior introduces persistent concept drift that further challenges static deployment strategies. Fraud patterns continuously evolve as attackers adapt to detection mechanisms, rendering periodic retraining insufficient for maintaining long-term model effectiveness. Consequently, research is increasingly shifting toward adaptive learning paradigms. Continual learning frameworks enable models to incrementally update decision boundaries while attempting to mitigate catastrophic forgetting of historical fraud patterns. Similarly, online ensemble techniques dynamically adjust model composition based on recent transaction distributions, thereby improving resilience to temporal behavioral shifts. Despite their advantages, adaptive learning approaches introduce additional complexity in model monitoring, stability assurance, and regulatory validation, highlighting the necessity for robust drift detection and model governance mechanisms.

Closely related to both latency and model adaptability considerations is the architectural design of operational fraud detection systems. In real-world financial environments, fraud detection is rarely implemented as a single-model solution. Instead, multi-stage detection architectures are widely adopted to balance efficiency and analytical depth. Lightweight classifiers or ensemble models are typically deployed for real-time risk scoring and transaction blocking, whereas computationally intensive sequential or graph-based models are reserved for risk escalation, forensic analysis, or post-transaction investigation. Additionally, managing false positives represents a critical operational challenge, as excessive false alarms can generate customer friction and increase manual review costs. Financial institutions therefore often employ conservative decision thresholds combined with downstream human verification processes. These practical requirements explain the increasing adoption of hybrid detection frameworks integrating real-time screening modules with advanced analytical subsystems, although such system-level architectures remain insufficiently examined in academic literature.

Interpretability further intersects with deployment feasibility and regulatory compliance. Widely adopted post hoc explanation techniques, including SHAP and LIME, have substantially improved transparency in complex fraud detection models. However, these explanation approaches may exhibit instability under adversarial perturbations or distributional shifts, potentially limiting their reliability in high-stakes financial environments subject to regulatory scrutiny. In response, recent research increasingly advocates integrating inherently interpretable modeling approaches, such as Explainable Boosting Machines and rule-based ensemble models, which provide more consistent and auditable decision rationales. Moreover, counterfactual explanation frameworks are gaining attention for their ability to generate actionable insights by identifying minimal feature changes required to alter model predictions. The integration of stable and actionable interpretability mechanisms is therefore becoming a critical component in strengthening auditability and enhancing stakeholder trust in AI-driven fraud detection systems.

Taken together, these findings suggest that meaningful progress in AI-based financial fraud detection cannot rely solely on predictive performance improvements. Future advancements will require coordinated attention to computational efficiency, adaptive learning capability, system-level architectural design, interpretability robustness, regulatory compliance, and dataset quality. Addressing these interconnected challenges is essential for developing fraud detection systems that are not only accurate but also stable, transparent, and deployable in high-stakes financial environments.

6. Conclusions

This review has provided a comprehensive synthesis of recent advances in AI-based financial fraud detection, covering both traditional and emerging fraud types, methodological developments, and dataset characteristics. By systematically examining prior studies, this work highlights how artificial intelligence—particularly deep learning, graph neural networks, and hybrid approaches—has reshaped fraud detection research while also revealing persistent challenges that continue to constrain real-world deployment.

Several key observations emerge from this analysis. Research attention has been disproportionately concentrated in domains supported by publicly available datasets, such as credit card fraud and certain cryptocurrency-related schemes, leaving economically significant areas like DeFi-specific attacks relatively underexplored. Methodological progress has led to substantial gains in predictive performance, yet these gains are often accompanied by reduced interpretability, limited robustness to distributional shifts, and constraints on deployment in regulated financial environments. Dataset-related issues, including severe class imbalance, inconsistent labeling, and fragmented benchmarks, continue to restrict reproducibility and hinder the detection of rare or emerging fraud patterns.

To better bridge the gap between academic innovation and industrial deployment, this review highlights three concrete directions for future research. First, the absence of standardized multimodal benchmarks continues to limit comparability; future datasets must jointly reflect transactional, relational, and unstructured information (e.g., smart contract code). Second, the practical deployment of graph-based and sequential models requires greater attention to real-time operation, employing lightweight architectures for scalability and latency-aware design. Finally, evaluation practices must evolve toward regulatory-aligned frameworks that explicitly account for domain-aware interpretability, accountability, and operational risk.

Beyond methodological advancements, sustainable progress in financial fraud detection requires stronger cross-sector collaboration. Establishing open benchmarking platforms and community-driven evaluation challenges involving academia, industry practitioners, and regulatory bodies could significantly accelerate dataset sharing and methodological standardization. Such initiatives would facilitate transparent performance comparison, encourage reproducibility, and promote the development of regulatory-aligned evaluation frameworks. Building collaborative ecosystems around standardized fraud benchmarks may ultimately enhance both scientific innovation and real-world deployment effectiveness.

In summary, artificial intelligence has demonstrated substantial potential in enhancing financial fraud detection across a wide range of scenarios. However, the findings of this review suggest that long-term progress in this field will depend not only on methodological innovation but also on how effectively such advances are aligned with data realism, ethical considerations, and practical deployment constraints. Collectively addressing these challenges is essential for advancing fraud detection systems that are not only accurate but also reliable and suitable for high-stakes financial environments.

Author Contributions

Conceptualization, H.Y. and Z.S.; methodology, H.Y.; software, H.Y.; validation, H.Y.; formal analysis, H.Y.; investigation, H.Y.; resources, Z.S.; data curation, H.Y.; writing—original draft preparation, H.Y.; writing—review and editing, H.Y., Z.S. and S.S.; visualization, H.Y.; supervision, Z.S. and S.S.; project administration, Z.S.; funding acquisition, H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflict of interest.

References

Hilal, W.; Gadsden, S.A.; Yawney, J. Financial fraud: A review of anomaly detection techniques and recent advances. Expert Syst. Appl. 2022, 193, 116429. [Google Scholar] [CrossRef]
Pourhabibi, T.; Ong, K.L.; Kam, B.H.; Boo, Y.L. Fraud detection: A systematic literature review of graph-based anomaly detection approaches. Decis. Support Syst. 2020, 133, 113303. [Google Scholar] [CrossRef]
Amarasinghe, T.; Aponso, A.; Krishnarajah, N. Critical analysis of machine learning based approaches for fraud detection in financial transactions. In Proceedings of the 2018 International Conference on Machine Learning Technologies, Jinan, China, 26–28 May 2018; pp. 12–17. [Google Scholar]
Zojaji, Z.; Atani, R.E.; Monadjemi, A.H. A survey of credit card fraud detection techniques: Data and technique oriented perspective. arXiv 2016, arXiv:1611.06439. [Google Scholar] [CrossRef]
Taha, A.A.; Malebary, S.J. An intelligent approach to credit card fraud detection using an optimized light gradient boosting machine. IEEE Access 2020, 8, 25579–25587. [Google Scholar] [CrossRef]
Ali, A.; Abd Razak, S.; Othman, S.H.; Eisa, T.A.; Al-Dhaqm, A.; Nasser, M.; Elhassan, T.; Elshafie, H.; Saif, A. Financial fraud detection based on machine learning: A systematic literature review. Appl. Sci. 2022, 12, 9637. [Google Scholar] [CrossRef]
Bao, Y.; Hilary, G.; Ke, B. Artificial intelligence and fraud detection. In Innovative Technology at the Interface of Finance and Operations: Volume I; Springer International Publishing: Cham, Switzerland, 2022; pp. 223–247. [Google Scholar]
Kamuangu, P. A review on financial fraud detection using ai and machine learning. J. Econ. Financ. Account. Stud. 2024, 6, 67. [Google Scholar] [CrossRef]
Bello, O.A.; Olufemi, K. Artificial intelligence in fraud prevention: Exploring techniques and applications challenges and opportunities. Comput. Sci. IT Res. J. 2024, 5, 1505–1520. [Google Scholar] [CrossRef]
Ravindranath, V.; Nallakaruppan, M.K.; Shri, M.L.; Balusamy, B.; Bhattacharyya, S. Evaluation of performance enhancement in Ethereum fraud detection using oversampling techniques. Appl. Soft Comput. 2024, 161, 111698. [Google Scholar] [CrossRef]
Boutaher, N.; Elomri, A.; Abghour, N.; Moussaid, K.; Rida, M. A review of credit card fraud detection using machine learning techniques. In Proceedings of the 2020 5th International Conference on Cloud Computing and Artificial Intelligence: Technologies and Applications (CloudTech); IEEE: New York, NY, USA, 2020; pp. 1–5. [Google Scholar]
Tiwari, P.; Mehta, S.; Sakhuja, N.; Kumar, J.; Singh, A.K. Credit card fraud detection using machine learning: A study. arXiv 2021, arXiv:2108.10005. [Google Scholar] [CrossRef]
Awoyemi, J.O.; Adetunmbi, A.O.; Oluwadare, S.A. Credit card fraud detection using machine learning techniques: A comparative analysis. In Proceedings of the 2017 International Conference on Computing Networking and Informatics (ICCNI); IEEE: New York, NY, USA, 2017; pp. 1–9. [Google Scholar]
Dal Pozzolo, A.; Boracchi, G.; Caelen, O.; Alippi, C.; Bontempi, G. Credit card fraud detection: A realistic modeling and a novel learning strategy. IEEE Trans. Neural Netw. Learn. Syst. 2017, 29, 3784–3797. [Google Scholar] [CrossRef]
Aslam, F.; Hunjra, A.I.; Ftiti, Z.; Louhichi, W.; Shams, T. Insurance fraud detection: Evidence from artificial intelligence and machine learning. Res. Int. Bus. Financ. 2022, 62, 101744. [Google Scholar] [CrossRef]
Ghosh Dastidar, K.; Jurgovsky, J.; Siblini, W.; Granitzer, M. NAG: Neural feature aggregation framework for credit card fraud detection. Knowl. Inf. Syst. 2022, 64, 831–858. [Google Scholar] [CrossRef]
Forough, J.; Momtazi, S. Ensemble of deep sequential models for credit card fraud detection. Appl. Soft Comput. 2021, 99, 106883. [Google Scholar] [CrossRef]
Shi, F.; Zhao, C. Enhancing financial fraud detection with hierarchical graph attention networks: A study on integrating local and extensive structural information. Financ. Res. Lett. 2023, 58, 104458. [Google Scholar] [CrossRef]
Li, Q. Textual Data Mining for Financial Fraud Detection: A Deep Learning Approach. arXiv 2023, arXiv:2308.03800. [Google Scholar] [CrossRef]
Mirtaheri, M.; Abu-El-Haija, S.; Morstatter, F.; Ver Steeg, G.; Galstyan, A. Identifying and analyzing cryptocurrency manipulations in social media. IEEE Trans. Comput. Soc. Syst. 2021, 8, 607–617. [Google Scholar] [CrossRef]
Choi, D.; Lee, K. An artificial intelligence approach to financial fraud detection under IoT environment: A survey and implementation. Secur. Commun. Netw. 2018, 2018, 5483472. [Google Scholar] [CrossRef]
Cheah, P.C.Y.; Yang, Y.; Lee, B.G. Enhancing financial fraud detection through addressing class imbalance using hybrid SMOTE-GAN techniques. Int. J. Financ. Stud. 2023, 11, 110. [Google Scholar] [CrossRef]
Zhu, Y.; Xi, D.; Song, B.; Zhuang, F.; Chen, S.; Gu, X.; He, Q. Modeling users’ behavior sequences with hierarchical explainable network for cross-domain fraud detection. In Proceedings of the Web Conference 2020, Online, 20–24 April 2020; pp. 928–938. [Google Scholar]
Yang, Q.; Liu, Y.; Chen, T.; Tong, Y. Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol. (TIST) 2019, 10, 1–19. [Google Scholar] [CrossRef]
Figueira, A.; Vaz, B. Survey on synthetic data generation, evaluation methods and GANs. Mathematics 2022, 10, 2733. [Google Scholar] [CrossRef]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372. [Google Scholar] [CrossRef]
Kumar, D.; Pawar, P.P.; Addula, S.R.; Meesala, M.K.; Oni, O.; Cheema, Q.N.; Haq, A.U. A Smart Optimization Model for Reliable Signal Detection in Financial Markets Using ELM and Blockchain Technology. FinTech 2025, 4, 56. [Google Scholar] [CrossRef]
Hafez, I.Y.; Hafez, A.Y.; Saleh, A.; Abd El-Mageed, A.A.; Abohany, A.A. A systematic review of AI-enhanced techniques in credit card fraud detection. J. Big Data 2025, 12, 6. [Google Scholar] [CrossRef]
Amiram, D.; Bozanic, Z.; Cox, J.D.; Dupont, Q.; Karpoff, J.M.; Sloan, R. Financial reporting fraud and other forms of misconduct: A multidisciplinary review of the literature. Rev. Account. Stud. 2018, 23, 732–783. [Google Scholar] [CrossRef]
Al-Hashedi, K.G.; Magalingam, P. Financial fraud detection applying data mining techniques: A comprehensive review from 2009 to 2019. Comput. Sci. Rev. 2021, 40, 100402. [Google Scholar] [CrossRef]
Ofoje, B.C.; Aggreh, M. Computerized forensic investigation technique and fraud detection in the public sector: Perception of professional accountants in Anambra State. J. Glob. Account. 2023, 9, 407–447. [Google Scholar]
Nießner, T.; Nickerson, R.C.; Schumann, M. Towards a taxonomy of AI-based methods in Financial Statement Analysis. In Proceedings of the AMCIS 2021 Proceedings, Online, 9–13 August 2021. [Google Scholar]
Roy, R.; George, K.T. Detecting insurance claims fraud using machine learning techniques. In Proceedings of the 2017 International Conference on Circuit, Power and Computing Technologies (ICCPCT); IEEE: New York, NY, USA, 2017; pp. 1–6. [Google Scholar]
Urunkar, A.; Khot, A.; Bhat, R.; Mudegol, N. Fraud detection and analysis for insurance claim using machine learning. In Proceedings of the 2022 IEEE International Conference on Signal Processing, Informatics, Communication and Energy Systems (SPICES); IEEE: New York, NY, USA, 2022; Volume 1, pp. 406–411. [Google Scholar]
Li, J.; Lan, Q.; Zhu, E.; Xu, Y.; Zhu, D. A study of health insurance fraud in China and recommendations for fraud detection and prevention. J. Organ. End User Comput. (JOEUC) 2022, 34, 1–19. [Google Scholar] [CrossRef]
Karim, A.S.; Mohamed, N.; Ahmad, M.A.N.; Prabowo, H.Y.; Suffian, M.T.M. Money laundering intentions of the bankers: Some insights from past literature. Int. J. Acad. Res. Account. Financ. Manag. Stud. 2022, 12, 170–186. [Google Scholar] [CrossRef]
Rukhsar, L.; Bangyal, W.H.; Nisar, K.; Nisar, S. Prediction of insurance fraud detection using machine learning algorithms. Mehran Univ. Res. J. Eng. Technol. 2022, 41, 33–40. [Google Scholar] [CrossRef]
Abdallah, A.; Maarof, M.A.; Zainal, A. Fraud detection system: A survey. J. Netw. Comput. Appl. 2016, 68, 90–113. [Google Scholar] [CrossRef]
Fang, W.; Li, X.; Zhou, P.; Yan, J.; Jiang, D.; Zhou, T. Deep learning anti-fraud model for internet loan: Where we are going. IEEE Access 2021, 9, 9777–9784. [Google Scholar] [CrossRef]
Huang, L.; Pontell, H.N. Crime and crisis in China’s P2P online lending market: A comparative analysis of fraud. Crime Law Soc. Change 2023, 79, 369–393. [Google Scholar] [CrossRef]
Villányi, B. Money laundering: History, regulations, and techniques. In Oxford Research Encyclopedia of Criminology and Criminal Justice; Oxford University Press: Oxford, UK, 2021. [Google Scholar]
Calafos, M.W.; Dimitoglou, G. Cyber laundering: Money laundering from fiat money to cryptocurrency. In Principles and Practice of Blockchains; Springer International Publishing: Cham, Switzerland, 2022; pp. 271–300. [Google Scholar]
Dumitrescu, B.; Băltoiu, A.; Budulan, Ş. Anomaly detection in graphs of bank transactions for anti money laundering applications. IEEE Access 2022, 10, 47699–47714. [Google Scholar] [CrossRef]
Zohuri, B.; Nguyen, H.T.; Moghaddam, M. What is the Cryptocurrency. In Is It a Threat to Our National Security, Domestically and Globally; Uniscience Publishers: Cheyenne, WY, USA, 2022; pp. 1–14. [Google Scholar]
Baucus, M.S.; Mitteness, C.R. Crowdfrauding: Avoiding Ponzi entrepreneurs when investing in new ventures. Bus. Horiz. 2016, 59, 37–50. [Google Scholar] [CrossRef]
Momtaz, P.P. Initial coin offerings. PLoS ONE 2020, 15, e0233018. [Google Scholar] [CrossRef]
Albrecht, C.; Duffin, K.M.; Hawkins, S.; Morales Rocha, V.M. The use of cryptocurrencies in the money laundering process. J. Money Laund. Control 2019, 22, 210–216. [Google Scholar] [CrossRef]
Bartoletti, M.; Lande, S.; Loddo, A.; Pompianu, L.; Serusi, S. Cryptocurrency scams: Analysis and perspectives. IEEE Access 2021, 9, 148353–148373. [Google Scholar] [CrossRef]
Sun, D.; Ma, W.; Nie, L.; Liu, Y. Sok: Comprehensive analysis of rug pull causes, datasets, and detection tools in defi. arXiv 2024, arXiv:2403.16082. [Google Scholar] [CrossRef]
Sharma, T.; Agarwal, R.; Shukla, S.K. Understanding rug pulls: An in-depth behavioral analysis of fraudulent nft creators. ACM Trans. Web 2023, 18, 1–39. [Google Scholar] [CrossRef]
Huynh, P.D.; Dau, S.H.; Huppert, N.; Cervenjak, J.; Sun, H.; Tran, H.Y.; Li, X.; Viterbo, E. Serial Scammers and Attack of the Clones: How Scammers Coordinate Multiple Rug Pulls on Decentralized Exchanges. In Proceedings of the ACM on Web Conference 2025, Sydney, Australia, 28 April–2 May 2025; pp. 1016–1033. [Google Scholar]
Lin, Z.; Chen, J.; Wu, J.; Zhang, W.; Wang, Y.; Zheng, Z. Crpwarner: Warning the risk of contract-related rug pull in defi smart contracts. IEEE Trans. Softw. Eng. 2024, 50, 1534–1547. [Google Scholar] [CrossRef]
Alotaibi, B. Cybersecurity Attacks and Detection Methods in Web 3.0 Technology: A Review. Sensors 2025, 25, 342. [Google Scholar] [CrossRef] [PubMed]
Qin, K.; Zhou, L.; Livshits, B.; Gervais, A. Attacking the defi ecosystem with flash loans for fun and profit. In Proceedings of the International Conference on Financial Cryptography and Data Security; Springer: Berlin/Heidelberg, Germany, 2021; pp. 3–32. [Google Scholar]
Cao, Y.; Zou, C.; Cheng, X. Flashot: A snapshot of flash loan attack on DeFi ecosystem. arXiv 2021, arXiv:2102.00626. [Google Scholar] [CrossRef]
Alhaidari, A.; Palanisamy, B.; Krishnamurthy, P. Protecting DeFi Platforms against Non-Price Flash Loan Attacks. In Proceedings of the Fifteenth ACM Conference on Data and Application Security and Privacy, Porto, Portugal, 19–21 June 2024; pp. 281–292. [Google Scholar]
Al-Zubaidie, M.; Jebbar, W.A. Providing security for flash loan system using cryptocurrency wallets supported by XSalsa20 in a blockchain environment. Appl. Sci. 2024, 14, 6361. [Google Scholar] [CrossRef]
Shulzhenko, N.; Romashkin, S. Internet fraud and transnational organized crime. Jurid. Trib. 2020, 10, 162–172. [Google Scholar]
Xie, W.; He, J.; Huang, F.; Ren, J. Supply chain financial fraud detection based on graph neural network and knowledge graph. Teh. Vjesn. 2024, 31, 2055–2063. [Google Scholar]
Hong, B.; Lu, P.; Xu, H.; Lu, J.; Lin, K.; Yang, F. Health insurance fraud detection based on multi-channel heterogeneous graph structure learning. Heliyon 2024, 10, e30045. [Google Scholar] [CrossRef]
Nabrawi, E.; Alanazi, A. Fraud detection in healthcare insurance claims using machine learning. Risks 2023, 11, 160. [Google Scholar] [CrossRef]
Ferretti, S.; D’Angelo, G.; Ghini, V. On the use of heterogeneous graph neural networks for detecting malicious activities: A case study with cryptocurrencies. In Proceedings of the 4th International Workshop on Open Challenges in Online Social Networks, Poznan, Poland, 10–13 September 2024; pp. 33–40. [Google Scholar]
Srifa, S.; Yanovich, Y.; Vasilyev, R.; Rupasinghe, T.; Amelin, V. Rug Pull Detection on Decentralized Exchange Using Transaction Data. Blockchain: Res. Appl. 2025, 6, 100275. [Google Scholar] [CrossRef]
Mazorra, B.; Adan, V.; Daza, V. Do not rug on me: Leveraging machine learning techniques for automated scam detection. Mathematics 2022, 10, 949. [Google Scholar] [CrossRef]
Wu, K.W. Strengthening DeFi security: A static analysis approach to Flash Loan vulnerabilities. arXiv 2024, arXiv:2411.01230. [Google Scholar] [CrossRef]
Paul, C. AI for Identifying and Preventing Flash Loan Attacks in DeFi. 2021. Available online: https://www.researchgate.net/publication/387183146_AI_for_Identifying_and_Preventing_Flash_Loan_Attacks_in_DeFi (accessed on 8 September 2025).
Werapun, W.; Karode, T.; Arpornthip, T.; Suaboot, J.; Sangiamkul, E.; Boonrat, P. The Flash Loan Attack Analysis (FAA) Framework—A Case Study of the Warp Finance Exploitation. Informatics 2022, 10, 3. [Google Scholar] [CrossRef]
Alenzi, H.Z.; Aljehane, N.O. Fraud detection in credit cards using logistic regression. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 540–551. [Google Scholar] [CrossRef]
Fang, Y.; Zhang, Y.; Huang, C. Credit card fraud detection based on machine learning. In Proceedings of the 2024 2nd International Conference on Image, Algorithms and Artificial Intelligence, Online, 9–11 August 2024; p. 35. [Google Scholar]
Kumar, S.; Gunjan, V.K.; Ansari, M.D.; Pathak, R. Credit card fraud detection using support vector machine. In Proceedings of the 2nd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications: ICMISC 2021; Springer Nature: Singapore, 2022; pp. 27–37. [Google Scholar]
Abakarim, Y.; Lahby, M.; Attioui, A. An efficient real time model for credit card fraud detection based on deep learning. In Proceedings of the 12th International Conference on Intelligent Systems: Theories and Applications, Rabat, Morocco, 24–25 October 2018; pp. 1–7. [Google Scholar]
Doshi-Velez, F.; Kim, B. Towards a rigorous science of interpretable machine learning. arXiv 2017, arXiv:1702.08608. [Google Scholar] [CrossRef]
Varun Kumar, K.S.; Vijaya Kumar, V.G.; Vijay Shankar, A.; Pratibha, K. Credit card fraud detection using machine learning algorithms. Int. J. Eng. Res. Technol. (IJERT) 2020, 9, 2020. [Google Scholar] [CrossRef]
Dhieb, N.; Ghazzai, H.; Besbes, H.; Massoud, Y. A secure ai-driven architecture for automated insurance systems: Fraud detection and risk measurement. IEEE Access 2020, 8, 58546–58558. [Google Scholar] [CrossRef]
Liu, X. Empirical analysis of financial statement fraud of listed companies based on logistic regression and random forest algorithm. J. Math. 2021, 2021, 9241338. [Google Scholar] [CrossRef]
Alkhalili, M.; Qutqut, M.H.; Almasalha, F. Investigation of applying machine learning for watch-list filtering in anti-money laundering. IEEE Access 2021, 9, 18481–18496. [Google Scholar] [CrossRef]
Kumar, A.; Das, S.; Tyagi, V. Anti money laundering detection using Naïve Bayes classifier. In Proceedings of the 2020 IEEE International Conference on Computing, Power and Communication Technologies (GUCON); IEEE: New York, NY, USA, 2020; pp. 568–572. [Google Scholar]
Ketenci, U.G.; Kurt, T.; Önal, S.; Erbil, C.; Aktürkoǧlu, S.; İlhan, H.Ş. A time-frequency based suspicious activity detection for anti-money laundering. IEEE Access 2021, 9, 59957–59967. [Google Scholar] [CrossRef]
Ashfaq, T.; Khalid, R.; Yahaya, A.S.; Aslam, S.; Azar, A.T.; Alsafari, S.; Hameed, I.A. A machine learning and blockchain based efficient fraud detection mechanism. Sensors 2022, 22, 7162. [Google Scholar] [CrossRef]
Dong, W.; Liao, S.; Liang, L. Financial statement fraud detection using text mining: A systemic functional linguistics theory perspective. In Proceedings of the PACIS 2016 Proceedings, Chiayi, Taiwan, 19 June 2016. [Google Scholar]
Bao, Y.; Ke, B.; Li, B.; Yu, Y.J.; Zhang, J. Detecting accounting fraud in publicly traded US firms using a machine learning approach. J. Account. Res. 2020, 58, 199–235. [Google Scholar] [CrossRef]
Ileberi, E.; Sun, Y.; Wang, Z. Performance evaluation of machine learning methods for credit card fraud detection using SMOTE and AdaBoost. IEEE Access 2021, 9, 165286–165294. [Google Scholar] [CrossRef]
Alkhatib, K.I.; Al-Aiad, A.I.; Almahmoud, M.H.; Elayan, O.N. Credit card fraud detection based on deep neural network approach. In Proceedings of the 2021 12th International Conference on Information and Communication Systems (ICICS); IEEE: New York, NY, USA, 2021; pp. 153–156. [Google Scholar]
Alarfaj, F.K.; Malik, I.; Khan, H.U.; Almusallam, N.; Ramzan, M.; Ahmed, M. Credit card fraud detection using state-of-the-art machine learning and deep learning algorithms. IEEE Access 2022, 10, 39700–39715. [Google Scholar] [CrossRef]
Mienye, I.D.; Jere, N. Deep learning for credit card fraud detection: A review of algorithms, challenges, and solutions. IEEE Access 2024, 12, 96893–96910. [Google Scholar] [CrossRef]
Mohmad, Y.A. Credit card fraud detection using lstm algorithm. Wasit J. Comput. Math. Sci. 2022, 1, 26–35. [Google Scholar] [CrossRef]
Jan, C.L. Detection of financial statement fraud using deep learning for sustainable development of capital markets under information asymmetry. Sustainability 2021, 13, 9879. [Google Scholar] [CrossRef]
Wu, X.G.; Du, S.Y. An analysis on financial statement fraud detection for Chinese listed companies using deep learning. IEEE Access 2022, 10, 22516–22532. [Google Scholar] [CrossRef]
Wang, G.; Ma, J.; Chen, G. Attentive statement fraud detection: Distinguishing multimodal financial data with fine-grained attention. Decis. Support Syst. 2023, 167, 113913. [Google Scholar] [CrossRef]
Yang, X.; Zhang, C.; Sun, Y.; Pang, K.; Jing, L.; Wa, S.; Lv, C. FinChain-BERT: A high-accuracy automatic fraud detection model based on NLP methods for financial scenarios. Information 2023, 14, 499. [Google Scholar] [CrossRef]
Pang, G.; Shen, C.; Cao, L.; Hengel, A.V.D. Deep learning for anomaly detection: A review. ACM Comput. Surv. (CSUR) 2021, 54, 1–38. [Google Scholar] [CrossRef]
Yin, C.; Zhu, Y.; Fei, J.; He, X. A deep learning approach for intrusion detection using recurrent neural networks. IEEE Access 2017, 5, 21954–21961. [Google Scholar] [CrossRef]
Alarab, I.; Prakoonwit, S.; Nacer, M.I. Competence of graph convolutional networks for anti-money laundering in bitcoin blockchain. In Proceedings of the 2020 5th International Conference on Machine Learning Technologies, Beijing, China, 19–21 June 2020; pp. 23–27. [Google Scholar]
Deepa, N.; Pham, Q.V.; Nguyen, D.C.; Bhattacharya, S.; Prabadevi, B.; Gadekallu, T.R.; Maddikunta, P.K.; Fang, F.; Pathirana, P.N. A survey on blockchain for big data: Approaches, opportunities, and future directions. Future Gener. Comput. Syst. 2022, 131, 209–226. [Google Scholar] [CrossRef]
Ahmed, A.A.; Alabi, O.O. Secure and scalable blockchain-based federated learning for cryptocurrency fraud detection: A systematic review. IEEE Access 2024, 12, 102219–102241. [Google Scholar] [CrossRef]
Saito, T.; Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef] [PubMed]
Pillai, S.E.; Nadella, G.S.; Meduri, K.; Priyadharsini, N.A.; Bhuvanesh, A.; Kumar, D. A walrus optimization-enhanced long short-term memory model for credit fraud detection in banking. Int. J. Inf. Technol. 2025, 1–17. [Google Scholar] [CrossRef]
Arora, S.; Beams, A.; Chatzigiannis, P.; Meiser, S.; Patel, K.; Raghuraman, S.; Rindal, P.; Shah, H.; Wang, Y.; Wu, Y.; et al. Privacy-preserving financial anomaly detection via federated learning multi-party computation. In Proceedings of the 2024 Annual Computer Security Applications Conference Workshops (ACSAC Workshops); IEEE: New York, NY, USA, 2024; pp. 270–279. [Google Scholar]
Wang, X.; Bo, D.; Shi, C.; Fan, S.; Ye, Y.; Yu, P.S. A survey on heterogeneous graph embedding: Methods, techniques, applications and sources. IEEE Trans. Big Data 2022, 9, 415–436. [Google Scholar] [CrossRef]
Asiri, A.; Somasundaram, K. Graph convolution network for fraud detection in bitcoin transactions. Sci. Rep. 2025, 15, 11076. [Google Scholar] [CrossRef] [PubMed]
Ji, S.; Pan, S.; Cambria, E.; Marttinen, P.; Yu, P.S. A survey on knowledge graphs: Representation, acquisition, and applications. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 494–514. [Google Scholar] [CrossRef] [PubMed]
Boulieris, P.; Pavlopoulos, J.; Xenos, A.; Vassalos, V. Fraud detection with natural language processing. Mach. Learn. 2024, 113, 5087–5108. [Google Scholar] [CrossRef]
Liu, Y.; Kang, Y.; Xing, C.; Chen, T.; Yang, Q. A secure federated transfer learning framework. IEEE Intell. Syst. 2020, 35, 70–82. [Google Scholar] [CrossRef]
Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; Galstyan, A. A survey on bias and fairness in machine learning. ACM Comput. Surv. (CSUR) 2021, 54, 1–35. [Google Scholar] [CrossRef]
Saha, B.; Rani, N.; Shukla, S.K. Generative AI in Financial Institution: A Global Survey of Opportunities, Threats, and Regulation. arXiv 2025, arXiv:2504.21574. [Google Scholar] [CrossRef]
Fiore, U.; De Santis, A.; Perla, F.; Zanetti, P.; Palmieri, F. Using generative adversarial networks for improving classification effectiveness in credit card fraud detection. Inf. Sci. 2019, 479, 448–455. [Google Scholar] [CrossRef]
Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of hallucination in natural language generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
Chen, W.; Zheng, Z.; Cui, J.; Ngai, E.; Zheng, P.; Zhou, Y. Detecting ponzi schemes on ethereum: Towards healthier blockchain technology. In Proceedings of the 2018 World Wide Web Conference, Lyon, France, 23–27 April 2018; pp. 1409–1418. [Google Scholar]
Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R. Foundations and trends^® in machine learning. In Advances and Open Problems in Federated Learning; Now Publishers Inc.: Hanover, MA, USA, 2021; Volume 14, pp. 1–210. [Google Scholar]
Li, Y.; Wang, S.; Ding, H.; Chen, H. Large language models in finance: A survey. In Proceedings of the Fourth ACM International Conference on AI in Finance, Brooklyn, NY, USA, 27–29 November 2023; pp. 374–382. [Google Scholar]
Machado, M.; Coita, I.F.; Bolesta, K.; Filipovska, O.; van Heeswijk, W.; Muñiz, J.A.; Bernard, F.S.; Osterrieder, J. What do we Know About Fraud Detection in Peer-to-Peer Lending? A Systematic Literature Review. (6 September 2024). 2024. Available online: https://ssrn.com/abstract=4948657 (accessed on 20 November 2025).
Fernández, A.; Garcia, S.; Herrera, F.; Chawla, N.V. SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 2018, 61, 863–905. [Google Scholar] [CrossRef]
Hariri, S.; Kind, M.C.; Brunner, R.J. Extended isolation forest. IEEE Trans. Knowl. Data Eng. 2019, 33, 1479–1489. [Google Scholar] [CrossRef]
Liu, Y.; Fan, T.; Chen, T.; Xu, Q.; Yang, Q. Fate: An industrial grade platform for collaborative learning with data protection. J. Mach. Learn. Res. 2021, 22, 1–6. [Google Scholar]
Sodnomdavaa, T.; Lkhagvadorj, G. Financial statement fraud detection through an integrated machine learning and explainable AI framework. J. Risk Financ. Manag. 2025, 19, 13. [Google Scholar] [CrossRef]
Ying, Z.; Bourgeois, D.; You, J.; Zitnik, M.; Leskovec, J. Gnnexplainer: Generating explanations for graph neural networks. Adv. Neural Inf. Process. Syst. 2019, 32, 1–12. [Google Scholar]
Zhou, I.; Tofigh, F.; Piccardi, M.; Abolhasan, M.; Franklin, D.; Lipman, J. Secure multi-party computation for machine learning: A. survey. IEEE Access 2024, 12, 53881–53899. [Google Scholar] [CrossRef]
Iqbal, S.; Awan, K.M.; Kamal, S.; Rehman, Z.U. Interpretable Ensemble Learning Models for Credit Card Fraud Detection. Appl. Sci. 2025, 15, 12073. [Google Scholar] [CrossRef]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
Nie, H.; Long, Z.H.; Fang, Z.J.; Gao, L.Q. Multimodal detection framework for financial fraud integrating LLMs and interpretable machine learning. J. Data Inf. Sci. 2025, 10, 291–315. [Google Scholar] [CrossRef]
Awosika, T.; Shukla, R.M.; Pranggono, B. Transparency and privacy: The role of explainable ai and federated learning in financial fraud detection. IEEE Access 2024, 12, 64551–64560. [Google Scholar] [CrossRef]
Farrukh, H.; Zafar, S.; Rehman, Z.U.; Shah, A.A.; Alshammry, N. Blockchain-Based Fraud Detection: A Comparative Systematic Literature Review of Federated Learning and Machine Learning Approaches. Electronics 2025, 14, 4952. [Google Scholar] [CrossRef]

Figure 1. Research framework of this review.

Figure 2. PRISMA Literature Screening Flowchart.

Figure 3. Financial Fraud Research Trends.

Figure 4. Classification of research subjects across specific traditional and emerging fraud domains.

Figure 5. Distribution of AI methods in traditional fraud detection.

Table 1. Comparison between this review and existing surveys on AI-based financial fraud detection.

Existing Reviews	Their Main Focus & Scope	This Review’s Unique Contribution
[6]	Traditional ML (2010–2021): Primarily reviews classical algorithms (SVM, Naïve Bayes) applied to banking and credit card fraud.	Advanced Tech & Newer Threats: Extends coverage to 2025, focusing on advanced AI (GNNs, LLMs) and specifically addressing DeFi-native attacks (Flash loans) absent in Ali et al.
[7]	Accounting Fraud Focus: Discusses AI challenges (rare events, data quality, regime shift, etc.) specifically within the context of financial statement fraud and auditor detection tasks.	Broader Fraud Spectrum: Beyond accounting, we cover crypto-crime and P2P lending, offering a holistic view of fraud across the entire digital finance ecosystem.
[8]	General ML Efficacy: Focuses on the performance of standard classifiers (RF, Neural Networks) in general financial misconduct.	Systematic Dataset Roadmap: Unlike general efficacy reviews, we provide a concrete roadmap for dataset construction (see Table 8), addressing the critical “data scarcity” bottleneck.

Table 2. Quality assessment criteria for included studies.

No	Criterion Description
1	Is the research objective of the study clearly stated?
2	Does the study explicitly apply at least one AI/ML/DL technique to financial fraud detection?
3	Are the datasets clearly described (e.g., name, size, fraud ratio, source and accessibility)?
4	Are the performance evaluation metrics clearly reported (e.g., AUROC, AUPRC, F1-score, precision, recall)?
5	Is the validation strategy clearly described (e.g., k-fold CV, temporal split, hold-out test set) and leakage risks addressed?

Table 3. Characteristics of major financial fraud types.

Fraud Type	Category	Implementation	Typical Cases	Key Challenges
Credit card fraud	Traditional	Stolen cards, CNP misuse	Unauthorized purchases	Extreme imbalance; real-time detection [12,28,58]
Financial statement fraud	Traditional	Misrepresentation of financial reports, earnings manipulation	Enron, WorldCom scandals	Limited labeled data; heavy reliance on textual + numerical fusion [29,59]
Insurance fraud	Traditional	False/exaggerated claims	Auto & health claims	Mixed structured/unstructured data [15,33,34,60,61]
Loan fraud	Traditional	Forged docs, identity theft	Fake loan apps	Heterogeneous borrower data; privacy [39]
Anti-money laundering	Traditional/Evolving	Structuring, layering	Smurfing, layering ops	Massive transaction graphs; explainability [41,62]
Internet loan fraud	Emerging	Fake platforms, forged KYC	P2P identity fraud	Scarce public data; platform adversaries [40]
Cryptocurrency fraud	Emerging	Ponzi, scam tokens	Illicit blockchain txns	Pseudonymity; fast evolving [44,48]
Rug pulls	Emerging	Liquidity drained by devs	DEX rug pulls	Need on/off-chain fusion [48,63,64]
Flash loan attacks	Emerging	Exploit via atomic loans	Oracle manipulation	Requires high-frequency tracing [65,66,67]

Table 4. AI methods applied in traditional fraud detection.

Method	Typical Models	Type of Fraud	Strengths	Limitations	References
Machine Learning	Logistic Regression, Decision Trees, Random Forest, SVM	Credit Card Fraud, Insurance Fraud, Loan Fraud, Financial Statement Fraud	Interpretable; efficient on structured/tabular data	Limited capacity for complex fraud patterns; manual feature engineering; class imbalance sensitivity; Weak ability to model complex relationships	[28,33,37,68,70,73,74,75,76,77,78,79,80,81]
Ensemble Learning	XGBoost, LightGBM, Random Forest ensembles	Credit Card Fraud, Insurance Fraud, Financial Statement Fraud	High accuracy; robustness to imbalance; better generalization	may still require careful tuning on small or noisy datasets; limited transparency	[4,5,37,61,74,81,82]
Deep Learning	ANN, CNN, RNN, LSTM, Autoencoders, GRU, BiLSTM, Transformer	Credit Card Fraud, Insurance Fraud, Financial Statement Fraud	Learns hierarchical/nonlinear patterns; strong on sequential data	Requires large labeled datasets; computationally expensive; weak interpretability	[12,16,18,60,71,83,84,85,86,87,88,89,90,91,92]
Graph-based Learning	Graph Convolutional Networks (GCN), Graph Attention Networks (GAT), HGAT, HAN	Anti-Money Laundering, Credit Card Fraud, Financial Statement Fraud	Captures relational/structural dependencies in transaction data	Scalability and interpretability challenges	[41,59,62,93,94,95]
Hybrid & NLP-based Methods	Hybrid ML+NLP, text mining approaches, FinChain-BERT, RNN for text, multimodal fusion	Insurance Fraud, Financial Statement Fraud, Loan Fraud	Leverages both structured + unstructured data; effective for detecting misrepresentation	Requires domain-specific text annotation; multimodal integration challenges	[40,59,80,87,88,89,90]

Table 5. Performance benchmarks of representative AI fraud detection studies.

Ref.	Fraud Type	Method/Model	Dataset	Performance
[5]	Credit Card	Optimized LightGBM	European Card (Kaggle)	Acc: 98.40%, AUC: 92.88%, F1: 56.95%
[16]	Credit Card	NAG (embeddings + 1-D conv)	Wordline real-world dataset	AUC-PR: 0.201–0.338 (April–July, mean over runs)
[22]	Credit Card	GANified-SMOTE + CNN	European Card (Kaggle)	F1: 0.89 (Best among resampling methods)
[17]	Credit Card	LSTM + GRU Ensemble	Brazilian Bank Data	AUC: 0.868, F1: 0.823
[13]	Credit Card	k-NN	European Card (Kaggle)	Acc: 97.92% (Outperformed NB and LR)
[18]	Financial Graph	HGAT (Graph Attention)	Synthetic Financial Graph	F1: 0.852, Recall: 0.819
[20]	Crypto (Pump-dump)	Random Forest	Twitter & Telegram Data	AUC: 0.74 (Early prediction task)
[15]	Insurance	SVM	US Auto Insurance	Acc: 94.0%, F1: 1.4%
[21]	Mobile Payment	SVM & Random Forest	South Korea Transactions	F1: 0.986 (Supervised ML average)
[23]	E-commerce	HEN (Hierarchical Net)	Cross-border Platform	AUC: 0.924 (Transfer learning task)
[1]	General	XGBoost	Auto-insurance	Acc: 99.25%
[11]	Credit Card	Random Forest	Private Bank Data (80 M)	Precision: 99%

Table 6. AI applications in emerging fraud scenarios: challenges, methods, and future directions.

Fraud Type	Key Modeling Challenges	Existing AI Methods	Future Research Directions	Representative References
Cryptocurrency Fraud (Ponzi schemes, scam tokens, illicit blockchain transactions)	Pseudonymity of users; fast-evolving fraud strategies; lack of standardized multimodal datasets	Deep learning (CNN, RNN, Transformer) for sequential blockchain data; GNNs (GCN, GAT, GraphSAGE) for transaction graphs	Development of multimodal benchmarks (on-chain + social/text data); adversarially robust and interpretable models	[45,48,49,51,72,93,95]
Internet Loan Fraud (fake platforms, forged KYC, P2P identity fraud)	Scarcity of public data; inconsistent labeling across platforms; adversarial manipulation by loan apps	Traditional ML classifiers (Logistic Regression, SVM, Random Forest); hybrid ML+NLP models for loan applications	Federated learning and privacy-preserving data sharing; cross-domain transfer learning; explainability enhancement	[102,103,104]
Rug Pulls (liquidity drained by developers on decentralized exchanges)	Need to fuse on-chain liquidity patterns with off-chain/social media signals; lack of curated datasets	Hybrid models combining NLP (whitepapers, announcements) with anomaly detection on liquidity graphs	Creation of unified annotated datasets; multimodal integration (transactions + text + community signals)	[49,50,51,63,64]
Flash Loan Attacks (atomic, uncollateralized loans used to manipulate oracles or drain protocols)	Ultra-fast execution; event-based detection; difficulty of real-time graph tracing	Graph-based anomaly detection; temporal deep learning for high-frequency blockchain data	Development of real-time blockchain monitoring systems; integration of smart contract forensics	[53,54,55,56,57,62,65,66,67]

Table 7. Representative datasets for financial fraud detection.

Fraud Type	Representative Datasets	Public/Private	Reference
Credit card fraud	Kaggle Credit Card (2013), IEEE-CIS Fraud Detection	Public	[5,28,68,69]
Financial statement fraud	SEC filings (EDGAR), Chinese CSMAR database	Public (partial)/Proprietary	[59,87,88,89]
Insurance fraud	Private insurer claims; auto-claim corpora	Mostly private	[15,35,60,61,74]
Loan fraud	LendingClub (partial public), bank loan logs	Partial public/private	[111]
Anti-money laundering	Elliptic Bitcoin dataset, synthetic AML datasets, synthetic bank transaction graphs	Public/synthetic/private	[62,93,95,100]
Online/P2P Lending Fraud	Proprietary platform logs; academic synthetic datasets	Mostly private	[102]
Cryptocurrency fraud	Elliptic, BitcoinHeist, Ethereum scam datasets	Mixed public/private	[95]
Rug pulls	On-chain liquidity snapshots, curated rug-pull lists	Public on-chain + curated labels	[52,63,64]
Flash loan attacks	Transaction traces, exploit databases by DeFi security researchers	Public raw data + curated labels	[56,57,65,66]

Table 8. Representative datasets for different fraud types and their main limitations.

Fraud Type	Representative Public Datasets	Sample Size/Fraud Ratio	Time Span	Key Limitations
Credit Card Fraud	European Credit Card (Kaggle); IEEE-CIS dataset	284,807/0.17% fraud	2 days	Severe class imbalance; anonymized features limit interpretability [28,83]
Financial Statement Fraud	CSMAR, China Listed Companies	Firm-level	Multi-year	Limited availability; inconsistent definitions of “fraud” [59,87,88]
Insurance Fraud	Car Insurance Claim, Health Insurance datasets	<50 K records	Case-based	Small-scale, privacy-restricted, domain-specific [60,61]
Loan Fraud	LendingClub, Prosper, EasyLoans, WDZJ.com	≈30K–500 K records	3–36 months	class imbalance; no unified public benchmark [111]
Anti-Money Laundering (AML)	Elliptic dataset	≈ 200 K	49 time-steps	Single public chain; no cross-bank view; lack of controlled synthetic fraud injection for benchmarking [93,100]
Online/P2P Lending Fraud	LendingClub	~0.5 M	Months	Labeling inconsistency; imbalance [111]
Cryptocurrency Fraud	Elliptic, Bitcoin OTC, Ethereum scams	200 K–1 M txns	Months	Graph-only, no cross-modal data; label inconsistency [95]
Rug Pulls	Case-specific datasets; early-stage public collections	Case-based/Fragmented	N/A	No standardized benchmark; datasets fragmented across NFT markets and decentralized exchanges; annotation processes inconsistent and incomplete [49,50,51,63,64]
Flash Loan Attacks	Raw blockchain records; case-based snapshots	Case-based/Fragmented	Event-based	Only incident-specific data available; no standardized labeled datasets; limited to isolated case studies [54,55,56,65,66]

Table 9. Technical challenges, solutions, and standardization recommendations for future fraud detection datasets.

Reference	Problem	Technical Solutions	Evaluation Metrics	Standardization & Reproducibility Recommendations
[22,25,71,91,112,113,114]	Severe class imbalance	Oversampling & synthetic generation (SMOTE, GANs, VAEs); Anomaly detection (iForest, auto-encoders); Few-shot/meta-learning mentioned	AUPRC, MCC, G-mean, AUROC, Precision, Recall, F1	Document resampling strategies explicitly; Release code for synthetic generation; Provide metadata on original vs. synthetic ratios
[14,23,109]	Limited temporal coverage	Cross-institutional stitching; Temporal augmentation/simulation; Streaming benchmarks with online labeling; sliding-window & drift handling	Temporal AUPRC, Time-split validation accuracy, Drift detection latency, Prequential AUC	Provide explicit timestamps for transactions; Release incremental dataset versions; Standardize drift monitoring protocols
[98,99,115,116]	Feature anonymization and semantic information loss	Controlled anonymization (DP, federated); Standardized derived features; Explainable embeddings (SHAP, LIME, GNNExplainer)	Privacy-utility tradeoff curves; Stability of feature importance; Expert validation of explanations	Specify anonymization logic (e.g., PCA variance retained); Provide synthetic proxy data for explanation validation; Maintain semantic feature descriptions where possible
[18,62,89]	Lack of multimodality and graph integration	Heterogeneous GNN (HGAT, HAN, Hetero-SAGE); Multimodal benchmarks (transactions + text + network); inter-modal attention weighting	F1, AUC-ROC; Ablation on modality contribution; Robustness under topology perturbation	Standardize graph formats (e.g., PyG/DGL compatible); Publish schemas for node/edge attributes; Release multi-modal feature extraction pipelines
[22,25,112]	Reliability of synthetic data	DP-GANs, VAEs, SMOTE variants; Distribution similarity testing (MMD, KL); Hybrid expert-automated validation	MMD, KL divergence; Downstream performance gap (synthetic→real); α-Precision, β-Recall	Open-source generation models and seeds; Benchmark synthetic data against real distributions; distinct flags for synthetic samples
[48,64]	Inconsistent labeling	Weak supervision (keyword + manual), crowdsourcing + expert arbitration	Precision, Recall, F1	Publish unified annotation guidebooks; Standardize label definitions; Report consensus metrics for ambiguous cases
[98,109,117]	Privacy and accessibility constraints	Federated learning (FedAvg, SecureBoost); Secure MPC; Differential privacy	Privacy–utility tradeoff curves; AUPRC under varying noise levels	Establish federated benchmarking protocols; Document privacy budgets; Use standard secure computing frameworks (e.g., FATE)
[20,51,54,63,66]	Emerging fraud types (Rug pulls, Flash loans, DeFi)	Unified multi-platform benchmarks; Annotated blockchain datasets with temporal event labels; Cross-modal fusion (tx, smart-contract, social media)	Event detection P/R, F1, AUC; Economic cost–benefit gain, Cross-platform generalization	Standardize fraud taxonomy (e.g., specific attack types); Share transaction hashes for verification; Annotate start/end timestamps of attacks

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, H.; Shukur, Z.; Sahran, S. A Review of Artificial Intelligence for Financial Fraud Detection. Appl. Sci. 2026, 16, 1931. https://doi.org/10.3390/app16041931

AMA Style

Yang H, Shukur Z, Sahran S. A Review of Artificial Intelligence for Financial Fraud Detection. Applied Sciences. 2026; 16(4):1931. https://doi.org/10.3390/app16041931

Chicago/Turabian Style

Yang, Haiquan, Zarina Shukur, and Shahnorbanun Sahran. 2026. "A Review of Artificial Intelligence for Financial Fraud Detection" Applied Sciences 16, no. 4: 1931. https://doi.org/10.3390/app16041931

APA Style

Yang, H., Shukur, Z., & Sahran, S. (2026). A Review of Artificial Intelligence for Financial Fraud Detection. Applied Sciences, 16(4), 1931. https://doi.org/10.3390/app16041931

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Review of Artificial Intelligence for Financial Fraud Detection

Abstract

1. Introduction

2. Related Research

2.1. Machine Learning Methods

2.2. Deep Learning Approaches

2.3. Graph-Based Methods for Relational and Collective Fraud

2.4. Natural Language Processing (NLP) Techniques

2.5. Dataset Characteristics and Research Implications

3. Research Methods

4. Research Findings

4.1. These Major Fraud Types Are Described in Detail Below

4.2. Application of Artificial Intelligence Techniques in Traditional Fraud Detection

4.3. Application of AI Technologies in Emerging Fraud Scenarios

Role of Generative AI and Large Language Models in Emerging Fraud Detection

4.4. Dataset Limitations and Development

4.5. Research Gaps and Future Research Directions

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI