1. Introduction
The rapid expansion of the digital economy has fostered financial innovation while simultaneously opening novel avenues for fraudulent activities. Consumers and financial institutions worldwide have suffered substantial financial losses as fraud schemes become increasingly diverse and concealed [
1]. A 2022 study found that financial fraud costs billions of dollars every year worldwide. In the US alone, losses are over
$400 billion [
2]. Fraud tactics are becoming more complex. This makes rule-based detection systems less effective [
3]. Owing to its strong capabilities in pattern recognition and predictive analytics, artificial intelligence (AI), particularly machine learning (ML) and deep learning (DL), has become a key enabling technology in modern financial fraud detection systems [
4,
5]. Many studies have shown that artificial intelligence is effective in different fraud detection settings, including credit card use and insurance claims.
While recent systematic reviews have examined AI in financial fraud detection [
6,
7,
8,
9], they predominantly focus on general machine learning efficacy or traditional banking and accounting sectors. Specifically, Ali et al. [
10] and Kamuangu [
11] provide comprehensive analyses of standard ML classifiers, while Bao et al. [
12] concentrate on financial statement auditing. However, a holistic analysis that explicitly examines the technical characteristics of DeFi-native attacks (e.g., rug pulls and flash loan exploits) and discusses systematic approaches to dataset construction remains limited. Moreover, systematic reviews and comparative analyses are still rare, despite the fact that problems like dataset imbalance and accessibility have been recognized [
11]. These gaps underline the necessity of a thorough and current review that spans methodological developments, dataset characteristics, traditional and emerging domains, and future directions. Such a review is particularly valuable for guiding the design and deployment of practical fraud detection systems in real-world financial environments.
This review gives a full overview of AI use in financial fraud detection. It also adds new areas such as cryptocurrency scams, DeFi attacks, and NFT rug pulls. It covers these together with traditional fraud cases. It synthesizes recent methodological advancements, including natural language processing, deep learning, and graph neural networks. In addition, it systematically analyzes commonly used datasets and evaluation metrics, and discusses emerging research directions such as explainable AI, federated learning, and blockchain integration.
To clearly distinguish this survey from existing reviews on financial fraud detection, the main contributions of this work are summarized as follows:
A structured and task-driven taxonomy of AI-based fraud detection methods, organizing prior studies according to fraud types, data modalities, and modeling paradigms, rather than providing a purely algorithm-centric overview.
A comprehensive analysis of datasets and data-related challenges, including a roadmap of commonly used public datasets, labeling limitations, and data scarcity issues, with particular emphasis on emerging and decentralized financial environments.
Dedicated coverage of emerging fraud scenarios, such as cryptocurrency-related fraud, rug pulls, and flash loan attacks, highlighting their unique characteristics, data requirements, and modeling challenges that are not adequately addressed in traditional fraud detection surveys.
A critical discussion of evaluation challenges and open research issues, extending beyond predictive accuracy to include robustness, interpretability, scalability, and adversarial considerations in real-world financial systems.
To make these contributions clear at a glance,
Table 1 contrasts this review with representative existing surveys.
As summarized in
Table 1, this review offers broader coverage and a more integrated methodological perspective compared with existing surveys.
Figure 1 presents a conceptual synthesis framework of this review, illustrating how findings from the surveyed literature are systematically organized across fraud categories, datasets, methodologies, evaluation metrics, and future research directions.
2. Related Research
Both industry and academia have shown increasing interest in applying artificial intelligence (AI) techniques to financial fraud detection in recent years. This section reviews representative studies that use machine learning, deep learning, graph-based modeling, and natural language processing (NLP) across traditional and emerging fraud scenarios. Rather than only listing reported results, we also comment on why particular methods tend to work (or break) in specific fraud settings, and we relate these findings to practical constraints that shape real deployments, such as extreme class imbalance, delayed or selective supervision, temporal drift, scalability, and auditability.
2.1. Machine Learning Methods
For classical machine learning, Awoyemi et al. [
13] compared several algorithms for credit card fraud detection and found that k-nearest neighbors outperformed Naïve Bayes and logistic regression. One reason k-nearest neighbors can appear strong is that it does not impose a restrictive functional form: if fraudulent transactions form small and irregular pockets in feature space, an instance-based decision rule can capture such local structure without assuming linear separability. At the same time, its performance is tightly coupled to representation choices. In real transaction data, features are often high-dimensional, sparse, and partly categorical, and the induced distance can become less informative, making “nearest” neighbors noisy. The method is also sensitive to temporal drift—neighbors drawn from historical behavior may stop being relevant once fraud strategies shift. Beyond these modeling aspects, kNN also raises practical concerns for large-scale deployment because exact neighbor search can be expensive at high throughput, often requiring approximate indexing and careful system engineering.
While [
13] provides an early comparative analysis in this domain, its evaluation setting still departs from operational conditions in several technical respects. First, the benchmark covers a limited time span, which makes it difficult to assess robustness under temporal drift. Second, class rebalancing through resampling (e.g., under- and/or over-sampling) can make training feasible, but it also alters the effective class prior; as a result, metrics and model rankings observed under resampled distributions may not translate directly to production alerting behavior, where base rates are extremely low and calibration matters. Finally, the PCA-anonymized feature space limits interpretability and provides limited insight into which signals would remain stable and actionable in real deployments.
Dal Pozzolo et al. [
14] moved closer to operational reality by proposing a learning strategy that treats feedback data and delayed supervision samples separately. This is important because confirmed labels in finance are commonly delayed (e.g., chargebacks), and the observed labels are shaped by review processes. Handling delayed supervision explicitly helps reduce the bias that arises when “not yet confirmed” cases are implicitly treated as legitimate during training. The main caveat is that the method assumes feedback is continuously available and sufficiently representative. In practice, feedback can be sporadic and policy-driven (e.g., limited review capacity concentrates attention on a subset of alerts). Under such selection effects, a model may end up fitting the institution’s investigation policy rather than fraud itself, and performance can shift when thresholds or review processes change.
In the insurance domain, Aslam et al. [
15] proposed an auto insurance fraud detection framework that performs feature selection using the Boruta algorithm and then develops predictive models using logistic regression, support vector machines, and Naïve Bayes. Their results suggest that, even under the same feature set, different classical learners can behave quite differently across evaluation metrics, underscoring the sensitivity of claim fraud detection to model choice and metric selection. Nevertheless, the study is built on a publicly available dataset with a specific feature schema and label definition, and such design choices are often tied to local claim processes and investigation standards. This raises a practical question about how reliably learned decision rules would transfer across insurers, product lines, or jurisdictions where coding practices, fraud definitions, and review workflows differ. Finally, while classical models are generally viewed as more transparent than deep architectures, transparency becomes operationally meaningful only when explanations are stable at the case level and can be mapped to auditable claim factors; in practice, the combined effects of feature selection, preprocessing, and model-specific decision boundaries can still make regulatory justification non-trivial.
Overall, traditional ML methods remain attractive because they are relatively simple to implement, easier to control, and useful as baselines. Their weaknesses, however, align with two persistent properties of fraud data: severe class imbalance and concept drift. In addition, most tabular classifiers do not naturally exploit relational structure (shared devices, counterparties, coordinated rings) unless those dependencies are manually engineered into features.
2.2. Deep Learning Approaches
Deep learning approaches are often adopted to reduce manual feature engineering and to model nonlinear interactions and temporal dependencies more directly. Ghosh Dastidar et al. [
16] proposed the Neural Aggregate Generator (NAG), which learns transaction-history representations end-to-end by combining feature embeddings with lightweight 1-D convolutional components that implement learnable aggregation over past transactions. They reported improved performance compared with generic CNN and LSTM baselines. Embeddings are particularly useful for high-cardinality categorical variables (e.g., merchant or channel identifiers), since dense representations make interaction learning feasible. The flip side is that such representations can become fragile when the categorical space evolves: new merchants, devices, or channels create cold-start behavior, and shifting semantics can degrade embeddings unless the model is updated regularly. In addition, performance gains from convolution-style components remain sensitive to how transaction histories are structured and aligned in the input; if the imposed structure does not match the underlying behavioral process, improvements can be dataset-specific and harder to transfer across institutions or products.
Forough et al. [
17] proposed an ensemble of deep sequential models with a voting mechanism for detecting fraud in transaction sequences. Sequence models are a natural choice when fraud is expressed as a tactic over time—bursts of activity, probing transactions, or abrupt shifts in spending context. The cost is operational: long histories, multiple base models, and the voting stage increase implementation complexity and can pressure latency budgets in authorization-time settings. While [
17] reports favorable time analysis, whether such ensembles can consistently meet strict sub-second constraints still depends on production-specific factors such as feature availability, batching strategy, and hardware limits.
More broadly, deep models can be effective when supervision is sufficiently rich and stable, but fraud labels are often delayed, noisy, and shaped by intervention policies. Without careful temporal evaluation and monitoring, deep models may overfit dataset artifacts and then degrade quietly in production. Interpretability remains another practical sticking point: in high-stakes financial environments, explanations typically need to be grounded in traceable, case-level evidence, which deep representations do not provide by default and often require additional design to support.
2.3. Graph-Based Methods for Relational and Collective Fraud
Graph-based modeling is increasingly used for fraud types where relationships are the main signal, including collusion rings, mule networks, and multi-entity laundering patterns. Shi [
18] proposed a Hierarchical Graph Attention Network (HGAT) that encodes both local and global structural information via multi-head self-attention, achieving improved results on public datasets. Attention-based aggregation is appealing in financial graphs because connectivity is often extremely uneven: prioritizing informative neighbors can matter more than averaging over large, noisy neighborhoods. The hierarchical design is also consistent with how fraud evidence can appear at multiple scales, from immediate counterparties to broader communities.
However, bringing GNN-style models into production is not straightforward. First, real transaction graphs are large and dynamic: edges arrive continuously, and multi-hop aggregation can quickly trigger neighbor explosion, pushing both training and inference costs upward. Second, graph construction is itself a modeling decision—what becomes a node, what becomes an edge, and how time is represented—and small choices here can change outcomes substantially. Time handling is especially delicate: if future-linked information leaks into training or evaluation, offline results can be inflated without any real operational value. Third, graph supervision is typically sparse and delayed, and labels tend to be concentrated in investigated regions of the graph. This makes training sensitive to sampling strategies and to selection bias, and it can reduce robustness when review policies or alerting thresholds shift. Finally, while attention weights can be informative, they do not automatically translate into audit-ready explanations; additional constraints or analysis is usually needed to turn learned structure into human-verifiable evidence. These issues, together with the need for large annotated graphs and higher computational complexity, remain major barriers to broad deployment.
2.4. Natural Language Processing (NLP) Techniques
NLP becomes relevant when unstructured text is central to the fraud signal, such as financial disclosures, narrative reports, or social media coordination. Li et al. [
19] used NLP for financial statement fraud detection and reported notable gains when narrative disclosures were incorporated alongside traditional features. A clear advantage of NLP in this context is that it can capture linguistic cues that are awkward to hand-code (inconsistent phrasing, unusual hedging, selective disclosure). Yet financial narratives are also heavily templated. Models can end up learning issuer style, sector conventions, or formatting artifacts rather than fraud intent. Long documents add another practical limitation: truncation, uneven distribution of signal across sections, and the frequent mixing of text with numeric statements all make it difficult for standard text encoders to reliably preserve the subtle cues that matter.
In cryptocurrency settings, Mirtaheri et al. [
20] examined pump-and-dump schemes coordinated on social platforms by combining NLP with social network analysis. This is technically well motivated because coordination can show up in language and group structure before market anomalies fully develop. Even so, much of the NLP-focused literature still concentrates on a narrow set of scenarios and often evaluates models in modality-specific pipelines [
20]. In practice, text-only detectors can be brittle under platform migration, rapid slang shifts, and adversarial phrasing. Robust detection typically requires aligning text with transactional and relational evidence, but multimodal alignment is not trivial: the modalities differ in timing, granularity, and noise, and naïve fusion can produce unstable behavior across domains.
2.5. Dataset Characteristics and Research Implications
Dataset characteristics and evaluation design often decide whether improvements are meaningful beyond a controlled benchmark. Dahee Choi et al. [
21] surveyed public corpora and found that many datasets are small, feature-limited, and lack temporal depth. These constraints help explain why models can look strong offline yet degrade under real-world conditions where attackers adapt and labels arrive late. To mitigate scarcity, later studies explored synthetic data generation with generative adversarial networks [
22] and cross-domain transfer learning [
23,
24]. These directions are promising, but they also introduce new risks. Synthetic data must be checked for fidelity—rare but operationally important fraud modes must not be washed out—and privacy concerns must be addressed [
25]. Transfer learning can reduce labeling effort, but differences in label definitions, feature semantics, and operational policies across organizations can cause negative transfer if not handled carefully.
In conclusion, although AI techniques have shown considerable promise in financial fraud detection, much of the literature still relies on limited datasets, narrow validations, or model choices that emphasize offline accuracy without fully confronting deployment realities such as drift, delayed supervision, scalability, and auditability. These gaps point to the need for fraud detection approaches that remain reliable and transparent under evolving fraud patterns and real operational constraints.
3. Research Methods
This study examines the current state of artificial intelligence (AI) applications in financial fraud detection using a systematic literature review (SLR) methodology. The research process is guided by the PRISMA [
26] framework, with a focus on study identification, screening, eligibility assessment, and reporting transparency. The review procedure consists of the following key steps:
- (1)
Research Question Formulation: This review is structured around the following five core questions:
Q1: What are the traditional and emerging types of financial fraud?
Q2: How is artificial intelligence currently being used in conventional fraud detection, like credit card and insurance fraud? What is the performance comparison of various techniques?
Q3: What are the potential applications, key challenges, and emerging research opportunities of AI for identifying novel fraud types, such as cryptocurrency-related fraud and flash loan attacks?
Q4: What are the boundaries of the public datasets that are currently available in this field? How can we create better datasets to propel new advancements?
Q5: What are the primary limitations and gaps in the existing research? Which avenues for further research are recommended?
- (2)
Search strategy: Relevant studies were retrieved from major academic databases, including Web of Science, IEEE Xplore, ScienceDirect, and Scopus. Search terms included:
“financial fraud” or “financial fraud detection” or “credit card fraud” or “insurance fraud” or “loan fraud” or “money laundering” or “cryptocurrency fraud” or “online loan fraud” or “rug pulls” or “flash loan attacks”) and (“artificial intelligence” or “machine learning” or “deep learning” or “graph neural network” or “GNN” or “natural language processing” or “federated learning” or “explainable AI” or “XAI”). The search was limited to English-language journal and conference papers published since 2015. Considering that the large-scale application of deep learning, graph neural networks, and other advanced AI techniques in fraud detection began to emerge after 2015, this review takes 2015 as the starting year to capture the most relevant and technologically important advances. The search and screening process was independently reviewed to ensure consistency and completeness.
- (3)
Eligibility criteria:
The following inclusion and exclusion criteria were applied:
Inclusion criteria:
- (i)
Studies explicitly applying artificial intelligence techniques (including ML, DL, GNNs, NLP, or hybrid approaches) to financial fraud detection, such as credit card fraud, loan fraud, insurance fraud, anti-money laundering, cryptocurrency fraud, and DeFi-specific attacks (e.g., rug pulls or flash loan exploits);
- (ii)
Studies presenting experimental, benchmark, or real-world evaluation results, including datasets and evaluation metrics;
- (iii)
Publications in peer-reviewed journals or conferences published between 2015 and 2025;
- (iv)
Full-text available in English.
Exclusion criteria:
- (i)
Editorials, commentaries, or opinion pieces without methodological contribution;
- (ii)
Studies unrelated to financial fraud (e.g., general intrusion detection without a financial context);
- (iii)
Duplicate records;
- (iv)
Studies with inaccessible full text or insufficient methodological details for quality assessment;
- (v)
Non-English publications.
- (4)
Study selection and screening:
Potentially eligible studies were subsequently assessed at the full-text level based on the predefined inclusion and exclusion criteria. The study selection process was documented in accordance with PRISMA and is summarized in the flow diagram (
Figure 2).
- (5)
Quality assessment criteria:
Using the quality evaluation criteria listed in
Table 2, we screened the chosen studies and assessed their methodological quality. The final dataset consists of 122 studies covering a wide range of financial fraud scenarios, including credit card fraud, insurance fraud, loan fraud, anti-money laundering, and cryptocurrency-related fraud.
The quality assessment criteria adopted in this review (
Table 2) are specifically designed to reflect the methodological characteristics and practical constraints of financial fraud detection research. Fraud detection datasets are typically highly imbalanced, with fraudulent cases representing only a small fraction of observed instances. Consequently, criteria related to data description, evaluation metrics, and validation strategies are essential for distinguishing studies that adequately address class imbalance from those relying on overly optimistic performance reporting.
In addition, fraud patterns are known to evolve over time due to regulatory changes, user behavior shifts, and adversarial adaptation. Quality criteria emphasizing temporal validation, dataset partitioning strategies, and experimental transparency therefore help identify studies that consider temporal drift and realistic deployment settings.
Finally, data accessibility and reproducibility remain persistent challenges in fraud detection research, as many datasets are proprietary or subject to strict privacy constraints. Including criteria related to dataset availability, experimental reproducibility, and methodological clarity enables a more consistent and fair assessment of the existing literature, particularly when comparing studies across different fraud domains and data sources.
In decentralized financial environments such as cryptocurrency and DeFi ecosystems, ensuring data integrity, transparency, and trustworthiness is particularly critical for reliable model evaluation. Prior research has demonstrated that blockchain-assisted learning frameworks can enhance the reliability and auditability of intelligent financial decision systems by leveraging tamper-resistant data storage and decentralized verification mechanisms. From a methodological perspective, this further motivates the emphasis on data transparency, validation rigor, and reproducibility when assessing AI-based fraud detection studies operating in decentralized and adversarial financial settings [
27].
These criteria help ensure that the reviewed studies provide sufficient methodological detail and comparable evaluation results.
4. Research Findings
The findings section of this study unfolds across two dimensions: overall trends and subject distribution. As illustrated in
Figure 3, the publication volume exhibits a dynamic evolution over the observed period. An initial accumulation can be observed from 2015 to 2018, followed by a pronounced increase starting in 2019 and reaching a peak in 2021. After 2021, the trajectory shows non-monotonic variations rather than a continuous decline: publication counts decreased in 2022 and 2023, experienced a modest rebound in 2024, and declined again in 2025. Overall, the figure highlights a transition from rapid growth to a phase characterized by year-to-year variability.
Secondly, regarding the distribution of research subjects, the included papers cover a broad range of financial fraud types, with substantial variation in research intensity across categories (
Figure 4). Within traditional fraud domains, credit card fraud is the most frequently studied topic. AML and insurance fraud follow at comparable levels, while financial statement fraud also attracts notable attention; loan-related fraud is less represented. For emerging fraud domains, crypto scams, DeFi/flash-loan attacks, and cryptocurrency-related fraud form a smaller yet meaningful body of work compared with the most studied traditional categories, whereas mobile/online payment fraud is relatively underexplored. Overall, the distribution indicates diversified but uneven research efforts across traditional and emerging fraud domains.
Q1: What are the traditional and emerging types of financial fraud?
4.1. These Major Fraud Types Are Described in Detail Below
In general, financial fraud can be broadly categorized into established forms—such as credit card, insurance, and loan fraud—and emerging forms, including cryptocurrency-related scams and online lending fraud. Because of variations in data availability, fraud patterns, and adversarial behaviors, each type presents different difficulties for AI-based detection.
The landscape of financial fraud detection is marked by a distinct bifurcation. Established domains, such as credit card and insurance fraud, benefit from mature methodologies but remain hindered by privacy regulations and data silos that limit external validation. Conversely, emerging threats in digital finance—spanning cryptocurrency, internet lending, and DeFi exploits like rug pulls—present a more volatile challenge. These scenarios are defined by rapid adversarial adaptation and data scarcity, necessitating models that can synthesize on-chain behaviors with smart contract semantics. This contrast highlights the need to bridge the maturity gap between traditional and emerging fraud domains by developing adaptable and deployment-oriented AI frameworks.
Q2: How is artificial intelligence currently being used in conventional fraud detection, like credit card and insurance fraud? What is the performance comparison of various techniques?
4.2. Application of Artificial Intelligence Techniques in Traditional Fraud Detection
Traditional financial fraud detection, particularly in credit card fraud, insurance fraud, and loan fraud, has been extensively studied in recent years. Many AI techniques have been used, from classic machine learning classifiers to more advanced deep learning and ensemble methods [
37,
60,
68,
69,
70,
71]. Despite their empirical success, these approaches continue to face persistent challenges in practical deployment, including extreme data imbalance, real-time detection requirements, and the limited interpretability of complex AI models [
18,
60,
62,
72].
As illustrated in
Figure 5, the reported percentages reflect the proportion of reviewed studies (
N = 122) adopting each dominant AI paradigm, as determined through a structured evidence matrix. The methodological landscape remains heavily anchored in Machine Learning (45.4%), reflecting its maturity and interpretability in regulated sectors. This dominance suggests that regulatory requirements and deployment constraints continue to shape model selection in real-world financial institutions. Nevertheless, a transition is evident: Deep Learning (27.6%) and Ensemble methods (11.2%) now command a substantial share, signaling a shift toward higher-capacity models. Newer paradigms such as graph-based learning and hybrid approaches are also emerging, aiming to better capture relational dependencies that are difficult to model using purely tabular features.
Table 4 summarizes representative AI methods used in traditional fraud detection, highlighting their typical application scenarios, strengths, and practical limitations.
To complement the qualitative analysis in
Table 4, we further provide a quantitative benchmarking of these methods.
Table 5 lists the specific performance metrics (Accuracy, F1, AUC) achieved by state-of-the-art models across different datasets.
As summarized in
Table 4, AI methodologies in traditional fraud detection have evolved from standard machine learning to more complex deep learning and graph-based approaches.
Table 5 further provides a quantitative snapshot of representative studies.
A recurring pattern in credit card fraud studies is that supervised models often report very high accuracy, frequently exceeding 98% (e.g., [
5]). However, the gap between Accuracy and F1-score reported in
Table 5 (e.g., [
5,
15]) illustrates the core technical issue of extreme class imbalance: a model can achieve excellent overall accuracy while still missing a substantial portion of minority fraud cases or producing a precision level that is operationally unacceptable. This observation also implies that accuracy alone is not an adequate proxy for detection quality in real-world fraud settings, where the practical objective is usually to maximize fraud capture under a constrained false-positive/alert budget [
96].
From a metric-design standpoint, performance interpretation should be tied to the operating regime. For highly imbalanced problems, AUC-ROC can remain optimistic because false-positive rate may look small in absolute terms even when the absolute number of false alarms is large. In contrast, PR-oriented measures (e.g., AUC-PR, precision at a target recall, or recall at a tolerable precision) better reflect the minority-class trade-off that drives manual review workload and customer friction. However, even metrics such as the F1-score implicitly assume equal misclassification costs across error types. In operational fraud detection, this assumption rarely holds. A false negative typically results in direct financial loss and regulatory exposure, whereas a false positive primarily generates investigation overhead and potential customer dissatisfaction. Consequently, evaluation protocols should incorporate cost-aware metrics or savings-based indicators derived from domain-specific cost matrices. These measures more accurately capture the economic and operational consequences of model decisions rather than relying solely on statistical discrimination performance.
From a modeling perspective, mitigating class imbalance also requires moving beyond data-level resampling strategies. Although techniques such as SMOTE remain widely adopted, algorithm-level approaches—including focal loss and cost-sensitive learning—have demonstrated improved stability under extreme imbalance conditions. Unlike resampling, which may distort underlying data distributions or introduce synthetic artifacts, focal loss dynamically down-weights well-classified majority-class samples and encourages the model to focus on difficult minority-class instances. Such mechanisms help improve robustness while preserving the integrity of original transaction distributions. In addition, it is important to recognize that the F1-score remains threshold-dependent rather than purely model-dependent, reinforcing the necessity of combining threshold analysis with business-driven cost calibration when evaluating deployment readiness.
While ensemble and deep learning methods are frequently reported as robust in
Table 5 (e.g., [
17,
22]), these gains must be interpreted together with deployment constraints. In latency-sensitive settings (such as authorization-time card fraud), the feasibility of sub-second inference can limit the practical use of sequence ensembles, large models, or complex feature pipelines, even when offline discrimination metrics improve. Similarly, graph-based models may improve relational signal capture but can introduce additional overhead through graph construction and neighborhood aggregation, which creates a non-trivial engineering burden at scale. Finally, interpretability remains a practical bottleneck: as model complexity increases, translating model outputs into stable, case-level evidence that satisfies audit and compliance workflows becomes more difficult, which can slow or prevent adoption even when headline metrics are strong [
41,
60,
62].
Recent studies further indicate that, beyond network architecture, effective optimization strategies play a critical role in improving the stability and practical performance of sequential models such as LSTM in real banking environments, where transaction patterns are non-stationary and operational constraints are strict [
97].
It is also important to note that many of the high-performing results in
Table 5 are obtained in settings where historical labels are available and data collection is relatively centralized. As fraud detection moves toward more decentralized and pseudonymous environments (e.g., blockchain and DeFi), supervision becomes weaker, behaviors shift faster, and feature availability differs substantially from conventional banking systems. These factors make direct transfer of conventional supervised pipelines non-trivial, even when offline metrics on traditional benchmarks appear strong. The following section (Q3) therefore examines recent efforts to adapt AI techniques to emerging fraud threats in decentralized financial environments.
Q3: What are the potential applications, key challenges, and emerging research opportunities of AI for identifying novel fraud types, such as cryptocurrency-related fraud and flash loan attacks?
4.3. Application of AI Technologies in Emerging Fraud Scenarios
Compared with traditional fraud detection tasks, emerging fraud types—such as cryptocurrency fraud, online loan scams, and DeFi-specific attacks including rug pulls and flash loan exploits—exhibit fundamentally different characteristics that challenge conventional AI-based detection pipelines. These schemes evolve rapidly, frequently lack large and consistently annotated datasets, and rely on heterogeneous data sources, including blockchain transaction graphs, project documentation, and social media content [
48,
90,
95,
98,
99]. Moreover, participant pseudonymity and highly adversarial settings further complicate reliable fraud identification and limit the direct transferability of models developed for centralized financial systems.
To address these complexities, recent work has explored a range of methods. Sequential models have been applied to capture temporal patterns in transaction sequences [
85,
87]; network-based techniques have been used to represent relationships in transaction graphs [
62,
93,
100,
101]; and approaches combining text analysis with structured data have been used to leverage unstructured information [
80,
90]. Despite these advances, this literature is still in a formative stage, and many issues—such as scalability, interpretability, and robustness in adversarial environments—remain unresolved. This reflects the fact that most existing approaches remain problem-specific and have yet to demonstrate robust generalization across different emerging fraud scenarios.
To synthesize these fragmented efforts,
Table 6 provides a structured overview of representative emerging fraud scenarios, highlighting their core modeling challenges, commonly adopted AI techniques, and open research directions. For example, in the context of cryptocurrency fraud, user pseudonymity and rapidly shifting strategies have led researchers to examine both sequential patterns and graph structures. In online lending fraud, the scarcity of publicly available labeled data and inconsistent labeling across platforms has prompted the use of hybrid text and tabular methods. For DeFi-specific threats such as rug pulls, integrating on-chain market behavior with off-chain social signals appears promising. In the case of flash loan attacks, which unfold over extremely short timescales, researchers have begun investigating methods capable of capturing both temporal dynamics and transaction relationships.
Overall, existing AI-based approaches show promise in addressing the diverse characteristics of emerging fraud types; however, their level of maturity remains noticeably lower than that observed in traditional fraud detection domains. Multimodal approaches that integrate on-chain behaviors with textual or social signals have demonstrated potential, yet their practical effectiveness is constrained by limited domain-specific annotations and the inherent difficulty of aligning heterogeneous data sources [
64,
80,
90]. Moving forward, meaningful progress will depend not only on methodological refinement, but also on the construction of higher-quality datasets, improved cross-scenario generalization, and greater emphasis on model transparency and robustness in adversarial financial environments [
53,
65,
66,
67,
94,
95].
In parallel with these model-centric advances, recent progress in generative artificial intelligence—particularly large language models (LLMs)—has introduced a complementary research direction for addressing several unresolved challenges in emerging fraud detection.
Role of Generative AI and Large Language Models in Emerging Fraud Detection
Recent advances in generative artificial intelligence, particularly large language models (LLMs), have expanded the scope of AI applications beyond conventional prediction-oriented fraud detection [
105]. Unlike traditional models that primarily operate on structured transaction records or graph representations, LLMs are capable of processing and generating unstructured textual information, which is increasingly relevant in emerging fraud scenarios involving project documentation, investigative narratives, and social media content.
One potential application of LLMs lies in synthetic data generation to alleviate class imbalance and data scarcity. By generating semantically coherent fraud-related narratives—such as scam descriptions or summaries of illicit schemes—LLMs may support weak supervision and scenario-based testing in domains where labeled data are limited [
106]. However, generated samples may overrepresent frequent patterns and fail to capture rare but critical fraud variants, requiring careful validation before downstream use.
LLMs also offer new possibilities for anomaly explanation and investigator-oriented decision support. By synthesizing structured signals, including transaction features and graph-based risk indicators, into natural language summaries, LLMs may help translate complex model outputs into more accessible explanations for analysts. Nevertheless, these explanations are not guaranteed to be factually accurate, as LLMs are prone to hallucination, which poses non-trivial risks in regulatory and compliance-sensitive environments [
107]. As a result, explanation generation should be constrained and embedded within human-in-the-loop workflows.
Beyond classification-oriented NLP tasks, LLMs enable higher-level semantic analysis of fraud-related narratives, such as summarizing fraud schemes and identifying recurring behavioral patterns across cases [
108]. These capabilities are particularly relevant for emerging fraud types, including rug pulls and Ponzi schemes, where textual communication plays a central role. At the same time, LLM-based approaches remain vulnerable to adversarial language manipulation and domain shifts, limiting their robustness when deployed in isolation.
Overall, generative AI and LLMs represent a promising yet still complementary direction in emerging fraud detection. Their strengths lie in semantic understanding, explanation support, and unstructured data analysis rather than in standalone fraud prediction. Future research should therefore focus on integrating LLMs with established sequential, graph-based, and multimodal models in a controlled and auditable manner, while carefully addressing reliability, adversarial robustness, and data privacy concerns [
109,
110].
Q4: What are the boundaries of the public datasets that are currently available in this field? How can we create better datasets to propel new advancements?
4.4. Dataset Limitations and Development
The availability and design of datasets fundamentally influence both model performance and the fairness, reproducibility, and deployment relevance of evaluation results. However, different fraud types rely on different data sources and labeling processes, and these differences often dominate model outcomes. In what follows, we first summarize commonly used datasets and their accessibility, then discuss the technical limitations of existing public benchmarks, and finally outline directions for building more realistic and reusable datasets.
Apart from their distinguishing features, dataset accessibility strongly shapes the direction of research. For each fraud type,
Table 7 lists frequently used datasets and categorizes accessibility from fully public sources to proprietary logs. While researchers can easily access fully public datasets such as Enron, other resources involve partial or restricted access—for example, LendingClub provides only a loan-listing portion, and CSMAR typically requires a paid subscription. These access differences partially explain why some fraud types (e.g., credit card fraud) have become benchmark-driven research areas, while other settings remain dominated by institution-specific case studies.
- b.
Limitations of Existing Datasets.
High-quality datasets are essential for developing reliable and generalizable fraud detection models. However, as reflected in
Table 8, available datasets are unevenly distributed across fraud types, and even widely used public benchmarks carry technical constraints that affect what can be concluded from reported results.
- (i)
Technical implications of anonymized features in public benchmarks.
In traditional domains, public datasets exist for credit card fraud (e.g., the European Credit Card dataset on Kaggle). Yet these benchmarks are often limited to short time windows and anonymized/transformed features (e.g., PCA components) [
28]. Importantly, anonymization is not only an interpretability concern. First, the loss of feature semantics makes it difficult to assess feature stability and actionability under deployment: a model may rely on a particular PCA direction that is predictive in the benchmark but corresponds to an unstable mixture of raw signals that shifts across institutions, time periods, or preprocessing pipelines. Second, anonymization constrains error analysis and debugging. When false positives increase, it becomes difficult to trace model behavior back to business-relevant factors (channels, merchant characteristics, devices, geography), which complicates operational remediation and audit justification. Third, anonymized benchmarks reduce the value of domain-informed robustness design (e.g., targeted monitoring of semantically meaningful features, or rule-based constraints that align with operational controls). As a result, strong performance on anonymized benchmarks may overstate how easily the approach transfers to real institutional data, where feature definitions, availability, and preprocessing differ.
- (ii)
Limited temporal depth and the under-evaluation of drift
A second recurring limitation is insufficient temporal coverage. Short time spans make it difficult to meaningfully assess concept drift, even though drift is one of the defining challenges in operational fraud detection [
14,
23,
109]. In practice, drift may appear as (a) prior drift (fraud prevalence changes after interventions, campaigns, or seasonal cycles), (b) covariate shift (new products, channels, and user segments change the feature distribution), and (c) behavioral/adversarial drift (attackers adjust to deployed controls). These shifts can break thresholds and calibration even when rank-based metrics appear stable. Consequently, evaluations that rely on random splits or that do not respect time ordering can underestimate degradation and overstate long-term stability.
- (iii)
Emerging domains: public raw data does not automatically yield usable benchmarks
In contrast to credit card benchmarks, insurance, P2P lending, and emerging cryptocurrency-related fraud still lack unified, standardized datasets [
5,
49,
56,
64]. For DeFi-related threats such as rug pulls and flash-loan attacks, raw on-chain transaction traces are publicly available, yet constructing representative labeled datasets remains technically challenging [
52,
56,
57,
63,
64,
65,
66]. Two issues are particularly limiting. First, “ground truth” labels are often incident-driven and curated from heterogeneous sources (security reports, exploit lists, post-mortems), which leads to inconsistent labeling criteria and incomplete coverage across studies. Second, defining negative samples is non-trivial: many benign on-chain behaviors can look superficially similar to attacks (e.g., arbitrage, liquidation, MEV-driven bursts). Naïve negative sampling therefore introduces label noise and can inflate offline metrics without improving true detection capability. Moreover, DeFi ecosystems evolve rapidly (protocol upgrades, migrations, multi-chain fragmentation), so benchmarks built around fixed incident sets can become stale quickly. These constraints push many studies toward fragmented case-based evaluation instead of standardized training and testing pipelines, which limits meaningful cross-study comparisons (
Table 8).
Overall,
Table 8 highlights that while established datasets exist for certain traditional fraud types, many other domains—especially insurance fraud, online lending, and novel cryptocurrency-related frauds—still lack standardized, large-scale, and multimodal benchmarks. This disparity helps explain why reported advances are often difficult to compare across studies and why performance may not translate smoothly across deployment settings [
5,
54,
55,
56,
60,
61,
63].
- c.
Toward High-Quality Datasets
Toward more reliable benchmarks, dataset development should not only increase sample size but also improve technical realism along three dimensions: time, labels, and feature/graph construction. First, temporal design should reflect deployment: benchmarks should adopt chronological splits, support sliding-window evaluation, and report how performance and calibration drift over time, rather than relying on a single static split [
14,
23,
109]. Second, labeling pipelines should be treated as part of the dataset specification: label delay, investigation selection bias, and incident definition criteria (especially for emerging crypto/DeFi fraud) should be documented explicitly so that reported metrics can be interpreted in context [
48,
64]. Third, for relational and DeFi-style fraud, benchmark construction should state entity resolution assumptions (address clustering, contract identity, cross-platform mapping) and define event boundaries (start/end time of an incident), because these choices directly affect both features and leakage risks during evaluation [
20,
51,
54,
63,
66].
Building on the limitations identified above,
Table 9 outlines a structured and actionable roadmap for improving fraud detection datasets by linking common data issues to corresponding technical solutions and evaluation strategies. This framework is intended to move dataset discussion beyond high-level statements (e.g., “imbalanced” or “private”) and toward reproducible remedies and transparent reporting practices. To further strengthen reproducibility across studies, the research community would benefit from standardized data-sharing infrastructures. Establishing open repositories with unified feature definitions, annotation protocols, and preprocessing guidelines could significantly reduce inconsistencies across case-specific datasets. Comparable efforts in other AI domains, such as standardized dataset hubs for natural language processing, demonstrate how coordinated data governance can accelerate benchmarking and cross-study comparability. Future financial fraud datasets should therefore prioritize structured metadata documentation, multi-source integration capability, and version-controlled update mechanisms to better support longitudinal fraud research and model validation.
This problem–solution–evaluation framework provides a systematic roadmap for guiding future dataset development in financial fraud detection. By explicitly linking dataset issues with technical remedies and robust evaluation metrics, it not only addresses the gaps identified in
Section 4.4 but also emphasizes practical strategies to ensure realism, interpretability, and generalizability in future benchmarks.
Although the reviewed studies collectively address the core research questions concerning fraud types, modeling paradigms, and data characteristics, the synthesis of findings also exposes a number of persistent gaps that remain insufficiently explored.
Q5: What are the primary limitations and gaps in the existing research? Which avenues for further research are recommended?
4.5. Research Gaps and Future Research Directions
Research Gaps and Future Research Directions Despite the progress highlighted in
Section 4.1,
Section 4.2,
Section 4.3 and
Section 4.4, several persistent challenges remain, particularly in emerging domains such as decentralized finance (DeFi) and multimodal fraud scenarios. These challenges arise from both modeling limitations and structural issues in data and evaluation design.
Interpretability beyond feature importance.
Interpretability has been investigated from multiple perspectives, including feature attribution methods such as SHAP and LIME [
118,
119], post hoc explanation frameworks for black-box models [
119], and regulatory-oriented explainability requirements in financial systems [
120,
121]. While these approaches provide useful insights into feature relevance, they offer limited support for understanding the underlying mechanisms of complex fraud behaviors. This limitation becomes especially evident in scenarios involving sequential decision processes or programmatic execution, such as money laundering chains or smart contract exploits. In such contexts, feature-level explanations alone are insufficient to capture the temporal and causal structure of fraudulent activities. A deeper notion of interpretability—one that connects model outputs to interpretable representations of event sequences and decision logic—remains largely underexplored. Future research may benefit from explanation frameworks that can be evaluated against expert judgments in realistic audit and investigation settings.
In addition, existing interpretability studies rarely consider how explanation outputs are operationalized within real-world fraud detection systems, where explanations must support regulatory audits, analyst review workflows, and time-critical decision making. Bridging the gap between explanation methods and their system-level integration therefore represents an important yet underexplored research direction.
Joint modeling of on-chain and off-chain signals.
Emerging fraud patterns increasingly span heterogeneous data modalities, including transaction graphs, market indicators, project documentation, and social media or community discussions. However, most existing approaches analyze these information sources in isolation, which limits their ability to detect fraud behaviors that only emerge through cross-modal interactions. Although several studies have begun exploring hybrid or multimodal representations, systematic frameworks for jointly modeling semantic text, temporal transaction data, and structural graph features remain limited. Moreover, the generalization of such multimodal models across different fraud types and platforms has not been sufficiently examined [
10,
120].
Synthetic and simulation-driven data for rare events.
As identified in
Section 4.4, a major bottleneck in emerging fraud detection is the scarcity of labeled data for rare or novel attack patterns. Traditional oversampling techniques, such as SMOTE, provide limited benefits when new fraud strategies lack close historical analogues [
14]. Under current data constraints, there is increasing interest in simulation-based and synthetic data generation approaches that incorporate domain-informed assumptions about adversarial behavior. However, the effectiveness of such synthetic benchmarks depends critically on their realism and their ability to transfer to real-world detection tasks. Rigorous evaluation against real fraud cases therefore remains essential.
Collaborative and cross-domain learning.
Fraud activities frequently transcend institutional, platform, and even jurisdictional boundaries, while available datasets are typically confined to isolated organizations or individual platforms. As discussed in
Section 4.4, this fragmentation limits model generalization and hampers the detection of coordinated or cross-platform fraud schemes. Although privacy-preserving learning paradigms such as federated learning have been explored in traditional financial settings, extending these approaches to heterogeneous and decentralized environments introduces additional challenges. Future research may explore cross-domain learning protocols that balance privacy preservation with the need to capture shared fraud patterns across institutions and platforms.
Standardized evaluation and benchmarking frameworks.
Across both traditional and emerging fraud domains, the absence of widely adopted benchmarks remains a fundamental obstacle to reproducibility and fair comparison. Current studies employ diverse evaluation metrics, dataset splits, and validation strategies, making it difficult to assess relative performance under realistic conditions. In particular, there is no consensus on how to evaluate models in the presence of severe class imbalance, temporal concept drift, and adversarial adaptation [
122]. More comprehensive and standardized evaluation frameworks are therefore needed to reflect real operational constraints and deployment requirements.
5. Discussion
The studies reviewed in this work collectively demonstrate both substantial progress and persistent structural challenges in AI-based financial fraud detection. Across methodological choices, data availability, and application contexts, several cross-cutting patterns emerge that help explain the current research landscape and the factors shaping research priorities.
One notable observation is the uneven distribution of research attention across fraud types. Domains such as credit card fraud and cryptocurrency-related schemes dominate the literature, largely due to the availability of public or semi-public datasets that facilitate benchmarking and comparative evaluation. In contrast, insurance fraud, loan fraud, and several decentralized finance (DeFi)-related threats remain comparatively underexplored despite their significant economic impact. As discussed in
Section 4.4, data accessibility continues to shape benchmark-driven research progress while also constraining investigation into institution-specific or emerging fraud scenarios.
Fraud detection in blockchain-based ecosystems introduces additional technical complexities beyond those encountered in traditional financial systems. Pseudonymity of wallet addresses obscures direct identity linkage, while transaction transparency paradoxically coexists with limited contextual metadata. Furthermore, effective analysis often requires integrating on-chain transaction graphs with off-chain behavioral or exchange-level information. These characteristics have motivated the development of hybrid detection frameworks that combine graph neural networks with advanced anomaly detection algorithms and blockchain-specific feature engineering. Such approaches leverage relational transaction patterns, temporal interaction dynamics, and smart contract behavior to improve detection capability in decentralized financial environments.
From a methodological perspective, the field has gradually transitioned from traditional machine learning toward ensemble learning, deep neural networks, and graph-based approaches capable of capturing nonlinear dependencies, temporal transaction dynamics, and relational structures. However, these advances introduce important trade-offs. Increased model complexity often reduces interpretability and operational transparency, both of which are critical requirements in regulated financial environments such as banking and anti-money laundering compliance. Consequently, improvements in predictive performance do not necessarily translate into deployment readiness or regulatory acceptance.
Dataset characteristics continue to exert a strong influence on research outcomes and evaluation practices. Widely adopted benchmarks, including the Kaggle Credit Card and Elliptic Bitcoin datasets, have enabled reproducible experimentation and standardized comparison across studies. Nevertheless, these datasets remain constrained by class imbalance, feature anonymization, and limited temporal coverage. For emerging fraud domains, datasets are frequently fragmented, case-specific, or platform-dependent, restricting cross-study comparability and weakening the external validity of reported results. These structural limitations help explain why models achieving strong performance in controlled experimental settings often face robustness and generalization challenges in real-world deployments.
Different imbalance mitigation strategies also present distinct trade-offs in fairness and robustness. Oversampling methods can enhance minority representation but may introduce overfitting or unrealistic synthetic patterns, particularly in highly sparse fraud scenarios. Undersampling, while computationally efficient, risks discarding informative legitimate transaction behavior that is critical for reducing false positives. In contrast, algorithm-level approaches such as focal loss and cost-sensitive optimization preserve the original data distribution while emphasizing high-risk samples during training. Emerging comparative studies suggest that hybrid strategies combining moderate resampling with adaptive loss functions often achieve more stable performance across varying fraud prevalence rates. Establishing standardized evaluation protocols for imbalance mitigation remains an important direction for improving cross-study comparability.
Interpretability remains a critical dimension influencing both regulatory acceptance and operational usability. Existing studies predominantly focus on feature-level explanations that attribute model predictions to individual transaction or account attributes. While effective for relatively static fraud scenarios such as credit card or consumer loan fraud, feature-level explanations provide limited insight into complex fraud schemes involving sequential behaviors and multi-entity interactions. For fraud types such as money laundering networks, coordinated cryptocurrency scams, and DeFi exploits, behavior-level or sequence-based explanations become essential for understanding why transaction patterns are classified as suspicious. Developing scalable and reliable behavioral explanation mechanisms therefore remains an open challenge, particularly for graph-based and sequential models operating in adversarial and rapidly evolving environments.
Beyond interpretability, fraud detection research is increasingly shaped by institutional, ethical, and regulatory constraints. Requirements related to fairness, accountability, and auditability are now widely recognized as prerequisites for deployment in regulated financial sectors. Data-sharing restrictions have also motivated exploration of privacy-preserving techniques, including federated learning and secure multi-party computation, as potential solutions to data silos. However, empirical evidence demonstrating their effectiveness in heterogeneous and non-independent and identically distributed financial environments remains limited. The emergence of generative artificial intelligence and large language models further illustrates this tension. While these techniques enable analysis of unstructured narratives, contractual documents, and online communications, their limited transparency and susceptibility to adversarial manipulation introduce additional governance challenges.
Practical deployment constraints represent another critical yet frequently underemphasized dimension in academic fraud detection research. Real-world fraud detection systems must operate under strict latency requirements, regulatory oversight, and computational resource limitations. In real-time transaction screening scenarios, decisions often need to be made within millisecond-level latency budgets, creating a fundamental tension between predictive complexity and inference efficiency. Although deep neural networks and graph-based models effectively capture complex temporal and relational patterns, their substantial computational overhead often limits direct deployment without architectural simplification. To address these computational and latency-related operational constraints, recent studies increasingly emphasize model compression techniques, including structured pruning, parameter quantization, and knowledge distillation. These approaches aim to reduce computational complexity and memory footprint while preserving predictive accuracy. In addition to compression-based optimization, lightweight architectural designs, such as shallow ensemble cascades and early-exit neural networks, have also been investigated to further reduce inference latency in real-time fraud detection scenarios. When carefully implemented, these techniques enable real-time fraud detection pipelines to meet sub-second response requirements while maintaining acceptable predictive performance and operational transparency.
Beyond computational efficiency, the dynamic and adversarial nature of fraud behavior introduces persistent concept drift that further challenges static deployment strategies. Fraud patterns continuously evolve as attackers adapt to detection mechanisms, rendering periodic retraining insufficient for maintaining long-term model effectiveness. Consequently, research is increasingly shifting toward adaptive learning paradigms. Continual learning frameworks enable models to incrementally update decision boundaries while attempting to mitigate catastrophic forgetting of historical fraud patterns. Similarly, online ensemble techniques dynamically adjust model composition based on recent transaction distributions, thereby improving resilience to temporal behavioral shifts. Despite their advantages, adaptive learning approaches introduce additional complexity in model monitoring, stability assurance, and regulatory validation, highlighting the necessity for robust drift detection and model governance mechanisms.
Closely related to both latency and model adaptability considerations is the architectural design of operational fraud detection systems. In real-world financial environments, fraud detection is rarely implemented as a single-model solution. Instead, multi-stage detection architectures are widely adopted to balance efficiency and analytical depth. Lightweight classifiers or ensemble models are typically deployed for real-time risk scoring and transaction blocking, whereas computationally intensive sequential or graph-based models are reserved for risk escalation, forensic analysis, or post-transaction investigation. Additionally, managing false positives represents a critical operational challenge, as excessive false alarms can generate customer friction and increase manual review costs. Financial institutions therefore often employ conservative decision thresholds combined with downstream human verification processes. These practical requirements explain the increasing adoption of hybrid detection frameworks integrating real-time screening modules with advanced analytical subsystems, although such system-level architectures remain insufficiently examined in academic literature.
Interpretability further intersects with deployment feasibility and regulatory compliance. Widely adopted post hoc explanation techniques, including SHAP and LIME, have substantially improved transparency in complex fraud detection models. However, these explanation approaches may exhibit instability under adversarial perturbations or distributional shifts, potentially limiting their reliability in high-stakes financial environments subject to regulatory scrutiny. In response, recent research increasingly advocates integrating inherently interpretable modeling approaches, such as Explainable Boosting Machines and rule-based ensemble models, which provide more consistent and auditable decision rationales. Moreover, counterfactual explanation frameworks are gaining attention for their ability to generate actionable insights by identifying minimal feature changes required to alter model predictions. The integration of stable and actionable interpretability mechanisms is therefore becoming a critical component in strengthening auditability and enhancing stakeholder trust in AI-driven fraud detection systems.
Taken together, these findings suggest that meaningful progress in AI-based financial fraud detection cannot rely solely on predictive performance improvements. Future advancements will require coordinated attention to computational efficiency, adaptive learning capability, system-level architectural design, interpretability robustness, regulatory compliance, and dataset quality. Addressing these interconnected challenges is essential for developing fraud detection systems that are not only accurate but also stable, transparent, and deployable in high-stakes financial environments.
6. Conclusions
This review has provided a comprehensive synthesis of recent advances in AI-based financial fraud detection, covering both traditional and emerging fraud types, methodological developments, and dataset characteristics. By systematically examining prior studies, this work highlights how artificial intelligence—particularly deep learning, graph neural networks, and hybrid approaches—has reshaped fraud detection research while also revealing persistent challenges that continue to constrain real-world deployment.
Several key observations emerge from this analysis. Research attention has been disproportionately concentrated in domains supported by publicly available datasets, such as credit card fraud and certain cryptocurrency-related schemes, leaving economically significant areas like DeFi-specific attacks relatively underexplored. Methodological progress has led to substantial gains in predictive performance, yet these gains are often accompanied by reduced interpretability, limited robustness to distributional shifts, and constraints on deployment in regulated financial environments. Dataset-related issues, including severe class imbalance, inconsistent labeling, and fragmented benchmarks, continue to restrict reproducibility and hinder the detection of rare or emerging fraud patterns.
To better bridge the gap between academic innovation and industrial deployment, this review highlights three concrete directions for future research. First, the absence of standardized multimodal benchmarks continues to limit comparability; future datasets must jointly reflect transactional, relational, and unstructured information (e.g., smart contract code). Second, the practical deployment of graph-based and sequential models requires greater attention to real-time operation, employing lightweight architectures for scalability and latency-aware design. Finally, evaluation practices must evolve toward regulatory-aligned frameworks that explicitly account for domain-aware interpretability, accountability, and operational risk.
Beyond methodological advancements, sustainable progress in financial fraud detection requires stronger cross-sector collaboration. Establishing open benchmarking platforms and community-driven evaluation challenges involving academia, industry practitioners, and regulatory bodies could significantly accelerate dataset sharing and methodological standardization. Such initiatives would facilitate transparent performance comparison, encourage reproducibility, and promote the development of regulatory-aligned evaluation frameworks. Building collaborative ecosystems around standardized fraud benchmarks may ultimately enhance both scientific innovation and real-world deployment effectiveness.
In summary, artificial intelligence has demonstrated substantial potential in enhancing financial fraud detection across a wide range of scenarios. However, the findings of this review suggest that long-term progress in this field will depend not only on methodological innovation but also on how effectively such advances are aligned with data realism, ethical considerations, and practical deployment constraints. Collectively addressing these challenges is essential for advancing fraud detection systems that are not only accurate but also reliable and suitable for high-stakes financial environments.