1. Introduction
Financial fraud has emerged as one of the most critical challenges facing modern financial systems. The rapid digitalization of financial services, the growth of online transactions, and the increasing complexity of global financial networks have significantly expanded the opportunities for fraudulent activities. Financial institutions are therefore investing heavily in advanced analytical techniques and intelligent systems to detect and prevent fraudulent transactions in real time.
Traditional fraud detection systems relied primarily on rule-based approaches and manual auditing processes. While these methods provided an initial line of defense, they often lacked adaptability and scalability when faced with large volumes of transactions and evolving fraudulent strategies. Early research in this area introduced Data Mining (DM) and Machine Learning (ML) techniques to automate fraud detection processes and improve detection accuracy [
1,
2,
3]. These approaches demonstrated the potential of statistical learning models for identifying suspicious transaction patterns.
In recent years, the rapid development of Artificial Intelligence (AI) technologies has significantly transformed financial fraud detection systems. Deep Learning (DL) models such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Long Short-term Memory (LSTM) architectures have shown strong performance in modeling complex transaction sequences and behavioral patterns [
4,
5,
6]. These models are particularly effective in capturing temporal dependencies in transaction streams and identifying anomalies that may indicate fraudulent activity.
More recently, graph-based approaches have gained increasing attention in fraud detection research. Financial transactions naturally form networks that capture relationships among accounts, devices, merchants, and users. Graph Neural Networks (GNNs) enable the modeling of these complex relationships and have been successfully applied to problems such as anti-money laundering and cryptocurrency fraud detection [
7,
8,
9]. Advanced graph-based architectures, including metapath-guided neural networks and temporal graph attention models, have further improved the ability to detect coordinated fraudulent schemes [
10,
11,
12]. At the same time, recent work indicates that many fraud graphs are substantially heterophilic: fraudsters often connect to legitimate users, merchants, or devices to camouflage malicious behavior rather than forming purely homophilic clusters [
13]. This observation is technically important because standard message-passing GNNs often assume that neighboring nodes provide compatible label information, an assumption that can break down in fraud networks and lead to representation mixing across dissimilar neighborhoods [
14].
In addition to graph-based models, several emerging research directions have recently been explored. Multimodal learning frameworks integrate heterogeneous sources of information such as transaction logs, behavioral data, and textual financial disclosures [
15,
16]. Transformer-based architectures and hybrid deep learning models have also demonstrated promising results in large-scale fraud detection environments [
17,
18,
19]. Furthermore, Reinforcement Learning (RL) and adaptive learning frameworks have been proposed to enable real-time and context-aware fraud detection systems [
20,
21].
Another rapidly growing area of research focuses on explainable and trustworthy AI in financial fraud detection. Since financial decisions often require regulatory transparency, explainable AI techniques have been proposed to improve model interpretability and decision traceability [
22,
23,
24]. In parallel, privacy-preserving approaches such as federated learning are being explored to enable collaborative fraud detection across financial institutions without exposing sensitive customer data [
25,
26].
Recent studies have also explored novel paradigms such as causal learning, contrastive learning, and dynamic hypergraph modeling to improve fraud detection robustness and generalization capabilities [
27,
28]. Other innovative directions include quantum-assisted ML models [
29,
30], large language model–assisted detection systems [
31], and hybrid artificial–quantum intelligence frameworks [
32].
Despite the significant progress achieved in this domain, several challenges remain unresolved. These include extreme class imbalance in financial datasets, evolving fraudulent strategies, limited availability of labeled fraud data, and the need for real-time detection in high-throughput financial systems. Moreover, the rapid emergence of new AI architectures and learning paradigms has created a fragmented research landscape that makes it difficult to systematically compare different approaches.
Motivated by these challenges, this survey provides a comprehensive review of recent advances in financial fraud detection using ML/DL techniques. The main contributions are summarized as follows:
A cross-paradigm survey that unifies classical ML, deep learning, graph learning, multimodal architectures, cost-sensitive learning, reinforcement learning, federated learning, and LLM-assisted fraud analysis within a single financial-fraud-focused taxonomy.
A comparative synthesis of datasets, data modalities, and evaluation practices, with particular emphasis on class imbalance, delayed labels, operational metrics, and the gap between benchmark reporting and deployment reality.
An explicit analysis of emerging technical themes that are only lightly treated or absent in earlier surveys, including heterophilic graph learning, multimodal fusion, privacy-preserving collaboration, and LLM-based analyst support.
A positioning of this survey relative to prior review papers, clarifying that the present manuscript aims not merely to summarize fraud detection methods but to integrate recent advances from 2022–2026 into a deployment-aware perspective on modern financial fraud detection.
To clarify the novelty of the present survey relative to prior review papers,
Table 1 contrasts our scope with representative earlier surveys. Prior reviews such as Abdallah et al. and West and Bhattacharya provide important early overviews of fraud detection and computational intelligence methods, but they predate recent advances in graph neural networks, multimodal fraud modeling, federated learning, and LLM-assisted analysis [
33,
34]. More recent surveys, such as Hilal et al., focus strongly on anomaly detection, while newer review papers emphasize broad ML/DL coverage but devote less space to deployment constraints, heterophily in transaction graphs, cost-aware evaluation, and emerging LLM-based workflows [
35,
36]. The contribution of the present survey is therefore to provide a more up-to-date and integrated synthesis across methodological families, data modalities, and real-world deployment challenges.
By synthesizing the rapidly expanding body of research in this area, this survey aims to provide researchers and practitioners with a structured overview of current methodologies and future research opportunities in financial fraud detection.
The remainder of this manuscript is organized as follows.
Section 2 presents the review methodology, while
Section 3 provides a bibliometric analysis of publication trends, methodological evolution, and emerging themes.
Section 4 then defines the financial fraud detection problem and reviews the main datasets and data modalities used in the literature.
Section 5,
Section 6,
Section 7 and
Section 8 examine classical machine learning methods, deep learning approaches, graph-based models, and multimodal architectures, respectively.
Section 9 discusses cost-sensitive learning and reinforcement learning, and
Section 10 addresses evaluation under extreme class imbalance.
Section 11 focuses on explainability, governance, and operational deployment issues, whereas
Section 12 summarizes the main open challenges in the field.
Section 13 outlines promising directions for future research, and
Section 14 concludes the survey.
2. Research Methodology
This survey follows a structured literature review methodology aimed at identifying, categorizing, and analyzing recent research contributions in the area of financial fraud detection using ML/DL techniques.
2.1. Literature Search Strategy
Relevant studies were identified through systematic searches in major scientific databases, including IEEE Xplore, Scopus, Web of Science, SpringerLink, and ACM Digital Library. Additional papers were identified through backward and forward citation analysis of highly cited articles in the field. The search queries included combinations of the following keywords: fraud detection, credit card detection, machine learning, deep learning, graph neural networks, anomaly detection, financial transactions. The literature search focused primarily on publications from 2015 to 2025, a period during which significant advances in DL and graph-based modeling have been applied to fraud detection problems. To improve reproducibility, the bibliometric corpus used in this survey was compiled from database searches run in IEEE Xplore, Scopus, Web of Science, SpringerLink, and ACM Digital Library on 10 February 2026. A representative Boolean query template was: (“fraud detection” OR “financial fraud” OR “credit card fraud”) AND (“machine learning” OR “deep learning” OR “graph neural network” OR “anomaly detection” OR “artificial intelligence”). Database-specific syntax was adjusted where necessary, but the semantic structure of the query was kept consistent across sources. Foundational papers that were not necessarily retrieved directly by this Boolean query structure were captured, where relevant, through backward and forward citation tracking from screened articles. Moreover, when a foundational or contextual source was cited only to frame the historical development of the field and was not part of the screened corpus itself, it was retained in the reference list as supplementary background rather than counted among the 101 PRISMA studies.
2.2. Inclusion and Exclusion Criteria
To ensure the relevance and quality of the selected studies, the following inclusion criteria were applied:
Peer-reviewed journal articles and conference papers
Studies proposing Machine Learning and/or Deep Learning methods for fraud detection
Papers presenting empirical evaluation on real-world or benchmark datasets
The following exclusion criteria were applied:
Studies not related to financial fraud detection
Papers lacking experimental validation
Non-peer-reviewed reports and technical documents
2.3. Study Selection Process
The study selection process followed a multi-stage screening approach according to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) framework [
37]. Initially, titles and abstracts were reviewed to identify potentially relevant publications. Full-text analysis was subsequently performed to confirm eligibility. The PRISMA count reported below refers specifically to the screened review corpus retained for qualitative synthesis and for the descriptive bibliometric summaries presented in this paper. In contrast, the complete manuscript bibliography also contains a limited number of supplementary foundational or contextual references used to frame early developments, definitions, or historical background. Those supplementary references were not counted in the PRISMA total unless they independently satisfied the inclusion criteria through the database-search and citation-tracking protocol described above. The selected studies were categorized according to the methodological approach used for fraud detection. The main categories identified in this survey include:
Classical Machine Learning models
Deep Learning architectures
Graph-based fraud detection methods
Multimodal and hybrid learning systems
Cost-sensitive and decision-aware learning approaches
For example, classical ML models such as decision trees, Support Vector Machines (SVMs), and ensemble learning techniques have been widely applied to financial fraud detection problems [
2,
38,
39]. DL architectures, including CNNs, RNNs, and hybrid neural models, have demonstrated strong performance in modeling transaction sequences and behavioral patterns [
4,
5,
40]. Graph-based learning approaches have also gained significant attention due to their ability to model relational structures between financial entities. GNNs and graph attention mechanisms have been successfully applied to financial transaction networks and blockchain-based fraud detection [
7,
10,
12,
41]. Recent studies have further explored advanced modeling paradigms such as multimodal learning [
15], causal representation learning [
27], RL–enhanced detection systems [
20], and large language model–assisted fraud analysis [
31]. These emerging approaches demonstrate the growing diversity of AI techniques applied to fraud detection. The overall study selection process is illustrated in
Figure 1.
Please ensure that all figures are provided in standard formats (e.g., PDF, JPG, PNG). Figures that are not properly formatted or embedded as images should be corrected.
2.4. Data Extraction and Analysis
For each selected study, key information was extracted and systematically analyzed. The extracted attributes include: detection methodology, dataset characteristics, feature engineering techniques, evaluation metrics, and experimental performance. The collected information was subsequently used to construct comparative tables, taxonomy frameworks, and analytical discussions presented in the following sections of the paper. This structured methodological approach ensures a comprehensive and systematic overview of the rapidly evolving research landscape in financial fraud detection.
3. Bibliometric Analysis
In order to better understand the evolution of research in financial fraud detection, a bibliometric analysis was conducted on the set of publications identified during the literature review process. Bibliometric analysis provides quantitative insights into the development of a research field by examining publication trends, methodological evolution, and emerging research themes.
The publication-trend statistics shown in
Figure 2 were derived from the final screened corpus used in this review rather than from an external bibliometric mapping package. Specifically, publication years were manually extracted from the eligible papers and then aggregated into yearly counts; the resulting chart was plotted from these counts for visualization. Thus, the analysis reported here should be interpreted as a review-level descriptive bibliometric summary based on the final included records, not as a VOSviewer - or Bibliometrix-based science-mapping exercise.
3.1. Publication Trends
The number of research publications on financial fraud detection has increased significantly over the past decade. This growth is largely driven by the expansion of digital financial services, the increasing availability of large-scale transactional datasets, and recent advances in ML/DL technologies. Earlier studies in the field primarily focused on traditional DM techniques and statistical methods. However, starting around 2015, the literature began to shift toward ML-based approaches, particularly ensemble methods, such as Random Forests (RFs) and Gradient Boosting (GB). More recently, DL architectures, including RNNs, GNNs, and transformer-based models, have gained increasing attention.
Figure 2 illustrates the growth in the number of publications related to financial fraud detection using ML/DL techniques.
3.2. Methodological Evolution
The methodological landscape of fraud detection research has evolved considerably over time. Early approaches largely relied on rule-based systems and traditional statistical models. These methods were gradually replaced by ML data-driven methods capable of capturing complex nonlinear relationships in transaction data. In recent years, DL models have emerged as powerful tools for modeling complex behavioral patterns and temporal dependencies in financial transactions. In particular, three major methodological trends can be observed:
Ensemble Machine Learning Methods: Algorithms such as RFs and GB have been widely adopted due to their strong performance on structured financial datasets.
Sequential Deep Learning Models: RNNs and LSTM networks have been used to model sequences of transactions and capture temporal patterns associated with fraudulent behavior.
Graph-Based Learning: GNNs have recently gained popularity for modeling relational structures between accounts, transactions, and entities in financial networks.
These methodological developments reflect a broader shift toward more complex and data-driven modeling frameworks capable of capturing sophisticated fraud patterns.
3.3. Emerging Research Themes
The bibliometric analysis also highlights several emerging research themes within the fraud detection literature. These themes reflect the evolving challenges faced by financial institutions and the growing complexity of modern financial ecosystems. Key emerging research directions include:
Graph-based fraud detection: modeling transactional networks and relationships between entities.
Explainable Artificial Intelligence (XAI): improving the interpretability of ML/DL models used in financial decision-making.
Federated learning: enabling collaborative fraud detection while preserving data privacy across institutions.
Adversarial robustness: developing models that remain effective against adaptive fraudulent strategies.
Multimodal learning: combining multiple data sources such as transaction logs, user behavior data, and device information.
These emerging themes suggest that future fraud detection systems will increasingly rely on hybrid modeling approaches that integrate multiple data representations and learning paradigms.
3.4. Summary of Bibliometric Insights
Overall, the bibliometric analysis reveals a rapidly growing research field characterized by increasing methodological diversity and technological sophistication. While traditional computational approaches remain widely used in practical applications, recent advances in DL and graph-based modeling are opening new opportunities for improving fraud detection accuracy and scalability. At the same time, several challenges remain unresolved, including data imbalance, concept drift, privacy constraints, and the need for interpretable decision-making models. Addressing these challenges will require continued interdisciplinary research at the intersection of ML/DL, financial analytics, and cybersecurity.
Figure 3 presents a detailed timeline of the methodological evolution in fraud detection, highlighting the paradigm shifts from manual rule-based systems to modern multimodal and quantum-assisted architectures.
4. Financial Fraud Detection Problem
Financial fraud detection is a complex analytical task that involves identifying malicious or deceptive activities within large volumes of financial transactions and organizational records. Fraudulent behavior may occur in a variety of contexts, including payment transactions, banking operations, corporate financial reporting, insurance claims, and cryptocurrency networks. The increasing digitalization of financial services has significantly expanded the scale and complexity of financial data, making automated detection systems based on machine learning and artificial intelligence essential for modern financial institutions.
From a computational perspective, fraud detection presents several challenges, including severe class imbalance, evolving fraudulent strategies, delayed labeling of fraudulent events, and the presence of heterogeneous data sources. Therefore, modern research in fraud detection explores a wide range of modeling paradigms, including supervised classification, anomaly detection, graph-based learning, multimodal fusion, and reinforcement learning frameworks.
4.1. Fraud Types and Detection Formulations
Financial fraud detection problems are typically formulated using several ML paradigms, depending on the availability of labeled data and the operational objectives of the system. The most common formulation is binary classification at the transaction or account level, where models predict whether an activity is fraudulent or legitimate. This formulation is widely used in payment systems, credit-card fraud detection, and online banking security.
Beyond binary classification, fraud detection problems are frequently modeled as anomaly detection tasks. In these settings, models attempt to identify unusual patterns that deviate from normal transactional behavior, often using unsupervised or semi-supervised learning techniques. Clustering-based approaches and multivariate anomaly scoring methods are commonly employed when labeled fraudulent examples are scarce or delayed [
42]. In financial reporting contexts, anomaly detection pipelines may also incorporate financial risk indicators and distribution-based distance metrics to detect irregular accounting behavior [
43].
More advanced formulations treat fraud detection as a multi-task learning problem. For example, Wang and Kang propose a framework that jointly performs fraud detection and capital flow prediction by modeling asymmetric interactions between transaction participants [
44]. Such formulations reflect real-world financial monitoring scenarios where understanding transaction dynamics can improve the identification of suspicious activities.
Another challenge arises in financial statement fraud detection, where fraudulent behavior often develops gradually over multiple reporting periods. In these cases, the labeling of fraud may correspond to the year in which enforcement actions occur, while relevant signals may accumulate across several years of financial reports. To address this issue, some architectures explicitly separate long-term behavioral signals from short-term events, allowing models to distinguish between chronic motives and acute fraudulent actions [
15].
Financial fraud detection is also increasingly studied in streaming environments. In many financial markets, transaction data arrive continuously and must be analyzed in real time to prevent financial losses. Online learning and streaming analytics approaches therefore play an important role in modern fraud detection systems. For instance, dynamic graph-stream analysis methods have been proposed to detect suspicious traders and evolving behavioral patterns in real-time transaction networks [
45].
Additionally, concept drift poses a significant challenge in fraud detection because fraudulent strategies continuously evolve as attackers adapt to defensive mechanisms. Drift-aware learning frameworks have therefore been developed to detect distributional changes and update detection models accordingly [
46]. Recent real-time architectures combine temporal sequence modeling with graph-based feature extraction to support low-latency fraud detection at scale [
40,
47].
Recent datasets and empirical studies increasingly focus on digital financial ecosystems such as online payment gateways, mobile banking platforms, and cryptocurrency transactions. For example, scalable boosting pipelines and real-time analytics frameworks have been explored for detecting fraudulent activity in digital payment networks and phishing-related transaction graphs [
48]. These approaches complement the more established credit-card and e-commerce fraud detection benchmarks that are commonly used in machine learning research [
29,
49].
To systematically illustrate the end-to-end process of identifying anomalous transactions,
Figure 4 presents a generalized Fraud Detection Pipeline, encompassing data ingestion, feature engineering, model inference, and operational decision-making.
4.2. Dataset Usage in Fraud Detection Research
The availability of benchmark datasets plays a crucial role in the development and evaluation of fraud detection algorithms. Due to the highly sensitive nature of financial data, access to real-world transaction datasets is often restricted, which has led researchers to rely on a limited number of publicly available datasets or synthetic transaction simulations. These datasets differ substantially in terms of size, data modality, fraudulent prevalence, and feature representation, which can significantly influence the evaluation of learning models.
Several datasets have emerged as widely used benchmarks in fraud detection research. These include transactional datasets derived from credit card activity, synthetic transaction simulators, large-scale e-commerce datasets, and graph-based cryptocurrency transaction networks. Each dataset reflects different fraud scenarios and therefore supports different modeling approaches, ranging from tabular classification to graph-based fraud detection.
One of the most comprehensive benchmarking frameworks is the Fraud Dataset Benchmark (FDB), introduced by Grover et al. [
50]. FDB provides a standardized collection of fraud-related datasets along with unified data loaders and evaluation protocols, allowing researchers to systematically compare machine learning models across multiple fraud detection tasks. The benchmark includes datasets covering credit card fraud, malicious web traffic, bot detection, and loan default prediction, thereby offering a diverse evaluation environment for fraud detection algorithms.
Another widely used dataset is the ULB Credit Card Fraud dataset containing credit card transactions recorded over a two-day period in September 2013, of which only 492 transactions (0.172%) are labeled as fraudulent. To protect confidentiality, most features were transformed using Principal Component Analysis (PCA), leaving only the Time and Amount variables in their original form [
51]. Due to its extreme class imbalance, this dataset has become a standard benchmark for evaluating anomaly detection and imbalanced classification methods in fraud detection research.
The IEEE-CIS Fraud Detection dataset, released through a collaboration between Vesta Corporation and Kaggle, represents a large-scale e-commerce fraud dataset containing hundreds of thousands of online transaction records with both numerical and categorical features. The dataset includes detailed transactional information and identity attributes across multiple files, enabling research on feature engineering, representation learning, and hybrid fraud detection models [
52]. Compared with smaller credit card datasets, IEEE-CIS provides richer contextual information and is therefore frequently used to evaluate deep learning and ensemble-based fraud detection methods.
To address privacy limitations in financial data sharing, several studies employ synthetic datasets such as the PaySim mobile money simulator. PaySim simulates financial transactions based on real transaction logs from a mobile money service operating in developing economies. The dataset contains more than six million transactions and includes attributes such as transaction type, sender and receiver balances, and timestamps. Approximately 0.13% of the transactions are labeled as fraudulent, making the dataset highly imbalanced and suitable for evaluating large-scale fraud detection models [
53]. Synthetic datasets like PaySim are particularly valuable for testing scalable machine learning architectures without exposing sensitive financial data.
More recently, the Credit Card Fraud Detection 2023 dataset released on Kaggle has been used to evaluate modern machine learning pipelines under more contemporary transaction patterns. This dataset contains hundreds of thousands of anonymized credit card transactions and is designed to support research on advanced machine learning models, including ensemble learning and deep neural networks [
54]. Unlike earlier datasets, this dataset is often curated to reduce extreme imbalance or to support balanced experimental setups.
Finally, graph-based fraud detection research frequently relies on the Elliptic Bitcoin dataset, which models cryptocurrency transactions as a temporal graph where nodes represent transactions and edges represent the flow of Bitcoin between addresses. The dataset includes labeled illicit and licit transactions and provides structural features describing transaction behavior within the blockchain network. Because of its graph structure, the Elliptic dataset has become a standard benchmark for evaluating graph neural networks and network-based anomaly detection approaches in financial crime detection [
7].
Table 2 summarizes the key characteristics of these benchmark datasets, including their scale, fraud prevalence, data modality, and typical research applications.
4.3. Data Modalities
Financial fraud detection systems increasingly rely on heterogeneous data sources that capture different aspects of financial activity. Traditional fraud detection models primarily operate on structured tabular transaction data, including attributes such as transaction amount, timestamp, merchant category, and account identifiers. While such features remain important, modern fraud detection architectures often incorporate additional modalities to capture richer behavioral and contextual information.
Behavioral data represent an important modality. User interaction patterns, transaction sequences, and device usage histories provide valuable signals of behavioral anomalies that may indicate fraudulent activity. Temporal sequence modeling approaches therefore analyze transaction histories to detect deviations from typical behavioral patterns.
Another important modality involves relational or network-based information. Financial transactions naturally form graphs where nodes represent entities such as accounts or users and edges represent financial interactions. Graph-based fraud detection methods leverage these structures to detect coordinated fraudulent schemes and suspicious communities of interacting accounts.
Unstructured textual information also plays an increasing role in certain fraud detection scenarios. In corporate fraud detection, textual disclosures in annual reports, management statements, and financial filings may contain semantic signals associated with financial manipulation or misreporting. Natural language processing techniques can therefore extract latent semantic representations from such documents to complement traditional accounting indicators.
Finally, contextual and device-level signals are increasingly used in modern fraud detection systems. Examples include device fingerprints, IP addresses, geolocation data, and authentication metadata. These signals help detect suspicious access patterns and account takeover attempts in online financial platforms.
The integration of these heterogeneous modalities has motivated the development of multimodal fraud detection architectures capable of jointly modeling structured, relational, behavioral, and textual information. Such multimodal approaches have been shown to improve detection performance while providing richer contextual insights into fraudulent behavior.
5. Machine Learning Methods for Fraud Detection
Traditional supervised methods remain widely used in operational fraud detection systems due to their simplicity, computational efficiency, and ease of deployment. In contrast to more complex DL architectures, classical ML models often require fewer computational resources and can be effectively trained on structured tabular datasets that dominate financial transaction records. Early studies demonstrated the effectiveness of statistical learning and data mining techniques for fraud detection tasks [
1,
55,
56].
Among the most commonly used models are tree-based ensembles, linear classifiers such as Logistic Regression (LR), and instance-based approaches including k-Nearest Neighbors (kNN). These models often serve as strong baselines in fraud detection benchmarks. For example, Purushu et al. compared Azure-based and Apache Spark ML pipelines and found that decision forest and Random Forest variants achieved competitive performance under distributed processing environments [
49]. Similarly, Binsawad revisited perceptron-style learning through an adaptive voted perceptron model and reported improvements in classification error and recall compared with several traditional baselines [
57].
Parameter-tuned ensemble pipelines frequently achieve improved sensitivity and specificity, highlighting the importance of systematic hyperparameter optimization even for classical models [
58]. Cost-sensitive learning strategies and calibrated probability outputs have also been explored to address the extreme class imbalance typically observed in financial fraud datasets [
2,
3].
Despite their advantages, classical ML models often rely heavily on feature engineering to capture complex behavioral patterns. While they can perform well on carefully engineered tabular features, they may struggle to directly model temporal transaction sequences or relational structures without additional preprocessing [
4,
5]. As a result, many fraud detection pipelines incorporate extensive feature engineering processes to encode behavioral signals such as transaction frequency, spending patterns, and account-level aggregation statistics.
Across diverse financial domains, including payment systems, banking operations, insurance claims, and enterprise reporting, many studies follow a similar experimental workflow. Typical pipelines include data preprocessing and normalization, optional feature selection or dimensionality reduction, and evaluation of multiple baseline classifiers such as LR, SVMs, decision trees, RFs, and GB. For example, comparative evaluations for financial statement fraud detection have examined the predictive power of various classical algorithms applied to accounting indicators [
38]. Other works investigate optimization-driven boosting pipelines designed for real-time financial analytics [
48], while stacking-style ensemble architectures combine multiple base learners to improve predictive performance [
59].
When combined with XAI techniques such as SHAP or feature attribution methods, classical ML models can also provide analyst-friendly explanations for fraudulent decisions. This interpretability is particularly important in financial institutions where regulatory compliance and auditability are critical requirements [
22,
24,
39].
In addition to supervised classification models, unsupervised approaches are often used as complementary baselines in fraud detection pipelines. Clustering algorithms such as k-means, density-based clustering, and hierarchical clustering can help identify anomalous transaction groups in settings where labeled fraud data is scarce. Comparative studies have examined the effectiveness of such clustering-based exploratory tools for identifying suspicious patterns prior to supervised scoring [
42,
60,
61].
Even when the final predictive model is a standard classifier, many fraud detection studies emphasize the importance of domain-driven feature design. Transaction attributes such as payment type, transaction amount, and temporal patterns often provide valuable signals for fraud detection. Behavioral analyses can help identify transaction categories associated with elevated fraud risk and inform feature engineering for downstream ML models [
62]. In financial statement fraud detection, studies frequently integrate accounting-based indicators and compare multiple machine learning algorithms to identify robust predictors [
38,
39].
Several survey-level analyses also highlight the need for improved experimental transparency, standardized datasets, and more realistic evaluation protocols in fraud detection research [
63]. These recommendations reflect broader concerns regarding reproducibility and the difficulty of comparing results across heterogeneous datasets and experimental settings.
Beyond conventional feature engineering, some studies explore alternative decision frameworks designed to address uncertainty and data quality issues. For example, neutrosophic and soft-set-based feature selection methods have been proposed to handle incomplete or inconsistent tabular signals, illustrating efforts to strengthen classical pipelines under imperfect data conditions [
64,
65].
Risk-oriented financial indicators also remain useful as input features for machine learning models. In consumer banking contexts, value-at-risk style metrics can be used to model the distributional properties of transaction values and align fraud detection predictions with potential financial losses [
66]. In insurance analytics, anomaly detection approaches based on volatility measures and distributional risk statistics have been proposed as interpretable baselines that complement machine learning classifiers [
43].
5.1. Ensemble Learning Approaches
Ensemble learning is widely adopted in fraud detection systems due to its ability to combine multiple weak or heterogeneous classifiers, thereby enhancing overall predictive performance. Methods such as RFs and GB are particularly effective in processing noisy, high-dimensional tabular data and reducing model variance when dealing with severely imbalanced fraud datasets [
3]. Recent studies heavily emphasize the robustness of these frameworks, particularly stacking architectures [
67], which allow practitioners to integrate strong tabular classifiers with specialized feature extraction modules without requiring complex end-to-end models. For instance, combining distinct ensemble classifiers and Multi-Layer Perceptrons (MLPs) within a stacking framework has proven highly effective in boosting detection accuracy while inherently mitigating class imbalance [
68,
69,
70]. These stacking approaches are especially relevant in enterprise fraud scenarios, such as financial statement analysis, where they can seamlessly integrate diverse feature groups—including accounting indicators, governance variables, and textual disclosures. Demonstrating this, Liu et al. showed that stacking-based models significantly outperform individual learners in identifying fraudulent financial reporting [
59]. Furthermore, complementing these supervised ensemble techniques with unsupervised learning methodologies offers significant advantages, creating hybrid systems that better adapt to dynamic and continuously evolving financial environments [
71].
5.2. Domain Risk Indicators and Scorecard Features
In many regulatory and auditing workflows, interpretable risk indices remain an important component of fraud detection pipelines. Financial analysts often rely on established scoring models that summarize company financial health or earnings manipulation risk. These domain-driven indicators can be incorporated as structured input features for ML classifiers. For example, Ozari et al. investigated the integration of classical financial risk metrics such as the Altman Z-score [
72] and Beneish M-score [
73] with a Random Forest classifier. Their analysis showed that combining traditional financial risk indicators with ML models can improve fraud detection performance while maintaining interpretability for auditors and regulators [
39]. The use of scorecard-style features illustrates a broader trend in fraud detection systems: combining domain knowledge with data-driven ML models to achieve robust and explainable predictions in real-world financial environments.
While traditional ML models remain widely used due to their interpretability and efficiency, recent advances in computational power and data availability have enabled the adoption of DL architectures for fraud detection tasks.
6. Deep Learning for Tabular and Sequential Fraud Data
Deep Learning techniques have become increasingly important in financial fraud detection due to their ability to learn complex nonlinear relationships and temporal patterns directly from large volumes of transactional data. In contrast to traditional ML models that often rely heavily on manual feature engineering, deep architectures can automatically extract hierarchical representations from raw or minimally processed data. This capability is particularly valuable in fraud detection, where transaction streams exhibit temporal dependencies, behavioral patterns, and high-dimensional feature spaces [
4,
5].
Recent research therefore focuses on architectures capable of modeling both tabular transaction attributes and sequential behavioral information. These approaches include CNNs and RNNs, representation-learning frameworks based on autoencoders and capsules, generative models for anomaly detection, and transformer-based architectures that capture long-range dependencies in transaction sequences.
6.1. CNNs, RNNs, and Hybrid Temporal Models
CNNs and RNNs represent some of the earliest DL approaches applied to financial fraud detection. CNNs are typically used to capture local feature interactions within transaction attributes, whereas RNN-based models, such as LSTM networks, are designed to capture temporal dependencies across sequences of transactions. These models not only capture intricate temporal dependencies but also synthesize high-quality minority class samples, significantly improving detection metrics in highly imbalanced datasets [
74].
Deep models are often introduced to reduce manual feature engineering and to model nonlinear temporal dependencies in transaction streams. Li et al. operationalize a theory of causal-temporal asynchrony by using parallel LSTMs for chronic signals and CNNs/FNN components for acute signals, subsequently fusing the resulting probabilities at the decision level [
15]. Such decoupling-fusion architectures reflect a broader design trend in fraud detection systems: distinct evidence streams (e.g., short-term behavioral signals versus long-term historical patterns) are encoded independently and combined at a later stage to produce final risk scores.
Operational fraud detection systems must also satisfy strict latency constraints, often requiring millisecond-scale scoring. Abd-Ellatif et al. propose an adaptive CNN-LSTM framework that combines convolutional feature extraction with sequential modeling, achieving low-latency inference while emphasizing the need to balance fraud detection accuracy with false-positive control in real-time environments [
40]. Similar hybrid CNN-LSTM pipelines designed for large-scale transaction streams report strong precision and recall while maintaining high throughput under big-data conditions [
75].
Another practical challenge concerns the portability of fraud detection models across different institutions, financial products, or geographic regions. Domain shifts between datasets can significantly degrade performance if models are trained on a single environment. Dubey et al. address this problem by proposing a unified transformer-based architecture augmented with knowledge distillation and symbolic reasoning components, enabling knowledge transfer across heterogeneous benchmark datasets while preserving interpretable decision support [
17]. More generally, recent surveys emphasize that cross-domain generalization should be explicitly evaluated in fraud detection research rather than assumed based on results from a single dataset [
36,
63].
6.2. Representation Learning with Autoencoders and Capsule Networks
Beyond standard convolutional and recurrent architectures, several studies focus on robust representation learning to address the challenges of class imbalance, noisy labels, and high-dimensional feature spaces. Autoencoders are frequently used as unsupervised or semi-supervised feature extractors that compress transaction data into informative latent representations. These learned representations can then be used as inputs to classical ML classifiers or ensemble models. Mia et al. compare traditional ML models, deep learning architectures, and variational quantum approaches under extreme class imbalance, demonstrating that autoencoder-based feature extraction combined with ensemble learning techniques such as bagging and boosting can significantly improve detection performance when appropriate resampling strategies are applied [
29].
Other studies extend representation learning with more specialized architectures. Sathe and Shinde combine SMOTE-based resampling, metaheuristic feature selection, and a deep convolutional capsule autoencoder for fraud classification [
76]. Capsule networks are designed to capture hierarchical relationships between features, which may improve robustness in highly imbalanced or noisy datasets.
More complex architectures integrate multiple representation-learning mechanisms. Shi et al. propose a regularized memory graph attention capsule network that combines bidirectional LSTM with graph attention layers and capsule components, demonstrating strong performance on the Credit Card Fraud 2023 and IEEE-CIS datasets [
47]. In related financial fraudulent contexts, stacked autoencoder frameworks combined with dynamic ensemble selection have also demonstrated improvements over baseline classifiers by providing richer latent feature representations for downstream decision models [
41].
6.3. Generative Models and Anomaly Detection
Generative DL models have also been explored as tools for addressing data imbalance and detecting anomalous transaction patterns. In fraud detection, generative models are often used either to model the distribution of legitimate transactions or to synthesize additional minority-class samples that improve classifier training.
Generative Adversarial Networks (GANs) are particularly popular in this context. Zhu et al. propose a GAN-based anomaly detection framework that combines dilated convolutional layers with adversarial learning to detect abnormal transaction behaviors [
77]. By modeling the distribution of normal transactions, the system can identify deviations that may indicate fraudulent activity.
Other studies focus on improving computational efficiency while maintaining detection performance. Yang et al. introduce a lightweight symmetrical GAN-CNN fusion architecture designed for high-volume transaction environments, emphasizing reduced computational cost and improved recognition of complex imbalanced patterns [
78]. These generative approaches are closely related to broader representation-learning strategies such as contrastive learning and graph-temporal modeling, which aim to stabilize training and improve generalization in highly imbalanced datasets [
19].
6.4. Transformer and Graph-Temporal Architectures
Transformer-based architectures have recently emerged as powerful models for sequential financial data. Originally developed for natural language processing, transformers use self-attention mechanisms to capture long-range dependencies within sequences. This capability makes them well suited for modeling transaction histories and behavioral patterns over extended time horizons.
Julius et al. propose a graph-temporal contrastive transformer that jointly models transaction behavior over time and the relational structure of financial networks [
19]. By integrating graph information with sequential modeling, the approach combines the advantages of GNNs with transformer-based sequence encoders. Contrastive learning objectives further improve representation quality under highly imbalanced conditions.
Recent work also emphasizes practical deployment considerations such as model efficiency and inference latency. Lightweight architectures, including GAN-CNN fusion models, seek to maintain competitive fraud detection performance while reducing parameter counts and computational requirements [
78]. In operational environments where millions of transactions must be evaluated in real time, reporting inference latency and throughput has become as important as traditional evaluation metrics such as precision, recall, or Area Under Curve (AUC) [
20,
40].
Finally, transformer-based encoders are increasingly being integrated with interpretable reasoning components to balance predictive performance with transparency and regulatory compliance. For instance, knowledge-distilled transformer frameworks combined with symbolic reasoning modules aim to provide both accurate sequence modeling and explainable decision support across heterogeneous financial datasets [
17].
7. Graph Learning and Network-Centric Fraud Detection
Many forms of financial fraud involve coordinated activity across multiple entities, such as accounts, devices, merchants, and payment channels. These interactions naturally form relational networks that cannot be fully captured using independent feature vectors. Graph-based fraud detection therefore aims to leverage relational evidence, including shared devices, counterparties, transaction paths, and behavioral connectivity, by representing financial ecosystems as graphs. In such representations, nodes correspond to entities (e.g., accounts or users) and edges represent transactional or behavioral relationships. Graph learning methods have become increasingly important in fraud detection because they can exploit neighborhood structure, higher-order connectivity patterns, and community-level behavior. These properties are particularly useful for detecting organized fraudulent rings, money laundering schemes, and coordinated attack campaigns where individual transactions may appear legitimate but collective behavior reveals suspicious patterns.
7.1. GNNs for Relational Fraud Signals
GNNs extend DL to graph-structured data by propagating information across neighboring nodes. In fraud detection settings, GNNs allow a model to incorporate signals from related entities, enabling the detection of suspicious communities and relational anomalies.
Recent work has explored interpretable and imbalance-aware graph learning architectures. Lu et al. propose an interpretable spectral graph learning framework that integrates adaptive filters, dynamic neighborhood sampling, and contrastive learning objectives to strengthen minority-class representations and improve robustness under extreme class imbalance [
28]. Such approaches aim to stabilize learning when fraudulent nodes are rare and relational signals may be noisy.
Graph-based approaches are particularly well suited for fraud scenarios involving coordinated actors. By aggregating neighborhood information, GNNs can detect patterns such as shared infrastructure, circular transaction flows, or anomalous connectivity structures that are difficult to identify using purely tabular models.
7.2. Robust and Imbalance-Aware Graph Learning
Operational financial graphs are often noisy and incomplete. Spurious edges may arise from benign shared infrastructure (e.g., common IP addresses or payment gateways), while missing links can obscure true fraudulent relationships. In addition, graph topology may evolve rapidly as fraudsters adapt their strategies. To address these challenges, recent studies have proposed robust sampling strategies, contrastive objectives, and self-supervised learning techniques. For example, imbalance-aware spectral filtering and contrastive representation learning have been used to improve model stability under noisy and imbalanced conditions [
28]. Such methods encourage models to learn representations that better separate fraudulent and legitimate nodes even when the minority class is extremely sparse.
Beyond standard pairwise graphs, several works explore hypergraph formulations that represent higher-order interactions among multiple entities simultaneously. Hypergraphs allow modeling of group-level relations such as shared devices among multiple accounts or coordinated merchant activity. Luo et al. introduce causal hypergraph representation learning to disentangle spurious correlations from meaningful relational signals, improving robustness in complex fraudulent environments [
27].
In addition to deep graph models, explicit graph-typology features can also be extracted and used as inputs to classical classifiers. Structural descriptors such as fan-in/fan-out ratios, scatter–gather patterns, and cyclic transaction paths often provide strong signals for detecting money laundering or coordinated fraudulent schemes. Rakhmetulayeva et al. demonstrate that incorporating such structural features into ML pipelines can significantly improve recall for laundering-style activity [
79].
An additional issue that deserves explicit attention is heterophily. In operational fraud networks, suspicious nodes often connect predominantly to benign neighbors because fraudsters deliberately route transactions through legitimate-looking accounts, merchants, or devices. Under these conditions, naive neighborhood averaging can blur discriminative signals and degrade predictive performance. Xu et al. argue that graph-based fraud detection should be revisited from the perspective of heterophily and graph spectrum, showing that the relational patterns encountered in fraud graphs differ from the homophilic assumptions behind many conventional GNN pipelines [
13]. More broadly, the heterophilic graph learning literature emphasizes that effective architectures should preserve ego-node information, combine information from multiple propagation regimes, and reduce oversmoothing across dissimilar neighborhoods [
14]. For fraud detection, this suggests that commonly used architectures such as vanilla GCN- or GraphSAGE-style models may be fragile when local neighborhoods are dominated by legitimate nodes, whereas attention-based, spectral, adaptive-propagation, or relation-specific aggregation schemes are more promising because they can selectively weight neighbors and retain class-discriminative information in heterophilous settings [
10,
12,
13,
14].
7.3. Temporal and Streaming Transaction Graphs
Financial transaction networks are inherently dynamic. Fraudulent campaigns often occur over short time windows, while long-term behavioral patterns may evolve gradually. Capturing these temporal dynamics is therefore crucial for effective detection. Recent research has focused on graph models that explicitly incorporate temporal information. Guang et al. propose a multi-temporal partitioned graph attention network that constructs graphs at multiple temporal granularities and uses attention-based neighborhood aggregation to improve robustness to irrelevant neighbors [
12]. By analyzing transaction relationships across different time scales, the model can capture both short-lived attack campaigns and longer-term behavioral trends.
Hybrid approaches that combine temporal sequence modeling with graph learning are also emerging. For instance, RL frameworks have been integrated with graph attention networks and community mining techniques to improve the detection of clustered fraud patterns under class imbalance [
20]. These systems dynamically adapt to evolving fraudulent strategies while leveraging relational structure within transaction networks.
Operational deployments must also satisfy strict latency requirements. Real-time graph-based systems therefore focus on efficient neighborhood sampling, incremental graph updates, and streaming inference pipelines. Wang et al. demonstrate that graph architectures can be optimized for low-latency transaction flow analysis while maintaining strong detection performance in high-throughput environments [
41].
7.4. Heterogeneous and Multi-Relational Graph Models
Real-world financial networks are rarely homogeneous. Instead, they typically involve multiple types of entities and relationships, including users, merchants, devices, IP addresses, and geographic locations. Modeling these heterogeneous interactions requires graph representations that explicitly capture typed nodes and edges. Metapath-based GNNs have been proposed to address this challenge. Qian and Tong introduce metapath-guided GNNs that construct subgraphs along predefined semantic paths and use attention mechanisms to aggregate relational information while mitigating category imbalance [
10]. These metapaths capture meaningful relational patterns such as user–device–merchant interactions or account–transaction–account flows.
Building on this idea, Tong et al. propose adaptive metagraph NNs that automatically discover useful relational structures through metagraph search rather than relying on manually defined patterns [
11]. Such approaches aim to improve generalization across different financial ecosystems by learning task-relevant relational schemas directly from data.
Overall, recent graph-based fraud detection research spans a rich design space defined by three key dimensions. First, graph construction strategies may vary from static transaction graphs to temporal or multi-layer graph representations. Second, neighborhood aggregation mechanisms may rely on spectral filtering, attention-based propagation, or global–local representation learning. Third, imbalance handling strategies include sampling, reweighting, and self-supervised objectives such as contrastive learning. These design choices collectively shape the performance and robustness of modern graph-based fraud detection systems [
19,
20].
8. Multimodal and Hybrid Models
Modern fraud detection systems increasingly rely on multimodal learning frameworks that integrate heterogeneous sources of information. Financial fraud rarely manifests through a single data modality; instead, it emerges from the interaction of multiple signals such as transaction metadata, user identity attributes, device fingerprints, location information, behavioral histories, and textual disclosures. As a result, multimodal architectures have become an important research direction in fraud detection, enabling models to capture richer contextual information and complex relationships between entities [
1,
4,
5].
Traditional ML models often operate on structured tabular features, but recent advances in representation learning allow the integration of multiple modalities such as text, behavioral sequences, and relational signals. Multimodal architectures can combine structured financial indicators with unstructured information sources, including corporate disclosures, transaction narratives, and behavioral logs. These approaches are particularly relevant in large-scale financial ecosystems where fraud risk emerges from interactions across accounts, devices, and institutional entities.
8.1. Intent-Aware Relational Modeling
An important direction in multimodal fraud detection focuses on modeling relational interactions between entities involved in a transaction. Rather than analyzing accounts in isolation, modern architectures explicitly capture relationships between senders, receivers, and counterparties. For example, Wang and Kang propose an intent-aware multi-source hybrid attention framework that integrates sender and receiver profiles with transaction-level metadata. Their model applies attention mechanisms to capture asymmetric interactions between entities while jointly predicting fraud risk and capital flow dynamics [
44]. Such relational modeling reflects operational fraud scenarios where suspicious behavior often arises from coordinated interactions rather than isolated transactions. These approaches are conceptually related to graph-based fraud detection systems, which model transaction networks and relational dependencies between accounts [
7]. Intent-aware attention architectures can therefore be viewed as a complementary paradigm that combines relational reasoning with multimodal feature integration.
8.2. Text–Financial Feature Fusion
Fraud detection in enterprise and corporate reporting contexts often requires combining numerical financial indicators with unstructured textual information. Corporate disclosures, annual reports, and management statements contain semantic signals that may reveal manipulation patterns or inconsistencies. Liu et al. demonstrate this approach by integrating latent semantic representations extracted from annual report text with traditional accounting indicators. Their framework employs ensemble methods, including stacking architectures, to improve enterprise-level fraud identification [
59]. Similar approaches combine textual embeddings with financial ratios to enhance predictive performance while enabling downstream econometric analysis of fraud risk factors [
38,
39]. Text–financial feature fusion illustrates a broader multimodal paradigm in fraud detection: combining structured numerical indicators with high-dimensional semantic representations derived from natural language processing pipelines.
8.3. Large Language Models for Fraud Analysis
Large Language Models (LLMs) are emerging as a distinct research direction in fraud detection because they can process unstructured evidence such as transaction narratives, customer communications, policy documents, suspicious activity reports, and investigator notes within a unified reasoning framework. In contrast to conventional tabular classifiers, LLM-based systems can support case summarization, alert triage, explanation generation, and retrieval-assisted analyst workflows, making them particularly attractive in settings where fraud decisions depend on both structured transaction attributes and textual context [
31,
80].
At the same time, direct use of LLMs for core transaction classification remains challenging. Financial fraud datasets are typically highly structured, high-dimensional, and extremely imbalanced, whereas LLMs are pretrained primarily on natural language and may struggle with dense numerical features and long tabular inputs. Recent work therefore suggests that LLMs are most effective when used in hybrid pipelines rather than as standalone replacements for specialized fraud classifiers. Hacini et al. combine an LLM encoder with reinforcement learning to optimize operational fraud decisions under asymmetric costs [
31], while more recent work on structured financial data shows that retrieval-augmented and feature-reduced prompting strategies can substantially improve LLM-based fraud analysis, although specialized tabular models often remain stronger for pure classification accuracy [
81]. Overall, the current literature suggests that LLMs are best viewed as assistive and multimodal components that complement graph, tabular, and cost-sensitive learning systems rather than replace them outright.
8.4. Temporal and Causal Multimodal Fusion
Beyond simple feature concatenation, recent research explores more principled strategies for multimodal fusion that reflect the temporal and causal structure of fraudulent processes. In particular, fraud motives and actions may occur at different temporal scales. Li et al. [
15] introduce a framework that models this phenomenon as
causal–temporal asynchrony. Their system separates long-term behavioral signals (representing latent intent) from short-term transactional events (representing concrete actions) and integrates asynchronous probability outputs from LSTM, CNNs, and feed-forward neural networks. This architecture provides a useful design pattern for multimodal fraud detection pipelines: heterogeneous evidence streams can be processed by specialized models and fused at the decision level, enabling more faithful representations of the underlying data-generating process.
8.5. Multimodal Systems with Interpretability and Governance Signals
Recent studies also explore the integration of organizational and governance-related information into fraud detection systems [
82]. In corporate fraud detection, governance indicators such as board composition, executive incentives, and ownership structure can provide additional predictive signals beyond financial metrics. Li et al. demonstrate that incorporating governance indicators alongside financial ratios and textual signals can improve corporate fraud identification performance [
82]. These results highlight the importance of contextual information when analyzing complex organizational fraud patterns.
In corporate and enterprise environments, qualitative factors such as managers’ abnormal tone and disclosure change trajectories provide critical semantic signals for financial fraud detection [
83,
84]. Moreover, incorporating corporate governance indicators and analyzing fraud learning cycles enhances the predictive power and contextual awareness of machine learning models deployed for listed firms [
85,
86].
Another emerging direction involves combining high-capacity deep encoders with interpretable reasoning layers. For instance, hybrid transformer-based architectures augmented with symbolic reasoning modules have been proposed to improve interpretability and decision transparency in multimodal fraud detection pipelines [
17]. Such hybrid models attempt to balance predictive power with explainability, which is essential for regulatory compliance and financial auditing.
In operational settings, multimodal fraud detection systems are frequently combined with explainability and privacy-preserving mechanisms. XAI techniques enable analysts to understand how different signals contribute to fraud predictions, while privacy-aware frameworks restrict the sharing of sensitive financial data across institutions [
24,
26,
87]. These considerations emphasize that multimodal architectures should be viewed not only as modeling techniques but also as system-level design choices that must satisfy constraints related to interpretability, privacy, and deployment latency.
8.6. Cross-Domain and Relational Expansions
Recent advancements in multimodal deep learning and relational graph integration have further fortified fraud identification across diverse digital economy platforms, enabling the holistic analysis of complex transactional networks [
88,
89]. Visual analytics combined with these models has also shown promise in money laundering domains [
90]. Additionally, the expansion of AI architectures into cross-domain sectors, such as healthcare fraud, highlights the versatility and adaptability of these data-driven paradigms [
91].
Table 3 provides a comparative analysis of the primary machine learning and deep learning architectures utilized in financial fraud detection, summarizing their respective advantages, limitations, and optimal use cases.
9. Cost-Sensitive Learning and Reinforcement Learning
In real-world fraud detection systems, predictive performance cannot be evaluated solely with traditional metrics such as accuracy. Financial institutions typically operate under asymmetric cost structures, where the cost of failing to detect fraudulent activity (false negatives) is significantly higher than the cost of incorrectly flagging legitimate transactions (false positives). As a result, modern fraud detection research increasingly focuses on cost-sensitive learning frameworks that explicitly incorporate financial loss functions and operational constraints into model design. Cost-sensitive approaches adjust training objectives or decision thresholds to reflect the economic impact of different prediction errors. These methods are particularly important in highly imbalanced fraud datasets, where fraudulent transactions constitute only a small fraction of total activity but carry disproportionately large financial consequences.
9.1. Cost-Sensitive Learning in Supervised Fraud Detection
A common strategy for handling asymmetric costs involves modifying supervised learning algorithms to penalize false negatives more heavily than false positives. This can be achieved through weighted loss functions, cost-sensitive sampling strategies, or specialized learning architectures. For example, Huang et al. propose a cost-sensitive cascade forest model designed specifically for fraud detection tasks with severe class imbalance. Their approach dynamically adjusts model depth while incorporating penalty terms that prioritize fraud recall. Experimental results demonstrate that appropriate cost-sensitive optimization can significantly improve recall while maintaining acceptable levels of precision [
92]. Their study also highlights the importance of preprocessing decisions, showing that missing-value handling strategies can materially influence AUC and recall outcomes.
Cost-aware learning can also be integrated into feature selection and model optimization processes. Metaheuristic feature selection pipelines frequently incorporate cost-based objective functions or penalty terms to guide the search toward features that maximize fraud detection effectiveness under imbalance conditions [
76]. Such optimization-guided pipelines demonstrate how cost-sensitive objectives can be embedded throughout the machine learning workflow rather than only at the classification stage.
Another related line of research focuses on incorporating financial risk indicators directly into fraud detection models. For instance, value-at-risk style metrics and distributional anomaly measures have been used to capture extreme financial events and tail risks in banking datasets [
66]. Similarly, volatility-based anomaly detection techniques have been explored in insurance contexts to detect rare but financially significant events [
43]. These approaches align predictive modeling with financial risk exposure and therefore support more economically meaningful fraud detection strategies.
9.2. Reinforcement Learning for Fraud Detection Policies
RL provides an alternative paradigm in which fraud detection is framed as a sequential decision-making problem rather than a static classification task. In this setting, models learn policies that optimize long-term reward functions reflecting operational objectives such as fraud recovery, investigation costs, and alert management capacity. Hacini et al. propose an RL-based fraud detection framework that combines Large Language Model (LLM) encoders with policy-gradient learning [
31]. Their system optimizes business-aligned reward functions that emphasize reducing false negatives while controlling false-positive rates. Experimental results suggest that RL-based policy optimization can improve recall without significantly increasing alert volume on benchmark datasets. RL is particularly appealing in fraud detection scenarios where decisions must be made dynamically under uncertainty. Instead of predicting fraud probabilities alone, RL models can learn adaptive policies that consider downstream consequences such as transaction blocking, manual review, or escalation procedures.
9.3. Hybrid RL–Graph and Context-Aware Detection
Recent research also explores hybrid architectures that combine reinforcement learning with relational or graph-based fraud detection models. These approaches are motivated by the networked nature of financial transactions, where fraud patterns often emerge from coordinated activity across multiple accounts. Renuga Devi et al. propose a framework that integrates graph attention mechanisms with an RL-based controller for context-aware community mining in transaction networks. Their approach dynamically adjusts detection strategies based on structural properties of transaction graphs and reports improvements in recall while reducing false positives in large-scale financial datasets [
20]. Hybrid RL–graph models illustrate a broader trend toward combining relational learning with adaptive decision policies in fraud detection pipelines. Such architectures can capture both network-level fraudulent structures and operational trade-offs between detection quality and computational cost.
9.4. Operational Constraints and Robust Fraud Detection Systems
Operational fraud detection systems must address additional challenges beyond class imbalance and cost asymmetry. These include concept drift, adversarial manipulation, evolving fraudulent strategies, and strict latency constraints in real-time transaction monitoring. Recent work emphasizes the importance of designing robust and auditable detection pipelines that incorporate multiple safeguards. For example, Al-Daoud and Abu-AlSondos propose a hybrid framework that integrates class imbalance handling, concept drift detection, adversarial training mechanisms, and explainable AI tools for deployment in live financial transaction streams [
46].
Similarly, practical fraud detection platforms must balance detection recall with alert management capacity. Excessive false positives can overwhelm human investigators and reduce system usability. Studies examining operational deployments therefore report trade-offs between recall, precision, and alert volume when evaluating real-time fraud detection systems [
40,
46].
Overall, these findings suggest that fraud detection should increasingly be viewed as an operational decision optimization problem rather than a purely predictive modeling task. Cost-sensitive objectives, RL policies, and system-level robustness mechanisms together form an emerging design paradigm for practical fraud detection architectures.
10. Evaluation Under Extreme Class Imbalance
Financial fraud detection presents a distinctive evaluation challenge due to the extremely low prevalence of fraudulent events. In many real-world financial datasets, fraudulent transactions represent significantly less than 1% of all observations. Under such conditions, conventional evaluation metrics such as classification accuracy become misleading because models can achieve high accuracy simply by predicting the majority (non-fraudulent) class.
To address this issue, the literature increasingly emphasizes evaluation metrics that better capture model performance under severe class imbalance. Commonly reported metrics include the Area Under the Receiver Operating Characteristic Curve (AUROC), the Area Under the Precision–Recall Curve (AUPRC), precision, recall, and F1-score. Many recent studies in fraud detection, particularly those involving multimodal architectures and deep learning models, report AUROC and F1-score as primary evaluation metrics [
15,
44]. However, relying exclusively on ROC-based metrics can be misleading in highly imbalanced settings.
Figure 5 conceptually illustrates why ROC-based evaluation alone may mask meaningful differences between models when fraudulent prevalence is extremely low. While two models may appear to perform similarly in ROC space, their precision–recall behavior can differ substantially. Because precision directly reflects the proportion of detected cases that are truly fraudulent, precision–recall curves and AUPRC are often more informative for fraud detection scenarios.
For reproducible and reliable evaluation, several methodological practices are increasingly recommended in the literature:
Use time-based data splits when concept drift is expected, avoiding random splits that may leak future information into the training set.
Report precision–recall curves or AUPRC in addition to AUROC to better capture performance under severe class imbalance.
Tune decision thresholds according to operational review budgets and the asymmetric costs of false positives and false negatives.
Validate score calibration when model outputs are used for downstream decision-making processes such as risk scoring or alert prioritization.
Several studies in the fraud detection literature explicitly address the limitations of accuracy-based evaluation under extreme imbalance. For example, multiple works report AUROC and F1-score while incorporating imbalance mitigation strategies such as Synthetic Minority Oversampling Technique (SMOTE), synthetic sample generation, or data augmentation methods [
29,
76,
77,
78]. Other studies emphasize the trade-offs between precision and recall in real-time fraud detection environments, where operational latency and response time are critical considerations [
20,
40,
46].
To improve cross-study comparability, researchers increasingly report precision–recall-based metrics such as average precision alongside cost-aware performance objectives that account for the financial impact of missed fraud cases [
31]. In addition, some studies incorporate alternative error metrics such as mean absolute error (MAE) and root mean square error (RMSE) when evaluating probabilistic predictions or regression-based risk scoring systems [
57].
In practical deployment scenarios, evaluation metrics must also reflect operational constraints and investigation workflows. For example, studies using mobile money simulations or enterprise transaction datasets often combine predictive accuracy metrics with interpretability or auditability requirements, highlighting the need to report precision and recall at specific decision thresholds rather than accuracy alone [
24,
38].
Other research introduces alternative detection-rate metrics aligned with financial risk measures, such as value-at-risk-inspired indicators designed to capture tail-risk behavior under heavily skewed label distributions [
66]. Parameter-tuned ensemble pipelines also emphasize sensitivity–specificity trade-offs, demonstrating how hyperparameter choices can significantly alter model operating points in imbalanced environments [
58].
Table 4 summarizes the standard and cost-aware evaluation metrics employed in fraud detection research, detailing their focus and practical suitability under extreme class imbalance.
To address the need for a quantitative cross-paper synthesis,
Table 5 aggregates representative benchmark results reported on widely used public datasets, including ULB, IEEE-CIS, PaySim, and Elliptic. The table is intentionally selective rather than exhaustive. Studies were included when they satisfied four criteria: they evaluated a recurrent public benchmark used across multiple fraud-detection papers; provided sufficient methodological detail to identify the model family and preprocessing setting; reported at least one commonly used discrimination or class-imbalance-aware metric or, in a small number of cases, illustrated a reporting practice that is itself methodologically important to discuss; and contributed coverage across different benchmark datasets and methodological paradigms rather than merely maximizing the number of entries. At the same time, these values should not be interpreted as a perfectly controlled head-to-head ranking because published studies often differ in preprocessing pipelines, feature engineering, class-balancing strategies, threshold selection, train/test or temporal splits, and even the subset of metrics reported.
Several patterns emerge from this synthesis. First, ULB studies often report very strong AUROC and recall values, but those gains are frequently tied to aggressive resampling or threshold tuning and therefore do not automatically transfer to richer real-world settings. In particular, the 100% accuracy reported by Mia et al. should be interpreted cautiously because it is presented after SMOTE-based preprocessing and hyperparameter tuning, while threshold-sensitive quantities such as precision, recall, F1-score, and AUPRC are not reported [
29]. In a severely imbalanced fraud setting, such an accuracy figure is therefore insufficient on its own to establish state-of-the-art fraud-detection effectiveness. Second, IEEE-CIS and PaySim studies increasingly complement AUROC with AUC-PR, F1, or decision-aware metrics, reflecting the need to evaluate fraud detection systems under operationally relevant class imbalance. The near-perfect precision and recall reported by Hacini et al. are also best read in context: they arise from an RL-based policy formulation on the PaySim benchmark and may depend strongly on the reward structure, action policy, and evaluation environment rather than reflecting a universally reproducible ceiling for real-world fraud detection [
31]. Third, on graph benchmarks such as Elliptic, improvements in F1-score are often more informative than accuracy because the class distribution and temporal split make simple majority-class success less meaningful. Overall, the extreme values in
Table 5 are retained not as definitive headline evidence of the state of the art, but as examples of how different reporting choices and experimental protocols can produce superficially incomparable performance claims.
10.1. Interpretive Assessment of Evidence Strength
Beyond descriptive synthesis, an evaluative reading of the literature suggests that the most trustworthy experimental results are generally those reported on recurrent public benchmarks, using metrics appropriate for severe class imbalance, and in settings that have been reused by multiple independent studies. By this standard, results such as the ULB benchmark evaluation of Alfaiz and Fati, the IEEE-CIS results of Moradi et al., and the Elliptic results reported by Weber et al. and Alarab and Prakoonwit are comparatively more informative than accuracy-only claims because they report threshold-sensitive metrics on datasets that support at least partial cross-study triangulation [
7,
68,
93,
94]. Even these results should be interpreted cautiously, but they provide a stronger evidential basis than isolated headline scores reported without precision–recall-aware context.
At the level of methodological families, ensemble and tabular classifiers currently have the broadest empirical support across recurrent transactional benchmarks such as ULB, IEEE-CIS, and PaySim [
24,
68,
93]. Graph-based methods also appear reasonably trustworthy when the underlying problem is intrinsically relational, especially on Elliptic and related transaction-network settings, although their gains often depend on graph construction choices and temporal split design [
7,
10,
12,
41,
94]. By contrast, reinforcement-learning, LLM-assisted, and some multimodal or causal architectures remain promising but are still validated on comparatively narrow evidence bases, often a single simulator, benchmark, or application-specific dataset [
15,
27,
31].
Accordingly, the performance gains most likely to generalize are those that recur across multiple datasets or benchmark families, rely on precision–recall-aware or cost-aware evaluation, and do not depend primarily on aggressive oversampling, synthetic environments, or accuracy-only reporting. Claims are less likely to generalize when they are derived from a single synthetic benchmark such as PaySim, from single-institution data, or from protocols dominated by SMOTE-heavy preprocessing and post hoc threshold optimization [
29,
53,
63]. In practical terms, the current evidence base most strongly supports robust ensemble baselines and carefully evaluated graph models, whereas the strongest claims for emerging RL-, multimodal-, and LLM-assisted systems should still be regarded as provisional until broader cross-dataset validation becomes available.
10.2. Evaluation Protocols and Resampling Effects
Imbalance mitigation techniques such as oversampling, undersampling, reweighting, and generative data synthesis can significantly influence model performance and evaluation outcomes. Oversampling methods, including SMOTE and related variants, are widely used in credit-card fraud detection research to address class imbalance. However, their effectiveness depends strongly on how synthetic examples are generated and incorporated into the training process [
95].
More recently, GANs have been used to synthesize minority-class examples in highly imbalanced fraud datasets. While these approaches may improve model learning, they can also introduce artificial patterns that inflate reported performance if evaluation procedures are not carefully designed [
77,
78].
As a general methodological guideline, resampling procedures and feature selection steps should be applied strictly within training folds during cross-validation. Applying resampling before dataset splitting may inadvertently leak information from the training set into the test set, resulting in overly optimistic performance estimates. When temporal drift is expected, time-ordered validation strategies are preferable to random cross-validation schemes [
46].
10.3. Operational Metrics and Cost-Aware Evaluation
In real-world financial systems, fraud detection models do not merely produce predictions; they trigger operational actions such as transaction blocking, step-up authentication, or manual investigation. Each of these actions involves different operational costs and capacity constraints.
To better reflect these operational realities, some recent studies formulate fraud detection as a sequential decision-making problem using reinforcement learning. In such frameworks, reward functions explicitly encode the asymmetric costs associated with missed fraud cases and false alarms, allowing models to optimize recall while maintaining acceptable false-positive rates [
31].
Graph-based community detection and graph reinforcement learning approaches similarly emphasize improvements in false-positive control and real-time latency, highlighting that evaluation should consider both detection performance and operational throughput [
20]. Cost-sensitive cascade forest models provide another perspective by incorporating penalty-weighted objectives that prioritize fraud detection recall under severe class imbalance [
92].
These developments illustrate that effective evaluation of fraud detection systems requires a broader perspective that integrates statistical performance metrics with operational constraints, financial costs, and investigative workflows.
11. Explainability, Governance, and Operational Deployment
As ML models become increasingly central to financial fraud detection systems, issues related to interpretability, governance, and operational deployment have gained growing attention in the literature. Fraud detection models are not deployed in isolation but operate within complex socio-technical environments involving analysts, auditors, compliance officers, and regulatory authorities. Thus, beyond predictive performance, models must also satisfy requirements related to transparency, accountability, privacy, and operational reliability.
Interpretability plays a crucial role in enabling auditability and regulatory compliance. Financial institutions must often justify automated decisions to regulators and investigators, especially when detection results trigger costly investigations or regulatory actions. Recent research emphasizes that explainability mechanisms should be embedded into fraud detection architectures rather than treated solely as post hoc analysis tools. For instance, graph-based fraud detection models can incorporate interpretability mechanisms such as spectral response analysis or attention-based interaction summaries to reveal how relational patterns influence model predictions [
28].
Similarly, in financial statement fraud detection, explainability analysis can connect model signals to conceptual frameworks such as the fraud triangle, thereby linking machine learning outputs to established auditing theory [
15]. These approaches illustrate how interpretable modeling can bridge the gap between statistical prediction and domain-specific reasoning.
From a deployment perspective, fraud detection systems must also address practical constraints, including latency requirements, human-in-the-loop feedback loops, privacy protection, and dataset shift. Detection models typically operate within investigative workflows where analysts validate alerts and generate new labels, creating feedback mechanisms that influence future model training. Designing models that remain stable and interpretable under such conditions remains an active research area [
46].
Recent studies also emphasize user-centered explanation design. Instead of generic explanation scores, explanations should be tailored to the needs of different stakeholders, such as fraud analysts, auditors, and regulatory authorities. For example, stakeholder-oriented explanation frameworks propose generating customized explanation reports aligned with audit procedures and compliance requirements [
22]. Explainable ensemble models have also been successfully applied in enterprise and mobile-money environments, demonstrating that predictive accuracy and interpretability can be jointly achieved in operational fraud detection systems [
24,
38].
11.1. Structured Interpretability and Financial Feature Groups
Traditional explainability techniques often focus on feature-level importance scores, which may be difficult for analysts to interpret in complex financial datasets. More recent research has therefore explored structured explanation approaches that operate at the level of financial feature groups or conceptual indicators.
For instance, Lin and Gao propose the group SHAP framework, which aggregates feature attributions across financial variable groups such as liquidity, profitability, and leverage [
96]. This approach reduces computational cost while producing explanations that are more aligned with financial due diligence practices. Group-based explanations also allow analysts to identify systematic patterns across industries and institutions, improving the practical usability of machine learning models in fraudulent investigations.
Such structured interpretability approaches complement commonly used post hoc techniques such as SHAP and LIME, which are frequently employed in ensemble learning pipelines and explainable federated learning frameworks [
24,
38]. In regulated financial environments, these tools contribute to transparency by enabling auditors to trace predictions back to meaningful financial indicators.
11.2. Explainable AI for Operational Fraud Analysis
XAI techniques are increasingly integrated into operational fraud detection systems to support analyst decision-making and regulatory oversight. In practice, explanations serve multiple purposes beyond regulatory reporting. They assist analysts in prioritizing suspicious transactions, identifying potential false positives, and diagnosing model behavior.
For example, Vijayanand and Smrithy demonstrate that combining ensemble learning methods with SHAP explanations can provide attribute-level transparency for mobile money transactions using the PaySim dataset, enabling investigators to understand the key drivers behind fraudulent predictions [
24]. Such approaches highlight that interpretability can be enhanced without sacrificing predictive accuracy.
Explainability can also contribute to model monitoring and debugging. Changes in feature attribution patterns may signal concept drift or emerging fraudulent strategies, allowing institutions to update detection models proactively. Therefore, explainability mechanisms increasingly play a role not only in reporting but also in the lifecycle management of fraud detection systems.
11.3. Privacy-Preserving and Distributed Fraud Detection
The growing scale of digital financial ecosystems has intensified the need for privacy-preserving fraud detection frameworks. Financial transaction data is highly sensitive, and strict regulatory frameworks limit the extent to which such data can be shared across organizations.
To address these challenges, several studies explore distributed learning architectures that enable collaborative fraud detection without exposing raw transaction data. Federated learning has emerged as a particularly promising approach, allowing multiple institutions to jointly train machine learning models while keeping data locally stored [
38]. However, deploying federated learning in high-frequency financial environments remains challenging because repeated model synchronization, secure aggregation, and client coordination can introduce substantial computational overhead and communication bottlenecks. These costs may be especially problematic when participating institutions must exchange updates under strict latency constraints or across heterogeneous infrastructure. When combined with explainability techniques such as SHAP or LIME, federated learning systems can maintain both privacy and transparency in fraud detection workflows.
In parallel, blockchain-based logging and distributed ledger technologies have been proposed to enhance governance and auditability in fraud detection systems. Blockchain infrastructures can provide tamper-resistant records of model predictions, data access events, and investigative actions, thereby creating transparent and verifiable audit trails [
87].
Additional privacy-enhancing techniques, such as differential privacy mechanisms, can further control information disclosure while preserving analytical utility. These approaches are increasingly viewed not merely as technical enhancements but as governance mechanisms that support compliance with regulatory frameworks and cross-institution collaboration [
25,
26].
11.4. Balancing Predictive Performance and Accountability
Ultimately, effective fraud detection systems must balance predictive performance with accountability and stakeholder trust. Highly complex machine learning models may achieve strong predictive performance but can be difficult to interpret and validate in regulated financial environments.
To address this trade-off, researchers increasingly advocate layered interpretability strategies that combine model-intrinsic interpretability with post hoc explanation techniques. For example, relational models may incorporate attention mechanisms or spectral analysis tools that reveal cross-account interaction patterns, while post hoc explanation methods provide feature-level insights into individual predictions [
28,
44].
Such layered approaches enable institutions to maintain high predictive accuracy while ensuring that fraud detection systems remain transparent, auditable, and aligned with regulatory expectations. As financial fraud detection continues to evolve toward large-scale, data-driven systems, the integration of explainability, privacy protection, and governance frameworks will remain a central requirement for trustworthy deployment.
12. Open Challenges
Despite significant progress in machine learning and artificial intelligence techniques for fraud detection, several open challenges remain. The literature indicates that improving model capacity alone is rarely sufficient for reliable deployment in real-world financial systems. Fraud detection operates within complex socio-technical ecosystems that involve human investigators, audit procedures, regulatory oversight, and adversarial actors attempting to bypass detection systems. For this reason, system performance depends not only on model architecture but also on data quality, operational workflows, and feedback loops between detection systems and fraudulent investigators.
This section summarizes the major open challenges identified in the literature and highlights potential research directions. These challenges span multiple dimensions, including data availability, modeling strategies, operational deployment constraints, and regulatory considerations.
12.1. Data Challenges
Data-related issues remain one of the most critical barriers to reliable fraud detection systems. Financial datasets are often incomplete, noisy, and subject to delayed labeling processes. In many cases, fraudulent labels are only assigned after lengthy investigations, which may introduce temporal biases in both training and evaluation datasets.
Another major issue is the extreme class imbalance present in most financial fraud datasets. Fraudulent transactions typically represent a very small fraction of all transactions, often below 1% of the total data. This imbalance can significantly degrade model performance and requires specialized techniques such as cost-sensitive learning, resampling strategies, and anomaly detection approaches [
92].
Label quality and delayed confirmation of fraudulent events further complicate the learning process. Fraudulent labels often depend on investigation workflows rather than objective ground truth, which can bias both training and testing datasets. Dataset representativeness also remains uneven, as many studies rely on synthetic datasets or data from a single financial institution, potentially overstating cross-domain generalization capabilities [
63].
Concept drift presents another critical data challenge. Fraud patterns continuously evolve as attackers adapt their strategies to circumvent detection systems. Drift-aware frameworks therefore emphasize continuous monitoring and adaptive model updating to maintain performance over time [
46].
In addition, modern fraud detection systems increasingly rely on heterogeneous data modalities, including transactional, behavioral, relational, and textual information. Integrating these heterogeneous sources remains an open research challenge that requires careful data preprocessing and feature engineering.
12.2. Modeling Challenges
From a modeling perspective, fraud detection poses several unique challenges that distinguish it from conventional machine learning tasks. One of the most prominent issues is the presence of adaptive adversaries. Fraudsters continuously modify their behavior in response to detection systems, creating adversarial environments where static models may quickly become ineffective. Evaluation protocols should therefore include stress-testing scenarios that simulate adversarial behaviors such as transaction splitting, coordinated fraud rings, and distributional shifts [
28].
Another important modeling challenge involves the integration of heterogeneous data modalities. Modern financial systems generate multiple types of signals, including temporal transaction histories, relational transaction networks, and contextual metadata. Developing unified architectures capable of effectively fusing these heterogeneous signals remains an active area of research [
15,
44].
Recent work has also explored advanced relational modeling approaches, including multi-temporal graph neural networks, meta-path-based graph representations, and knowledge graph integration. However, the selection of relational schemas and graph construction strategies remains largely under-specified in the literature, particularly under conditions of concept drift and delayed labels [
11,
97].
Furthermore, causal modeling and higher-order relational representations such as hypergraphs have been proposed as promising directions for improving robustness and interpretability. Nevertheless, validating causal claims in non-stationary environments remains a difficult challenge [
27].
12.3. Deployment Challenges
Even when high-performing models are developed in experimental settings, deploying fraud detection systems in operational environments introduces additional challenges. Real-world financial systems require models capable of handling large-scale data streams and performing inference under strict latency constraints.
As transaction networks grow in size and complexity, scalability becomes a central issue. Dynamic graph-stream approaches have been proposed to detect suspicious behavior in evolving transaction networks without relying on complete supervision [
45]. Similarly, distributed big-data pipelines and scalable graph embedding techniques are increasingly used to support large-scale fraud analytics [
98,
99].
Operational fraud detection systems must also consider the trade-off between detection accuracy and alert volume. Excessive false positives can overwhelm human investigators and reduce the effectiveness of fraud monitoring systems. As a result, researchers have begun to explore cost-sensitive learning and reinforcement learning frameworks that explicitly incorporate operational constraints into model training [
31].
Cross-institution generalization represents another important deployment challenge. Fraud detection models trained on data from one financial institution may not generalize well to other institutions due to differences in customer behavior, transaction patterns, and regulatory environments. Distributed training approaches and knowledge distillation frameworks have been proposed to improve model portability across heterogeneous datasets [
18].
12.4. Regulatory and Ethical Issues
Beyond technical considerations, fraud detection systems must also address regulatory and ethical requirements. Financial institutions operate under strict regulatory frameworks that impose constraints on data usage, model transparency, and accountability.
Privacy preservation is a particularly important issue in financial fraud detection. Sensitive financial data cannot easily be shared across organizations, which limits collaborative fraud detection efforts. Federated learning and privacy-preserving ML techniques have therefore emerged as promising solutions for enabling collaborative detection without exposing raw transaction data [
38,
87].
Interpretability and explainability are also essential for regulatory compliance. Fraud detection decisions often require justification to auditors, regulators, and affected customers. Explainable AI techniques, such as grouped attribution methods and feature importance analysis, can help investigators understand the factors contributing to model predictions and support transparent audit trails [
96].
Finally, emerging technologies such as Quantum Machine Learning (QML), Quantum Graph Neural Networks (QGNN), and privacy-preserving AI introduce new opportunities and challenges for fraud detection systems and offer potential exponential speedups and enhanced representational capacity for mapping complex, highly imbalanced financial networks [
100,
101]. While quantum-based approaches remain largely exploratory, they suggest potential new modeling paradigms for handling highly imbalanced datasets and complex relational structures [
25,
30]. At the same time, ensuring fairness, transparency, and governance in AI-driven financial systems remains a critical area for future research.
Table 6 summarizes the key challenges identified in the literature and outlines potential research directions for addressing these issues. In addition,
Figure 6 presents a conceptual taxonomy of these challenges, highlighting the relationships between data-related, modeling, and operational constraints.
13. Future Research Directions
Despite the substantial progress achieved in recent years, financial fraud detection remains an evolving research field. The increasing complexity of digital financial ecosystems, combined with adversarial behavior and strict regulatory requirements, continues to create new challenges for ML–based detection systems. Building on the open challenges discussed in the previous section, several promising research directions emerge that may significantly advance the state of the art in fraud detection.
13.1. Multimodal Fraud Detection
Modern financial systems generate heterogeneous data sources, including transaction logs, relational interaction networks, behavioral histories, device fingerprints, and textual financial disclosures. While many existing studies focus on a single data modality, future fraud detection systems are expected to rely increasingly on multimodal learning frameworks that integrate multiple information sources.
Recent work has explored the joint modeling of temporal transaction sequences and graph-based relationships between financial entities [
44]. Similarly, hybrid architectures that combine structured financial indicators with textual disclosures have demonstrated promising results in financial statement fraud detection [
15]. Future research should focus on developing principled multimodal fusion architectures capable of effectively combining heterogeneous data while maintaining interpretability and computational efficiency.
13.2. Graph-Based and Relational Learning
Financial fraud often involves coordinated activities among multiple entities, making relational learning approaches particularly suitable for fraud detection. Graph neural networks (GNNs), heterogeneous graph models, and knowledge graph approaches have therefore attracted increasing research attention.
However, several open issues remain. Graph construction strategies, temporal aggregation mechanisms, and relational schema design are often application-specific and lack standardized evaluation protocols. A particularly important but still under-discussed issue is heterophily: in many real transaction networks, fraudulent entities intentionally connect to normal nodes to evade detection, producing neighborhoods in which label similarity is weak or even misleading [
13]. This can reduce the effectiveness of standard homophily-oriented message-passing GNNs and motivates heterophily-aware architectures that preserve ego features, combine multiple propagation scales, or use relation-sensitive aggregation mechanisms [
14]. Recent work on metapath-based representations, temporal graph neural networks, and knowledge graph integration highlights the need for systematic approaches to modeling evolving financial interaction networks under precisely these heterogeneous and heterophilous conditions [
10,
12].
Future research should explore robust graph construction techniques, dynamic graph representation learning, heterophily-aware neighborhood aggregation, and scalable graph analytics capable of handling large transaction networks in real time.
13.3. Adaptive and Drift-Aware Learning Systems
Concept drift remains a fundamental challenge in fraud detection because fraudulent strategies continuously evolve. Static machine learning models trained on historical data may quickly become obsolete as fraud patterns change over time.
Drift-aware learning frameworks and adaptive model updating strategies are therefore essential for maintaining reliable fraud detection systems [
46]. Future research should investigate automated drift detection mechanisms, incremental learning algorithms, and continuous training pipelines capable of adapting to evolving fraud patterns without extensive manual intervention.
Reinforcement learning approaches may also play an important role in adaptive fraud detection systems by enabling models to optimize long-term operational outcomes rather than short-term classification accuracy [
31].
In parallel, future work should more systematically study how LLMs can be integrated into fraud detection pipelines for analyst assistance, document-grounded reasoning, and retrieval-augmented decision support, while carefully evaluating hallucination risk, latency, privacy, and domain adaptation requirements in regulated financial environments [
80,
81].
13.4. Privacy-Preserving and Federated Fraud Detection
Privacy concerns represent a major barrier to collaborative fraud detection across financial institutions. Regulatory frameworks such as GDPR impose strict limitations on data sharing, which restricts the availability of large-scale multi-institution datasets.
Federated learning has emerged as a promising paradigm for addressing these limitations by enabling multiple organizations to jointly train machine learning models without exchanging raw data. Recent studies demonstrate the feasibility of federated fraud detection frameworks that preserve data privacy while improving detection performance [
38,
87]. At the same time, practical deployment must account for the computational cost of local training and secure aggregation, as well as the communication burden created by frequent parameter exchange across institutions. In high-throughput financial settings, these overheads can become a major barrier unless communication-efficient optimization, asynchronous updates, or model-compression strategies are adopted.
Future studies should focus on scalable federated learning architectures, secure aggregation mechanisms, and privacy-preserving model evaluation techniques that can support collaborative fraud detection across financial institutions.
13.5. Explainable and Trustworthy Fraud Detection
Interpretability is an essential requirement for fraud detection systems because predictions often trigger costly investigations and regulatory actions. Financial institutions must be able to justify automated decisions to auditors, regulators, and customers.
Explainable AI (XAI) methods such as feature attribution techniques, rule extraction, and grouped explanation methods have therefore become increasingly important in fraud analytics [
96]. However, providing reliable explanations for complex models such as deep neural networks and graph neural networks remains challenging.
Future research should focus on developing interpretable fraud detection models that provide actionable insights for investigators while maintaining high predictive performance. In addition, fairness and bias mitigation should be considered when deploying automated fraud detection systems in real-world financial environments.
13.6. Scalable Real-Time Fraud Detection
The growth of digital payment systems and online financial platforms has substantially increased transaction volumes, requiring fraud detection systems to operate at massive scale under strict latency constraints. Real-time detection systems must process millions of transactions per second while maintaining high detection accuracy.
Recent studies highlight the importance of scalable graph analytics pipelines, distributed machine learning architectures, and streaming-based anomaly detection methods [
45,
98,
99]. Future research should explore efficient model architectures, hardware-aware implementations, and edge-based detection systems capable of supporting real-time fraud prevention in large-scale financial networks.
13.7. Emerging Technologies
Finally, emerging technologies such as quantum machine learning and advanced causal modeling may open new avenues for fraud detection research. Although still largely exploratory, quantum-based learning models have been investigated for classification tasks involving highly imbalanced financial datasets [
25,
30].
In parallel, causal machine learning approaches may provide deeper insights into the mechanisms underlying fraudulent behavior by distinguishing correlation from causation in complex financial systems [
27]. These emerging paradigms are still at an early stage but represent promising directions for future research.
Overall, advancing financial fraud detection will require interdisciplinary collaboration between machine learning researchers, financial institutions, regulatory bodies, and cybersecurity experts. Such collaboration will be essential for developing robust, interpretable, and scalable detection systems capable of addressing the rapidly evolving landscape of financial fraud.
14. Conclusions
Financial fraud detection is a critical research area due to the rapid expansion of digital financial systems and online transactions. Traditional rule-based systems are no longer sufficient to address the complexity and dynamic nature of modern fraudulent schemes. As a result, ML/DL approaches have emerged as essential tools for identifying fraudulent behavior in large-scale financial datasets.
This paper presented a comprehensive survey of ML/DL techniques for financial fraud detection. The survey reviewed a wide range of methodological approaches, including classical ML models, deep neural network architectures, graph-based learning methods, multimodal architectures, and cost-sensitive decision frameworks. In addition to analyzing model architectures, this work examined the characteristics of financial fraud datasets, evaluation methodologies, and practical deployment challenges.
Our analysis highlighted several key open challenges. Fraud detection systems must operate under extreme class imbalance, continuously evolving fraud patterns, and strict real-time constraints. Furthermore, financial institutions face significant regulatory and privacy requirements that limit the availability and sharing of transaction data. Addressing these challenges requires the development of adaptive learning models, robust anomaly detection techniques, and privacy-preserving collaborative learning frameworks.
Recent advances in GNNs, transformer architectures, and multimodal learning systems demonstrate strong potential for modeling complex transactional relationships and behavioral patterns. In addition, emerging approaches such as federated learning and explainable artificial intelligence offer new opportunities for building transparent and privacy-aware fraud detection systems.
Future research should focus on developing scalable and interpretable models capable of adapting to evolving fraud patterns in real-world environments. Integrating behavioral analytics, network-based representations, and cost-aware decision mechanisms will be essential for improving the effectiveness and operational impact of fraud detection systems.