Recent Progress on Financial Risk Detection in the Context of Transaction Fraud Based on Machine Learning Algorithms

Chen, Teli; Sun, Ruili; Ma, Tiefeng; Sergeev, Sergey

doi:10.3390/jrfm19010014

Open AccessReview

Recent Progress on Financial Risk Detection in the Context of Transaction Fraud Based on Machine Learning Algorithms

¹

Faculty of Science and Technology, University of Canberra, Canberra 2617, Australia

²

College of Mathematics and Information Science, Zhengzhou University of Light Industry, Zhengzhou 450001, China

³

School of Statistics, Southwestern University of Finance and Economics, Chengdu 611130, China

^*

Author to whom correspondence should be addressed.

J. Risk Financial Manag. 2026, 19(1), 14; https://doi.org/10.3390/jrfm19010014

Submission received: 28 October 2025 / Revised: 11 December 2025 / Accepted: 19 December 2025 / Published: 24 December 2025

(This article belongs to the Special Issue Featured Papers in Finance and Society Wellbeing—in Honor of Professors Joe Gani and Chris Heyde)

Download

Browse Figures

Versions Notes

Abstract

Transaction Fraud, a type of financial operational risk, remains a major threat to financial sectors and continuously imposes devastating financial impacts. This study comprehensively reviews 41 cutting-edge publications on financial transaction fraud detection using Machine Learning from January 2018 to October 2025. We establish a taxonomy to categorize the selected work into four themes: Traditional Machine Learning, Deep Learning, Ensemble Method, and Hybrid Method. Each theme is evaluated in-depth, from strengths to weaknesses. Ensemble exhibits better performance over other methods with a recall of 92.7%, a precision of 96% and an F1-score of 92.66% on average, while Traditional ML ranks last in terms of average F1-score. Preprocessing strategies, like data balancing, can enhance performance, while feature engineering requires careful evaluation before implementation. Significantly, we assess financial implications, suggesting it is essential to integrate financial metric design, feature explanation, time series patterns, and data privacy considerations into financial fraud detection—a focus that aligns with risk management frameworks and regulations. By revealing current research gaps and suggesting future directions, our study provides practical guidance for researchers and practitioners to advance financial fraud detection strategies within a highly intricate financial ecosystem.

Keywords:

card transaction fraud/scam detection; machine learning; deep learning; financial risk; financial fraud

1. Introduction

Background: Technology advancements have facilitated people’s lives over the past decades, leading to the prevalence of cashless, electronic, and online payments in the financial sector. However, these advancements in financial sectors have been accompanied by a rise in various forms of fraud and scams, such as identity theft, account takeover, card skimming, phishing, money laundering, and cyberattacks (Vashistha & Tiwari, 2024). Most fraudulent activities result in financial losses, causing substantial harm to financial institutions, individuals, and organizations. These losses also escalate operational risk, a key category defined by the Basel Framework, a comprehensive set of measures to strengthen the regulation, supervision, and risk management of the global banking sector (Basel Committee on Banking Supervision, 2024). Consumers in the United States lost more than USD 10 billion to fraud, marking a 14% increase from 2022, according to statistics of the Federal Trade Commission of the USA (Federal Trade Commission, 2024). In 2023, AUD 2.7 billion was stolen by scammers from Australian consumers (The Treasury, 2024). Although these financially external frauds emerged over a decade, they consistently pose a huge threat to the world’s economy and society’s well-being as people are more dependent on the digital world during the COVID and post-COVID era. Inadequate regulatory oversight concerning this risk could even result in severe breakdown for banks and financial institutions. Capital requirements, such as those based on Risk-Weighted Assets (RWA) within the Basel Framework, manage risk by ensuring banks maintain sufficient capital buffers. Complementing this, effective fraud detection strategies are imperative to tackle operational risk directly, minimizing financial losses and protecting customers from harm.

Motivation and Purpose: Machine Learning (ML), along with its subset Deep Learning (DL), is widely leveraged by researchers and practitioners to build fraud detection systems because of its effectiveness in prediction and forecasting by identifying intricate patterns and learning historical behaviors from large volumes of data (Dal Pozzolo et al., 2014). However, interpretability of “Black Box Model”, data imbalance, data inaccessibility, and misclassification are common challenges in financial fraud detection by ML (Abdul Salam et al., 2024; Ahmed et al., 2025; Baisholan et al., 2025b; Tayebi & El Kafhali, 2025). This study aims to survey recent ML methods applied to financial transaction fraud detection—which work by forecasting the probability of fraud to mitigate financial risk—and to evaluate their progress and weaknesses. Although several similar reviews (Baisholan et al., 2025a; Chen et al., 2025; Hafez et al., 2025; Moradi et al., 2025) were released recently, making substantial progress in this field, they still suffer from certain limitations. These include a lack of focus on transaction fraud detection specifically, insufficient discussion of preprocessing strategies, and a failure to address financial implications or link them systematically to financial risk, etc. Our work will distinguish itself from existing reviews by addressing these research gaps (Table 1).

Specifically, we are going to provide a critical analysis of ML techniques for transaction fraud detection, structured thematically to address advances, challenges, and opportunities. This analysis encompasses their classification methods, preprocessing strategies, highlights, results, and limitations. Additionally, we will conduct a deeper evaluation of preprocessing strategies. Moreover, we intend to emphasize the financial implications and significance of these methods from the perspective of the financial sector. This will involve aspects such as financial metric design, feature explainability, time-series considerations, and financial data privacy. By linking our analysis to risk management and regulatory compliance, we aim to bridge a critical research gap: few studies have systematically addressed the financial impact when applying ML to detect transaction fraud, despite it being a problem rooted in the financial sector itself. The overall goal is to provide a comprehensive review of recent ML methods in handling transaction fraud detection and to offer insights into future directions for both researchers and practitioners in this field.

Scope: This research primarily focuses on peer-reviewed journal articles and conference papers from the past seven years (January 2018–October 2025) that employ ML algorithms to detect card transaction fraud, a specific form of external fraud in operational risk under the Basel Framework. Since financial transaction data is usually structured, the datasets used among selected studies are tabular. They include features such as transaction time, amount, receiving account, type of transaction, age group, monthly salary, etc. The target is a binary feature (typically labeled 0 or 1) that indicates whether a transaction is fraudulent, defining the classification task.

This study contributes to the literature in the following key ways:

We assess recent ML progress in transaction fraud detection for financial risk, using a structured taxonomy that covers advances and limitations to provide a comprehensive domain overview.
Simultaneously, we specifically evaluate preprocessing methods, including data balancing, feature engineering, and hyperparameter optimization techniques, which are critical for enhancing ML performance in classification tasks, emphasizing a potential future focus.
Comparisons across studies by aggregation are conducted to critically analyze our reviews, presenting key findings.
Additionally, we address the financial implications and significance of these methods, linking them to risk management.
Finally, the review concludes with a comprehensive discussion of the limitations and future scope of current studies, highlighting directions for further work by researchers and practitioners in this domain.

The rest of the study is arranged in the following way: Section 2 briefly explains the review methodology employed to conduct this research. In Section 3, we review and analyze these approaches in-depth based on four subthemes: Traditional ML, DL, Ensemble method, and Hybrid method. Key topics such as dataset descriptions, preprocessing strategies, and cross-validation are considered. Discussions on performance comparisons across our reviews are carried out. Section 4 discusses the financial implications, closely connected to risk management. Section 5 focuses on gaps and limitations in this domain and provides future directions. In Section 6, we conclude this study.

2. Review Methodology

In this study, we perform an exhaustive and rigorous evaluation of the most recent publications on financial risk detection in the context of card transaction fraud using state-of-the-art ML. To ensure the quality of our evaluations, we referred to some of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) (Page et al., 2021) and Kitchenham Systematic Review Process guidelines (Kitchenham & Charters, 2007). Detailed steps are described in the following.

Initially, a comprehensive search was conducted using a combination of keywords: (“credit card fraud detection” OR “transaction fraud detection” OR “financial fraud detection”) and (“machine learning” OR “deep learning”). These search queries were applied to databases including Google Scholar, Springer Nature Link, ScienceDirect, Scopus, and IEEE Xplore to search publications published from January 2018 to October 2025. Then publications between the first five and ten pages of the search results in the above databases were screened. In addition, we conducted another search in terms of the above keywords on Google Scholar from January 2024 to October 2025 and browsed publications of the first ten pages to ensure our reviews keep up with the times. Only studies written in English and published in peer-reviewed journals or conferences were considered.

After screening 200 publications with their abstracts, experimental results, and summaries, we selected 41 publications for further in-depth reviews. Our selection criteria include relevance (whether the method is a Machine Learning based algorithm and applied to financial transaction/credit card fraud datasets), citation times according to Google Scholar, journal quality, etc. Specifically, studies that did not employ ML/DL techniques or that were not focused on transaction fraud detection with tabular data were excluded. In addition to satisfying the relevance, publications must also meet the following: (citation times > 50) OR (journal indexed in SCI with IF > 1.5) OR (journal indexed in ESCI with IF > 1.5). First, a minimum of 50 citations, a benchmark for influential work, ensured the credibility and impact of selected work. Second, to mitigate the recency bias in citation counts, we included papers from journals with an Impact Factor (IF) > 1.5, a threshold that captures quality emerging research while maintaining a baseline of editorial rigor. We also conducted snowballing through Research Rabbit to scan the reference lists of selected publications to find other related studies, but no additional publications were selected according to our selection criteria and their publishing years. To ensure the quality of included papers, the following components were assessed and satisfied by all selected works:

Data & Reproducibility: Datasets are clearly described, and a source link is provided if publicly accessible.
Experimental Rigor: The methodology and experimental steps are explicitly stated.
Model Credibility: The proposed model is compared with baselines or prior work.
Analysis: The results are discussed and analyzed.

With organized reviewing, we gathered information from the 41 selected publications, including their years, journals published, citation times (Google Scholar), methods used/developed, data balancing techniques, feature engineering techniques, datasets used, highlights, results, limitations/challenges, and financial significance, extracting and categorizing all the information into an Excel form. ML used to be categorized into two groups: Supervised Learning and Unsupervised Learning. But in financial fraud detection, no matter what preprocessing strategies (Supervised, Unsupervised, Statistical, or others) are employed, the last step is usually classification, which is a type of Supervised Learning. Thus, it is meaningless to classify in the traditional way since all these fraud detection techniques can be categorized as Supervised Learning (Chhabra et al., 2023). Alternatively, we established a taxonomy in terms of the specific ML strategies employed and divided them into four subgroups: Traditional ML, DL, Ensemble, and Hybrid. The fundamental distinction among them lies in their core mechanisms: from human-driven feature engineering (Traditional ML) to automatic feature learning (DL), to collective prediction (Ensemble), and finally to integrative system design (Hybrid). Each subgroup was reviewed in depth, and its strengths and limitations were thoroughly described to enlighten future studies.

A full review process diagram can be seen in Figure 1. The workflow of identification, screening, and inclusion ties closely to the framework of PRISMA and Kitchenham to ensure the trustworthiness of our research. Furthermore, an analysis part following the inclusion is presented in the workflow, providing readers with a clear understanding of the overall evaluation process. Through the proposed methodology, a comprehensive study is performed to present various ML techniques employed in financial risk prevention, with a particular focus on transaction fraud detection.

3. Analysis of Approaches

ML has been widely adopted in the financial domain to solve practical problems, like financial fraud and transaction fraud, particularly where this research focuses. Overall, the number of papers has increased over time. In Figure 2, the four methods used are almost evenly distributed, with DL, Ensemble, and Hybrid slightly higher than Traditional ML in our reviewed papers. It suggests that DL, Ensemble, and Hybrid methods attract more attention than Traditional ML in transaction fraud detection.

3.1. Traditional Machine Learning (ML)

Traditional ML refers to algorithms such as Logistic Regression (LR), Naïve Bayes (NB), K-Nearest Neighbors (KNN), Decision Tree (DT), etc., which are generally less complex, more interpretable, and perform well with smaller datasets. LR provides a linear, probabilistic foundation by modeling outcomes with a logistic curve, while NB relies on probability and the assumption of feature independence to perform quick probabilistic classification. In contrast, KNN is an instance-based, non-parametric method that classifies data points based on the majority vote of their k closest neighbors in the feature space. DT uses a hierarchical, rule-based structure to split data recursively, creating an interpretable model for classifications or predictions. In this subsection, we incorporate publications if they include any of the traditional ML methods mentioned above. Thus, the traditional ML subgroup comprises publications that may include both traditional ML and other types of methods.

Three traditional ML algorithms, including LR, NB, and KNN with Random Under Sampling (RUS), were employed by Itoo et al. (2021) to detect card transaction fraud. In this study, LR outperformed NB and KNN with a maximum accuracy of 95%. This result indicates that LR shows optimal performance when processing fewer data samples, reduced by RUS. Another study (Tanouz et al., 2021) compared Random Forest (RF), DT, LR, and NB with RUS resampling. RF, composed of multiple decision trees that output a mean prediction, achieved the best result with a recall of 91.11% due to its inherent ability to reduce overfitting. Although these 2 studies demonstrate excellent performance of Traditional ML models on transaction fraud detection, the reduced training data resulting from RUS may negatively impact the model’s stability and generalizability.

Both studies (Ileberi et al., 2022; Mienye & Sun, 2023b) introduced Genetic Algorithm (GA) with Traditional ML models on the same dataset. IG-GAW (Mienye & Sun, 2023b), which uses Extreme Learning Machine (ELM) as its classification method and achieves a sensitivity of 99.7%, outperforms GA-RF (Ileberi et al., 2022), which uses RF with a sensitivity of 72.56%. Due to its simpler learning process and robust generalizability, the ELM—a feedforward neural network that uses randomly fixed hidden weights and analytically solves output weights—demonstrates superior capability over RF in this case.

Afriyie et al. (2023) revealed the weakness of RF in their study, showing that it achieved low performance in terms of F1-score and precision, with only 17% and 9%. This performance would inconvenience customers, subsequently raising operational costs and risks for financial institutions if deploying the model to a real-time system. All reviewed publications associated with Traditional ML are comprehensively summarized in Table 2.

3.2. Deep Learning (DL)

DL is a subset of ML but is more complex and usually adopts multilayer neural network architectures to learn the intricate patterns from vast amounts of data. A detailed review is listed in Table 3 at the end of this subsection.

Convolutional Neural Network (CNN) with 20 layers achieved an accuracy of 99.9%, an F1-score of 85.71%, a precision of 93% and an AUC of 98% (Alarfaj et al., 2022) while Continuous-Coupled Neural Network (CCNN) with Synthetic Minority Over-sampling Technique (SMOTE) achieved an accuracy of 99.98%, a precision of 99.96%, a recall of 100%, and an F1-score of 99.98% (Wu et al., 2025). CNN was designed to automatically extract spatial features from image data via convolutional layers with learnable filters, while CCNN improved the representation of intricate spatiotemporal patterns by continuous neuron activation and dynamic coupling. These strong results demonstrate their efficacy in processing financial transaction data. Yu et al. (2024) adopted the Transformer model, a deep learning architecture originally developed for natural language processing tasks on financial transaction datasets. The transformer model with a multi-head attention mechanism successfully captured the intricate correlations between attributes of financial transaction datasets and obtained a recall of 99.8% and an F1-score of 99.8% under cross-validation. These state-of-the-art DL algorithms, particularly CCNN and Transformer, demonstrated impressive performance due to their inherently robust architectures. However, the substantial computational resources required for training and classifying, along with challenges in interpretability, hinder these methods from becoming the most effective solution for mitigating financial risks associated with transaction fraud in real-world systems.

Graph Neural Networks (GNNs) captured researchers’ interests in transaction fraud detection (Cherif et al., 2024; Harish et al., 2024; Khaled Alarfaj & Shahzadi, 2025). In the financial fraud context, unlike other techniques that only consider single transactions, GNNs usually transform tabular datasets into graphs with customer nodes and merchant nodes and link nodes to uncover the intricate relations and behavior patterns between customer (transaction initiator), merchant (transaction receiver), and transactions before classification. GNNs with lambda architecture—which processes large-scale data through a combined batch and real-time structure—achieved an F1-score of 80.78% and a recall of 79.68% (Khaled Alarfaj & Shahzadi, 2025). In contrast, GNNs with Relational Graph Convolutional Network (RGCN), used to learn node representations, achieved a lower F1-score of 61% and a recall of 46% (Harish et al., 2024). Cherif et al. (2024) proposed an encoder–decoder-based GNN model to detect transaction fraud. The encoder–decoder architecture was applied to represent the nodes, and the proposed model yielded better results than the previous two studies, achieving a recall of 92%, an F1-score of 86%, and an AUC of 92%. Overall, GNNs demonstrate lower performance relative to other DL algorithms. The effectiveness of GNNs in transaction fraud detection needs further investigation, as poor performance may lead to increased operational risk.

3.3. Ensemble Method

Ensemble methods are learning algorithms that construct two or more classifiers (also called weak learners) and then classify new data points by averaging, stacking, or taking a (weighted) vote of their predictions (Dietterich, 2000). They balance bias and variance of weak learners and usually yield a more robust result, improving overall performance compared to any single constituent model. Ensemble could be a powerful tool to combat financial risk in the context of transaction fraud.

The voting ensemble, which outputs classification results by aggregating (hard or soft) votes from multiple weak learners, has drawn attention in transaction fraud detection. All the work (Ahmed et al., 2025; Chhabra et al., 2023; Khalid et al., 2024) achieved promising accuracy (exceeding 99.9%) on the same dataset by employing a voting ensemble composed of different base classifiers. Another study (Talukder et al., 2024) proposed a voting-based multistage ensemble ML classifier (EIBMC), leveraging the diversity of the fundamental ML models and combining the finest aspects of multiple multistage ensemble models into a more robust and trustworthy detection technique. EIBMC achieved an accuracy score of 99.94% and an AUC score of 100% under stratified 5-fold cross-validation. However, EIBMC used accuracy—a metric known to be biased in imbalanced datasets—to assign weights to its ensemble classifiers. For future work, metrics like recall could be adopted to improve model robustness.

A cost-sensitive and ensemble deep forest (CE-gcForest) method, inspired by Zhou and Feng (2019), was employed to detect transaction fraud (Zhao et al., 2023). The ensemble-gcForest model enhances diversity and improves performance by selecting the best-performing base-classifiers in each round based on their Type II error rate. This process results in a more robust and efficient ensemble. The model achieved the best performance compared to other baseline ML models, with AUC 98.01% and 98.25% respectively, on real-world datasets.

Mienye and Sun (2023a) implemented a stacking ensemble approach involving deep learning base-classifiers, Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), and a meta-classifier, Multilayer Perception (MLP), to address the problem of dynamic patterns in card transactions. In a stacking ensemble, the base classifiers are trained on the original features, and the meta-classifier is trained on the results of the base classifiers to produce the final output. The proposed model obtained promising results with a recall value of 100%, a precision value of 99.7% and an AUC value of 100% under a 10-fold cross-validation technique.

The examples discussed above highlight the outstanding proficiency of Ensemble in detecting transaction fraud. Leveraging the strengths and diversities of multiple base learners, Ensemble successfully enhances models’ performance, making them particularly effective in handling complex and unbalanced datasets commonly found in financial transactions. A comprehensive summary regarding the reviewed Ensemble methods can be found in Table 4 below.

3.4. Hybrid Method

Unlike Ensemble methods, which combine the predictions of multiple base learners, Hybrid methods integrate two or more ML algorithms sequentially or in parallel to exploit their individual strengths. They are used for tasks such as parameter optimization, feature engineering, data balancing, and classification. This architecture allows a more robust, accurate, and reliable model to be developed.

Du et al. (2024) employed an Autoencoder (AE) for feature learning and Extreme Gradient Boosting (XGBoost) for classification, achieving a recall of 89.29%. Here, the AE compresses input data to extract features, while XGBoost serves as the predictive ensemble model. Another hybrid deep learning architecture, the Zeiler and Fergus Network integrated with Dwarf Mongoose–Shuffled Shepherd Political Optimization (DMSSPO_ZFNet), was proposed (Ganji & Chaparala, 2024). In this model, feature fusion is performed using Wave Hedge distance and a Deep Neuro-Fuzzy Network (DNFN), hyperparameters are optimized via DMSSPO, and classification is handled by ZFNet. It achieved an accuracy of 96.1%, a sensitivity of 96.1%, and a specificity of 95.1%.

Meta-heuristic algorithms inspired by animal behaviors have recently gained popularity for optimizing hyperparameters in transaction fraud detection. Jovanovic et al. (2022) introduced Group Search Fireflies algorithms (GSFA) coupled with XGBoost to detect transaction fraud. The GSFA utilizes a fitness function based on firefly brightness and attraction. It also employs a disputation operator, which searches for solutions within a selected subgroup of the population to optimize the hyperparameters of ML algorithms. XGBoost-GSFA achieved a recall of 99.97%, an F1-score of 99.97%, and an AUC of 100%. Reddy et al. (2025) introduced XGBoost with Elephant Herd Optimization (EHO) to tune hyperparameters. EHO finds an optimal hyperparameter set by mimicking elephant herding behavior, guided by specific evaluation metrics. The result showed XGBoost with EHO had an accuracy of 98% and an AUC of 99.7% under 20-fold cross-validation.

These statistics indicate that the capability of Hybrid techniques on transaction fraud detection is exceptional. Meta-heuristic algorithms, in particular, which are more flexible and more efficient at finding near-optimal solutions and less prone to becoming stuck in local optima, show a superior ability to enhance ML models’ performance through hyperparameter optimization. Nevertheless, most Hybrid methods incorporate deep neural network structures, resulting in the same issues as DL, like interpretability and requiring intensive computing resources. Table 5 presents the review summary of the Hybrid method.

3.5. Analysis

3.5.1. Datasets Description

Datasets for detecting transaction fraud are usually collected in two ways: online open sources or private data from banks or financial institutions. Most researchers adopt online open-source data to build their ML models on card transaction fraud detection, since private data from banks is inaccessible in most cases due to the private nature, while in some reviewed studies, researchers cooperate with banks and financial institutions so that they can access real-world data to build detection systems.

Online open-source data:

The European credit card dataset is a highly adopted dataset to train ML models in transaction fraud. There are two versions, which were made in 2013 and 2023, respectively. The first version contains 284,807 transactions, where 492 out of 284,807 (0.172%) transactions are fraudulent, made by European cardholders in September 2013 (ULB Machine Learning Group, 2018). The second version contains 550,000 credit card transactions, which are evenly distributed and made by European cardholders in 2023 (Elgiriyewithana, 2023). Most features in these datasets are transformed by Principal Component Analysis (PCA) and anonymized due to privacy concerns.

The Sparkov dataset is used by several studies. It was simulated using the Sparkov Data Generation tool, covering credit cards of 1000 customers doing transactions with a pool of 800 merchants from the duration 1 January 2019 to 31 December 2020 (Shenoy, 2022). This dataset is highly unbalanced.

The BankSim dataset is a synthetic dataset of bank payments based on a Spanish bank. It contains 594,643 records in total and is highly unbalanced, with normal payments of 587,443 and fraudulent transactions of 7200 (Lopez-Rojas, 2017a).

The PaySim dataset is another synthetic dataset generated from a private dataset containing mobile money transactions in an African country to solve the lack of publicly available datasets in financial fraud (Lopez-Rojas, 2017b). This dataset is very huge, containing more than 6 million instances, and is highly unbalanced.

Private data:

Private data from banks or financial institutions often contains real-world transactions. ML models developed with real-world data are usually more efficient and accurate because real data preserves all the authentic information of transactions. However, private data is hard to access due to the private nature of financial security.

Table 6 provides a summary of the datasets used in the reviewed papers. As shown, the European credit card dataset is most welcomed, and the adoption times are substantially greater than other datasets. This is because the European credit card dataset is easily accessible and the attributes have already been processed by PCA, providing researchers with clean and standardized data to develop their models (Chen et al., 2025).

Although synthetic datasets offer valuable utility, they might introduce significant risks in transaction fraud detection (Tayebi & El Kafhali, 2025). Firstly, synthetic data may fail to capture the complex and non-linear interactions and subtle behavioral cues presented in real-world transactions. Moreover, it often lacks the rare and evolving attack patterns that define emerging fraud. Consequently, models that perform well in training or testing might fail when deployed on different, more complex real-world datasets. Access to rich, up-to-date real transaction data, therefore, remains essential. Real data provides the ground truth needed to validate models, capture live threat intelligence, and ensure that detection systems adapt to novel fraud schemes in real time, thereby maintaining both robustness and regulatory credibility.

3.5.2. Preprocessing

Preprocessing plays a crucial part in ML-based applications before training and classification, especially in financial fraud detection, which typically involves datasets with unbalanced classes and redundant features. Regardless of the ML algorithms implemented, most of the reviewed papers adopted at least one of the three preprocessing methods: data balancing, feature engineering, or hyperparameter tuning/optimization to enhance the performance of their ML models. This subsection provides an analysis of these methods applied by selected publications.

Data Balancing

Data resampling is the most popular choice for researchers to address issues of highly unbalanced datasets in transaction fraud detection. It can create a more balanced data distribution, consequently leading to better model performance. Traditional resampling techniques include undersampling, oversampling, and a combination of undersampling and oversampling, which usually generate synthetic data instances of the minority class or reduce the instances of the majority class by statistical methods (Alfaiz & Fati, 2022). RUS, Random Over Sampling (ROS), and SMOTE are widely adopted traditional resampling methods. RUS reduces the majority class by randomly removing instances, while ROS randomly duplicates existing minority class instances. SMOTE can generate new synthetic examples of the minority class by interpolating between existing minority instances. Dang et al. (2021) compared the model performance with and without SMOTE, concluding that ML models achieve high performance when resampling methods are applied to both of training and testing datasets, but poor results are obtained by applying resampling methods only to the training dataset. However, oversampling can introduce noise and result in overfitting, while undersampling may result in the loss of information (Cherif et al., 2024). ML methods are alternatives for data resampling, and their generations are usually more robust and credible. Zheng et al. (2018) constructed a Hybrid method, Generative Adversarial Network (GAN) with Gaussian Mixture Models (GMMs) to generate transaction samples. In this architecture, generator networks integrated with GMMs create and validate synthetic fraud samples, resulting in a more robust generation process. The result in a real-world system showed strong financial significance. Similarly, Tayebi and El Kafhali (2025) leveraged an Autoencoder (AE) coupled with a Support Vector Machine (SVM) in their Hybrid model for data creation and verification. SVM is a traditional supervised ML method that finds the optimal separating hyperplane by maximizing the margin between classes for linear classification and efficiently handles non-linear problems using the kernel trick. In this combined approach, newly generated fraud data points from AE were added to the dataset only if they were correctly classified by the SVM model. This approach yielded promising results, with 99.99% accuracy, 98% precision, 99.99% recall, and a 97.77% F1-score. The validation process in these two examples demonstrated robust data generation.

Assigning class weights that are inversely related to the class frequencies is another alternative to solve the class imbalance (Alharbi et al., 2022; Baisholan et al., 2025b; Cherif et al., 2024; Zhao et al., 2023). This strategy is preferred because it maintains a realistic distribution to prevent overfitting and avoids biased results caused by artificially generated samples of the minority class (Cherif et al., 2024). Zhao et al. (2023) employed cost-sensitive gcForest to address the data imbalance issue by assigning higher classification costs to fraud cases, while Baisholan et al. (2025b) tuned the threshold and adjusted minority class weight. In these examples, a higher penalty was assigned to the model for misclassifying fraud samples. This cost-sensitive approach increases the model’s sensitivity to fraud during training, leading to better performance on imbalanced classification tasks.

Feature Engineering

Feature engineering is another hot topic in detecting financial fraud by ML algorithms because it enhances models’ performance by removing redundant information and extracting useful information. This is achieved by cleaning and restructuring the data through steps such as selecting the optimal set of features and creating new features through aggregation, decomposition, or interaction of existing variables. The goal is to highlight the underlying patterns and relationships within the data, making it easier for the model to learn effectively, rather than simply feeding it raw or unstructured information. Zhang et al. (2021) proposed a feature engineering method, Homogeneity-Oriented Behaviour Analysis (HOBA), inspired and developed from a marketing technique called recency-frequency-monetary (RFM) (Van Vlasselaer et al., 2015), to extract feature variables by aggregation. RFM analyzes and aggregates variables related to three key customer behaviors: recency (how recently they purchased), frequency (how often they purchase), and monetary value (how much they spend). HOBA is an enhanced version of RFM that formalizes this approach in the financial fraud detection context through a structured aggregation strategy, extracting features based on four components: the aggregation characteristic, aggregation period, transaction behavior measure, and aggregation statistic. This strategy helps the model capture fraudulent transaction patterns. GA, a type of Evolutionary-inspired Algorithm (EA), was introduced to find the optimized feature subsets by computing model fitness (Ileberi et al., 2022; Mienye & Sun, 2023b). GNNs were applied to represent the graph nodes (Cherif et al., 2024; Harish et al., 2024; Khaled Alarfaj & Shahzadi, 2025), while an Autoencoder (AE) was implemented for feature extraction and representation (Du et al., 2024; Zioviris et al., 2024). All the work mentioned improved their models’ performance by integrating feature engineering. However, Nguyen et al. (2020) failed to achieve optimal results without implementing feature engineering, as the features in the datasets had little correlation.

Hyperparameter Tuning/Optimization

Hyperparameter tuning and optimization are the processes of finding the optimal set of hyperparameters to achieve the best possible model performance and play an important role in ML algorithms. Several researchers (Ganji & Chaparala, 2024; Jovanovic et al., 2022; Reddy et al., 2025) applied Hybrid methods, incorporating hyperparameter tuning to enhance performance in transaction fraud detection. (see Section 3.4) Since huge computing resources are required, the feasibility of implementing it in a real-world system could be tested in the future.

3.5.3. Cross-Validation

Cross-validation is a fundamental statistical technique in ML used to evaluate the stability and reliability of predictive models and prevent overfitting (Wu et al., 2025). As mentioned before, a few studies employed cross-validation techniques to ensure the robustness of their models. The most common implementation is k-fold cross-validation. In this approach, the dataset is first randomly partitioned into k equal, non-overlapping subsets (folds). The model is then trained and evaluated k times; in each iteration, one subset is held out as the test set while the remaining k-1 subsets are used for training. This process ensures every data point is used for testing exactly once. Finally, the k performance scores are averaged to produce a robust and generalized estimate of model performance (Zheng et al., 2018). Stratified k-fold cross-validation is a crucial variant. This method ensures each fold (subset) maintains the same proportion of class labels as the original dataset, making it essential for imbalanced classification tasks (Alfaiz & Fati, 2022). Figure 3 gives an example of 5-fold cross-validation.

3.6. Discussion

We selected several traditional metrics, involving accuracy, recall, precision, AUC, F1-score, and ROC-AUC, that are prevalent to measure the performance of ML models for comparison across studies. The results are presented in Table 7. As shown, many studies report high accuracy. However, this can be misleading due to severe class imbalance in fraud datasets, as a classifier that always predicts “not fraud” will achieve high accuracy while failing to detect fraud. Several studies show low performance on metrics like recall, precision, and F1-score. This variation in performance is influenced by factors such as the datasets used, preprocessing methods, and the ML models employed.

To better understand the performance among various groups, box plots (Figure 4, Figure 5 and Figure 6) with average values are drawn to compare recall, precision, and F1-scoreacross different taxonomies, with and without data balancing and feature engineering (Note: due to insufficient sample sizes within individual datasets, comparisons were conducted across multiple datasets). We chose these three metrics for comparison because more than half of the reviewed work adopted them. Accuracy is excluded since it is unreliable for highly unbalanced binary classification tasks. Figure 4 presents the results of the comparison across subgroups. As shown, Ensemble outperforms the other three methods according to the average recall, precision, and F1-score. Traditional ML has two outliers while Hybrid has one. F1-score, a comprehensive metric combining precision and recall, provides a balanced view of an ML model’s performance. The graph indicates that Ensemble obtains the highest average F1-score with 92.66%, DL ranks second with 89.38%, Hybrid is slightly lower with 88.38% while Traditional ML has the lowest value with 83.66%. DL, Ensemble, and Hybrid outperform Traditional ML according to F1-score due to their superior powers and more robust architectures to identify intricate patterns. This is consistent with the observation that more studies employ DL, Ensemble, and Hybrid methods for transaction fraud detection compared to Traditional ML approaches.

In terms of Figure 5, studies that applied data balance techniques demonstrate higher performance overall compared to studies without data balancing, although there are several outliers. The average recall, precision, and F1-scores of studies using data balance techniques are 91.02%, 93%, and 91.57%, respectively, markedly greater than those of studies without data balancing, which are 86.69%, 74%, and 69.38%. This is aligned with the previous observation that data balance techniques are widely adopted by researchers to enhance the performance of transaction fraud detection systems.

In Figure 6, studies with feature engineering have a higher average precision over studies without feature engineering, which also display some outliers according to precision and F1-score. However, studies without feature engineering show a higher performance in terms of recall and F1-score, contradicting the assumption that feature engineering can improve models’ performance by removing unrelated information and extracting valuable information. One of the reasons might be that inappropriate adoption of feature engineering techniques could lead to poor performance. For example, Alharbi et al. (2022) implemented a text2IMG conversion technique, which converted the transactional data with 30 features into 5 × 6-dimensional images for modelling, but the output yielded only 51.22% recall and 57.8% F1-score. The necessity of converting tabular data into images is questionable. Another study (Malik et al., 2022) employed SVM- recursive feature elimination (RFE) for feature dimension reduction. The process iteratively selects features to identify the optimal subset. However, this approach may eliminate valuable information, resulting in a recall of 64% and an F1-score of 77%. Low recalls would allow more fraudulent transactions to be approved, causing tremendous financial losses and escalating the operational risk. The results of these 2 studies significantly bias the overall average performance of studies with feature engineering. Moreover, several studies (Khaled Alarfaj & Shahzadi, 2025; Kim et al., 2019; Zhang et al., 2021) with feature engineering applied their models to real-world datasets from banks. Their recalls are 80.68%, 91.5%, and 75%, respectively. As we know, real-world data that has better quality is usually more complex than synthetic datasets in transaction fraud. ML models often have lower performance on real-world data than on synthetic data. This could also negatively affect the box plots and overall averages.

Finally, after comprehensive analysis and exhaustive examination, a general workflow (Figure 7) of applying ML to mitigate financial risk in the context of transaction fraud detection is developed. One single study may not contain all the depicted steps, but this depiction provides a general picture and helps readers understand the overall process, and key procedures may be incorporated in this domain. The last three steps are considerable, as preprocessing techniques are essential to enhance the model’s performance, and the classification is the final step to predict the possibility of fraud.

4. Financial Implication

In the research field, addressing financial implications is critical and helpful to investigate the insights into financial risk in transaction fraud, as the goal is to protect stakeholders in the financial domains. Generally, advanced automated fraud detection systems, like ML-based methods that improve their detection performance through training, are more efficient for financial institutions and their customers to fight against external fraud than manual and traditional fraud detection systems. These approaches provide alternatives to deal with operational risk under the Basel Framework. Implementing ML also complied with regulations, such as the Scam Detection Framework, recently activated in Australia, designating financial institutions like banks as one of the initial sectors to introduce robust systems to combat fraud and scams, managing the risks that might cause financial losses (The Treasury, 2024). However, most reviewed studies focus on developing ML models, which mainly rely on traditional ML evaluation metrics, like accuracy, ignoring the specific significance in financial domains. This section highlights a selection of reviewed publications that emphasize financial significance, addressing key aspects such as financial metric design, feature interpretation, time series considerations, and financial data privacy.

4.1. Financial Metric Design and Direct Impact

The primary role of card transaction fraud detection is to reduce financial losses and preserve the reputation of financial institutions. Two studies adopted or designed financial metrics to evaluate ML models’ performance in this context. The Cost Reduction Rate (CRR) was introduced to quantify the cost savings achieved by the new model compared to the old one in a financial transaction fraud detection system (Kim et al., 2019). The metric is calculated by first aggregating the total cost for each model, which comprises financial losses from falsely approved fraudulent transactions and operational costs of incorrectly rejected legitimate transactions. The difference between the total cost of the old and new models is then divided by the old model’s total cost. The result, expressed as a percentage, represents the relative cost reduction achieved by the new model. Similarly, Cost Savings (CS) measure, inspired by profit-based systems in loan default prediction, measures the net financial benefit an ML algorithm provides on a fraud dataset (Hajek et al., 2023). Its value is derived from the difference between the savings generated by correctly identified fraudulent transactions and the total costs incurred due to misclassifications (both false positives and false negatives). This metric provides a direct and absolute basis for comparing algorithms, where a higher CS score signifies greater overall cost-effectiveness. In real-world applications, high performance on traditional metrics such as accuracy and recall does not guarantee that a model will effectively mitigate financial risk. For instance, a high accuracy rate indicates that most samples are classified correctly. However, if the misclassified samples consist of approved fraudulent transactions, substantial financial losses can still occur, thereby escalating risk. Similarly, a model with high recall successfully identifies many fraudulent transactions but may misclassify numerous legitimate ones, leading to high operational costs from manual reviews, which also increases risk. In contrast, CRR and CS directly quantify financial costs, offering a more meaningful approach to assessing the true financial impact of ML models in fraud detection and providing significant value for practical risk management (The Treasury, 2024). Another study (Zheng et al., 2018) showed direct financial impact in a real-world implementation by ML in transaction fraud detection. The system successfully detected 321 of 367 true fraudulent cases (87%) and reduced customer losses of nearly 10 million RMB for two commercial banks in China over 12 weeks. The statistics demonstrate how ML effectively mitigates external fraud in operational risk by detecting anomalies and reducing financial losses. Consequently, advanced ML for fraud detection could even affect banks’ strategies of capital requirements against financial risk, like RWA, allowing them to allocate more assets to other investments and subsequently resulting in economic growth. Zhang et al. (2021) highlighted the importance of setting a tolerance for the False Positive Rate (FPR) to balance the capture of fraudulent transactions against the misclassification of legitimate ones. Financial institutions typically prioritize a lower FPR to protect customer relationships and reputations. This strategy, however, often leads to a higher False Negative Rate (FNR), indirectly exposing customers to greater financial losses from undetected fraud. This preference can be understood through a risk-adjusted return framework. From the institution’s perspective, models are tuned to accept a certain level of risk (financial losses from false negatives) in exchange for a greater “return”: lower operational costs and preserved customer trust. Consequently, while this calibration may maximize the institution’s operational efficiency and profitability, it does so by effectively transferring a portion of the financial risk to the customer.

4.2. Feature Explanation

Interpreting the models and explaining the contributions of features to the results are always hard tasks for ML methods, especially for DL, as they are defined as “Black Box Models”, but it is important for financial institutions to understand attributes of people who are prone to be deceived and set up effective protection measures, as mentioned in Scam Detection Framework (The Treasury, 2024). Afriyie et al. (2023) analyzed the characteristics of fraudulent transactions on the Sparkov dataset. Their analysis reveals that fraudulent transactions tend to involve higher amounts than legitimate ones and disproportionately target older adults. Furthermore, fraudulent transactions occur most frequently on Sundays, a day of higher consumer activity, and during nighttime hours when monitoring may be less vigilant. These insights can help financial institutions strengthen their protective measures and mitigate this risk. Baisholan et al. (2025b) employed the Interpretable Machine Learning (IML) method, Shapley additive explanations (SHAP), to explain features’ contributions to the results. Although the features’ names in the dataset are anonymous, this is a good attempt to refine their models by understanding the attributes’ contributions.

Interpretable Machine Learning (IML) Overview

SHAP, Local Interpretable Model-agnostic Explanations (LIME), and Partial Dependence Plot (PDP) are widely adopted IML methods for explaining feature influence in tabular datasets (Wang et al., 2025).

SHAP, a method inspired by cooperative game theory, explains ML model outputs by attributing importance to each input feature. The framework treats each feature as a contributor and calculates its average marginal contribution across all possible feature combinations for a given prediction. The results are typically presented as a ranked list of feature importance, allowing users to easily identify the most influential factors (Wang et al., 2025). A key limitation is its potential to be misleading when features are highly correlated, as SHAP may distribute importance unevenly among them. Additionally, the method can be computationally expensive for large datasets, as the exact calculation grows exponentially with the number of features.

LIME explains individual predictions by ranking the contribution of each feature to that specific outcome. It does so by constructing a simple, interpretable local surrogate model, such as a linear regression, which explains how the complex “Black Box Model” behaves around one specific instance (Wang et al., 2025). While LIME is computationally efficient, it is designed only for local interpretation and does not provide a global view of feature importance.

In contrast to methods that explain individual predictions, PDP provides a global perspective by visualizing the average marginal effect of a feature across the entire dataset. It calculates how the model’s prediction changes on average when one specific feature is varied while all other features are held at their typical observed values. The result is a curve that shows the overall trend between the feature and the predicted outcome, along with an indication of the uncertainty around that average effect (Wang et al., 2025). However, this approach can produce misleading results when features are highly correlated, as it may extrapolate to unrealistic combinations of feature values.

All these methods are valuable for transaction fraud detection and financial risk prevention. However, selecting the appropriate method requires careful consideration of the specific task (whether to explain local or global influence), the application context, dataset characteristics, and the inherent strengths and limitations of each technique. A summary can be found in Figure 8.

4.3. Time Series Consideration

Applications in financial contexts usually involve time series data. Most of the studies in transaction fraud detection by ML remove the “time” attribute before training, as they think this attribute is redundant and not contributory, ignoring concept drift, the common phenomenon that statistical properties always change over time in time series data. Neglecting “time” can harm a model’s performance by limiting its generalizability to new or real-world data. A few (Mienye & Sun, 2023a; Tayebi & El Kafhali, 2025; Zioviris et al., 2024) consider the importance of temporal and sequential patterns during training. These studies involve Recurrent Neural Network (RNN), including LSTM and GRU, which are well-suited for sequential data like time series, and this inherent suitability enables them to generate more robust results. RNN introduces the basic concept of memory for sequences. This approach can learn temporal dependencies among instances. LSTM and GRU are enhanced versions that overcome the vanishing gradient problem and capture long-term dependencies through their gated architectures. This allows them to capture relationships between instances that are widely separated in time. These approaches offer examples of how concept drift is considered and handled, and could inspire other financial time series-related problems, like stock market prediction, loan default, and insurance fraud etc.

4.4. Financial Data Privacy

The data privacy concerns cause underperformance in transaction fraud detection for financial institutions. Without sufficient data to be trained and learned, ML-based detection systems would be more likely to yield poor results and fail to correctly classify transactions. The Federated Learning (FL) framework was proposed to address financial data privacy issues (Abdul Salam et al., 2024). It enables multiple financial institutions to cooperatively train ML models using their dispersed data without sharing the raw data itself, thereby bypassing privacy concerns. This framework helps mitigate the risk of financial data leakage while ensuring compliance with stringent data protection laws, such as the EU’s General Data Protection Regulation (GDPR), a comprehensive data privacy and security standard (Sgantzos et al., 2025; Yang et al., 2019). Table 8 summarizes the financial significance addressed by relevant papers.

5. Limitations/Challenges and Future Scopes

To wrap up, this section extracts limitations and challenges from previous analysis and develops research gaps and future scopes in managing risks associated with detecting financial transaction fraud.

5.1. Limitations/Challenges

An appropriate performance metric might be the most urgent concern in detecting financial fraud by ML for financial institutions. Some studies focus on improving accuracy, a metric that always has a misleadingly high value in unbalanced data classification tasks. Although a higher accuracy marginally reduces losses, it is an indirect and weaker measure than metrics like the F1-score or financial indicators (e.g., CRR, CS), which correlate more directly with actual risk and cost outcomes. Furthermore, since this is fundamentally a financial risk problem, few studies adopted financially significant metrics that assess the cost of fraud under the risk. This is inconsistent with the risk management framework to mitigate financial losses.

Interpretability is another important limitation confronted by all types of ML methods, which are often called “Black Box Models”. This is especially true for DL, Ensemble, and Hybrid methods, which employ more complicated architectures that are difficult to understand. Under the Scam Prevention Framework, identifying individuals who could be impacted by scams is required (The Treasury, 2024). So, it is important for financial institutions to set up prevention measures by understanding the intricate relationship between features and their contributions to the results. However, only one publication in the review used IML techniques, representing a significant gap in the literature.

Data privacy/availability is one of the biggest obstacles faced by financial fraud detection. Open-source datasets are insufficient and might exclude important information as they are usually synthetic and generated, while private-source or real-world datasets are inaccessible to most researchers, except for those who cooperate with financial institutions or industries. In addition, financial institutions are less likely to collaborate to build a more robust detection system due to data privacy regulations, like GDPR. For example, A bank in country A cannot simply share its customers’ transaction records with a partner bank in country B, even if the goal is to protect those same customers from cross-border fraud. Without sufficient and accurate data, ML algorithms are hindered from leveraging their full potential to learn the historical behaviors from vast amounts of data, thus leading to poor performance.

Model replicability and concept drift are challenges for financial fraud detection. Most studies only involve one dataset, and while a model obtains promising results on a specific dataset does not guarantee it performs well on other datasets. Moreover, many studies ignore concept drift, which happens a lot in time series data, and exclude the time attribute while training. This will probably cause negative effects on the models’ performance.

Data resampling is one of the techniques to address data imbalance problems, faced by all studies and methods in the field of financial fraud detection, as fraudulent transactions are usually much fewer than legitimate transactions. Although data resampling techniques help improve models’ performances, they, especially the traditional ways like undersampling and oversampling, also increase the risk of introducing bias, potential loss of valuable information, and the creation of noisy or inaccurate synthetic samples that can lead to overfitting.

Intensive computing resources required are a disadvantage for DL and Hybrid, which usually involve more complex structures, requiring more training and processing time. However, real-time detection systems always allow little response time to approve or reject a transaction. For instance, when a customer makes an online purchase, the fraud detection system must analyze the transaction, check historical patterns, and decide whether to block or approve it—often within a few hundred milliseconds. During that brief window, high latency causes system inefficiency and leads to customer complaints.

5.2. Future Scopes

Involving financial metrics and addressing more of the financial impact are pressing issues anticipated in future studies to mitigate this financial risk. Instead of relying on traditional ML model evaluation metrics like accuracy, future research can prioritize CS (Hajek et al., 2023) or CRR (Kim et al., 2019) to estimate the model’s performance financially, aligning with operational risk management. Moreover, the trade-off between FPR and True Positive Rate (TPR) is encouraged to be considered (Zhang et al., 2021).

Incorporating IML strategies to explain models, features, and results in future studies of financial fraud detection is imperative. IML methods, including SHAP, feature permutations, feature occlusion, PDP, and LIME (Allen et al., 2024), are good ways to address the interpretability of “Black Box Model” and understand the models’ decisions. Moreover, it helps detect unfair bias. For example, auditors can use IML techniques to check whether a model’s decisions are overly influenced by sensitive attributes like age or location.

Integrating novel technologies like GNNs, metaheuristics, time series forecasting models, blockchain, and FL to emphasize different sides of financial fraud detection could be considered. GNNs and metaheuristic algorithms could be used for feature representation and hyperparameter optimization to enhance performance, while time series forecasting models could be included to solve concept drift. Blockchain, known for its distributed ledger and secured blocks, together with FL, can enhance the performance of ML in real-time financial transaction fraud detection without sharing sensitive information between financial institutions. This combination could effectively mitigate the data privacy constraints that prevent banks from sharing sensitive information with each other. Moreover, FL framework allows organizations across sectors like financial institutions and telecommunication companies, etc. working together to integrate multimodal data, such as financial transaction data (tabular), customers’ portraits (image), message contents (text), phone call contents (audio), etc., developing more robust and resilient detection system, and complying with regulations, like GDPR simultaneously.

Assigning class weights or implementing ML algorithms for combating data imbalance lacks investigation. Assigning class weights (Alharbi et al., 2022; Baisholan et al., 2025b; Cherif et al., 2024; Zhao et al., 2023) can maintain a realistic distribution in the dataset while ML algorithms (Tayebi & El Kafhali, 2025; Zheng et al., 2018) make the generated data points more reliable.

Various advanced feature engineering and hyperparameter tuning strategies are expected to enhance model performance. More preprocessing strategies could be focused on in future studies of financial fraud detection. For example, our findings suggest that methods like the statistical HOBA framework and complex neural networks, such as Autoencoder (AE), demonstrate superior power for model improvement. Future work could leverage such approaches, including statistical feature selection that aggregates fraud-related components and advanced neural networks for effective feature representation.

6. Conclusions

This research exhaustively analyzes 41 recent papers on ML for detecting transaction fraud—a key financial operational risk. It provides a comprehensive review of their advantages and disadvantages. Traditional ML methods like Logistic Regression (LR) and Random Forest (RF) demonstrate strong classification performance on small, randomly undersampled transaction fraud datasets. However, they may show instability when classifying new data points. DL approaches, particularly Transformer models and Continuous-Coupled Neural Networks (CCNNs)—which were originally designed for unstructured data—have shown superior power on transactional tabular data. However, DL suffers from difficulties such as poor interpretability and intensive computational demands. Ensemble and Hybrid methods, which strategically combine multiple ML techniques, tend to deliver more robust performance in fraud detection. Nonetheless, they face limitations, such as issues with generalizability, interpretability, or high resource consumption, depending on the specific base algorithms used. A comparison across the taxonomy indicates that Ensemble methods demonstrate superior average performance over the other three categories. Data balancing generally improves model performance, while inappropriate feature engineering can negatively impact results. Employing financial metrics, applying feature explanation techniques, considering temporal patterns, and addressing financial data privacy are all crucial for assessing financial significance in external fraud prevention. These elements are also key to aligning with risk management frameworks, such as the Basel operational risk standards (Basel Committee on Banking Supervision, 2024), and regulations like GDPR and Scam Prevention Framework (The Treasury, 2024).

Future work in this domain could employ more preprocessing alternatives, focus on metrics such as the False Positive Rate (FPR), True Positive Rate (TPR), and other financial metrics like Cost Reduction Rate (CRR), Cost Savings (CS), to evaluate model performance and financial impact, and implement Interpretable ML (IML) strategies to explain the results. Advanced technologies like Federated Learning (FL), which allows collaborations across multiple organizations, could be considered to mitigate data privacy concerns. The use of the Hybrid Method, which integrates multiple ML techniques for tasks such as data balancing, feature engineering, hyperparameter optimization, and classification to enhance model performance, and the Ensemble Method, which leverages the diversity of the various ML models to yield a more robust result, might be a trend for future-generation financial fraud detection systems.

Author Contributions

Conceptualization, T.C. and T.M.; methodology, T.C. and R.S.; formal analysis, T.C., R.S., T.M. and S.S.; writing—original draft preparation, T.C.; writing—review and editing, T.C., R.S., T.M. and S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study.

Acknowledgments

We would like to thank the editors and reviewers for their constructive comments and insightful advice, which helped us to significantly improve the presentation of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Abdul Salam, M., Fouad, K. M., Elbably, D. L., & Elsayed, S. M. (2024). Federated learning model for credit card fraud detection with data balancing techniques. Neural Computing and Applications, 36(11), 6231–6256. [Google Scholar] [CrossRef]
Afriyie, J. K., Tawiah, K., Pels, W. A., Addai-Henne, S., Dwamena, H. A., Owiredu, E. O., Ayeh, S. A., & Eshun, J. (2023). A supervised machine learning algorithm for detecting and predicting fraud in credit card transactions. Decision Analytics Journal, 6, 100163. [Google Scholar] [CrossRef]
Ahmed, K. H., Axelsson, S., Li, Y., & Sagheer, A. M. (2025). A credit card fraud detection approach based on ensemble machine learning classifier with hybrid data sampling. Machine Learning with Applications, 20, 100675. [Google Scholar] [CrossRef]
Alarfaj, F. K., Malik, I., Khan, H. U., Almusallam, N., Ramzan, M., & Ahmed, M. (2022). Credit card fraud detection using state-of-the-art machine learning and deep learning algorithms. IEEE Access, 10, 39700–39715. [Google Scholar] [CrossRef]
Alfaiz, N. S., & Fati, S. M. (2022). Enhanced credit card fraud detection model using machine learning. Electronics, 11(4), 662. [Google Scholar] [CrossRef]
Alharbi, A., Alshammari, M., Okon, O. D., Alabrah, A., Rauf, H. T., Alyami, H., & Meraj, T. (2022). A novel text2IMG mechanism of credit card fraud detection: A deep learning approach. Electronics, 11(5), 756. [Google Scholar] [CrossRef]
Allen, G. I., Gan, L., & Zheng, L. (2024). Interpretable machine learning for discovery: Statistical challenges and opportunities. Annual Review of Statistics and Its Application, 11(1), 97–121. [Google Scholar] [CrossRef]
Baisholan, N., Dietz, J. E., Gnatyuk, S., Turdalyuly, M., Matson, E. T., & Baisholanova, K. (2025a). A Systematic review of machine learning in credit card fraud detection under original class imbalance. Computers, 14(10), 437. [Google Scholar] [CrossRef]
Baisholan, N., Dietz, J. E., Gnatyuk, S., Turdalyuly, M., Matson, E. T., & Baisholanova, K. (2025b). FraudX AI: An interpretable machine learning framework for credit card fraud detection on imbalanced datasets. Computers, 14(4), 120. [Google Scholar] [CrossRef]
Basel Committee on Banking Supervision. (2024). Basel framework. Bank for International Settlements. Available online: https://www.bis.org/basel_framework/index.htm?m=97 (accessed on 3 December 2025).
Cheah, P. C. Y., Yang, Y., & Lee, B. G. (2023). Enhancing financial fraud detection through addressing class imbalance using hybrid SMOTE-GAN techniques. International Journal of Financial Studies, 11(3), 110. [Google Scholar] [CrossRef]
Chen, Y., Zhao, C., Xu, Y., Nie, C., & Zhang, Y. (2025). Deep learning in financial fraud detection: Innovations, challenges, and applications. Data Science and Management, S2666764925000372. [Google Scholar] [CrossRef]
Cherif, A., Ammar, H., Kalkatawi, M., Alshehri, S., & Imine, A. (2024). Encoder–decoder graph neural network for credit card fraud detection. Journal of King Saud University—Computer and Information Sciences, 36(3), 102003. [Google Scholar] [CrossRef]
Chhabra, R., Goswami, S., & Ranjan, R. K. (2023). A voting ensemble machine learning based credit card fraud detection using highly imbalance data. Multimedia Tools and Applications, 83(18), 54729–54753. [Google Scholar] [CrossRef]
Dal Pozzolo, A., Caelen, O., Le Borgne, Y.-A., Waterschoot, S., & Bontempi, G. (2014). Learned lessons in credit card fraud detection from a practitioner perspective. Expert Systems with Applications, 41(10), 4915–4928. [Google Scholar] [CrossRef]
Dang, T. K., Tran, T. C., Tuan, L. M., & Tiep, M. V. (2021). Machine learning based on resampling approaches and deep reinforcement learning for credit card fraud detection systems. Applied Sciences, 11(21), 10004. [Google Scholar] [CrossRef]
Dietterich, T. G. (2000). Ensemble methods in machine learning. In G. Goos, J. Hartmanis, & J. Van Leeuwen (Eds.), Multiple classifier systems (Vol. 1857, pp. 1–15). Springer. [Google Scholar] [CrossRef]
Dong, H., Liu, S., & Tran, D. (2025). Enhanced autoencoder model for robust anomaly detection in financial fraud with imbalanced data. In M. Mahmud, M. Doborjeh, K. Wong, A. C. S. Leung, Z. Doborjeh, & M. Tanveer (Eds.), Neural information processing (Vol. 2293, pp. 184–198). Springer Nature. [Google Scholar] [CrossRef]
Du, H., Lv, L., Wang, H., & Guo, A. (2024). A novel method for detecting credit card fraud problems. PLoS ONE, 19(3), e0294537. [Google Scholar] [CrossRef] [PubMed]
Elgiriyewithana, N. (2023). Credit card fraud detection dataset 2023 [Dataset]. Kaggle. Available online: https://www.kaggle.com/datasets/nelgiriyewithana/credit-card-fraud-detection-dataset-2023 (accessed on 18 September 2025).
Federal Trade Commission. (2024). As nationwide fraud losses top $10 billion in 2023, FTC steps up efforts to protect the public. Available online: https://www.ftc.gov/news-events/news/press-releases/2024/02/nationwide-fraud-losses-top-10-billion-2023-ftc-steps-efforts-protect-public (accessed on 23 August 2025).
Ganji, V. R., & Chaparala, A. (2024). Wave Hedges distance-based feature fusion and hybrid optimization-enabled deep learning for cyber credit card fraud detection. Knowledge and Information Systems, 66(11), 7005–7030. [Google Scholar] [CrossRef]
Hafez, I. Y., Hafez, A. Y., Saleh, A., Abd El-Mageed, A. A., & Abohany, A. A. (2025). A systematic review of AI-enhanced techniques in credit card fraud detection. Journal of Big Data, 12(1), 6. [Google Scholar] [CrossRef]
Hajek, P., Abedin, M. Z., & Sivarajah, U. (2023). Fraud detection in mobile payment systems using an XGBoost-based framework. Information Systems Frontiers, 25(5), 1985–2003. [Google Scholar] [CrossRef]
Harish, S., Lakhanpal, C., & Jafari, A. H. (2024). Leveraging graph-based learning for credit card fraud detection: A comparative study of classical, deep learning and graph-based approaches. Neural Computing and Applications, 36(34), 21873–21883. [Google Scholar] [CrossRef]
Ileberi, E., & Sun, Y. (2024). Advancing model performance with ADASYN and recurrent feature elimination and cross-validation in machine learning-assisted credit card fraud detection: A comparative analysis. IEEE Access, 12, 133315–133327. [Google Scholar] [CrossRef]
Ileberi, E., Sun, Y., & Wang, Z. (2021). Performance evaluation of machine learning methods for credit card fraud detection using SMOTE and AdaBoost. IEEE Access, 9, 165286–165294. [Google Scholar] [CrossRef]
Ileberi, E., Sun, Y., & Wang, Z. (2022). A machine learning based credit card fraud detection using the GA algorithm for feature selection. Journal of Big Data, 9(1), 24. [Google Scholar] [CrossRef]
Itoo, F., Meenakshi, & Singh, S. (2021). Comparison and analysis of logistic regression, Naïve Bayes and KNN machine learning algorithms for credit card fraud detection. International Journal of Information Technology, 13(4), 1503–1511. [Google Scholar] [CrossRef]
Jovanovic, D., Antonijevic, M., Stankovic, M., Zivkovic, M., Tanaskovic, M., & Bacanin, N. (2022). Tuning machine learning models using a group search firefly algorithm for credit card fraud detection. Mathematics, 10(13), 2272. [Google Scholar] [CrossRef]
Khaled Alarfaj, F., & Shahzadi, S. (2025). Enhancing fraud detection in banking with deep learning: Graph neural networks and autoencoders for real-time credit card fraud prevention. IEEE Access, 13, 20633–20646. [Google Scholar] [CrossRef]
Khalid, A. R., Owoh, N., Uthmani, O., Ashawa, M., Osamor, J., & Adejoh, J. (2024). Enhancing credit card fraud detection: An ensemble machine learning approach. Big Data and Cognitive Computing, 8(1), 6. [Google Scholar] [CrossRef]
Kim, E., Lee, J., Shin, H., Yang, H., Cho, S., Nam, S., Song, Y., Yoon, J., & Kim, J. (2019). Champion-challenger analysis for credit card fraud detection: Hybrid ensemble and deep learning. Expert Systems with Applications, 128, 214–224. [Google Scholar] [CrossRef]
Kitchenham, B., & Charters, S. (2007). Guidelines for performing systematic literature reviews in software engineering (Version 2.3, EBSE Technical Report No. EBSE-2007-01). Keele University and University of Durham. Available online: https://www.researchgate.net/publication/302924724_Guidelines_for_performing_Systematic_Literature_Reviews_in_Software_Engineering (accessed on 11 August 2025).
Lopez-Rojas, E. (2017a). Synthetic data from a financial payment system [Dataset]. Kaggle. Available online: https://www.kaggle.com/datasets/ealaxi/banksim1 (accessed on 18 September 2025).
Lopez-Rojas, E. (2017b). Synthetic financial datasets for fraud detection [Dataset]. Kaggle. Available online: https://www.kaggle.com/datasets/ealaxi/paysim1 (accessed on 18 September 2025).
Malik, E. F., Khaw, K. W., Belaton, B., Wong, W. P., & Chew, X. (2022). Credit card fraud detection using a new hybrid machine learning architecture. Mathematics, 10(9), 1480. [Google Scholar] [CrossRef]
Mienye, I. D., & Sun, Y. (2023a). A deep learning ensemble with data resampling for credit card fraud detection. IEEE Access, 11, 30628–30638. [Google Scholar] [CrossRef]
Mienye, I. D., & Sun, Y. (2023b). A machine learning method with hybrid feature selection for improved credit card fraud detection. Applied Sciences, 13(12), 7254. [Google Scholar] [CrossRef]
Moradi, F., Tarif Hokmabadi, M., & Homaei, M. (2025). A systematic review of machine learning in credit card fraud detection. Preprints. [Google Scholar] [CrossRef]
Najadat, H., Altiti, O., Aqouleh, A. A., & Younes, M. (2020, April 7–9). Credit card fraud detection based on machine and deep learning. 2020 11th International Conference on Information and Communication Systems (ICICS) (pp. 204–208), Irbid, Jordan. [Google Scholar] [CrossRef]
Nguyen, T. T., Tahir, H., Abdelrazek, M., & Babar, A. (2020). Deep learning methods for credit card fraud detection. arXiv. [Google Scholar] [CrossRef]
Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D., Shamseer, L., Tetzlaff, J. M., Akl, E. A., Brennan, S. E., Chou, R., Glanville, J., Grimshaw, J. M., Hróbjartsson, A., Lalu, M. M., Li, T., Loder, E. W., Mayo-Wilson, E., McDonald, S., … Moher, D. (2021). The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ, 372, n71. [Google Scholar] [CrossRef]
Reddy, S. S., Amrutha, K., Mnssvkr Gupta, V., Vssr Murthy, K., & Rao, V. V. R. M. (2025). Optimizing hyperparameters for credit card fraud detection with nature-inspired metaheuristic algorithms in machine learning. Journal of The Institution of Engineers (India): Series B, 106, 2005–2030. [Google Scholar] [CrossRef]
Sailusha, R., Gnaneswar, V., Ramesh, R., & Rao, G. R. (2020, May 13–15). Credit card fraud detection using machine learning. 2020 4th International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India. [Google Scholar]
Sgantzos, K., Tzavaras, P., Al Hemairy, M., & Porras, E. R. (2025). Triple-entry accounting and other secure methods to preserve user privacy and mitigate financial risks in AI-empowered lifelong education. Journal of Risk and Financial Management, 18(4), 176. [Google Scholar] [CrossRef]
Shenoy, K. (2022). Credit card transactions fraud detection dataset [Dataset]. Kaggle. Available online: https://www.kaggle.com/datasets/kartik2112/fraud-detection (accessed on 18 September 2025).
Talukder, M. A., Khalid, M., & Uddin, M. A. (2024). An integrated multistage ensemble machine learning model for fraudulent transaction detection. Journal of Big Data, 11(1), 168. [Google Scholar] [CrossRef]
Tanouz, D., Subramanian, R. R., Eswar, D., Reddy, G. V. P., Kumar, A. R., & Praneeth, C. V. N. M. (2021, May 6–8). Credit card fraud detection using machine learning. 2021 5th International Conference on Intelligent Computing and Control Systems (ICICCS) (pp. 967–972), Madurai, India. [Google Scholar] [CrossRef]
Tayebi, M., & El Kafhali, S. (2025). Combining autoencoders and deep learning for effective fraud detection in credit card transactions. Operations Research Forum, 6(1), 8. [Google Scholar] [CrossRef]
The Treasury. (2024). Scams prevention framework—Summary of reforms. Available online: https://treasury.gov.au/sites/default/files/2024-09/c2024-573813-summary.pdf (accessed on 16 October 2025).
ULB Machine Learning Group. (2018). Credit card fraud detection [Dataset]. Kaggle. Available online: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud (accessed on 18 September 2025).
Van Vlasselaer, V., Bravo, C., Caelen, O., Eliassi-Rad, T., Akoglu, L., Snoeck, M., & Baesens, B. (2015). APATE: A novel approach for automated credit card transaction fraud detection using network-based extensions. Decision Support Systems, 75, 38–48. [Google Scholar] [CrossRef]
Vashistha, A., & Tiwari, A. K. (2024). Building resilience in banking against fraud with hyper ensemble machine learning and anomaly detection strategies. SN Computer Science, 5(5), 556. [Google Scholar] [CrossRef]
Wang, Z., Chen, X., Wu, Y., Jiang, L., Lin, S., & Qiu, G. (2025). A robust and interpretable ensemble machine learning model for predicting healthcare insurance fraud. Scientific Reports, 15(1), 218. [Google Scholar] [CrossRef]
Wu, Y., Wang, L., Li, H., & Liu, J. (2025). A deep learning method of credit card fraud detection based on continuous-coupled neural networks. Mathematics, 13(5), 819. [Google Scholar] [CrossRef]
Yang, Q., Liu, Y., Chen, T., & Tong, Y. (2019). Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology, 10(2), 1–19. [Google Scholar] [CrossRef]
Yu, C., Xu, Y., Cao, J., Zhang, Y., Jin, Y., & Zhu, M. (2024, August 12–14). Credit card fraud detection using advanced transformer model. 2024 IEEE International Conference on Metaverse Computing, Networking, and Applications (MetaCom) (pp. 343–350), Hong Kong, China. [Google Scholar] [CrossRef]
Zeng, Q., Lin, L., Jiang, R., Huang, W., & Lin, D. (2025). NNEnsLeG: A novel approach for e-commerce payment fraud detection using ensemble learning and neural networks. Information Processing & Management, 62(1), 103916. [Google Scholar] [CrossRef]
Zhang, X., Han, Y., Xu, W., & Wang, Q. (2021). HOBA: A novel feature engineering methodology for credit card fraud detection with a deep learning architecture. Information Sciences, 557, 302–316. [Google Scholar] [CrossRef]
Zhao, F., Li, G., Lyu, Y., Ma, H., & Zhu, X. (2023). A cost-sensitive ensemble deep forest approach for extremely imbalanced credit fraud detection. Quantitative Finance, 23(10), 1397–1409. [Google Scholar] [CrossRef]
Zheng, Y.-J., Zhou, X.-H., Sheng, W.-G., Xue, Y., & Chen, S.-Y. (2018). Generative adversarial network based telecom fraud detection at the receiving bank. Neural Networks, 102, 78–86. [Google Scholar] [CrossRef]
Zhou, Z.-H., & Feng, J. (2019). Deep forest. National Science Review, 6(1), 74–86. [Google Scholar] [CrossRef]
Zioviris, G., Kolomvatsos, K., & Stamoulis, G. (2024). An intelligent sequential fraud detection model based on deep learning. The Journal of Supercomputing, 80(10), 14824–14847. [Google Scholar] [CrossRef]

Figure 1. Review workflow.

Figure 2. Distribution of methods by category.

Figure 3. 5-fold cross-validation workflow.

Figure 4. Comparison of metrics across subgroups.

Figure 5. Comparison of metrics across data balance techniques.

Figure 6. Comparison of metrics across feature engineering techniques.

Figure 7. General workflow.

Figure 8. Interpretable Machine Learning (IML) methods overview.

Table 1. Comparison with recent reviews in 2025.

Citation	Coverage
	Card Transaction Fraud Focus	Various Datasets Description	Comprehensive Preprocessing Analysis			Pros and Cons	Financial implications					Current Issue and Future Scope
	Card Transaction Fraud Focus	Various Datasets Description	Data Balancing	Feature Engineering	Hyperparameter Tunning	Pros and Cons	Financial Metrics	Financial Risk and Direct Impact	IML	Time Series	Data Privacy with Regulation Compliance
This work	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
(Moradi et al., 2025)	✓									✓		✓
(Hafez et al., 2025)	✓	✓	✓		✓	✓			✓			✓
(Chen et al., 2025)			✓	✓			✓		✓		✓	✓
(Baisholan et al., 2025a)	✓	✓							✓			✓

Table 2. Selected work on Transaction Fraud Detection by Traditional ML.

Citation	Specific Method	Resample Tech	Feature Engineering	Dataset	Highlight	Result	Limitation/Challenge
(Itoo et al., 2021)	LR, NB, KNN	RUS		European credit card dataset		LR achieved an accuracy of 95.9%, NB achieved 91% while KNN achieved 75%	Fewer data points are trained, which may result in poor generalizability
(Tanouz et al., 2021)	RF, DT, LR, NB	RUS	Removing highly correlated features	European credit card dataset		RF with 96.77% accuracy, 100% precision, 91.11% recall, 95.35% F1-scores, and 95.55% ROC-AUC score	Model generalizability, robustness
(Dang et al., 2021)	RF, KNN, DT, LR, AdaBoost, XGBoost, DNN	SMOTE, Adaptive Synthetic Sampling (ADASYN)		European credit card dataset		High performance when resampling methods are applied to both the train and test datasets. DRL is ineffective when detecting fraud.	Model generalization, data availability, and low performance when the test data is not balanced
(Ileberi et al., 2022)	DT, RF, LR, ANN, NB with GA	SMOTE	GA with fitness function RF and an accuracy measure to reduce feature dimension	European credit card dataset, Synthetic Credit Card Fraud Dataset		GA-RF with an accuracy of 99.98%, GA-DT with an accuracy of 99.92%	Recall and F1-score still need improvements, due to their importance in the context
(Mienye & Sun, 2023b)	IG-GAW (ELM with information gain and GA)		Combination of IG and GA with ELM and a fitness measure of geometric means to select features	European credit card dataset		Sensitivity: 0.997; Specificity: 0.994 (stratified 10-fold cross-validation)	Other state-of-the-art ML models can be combined with the proposed feature selection method to compare performance
(Afriyie et al., 2023)	RF, DT, LR			Sparkov Dataset	Exploratory data analysis is conducted with feature interpretation	RF achieved an AUC value of 98.9% and an accuracy value of 96.0%	Bad performance on F1-score and precision, with only 17% and 9% respectively
(Abdul Salam et al., 2024)	RF, KNN, DT, NB, CNN	RUS, ROS, SMOTE, ADASYN		European credit card dataset	Data privacy concerns are addressed by proposing a federated learning framework	RF with an accuracy of 99.99% under cross-validation	More advanced algorithms are expected under the federal learning framework
(Ileberi & Sun, 2024)	DT, RF, XGBoost, LightGBM, and LR	ADASYN	Recursive Feature Elimination Cross Validation (RFECV) for feature elimination	European credit card dataset		XGBoost and RF with ADASYN and RFECV achieved the best Matthew’s Correlation Coefficient (MCC) of 0.9994 and 0.9991, respectively	Model generalization on other datasets

Table 3. Selected work on Transaction Fraud Detection by DL.

Citation	Specific Method	Resample Tech	Feature Engineering	Dataset	Highlight	Result	Limitation/Challenge
(Kim et al., 2019)	Challenger (Deep Feed-Forward Network) and Champion		Hand-engineered by highly experienced domain experts	Real-world data from a South Korean bank		The Challenger outperforms the Champion, with +3.8% of recall and +5.5% of cost reduction rate	Model interpretability of features’ contributions to the results
(Najadat et al., 2020)	BiLSTM- MaxPooling-BiGRU-MaxPooling	ROS, RUS, SMOTE	Categorical features are embedded by spatial dropout and a dropout layer	IEEE-CIS Fraud Detection Dataset		AUC of 91.37% with ROS	Model interpretability. The time series feature (date) is removed before modelling, and it’s better to consider the temporal pattern in the future, as LSTM has the potential to capture the pattern and may promote performance
(Nguyen et al., 2020)	LSTM	RUS, NM (Near Miss), SMOTE		European credit card dataset, Small Card Data, Tall Card Data		F1-Score of 84.85% on the European dataset	Model interpretability and poor performance with SCD and TCD datasets
(Zhang et al., 2021)	DBN, CNN, RNN with HOBA		HOBA to extract feature variables	A real-world dataset from a Chinese bank	A novel feature engineering method, HOBA	DBN with HOBA outperforms others with an accuracy of 98.25%, a recall of 51.96% and an accuracy of 96.51%, a recall of 75% under the tolerance of 3% FPR (10-fold cross-validation)	Intensive computing resources, low recall without data resampling
(Alarfaj et al., 2022)	CNN with 20 layers	Removing non-fraudulent transactions		European credit card dataset		Accuracy, F1-score, precision and AUC are 99.9%,85.71%,93%, and 98%, respectively	Model interpretability, Data availability
(Yu et al., 2024)	Transformer	Random sampling (not specifically mentioned)	T-SNE, PCA, SVD	European credit card dataset		The transformer model with a multi-head attention mechanism achieved a recall of 0.998, an F1-score of 0.998 (cross-validation)	Model interpretability
(Cherif et al., 2024)	Encoder–decoder-based GNNs		GNN for graph (node) representation and encoder–decoder for feature representation	Sparkov dataset		Precision of 0.82, recall of 0.92, F1-score of 0.86, AUC-ROC of 0.92	Model interpretability, relatively low performance for the graph method
(Harish et al., 2024)	GNNS with RGCN		GNN for graph (node) representation and RGCN for feature representation	IBM TabFormer Dataset, Sparkov dataset		F1-score of 0.78 and recall rate of 0.66 for the IBM dataset, and F1-score of 0.61 and recall rate of 0.46 for the Sparkov dataset	Model interpretability, poor performance
(Khaled Alarfaj & Shahzadi, 2025)	GNNs, Autoencoders (AE)	Min-Max Scaling	GNN for graph (node) representation	Real-world datasets from two Pakistani banks		Autoencoder achieves accuracy, precision, recall, F1-score of 79.01%, 79.98%, 80.68%, 79.78% while GNN achieves accuracy, precision, recall, F1-score of 78.01%, 80.98%, 79.68%, 80.78%	Model interpretability, relatively low performance for the graph method
(Wu et al., 2025)	CCNN	SMOTE	The features are sequentially placed in a 6 × 6 matrix to fit the deep learning method	European credit card dataset		Accuracy of 0.9998, precision of 0.9996, recall of 1.00, and an F1-score of 0.9998 (10-fold cross-validation)	Model interpretability, data availability
(Dong et al., 2025)	Enhanced AE			The Bank Account Fraud (BAF) dataset, The BankSim Dataset		Accuracy of 0.8194, precision of 0.9969, recall of 0.8200, specificity of 0.7664, and F1-score of 0.8999 on BAF, while precision of 0.9996, recall of 0.9520, F1-score of 0.9752, and specificity of 0.9700 on BankSim	Model interpretability

Table 4. Selected work on Transaction Fraud Detection by Ensemble Method.

Citation	Specific Method	Resample Tech	Feature Engineering	Dataset	Highlight	Result	Limitation/Challenge
(Sailusha et al., 2020)	RF, Adaboost			European credit card dataset		RF and AdaBoost had the same accuracy, but RF performed better according to precision, recall, and F1-score	Low performance without any data resampling or feature engineering
(Ileberi et al., 2021)	DT-AdaBoost, RF-AdaBoost, ET-AdaBoost, XGB-AdaBoost, LR-AdaBoost	SMOTE		European credit card dataset		Both ET-AdaBoost and XGB-AdaBoost achieved an accuracy of 99.98% and an MCC of 0.99	Data availability, model interpretability
(Alfaiz & Fati, 2022)	AllKNN-CatBoost	AllKNN and the other 18 resampling methods		European credit card dataset	Comprehensive comparison study of a combination of 19 different resampling methods and ML algorithms to address the unbalanced issue	AllKNN-CatBoost achieved AUC value of 97.94%, Recall value of 95.91%, and F1-score value of 87.40%	Model generalizability, interpretability
(Hajek et al., 2023)	XGBoost, XGBOD	RUS	Outlier detection algorithms are used to generate new features	Paysim, Banksim	A novel measure of cost savings (CS) metric is proposed to evaluate the financial impact	XGBoost- RUS with best CS total 4,866.9. XGBOD with best AUC 99.68% (5-fold cross-validation)	Cost savings (CS) do not apply to test data
(Chhabra et al., 2023)	Voting ensemble with base classifiers KNN, RF, LR	RUS, ROS		European credit card dataset		Accuracy of 100% and 99% for training and testing, when applying ROS and assigning the highest weight to RF	Model generalizability, synthetic dataset
(Zhao et al., 2023)	CE-gcForest			European credit card dataset, a real-world dataset from a Chinese bank	Higher costs are assigned to the fraud class by the cost-sensitive gcForest to address the data imbalance problem	AUC of 98.01% and 98.25% respectively, on the two datasets
(Mienye & Sun, 2023a)	Stacking with base learners of LSTM, GRU, and a meta-learner of MLP	SMOTE-ENN		European credit card dataset	LSTM and GRU consider the sequential patterns	Sensitivity: 1; specificity: 99.7%; AUC: 100% (10-fold cross-validation)	Model interpretability
(Khalid et al., 2024)	Voting ensemble with base classifiers SVM, LR, KNN, RF, Bagging	RUS, SMOTE		European credit card dataset		The proposed model with SMOTE achieved an accuracy of 99.96%, but performed poorly with RUS.	Model generalizability, synthetic dataset
(Talukder et al., 2024)	EIBMC	IHT+EMC		European credit card dataset		AUC score of 100%	Model generalizability; the voting weights were defined by accuracy, which may introduce biases
(Ahmed et al., 2025)	Voting ensembles with base classifiers RF, KNN, AdaBoost	SMOTE-ENN		European credit card dataset		Accuracy, recall, and an F1-score of 99.9%, 100%, and 99.9%	Model generalizability, synthetic dataset
(Baisholan et al., 2025b)	FraudX AI			European credit card dataset	SHAP was conducted to interpret the contributions of features to the result.	Recall value of 95% and AUC-PR of 97%	Model generalization on other datasets and the SHAP method could be tried on datasets with specific feature names

Table 5. Selected work on transaction fraud detection by a hybrid method.

Citation	Specific Method	Resample Tech	Feature Engineering	Dataset	Highlight	Result	Limitation/Challenge
(Zheng et al., 2018)	GAN with Denoising AE and GMMs	GAN with decoders and GMMs		Real-world data from two commercial banks in China	The generator, along with GMMs, makes the resample more robust and credible	The system successfully detects 321 of 367 true fraudulent cases (87%) and reduces customer losses of nearly 10 million RMB over 12 weeks, and the misclassification rate is about 0.68%
(Malik et al., 2022)	AdaBoost + LightGBM	SMOTE-ENN	SVM-RFE	IEEE-CIS Fraud Detection Dataset		AUC-ROC value of 0.82, recall value of 0.64, precision of 0.97, F1-score of 0.77	Relatively low performance
(Jovanovic et al., 2022)	SVM, ELM, and XGBoost with GSFA	SMOTE		European credit card dataset	Nature-inspired meta-heuristic algorithms adopted to optimize hyperparameters	XGBoost-GSFA without SMOTE achieves the best performance with a recall of 0.9997, an F1-score of 0.9997, AUC-ROC of 1	Model generalization, intensive computing resources
(Alharbi et al., 2022)	Novel text2IMG with CNN and Coarse-KNN		Transactional data to image by text2ING	European credit card dataset		An accuracy of 99.87% under 5 and 10-fold cross-validation	Low Recall. Is tabular data to image necessary? Model interpretability
(Cheah et al., 2023)	FNN + CNN with GANified-SMOTE	GANified-SMOTE,		European credit card dataset		F1-score of 0.89	Model interpretability and generalizability
(Zioviris et al., 2024)	LSTM with AE and VAE	SMOTE, K-MEANS SMOTE, Borderline SMOTE, ADASYN	Dimensionality reduction by AE and VAE	BankSim	The proposed model considers the temporal pattern of the dataset since LSTM is good at dealing with time series data	AUC of 99.61% by K-Means SMOTE—AE—LSTM	Model interpretability and generalizability
(Du et al., 2024)	AE-XGB-SMOTE-CGAN	SMOTE-CGAN	AE for feature extraction and presentation	European credit card dataset		Recall of 0.8929	Model interpretability and generalizability
(Ganji & Chaparala, 2024)	DMSSPO_ZFNet		Wave Hedge distance and DNFN for feature fusion	European credit card dataset		Accuracy, sensitivity, and specificity at 0.961, 0.961, and 0.951	Model interpretability and generalizability
(Tayebi & El Kafhali, 2025)	GB_ALSTM with ASVM	AVSM (AE with Support Vector Machine)		European credit card dataset	AVSM ensures the robustness of the generation. LSTM captures the temporal pattern of the feature.	99.99% accuracy, 98% precision, 99.99% recall, and 97.77% F-measure	Interpretability, generalization to other datasets
(Zeng et al., 2025)	NNEnsLeG (Neural Network Based Ensemble Learning with Generation)	GAN		European credit card dataset		AUC-ROC of 0.9862	Model generalization, intensive computing resources
(Reddy et al., 2025)	XGBoost, CatBoost, and LightGBM with EHO, MSA, and SMA	BorderLine SMOTE		Paysim	Nature-inspired meta-heuristic algorithms adopted to optimize hyperparameters	XGBoost with EHO had an accuracy of 98% and an AUC-ROC of 0.997 (20-fold cross-validation)	Model generalization, intensive computing resources

Table 6. Datasets used by the reviewed work.

Datasets Name	Times Used
European credit card dataset	28
Sparkov dataset	3
BankSim dataset	3
PaySim dataset	2
IEEE-CIS Fraud Detection dataset	2
The Bank Account Fraud (BAF) dataset	1
IBM TabFormer dataset	1
Small Card Data	1
Tall Card Data	1
Real-world data from financial institutions	5 out of 41 studies

Table 7. Comparison of reviewed work among widely adopted metrics.

Citation	Accuracy	Recall (Sensitivity)	Precision	AUC	F1-Score	ROC-AUC
(Zheng et al., 2018)		88.33%	4.51%
(Kim et al., 2019)		91.5%
(Sailusha et al., 2020)	100%	77%	95%	94.29%	85%
(Najadat et al., 2020)		94.59%	91.14%	91.37%	92.81%
(Itoo et al., 2021)	95.9%	83.9%	99.1%	91.8%	90.9%
(Nguyen et al., 2020)					84.85%
(Tanouz et al., 2021)	96.77%	91.11%	100%		95.35%	95.55%
(Ileberi et al., 2021)	99.98%	99.96%	99.93%
(Dang et al., 2021)	99.99%	100%	99.98%	100%	99.99%
(Zhang et al., 2021)	96.51%	75%
(Alarfaj et al., 2022)	99.9%		93%	98%	85.71%
(Alfaiz & Fati, 2022)	99.96%	95.91%	80.28%	97.94%	87.4%
(Malik et al., 2022)		64%	97%		77%	82%
(Ileberi et al., 2022)	99.98%	72.56%	95.34%		82.41%
(Jovanovic et al., 2022)	99.97%	99.97%	99.97%		99.97%	100%
(Alharbi et al., 2022)	99.87%	51.22%			57.8%
(Hajek et al., 2023)	99.94%	77.93%	99.42%	99.58%	87.37%
(Chhabra et al., 2023)	99.99%
(Zhao et al., 2023)		82.11%	88.45%	98.01%	85.14%
(Afriyie et al., 2023)	96%	93%	9%	98.9%	17%
(Mienye & Sun, 2023a)		100%		100%
(Mienye & Sun, 2023b)		99.7%		99%
(Cheah et al., 2023)		85%	94%		89%
(Zioviris et al., 2024)		99.89%	98.94%	99.61%	99.14%
(Abdul Salam et al., 2024)	99.99%	100%	99.98%		99.99%
(Khalid et al., 2024)	99.96%	99.96%	99.96%		99.96%	100%
(Ileberi & Sun, 2024)	99.98%	99.99%	99.92%		99.95%
(Yu et al., 2024)		99.8%	99.8%		99.8%	99%
(Du et al., 2024)	99.87%	89.29%
(Talukder et al., 2024)	99.84%	99.14%	99.91%	100%	99.52%
(Ganji & Chaparala, 2024)	96.1%	96.1%
(Cherif et al., 2024)	97%	92%	82%		86%	92%
(Harish et al., 2024)		66%	94%		78%
(Khaled Alarfaj & Shahzadi, 2025)	79.01%	80.68%	79.98%		79.78%
(Tayebi & El Kafhali, 2025)	99.99%	99.99%	98%		97.77%
(Zeng et al., 2025)						98.62%
(Reddy et al., 2025)	98%	97.2%	97.8%		98%	99.7%
(Ahmed et al., 2025)	99.9%	100%	99.9%		99.9%	100%
(Baisholan et al., 2025b)	99%	95%	100%		97%
(Wu et al., 2025)	99.98%	100%	99.96%		99.98%
(Dong et al., 2025)	95.23%	95.2%	99.96%		97.52%	96%

Table 8. Publications that addressed Financial Significance.

Citation	Method Category	Metho Used	Financial Significance
(Zheng et al., 2018)	Hybrid	GAN with Denoising AE and GMMs	This experiment was conducted in a real-world scenario and successfully reduced huge losses (10 million RMB)
(Kim et al., 2019)	DL	Challenger (Deep Feed-Forward Network) and Champion	Cost reduction rate (CRR) was adopted to estimate the financial impact of different models
(Zhang et al., 2021)	DL	DBN, CNN, RNN with HOBA	Tolerance of FPR is usually low for financial institutions to preserve reputations (but a lower FPR causes more fraud transactions to be approved), transferring a portion of the financial risk to the customer
(Hajek et al., 2023)	Ensemble	XGBoost, XGBOD	The Cost Savings (CS) metric was proposed and effectively identified how many losses were reduced by the proposed model
(Afriyie et al., 2023)	Traditional ML	RF, DT, LR	The characteristics of individuals who are easily deceived were analyzed. And this helps financial institutions efficiently set up protection measures.
(Mienye & Sun, 2023a)	Ensemble	Stacking with base learners of LSTM, GRU, and a meta-learner of MLP	Sequential and temporal patterns were considered, inspiring other financial contexts
(Zioviris et al., 2024)	Hybrid	LSTM with AE and VAE	Sequential and temporal patterns were considered, inspiring other financial contexts
(Abdul Salam et al., 2024)	Traditional ML	RF, KNN, DT, NB, CNN	FL framework allows different financial institutions to cooperate regardless of data privacy concerns, aligning with regulation compliance, like GDPR
(Tayebi & El Kafhali, 2025)	Hybrid	GB_ALSTM	Sequential and temporal patterns were considered, inspiring other financial contexts
(Baisholan et al., 2025b)	Ensemble	FraudX AI	The IML method is important in this context as it helps financial institutions understand the characteristics of fraudulent transactions

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, T.; Sun, R.; Ma, T.; Sergeev, S. Recent Progress on Financial Risk Detection in the Context of Transaction Fraud Based on Machine Learning Algorithms. J. Risk Financial Manag. 2026, 19, 14. https://doi.org/10.3390/jrfm19010014

AMA Style

Chen T, Sun R, Ma T, Sergeev S. Recent Progress on Financial Risk Detection in the Context of Transaction Fraud Based on Machine Learning Algorithms. Journal of Risk and Financial Management. 2026; 19(1):14. https://doi.org/10.3390/jrfm19010014

Chicago/Turabian Style

Chen, Teli, Ruili Sun, Tiefeng Ma, and Sergey Sergeev. 2026. "Recent Progress on Financial Risk Detection in the Context of Transaction Fraud Based on Machine Learning Algorithms" Journal of Risk and Financial Management 19, no. 1: 14. https://doi.org/10.3390/jrfm19010014

APA Style

Chen, T., Sun, R., Ma, T., & Sergeev, S. (2026). Recent Progress on Financial Risk Detection in the Context of Transaction Fraud Based on Machine Learning Algorithms. Journal of Risk and Financial Management, 19(1), 14. https://doi.org/10.3390/jrfm19010014

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Recent Progress on Financial Risk Detection in the Context of Transaction Fraud Based on Machine Learning Algorithms

Abstract

1. Introduction

2. Review Methodology

3. Analysis of Approaches

3.1. Traditional Machine Learning (ML)

3.2. Deep Learning (DL)

3.3. Ensemble Method

3.4. Hybrid Method

3.5. Analysis

3.5.1. Datasets Description

3.5.2. Preprocessing

Data Balancing

Feature Engineering

Hyperparameter Tuning/Optimization

3.5.3. Cross-Validation

3.6. Discussion

4. Financial Implication

4.1. Financial Metric Design and Direct Impact

4.2. Feature Explanation

Interpretable Machine Learning (IML) Overview

4.3. Time Series Consideration

4.4. Financial Data Privacy

5. Limitations/Challenges and Future Scopes

5.1. Limitations/Challenges

5.2. Future Scopes

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI