1. Introduction
The prompt transition to digital transactions has transformed the financial landscape, brought about unprecedented convenience but also escalated the risk of financial fraud. The growth of electronic trade and the rise in online payment systems have led to a significant increase in fraudulent activity [
1]. Many daily activities are also dependent on technology [
2]. Traditional detection methods often fall short as fraudsters use advanced and sophisticated techniques to carry out their witty plans, hence increasing the need for advanced detection techniques to confront these threats [
3].
This work illustrates an advanced machine learning approach, utilizing a robust Kaggle dataset containing over one million records. Superior precision is achieved by careful data cleaning and applying multiple models. As demonstrated by Vuppula [
4], recent breakthroughs in machine learning have led to a considerable improvement in identifying complex financial fraud patterns. The growing digital financial transactions have made life easier but have exposed us to a high risk of fraud. Traditional detection methods are no longer enough. This challenge motivated us to explore smarter ways to overcome these challenges.
Moreover, this work incorporates anomaly detection techniques and generates synthetic behavioral features. This research lays the basis for the advancement of a highly efficient fraud detection system that addresses the main concern of financial sectors worldwide. The proposed methods fit into the financial ecosystem, where security and trust are important assets. The novelty of this work lies in its hybrid method that integrates both supervised models such as CatBoost and XGBoost and unsupervised models like Isolation Forest. Additionally, the synthetic behavioral features enhance the flexibility to new fraud patterns. This work is a combination of resampling, feature engineering, and anomaly detection to improve performance.
2. Literature Survey
Financial fraud has become a growing concern as digital transactions expand across global financial systems. Both individual and organizations face new risk as fraudsters develop complex methods to exploit weakness in payment channels [
5]. Reports indicate that in 2022 alone, more than GPB 1.2 billion was lost to both authorized and unauthorized scams, equivalent to about GBP 2300 every minute, with most cases originating online and roughly 18% through telecommunication. Dang et al. [
6] examined oversampling online techniques such as SMOTHE and ADSYN combined with several classifiers, including RF, KNN, XGBoost, and DNN. While the traditional ML models reached accuracy levels above 99%, deep reinforcement learning (DRL) achieved only 34.8%. Detecting fraud often involves recognizing unusual transaction patterns, such as sudden spikes in activity or money being transferred across borders [
7]. Machine learning has made this process more efficient by improving both the speed and precision of fraud identification. Ahmed [
8] emphasized that algorithms are now vital in modern fraud detection due to their flexibility and ability to process massive transaction datasets. Recent studies have also highlighted hybrid machine learning approaches and advanced feature engineering techniques that further enhance fraud detection performance, demonstrating significant improvements in accuracy and robustness [
9,
10]. A comparative summary of related studies is presented in
Table 1.
3. Methodology
Transaction fraud is a crucial problem for financial firms and businesses, often causing monetary damage to trust and reputation. These frauds can be mitigated by using fraud prediction models that include machine learning algorithms.
The information used in this work, outlined in
Figure 1, comprises 1,048,575 financial transactions and 10 features, with approximately 0.1% labeled as fraudulent, reflecting the natural imbalance commonly observed in financial data. Each transaction includes features such as transaction amount, step (time stamp), balance amount in origin, and destination account and transaction type. The data was split into 70% for training and 30% for testing in order to guarantee the evaluation of robustness, and all models underwent 5-fold cross-validation. ROC-AUC, F1-score, recall, and precision were used to compare models. In practical deployment, the proposed model can be integrated with online banking systems to flag suspicious transactions in real time. By continuously retraining new transaction data, the system can adapt to evolving fraud patterns while maintaining low latency and scalability, thereby enhancing its effectiveness in real world financial operations.
Before working with the collected dataset shown in
Figure 2, preprocessing is an essential step to ensure that there are no null values, which may affect the detection efficiency of the model in fraud cases. The null or unknown values in the dataset are replaced by the mean if skewness is less than 0.5, the median if skewness is more than 0.5, and the mode if the datatype is an object.
By looking at the pie chart in
Figure 3, it is clear that there is a huge imbalance in the data, i.e., 1142 fraudulent data and 1,047,433 non-fraudulent data. The team will be working on the imbalance between classes in the later stages to ensure that it does not affect the ML model.
Figure 4 helps to identify if certain transaction types are more prone to fraud. For example, “TRANSFER” might have more fraudulent cases than other transaction types. Let us focus more attention on such a type of transaction, which helps with catching fraud cases.
3.1. Feature Selection
Feature selection was performed using Pearson and Chi-square tests to remove redundant features. The dataset includes two features, “namedest” and “nameOrig”, which hold less importance in comparison with other features [
5].
3.2. Correlation Analysis
As shown in
Figure 5, the highly correlated attributes are oldbalanceOrg and rebalancers (1), oldbalanceDest and newbalanceDest (0.98), and oldbalanceOrg and newbalanceDest (1), and the moderately correlated attributes are type and namedest (0.59).
All the attributes have a weak correlation with fraud, but amount has a correlation of 0.13, which is slightly useful for fraud detection.
3.3. Splitting into Training and Testing Datasets
Figure 6 shows the splitting of the dataset into test and train sets, where the model is tested on the training set, and to check the performance, it is tested using unseen data, which is the testing set.
3.4. Model Selection and Training
In attempt to use the efficiency of various supervised machine learning techniques used to detect fraud cases, six classification models were applied. All the algorithms were trained on the derived dataset. Hyperparameters for all models were optimized using GridSearchCV and RandomizedSearchCV.
3.4.1. KNN
KNN stands for K-Nearest Neighbors and is often called a lazy learner algorithm because it does not train itself immediately. Instead of building a model during the training phase, KNN simply stores the dataset and makes classifications only when needed. KNN tuned n_neighbors.
KNN relies on the distance matrix to find the nearest neighbors, which are then used for classification and regression tasks. The confusion matrix (
Table 2) illustrates the relationship between the observed and estimated classifications of the KNN model. Performance matrix (
Table 3) shows how well a model performs.
3.4.2. Logistic Regression
Logistic Regression is a regressor that assigns data to different categorical values. Predictive analytics with multiple factors determine the outcome of a dependent variable, which relies on the relationships among those independent variables used to train the model, as illustrated in
Figure 7.
3.4.3. Random Forest
The tree ensemble model is made up of many decision trees that are all put together to solve classification problems. This is a machine learning technique that utilizes various decision trees to achieve improved accuracy and robust outcomes. It is suitable for both classification and regression operations. This ensemble learning approach enables the model to capture diverse patterns and generalize well to unseen data. When a new transaction is made, it traverses through each predictor, and the collective decision of the collective of trees determines its classification [
6]. This model constructs multiple decision trees to obtain the most accurate results. Random Forest tuned n_estimators and max_depth, as illustrated in
Figure 8.
3.4.4. XGBoost
XGBoost, an acronym for the Extreme Gradient Boosting method, signifies a formidable machine learning procedure that forms many decision trees one after another. Each new tree tries to fix the training error, thus improving the model with every step. It is well-regarded for its swiftness, accuracy, and effectiveness, especially on large datasets, as illustrated in
Figure 9. This is one of the reasons why it is widely used on datasets where performance matters a lot. XGBoost tuned learning_rate, max_depth, and num_leaves. In
Table 4 shows the confusion matrix, while
Table 5 shows the model’s performance after tuning.
3.4.5. LightGBM
LightGBM is a rule-based machine learning algorithm used for regression, classification, and ranking tasks. This algorithm is widely used because of its fast-training speed, low memory usage, and superior accuracy. Further, at its core, a decision tree is created where at each sublevel, the error at the previous level is corrected. LightGBM tuned learning_rate, max_depth, and num_leaves, as illustrated in
Figure 10. In
Table 6 shows the confusion matrix, while
Table 7 shows the model’s performance after tuning.
3.4.6. CatBoost
CatBoost is a predictive modeling technique employed especially for categorical data. It is a tree-based modeling algorithm that is used for classification, regression, and ranking problems, mostly when the dataset is large and primarily comprises categorical data. CatBoost tuned iterations, learning_rate, and depth, as illustrated in
Figure 11.
Table 8 shows the confusion matrix, while
Table 9 shows the model’s performance after tuning.
4. Results and Discussion
The dataset comprises approximately 1 million transactions across 10 columns, consistent with the described structure. The hardware specifications include 16 GB of RAM and four CPU cores; models such as XGBoost and CatBoost can process the data without memory constraints. While training on the full dataset requires a considerable amount of time, inference on all 1 million rows can be completed within tens of seconds. With optimized feature computation, even real-time scoring can be performed efficiently.
This work seeks to evaluate whether the supervised machine learning approach outperforms existing methods. The proposed work has two key experiments, one using all dataset features and another excluding nameOrig and nameDest, to evaluate the results. The proposed work uses metrics such as F1-Score, AUC-ROC, recall, precision, and the geometric mean of recall and precision. Given the class imbalance, accuracy alone is not a reliable metric. All the performance specifications were computed using the false negative (FN), true negative (TN), false positive (FP), and true positive (TP) values of each model. The confusion matrix, a commonly used tool for characterization and classification of model performance, contains these values [
1]. The confusion matrix is explained in
Table 10 below.
True Negative (TN): Legitimate transaction properly recognized as a valid transaction;
False Positive (FP): Incorrectly flagged non-fraud transactions as fraud (false alarm);
False Negative (FN): Fraud transaction missed by the model (very critical);
True Positive (TP): Correctly identified fraudulent transactions as fraud.
In the second model in
Table 4, there were 314,222 true negatives and 287 true positives, along with a slight increase to 8 false positives and a decrease to 56 false negatives. In the first model in
Table 8, the classifier predicted 314,224 true negative and 283 true positives, with 6 false positives and 60 false negatives.
As mentioned in
Table 11’s results and illustrated in
Figure 12, among the supervised machine learning models, some outperform with 100% accuracy, as this is not an important factor in an imbalanced case, and it alone cannot decide the model’s performance, and 100% accuracy does not mean that it is performing well.
The confusion matrix in
Table 12 shows that the system correctly identified 35 legitimate records as not fraud (TN) but misclassified 308 genuine transactions as fraud (FP). To evaluate model performance, metrics like precision and recall can be derived, focusing on how well the model detects actual fraud versus falsely identifying legitimate transactions. This analysis helps to assess the balance between minimizing false alarms and accurately detecting fraud.
The statistical method used for extending the number of minority class instances in a balanced manner in a dataset is SMOTE [
1].
The plot,
Figure 13 and
Table 13, shows the performance comparisons before and after applying SMOTE on the CatBoost machine learning model.
The novelty of this paper is that we combined Isolation Forest (unsupervised machine learning) with an earlier-proposed model, as shown in
Table 14.
The classification report shows excellent performance with an accuracy of 1.00, indicating almost perfect predictions. Precision for the non-fraud class (0) is 1.00, while for fraud (1), it is 0.98, showing that the model is highly accurate in predicting non-fraud cases. Recall for fraud is 0.83, meaning that some fraud cases are missed, but this is typical in highly imbalanced datasets. The F1-score for fraud (0.90) suggests that precision and recall are well balanced. Finally, the ROC-AUC score of 0.996 indicates near-perfect clarity between fraud and non-fraud.
This is an important feature graph (
Figure 14) that plays an important role in understanding what impact the feature has on the target variable.
First Case (small sample):
Second Case (full dataset):
Typing speed and mouse speed are added as new features to mimic real human behavior. This is also experimented with the use of Isolation Forest to detect anomalies first. However, because a powerful model (CatBoost) was used with a large dataset, the model already performed extremely well. The extra anomaly features slightly supported the model, but the major strength came from the CatBoost learning itself.
The SHAPE analysis of the above graph, as shown in
Figure 15, is explained below:
Top Influencers: newbalanceOrig, oldbalanceOrg, and newbalanceDest—key account balance features driving fraud detection;
High-Value Features: TYPE_CASH_OUT and amount—transactions with large values and cash-outs result in high fraud risk;
Hybrid Success: anomaly_score ranks HIGH—this proves that Isolation Forest works well;
Red points (high anomaly) push predictions toward fraud (right side);
anomaly_flag is also contributing: Even binary outlier signals add value—great choice including both score and flag.
SHAP analysis confirmed the value of the hybrid modeling, with anomaly_score emerging as a top feature influencing fraud predictions. This validates the use of unsupervised anomaly detection to guide supervised learning.
This study effectively and uniquely overcomes real-world, highly imbalanced data, in contrast to studies using small or balanced datasets. This study combined multiple machine learning models with a soft-voting ensemble and integrated anomaly detection features to achieve better capture of fraudulent activity patters. Overall, it delivers methods that are deployable and scalable, setting it apart from exiting work.