Next Article in Journal
Big Tech and the Sustainable Consumer Practices: A Critical Analysis Using a Mixed Methodology
Next Article in Special Issue
Data-Driven Approach for Asthma Classification: Ensemble Learning with Random Forest and XGBoost
Previous Article in Journal
Tracking Trans-Generational Stress Susceptibility in the Farm Animal Using AI
Previous Article in Special Issue
Utilization of TiO2 Nanoparticles for Methylene Blue Degradation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Scalable Machine Learning Solutions for High-Volume Financial Transaction Fraud Detection †

by
Sourav Yallur
*,
Jiya Patil
,
Tanvi Shikhari
,
Prajwal Dabbanavar
,
Rajashri Khanai
and
Salma Shahpur
Department of Computer Science and Engineering, KLE Technological University, Dr. M. S. Sheshgiri Campus, Udyambag, Belagavi 590008, Karnataka, India
*
Author to whom correspondence should be addressed.
Presented at the First International Conference on Computational Intelligence and Soft Computing (CISCom 2025), Melaka, Malaysia, 26–27 November 2025.
Comput. Sci. Math. Forum 2025, 12(1), 1; https://doi.org/10.3390/cmsf2025012001
Published: 17 December 2025

Abstract

More reliable and intelligent detection systems are required because of the rise in fraudulent activities brought on by the volume of digital financial transactions. In this work, the data used is from a publicly accessible dataset with more than a million transaction records to investigate a machine learning strategy to identify hidden patterns in the fraud transaction. Data preprocessing included applying Z-score normalization, eliminating outliers using the IQR method, and handling missing values according to the skewness of each attribute. The selection of important features was guided by correlation analysis using Chi-square tests and Pearson coefficients. This study implemented multiple supervised learning techniques, comprising Random Forest, Logistic Regression, K-Nearest Neighbors, and Gradient Boost to evaluate and compare their effectiveness in accurately detecting fraudulent transactions.

1. Introduction

The prompt transition to digital transactions has transformed the financial landscape, brought about unprecedented convenience but also escalated the risk of financial fraud. The growth of electronic trade and the rise in online payment systems have led to a significant increase in fraudulent activity [1]. Many daily activities are also dependent on technology [2]. Traditional detection methods often fall short as fraudsters use advanced and sophisticated techniques to carry out their witty plans, hence increasing the need for advanced detection techniques to confront these threats [3].
This work illustrates an advanced machine learning approach, utilizing a robust Kaggle dataset containing over one million records. Superior precision is achieved by careful data cleaning and applying multiple models. As demonstrated by Vuppula [4], recent breakthroughs in machine learning have led to a considerable improvement in identifying complex financial fraud patterns. The growing digital financial transactions have made life easier but have exposed us to a high risk of fraud. Traditional detection methods are no longer enough. This challenge motivated us to explore smarter ways to overcome these challenges.
Moreover, this work incorporates anomaly detection techniques and generates synthetic behavioral features. This research lays the basis for the advancement of a highly efficient fraud detection system that addresses the main concern of financial sectors worldwide. The proposed methods fit into the financial ecosystem, where security and trust are important assets. The novelty of this work lies in its hybrid method that integrates both supervised models such as CatBoost and XGBoost and unsupervised models like Isolation Forest. Additionally, the synthetic behavioral features enhance the flexibility to new fraud patterns. This work is a combination of resampling, feature engineering, and anomaly detection to improve performance.

2. Literature Survey

Financial fraud has become a growing concern as digital transactions expand across global financial systems. Both individual and organizations face new risk as fraudsters develop complex methods to exploit weakness in payment channels [5]. Reports indicate that in 2022 alone, more than GPB 1.2 billion was lost to both authorized and unauthorized scams, equivalent to about GBP 2300 every minute, with most cases originating online and roughly 18% through telecommunication. Dang et al. [6] examined oversampling online techniques such as SMOTHE and ADSYN combined with several classifiers, including RF, KNN, XGBoost, and DNN. While the traditional ML models reached accuracy levels above 99%, deep reinforcement learning (DRL) achieved only 34.8%. Detecting fraud often involves recognizing unusual transaction patterns, such as sudden spikes in activity or money being transferred across borders [7]. Machine learning has made this process more efficient by improving both the speed and precision of fraud identification. Ahmed [8] emphasized that algorithms are now vital in modern fraud detection due to their flexibility and ability to process massive transaction datasets. Recent studies have also highlighted hybrid machine learning approaches and advanced feature engineering techniques that further enhance fraud detection performance, demonstrating significant improvements in accuracy and robustness [9,10]. A comparative summary of related studies is presented in Table 1.

3. Methodology

Transaction fraud is a crucial problem for financial firms and businesses, often causing monetary damage to trust and reputation. These frauds can be mitigated by using fraud prediction models that include machine learning algorithms.
The information used in this work, outlined in Figure 1, comprises 1,048,575 financial transactions and 10 features, with approximately 0.1% labeled as fraudulent, reflecting the natural imbalance commonly observed in financial data. Each transaction includes features such as transaction amount, step (time stamp), balance amount in origin, and destination account and transaction type. The data was split into 70% for training and 30% for testing in order to guarantee the evaluation of robustness, and all models underwent 5-fold cross-validation. ROC-AUC, F1-score, recall, and precision were used to compare models. In practical deployment, the proposed model can be integrated with online banking systems to flag suspicious transactions in real time. By continuously retraining new transaction data, the system can adapt to evolving fraud patterns while maintaining low latency and scalability, thereby enhancing its effectiveness in real world financial operations.
Before working with the collected dataset shown in Figure 2, preprocessing is an essential step to ensure that there are no null values, which may affect the detection efficiency of the model in fraud cases. The null or unknown values in the dataset are replaced by the mean if skewness is less than 0.5, the median if skewness is more than 0.5, and the mode if the datatype is an object.
By looking at the pie chart in Figure 3, it is clear that there is a huge imbalance in the data, i.e., 1142 fraudulent data and 1,047,433 non-fraudulent data. The team will be working on the imbalance between classes in the later stages to ensure that it does not affect the ML model.
Figure 4 helps to identify if certain transaction types are more prone to fraud. For example, “TRANSFER” might have more fraudulent cases than other transaction types. Let us focus more attention on such a type of transaction, which helps with catching fraud cases.

3.1. Feature Selection

Feature selection was performed using Pearson and Chi-square tests to remove redundant features. The dataset includes two features, “namedest” and “nameOrig”, which hold less importance in comparison with other features [5].

3.2. Correlation Analysis

As shown in Figure 5, the highly correlated attributes are oldbalanceOrg and rebalancers (1), oldbalanceDest and newbalanceDest (0.98), and oldbalanceOrg and newbalanceDest (1), and the moderately correlated attributes are type and namedest (0.59).
All the attributes have a weak correlation with fraud, but amount has a correlation of 0.13, which is slightly useful for fraud detection.

3.3. Splitting into Training and Testing Datasets

Figure 6 shows the splitting of the dataset into test and train sets, where the model is tested on the training set, and to check the performance, it is tested using unseen data, which is the testing set.

3.4. Model Selection and Training

In attempt to use the efficiency of various supervised machine learning techniques used to detect fraud cases, six classification models were applied. All the algorithms were trained on the derived dataset. Hyperparameters for all models were optimized using GridSearchCV and RandomizedSearchCV.

3.4.1. KNN

KNN stands for K-Nearest Neighbors and is often called a lazy learner algorithm because it does not train itself immediately. Instead of building a model during the training phase, KNN simply stores the dataset and makes classifications only when needed. KNN tuned n_neighbors.
KNN relies on the distance matrix to find the nearest neighbors, which are then used for classification and regression tasks. The confusion matrix (Table 2) illustrates the relationship between the observed and estimated classifications of the KNN model. Performance matrix (Table 3) shows how well a model performs.

3.4.2. Logistic Regression

Logistic Regression is a regressor that assigns data to different categorical values. Predictive analytics with multiple factors determine the outcome of a dependent variable, which relies on the relationships among those independent variables used to train the model, as illustrated in Figure 7.

3.4.3. Random Forest

The tree ensemble model is made up of many decision trees that are all put together to solve classification problems. This is a machine learning technique that utilizes various decision trees to achieve improved accuracy and robust outcomes. It is suitable for both classification and regression operations. This ensemble learning approach enables the model to capture diverse patterns and generalize well to unseen data. When a new transaction is made, it traverses through each predictor, and the collective decision of the collective of trees determines its classification [6]. This model constructs multiple decision trees to obtain the most accurate results. Random Forest tuned n_estimators and max_depth, as illustrated in Figure 8.

3.4.4. XGBoost

XGBoost, an acronym for the Extreme Gradient Boosting method, signifies a formidable machine learning procedure that forms many decision trees one after another. Each new tree tries to fix the training error, thus improving the model with every step. It is well-regarded for its swiftness, accuracy, and effectiveness, especially on large datasets, as illustrated in Figure 9. This is one of the reasons why it is widely used on datasets where performance matters a lot. XGBoost tuned learning_rate, max_depth, and num_leaves. In Table 4 shows the confusion matrix, while Table 5 shows the model’s performance after tuning.

3.4.5. LightGBM

LightGBM is a rule-based machine learning algorithm used for regression, classification, and ranking tasks. This algorithm is widely used because of its fast-training speed, low memory usage, and superior accuracy. Further, at its core, a decision tree is created where at each sublevel, the error at the previous level is corrected. LightGBM tuned learning_rate, max_depth, and num_leaves, as illustrated in Figure 10. In Table 6 shows the confusion matrix, while Table 7 shows the model’s performance after tuning.

3.4.6. CatBoost

CatBoost is a predictive modeling technique employed especially for categorical data. It is a tree-based modeling algorithm that is used for classification, regression, and ranking problems, mostly when the dataset is large and primarily comprises categorical data. CatBoost tuned iterations, learning_rate, and depth, as illustrated in Figure 11. Table 8 shows the confusion matrix, while Table 9 shows the model’s performance after tuning.

4. Results and Discussion

The dataset comprises approximately 1 million transactions across 10 columns, consistent with the described structure. The hardware specifications include 16 GB of RAM and four CPU cores; models such as XGBoost and CatBoost can process the data without memory constraints. While training on the full dataset requires a considerable amount of time, inference on all 1 million rows can be completed within tens of seconds. With optimized feature computation, even real-time scoring can be performed efficiently.
This work seeks to evaluate whether the supervised machine learning approach outperforms existing methods. The proposed work has two key experiments, one using all dataset features and another excluding nameOrig and nameDest, to evaluate the results. The proposed work uses metrics such as F1-Score, AUC-ROC, recall, precision, and the geometric mean of recall and precision. Given the class imbalance, accuracy alone is not a reliable metric. All the performance specifications were computed using the false negative (FN), true negative (TN), false positive (FP), and true positive (TP) values of each model. The confusion matrix, a commonly used tool for characterization and classification of model performance, contains these values [1]. The confusion matrix is explained in Table 10 below.
  • True Negative (TN): Legitimate transaction properly recognized as a valid transaction;
  • False Positive (FP): Incorrectly flagged non-fraud transactions as fraud (false alarm);
  • False Negative (FN): Fraud transaction missed by the model (very critical);
  • True Positive (TP): Correctly identified fraudulent transactions as fraud.
In the second model in Table 4, there were 314,222 true negatives and 287 true positives, along with a slight increase to 8 false positives and a decrease to 56 false negatives. In the first model in Table 8, the classifier predicted 314,224 true negative and 283 true positives, with 6 false positives and 60 false negatives.
As mentioned in Table 11’s results and illustrated in Figure 12, among the supervised machine learning models, some outperform with 100% accuracy, as this is not an important factor in an imbalanced case, and it alone cannot decide the model’s performance, and 100% accuracy does not mean that it is performing well.
The confusion matrix in Table 12 shows that the system correctly identified 35 legitimate records as not fraud (TN) but misclassified 308 genuine transactions as fraud (FP). To evaluate model performance, metrics like precision and recall can be derived, focusing on how well the model detects actual fraud versus falsely identifying legitimate transactions. This analysis helps to assess the balance between minimizing false alarms and accurately detecting fraud.
The statistical method used for extending the number of minority class instances in a balanced manner in a dataset is SMOTE [1].
The plot, Figure 13 and Table 13, shows the performance comparisons before and after applying SMOTE on the CatBoost machine learning model.
The novelty of this paper is that we combined Isolation Forest (unsupervised machine learning) with an earlier-proposed model, as shown in Table 14.
The classification report shows excellent performance with an accuracy of 1.00, indicating almost perfect predictions. Precision for the non-fraud class (0) is 1.00, while for fraud (1), it is 0.98, showing that the model is highly accurate in predicting non-fraud cases. Recall for fraud is 0.83, meaning that some fraud cases are missed, but this is typical in highly imbalanced datasets. The F1-score for fraud (0.90) suggests that precision and recall are well balanced. Finally, the ROC-AUC score of 0.996 indicates near-perfect clarity between fraud and non-fraud.
This is an important feature graph (Figure 14) that plays an important role in understanding what impact the feature has on the target variable.
First Case (small sample):
  • Confusion Matrix: Good performance, but recall for fraud (1 class) is around 84%;
  • ROC-AUC: 0.9960.
Second Case (full dataset):
  • Confusion Matrix: Also very good, recall around 83%;
  • ROC-AUC: 0.9963.
Typing speed and mouse speed are added as new features to mimic real human behavior. This is also experimented with the use of Isolation Forest to detect anomalies first. However, because a powerful model (CatBoost) was used with a large dataset, the model already performed extremely well. The extra anomaly features slightly supported the model, but the major strength came from the CatBoost learning itself.
The SHAPE analysis of the above graph, as shown in Figure 15, is explained below:
  • Top Influencers: newbalanceOrig, oldbalanceOrg, and newbalanceDest—key account balance features driving fraud detection;
  • High-Value Features: TYPE_CASH_OUT and amount—transactions with large values and cash-outs result in high fraud risk;
  • Hybrid Success: anomaly_score ranks HIGH—this proves that Isolation Forest works well;
  • Red points (high anomaly) push predictions toward fraud (right side);
  • anomaly_flag is also contributing: Even binary outlier signals add value—great choice including both score and flag.
SHAP analysis confirmed the value of the hybrid modeling, with anomaly_score emerging as a top feature influencing fraud predictions. This validates the use of unsupervised anomaly detection to guide supervised learning.
This study effectively and uniquely overcomes real-world, highly imbalanced data, in contrast to studies using small or balanced datasets. This study combined multiple machine learning models with a soft-voting ensemble and integrated anomaly detection features to achieve better capture of fraudulent activity patters. Overall, it delivers methods that are deployable and scalable, setting it apart from exiting work.

5. Conclusions

In conclusion, this research work focuses on identifying fraud activities in financial transactions with the support of predictive models. The data used in this study was thoroughly cleaned and analyzed to find the correlation between attributes, confirming the class issue. Further, data was split into test and train sets to apply machine learning models, where CatBoost demonstrated superior performance compared to other algorithms. Although the application of SMOTE did not improve the performance as per our expectation, novel techniques such as Isolation Forest, anomaly detection, and biometric behavior synthesis were incorporated to enhance model efficiency. The capacity to detect fraud can be further enhanced through the use of advanced techniques like deep learning. This study depicts the crucialness of algorithms in detecting fraud in banking sectors.

Author Contributions

Conceptualization, S.Y. and J.P.; methodology, S.Y. and J.P.; software, S.Y.; validation, S.Y., J.P. and T.S.; formal analysis, S.Y. and J.P.; investigation, S.Y.; resources, S.Y.; data curation, S.Y. and J.P.; writing—original draft preparation, J.P.; writing—review and editing, S.Y., J.P., T.S. and P.D.; visualization, S.Y.; supervision, R.K. and S.S.; project administration, R.K.; funding acquisition, R.K. and S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by KLE Technological University’s Dr. M.S. Sheshgiri College of Engineering and Technology, Belagavi, Karnataka, India. The APC was funded by KLE Technological University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study are publicly available and were obtained from the Kaggle repository. The dataset can be accessed at: https://www.kaggle.com/code/chandra17iith/online-fraud-detection-classification/notebook (accessed on 1 September 2025).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Khalid, A.R.; Owoh, N.; Uthmani, O.; Ashawa, M.; Osamor, J.; Adejoh, J. Enhancing Credit Card Fraud Detection: An Ensemble Machine Learning Approach. Big Data Cogn. Comput. 2024, 8, 6. [Google Scholar] [CrossRef]
  2. Islam, S.; Haque, M.M.; Karim, A.N.M.R. A Rule-Based Machine Learning Model for Financial Fraud Detection. Int. J. Electr. Comput. Eng. 2024, 14, 759–771. [Google Scholar] [CrossRef]
  3. Innan, N.; Khan, M.A.Z.; Bennai, M. Financial Fraud Detection: A Comparative Study of Quantum Machine Learning Models. arXiv 2023, arXiv:2308.05237. [Google Scholar] [CrossRef]
  4. Vuppula, K. An Advanced Machine Learning Algorithm for Fraud Financial Transaction Detection. J. Innov. Dev. Pharm. Tech. Sci. 2021, 4, 73–78. [Google Scholar]
  5. Farabi, S.F.; Prabha, M.; Alam, M.; Hossan, M.Z.; Arif, M.; Islam, M.R.; Uddin, A.; Bhuiyan, M.; Biswas, M.Z.A. Enhancing Credit Card Fraud Detection: A Comprehensive Study of Machine Learning Algorithms and Performance Evaluation. J. Bus. Manag. Stud. 2024, 6, 13–21. [Google Scholar] [CrossRef]
  6. Dang, T.K.; Tran, T.C.; Tuan, L.M.; Tiep, M.V. Machine Learning Based on Resampling Approaches and Deep Reinforcement Learning for Credit Card Fraud Detection Systems. Appl. Sci. 2021, 11, 10004. [Google Scholar] [CrossRef]
  7. Bello, H.O.; Ige, A.B.; Ameyaw, M.N. Adaptive Machine Learning Models: Concepts for Real-Time Financial Fraud Prevention in Dynamic Environments. World J. Adv. Eng. Technol. Sci. 2024, 12, 21–34. [Google Scholar] [CrossRef]
  8. Ahmed, M.H. A Review: Credit Card Fraud Detection in Banks using Machine Learning Algorithms. Preprint 2023. [Google Scholar] [CrossRef]
  9. Almazroi, A.A.; Ayub, N. Online Payment Fraud Detection Model Using Machine Learning Techniques. IEEE Access 2023, 11, 137188. [Google Scholar] [CrossRef]
  10. Chawla, T.S. Online Payment Fraud Detection Using Machine Learning Techniques; MSc Research Project; National College of Ireland: Dublin, Ireland, 2022. [Google Scholar]
Figure 1. Overview of dataset.
Figure 1. Overview of dataset.
Csmf 12 00001 g001
Figure 2. Flowchart of proposed implementation.
Figure 2. Flowchart of proposed implementation.
Csmf 12 00001 g002
Figure 3. Investigating target distribution.
Figure 3. Investigating target distribution.
Csmf 12 00001 g003
Figure 4. Transaction type vs. fraudulent transactions.
Figure 4. Transaction type vs. fraudulent transactions.
Csmf 12 00001 g004
Figure 5. Heatmap.
Figure 5. Heatmap.
Csmf 12 00001 g005
Figure 6. Distribution of data points.
Figure 6. Distribution of data points.
Csmf 12 00001 g006
Figure 7. Logistic regression.
Figure 7. Logistic regression.
Csmf 12 00001 g007
Figure 8. Random Forest.
Figure 8. Random Forest.
Csmf 12 00001 g008
Figure 9. XGBoost.
Figure 9. XGBoost.
Csmf 12 00001 g009
Figure 10. LightGBM.
Figure 10. LightGBM.
Csmf 12 00001 g010
Figure 11. CatBoost.
Figure 11. CatBoost.
Csmf 12 00001 g011
Figure 12. Bar chart comparison of machine learning model performance.
Figure 12. Bar chart comparison of machine learning model performance.
Csmf 12 00001 g012
Figure 13. Performance comparison of CatBoost before and after SMOTE.
Figure 13. Performance comparison of CatBoost before and after SMOTE.
Csmf 12 00001 g013
Figure 14. Top 10 feature importances—CatBoost + Isolation Forest.
Figure 14. Top 10 feature importances—CatBoost + Isolation Forest.
Csmf 12 00001 g014
Figure 15. Violin plot of feature distributions in fraud detection dataset.
Figure 15. Violin plot of feature distributions in fraud detection dataset.
Csmf 12 00001 g015
Table 1. Comparative literature review.
Table 1. Comparative literature review.
Ref no.AuthorsMethodsML ModelsAccuracyPrecision/Recall
[1]Khalid et al.Ensemble (SMOTE + ML)SVM, KNN, RF, BoostingHighHigh
[2]Islam et al.Rule generation (no resampling)RF, DT, MLP, NB0.99-
[3]Innan et al.QML models (QSVC, VQC, QNN)QSVC, VQC, Estimator QNNF1 up to 0.98-
[4]VuppulaPCA, correlation matrixDT, ML tools--
[5]Farabi et al.Feature eng., SMOTEPM, RF, Boosting99.95%99.95%
[6]Dang et al.SMOTE, ADASYN, DRLRF, KNN, XGBoost, DNNML > 99%, DRL 34.8%Varies
[7]Bello et al.RL + online learning-0-
[8]Ahmed, M.H.ML models with hybrid samplingRF, NB, KNN, LR, MLPRF Accuracy 99.7%RF Accuracy 99.7%
[9]Almazroi et al.EARN, PCA, JayaRXT-J+10–18% vs. baseline-
[10]Chawla, T.S.Supervised ML models, UndersamplingLogistic Regression, Random Forest, ANNAccuracy up to 99%Precision 96.3%, Recall high
Table 2. Confusion matrix of KNN.
Table 2. Confusion matrix of KNN.
Predicted: Not Fraud (0)Predicted: Fraud (1)
Actual: Not Fraud (0)314,2291
Actual: Fraud (1)162181
Table 3. Performance metrics of the KNN model.
Table 3. Performance metrics of the KNN model.
PrecisionRecallF1-ScoreSupport
01.001.001.00314,230
10.990.530.69343
Accuracy 1.00314,573
Macro average1.000.760.84314,573
Weighted avg.1.001.001.00314,573
Table 4. Confusion matrix of XGBoost.
Table 4. Confusion matrix of XGBoost.
Predicted: Not Fraud (0)Predicted: Fraud (1)
Actual: Not Fraud (0)314,2228
Actual: Fraud (1)56287
Table 5. Performance metrics of the XGBoost model.
Table 5. Performance metrics of the XGBoost model.
PrecisionRecallF1ScoreSupport
01.001.001.00314,230
10.970.840.90343
Accuracy 1.00314,573
Macro average0.990.920.95314,573
Weighted avg1.001.001.00314,573
Table 6. Confusion matrix of LightGBM.
Table 6. Confusion matrix of LightGBM.
Predicted: Not Fraud (0)Predicted: Fraud (1)
Actual: Not Fraud (0)313,535695
Actual: Fraud (1)21322
Table 7. Performance metrics of the LightGBM model.
Table 7. Performance metrics of the LightGBM model.
PrecisionRecallF1-ScoreSupport
01.001.001.00314,230
10.320.940.47343
Accuracy 1.00314,573
Macro average0.660.970.74314,573
Weighted avg1.001.001.00314,573
Table 8. Confusion matrix of CatBoost.
Table 8. Confusion matrix of CatBoost.
Predicted: Not Fraud (0)Predicted: Fraud (1)
Actual: Not Fraud (0)314,2246
Actual: Fraud (1)60283
Table 9. Performance metrics of the CatBoost model.
Table 9. Performance metrics of the CatBoost model.
Precision Recall F1-ScoreSupport
01.001.001.00314,230
10.980.830.90343
Accuracy 1.00314,573
Macro average0.990.910.95314,573
Weighted avg1.001.001.00314,573
Table 10. Confusion matrix.
Table 10. Confusion matrix.
Predicted: Not Fraud (0)Predicted: Fraud (1)
Actual: Not Fraud (0)True Negative (TN)False Positive (FP)
Actual: Fraud (1)False Negative (FN)True Positive (TP)
Table 11. Evaluation of key indicators of machine learning models.
Table 11. Evaluation of key indicators of machine learning models.
ModelAccuracyPrecision Recall F1-Score ROC-AUC
Logistic Regression0.940.020.930.030.9847
Random Forest1.00.990.770.870.9719
KNN1.00.990.530.690.8482
XGBoost1.00.980.840.90.9978
LightGBM1.00.320.940.470.9977
CatBoost Stack1.00.970.830.90.9964
Table 12. Confusion matrix for evaluation of fraud detection model.
Table 12. Confusion matrix for evaluation of fraud detection model.
Predicted: Not Fraud (0)Predicted: Fraud (1)
Actual: Not Fraud (0)313,818412
Actual: Fraud (1)35308
Table 13. CatBoost performance before and after SMOTE—tabular comparison.
Table 13. CatBoost performance before and after SMOTE—tabular comparison.
ModelAccuracyPrecision Recall F1-Score ROC-AUC
CatBoost Before SMOTE1.00.970.830.900.9964
CatBoost After SMOTE1.00.430.900.580.9920
Table 14. Confusion matrix for classification of model performance.
Table 14. Confusion matrix for classification of model performance.
Predicted: Not Fraud (0)Predicted: Fraud (1)
Actual: Not Fraud (0)314,2246
Actual: Fraud (1)60283
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yallur, S.; Patil, J.; Shikhari, T.; Dabbanavar, P.; Khanai, R.; Shahpur, S. Scalable Machine Learning Solutions for High-Volume Financial Transaction Fraud Detection. Comput. Sci. Math. Forum 2025, 12, 1. https://doi.org/10.3390/cmsf2025012001

AMA Style

Yallur S, Patil J, Shikhari T, Dabbanavar P, Khanai R, Shahpur S. Scalable Machine Learning Solutions for High-Volume Financial Transaction Fraud Detection. Computer Sciences & Mathematics Forum. 2025; 12(1):1. https://doi.org/10.3390/cmsf2025012001

Chicago/Turabian Style

Yallur, Sourav, Jiya Patil, Tanvi Shikhari, Prajwal Dabbanavar, Rajashri Khanai, and Salma Shahpur. 2025. "Scalable Machine Learning Solutions for High-Volume Financial Transaction Fraud Detection" Computer Sciences & Mathematics Forum 12, no. 1: 1. https://doi.org/10.3390/cmsf2025012001

APA Style

Yallur, S., Patil, J., Shikhari, T., Dabbanavar, P., Khanai, R., & Shahpur, S. (2025). Scalable Machine Learning Solutions for High-Volume Financial Transaction Fraud Detection. Computer Sciences & Mathematics Forum, 12(1), 1. https://doi.org/10.3390/cmsf2025012001

Article Metrics

Back to TopTop