Machine Learning-Based Data Generative Techniques for Credit Card Fraud-Detection Systems
Abstract
1. Introduction
- (1)
- Data and empirical contribution. We develop a novel integration of two datasets with entirely different features. The integrated dataset yields a 6.05% improvement in accuracy compared with the original datasets and achieves 100% accuracy after imputation; this improvement is statistically significant.
- (2)
- Methodological contribution. We introduce a robust integration pipeline together with a CBPM performance metric that jointly accounts for predictive accuracy and training time, allowing practitioners to select models that balance effectiveness and computational cost in real-world deployments.
- (3)
- Practical contribution. We provide actionable guidance for financial institutions and regulators on deploying the proposed approach to enhance fraud-detection systems. By improving detection efficiency and reducing false positives, the method helps protect consumers while enhancing the resilience and credibility of financial markets.
2. Preliminaries
2.1. Dataset
2.2. Various Data Generation Techniques for Credit Card Fraud Detection
- Classification and Regression Tree (CART) Imputation is a popular modeling method developed in recent decades that is also used as a supportive model for missing data imputation [33]. CART imputation is a method used for imputing missing data by treating each variable as a response variable and utilizing other variables to build the model. During the construction process, only the observed data is used to train the model, which is then employed to predict the missing values. When multiple variables have missing data, one can either exclude the missing data or use iterative methods for stepwise imputation. The CART imputation method is flexible and efficient, capable of handling complex data structures while generating intuitive visual results that are easy to interpret [34].
- Gradient Boosting Tree (GBT) Imputation enhances the handling of missing data by constructing decision trees sequentially based on GBT principles [35,36]. Each tree corrects the errors of its predecessor, starting with a simple tree known as a weak learner. The iterative procedure focuses on those instances containing missing values that previous trees predicted incorrectly, and proceeds sequentially until either the predetermined number of trees has been constructed or a target level of predictive accuracy has been attained. GBT imputation has proven to be highly effective in accurately filling in missing data in tabular datasets [37,38].
- K-Nearest Neighbour (KNN) Imputation has become one of the most popular approaches in recent years [39]. It fills in missing data by estimating values based on the nearest instances in the dataset. It predicts discrete attributes by identifying the most frequent value among the k-nearest neighbours and continuous attributes by calculating their mean. A key advantage of k-NN imputation is that it does not require explicit predictive models for each attribute with missing data, allowing for easy adaptation to various attributes by simply modifying the distance metric. Additionally, it can effectively handle instances with multiple missing values [30].
- Random Forest (RF) Imputation is based on the random forest method proposed by Breiman in 2001 [40], making predictions by constructing multiple decision trees. RF imputation differs from traditional imputation methods by employing a random selection of features at each node for splitting, thus enhancing diversity among the trees while reducing the risk of overfitting [41]. The aggregation of multiple weak learners results in improved accuracy, robustness, and generalization compared to a single decision tree [42]. This increases randomness and enhances the stability of the model. Mainly, RF has only two main parameters: the number of variables in the random subset at each node and the number of trees in the forest [43]. However, a potential drawback of RF imputation is its computational cost, as it requires constructing multiple trees and repeated fitting during each training iteration, leading to increased resource consumption and computation time.
| Algorithm 1: KNN missing value imputation Procedure |
![]() |
2.3. Various Machine Learning Models for Credit Card Fraud Detection
- Artificial Neural Network (ANN) [48] constitutes a computational framework inspired by biological neural systems. The architecture consists of numerous interconnected artificial neurons that transform input signals through successive weighted summations followed by nonlinear activation functions. In standard configurations, the network is organized hierarchically into an input layer responsible for receiving raw external information, one or several hidden layers that facilitate complex nonlinear transformations via inter-neuron signal propagation, and an output layer that generates the final computed results. The hidden layers allow the ANN to approximate complex functional relationships and recognize patterns through self-learning, while an ANN without hidden layers can only solve linear functions.
- Convolutional Neural Network (CNN) [46] presents a specialized class of deep learning architectures that has achieved widespread adoption in domains such as image processing, natural language processing, audio signal analysis, and time-series modeling [49]. The framework is particularly tailored to grid-structured inputs and automatically extracts hierarchical spatial patterns by means of convolutional filters applied across multiple layers. CNNs provide key benefits in classification tasks by reducing the need for extensive human feature engineering and performing well in various recognition applications [50]. When dealing with high-dimensional trading data, CNNs can effectively capture temporal patterns and local dependencies through their hierarchical structure. For feature heterogeneity, CNNs provide the advantage of automatic feature extraction through convolutional layers. These layers learn to identify relevant patterns and variations in the input data, making them adaptable to the diverse characteristics found in real-world datasets. This capability allows CNNs to effectively process heterogeneous features without the need for extensive manual preprocessing [51].
- Gradient-Boosted Decision Tree (GBDT) [45] constitutes an ensemble learning technique that constructs a powerful predictive model through sequential fitting of decision trees, where each successive tree corrects the residual errors of its predecessors. Prior studies have further demonstrated the utility of GBDT as a base learner when employing fixed-depth decision trees, thereby circumventing difficulties associated with the rapid exponential increase in complexity that accompanies unrestricted tree growth. In the process of building a decision tree, GBDT automatically analyzes and selects the features with the highest statistical information gain. By combining these features, it aims to better fit the training target, thereby effectively handling datasets with dense numerical features [36]. Additionally, its iterative approach to correcting predictions improves performance by focusing on complex patterns, making it highly suitable for a variety of predictive tasks.
- K-Nearest Neighbor (KNN) [46] forms a non-parametric classification model by assigning class labels according to majority vote among the k closest training instances in the feature space [47]. The hyperparameter k, which determines the number of neighbors considered, is user-defined, with initial neighbor selection typically performed randomly and subsequently optimized via iterative performance assessment. A principal strength of the KNN algorithm lies in its conceptual simplicity and minimal implementation complexity, rendering it particularly suitable for individuals new to machine learning applications [52]. Additionally, KNN is highly effective for datasets that are not linearly separable, as it can capture complex decision boundaries through local voting mechanisms [53]. KNN performs well in multi-class classification tasks and provides robust performance even in the presence of noisy data, as the influence of outliers is mitigated by the collective voting of nearby neighbors. This characteristic enhances its effectiveness across various applications, such as recommendation systems, image recognition, and anomaly detection.
- Long Short-Term Memory (LSTM) [54] is an advanced type of recurrent neural network designed to retain sequential data over time. It features gates and a memory cell that capture and store historical trends. Each LSTM consists of multiple cells that function as modules, facilitating the transfer of data along a transport line from one cell to another. The gates within each cell filter, retain, or add data based on sigmoidal activations, enabling selective passage. The Forget Gate regulates data retention, the Memory Gate selects and modifies new data for storage, and the Output Gate determines the final output based on the cell state and processed information. This architecture allows LSTMs to effectively manage long-range dependencies in sequential data.
- Support Vector Machine (SVM) [47] is used for both classification and regression tasks and is well-known for its ability to establish optimal decision boundaries between different class distributions. SVM has strong generalization ability and is designed to maximize the margin between classes. This design allows the model to respond more effectively to unseen data, reducing the risk of overfitting and making it a robust classifier. Its performance in high-dimensional spaces is particularly impressive, especially when the number of features exceeds the number of samples, which makes SVM especially well-suited for analyzing high-dimensional data in areas such as natural language processing and image recognition [55]. However, SVM may perform unsatisfactorily when dealing with datasets that have imbalanced class distributions, include irrelevant data, or feature overlapping class samples.
2.4. Combined Bivariate Performance Measure
3. Results of the Experiment
3.1. Result Comparisons for Various ML Algorithms
3.2. Result Comparisons for Data Imputation
3.3. Enhanced Credit Card Fraud-Detection System with Generative Dataset
3.4. Statistical Tests
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Banker, S.; Dunfield, D.; Huang, A.; Prelec, D. Neural mechanisms of credit card spending. Sci. Rep. 2021, 11, 4070. [Google Scholar] [CrossRef]
- Borah, L.; Saleena, B.; Prakash, B. Credit Card Fraud Detection Using Data Mining Techniques. Seybold Rep. 2020, 15, 2431–2436. [Google Scholar]
- da Costa, V.G.T.; de Leon Ferreira de Carvalho, A.C.P.; Barbon Junior, S. Strict Very Fast Decision Tree: A memory conservative algorithm for data stream mining. Pattern Recognit. Lett. 2018, 116, 22–28. [Google Scholar] [CrossRef]
- Tang, Q.; Tong, Z.; Yang, Y. Large portfolio losses in a turbulent market. Eur. J. Oper. Res. 2021, 292, 755–769. [Google Scholar] [CrossRef]
- Koralage, R. Data Mining Techniques for Credit Card Fraud Detection. Sustain. Vital Technol. Eng. Inform. 2019, 1–9. [Google Scholar]
- Makki, S.; Assaghir, Z.; Taher, Y.; Haque, R.; Hacid, M.S.; Zeineddine, H. An Experimental Study With Imbalanced Classification Approaches for Credit Card Fraud Detection. IEEE Access 2019, 7, 93010–93022. [Google Scholar] [CrossRef]
- Ghaleb, F.A.; Saeed, F.; Al-Sarem, M.; Qasem, S.N.; Al-Hadhrami, T. Ensemble Synthesized Minority Oversampling-Based Generative Adversarial Networks and Random Forest Algorithm for Credit Card Fraud Detection. IEEE Access 2023, 11, 89694–89710. [Google Scholar] [CrossRef]
- Tingfei, H.; Guangquan, C.; Kuihua, H. Using Variational Auto Encoding in Credit Card Fraud Detection. IEEE Access 2020, 8, 149841–149853. [Google Scholar] [CrossRef]
- Salazar, A.; Safont, G.; Vergara, L. Semi-Supervised Learning for Imbalanced Classification of Credit Card Transaction. In 2018 International Joint Conference on Neural Networks (IJCNN); IEEE: Piscataway, NJ, USA, 2018; pp. 1–7. [Google Scholar] [CrossRef]
- Calvanese, D.; Giacomo, G.D.; Lenzerini, M.; Nardi, D.; Rosati, R. Description Logic Framework for Information Integration; KR’98; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1998; pp. 2–13. [Google Scholar]
- Bleiholder, J.; Naumann, F. Data fusion. ACM Comput. Surv. (CSUR) 2009, 41, 1–41. [Google Scholar] [CrossRef]
- Kuang, S.; Huang, Y.; Song, J. Unsupervised data imputation with multiple importance sampling variational autoencoders. Sci. Rep. 2025, 15, 3409. [Google Scholar] [CrossRef] [PubMed]
- Little, R.; Rubin, D. Statistical Analysis with Missing Data; Wiley: New York, NY, USA, 2019. [Google Scholar]
- Donders, A.R.T.; van der Heijden, G.J.M.G.; Stijnen, T.; Moons, K.G.M. Review: A gentle introduction to imputation of missing values. J. Clin. Epidemiol. 2006, 59, 1087–1091. [Google Scholar] [CrossRef]
- Rubin, D.B. Multiple Imputation for Nonresponse in Surveys; John Wiley & Sons Inc.: New York, NY, USA, 1987. [Google Scholar]
- Murray, J.S. Multiple Imputation: A Review of Practical and Theoretical Findings. Statist. Sci. 2018, 33, 142–159. [Google Scholar] [CrossRef]
- Muslim, M.A.; Nikmah, T.L.; Pertiwi, D.A.; Dasril, Y. New Model Combination Meta-learner to Improve Accuracy Prediction P2P Lending with Stacking Ensemble Learning. Intell. Syst. Appl. 2023, 18, 200–204. [Google Scholar] [CrossRef]
- Liu, H.; Yu, L. Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng. 2005, 17, 491–502. [Google Scholar] [CrossRef]
- Dayal, U.; Castellanos, M.; Simitsis, A.; Wilkinson, K. Data Integration Flows for Business Intelligence; EDBT ’09; Association for Computing Machinery: New York, NY, USA, 2009; pp. 1–11. [Google Scholar] [CrossRef]
- Nofal, M.I.; Yusof, Z.M. Integration of Business Intelligence and Enterprise Resource Planning within Organizations. Procedia Technol. 2013, 11, 658–665. [Google Scholar] [CrossRef]
- Feng, X.; Kim, S.K. Novel Machine Learning Based Credit Card Fraud Detection Systems. Mathematics 2024, 12, 1869. [Google Scholar] [CrossRef]
- Feng, X.; Kim, S.K. Statistical Data-Generative Machine Learning-Based Credit Card Fraud Detection Systems. Mathematics 2025, 13, 2446. [Google Scholar] [CrossRef]
- Rajora, S.; Li, D.L.; Jha, C.; Bharill, N.; Patel, O.P.; Joshi, S.; Puthal, D.; Prasad, M. A Comparative Study of Machine Learning Techniques for Credit Card Fraud Detection Based on Time Variance. In 2018 IEEE Symposium Series on Computational Intelligence; IEEE: Piscataway, NJ, USA, 2018; pp. 1958–1963. [Google Scholar] [CrossRef]
- Tanouz, D.; Subramanian, R.R.; Eswar, D.; Reddy, G.V.P.; Kumar, A.R.; Praneeth, C.V.N.M. Credit Card Fraud Detection Using Machine Learning. In IEEE Proceedings of ICICCS; IEEE: Piscataway, NJ, USA, 2021; pp. 967–972. [Google Scholar]
- El hlouli, F.Z.; Riffi, J.; Mahraz, M.A.; El Yahyaouy, A.; Tairi, H. Credit Card Fraud Detection Based on Multilayer Perceptron and Extreme Learning Machine Architectures. In Proceedings of the 2020 International Conference on Intelligent Systems and Computer Vision (ISCV), Fez, Morocco, 9–11 June 2020. [Google Scholar]
- Batista, G.E.A.P.A.; Monard, M.C. An analysis of four missing data treatment methods for supervised learning. Appl. Artif. Intell. 2003, 17, 519–533. [Google Scholar] [CrossRef]
- Afriyie, J.K.; Tawiah, K.; Pels, W.A.; Addai-Henne, S.; Dwamena, H.A.; Owiredu, E.O.; Ayeh, S.A.; Eshun, J. A supervised machine learning algorithm for detecting and predicting fraud in credit card transactions. Decis. Anal. J. 2023, 6, 100163. [Google Scholar] [CrossRef]
- Developers, S.L. Sklearn. Preprocessing. StandardScaler. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html (accessed on 31 August 2025).
- Fernandez, A.; Garcia, S.; Galar, M.; Prati, R.C.; Krawczyk, B.; Herrera, F. Learning from Imbalanced Data Sets; Springer: Cham, Switzerland, 2018. [Google Scholar]
- Batista, G.; Monard, M.C. A Study of K-Nearest Neighbour as an Imputation Method. In Proceedings of the Soft Computing Systems—Design, Management and Applications, HIS 2002, Santiago, Chile, 1–4 December 2002; Volume 30, pp. 251–260. [Google Scholar]
- Pratama, I.; Permanasari, A.E.; Ardiyanto, I.; Indrayani, R. A review of missing values handling methods on time-series data. In Proceedings of the 2016 International Conference on Information Technology Systems and Innovation (ICITSI), Bali, Indonesia, 24–27 October 2016; pp. 1–6. [Google Scholar]
- Bertsimas, D.; Pawlowski, C.; Zhuo, Y.D. From Predictive Methods to Missing Data Imputation: An Optimization Approach. J. Mach. Learn. Res. 2018, 18, 1–39. [Google Scholar]
- Breiman, L.; Friedman, J.; Stone, C.; Olshen, R. Classification and Regression Trees; Chapman and Hall/CRC: New York, NY, USA, 1984. [Google Scholar]
- Chen, C.Y.; Chang, Y.W. Missing data imputation using classification and regression trees. PeerJ Comput. Sci. 2024, 10, e2119. [Google Scholar] [CrossRef]
- Friedman, J.; Hastie, T.; Tibshirani, R. Additive logistic regression: A statistical view of boosting (With discussion and a rejoinder by the authors). Ann. Stat. 2000, 28, 337–407. [Google Scholar] [CrossRef]
- Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.U.; Polosukhin, I. Attention is All you Need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Jolicoeur-Martineau, A.; Fatras, K.; Kachman, T. Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Valencia, Spain, 2–4 May 2024; Volume 238, pp. 1288–1296. [Google Scholar]
- Jäger, S.; Allhorn, A.; Bießmann, F. A Benchmark for Data Imputation Methods. Front. Big Data 2021, 30, 693674. [Google Scholar] [CrossRef]
- Breiman, L. Random Forests. Mach. Learn. 2021, 45, 5–32. [Google Scholar] [CrossRef]
- Rebai, S.; Ben Yahia, F.; Essid, H. A graphically based machine learning approach to predict secondary schools performance in Tunisia. Socio-Econ. Plan. Sci. 2020, 70, 100724. [Google Scholar] [CrossRef]
- Kehinde, T.; Oyedele, A.; Kareem, M.; Akpan, J.; Olanrewaju, O. Explainable DEA–Ensemble Approach with Golden Jackal Optimization: Efficiency Evaluation and Prediction for United States Information Technology Firms. Mach. Learn. Appl. 2025, 23, 100798. [Google Scholar] [CrossRef]
- Liaw, A.; Wiener, M.C. Classification and Regression by randomForest. R News 2007, 2, 18–22. [Google Scholar]
- Ileberi, E.; Sun, Y.; Wang, Z. Performance Evaluation of Machine Learning Methods for Credit Card Fraud Detection Using SMOTE and AdaBoost. IEEE Access 2021, 9, 165286–165294. [Google Scholar] [CrossRef]
- Alam, T.M.; Shaukat, K.; Hameed, I.A.; Luo, S.; Sarwar, M.U.; Shabbir, S.; Li, J.; Khushi, M. An Investigation of Credit Card Default Prediction in the Imbalanced Datasets. IEEE Access 2020, 8, 201173–201198. [Google Scholar] [CrossRef]
- Alarfaj, F.K.; Malik, I.; Khan, H.U.; Almusallam, N.; Ramzan, M.; Ahmed, M. Credit Card Fraud Detection Using State-of-the-Art Machine Learning and Deep Learning Algorithms. IEEE Access 2022, 10, 39700–39715. [Google Scholar] [CrossRef]
- Kalid, S.N.; Ng, K.H.; Tong, G.K.; Khor, K.C. A Multiple Classifiers System for Anomaly Detection in Credit Card Data with Unbalanced and Overlapped Classes. IEEE Access 2020, 8, 28210–28221. [Google Scholar] [CrossRef]
- Nur Ozkan-Gunay, E.; Ozkan, M. Prediction of bank failures in emerging financial markets: An ANN approach. J. Risk Financ. 2007, 8, 465–480. [Google Scholar] [CrossRef]
- Lu, H.; Setiono, R.; Liu, H. Effective data mining using neural networks. IEEE Trans. Knowl. Data Eng. 1996, 8, 957–961. [Google Scholar] [CrossRef]
- Khan, A.; Sohail, A.; Zahoora, U.; Qureshi, A.S. A survey of the recent architectures of deep convolutional neural networks. Artif. Intell. Rev. 2020, 53, 5455–5516. [Google Scholar] [CrossRef]
- Razali, M.N.; Arbaiy, N.; Lin, P.C.; Ismail, S. Optimizing Multiclass Classification Using Convolutional Neural Networks with Class Weights and Early Stopping for Imbalanced Datasets. Electronics 2025, 14, 705. [Google Scholar] [CrossRef]
- Oded Maimon, L.R. Data Mining and Knowledge Discovery Handbook; Springer: New York, NY, USA, 2010. [Google Scholar]
- Taghizadeh-Mehrjardi, R.; Nabiollahi, K.; Minasny, B.; Triantafilis, J. Comparing data mining classifiers to predict spatial distribution of USDA-family soil groups in Baneh region, Iran. Geoderma 2015, 253–254, 67–77. [Google Scholar] [CrossRef]
- Siami-Namini, S.; Namin, A.S. Forecasting Economics and Financial Time Series: ARIMA vs. LSTM. arXiv 2018, arXiv:1803.06386. [Google Scholar] [CrossRef]
- Karamizadeh, S.; Abdullah, S.M.; Halimi, M.; Shayan, J.; Rajabi, M.J. Advantage and drawback of support vector machine functionality. In Proceedings of the 2014 International Conference on Computer, Communications, and Control Technology (I4CT), Langkawi, Malaysia, 2–4 September 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 63–65. [Google Scholar]
- Kim, S.K. Combined Bivariate Performance Measure. IEEE Trans. Instrum. Meas. 2024, 73, 1–4. [Google Scholar] [CrossRef]
- AbouGrad, H.; Sankuru, L. Online Banking Fraud Detection Model: Decentralized Machine Learning Framework to Enhance Effectiveness and Compliance with Data Privacy Regulations. Mathematics 2025, 13, 2110. [Google Scholar] [CrossRef]
- Yin, C.; Zhang, S.; Wang, J.; Xiong, N.N. Anomaly Detection Based on Convolutional Recurrent Autoencoder for IoT Time Series. IEEE Trans. Syst. Man Cybern. Syst. 2022, 52, 112–122. [Google Scholar] [CrossRef]









| Serial Number | Feature | Original Value | Encoded Value |
|---|---|---|---|
| 1 | gender | F M | 0 1 |
| 2 | owns_car | N Y | 0 1 |
| 3 | owns_house | N Y | 0 1 |
| 4 | occupation_type | Accountants Cleaning staff Cooking staff Core staff Drivers HR staff High skill tech staff IT staff Laborers Low-skill laborers Managers Medicine staff Private service staff Realty agents Sales staff Secretaries Security staff Unknown Waiters/barmen staff | 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
| Dataset Name | Dataset Type | Class 0 (Non-Fraud) | Class 1 (Fraud) |
|---|---|---|---|
| Combined datasets | Training Testing | 2787 692 | 2752 689 |
| Model | Parameters |
|---|---|
| ANN | Input layer: 128 neurons, hidden layers: 64 and 32 neurons, output layer: 1 neuron |
| CNN | Convolutional layers: 32-64-128 filters, pooling: max pooling, fully connected: 256-128-64-32-16 neurons, output: 1 neuron |
| GBDT | Estimators: 100, max depth: 3 |
| KNN | Neighbors: 5 |
| LSTM | LSTM layer: 64 neurons, output layer: 1 neuron |
| SVM | Penalty parameter: 1.0, kernel: rbf |
| Algorithm | Accuracy | Precision | Recall | F1-Score | Training Time (sec) |
|---|---|---|---|---|---|
| ANN [48] | 90.36 | 90.40 | 90.35 | 0.904 | 3.60 |
| CNN [46] | 91.37 | 91.49 | 91.36 | 0.914 | 9.64 |
| GBDT [45] | 92.39 | 92.61 | 92.37 | 0.924 | 0.94 |
| KNN [46] | 91.88 | 92.17 | 91.86 | 0.919 | 0.10 |
| LSTM [54] | 92.89 | 92.97 | 92.88 | 0.929 | 5.38 |
| SVM [47] | 93.91 | 94.22 | 93.89 | 0.939 | 0.12 |
| Algorithm | Accuracy | Precision | Recall | F1-Score | Training Time (sec) |
|---|---|---|---|---|---|
| ANN [48] | 96.01 | 96.00 | 96.03 | 0.960 | 20.72 |
| CNN [46] | 96.62 | 96.62 | 96.66 | 0.966 | 97.04 |
| GBDT [45] | 97.43 | 97.48 | 97.50 | 0.975 | 1.71 |
| KNN [46] | 90.60 | 90.61 | 90.58 | 0.906 | 0.30 |
| LSTM [54] | 95.61 | 95.60 | 95.62 | 0.960 | 33.79 |
| SVM [47] | 95.40 | 95.41 | 95.45 | 0.954 | 2.850 |
| Imputation Method | Algorithm | Accuracy | Total Time (sec) | Imputation Time (sec) | Training Time (sec) |
|---|---|---|---|---|---|
| CART | ANN | 95.50 | 23.04 | 0.21 | 22.83 |
| CNN | 94.84 | 64.61 | 0.21 | 64.40 | |
| GBDT | 96.88 | 2.48 | 0.21 | 2.27 | |
| KNN | 91.72 | 0.62 | 0.21 | 0.41 | |
| LSTM | 95.35 | 30.02 | 0.21 | 29.81 | |
| SVM | 95.50 | 4.28 | 0.21 | 4.07 | |
| GBT | ANN | 99.20 | 52.68 | 32.96 | 19.72 |
| CNN | 98.98 | 94.03 | 32.96 | 61.07 | |
| GBDT | 99.06 | 35.38 | 32.96 | 2.42 | |
| KNN | 98.55 | 33.36 | 32.96 | 0.40 | |
| LSTM | 99.20 | 58.51 | 32.96 | 25.55 | |
| SVM | 98.84 | 35.21 | 32.96 | 2.25 | |
| KNN | ANN | 99.85 | 29.03 | 9.60 | 19.43 |
| CNN | 100.00 | 72.51 | 9.60 | 62.91 | |
| GBDT | 99.85 | 13.12 | 9.60 | 3.52 | |
| KNN | 99.42 | 9.99 | 9.60 | 0.39 | |
| LSTM | 99.85 | 34.57 | 9.60 | 24.97 | |
| SVM | 99.71 | 10.68 | 9.60 | 1.08 | |
| RF | ANN | 98.33 | 3662.10 | 3636.74 | 25.36 |
| CNN | 97.75 | 3940.49 | 3636.74 | 303.75 | |
| GBDT | 98.47 | 3645.21 | 3636.74 | 8.47 | |
| KNN | 95.72 | 3637.15 | 3636.74 | 0.41 | |
| LSTM | 98.40 | 3685.99 | 3636.74 | 49.25 | |
| SVM | 98.33 | 3639.22 | 3636.74 | 2.48 |
| Algorithm | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|
| ANN [48] | 99.85 | 99.85 | 99.86 | 0.999 |
| CNN [46] | 100.00 | 100.00 | 100.00 | 1.000 |
| GBDT [45] | 99.85 | 99.85 | 99.86 | 0.999 |
| KNN [46] | 99.42 | 98.41 | 99.44 | 0.994 |
| LSTM [54] | 99.85 | 99.85 | 99.86 | 0.999 |
| SVM [47] | 99.71 | 99.71 | 99.71 | 0.997 |
| Algorithm | Original | Accuracy | p-Value | Significance? () | Accepted? |
|---|---|---|---|---|---|
| ANN | 96.01 | 99.85 | <0.001 | Yes | Yes |
| CNN | 96.62 | 100.00 | <0.001 | Yes | Yes |
| GBDT | 97.43 | 99.85 | <0.001 | Yes | Yes |
| KNN | 90.60 | 99.42 | <0.001 | Yes | Yes |
| LSTM | 95.61 | 99.85 | <0.001 | Yes | Yes |
| SVM | 95.40 | 99.71 | <0.001 | Yes | Yes |
| Algorithm | Original | Accuracy | p-Value | Significance? () | Accepted? |
|---|---|---|---|---|---|
| ANN | 90.36 | 99.85 | <0.001 | Yes | Yes |
| CNN | 91.37 | 100.00 | <0.001 | Yes | Yes |
| GBDT | 92.39 | 99.85 | <0.001 | Yes | Yes |
| KNN | 91.88 | 99.42 | <0.001 | Yes | Yes |
| LSTM | 92.89 | 99.85 | <0.001 | Yes | Yes |
| SVM | 93.91 | 99.71 | <0.001 | Yes | Yes |
| Algorithm | Accuracy | Training Time (sec) | |||
|---|---|---|---|---|---|
| ANN | 99.85 | 29.03 | 0.982 | 0.555 | 0.545 |
| CNN | 100.00 | 72.51 | 1.000 | 0.282 | 0.282 |
| GBDT | 99.85 | 13.12 | 0.982 | 1.000 | 0.982 |
| KNN | 99.42 | 9.99 | 0.931 | 0.952 | 0.886 |
| LSTM | 99.85 | 34.57 | 0.982 | 0.268 | 0.264 |
| SVM | 99.71 | 10.68 | 0.965 | 0.968 | 0.934 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Feng, X.; Kim, S.-K. Machine Learning-Based Data Generative Techniques for Credit Card Fraud-Detection Systems. Mathematics 2026, 14, 975. https://doi.org/10.3390/math14060975
Feng X, Kim S-K. Machine Learning-Based Data Generative Techniques for Credit Card Fraud-Detection Systems. Mathematics. 2026; 14(6):975. https://doi.org/10.3390/math14060975
Chicago/Turabian StyleFeng, Xiaomei, and Song-Kyoo Kim. 2026. "Machine Learning-Based Data Generative Techniques for Credit Card Fraud-Detection Systems" Mathematics 14, no. 6: 975. https://doi.org/10.3390/math14060975
APA StyleFeng, X., & Kim, S.-K. (2026). Machine Learning-Based Data Generative Techniques for Credit Card Fraud-Detection Systems. Mathematics, 14(6), 975. https://doi.org/10.3390/math14060975


