Statistical Data-Generative Machine Learning-Based Credit Card Fraud Detection Systems

Feng, Xiaomei; Kim, Song-Kyoo

doi:10.3390/math13152446

Open AccessArticle

Statistical Data-Generative Machine Learning-Based Credit Card Fraud Detection Systems

by

Xiaomei Feng

and

Song-Kyoo Kim

^*

Faculty of Applied Sciences, Macao Polytechnic University, R. de Luis Gonzaga Gomes, Macao SAR, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(15), 2446; https://doi.org/10.3390/math13152446

Submission received: 26 June 2025 / Revised: 18 July 2025 / Accepted: 27 July 2025 / Published: 29 July 2025

(This article belongs to the Special Issue New Advances in Computational Finance and Computational Intelligence in Finance)

Download

Browse Figures

Versions Notes

Abstract

This study addresses the challenges of data imbalance and missing values in credit card transaction datasets by employing mode-based imputation and various machine learning models. We analyzed two distinct datasets: one consisting of European cardholders and the other from American Express, applying multiple machine learning algorithms, including Artificial Neural Networks, Convolutional Neural Networks, and Gradient Boosted Decision Trees, as well as others. Notably, the Gradient Boosted Decision Tree demonstrated superior predictive performance, with accuracy increasing by 4.53%, reaching 96.92% on the European cardholders dataset. Mode imputation significantly improved data quality, enabling stable and reliable analysis of merged datasets with up to 50% missing values. Hypothesis testing confirmed that the performance of the merged dataset was statistically significant compared to the original datasets. This study highlights the importance of robust data handling techniques in developing effective fraud detection systems, setting the stage for future research on combining different datasets and improving predictive accuracy in the financial sector.

Keywords:

credit card fraud; statistical data generation; machine learning; credit prediction; predictive modeling

MSC:

62H30; 62P05; 62P99; 91B08

1. Introduction

The emergence of credit cards in the 1960s has gradually supplanted cash and check transactions, resulting in a considerable and swift rise in both credit card users and transaction volumes, which has consequently triggered a persistent surge in credit card fraud [1]. As a result, financial institutions are increasingly focused on identifying potential borrowers and evaluating the associated credit risks stemming from defaults [2]. Credit card fraud refers to the unauthorized use of a credit card account that happens without the knowledge of the cardholder or the issuing institution regarding a third party engaging in such activities. Fraud perpetrators engage in various unlawful activities which involve acquiring goods or services without making a payment or improperly taking funds from the account of another person. Among these activities are several types of fraud including offline fraud, application fraud, and bankruptcy fraud [3]. The detection and prevention of credit card fraud are vital components of the financial system because they aim to identify and stop all fraudulent transactions that can lead to significant losses for both consumers and institutions. Effective measures in fraud detection are crucial for safeguarding customer trust and ensuring the overall integrity of financial operations [4].

Implementing effective fraud monitoring strategies is a crucial step that can significantly reduce economic losses faced by financial institutions while also greatly enhancing customer trust in their services [5]. These strategies play a vital role in minimizing complaints by proactively addressing fraudulent activities before they escalate. Furthermore, it is essential for financial institutions to rapidly develop and implement robust credit card fraud detection systems that can swiftly identify suspicious transactions, as this will help prevent substantial financial losses associated with these fraudulent actions, which can affect both consumers and businesses. Data integration or generation involves the comprehensive process of consolidating data from various different sources in order to provide users with a coherent and unified view of the information available. Despite the extensive research that has been conducted on the integration of heterogeneous information systems across multiple domains, it remains that most commercial solutions still fall short of achieving full data integration [6]. This persistent issue is compounded by the fact that missing data continues to represent a pervasive challenge that affects many contemporary scientific and engineering domains, resulting in significant obstacles for accurate data analysis and decision-making processes. Missing data continues to represent a pervasive and significant challenge that is prevalent across many contemporary scientific and engineering fields. The occurrence of missing data can lead to the introduction of substantial biases in the information that is processed, thereby resulting in errors during data processing and analysis. This situation ultimately diminishes overall statistical efficiency, impacting the validity and reliability of the results obtained from such analyses. It is crucial for researchers and engineers to address this challenge effectively in order to maintain the integrity of their findings and to ensure that decision-making processes are based on accurate and complete datasets [7].

Generally, missing data is categorized into three main classifications, which include missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR) [8]. Under the MCAR mechanism, data absence is entirely independent of both observed and unobserved variables; thus, missing values do not depend on any specific data value within the dataset. Conversely, the missing at random (MAR) mechanism dictates that the missing data pattern relies exclusively on observed variables, indicating that the missingness can be explained by the available data. This distinction is critical for researchers to select suitable missing data handling techniques, as these choices can significantly affect analysis outcomes and research conclusions. Alternatively, NMAR suggests that the missing data is influenced not only by observed variables but also by the missing values themselves. This dependency introduces significant complexity into the analysis of data completeness, as the reasons for missingness are intrinsically linked to the unobserved data. Simultaneously, the field of credit card fraud detection has witnessed a surge in the adoption of diverse machine learning models, driven by rapid advancements in machine learning techniques. These models are trained and validated using multiple datasets, often sourced from varied origins, to enhance their generalizability and robustness. The primary objective of these techniques is to improve both the accuracy and efficiency of fraud detection systems, thereby providing enhanced support to financial institutions in their efforts to proactively prevent fraudulent transactions and mitigate financial losses. Furthermore, the deployment of these advanced machine learning systems contributes to a more secure and reliable financial ecosystem. Building upon prior studies that have employed extensive credit card datasets for model training, this work proposes integrating two distinct credit card datasets and imputing the resulting missing data to identify potential borrowers and assess their default risk. Specifically, the European Cardholders Dataset and the American Express credit dataset are utilized for data integration, missing data imputation, and subsequent training of machine learning models, as detailed in [3,9]. With the exponential growth of data generation and the proliferation of data sources, processing and integrating data have become increasingly challenging for businesses.

Companies face challenges that include ensuring data accuracy and timeliness, compounded by the fact that traditional data warehouse approaches can lead to data silos, thereby diminishing the overall effectiveness and benefits of data warehouses [10]. Data holds minimal value for organizations if it cannot be effectively accessed, observed, and utilized. The data integration process facilitates the merging of information from disparate databases and sources, thereby presenting business users with a consolidated and comprehensive view of the data landscape. A robust and well-executed data integration strategy empowers companies to effectively leverage information originating from diverse sources, thereby supporting their overarching business objectives and strategic initiatives. Data integration assumes a pivotal role in contemporary business intelligence systems, impacting not only the precision and timeliness of information but also directly influencing a capacity of a company to make informed decisions in a rapidly evolving market environment. The ability to synthesize and analyze data from various sources enables organizations to respond swiftly to market trends, anticipate customer needs, and maintain a competitive edge. Ultimately, effective data integration is essential for fostering data-driven decision-making and achieving sustainable business success. For efficient business operations, companies must prioritize the design and implementation of robust data integration strategies to better support real-time decision-making processes [11].

Integrating data from diverse business functions, including finance, manufacturing, sales, and marketing, enhances financial transparency and optimizes supply chain operational efficiency. Through comprehensive data integration, businesses can leverage customer behavior information derived from merged datasets to maximize customer satisfaction, foster customer loyalty, and enhance overall profitability [12]. Furthermore, companies can analyze and utilize this integrated data to improve its availability, reliability, and overall quality, thereby ensuring that decision-makers have access to accurate and timely insights. By effectively integrating data assets, businesses are better positioned to leverage their information resources, sustain a competitive advantage in the marketplace, and drive innovation across the organization. This holistic approach to data management and utilization is essential for achieving long-term success and adapting to evolving market dynamics.

This article is further organized into three sections. Section 2 details the preliminaries, focusing on the theoretical underpinnings of this study, including data balancing techniques and various machine learning algorithms previously employed in related research. This section also provides a brief overview of mode-based generative data methods and introduces a novel feature reduction approach utilizing hypothesis testing. The performance of the mode-based generative data method is evaluated using hypothesis testing at different confidence levels. Section 3 presents the experimental results, comparing several machine learning algorithms and feature reduction methods with the mode imputation technique. Finally, Section 4 summarizes the performance comparisons with statistical data generation methods.

2. Preliminaries

The experiments utilizes two publicly available credit card datasets selected from Kaggle. The first selected dataset is the European credit cardholders dataset, which includes 284,807 non-fraud transactions and 492 fraud transactions (i.e.,

0.172 %

out of the total samples) and contains 30 features. The second selected dataset is the American Express credit dataset, which contains 42,579 non-fraud transactions and 2949 fraud transactions (i.e.,

6.926 %

out of the total samples) and includes 18 features [13]. These two publicly available datasets have been extensively utilized in related research studies [14,15,16]. During the merging process, all features from the European dataset were retained, while two irrelevant features from the American Express dataset were removed.

2.1. Data Set

The resampling strategy includes both over-sampling and under-sampling techniques, which are commonly employed to address data imbalance issues. Over-sampling through the duplication of minority class samples may lead to overfitting or exacerbate the noise inherent in the dataset [4]. Under-sampling is a technique used to reduce the computational load in order to enhance efficiency. This method can be implemented by either randomly removing samples from the majority class or substituting samples with cluster centroids obtained from a subset of the dataset [4,17]. In this study, we employed random under-sampling techniques on both datasets to mitigate the prevalence of non-fraud samples, with the objective of achieving a nearly equal distribution between fraud and non-fraud classes, each constituting approximately 50%. Utilizing the random under-sampling approach, the European credit cardholders dataset was adjusted to contain 483 non-fraud transactions and 492 fraud transactions (see Figure 1), whereas the American Express credit dataset was modified to include 2966 non-fraud transactions and 2949 fraud transactions (see Figure 2).

These two selected datasets have been divided into training and testing subsets to train and evaluate various machine learning models. Table 1 presents detailed information on the training and testing datasets for European credit cardholders, including the respective counts for non-fraud and fraud cases. Additionally, it provides the values for both the training and testing datasets of the American Express credit dataset.

Although reducing data samples may negatively impact key performance indicators, such as accuracy, precision, recall, and so on, it is essential to recognize that imbalanced training datasets can result in bias within machine learning algorithms. Therefore, it is imperative to address the issue of imbalance in training datasets to develop robust machine learning systems. In the subsequent phase of our research, we will utilize various machine learning models to train the categorized dataset. These algorithms will be trained on a balanced classification dataset to ensure adequate representation of all classes. This balanced approach is crucial for enhancing the performance and reliability of the models in predictive tasks.

2.2. Hypothesis Testing

The fundamental principle of a statistical hypothesis test is to determine whether a given data sample is typical or atypical in relation to a specified population, under the assumption that a formulated hypothesis about that population is valid [19]. In scenarios where the sample size is sufficiently large (typically

n \geq 30

), the population standard deviation

σ

is known, and the underlying data distribution can be assumed to be normal, a Z-test is appropriate. The Z-test facilitates the evaluation of whether the sample mean significantly differs from the hypothesized population mean, thereby providing insights into the validity of the null hypothesis within the context of the study.

In this study, we employed a two-tailed Z-test to assess whether the sample mean significantly differs from the hypothesized population mean. This approach allows us to evaluate deviations in both directions, providing a comprehensive analysis of potential differences. The two-tailed Z-test is utilized to compare the means of two independent samples. The procedure commences with the formulation of hypotheses. The null hypothesis

H_{0}

asserts that there is no significant difference between the two population means, which is mathematically expressed as follows [20]:

H_{0} : μ_{1} = μ_{2},

(1)

and, alternatively, the hypothesis

H_{1}

indicates the presence of a significant difference between the two population means, which can be described as follows:

H_{1} : μ_{1} \neq μ_{2} (for a two - tailed test),

(2)

and the test statistic Z can be calculated as follows:

Z = \frac{{\bar{x}}_{1} - {\bar{x}}_{2}}{\sqrt{\frac{σ_{1}^{2}}{n_{1}} + \frac{σ_{2}^{2}}{n_{2}}}},

(3)

by using an alternative form of the Z-test formula where

{\bar{x}}_{1}

and

{\bar{x}}_{2}

denote the sample means of the first and second groups, respectively;

σ_{1}

and

σ_{2}

are the population standard deviations; while

n_{1}

and

n_{2}

represent the sample sizes. The numerator reflects the difference between the sample means, and the denominator computes the standard error associated with this difference, incorporating the variances of both populations and their sample sizes. Following this computation, the critical value is established based on a predetermined significance level

α

, typically set at

0.01

,

0.05

, or

0.10

. This critical value is derived from the standard normal distribution table corresponding to the chosen

α

. For two-tailed tests, it is denoted as

Z_{α / 2}

. If the absolute value of the computed Z exceeds the critical value (i.e.,

| Z | > Z_{α / 2}

), the null hypothesis is rejected. Conversely, if

|Z|

is less than or equal to the critical value, the null hypothesis is not rejected.

2.3. Mode-Based Generative Data

Mode-based generative data, also known as mode-based data imputation, replaces missing values with the mode, defined as the most frequent value in a variable. This method is a commonly employed technique for addressing missing data [21]. Identifying the mode in this process typically involves utilizing a histogram distribution. A systematic approach to mode imputation, where missing values are replaced with the mode of variables, is presented as a method for handling missing data resulting from merging two datasets. This method proves particularly effective for categorical data by preserving distribution characteristics and maintaining consistency, especially when repeated values are prominent. The process of merging two datasets and performing imputation to generate a complete dataset is illustrated in Figure 3.

The process of mode imputation is analyzed and documented here through a series of mathematical formulas. Let

D = {x_{1}, x_{2}, \dots, x_{n}}

be the dataset with missing values. Construct a histogram to visualize the frequency distribution of the observed values in

D

. The histogram will categorize the data into bins. For each bin

b_{i}

, calculate the frequency

f (b), b \in B,

which is of the count of values in bin. The highest frequency of the bin

b^{*}

can be found as follows:

b^{*} = a r g \underset{b}{m} a x {{}^{\exists}b | f (b), b \in B},

(4)

where

B

is the set of the histogram bins and

f (b)

is the frequency of the bin b. The mean for the value set

V

falling within the bin

b^{*}

can be determined as follows:

ν = \frac{1}{|V|} \sum_{x_{i} \in V} x_{i} .

(5)

The mode-based generative set

G

which imputes the missing dataset from the dataset

D

can be determined as follows:

G = {{}^{\forall}{g_{i}} | g_{i} = x_{i} \cdot 1_{\{x_{i} \in V\}} + ν \cdot 1_{\{x_{i} \notin V\}}},

(6)

and the complete dataset

\tilde{D}

can be found as follows:

\tilde{D} = D \cup G,

(7)

which includes all original and imputed values. The above process effectively imputes missing values using the mode determined from the histogram of the dataset, thereby enhancing the integrity of the data for further analysis. For each feature within the merged dataset, imputed values were derived from the same data source, thereby ensuring consistent units post-imputation. Subsequently, non-numeric data underwent label encoding during the preprocessing stage. The European dataset comprises exclusively numeric data, whereas the American Express dataset includes four non-numeric features; label encoding was applied to these, as detailed in Table 2. Prior to model training, StandardScaler was systematically deployed for comprehensive data standardization [22]. This essential preprocessing step normalized the features, a measure instrumental in mitigating the influence of varying data scales and thereby ensuring a substantial enhancement in the overall predictive capabilities of the model [23].

2.4. Preliminary Performance Comparison for Various ML Models

For model training and analysis, six notable machine learning models, including both traditional and deep learning approaches, have been chosen. The traditional models incorporate ensemble-based learning techniques. Notably, four of these selected machine learning algorithms have been previously applied to the same dataset, as documented in prior research [24,25,26,27]. The following machine learning models are employed in this research:

Artificial Neural Networks (ANNs) [28] are mathematical models that emulate the functioning of the biological brain. They consist of interconnected artificial neurons that process input data through a series of weighted sums and nonlinear decision functions. ANNs are typically organized into layers: an input layer that receives external data, one or more hidden layers that facilitate nonlinear modeling by processing signals between neurons, and an output layer that generates final responses.
Convolutional Neural Network (CNN) [26] represent a deep learning methodology extensively utilized across various domains, including image processing, natural language processing, audio analysis, and time series data. The CNN architecture comprises six distinct layers: the input layer, convolutional layer, pooling layer, fully connected layer, SoftMax/Logic layer, and output layer. Notably, the hidden layers, which share a common structural framework, can accommodate varying numbers of channels within each layer.
Gradient Boosted Decision Tree (GBDT) [25] is an ensemble learning algorithm that constructs a powerful predictive model by iteratively training a series of decision trees. In prior research, GBDT has also been employed as a base learner for fixed-size decision trees, effectively addressing the challenges associated with limited tree depth caused by exponential growth.
K-Nearest Neighbor (KNN) [26] constructs the classifier function by performing voting among its local neighboring data points [27,29,30]. The user specifies the number of neighbors, denoted as k, and the initial selection of neighboring points is made randomly. However, this selection can be refined through iterative evaluation.
Long Short-Term Memory (LSTM) [31] networks represent an advanced class of recurrent neural networks designed to retain sequential data over time. They incorporate gates and a memory cell that capture and store historical trends. Each LSTM consists of multiple cells functioning as modules, where data is conveyed along a transport line from one cell to another. This architecture enables LSTMs to effectively manage long-range dependencies in sequential data.
Support Vector Machines (SVMs) [27] are employed for both classification and regression tasks and are widely recognized for their capacity to delineate optimal decision boundaries between distinct class distributions. However, SVMs tend to demonstrate suboptimal performance when confronted with datasets characterized by imbalanced class distributions, the presence of noise, and the overlap of class samples.

The analysis allows for a detailed comparison of the four evaluation metrics—accuracy, precision, recall, and F1-score—in relation to their respective training times. The following tables present the performance and records of two separate datasets across six algorithms, enabling a better observation and comparison of the effectiveness of merged imputed datasets in further experiments. Table 3 illustrates the performance results of the European cardholders dataset, utilizing the six methodologies previously discussed. The results show that the accuracy ranged from 90% to 94%, with SVM achieving the highest accuracy of 93.91%. All methods executed their processes in under 10 s.

The American Express credit dataset performance results are detailed in Table 4. GBDT achieved the highest accuracy at 97.43%, while other methods ranged from 95% to 98% accuracy. KNN showed the lowest accuracy at 90.60%, alongside the shortest execution time, approaching 0 s. In contrast, CNN exhibited the longest average execution time, approximately 100 s.

3. Experiment Results

This section employed various statistical methods for imputation and compared their performance outcomes. Mode imputation achieved the highest accuracy among the numerous machine learning algorithms evaluated, with the GBDT algorithm demonstrating the highest accuracy across all results. Consequently, mode imputation was selected as the optimal method for subsequent data imputation in this research. Given that this study combined two entirely distinct features from credit card datasets, the proportion of missing values reached 50% after merging. The imputed values for different features were calculated using the corresponding histograms to determine the mode values, and subsequent imputation was performed on the relevant features. The entire data imputation technique is depicted in Figure 4.

Initially, the two datasets intended for merging are input. Next, the chosen imputation method is applied to fill in the missing values in the merged dataset, resulting in a complete dataset without any gaps. Following this, model training and prediction are conducted, with the predicted results analyzed through performance evaluation and hypothesis testing. Finally, if the analysis indicates that the performance of the merged dataset is similar to or significantly improved compared to the original dataset, the results and model will be output. Otherwise, the imputation method will be reselected, and relevant parameters will be adjusted to retrain the model.

3.1. Result of the Mode-Based Imputation

A test dataset comprising 46 features and 1316 data samples has been used to perform a performance analysis, and the resulting data was subsequently applied in the hypothesis testing phase. The performance results of the mode-based imputation method are presented in Table 5.

It is evident that all key evaluation metrics—including accuracy, precision, and recall—exceed 90%, with F1-score values spanning from 0.90 to 0.97. KNN achieves the lowest metrics at approximately 90.90%, while the other algorithms all exceed 94%. Importantly, GBDT achieved the highest accuracy, precision, and recall, with values of 96.92%, 96.95%, and 96.93%, respectively, and an F1-score of 0.97. Overall, the results from the merged dataset demonstrate stability and reliability, indicating that the imputation process was successful in enhancing data quality. In terms of runtime, although CNN took twice as long to run compared to the time required for the American Express credit dataset, this increase can be attributed to the larger number of parameters involved. In contrast, the runtimes for all other algorithms showed no substantial change. This suggests that despite a 50% increase in data volume, our mode-based imputation model maintained efficient runtime performance, reflecting its effectiveness in handling larger datasets. The performance evaluation of the merged dataset revealed outstanding results from various models in both Receiver Operating Characteristic Area Under Curve (ROC-AUC) and Precision–Recall Curve (PR-AUC) metrics [32,33]. Specifically, GBDT, SVM, CNN, and LSTM consistently achieved ROC-AUC values of 0.99 (see Figure 5), evidencing their high discriminatory capability in differentiating positive from negative samples and capturing complex data patterns. KNN also performed effectively, attaining a ROC-AUC value of 0.97, reflecting strong classification performance.

Regarding PR-AUC, the assessed models (e.g., GBDT, SVM, CNN, and LSTM) consistently demonstrated robust performance, with values spanning from 0.96 to 0.99 (as depicted in Figure 6). This range definitively reinforces their notable efficacy in adeptly managing and predicting outcomes within inherently imbalanced datasets, a critical consideration in real-world applications. While the PR-AUC for KNN registered a slightly reduced value of 0.93, this metric remains within an acceptable and practically viable operational spectrum for classification tasks.

The outcomes demonstrate that models from the merged dataset exhibit strong overall accuracy and offer considerable benefits in addressing imbalanced datasets.

3.2. Hypothesis Testing Results

To evaluate the effectiveness of the merged dataset, we employed hypothesis testing. This method systematically evaluates whether the data support a specific hypothesis, allowing for informed conclusions. We used varying significance levels to compare the merged dataset with the original dataset. Compared to the American Express dataset, there is a slight decline in performance. To confirm that the merged dataset is not significantly different from the American Express dataset, we set a significance level of 0.1, which means we are lowering the threshold for rejecting the null hypothesis

H_{0}

. We reject

H_{0}

when the p-value is less than 0.1. Table 6 presents the accuracy results of the American Express dataset compared to the merged and imputed dataset. The accuracies of KNN and SVM showed slight improvements, whereas GBDT, ANN, and LSTM experienced minor declines, all of which were less than 1%. However, CNN exhibited a decrease of 1.17%. In the hypothesis testing conducted at

α

= 0.1, which indicates 90% of the confident level, all models showed non-significant changes except for CNN, which demonstrated a significant decline. Therefore, while the null hypothesis

H_{0}

is not accepted for CNN, it is accepted for the remaining algorithms.

To validate the effectiveness of the comparison between the merged dataset and the European dataset, we employed a significance level of 0.01 during hypothesis testing. This was done to confirm that the overall performance of the merged dataset significantly improved compared to the European dataset. This approach ensured that the

α

value was set at a high standard, allowing us to identify strong evidence that supports the observed effect as statistically significant. Specifically, we reject the null hypothesis

H_{0}

when the p-value is less than

0.01

, thereby accepting the alternative hypothesis

H_{1}

, which indicates that the performance of the merged dataset is indeed significantly better than that of the European dataset. Table 7 presents the accuracy of the European dataset in comparison to the merged and imputed dataset. The results were evaluated using the hypothesis testing method, which revealed a significant overall improvement in accuracy for the merged dataset. Specifically, the accuracies of GBDT, SVM, ANN, CNN, and LSTM increased by 4.53%, 1.61%, 5.83%, 4.08%, and 1.75%, respectively. In the hypothesis testing conducted at

α

= 0.01, which indicates 99% of the confident level, all these models demonstrated significant improvements, leading to the acceptance of the alternative hypothesis

H_{1}

. However, KNN experienced a slight decline that was not statistically significant, resulting in the rejection of the alternative hypothesis

H_{1}

. This indicates that the observed data from the merged dataset shows a statistically significant improvement in accuracy compared to the null hypothesis for the European dataset.

The adoption of a stricter significance level (i.e.,

α = 0.01

) and a stricter similarity threshold (i.e.,

α = 0.1

) for hypothesis testing in this study is supported by several considerations. A primary reason is the critical need to minimize Type I errors, especially pertinent in credit risk assessments where falsely rejecting a valid null hypothesis could lead to unwarranted risks and flawed conclusions. A more stringent significance level thus bolsters the findings’ robustness. Furthermore, employing the stricter significance enhances the research outcomes’ credibility, implying that the null hypothesis is rejected only with substantial supporting evidence. The use of the stricter significance is also common in diverse financial evaluations, notably in models demanding high precision to curtail erroneous conclusions [34,35].

3.3. Advanced Data-Generative Machine Learning System

In comparison to the original dataset, the merged dataset exhibited consistent performance when evaluated against the American Express dataset, while showing notable improvement relative to the European dataset. This observation not only affirms the efficacy of mode imputation within the experiments but also underscores the success achieved through merging two distinctly different feature datasets. The study employed a merged dataset comprised of 1316 transactions (i.e.,

x_{i}

), which integrated a total of 46 features (i.e.,

b_{i}

). This comprehensive approach facilitates enhanced data quality, thereby supporting more robust analysis and model training outcomes. The process of mode imputation in the context of merging two datasets for enhancing data integrity has been illustrated on Figure 7.

It showcases how missing values arising from the integration of disparate datasets are systematically filled using the mode, which is the most frequently occurring value in a given data set. This technique is especially beneficial for handling categorical data, as it retains the underlying distribution characteristics and aids in maintaining consistency within the dataset. The figure likely depicts a histogram to visualize the frequency distribution of observed values, identifying the mode effectively. By employing this method, the research aims to create a complete dataset that can be utilized for further analysis, including training models to detect credit card fraud. Ultimately, mode imputation contributes to improving the performance and reliability of machine learning algorithms in fraud detection while ensuring a robust dataset for accurate predictions. All missing values resulting from the merging process were effectively imputed using mode-based data generation. Following the evaluation of model training performance, the GBDT model, which demonstrated the highest accuracy, was chosen for training and prediction. This model successfully identified high-risk transactions among users and generated binary classification outputs, facilitating more accurate risk assessments in financial contexts.

4. Conclusions

This study successfully integrated two distinct credit card datasets and demonstrated the efficacy of mode imputation to address missing data challenges, enhancing both data quality and analytical outcomes. The main contributions include the validation of the GBDT model, which achieved superior predictive performance in identifying high-risk user transactions, as well as the establishment of a robust methodology for dataset integration that can benefit financial institutions. Limitations are present, as this method may primarily apply to specific contexts in the financial sector, making broader generalization difficult. Future investigations can benefit from examining how integrating datasets from various industries influences customer behavior, enabling new insights into fraud detection that transcend conventional frameworks. Investigating alternative imputation methods and their combinations might further enhance data integrity and predictive accuracy. Ultimately, this research paves the way for advancements in fraud detection mechanisms and demonstrates the critical importance of high-quality data in effective financial analysis.

Author Contributions

Writing—draft, X.F.; conceptualization, S.-K.K.; software, X.F.; Visualization, X.F.; writing—revision, S.-K.K.; experiments, X.F.; review, S.-K.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Macao Polytechnic University (MPU), under Grant RP/FCA-05/2024.

Data Availability Statement

The datasets for this paper is available on Kaggle repository (https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud (accessed on 30 June 2025); (Credit Card Fraud Detection), https://www.kaggle.com/datasets/pradip11/amexpert-codelab-2021 (accessed on 30 June 2025); (American Express CodeLab 2021), No DOI available). The merged datasets of imputation are available on GitHub (https://github.com/AliceFeng0417/Merged-credit-datasets (accessed on 30 June 2025)).

Acknowledgments

This paper was revised by using AI/ML-assisted tools. The authors are much in debt to anonymous referees for their careful reviews of our manuscript. Their insightful suggestions led to a notable improvement.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Banker, S.; Dunfield, D.; Huang, A. Neural mechanisms of credit card spending. Sci. Rep. 2021, 11, 4070. [Google Scholar] [CrossRef] [PubMed]
Tang, Q.; Tong, Z.; Yang, Y. Large portfolio losses in a turbulent market. Eur. J. Oper. Res. 2021, 292, 755–769. [Google Scholar] [CrossRef]
Makki, S.; Assaghir, Z.; Taher, Y.; Haque, R.; Hacid, M.S.; Zeineddine, H. An Experimental Study with Imbalanced Classification Approaches for Credit Card Fraud Detection. IEEE Access 2019, 7, 93010–93022. [Google Scholar] [CrossRef]
Ghaleb, F.A.; Saeed, F.; Al-Sarem, M.; Qasem, S.N.; Al-Hadhrami, T. Ensemble Synthesized Minority Oversampling-Based Generative Adversarial Networks and Random Forest Algorithm for Credit Card Fraud Detection. IEEE Access 2023, 11, 89694–89710. [Google Scholar] [CrossRef]
Tingfei, H.; Guangquan, C.; Kuihua, H. Using Variational Auto Encoding in Credit Card Fraud Detection. IEEE Access 2020, 8, 149841–149853. [Google Scholar] [CrossRef]
Jens, B.; Felix, N. Data fusion. ACM Comput. Surv. 2009, 41, 1–41. [Google Scholar]
Huang, S.K.Y.; Song, J. Unsupervised data imputation with multiple importance sampling variational autoencoders. Sci. Rep. 2025, 15, 3409. [Google Scholar]
Little, R.; Rubin, D. Statistical Analysis with Missing Data; Wiley: New York, NY, USA, 2019. [Google Scholar]
Muslim, M.A.; Nikmah, T.L.; Pertiwi, D.A.; Dasril, Y. New Model Combination Meta-learner to Improve Accuracy Prediction P2P Lending with Stacking Ensemble Learning. Intell. Syst. Appl. 2023, 18, 200–204. [Google Scholar] [CrossRef]
Perez Martinez J., M.; Berlanga, R.; Aramburu M., J.; Pedersen T., B. Integrating Data Warehouses with Web Data: A Survey. IEEE Tran. Know. Data Eng. 2008, 20, 940–955. [Google Scholar] [CrossRef]
Dayal, U.; Castellanos, M.; Simitsis, A.; Wilkinson, K. Data Integration Flows for Business Intelligence; EDBT ’09; Association for Computing Machinery: New York, NY, USA, 2009; pp. 1–11. [Google Scholar]
Nofal, M.I.; Yusof, Z.M. Integration of Business Intelligence and Enterprise Resource Planning within Organizations. Procedia Technol. 2013, 11, 658–665. [Google Scholar] [CrossRef]
Feng, X.; Kim, S.K. Novel Machine Learning Based Credit Card Fraud Detection Systems. Mathematics 2024, 12, 1869. [Google Scholar] [CrossRef]
Rajora, S.; Li, D.L.; Jha, C.; Bharill, N.; Patel, O.P.; Joshi, S.; Puthal, D.; Prasad, M. A Comparative Study of Machine Learning Techniques for Credit Card Fraud Detection Based on Time Variance. In Proceedings of the IEEE Proceedings of SSCI, Bangalore, India, 18–21 November 2018; pp. 1958–1963. [Google Scholar]
Tanouz, D.; Subramanian, R.R.; Eswar, D.; Reddy, G.V.P.; Kumar, A.R.; Praneeth, C.V.N.M. Credit Card Fraud Detection Using Machine Learning. In Proceedings of the IEEE Proceedings of ICICCS, Madurai, India, 6–8 May 2021; pp. 967–972. [Google Scholar]
El hlouli, F.Z.; Riffi, J.; Mahraz, M.A.; El Yahyaouy, A.; Tairi, H. Credit Card Fraud Detection Based on Multilayer Perceptron and Extreme Learning Machine Architectures. In Proceedings of the 2020 International Conference on Intelligent Systems and Computer Vision (ISCV), Fez, Morocco, 9–11 June 2020. [Google Scholar]
Fernandez, A.; Garcia, S.; Galar, M.; Prati, R.C.; Krawczyk, B.; Herrera, F. Learning from Imbalanced Data Sets; Springer: Cham, Switzerland, 2018. [Google Scholar]
Basak, P. AmExpert CodeLab 2021: Credit Card Default Risk; Kaggle: San Francisco, CA, USA, 2021. [Google Scholar]
Emmert-Streib, F.; Dehmer, M. Understanding Statistical Hypothesis Testing: The Logic of Statistical Inference. Mach. Learn. Knowl. Extr. 2019, 1, 945–961. [Google Scholar] [CrossRef]
Baron, M. Probability and Statistics for Computer Scientists; CRC Press: New York, NY, USA, 2019. [Google Scholar]
Batista, G.E.A.P.A.; Monard, M.C. An analysis of four missing data treatment methods for supervised learning. Appl. Artif. Intell. 2003, 17, 519–533. [Google Scholar] [CrossRef]
Afriyie, J.; Tawiah, K.; Pels, W.; Addai-Henne, S.; Dwamena, H.; Emmanuel, O.; Ayeh, S.; Eshun, J. A supervised machine learning algorithm for detecting and predicting fraud in credit card transactions. Decis. Anal. J. 2023, 6, 100–163. [Google Scholar] [CrossRef]
Scikit-Learn Developers. Sklearn.Preprocessing.StandardScaler. 2024. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html (accessed on 30 June 2025).
Ileberi, E.; Sun, Y.; Wang, Z. Performance Evaluation of Machine Learning Methods for Credit Card Fraud Detection Using SMOTE and AdaBoost. IEEE Access 2021, 9, 165286–165294. [Google Scholar] [CrossRef]
Alam, T.M.; Shaukat, K.; Hameed, I.A.; Luo, S.; Sarwar, M.U.; Shabbir, S.; Li, J.; Khushi, M. An Investigation of Credit Card Default Prediction in the Imbalanced Datasets. IEEE Access 2020, 8, 201173–201198. [Google Scholar] [CrossRef]
Alarfaj, F.K.; Malik, I.; Khan, H.U.; Almusallam, N.; Ramzan, M.; Ahmed, M. Credit Card Fraud Detection Using State-of-the-Art Machine Learning and Deep Learning Algorithms. IEEE Access 2022, 10, 39700–39715. [Google Scholar] [CrossRef]
Kalid, S.N.; Ng, K.H.; Tong, G.K.; Khor, K.C. A Multiple Classifiers System for Anomaly Detection in Credit Card Data with Unbalanced and Overlapped Classes. IEEE Access 2020, 8, 28210–28221. [Google Scholar] [CrossRef]
Nur Ozkan-Gunay, E.; Ozkan, M. Prediction of bank failures in emerging financial markets: An ANN approach. J. Risk Finan. 2007, 8, 465–480. [Google Scholar] [CrossRef]
Oded Maimon, L.R. Data Mining and Knowledge Discovery Handbook; Springer: New York, NY, USA, 2010. [Google Scholar]
Taghizadeh-Mehrjardi, R.; Nabiollahi, K.; Minasny, B.; Triantafilis, J. Comparing data mining classifiers to predict spatial distribution of USDA-family soil groups in Baneh region, Iran. Geoderma 2015, 253–254, 67–77. [Google Scholar] [CrossRef]
Siami-Namini, S.; Namin, A.S. Forecasting Economics and Financial Time Series: ARIMA vs. LSTM. arXiv 2018, arXiv:1803.06386. [Google Scholar]
AbouGrad, H.; Sankuru, L. Online Banking Fraud Detection Model: Decentralized Machine Learning Framework to Enhance Effectiveness and Compliance with Data Privacy Regulations. Mathematics 2025, 13, 2110. [Google Scholar] [CrossRef]
Yin, C.; Zhang, S.; Wang, J.; Xiong, N.N. Anomaly Detection Based on Convolutional Recurrent Autoencoder for IoT Time Series. IEEE Trans. Syst. Man Cybern. Syst. 2022, 52, 112–122. [Google Scholar] [CrossRef]
Toan Duc Le, T.; Ho, V.; Ngoc Anh, N.; Man, L. The Relationship between Working Capital Management and Profitability: Evidence in Viet Nam. Int. J. Bus. Manag. 2017, 12, 175. [Google Scholar] [CrossRef]
Eling, M.; Schuhmacher, F. Does the choice of performance measure influence the evaluation of hedge funds? J. Bank. Financ. 2007, 31, 2632–2647. [Google Scholar] [CrossRef]

Figure 1. Data balancing for European cardholders dataset [9].

Figure 2. Data balancing for American Express credit dataset [18].

Figure 3. Description of the process for reconstructing missing data sets with mode imputation.

Figure 4. Experimental step diagram.

Figure 5. ROC-AUC analysis for mode imputation techniques.

Figure 6. Precision–recall analysis for mode imputation techniques.

Figure 7. Automated credit card risk detection system.

Table 1. Transaction sample distribution in European and American Express credit datasets.

Dataset	Type	Class 0 (Non-Fraud)	Class 1 (Fraud)
European	Training	393	402
European	Testing	90	90
American Express	Training	2341	2391
American Express	Testing	625	558

Table 2. Encoded features for American Express credit dataset.

Feature	Original	Encoded
gender	F M	0 1
owns_car	N Y	0 1
owns_house	N Y	0 1
occupation_type	Accountants Cleaning staff Cooking staff Core staff Drivers HR staff High skill tech staff IT staff Laborers Low-skill laborers Managers Medicine staff Private service staff Realty agents Sales staff Secretaries Security staff Unknown Waiters/barmen staff	0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Table 3. Performance results of various machine learning algorithms on the European cardholders dataset [9].

Algorithm	Accuracy $(%)$	Precision $(%)$	Recall $(%)$	F1-Score	Training Time [s]
ANN [28]	90.36	90.40	90.35	0.9035	3.595
CNN [26]	91.37	91.49	91.36	0.9136	9.642
GBDT [25]	92.39	92.61	92.37	0.9237	0.941
KNN [26]	91.88	92.17	91.86	0.9186	0.104
LSTM [31]	92.89	92.97	92.88	0.9289	5.377
SVM [27]	93.91	94.22	93.89	0.9390	0.118

Table 4. Performance results of various machine learning algorithms on the American Express credit dataset [18].

Algorithm	Accuracy $(%)$	Precision $(%)$	Recall $(%)$	F1-Score	Training Time [s]
ANN [28]	96.01	96.00	96.03	0.9601	20.723
CNN [26]	96.62	96.62	96.66	0.9662	97.043
GBDT [25]	97.43	97.48	97.50	0.9750	1.705
KNN [26]	90.60	90.61	90.58	0.9060	0.295
LSTM [31]	95.61	95.60	95.62	0.9600	33.785
SVM [27]	95.40	95.41	95.45	0.9540	2.847

Table 5. Performance results of mode-based generative data.

Algorithm	Accuracy $(%)$	Precision $(%)$	Recall $(%)$	F1-Score	Training Time [s]
ANN [28]	95.74	95.74	95.74	0.9574	19.41
CNN [26]	95.45	95.45	95.45	0.9545	216.80
GBDT [25]	96.92	96.95	96.93	0.961	2.23
KNN [26]	90.90	90.91	90.90	0.909	0.41
LSTM [31]	94.64	94.65	94.64	0.946	30.04
SVM [27]	95.52	95.55	95.53	0.955	4.00

Table 6. Evaluation of mode-based generative data compare to the American Express credit dataset.

Algorithm	Original $(%)$	Accuracy $(%)$	p-Value	Significance? ( $α = 0.1$ )	$H_{0}$ Accepted?
GBDT	97.43	96.92	0.276	No	Yes
KNN	90.60	90.90	0.701	No	Yes
SVM	95.40	95.52	0.831	No	Yes
ANN	96.01	95.74	0.620	No	Yes
CNN	96.62	95.45	0.038	Yes	No
LSTM	95.61	94.64	0.111	No	Yes

Table 7. Evaluation of mode-based generative data compare to the the European cardholders dataset.

Algorithm	Original $(%)$	Accuracy $(%)$	p-Value	Significance? ( $α = 0.01$ )	$H_{1}$ Accepted?
GBDT	92.39	96.92	<0.001	Yes	Yes
KNN	91.88	90.90	0.208	No	No
SVM	93.91	95.52	0.004	Yes	Yes
ANN	90.36	95.74	<0.001	Yes	Yes
CNN	91.37	95.45	<0.001	Yes	Yes
LSTM	92.89	94.64	0.004	Yes	Yes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Feng, X.; Kim, S.-K. Statistical Data-Generative Machine Learning-Based Credit Card Fraud Detection Systems. Mathematics 2025, 13, 2446. https://doi.org/10.3390/math13152446

AMA Style

Feng X, Kim S-K. Statistical Data-Generative Machine Learning-Based Credit Card Fraud Detection Systems. Mathematics. 2025; 13(15):2446. https://doi.org/10.3390/math13152446

Chicago/Turabian Style

Feng, Xiaomei, and Song-Kyoo Kim. 2025. "Statistical Data-Generative Machine Learning-Based Credit Card Fraud Detection Systems" Mathematics 13, no. 15: 2446. https://doi.org/10.3390/math13152446

APA Style

Feng, X., & Kim, S.-K. (2025). Statistical Data-Generative Machine Learning-Based Credit Card Fraud Detection Systems. Mathematics, 13(15), 2446. https://doi.org/10.3390/math13152446

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Statistical Data-Generative Machine Learning-Based Credit Card Fraud Detection Systems

Abstract

1. Introduction

2. Preliminaries

2.1. Data Set

2.2. Hypothesis Testing

2.3. Mode-Based Generative Data

2.4. Preliminary Performance Comparison for Various ML Models

3. Experiment Results

3.1. Result of the Mode-Based Imputation

3.2. Hypothesis Testing Results

3.3. Advanced Data-Generative Machine Learning System

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI