1. Introduction
Machine learning has become a widely adopted tool across numerous domains, enabling automated decision-making systems that can process large amounts of data and extract meaningful patterns. However, the effectiveness of these systems depends on the quality and characteristics of the training data. One of the common challenges encountered in real-world machine learning applications is the class imbalance problem, which occurs when the distribution of target classes in a dataset is skewed [
1]. This phenomenon is particularly prevalent in domains such as fraud detection, medical diagnosis, network intrusion detection, and customer behavior prediction, where the events of interest are inherently rare but carry practical importance [
2]. Recent advances in deep learning have enabled sophisticated classification and domain adaptation techniques across these domains—including adaptive generative adversarial networks for fault diagnosis under data scarcity [
3], multimodal detection methods for safety-critical systems [
4], and dynamic adversarial domain adaptation networks for unsupervised fault diagnosis [
5]—yet the underlying class imbalance challenge remains a common bottleneck that requires effective data-level solutions.
When the target variable distribution is heavily skewed, traditional machine learning algorithms tend to exhibit bias toward the majority class. This behavior stems from the optimization objectives of most learning algorithms, which aim to minimize overall error rates or maximize accuracy metrics that do not adequately account for class distribution disparities [
6]. As a consequence, minority class instances—often representing important cases requiring accurate prediction—are frequently misclassified, leading to reduced predictive performance for the class of interest [
7].
The Bank Marketing dataset from the UCI Machine Learning Repository serves as a representative case study for investigating class imbalance challenges in practical applications. This dataset contains records of direct marketing campaigns conducted by a Portuguese banking institution, with the classification objective being the prediction of whether a client will subscribe to a term deposit product [
8]. With a class imbalance ratio of approximately 7.88:1 between non-subscribers and subscribers, building predictive models for identifying potential term deposit subscribers presents methodological challenges that reflect scenarios commonly encountered in marketing analytics, customer relationship management, and financial services applications.
The research community has proposed numerous approaches to address the class imbalance problem, which can be broadly categorized into algorithm-level methods and data-level methods [
7]. Algorithm-level approaches modify the learning algorithm itself to account for class distribution disparities, including techniques such as cost-sensitive learning, threshold adjustment, and ensemble methods specifically designed for imbalanced data. Data-level approaches, on the other hand, focus on modifying the training data distribution through various resampling strategies, including undersampling of the majority class, oversampling of the minority class, and hybrid combinations of both techniques [
9].
Among data-level approaches, synthetic data generation has emerged as a promising strategy for addressing class imbalance. Unlike simple duplication of minority class instances, which can lead to overfitting, synthetic data generation creates new, artificial samples that augment the minority class while preserving important statistical properties and relationships present in the original data. This approach offers the dual advantage of increasing the representation of minority class instances without discarding valuable information from the majority class, potentially improving model generalization and robustness [
10].
The landscape of synthetic data generation methods has evolved over the past two decades. Early approaches such as the Synthetic Minority Over-sampling Technique (SMOTE) relied on interpolation-based strategies to generate new samples in the feature space. More recently, the advent of deep generative models, including Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs), has opened new possibilities for generating high-fidelity synthetic data that can capture complex, non-linear relationships in the original data distribution [
11]. However, the relative performance of these different approaches, particularly in the context of tabular data augmentation for imbalanced classification tasks, remains an active area of investigation.
This study aims to provide a comparative analysis of synthetic data generation methods for imbalanced tabular data. Specifically, this research addresses three research questions. First, how do different synthetic data generation methods compare in terms of statistical similarity to the original data distribution? Understanding the fidelity of generated samples is relevant for applications where maintaining data authenticity is important. Second, which augmentation strategy yields better machine learning classification performance when the augmented data is used to train predictive models? This question addresses the practical utility of different generation methods. Third, what is the trade-off between data fidelity and predictive utility across different methods? Characterizing this trade-off may provide guidance for practitioners who must balance multiple objectives when selecting augmentation strategies.
This paper offers three main elements. First, we present a systematic comparison of traditional interpolation-based methods (SMOTE) and modern deep learning-based approaches (Gaussian Copula, TVAE, CTGAN) for synthetic tabular data generation in the context of class imbalance. Second, we employ a multi-dimensional evaluation framework that assesses synthetic data quality across three complementary dimensions: statistical similarity to the original distribution, classification performance on downstream tasks, and minority class detection capability. Third, based on our experimental findings, we provide practical guidelines for selecting appropriate augmentation strategies based on specific application requirements and constraints, offering reference points for practitioners working with imbalanced datasets.
5. Discussion
5.1. Trade-Off Between Fidelity and Utility
Our results reveal a notable trade-off between statistical fidelity and classification utility. To explicitly examine this relationship,
Table 8 presents a direct comparison of the composite fidelity scores and classification performance metrics for each augmentation method, along with the Spearman rank correlation coefficients.
A key finding emerges from this multi-dimensional analysis: higher composite fidelity does not guarantee better classification performance. Gaussian Copula achieved the highest composite fidelity (0.930) but ranked second in F1-Score and recall. CTGAN ranked second in fidelity (0.884), outperforming SMOTE (0.878) due to stronger categorical and correlation preservation, yet SMOTE achieved the best classification performance. The Spearman correlation coefficients between composite fidelity and F1-Score () and recall () suggest a weak negative relationship, while AUC-ROC showed a strong positive correlation (), indicating that higher-fidelity methods better preserve discriminative ranking capability.
The divergence between the individual fidelity dimensions further enriches this finding. SMOTE showed the weakest categorical distribution similarity (
) and correlation preservation (
) among augmentation methods, yet achieved the highest F1-Score and recall. This can be explained by the different optimization objectives of each method. SMOTE’s interpolation-based approach creates synthetic samples along line segments connecting existing minority class instances, effectively generating data points in decision boundary regions. This behavior relates to the concept of “informative oversampling” discussed by Fernández et al. [
9], where synthetic samples that populate the decision boundary can be more beneficial for classifier learning than samples that merely replicate the original distribution.
Conversely, distribution matching approaches like Gaussian Copula excel at preserving all aspects of the original data characteristics—including marginal distributions, categorical frequencies, and correlation structure—but this faithfulness may not directly benefit classification. The strong correlation preservation of Gaussian Copula () confirms its ability to maintain inter-variable dependencies, which is valuable for applications requiring high data authenticity.
This finding aligns with the evaluation framework proposed by Dankar and Ibrahim [
25], who emphasized that synthetic data quality should be assessed relative to intended downstream tasks rather than purely statistical measures. Our multi-dimensional fidelity evaluation provides stronger quantitative evidence for this perspective: even when assessed across marginal distributions, categorical frequencies, inter-variable dependencies, and distributional shapes, methods with higher overall fidelity did not consistently yield better classification performance.
5.2. Deep Learning Methods Performance
TVAE and CTGAN showed lower performance compared to simpler methods under the default configurations used in our experiments. The minority class contained only 3712 samples, which appears insufficient for deep generative models to learn effective latent representations. For context, Xu et al. [
11] evaluated CTGAN and TVAE on datasets ranging from 7043 to 581,012 samples. Shwartz-Ziv and Armon [
27] demonstrated that deep learning models for tabular data typically require datasets exceeding 10,000 samples.
As noted in
Section 3.2.1, our comparison employed default hyperparameter configurations with a fixed computational budget, which represents a practical but constrained evaluation setting.
Table 9 summarizes the key hyperparameters of TVAE and CTGAN, their default values used in our experiments, and the potential impact of tuning based on the existing literature.
The default configurations represent a common practical scenario: practitioners adopting these methods without dedicated tuning resources. While comprehensive hyperparameter optimization (estimated at 50–100× the base training time) could potentially improve deep model performance, such optimization was beyond the scope of this comparison. Importantly, this practical constraint itself is a relevant finding—it highlights that simpler methods like SMOTE offer competitive or superior performance “out of the box,” which is a meaningful advantage for resource-constrained applications. These findings align with the recent literature suggesting that deep generative models require careful tuning and sufficient data [
20].
5.3. Evaluation Metrics and Operational Considerations
Our evaluation employed AUC-ROC as the primary ranking metric, which measures the classifier’s ability to discriminate between classes across all possible decision thresholds. While AUC-ROC is widely used and enables comparison with prior studies, it has a known limitation in the class imbalance context: the false positive rate (FPR) component can be dominated by the large number of true negatives, potentially painting an overly optimistic picture of classifier performance [
1].
The precision–recall AUC (PR-AUC) has been recommended as a more informative alternative for imbalanced datasets, as it focuses exclusively on the positive (minority) class and is not influenced by the number of true negatives. In our experimental setting, the emphasis on recall and F1-Score partially addresses this concern, since both metrics are precision–recall-based and do not involve true negatives. The precision–recall trade-off analysis (
Figure 6) further provides a visual assessment analogous to PR-AUC by plotting the precision–recall operating points for each method–classifier combination. Nevertheless, formal PR-AUC computation and threshold optimization strategies (e.g., cost-sensitive thresholding, F-beta optimization) would provide additional operational insights and are identified as future work directions.
5.4. Model-Specific Observations
The three classification models exhibited distinct response patterns to data augmentation. Logistic Regression showed notable improvement, with recall increasing from 0.206 in the baseline to 0.654 when trained on Gaussian Copula-augmented data—a more than threefold increase.
Random Forest demonstrated robust and consistent performance across all augmentation methods. Gradient Boosting achieved the highest AUC-ROC scores in most configurations, indicating better ranking capability.
5.5. Practical Implications and Decision Framework
Based on the experimental findings and the method taxonomy presented in
Section 2.6, we propose a structured decision framework for practitioners selecting synthetic data augmentation strategies.
Table 10 maps common application scenarios to recommended methods with supporting evidence from our experiments.
Several cross-cutting considerations inform method selection beyond the primary application scenario. First, regarding feature composition: when the dataset contains predominantly categorical features, SMOTE’s interpolation in encoded space can produce invalid intermediate values, making CTGAN or Gaussian Copula preferable. Our categorical distribution analysis confirmed this limitation (: SMOTE = 0.858 vs. CTGAN = 0.896). Second, regarding inter-variable dependencies: if preserving the correlation structure is critical (e.g., for downstream causal analysis), Gaussian Copula’s explicit copula-based modeling () offers a clear advantage over interpolation-based approaches (). Third, regarding scalability: SMOTE’s computational cost grows linearly with sample size, while deep generative models incur fixed training overhead but offer constant-time sampling, making them more efficient for generating very large synthetic datasets once trained.
5.6. Limitations
This study has several limitations that should be considered when interpreting the results.
First, the evaluation was conducted on a single dataset from the banking domain. However, the Bank Marketing dataset exhibits properties representative of many real-world imbalanced classification problems: a moderate imbalance ratio (7.88:1), which falls within the commonly encountered range of 5:1 to 100:1 [
1]; a mixture of numerical (nine) and categorical (10) features typical of business datasets. Validation on datasets from healthcare, cybersecurity, and manufacturing domains would strengthen the generalizability of the findings.
Second, all methods were evaluated under default hyperparameter configurations with a fixed computational budget, as detailed in
Section 3.2.1. While this reflects a practical scenario for resource-constrained practitioners, it means that the reported performance of deep generative models (TVAE, CTGAN) represents a lower bound. Comprehensive hyperparameter optimization—including tuning of latent dimensions, learning rates, network architectures, and training schedules—could potentially improve their performance, though at significant computational cost (estimated 50–100× the base training time). The sensitivity analysis in
Section 5.2 identifies key hyperparameters and their expected impact.
Third, recent generative approaches—including diffusion-based models (TabDDPM [
21], TabSyn [
22]) and LLM-based generation (GReaT [
23])—were not included in the experimental comparison, as discussed in
Section 3.3. Their exclusion limits the timeliness of the benchmark; however, these methods’ data requirements and computational demands exceed the constraints of our experimental setting. As these methods mature and become available through stable frameworks, their inclusion in comparative benchmarks will be essential.
Fourth, the classification evaluation employed three traditional machine learning classifiers (Logistic Regression, Random Forest, Gradient Boosting). While these represent distinct learning paradigms (linear, bagging, boosting), the inclusion of deep neural network classifiers (e.g., multi-layer perceptrons, Tabular Network (TabNet)) and other modern approaches would broaden the assessment of how augmentation methods interact with different classifier architectures.
Fifth, our evaluation metrics focused on classification accuracy, F1-Score, recall, and AUC-ROC. While the precision–recall trade-off analysis (
Figure 6) provides visual assessment of operating points, formal PR-AUC computation and threshold optimization strategies (e.g., cost-sensitive thresholding) were not performed. PR-AUC is particularly informative for imbalanced datasets as it is not affected by true negatives, and its inclusion would strengthen the operational analysis of augmentation benefits.
Sixth, all experiments were conducted at a single imbalance ratio (7.88:1). At higher imbalance ratios (e.g., >20:1), the conclusions of this study may shift in several ways. The minority class sample size would decrease substantially (e.g., at 50:1, fewer than 600 minority samples), which would further constrain deep generative models that already struggled with 3712 samples, likely widening the performance gap in favor of SMOTE. However, extreme imbalance also increases the risk of SMOTE generating synthetic samples in overlapping class boundary regions, potentially degrading precision. Gaussian Copula’s correlation matrix estimation would become less reliable with fewer samples, potentially reducing its fidelity advantage. Conversely, at milder imbalance ratios (e.g., 3:1), deep generative models would have access to substantially more minority samples, potentially reversing the performance rankings observed in this study.
Seventh, our evaluation focused on classification performance and statistical fidelity. Additional considerations such as privacy preservation (resistance to membership inference attacks), computational cost profiling, and sample diversity were not systematically evaluated.
6. Conclusions
Class imbalance remains a common challenge in machine learning applications, particularly in domains where minority class instances represent important events requiring accurate detection. This study addressed the question of how different synthetic data generation approaches compare in their ability to improve classification performance on imbalanced tabular data, providing empirical observations and practical reference points for method selection.
We conducted a comparative analysis of four synthetic data generation methods—SMOTE, Gaussian Copula, TVAE, and CTGAN—using the UCI Bank Marketing dataset as a case of moderate class imbalance (7.88:1 ratio). Our evaluation framework assessed each method across three complementary dimensions.
The experimental results, validated through statistical significance testing with 10 repeated experiments, yielded several findings. First, all augmentation methods showed statistically significant improvements () in minority class detection compared to the baseline, with recall improvements ranging from 58% (TVAE) to 118% (SMOTE). Second, SMOTE achieved the highest average F1-Score (0.437 ± 0.013) and recall (0.542 ± 0.022), suggesting it as a suitable choice when minority class detection is the primary objective. Third, Gaussian Copula achieved the highest composite fidelity score (0.930), demonstrating strong preservation of marginal distributions, categorical frequencies, and inter-variable dependencies, making it a suitable choice when data authenticity is important. Fourth, deep learning-based methods showed modest but statistically significant improvements, though they underperformed compared to simpler approaches in our experimental setting. Notably, CTGAN demonstrated stronger categorical distribution and correlation preservation than SMOTE, despite lower classification performance.
Beyond these empirical findings, we highlight several insights that may inform future research and practice. Our multi-dimensional fidelity evaluation—encompassing marginal numerical similarity, categorical distribution similarity, correlation structure preservation, and KS test scores—revealed a weak negative correlation () between composite fidelity and classification performance, providing stronger evidence that high-fidelity distribution matching does not necessarily translate to better downstream task performance. This trade-off has important implications for method selection based on application requirements.
The findings carry implications primarily for practitioners working with moderately imbalanced tabular datasets (imbalance ratios in the range of approximately 5:1 to 10:1) where the minority class contains several thousand samples. Under these conditions, for applications prioritizing minority class detection, SMOTE combined with Gradient Boosting may offer a practical and computationally efficient option under default configurations. For applications requiring high data fidelity, Gaussian Copula may provide a suitable balance. Deep learning-based methods (TVAE, CTGAN), while showing lower performance under default settings in our experiments, may achieve improved results with comprehensive hyperparameter optimization and larger training sets, and should not be dismissed based solely on default configuration comparisons. For datasets with extreme imbalance ratios (>20:1), very small minority classes (<500 samples), or fundamentally different feature compositions, the relative effectiveness of these methods may differ from the patterns observed in this study.
Future work should extend this comparative framework in several directions. First, validation on multiple datasets spanning diverse domains (e.g., healthcare, cybersecurity, manufacturing) with varying feature compositions, sample sizes, and class distributions would strengthen the generalizability of the findings and enable the identification of domain-specific patterns in augmentation effectiveness. Second, systematic evaluation across a range of imbalance ratios (e.g., 3:1, 10:1, 50:1, 100:1) would reveal how the relative effectiveness of augmentation methods changes with imbalance severity, providing more nuanced guidance for practitioners. Third, the inclusion of recent generative approaches—particularly diffusion-based models (TabDDPM, TabSyn) and LLM-based generators (GReaT)—would enhance the timeliness and comprehensiveness of the comparison as these methods become available through stable frameworks. Fourth, broadening the classifier pool to include deep neural networks (e.g., multi-layer perceptrons, TabNet) and incorporating precision–recall AUC (PR-AUC) as an evaluation metric alongside threshold optimization strategies would provide a more complete picture of augmentation benefits across different operational settings. Fifth, systematic hyperparameter optimization studies for deep generative models would help establish the performance ceiling of these methods on small minority class samples. Finally, the integration of privacy-preserving mechanisms with synthetic data generation represents an increasingly important research direction.