Next Article in Journal
Smart Home IoT Forensics in Matter Ecosystems: A Data Extraction Method Using Multi-Admin
Next Article in Special Issue
Energy-Aware Multilingual Evaluation of Large Language Models
Previous Article in Journal
A Vision–Locomotion Framework Toward Obstacle Avoidance for a Bio-Inspired Gecko Robot
Previous Article in Special Issue
Unpacking Prediction: Contextualized and Interpretable Academic Risk Modeling with XAI for Small Cohorts
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Synthetic Data Augmentation for Imbalanced Tabular Data: A Comparative Study of Generation Methods

1
Office of Liberal Arts Education, Wonkwang University, Iksan 54538, Republic of Korea
2
Department of Computer Engineering, Sunchon National University, Suncheon 57992, Republic of Korea
*
Authors to whom correspondence should be addressed.
Electronics 2026, 15(4), 883; https://doi.org/10.3390/electronics15040883
Submission received: 23 January 2026 / Revised: 15 February 2026 / Accepted: 19 February 2026 / Published: 20 February 2026
(This article belongs to the Special Issue Data-Related Challenges in Machine Learning: Theory and Application)

Abstract

Class imbalance in tabular datasets poses a challenge for machine learning classification tasks, often leading to biased models that underperform in predicting minority class instances. This study presents a comparative analysis of synthetic data generation methods for addressing class imbalance in tabular data. We evaluate four augmentation approaches—Synthetic Minority Over-sampling Technique (SMOTE), Gaussian Copula, Tabular Variational Autoencoder (TVAE), and Conditional Tabular Generative Adversarial Network (CTGAN)—using the University of California Irvine (UCI) Bank Marketing dataset, which exhibits a class imbalance ratio of approximately 7.88:1. Our experimental framework assesses each method across three dimensions: statistical fidelity to the original data distribution evaluated through four complementary metrics (marginal numerical similarity, categorical distribution similarity, correlation structure preservation, and Kolmogorov–Smirnov test), machine learning utility measured through classification performance, and minority class detection capability. Results indicate that all augmentation methods achieved statistically significant improvements over the baseline ( p < 0.05 ). SMOTE achieved the highest recall (54.2%, a 117.6% relative improvement over the baseline) and F1-Score (0.437, +22.4% over the baseline) for minority class detection, while Gaussian Copula provided the highest composite fidelity score (0.930) with competitive predictive performance. A weak negative correlation ( ρ = 0.30 ) between composite fidelity and classification performance was observed, suggesting that higher statistical fidelity does not necessarily translate to better downstream task performance. Deep learning-based methods (TVAE, CTGAN) showed statistically significant improvements over the baseline (recall: +58% to +63%) but underperformed compared to simpler methods under default configurations, suggesting the need for larger training samples or more extensive hyperparameter tuning. These findings offer reference points for practitioners working with moderately imbalanced tabular data with limited minority class samples, supporting the selection of generation strategies based on specific requirements regarding data fidelity and classification objectives.

1. Introduction

Machine learning has become a widely adopted tool across numerous domains, enabling automated decision-making systems that can process large amounts of data and extract meaningful patterns. However, the effectiveness of these systems depends on the quality and characteristics of the training data. One of the common challenges encountered in real-world machine learning applications is the class imbalance problem, which occurs when the distribution of target classes in a dataset is skewed [1]. This phenomenon is particularly prevalent in domains such as fraud detection, medical diagnosis, network intrusion detection, and customer behavior prediction, where the events of interest are inherently rare but carry practical importance [2]. Recent advances in deep learning have enabled sophisticated classification and domain adaptation techniques across these domains—including adaptive generative adversarial networks for fault diagnosis under data scarcity [3], multimodal detection methods for safety-critical systems [4], and dynamic adversarial domain adaptation networks for unsupervised fault diagnosis [5]—yet the underlying class imbalance challenge remains a common bottleneck that requires effective data-level solutions.
When the target variable distribution is heavily skewed, traditional machine learning algorithms tend to exhibit bias toward the majority class. This behavior stems from the optimization objectives of most learning algorithms, which aim to minimize overall error rates or maximize accuracy metrics that do not adequately account for class distribution disparities [6]. As a consequence, minority class instances—often representing important cases requiring accurate prediction—are frequently misclassified, leading to reduced predictive performance for the class of interest [7].
The Bank Marketing dataset from the UCI Machine Learning Repository serves as a representative case study for investigating class imbalance challenges in practical applications. This dataset contains records of direct marketing campaigns conducted by a Portuguese banking institution, with the classification objective being the prediction of whether a client will subscribe to a term deposit product [8]. With a class imbalance ratio of approximately 7.88:1 between non-subscribers and subscribers, building predictive models for identifying potential term deposit subscribers presents methodological challenges that reflect scenarios commonly encountered in marketing analytics, customer relationship management, and financial services applications.
The research community has proposed numerous approaches to address the class imbalance problem, which can be broadly categorized into algorithm-level methods and data-level methods [7]. Algorithm-level approaches modify the learning algorithm itself to account for class distribution disparities, including techniques such as cost-sensitive learning, threshold adjustment, and ensemble methods specifically designed for imbalanced data. Data-level approaches, on the other hand, focus on modifying the training data distribution through various resampling strategies, including undersampling of the majority class, oversampling of the minority class, and hybrid combinations of both techniques [9].
Among data-level approaches, synthetic data generation has emerged as a promising strategy for addressing class imbalance. Unlike simple duplication of minority class instances, which can lead to overfitting, synthetic data generation creates new, artificial samples that augment the minority class while preserving important statistical properties and relationships present in the original data. This approach offers the dual advantage of increasing the representation of minority class instances without discarding valuable information from the majority class, potentially improving model generalization and robustness [10].
The landscape of synthetic data generation methods has evolved over the past two decades. Early approaches such as the Synthetic Minority Over-sampling Technique (SMOTE) relied on interpolation-based strategies to generate new samples in the feature space. More recently, the advent of deep generative models, including Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs), has opened new possibilities for generating high-fidelity synthetic data that can capture complex, non-linear relationships in the original data distribution [11]. However, the relative performance of these different approaches, particularly in the context of tabular data augmentation for imbalanced classification tasks, remains an active area of investigation.
This study aims to provide a comparative analysis of synthetic data generation methods for imbalanced tabular data. Specifically, this research addresses three research questions. First, how do different synthetic data generation methods compare in terms of statistical similarity to the original data distribution? Understanding the fidelity of generated samples is relevant for applications where maintaining data authenticity is important. Second, which augmentation strategy yields better machine learning classification performance when the augmented data is used to train predictive models? This question addresses the practical utility of different generation methods. Third, what is the trade-off between data fidelity and predictive utility across different methods? Characterizing this trade-off may provide guidance for practitioners who must balance multiple objectives when selecting augmentation strategies.
This paper offers three main elements. First, we present a systematic comparison of traditional interpolation-based methods (SMOTE) and modern deep learning-based approaches (Gaussian Copula, TVAE, CTGAN) for synthetic tabular data generation in the context of class imbalance. Second, we employ a multi-dimensional evaluation framework that assesses synthetic data quality across three complementary dimensions: statistical similarity to the original distribution, classification performance on downstream tasks, and minority class detection capability. Third, based on our experimental findings, we provide practical guidelines for selecting appropriate augmentation strategies based on specific application requirements and constraints, offering reference points for practitioners working with imbalanced datasets.

2. Related Work

2.1. The Class Imbalance Problem

The class imbalance problem represents one of the most extensively studied challenges in machine learning, with research spanning more than two decades and encompassing theoretical analysis, algorithmic development, and practical applications [1]. At its core, the problem arises when the prior probabilities of different classes in a classification task differ substantially, leading to datasets where one or more classes are significantly underrepresented relative to others [2].
The fundamental difficulty posed by class imbalance stems from the optimization objectives employed by most machine learning algorithms. Standard learning procedures typically aim to minimize overall error rates or maximize accuracy metrics, which implicitly assume that all classes are equally important and that misclassification costs are uniform across classes [6]. Under these assumptions, a classifier can achieve high overall accuracy simply by predicting the majority class for all instances, while completely failing to identify minority class examples.
Japkowicz and Stephen [2] conducted a systematic study demonstrating that class imbalance interacts with other dataset characteristics, including the complexity of the underlying concept, the presence of small disjuncts, and the overall dataset size. Chawla et al. [12] provided a comprehensive editorial overview of the class imbalance problem, identifying key research directions and challenges. More recently, Krawczyk [7] presented an extensive survey of learning from imbalanced data, identifying open challenges and future research directions.

2.2. Traditional Oversampling Techniques

Data-level approaches to class imbalance address the problem by modifying the training data distribution to achieve a more balanced representation of classes. Among these approaches, oversampling techniques that generate new minority class instances have received considerable attention [9].
The Synthetic Minority Over-sampling Technique (SMOTE), introduced by Chawla et al. [10], represents a landmark contribution that fundamentally changed the approach to minority class augmentation. Unlike simple random oversampling, which duplicates existing minority class instances and can lead to overfitting, SMOTE generates synthetic samples by interpolating between existing minority class instances and their k-nearest neighbors in the feature space.
The effectiveness of SMOTE has been demonstrated across numerous application domains, establishing it as a de facto standard for addressing class imbalance. Fernández et al. [9] conducted a comprehensive review of SMOTE and its extensions, documenting over 85 variants. These include Borderline-SMOTE [13] and Adaptive Synthetic Sampling (ADASYN) [14]. A comparative study by Batista et al. [15] evaluated several resampling methods, finding that combinations of oversampling and undersampling often yielded better results than either approach alone.

2.3. Statistical Generative Models

Statistical approaches to synthetic data generation aim to model the underlying probability distribution of the data and generate new samples by drawing from this learned distribution. Among these methods, copula-based approaches have gained attention for their ability to model complex dependencies between variables [16].
The Gaussian Copula method, as implemented in the Synthetic Data Vault (SDV) framework, represents a principled approach to tabular data synthesis [16]. The method operates by first estimating the marginal distribution of each feature independently, then modeling the dependency structure between features using a Gaussian copula.

2.4. Deep Generative Models for Tabular Data

The success of deep generative models in domains such as image synthesis and natural language generation has motivated their application to tabular data synthesis. Two primary architectures have been adapted for this purpose: Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) [17,18].
Variational Autoencoders, introduced by Kingma and Welling [17], provide a principled probabilistic framework for learning latent representations of data. The Tabular VAE (TVAE), developed by Xu et al. [11], adapts the VAE architecture for tabular data by incorporating specialized handling of different variable types.
Generative Adversarial Networks, introduced by Goodfellow et al. [18], take a fundamentally different approach based on adversarial training. CTGAN (Conditional Tabular GAN), also developed by Xu et al. [11], extends the GAN framework for tabular data synthesis with several key innovations. More recently, Conditional Tabular GAN+ (CTAB-GAN+) [19] improved upon earlier GAN-based methods by incorporating downstream losses and Wasserstein distance with gradient penalty for enhanced training stability and higher utility synthetic data.
Despite their sophistication, deep generative models face several challenges when applied to tabular data synthesis. Borisov et al. [20] conducted a comprehensive survey, noting that deep models often struggle to outperform traditional methods on tabular tasks.
Recent advances have introduced novel architectures for tabular data generation. Diffusion models, which have achieved remarkable success in image generation, have been adapted for tabular data. The Tabular Denoising Diffusion Probabilistic Model (TabDDPM) [21] applies denoising diffusion probabilistic models to tabular data, using Gaussian diffusion for continuous features and multinomial diffusion for categorical features, demonstrating state-of-the-art performance on various benchmarks. Tabular Synthesis via Score-based Diffusion (TabSyn) [22] further advances this direction by performing score-based diffusion in a learned latent space, enabling more effective handling of mixed-type tabular data.
Large language models (LLMs) have also emerged as promising tools for tabular data synthesis. Generation of Realistic Tabular data (GReaT) [23] leverages pre-trained Generative Pre-trained Transformer 2 (GPT-2) models to generate realistic tabular data by treating rows as sequences of feature–value pairs, demonstrating that language models can effectively capture complex feature dependencies without task-specific architectural modifications.

2.5. Synthetic Data for Machine Learning Augmentation

The use of synthetic data for augmenting machine learning training sets has gained increasing attention as a strategy for addressing various data challenges [24]. Dankar and Ibrahim [25] provided practical guidelines for effective synthetic data generation, emphasizing the importance of evaluating synthetic data quality across multiple dimensions.

2.6. Comparative Taxonomy of Generation Methods

To provide a structured basis for comparison, Table 1 presents a taxonomy of the four synthetic data generation methods evaluated in this study, classified along consistent criteria including underlying approach, core assumptions, strengths, limitations, data requirements, and known failure cases. This taxonomy enables readers to systematically assess each method’s suitability for different application contexts.
Several observations emerge from this taxonomy. First, a clear trade-off exists between model complexity and data requirements: interpolation-based methods (SMOTE) require minimal data but make strong assumptions about feature space geometry, while deep generative models (TVAE, CTGAN) can capture complex distributions but require substantially more training data. Second, the methods differ fundamentally in their treatment of categorical features—SMOTE operates in encoded numerical space and may produce invalid intermediate values, whereas CTGAN’s conditional generator is explicitly designed for categorical handling. Third, the failure modes are method-specific: SMOTE fails when the convexity assumption is violated (multimodal distributions), statistical models fail with complex non-linear dependencies, and deep models fail with insufficient data. These distinctions have direct implications for method selection in practice, which we revisit in the Discussion (Section 5.4).

2.7. Summary and Research Gap

The existing literature provides a foundation for understanding class imbalance and synthetic data generation, yet several gaps remain that motivate the present study. First, comparisons across traditional and deep learning approaches on standardized benchmarks remain limited, particularly for tabular data in classification contexts. Second, most existing studies focus on either statistical fidelity or downstream task performance, but few provide integrated evaluation frameworks. Third, practical guidelines for method selection based on structured analysis of method characteristics, assumptions, and constraints are often lacking.
This study addresses these gaps by providing a comparison of four representative synthetic data generation methods—spanning traditional interpolation-based, statistical, and deep learning approaches—evaluated through a multi-dimensional framework that assesses statistical fidelity, classification performance, and minority class detection capability. Beyond empirical comparison, we provide a structured decision framework to guide practitioners in selecting appropriate methods based on their specific application requirements.

3. Materials and Methods

3.1. Dataset Description

We utilize the Bank Marketing dataset from the UCI Machine Learning Repository [26], which contains 41,188 instances related to direct marketing campaigns of a Portuguese banking institution. The original dataset includes 20 input features; however, following UCI guidelines, we exclude the ‘duration’ feature as it is not available before a call is made and would lead to data leakage, resulting in 19 features for our experiments. The classification task is to predict whether a client will subscribe to a term deposit. Table 2 summarizes the dataset characteristics.
The Bank Marketing dataset was selected as the evaluation benchmark for several reasons. First, it is a widely used benchmark in imbalanced classification research [8,9], enabling comparison with prior studies. Second, it contains a representative mixture of numerical (9) and categorical (10) features, reflecting the mixed-type composition typical of real-world business and administrative datasets. Third, its moderate imbalance ratio (7.88:1) falls within the commonly encountered range of 5:1 to 100:1 [1], making it neither a trivial nor an extreme case for augmentation methods. Fourth, the dataset is publicly available from the UCI Machine Learning Repository with well-documented provenance [26], ensuring reproducibility. While these properties make the Bank Marketing dataset a suitable testbed for our comparison, we acknowledge that validation on additional datasets from diverse domains would strengthen generalizability (see Section 5.5).

3.2. Experimental Setup

3.2.1. Implementation Environment

All experiments were conducted using Python 3.8 with the following primary libraries: scikit-learn (v1.0.2) for machine learning models and preprocessing, imbalanced-learn (v0.9.0) for SMOTE implementation, SDV (Synthetic Data Vault, v0.17.1) for Gaussian Copula, TVAE and CTGAN implementations, and SciPy (v1.7.3) for statistical testing. Experiments were executed on a system with an Intel Core i7 processor and 16 GB RAM (Intel, Santa Clara, CA, USA).
Scope of comparison and experimental fairness. To ensure a fair comparison across methods with inherently different tuning complexities, this study adopts a controlled experimental design in which all methods are evaluated under their default (library-provided) hyperparameter configurations with a fixed computational budget. Traditional methods such as SMOTE require minimal configuration (only k-neighbors), whereas deep generative models (TVAE, CTGAN) involve numerous hyperparameters (optimizer, learning rate, network architecture, activation functions). To avoid introducing bias through differential tuning effort, we uniformly applied the SDV library defaults for TVAE and CTGAN—including Adam optimizer, learning rates of 1 × 10 3 and 2 × 10 4 , respectively, and network architectures detailed in Appendix A (Table A1). These defaults were established by Xu et al. [11] based on extensive experimentation across multiple datasets and represent well-tested starting configurations. Additionally, all methods share the same data splits, preprocessing pipeline, and evaluation protocol, ensuring that observed performance differences are attributable to the generation methods themselves rather than experimental conditions. This design choice reflects a practical scenario in which practitioners adopt augmentation methods without extensive tuning resources. Consequently, the reported performance of deep generative models represents a lower bound of their potential capability; with comprehensive hyperparameter optimization, their performance could improve. Performance rankings reported in this study should therefore be interpreted within this default configuration context rather than as definitive assessments of each method’s maximum achievable performance.

3.2.2. Repeated Experiments and Statistical Analysis

To ensure the reliability of our findings and account for variability due to data partitioning, all experiments were repeated 10 times with different random seeds (seeds 0–9). Each repetition involved a completely independent pipeline: (1) a new stratified train/test split (80%/20%) generated with the corresponding random seed, (2) fresh synthetic data generation from the resulting training set, and (3) independent model training and evaluation. This design ensures that results are not contingent on any single data partition, effectively averaging over the randomness in both data splitting and model training. The 10-repetition hold-out strategy was chosen over k-fold cross-validation because the synthetic data generation step—particularly for deep generative models requiring iterative training—incurs substantial computational cost per fold, making full cross-validation prohibitively expensive.
We report the mean and standard deviation of performance metrics across the 10 runs. To assess whether performance differences between augmentation methods are statistically significant, we employed the Wilcoxon signed-rank test, a non-parametric alternative to the paired t-test that does not assume normal distribution of differences. Statistical significance was determined at α = 0.05 . We specifically tested whether each augmentation method significantly outperforms the baseline condition for each classifier and metric combination.

3.2.3. Data Preprocessing

The preprocessing pipeline consisted of three main stages:
Stage 1: Feature Removal. We removed the ‘duration’ feature following UCI repository guidelines. This variable represents the duration of the last contact in seconds and is only known after a call is completed. Including this feature would constitute data leakage since it is unavailable at prediction time.
Stage 2: Categorical Encoding. All categorical features were transformed using label encoding, which maps each unique category to a single integer value. Ten categorical features (job, marital, education, default, housing, loan, contact, month, day_of_week, poutcome) were encoded independently. Unlike one-hot encoding, label encoding does not expand the feature dimensionality; therefore, the final input representation retains all 19 features (9 numerical + 10 label-encoded categorical), yielding a 19-dimensional feature vector x R 19 for each instance. This encoding was fit on the training set and applied consistently to both training and test sets to prevent information leakage.
Stage 3: Feature Scaling. Numerical features were standardized using StandardScaler, which transforms each feature to have zero mean and unit variance:
x scaled = x μ σ
where μ and σ are the mean and standard deviation computed from the training set. Scaling was performed after the train–test split and fit only on training data to prevent data leakage.

3.2.4. Data Splitting Strategy

The dataset was split into training (80%, n = 32,950) and test (20%, n = 8238) sets using stratified sampling to maintain the original class proportions in both subsets. Stratification ensures that the 7.88:1 imbalance ratio is preserved in both training and test sets, providing a realistic evaluation scenario. The training set contained 29,238 majority class (No) and 3712 minority class (Yes) instances, while the test set contained 7310 majority class and 928 minority class instances.

3.2.5. Synthetic Data Generation Protocol

For each augmentation method (excluding Baseline), synthetic samples were generated exclusively from the minority class to achieve class balance. The number of synthetic samples required was computed as:
n synthetic =   | D majority | | D minority |   = 29 , 238 3712 = 25 , 526
This ensures that after augmentation, both classes have equal representation (29,238 instances each), resulting in a balanced training set of 58,476 instances. Importantly, synthetic data generation was performed exclusively on the training set; the test set remained unchanged throughout all experiments to ensure fair evaluation.

3.2.6. Experimental Workflow

The overall experimental procedure is formalized in Algorithm 1, which outlines the systematic workflow for comparing synthetic data augmentation methods.
Algorithm 1 Experimental workflow for synthetic data augmentation comparison.
Require: 
Dataset D, Methods M = { Baseline , SMOTE , GaussianCopula , TVAE , CTGAN }
Require: 
Classifiers C = { LogisticRegression , RandomForest , GradientBoosting }
Ensure: 
Performance metrics P, Similarity scores S
1:
// Phase 0: Data Preparation
2:
Preprocess D: Label encode categorical features, remove ‘duration’ column
3:
Split D D train (80%), D test (20%) using stratified sampling
4:
Extract D minority D train [ target = minority ]
5:
Extract D majority D train [ target = majority ]
6:
n synthetic | D majority | | D minority |
7:
for each method m M  do
8:
    // Phase 1: Synthetic Data Generation
9:
    if  m = Baseline  then
10:
         D aug D train                   ▹ No augmentation
11:
    else
12:
        Initialize generative model G m with method-specific parameters
13:
        Fit G m on D minority
14:
         D synthetic G m . sample ( n synthetic )
15:
        Assign minority class label to all synthetic samples
16:
         D aug D train D synthetic
17:
    end if
18:
    // Phase 2a: Statistical Similarity Evaluation
19:
     D aug _ minority D aug [ target = minority ]
20:
     S [ m ] ComputeFidelity ( D minority , D aug _ minority )     ▹ Equations (10)–(16)
21:
    // Phase 2b: Classification Performance Evaluation
22:
    for each classifier c C  do
23:
        Initialize StandardScaler and fit on D aug
24:
         D aug _ scaled StandardScaler . transform ( D aug )
25:
         D test _ scaled StandardScaler . transform ( D test )
26:
        Initialize classifier c with default parameters
27:
        Fit c on D aug _ scaled
28:
         y pred c . predict ( D test _ scaled )
29:
         y prob c . predict _ proba ( D test _ scaled )
30:
         P [ m ] [ c ] EvaluateMetrics ( y test , y pred , y prob )
31:
    end for
32:
end for
33:
return  P, S
Figure 1 provides a visual representation of this workflow.

3.2.7. Evaluation Protocol

The evaluation protocol encompasses two complementary dimensions. Statistical fidelity (Section 3.4.1) assesses how well synthetic data preserves the distributional properties of the original minority class data through four complementary metrics. Machine learning utility (Section 3.4.2) evaluates classification performance when models are trained on augmented data and tested on the held-out test set. This dual evaluation approach enables the assessment of both data fidelity and practical utility of each generation method.

3.3. Synthetic Data Generation Methods

We selected four methods representing three distinct paradigms of synthetic data generation: interpolation-based (SMOTE), statistical modeling (Gaussian Copula), and deep generative modeling (TVAE, CTGAN). This selection was guided by three criteria. First, paradigm coverage: the methods span the major methodological categories identified in our taxonomy (Table 1), enabling cross-paradigm comparison. Second, software maturity: all methods are available through established, well-maintained libraries—imbalanced-learn for SMOTE and the Synthetic Data Vault (SDV) for Gaussian Copula, TVAE, and CTGAN—ensuring reproducibility and practical accessibility. Third, data feasibility: the methods were selected based on their documented ability to operate with training set sizes comparable to our minority class (3712 samples).
Recent advances such as diffusion-based models (TabDDPM [21], TabSyn [22]) and LLM-based generation (GReaT [23]) were discussed in Section 2.4 but not included in the experimental comparison. This exclusion reflects three practical considerations: (1) these methods were evaluated in their original publications on datasets with substantially larger sample sizes (typically >30,000 training instances), and their performance on minority classes with <4000 samples remains unvalidated; (2) they require significant GPU resources and extended training times that exceed the computational constraints of this study; and (3) their implementations remain at the research prototype stage without stable, production-ready APIs, which limits reproducibility. Given that TVAE and CTGAN—which share the deep generative paradigm—already exhibited limited performance with 3712 minority samples in our experiments, more data-intensive architectures would face similar or greater challenges. We identify the inclusion of these emerging methods as an important direction for future work (Section 6).
To clarify the dimensionality transformations at each pipeline stage, Table 3 summarizes the input, internal representation, latent/noise space, and output dimensions for each method.

3.3.1. SMOTE (Synthetic Minority Over-Sampling Technique)

SMOTE generates synthetic samples by linear interpolation between minority class instances and their k-nearest neighbors [10]. For a given minority class instance x i , SMOTE first identifies its k-nearest neighbors within the minority class. A synthetic sample x new is then generated by randomly selecting one of the k neighbors, x nn , and performing linear interpolation:
x new = x i + λ · ( x nn x i )
where λ is a random number uniformly distributed in [ 0 , 1 ] , and x nn is a randomly selected nearest neighbor from the k neighbors of x i . This interpolation ensures that synthetic samples lie on the line segment connecting x i and x nn in the feature space.
In our experiments, we set k = 5 and applied SMOTE to fully balance the classes, generating 25,526 synthetic minority class samples until the minority class size matched that of the majority class. SMOTE operates directly on the 19-dimensional label-encoded feature space without any internal dimensionality transformation.

3.3.2. Gaussian Copula

The Gaussian Copula method models the joint probability distribution of features by separately modeling marginal distributions and their dependency structure using a Gaussian copula [16]. According to Sklar’s theorem, any multivariate joint distribution F can be expressed as:
F ( x 1 , x 2 , , x d ) = C ( F 1 ( x 1 ) , F 2 ( x 2 ) , , F d ( x d ) )
where F i represents the marginal cumulative distribution function (CDF) of feature i, and C is the copula function that captures the dependency structure. The Gaussian copula is defined as:
C Σ ( u 1 , u 2 , , u d ) = Φ Σ ( Φ 1 ( u 1 ) , Φ 1 ( u 2 ) , , Φ 1 ( u d ) )
where Φ Σ is the joint CDF of a multivariate normal distribution with correlation matrix Σ , Φ 1 is the inverse CDF of the standard normal distribution, and u i = F i ( x i ) .
For synthetic data generation, the process involves (1) estimating the marginal distributions F i for each feature, (2) estimating the correlation matrix Σ from the transformed data, and (3) sampling from the Gaussian Copula and back-transforming through inverse marginal CDFs. The Gaussian Copula model accepts the 19-dimensional input and internally distinguishes between numerical and categorical features: numerical features are modeled with parametric marginals (default: beta distribution), while categorical features are modeled with discrete frequency distributions. The copula operates on the d = 19 transformed uniform marginals without dimensionality expansion.

3.3.3. TVAE (Tabular Variational Autoencoder)

TVAE adapts the Variational Autoencoder framework for tabular data [11]. A VAE consists of an encoder q ϕ ( z | x ) that maps input data to a latent distribution and a decoder p θ ( x | z ) that reconstructs data from the latent representation. The model is trained by maximizing the Evidence Lower Bound (ELBO):
L ELBO = E q ϕ ( z | x ) [ log p θ ( x | z ) ] D KL ( q ϕ ( z | x ) p ( z ) )
The first term represents the reconstruction loss, measuring how well the decoder reconstructs the original input. The second term is the Kullback–Leibler divergence between the approximate posterior q ϕ ( z | x ) and the prior p ( z ) , typically assumed to be a standard normal distribution N ( 0 , I ) .
For mixed-type tabular data, TVAE employs mode-specific normalization for continuous variables. Given a continuous feature with multiple modes, values are normalized relative to the closest mode:
α i = x i η k σ k
where η k and σ k are the mean and standard deviation of the k-th mode, respectively. Internally, TVAE transforms the 19 input features into an expanded representation: each categorical feature is one-hot encoded, and each numerical feature is represented by a normalized value plus mode indicator bits via mode-specific normalization. This internal representation is mapped to a latent space of dimension d z = 128 (SDV default). The decoder reconstructs the expanded representation from z R 128 , and reverse transformations produce the final 19-dimensional output. The model was trained for 100 epochs on the minority class instances.

3.3.4. CTGAN (Conditional Tabular GAN)

CTGAN extends Generative Adversarial Networks for tabular data synthesis [11]. A GAN consists of a generator G and discriminator D that are trained through adversarial optimization. The objective function is:
min G max D L GAN = E x p data [ log D ( x ) ] + E z p z [ log ( 1 D ( G ( z ) ) ) ]
where p data is the real data distribution and p z is the prior distribution over latent variables (typically Gaussian noise).
CTGAN incorporates several adaptations for tabular data. First, it uses mode-specific normalization for continuous variables, similar to TVAE. Second, it employs a conditional generator that generates samples conditioned on specific categories:
x ^ = G ( z , c )
where c is a one-hot encoded condition vector representing the categorical value and z R 128 is a noise vector sampled from a standard normal distribution (SDV default dimension d z = 128 ). Third, CTGAN uses training-by-sampling, which samples training batches uniformly across category values to address imbalanced categorical features. Similar to TVAE, CTGAN internally transforms the 19 input features using one-hot encoding for categorical variables and mode-specific normalization for numerical variables; the generator produces this expanded representation, which is then reverse-transformed to yield the final 19-dimensional output. The model was trained for 100 epochs on the minority class instances.

3.4. Evaluation Metrics

3.4.1. Statistical Fidelity

We evaluate the fidelity of synthetic data through four complementary metrics that assess different aspects of distributional preservation: marginal numerical similarity, categorical distribution similarity, correlation structure preservation, and distributional shape similarity.
Marginal Numerical Similarity. For each numerical feature f, we compute a score based on mean and variance matching:
S num f = 0.5 × 1 min | μ orig f μ syn f | σ orig f + ϵ , 1 + 0.5 × min ( σ orig f , σ syn f ) max ( σ orig f , σ syn f ) + ϵ
where μ orig f and μ syn f are the means of feature f in the original and synthetic data, σ orig f and σ syn f are the corresponding standard deviations, and ϵ is a small constant to prevent division by zero. The overall marginal numerical similarity is:
S num = 1 | F num | f F num S num f
where F num is the set of numerical features.
Categorical Distribution Similarity. For each categorical feature c, we measure the distributional match using the Jensen–Shannon Divergence (JSD), a symmetric measure derived from the Kullback–Leibler divergence:
JSD ( P Q ) = 1 2 D KL ( P M ) + 1 2 D KL ( Q M ) , M = 1 2 ( P + Q )
where P and Q are the category frequency distributions of the original and synthetic data, respectively. The categorical similarity score is:
S cat = 1 | F cat | c F cat ( 1 JSD ( P c Q c ) )
where F cat is the set of categorical features. A score of 1.0 indicates identical category distributions.
Correlation Structure Preservation. To assess whether inter-variable dependencies are preserved, we compare the correlation matrices of the original and synthetic data using the normalized Frobenius distance:
S corr = 1 R orig R syn F R orig F + R syn F
where R orig and R syn are the correlation matrices computed from the original and synthetic data, respectively, and · F denotes the Frobenius norm. A score of 1.0 indicates perfect preservation of the correlation structure.
Kolmogorov–Smirnov (KS) Test Score. To evaluate the overall distributional shape beyond summary statistics, we apply the two-sample KS test to each numerical feature. The KS statistic measures the maximum absolute difference between empirical cumulative distribution functions:
S KS = 1 | F num | f F num ( 1 D KS f )
where D KS f = sup x | F orig f ( x ) F syn f ( x ) | is the KS statistic for feature f. A score of 1.0 indicates identical distributions.
Composite Fidelity Score. The overall fidelity is the arithmetic mean of the four individual metrics:
S composite = 1 4 ( S num + S cat + S corr + S KS )
This multi-dimensional evaluation addresses the key aspects of tabular data fidelity: marginal distributions of numerical features, categorical frequency preservation, inter-variable dependency structure, and overall distributional shape.

3.4.2. Machine Learning Utility

We evaluate classification performance using three models selected to represent distinct learning paradigms: Logistic Regression (LR; a linear, parametric model serving as a baseline), Random Forest (RF; a bagging-based ensemble of decision trees), and Gradient Boosting (GB; a sequential boosting ensemble). This selection was guided by two considerations. First, paradigm diversity: the three classifiers span linear, bagging, and boosting paradigms, which differ in their sensitivity to class distribution and decision boundary geometry, enabling the assessment of whether augmentation effects are consistent across fundamentally different learning strategies. Second, practical prevalence: these classifiers remain among the most widely used in applied machine learning for tabular data [27], making our findings directly applicable to common workflows. While deep neural network classifiers were not included—since our focus is on comparing data augmentation methods rather than classification architectures—we discuss their inclusion as a future direction (Section 6).
Performance is assessed using five standard metrics: accuracy, precision, recall, F1-Score, and Area Under the Receiver Operating Characteristic Curve (AUC-ROC). Given the class imbalance context, we particularly emphasize recall (minority class detection rate) and F1-Score (harmonic mean of precision and recall) as primary evaluation criteria, since overall accuracy can be misleading when classes are imbalanced. The hyperparameter configurations for all methods are detailed in Appendix A, and the formal definitions of evaluation metrics are provided in Appendix B.

4. Results

4.1. Class Distribution After Augmentation

Before examining performance metrics, we first analyze the effect of each augmentation method on class distribution. As shown in Figure 2, the original training data (Baseline) exhibits a severe class imbalance with 29,238 majority class instances and only 3712 minority class instances. All augmentation methods successfully balanced the class distribution by generating 25,526 synthetic minority class samples.

4.2. Statistical Fidelity Analysis

Table 4 presents the fidelity evaluation results across four complementary metrics. Figure 3 visualizes the composite fidelity scores.
Gaussian Copula achieved the highest composite fidelity (0.930), demonstrating strong performance across all four dimensions. Its explicit modeling of the correlation structure through the copula function yielded the highest correlation preservation score ( S corr = 0.958 ). Notably, CTGAN ranked second in composite fidelity (0.884), outperforming SMOTE (0.878), primarily due to stronger categorical distribution similarity ( S cat = 0.896 vs. 0.858) and correlation preservation ( S corr = 0.921 vs. 0.879). SMOTE’s interpolation-based approach in the encoded feature space can distort categorical distributions, as intermediate values between encoded categories may not correspond to valid category assignments. TVAE showed the lowest fidelity across most dimensions, consistent with the limited minority class sample size for deep generative model training.

4.3. Classification Performance

Table 5 presents the classification results as mean ± standard deviation across 10 experimental runs.
Figure 4 visualizes these results through three panels focusing on F1-Score, AUC-ROC, and recall.

4.4. Method-Wise Average Performance

Table 6 presents the average performance metrics across all three classification models, aggregated over 10 experimental runs.
The results indicate that SMOTE achieved the highest average F1-Score (0.437) and recall (0.542) among the tested methods.

4.5. Statistical Significance Analysis

Table 7 presents the results of Wilcoxon signed-rank tests comparing each augmentation method against the baseline condition for the key metrics (recall and F1-Score). The tests were performed using the paired results from 10 experimental runs.
All augmentation methods showed statistically significant improvements over the baseline for both recall and F1-Score ( p < 0.05 ). SMOTE and Gaussian Copula demonstrated the strongest statistical significance ( p < 0.01 ), indicating robust and consistent improvements across experimental runs. These results confirm that the observed performance differences are not due to random variation but represent genuine improvements in minority class detection capability.
Figure 5 quantifies the percentage improvement over the baseline for each augmentation method. SMOTE achieved the largest improvement in recall (+117.6%), more than doubling the minority class detection rate compared to no augmentation. Gaussian Copula followed with +73.4% recall improvement.

4.6. Precision–Recall Trade-Off Analysis

Figure 6 visualizes the precision–recall trade-off by plotting each method–model combination.

5. Discussion

5.1. Trade-Off Between Fidelity and Utility

Our results reveal a notable trade-off between statistical fidelity and classification utility. To explicitly examine this relationship, Table 8 presents a direct comparison of the composite fidelity scores and classification performance metrics for each augmentation method, along with the Spearman rank correlation coefficients.
A key finding emerges from this multi-dimensional analysis: higher composite fidelity does not guarantee better classification performance. Gaussian Copula achieved the highest composite fidelity (0.930) but ranked second in F1-Score and recall. CTGAN ranked second in fidelity (0.884), outperforming SMOTE (0.878) due to stronger categorical and correlation preservation, yet SMOTE achieved the best classification performance. The Spearman correlation coefficients between composite fidelity and F1-Score ( ρ = 0.30 ) and recall ( ρ = 0.30 ) suggest a weak negative relationship, while AUC-ROC showed a strong positive correlation ( ρ = 1.00 ), indicating that higher-fidelity methods better preserve discriminative ranking capability.
The divergence between the individual fidelity dimensions further enriches this finding. SMOTE showed the weakest categorical distribution similarity ( S cat = 0.858 ) and correlation preservation ( S corr = 0.879 ) among augmentation methods, yet achieved the highest F1-Score and recall. This can be explained by the different optimization objectives of each method. SMOTE’s interpolation-based approach creates synthetic samples along line segments connecting existing minority class instances, effectively generating data points in decision boundary regions. This behavior relates to the concept of “informative oversampling” discussed by Fernández et al. [9], where synthetic samples that populate the decision boundary can be more beneficial for classifier learning than samples that merely replicate the original distribution.
Conversely, distribution matching approaches like Gaussian Copula excel at preserving all aspects of the original data characteristics—including marginal distributions, categorical frequencies, and correlation structure—but this faithfulness may not directly benefit classification. The strong correlation preservation of Gaussian Copula ( S corr = 0.958 ) confirms its ability to maintain inter-variable dependencies, which is valuable for applications requiring high data authenticity.
This finding aligns with the evaluation framework proposed by Dankar and Ibrahim [25], who emphasized that synthetic data quality should be assessed relative to intended downstream tasks rather than purely statistical measures. Our multi-dimensional fidelity evaluation provides stronger quantitative evidence for this perspective: even when assessed across marginal distributions, categorical frequencies, inter-variable dependencies, and distributional shapes, methods with higher overall fidelity did not consistently yield better classification performance.

5.2. Deep Learning Methods Performance

TVAE and CTGAN showed lower performance compared to simpler methods under the default configurations used in our experiments. The minority class contained only 3712 samples, which appears insufficient for deep generative models to learn effective latent representations. For context, Xu et al. [11] evaluated CTGAN and TVAE on datasets ranging from 7043 to 581,012 samples. Shwartz-Ziv and Armon [27] demonstrated that deep learning models for tabular data typically require datasets exceeding 10,000 samples.
As noted in Section 3.2.1, our comparison employed default hyperparameter configurations with a fixed computational budget, which represents a practical but constrained evaluation setting. Table 9 summarizes the key hyperparameters of TVAE and CTGAN, their default values used in our experiments, and the potential impact of tuning based on the existing literature.
The default configurations represent a common practical scenario: practitioners adopting these methods without dedicated tuning resources. While comprehensive hyperparameter optimization (estimated at 50–100× the base training time) could potentially improve deep model performance, such optimization was beyond the scope of this comparison. Importantly, this practical constraint itself is a relevant finding—it highlights that simpler methods like SMOTE offer competitive or superior performance “out of the box,” which is a meaningful advantage for resource-constrained applications. These findings align with the recent literature suggesting that deep generative models require careful tuning and sufficient data [20].

5.3. Evaluation Metrics and Operational Considerations

Our evaluation employed AUC-ROC as the primary ranking metric, which measures the classifier’s ability to discriminate between classes across all possible decision thresholds. While AUC-ROC is widely used and enables comparison with prior studies, it has a known limitation in the class imbalance context: the false positive rate (FPR) component can be dominated by the large number of true negatives, potentially painting an overly optimistic picture of classifier performance [1].
The precision–recall AUC (PR-AUC) has been recommended as a more informative alternative for imbalanced datasets, as it focuses exclusively on the positive (minority) class and is not influenced by the number of true negatives. In our experimental setting, the emphasis on recall and F1-Score partially addresses this concern, since both metrics are precision–recall-based and do not involve true negatives. The precision–recall trade-off analysis (Figure 6) further provides a visual assessment analogous to PR-AUC by plotting the precision–recall operating points for each method–classifier combination. Nevertheless, formal PR-AUC computation and threshold optimization strategies (e.g., cost-sensitive thresholding, F-beta optimization) would provide additional operational insights and are identified as future work directions.

5.4. Model-Specific Observations

The three classification models exhibited distinct response patterns to data augmentation. Logistic Regression showed notable improvement, with recall increasing from 0.206 in the baseline to 0.654 when trained on Gaussian Copula-augmented data—a more than threefold increase.
Random Forest demonstrated robust and consistent performance across all augmentation methods. Gradient Boosting achieved the highest AUC-ROC scores in most configurations, indicating better ranking capability.

5.5. Practical Implications and Decision Framework

Based on the experimental findings and the method taxonomy presented in Section 2.6, we propose a structured decision framework for practitioners selecting synthetic data augmentation strategies. Table 10 maps common application scenarios to recommended methods with supporting evidence from our experiments.
Several cross-cutting considerations inform method selection beyond the primary application scenario. First, regarding feature composition: when the dataset contains predominantly categorical features, SMOTE’s interpolation in encoded space can produce invalid intermediate values, making CTGAN or Gaussian Copula preferable. Our categorical distribution analysis confirmed this limitation ( S cat : SMOTE = 0.858 vs. CTGAN = 0.896). Second, regarding inter-variable dependencies: if preserving the correlation structure is critical (e.g., for downstream causal analysis), Gaussian Copula’s explicit copula-based modeling ( S corr = 0.958 ) offers a clear advantage over interpolation-based approaches ( S corr = 0.879 ). Third, regarding scalability: SMOTE’s computational cost grows linearly with sample size, while deep generative models incur fixed training overhead but offer constant-time sampling, making them more efficient for generating very large synthetic datasets once trained.

5.6. Limitations

This study has several limitations that should be considered when interpreting the results.
First, the evaluation was conducted on a single dataset from the banking domain. However, the Bank Marketing dataset exhibits properties representative of many real-world imbalanced classification problems: a moderate imbalance ratio (7.88:1), which falls within the commonly encountered range of 5:1 to 100:1 [1]; a mixture of numerical (nine) and categorical (10) features typical of business datasets. Validation on datasets from healthcare, cybersecurity, and manufacturing domains would strengthen the generalizability of the findings.
Second, all methods were evaluated under default hyperparameter configurations with a fixed computational budget, as detailed in Section 3.2.1. While this reflects a practical scenario for resource-constrained practitioners, it means that the reported performance of deep generative models (TVAE, CTGAN) represents a lower bound. Comprehensive hyperparameter optimization—including tuning of latent dimensions, learning rates, network architectures, and training schedules—could potentially improve their performance, though at significant computational cost (estimated 50–100× the base training time). The sensitivity analysis in Section 5.2 identifies key hyperparameters and their expected impact.
Third, recent generative approaches—including diffusion-based models (TabDDPM [21], TabSyn [22]) and LLM-based generation (GReaT [23])—were not included in the experimental comparison, as discussed in Section 3.3. Their exclusion limits the timeliness of the benchmark; however, these methods’ data requirements and computational demands exceed the constraints of our experimental setting. As these methods mature and become available through stable frameworks, their inclusion in comparative benchmarks will be essential.
Fourth, the classification evaluation employed three traditional machine learning classifiers (Logistic Regression, Random Forest, Gradient Boosting). While these represent distinct learning paradigms (linear, bagging, boosting), the inclusion of deep neural network classifiers (e.g., multi-layer perceptrons, Tabular Network (TabNet)) and other modern approaches would broaden the assessment of how augmentation methods interact with different classifier architectures.
Fifth, our evaluation metrics focused on classification accuracy, F1-Score, recall, and AUC-ROC. While the precision–recall trade-off analysis (Figure 6) provides visual assessment of operating points, formal PR-AUC computation and threshold optimization strategies (e.g., cost-sensitive thresholding) were not performed. PR-AUC is particularly informative for imbalanced datasets as it is not affected by true negatives, and its inclusion would strengthen the operational analysis of augmentation benefits.
Sixth, all experiments were conducted at a single imbalance ratio (7.88:1). At higher imbalance ratios (e.g., >20:1), the conclusions of this study may shift in several ways. The minority class sample size would decrease substantially (e.g., at 50:1, fewer than 600 minority samples), which would further constrain deep generative models that already struggled with 3712 samples, likely widening the performance gap in favor of SMOTE. However, extreme imbalance also increases the risk of SMOTE generating synthetic samples in overlapping class boundary regions, potentially degrading precision. Gaussian Copula’s correlation matrix estimation would become less reliable with fewer samples, potentially reducing its fidelity advantage. Conversely, at milder imbalance ratios (e.g., 3:1), deep generative models would have access to substantially more minority samples, potentially reversing the performance rankings observed in this study.
Seventh, our evaluation focused on classification performance and statistical fidelity. Additional considerations such as privacy preservation (resistance to membership inference attacks), computational cost profiling, and sample diversity were not systematically evaluated.

6. Conclusions

Class imbalance remains a common challenge in machine learning applications, particularly in domains where minority class instances represent important events requiring accurate detection. This study addressed the question of how different synthetic data generation approaches compare in their ability to improve classification performance on imbalanced tabular data, providing empirical observations and practical reference points for method selection.
We conducted a comparative analysis of four synthetic data generation methods—SMOTE, Gaussian Copula, TVAE, and CTGAN—using the UCI Bank Marketing dataset as a case of moderate class imbalance (7.88:1 ratio). Our evaluation framework assessed each method across three complementary dimensions.
The experimental results, validated through statistical significance testing with 10 repeated experiments, yielded several findings. First, all augmentation methods showed statistically significant improvements ( p < 0.05 ) in minority class detection compared to the baseline, with recall improvements ranging from 58% (TVAE) to 118% (SMOTE). Second, SMOTE achieved the highest average F1-Score (0.437 ± 0.013) and recall (0.542 ± 0.022), suggesting it as a suitable choice when minority class detection is the primary objective. Third, Gaussian Copula achieved the highest composite fidelity score (0.930), demonstrating strong preservation of marginal distributions, categorical frequencies, and inter-variable dependencies, making it a suitable choice when data authenticity is important. Fourth, deep learning-based methods showed modest but statistically significant improvements, though they underperformed compared to simpler approaches in our experimental setting. Notably, CTGAN demonstrated stronger categorical distribution and correlation preservation than SMOTE, despite lower classification performance.
Beyond these empirical findings, we highlight several insights that may inform future research and practice. Our multi-dimensional fidelity evaluation—encompassing marginal numerical similarity, categorical distribution similarity, correlation structure preservation, and KS test scores—revealed a weak negative correlation ( ρ = 0.30 ) between composite fidelity and classification performance, providing stronger evidence that high-fidelity distribution matching does not necessarily translate to better downstream task performance. This trade-off has important implications for method selection based on application requirements.
The findings carry implications primarily for practitioners working with moderately imbalanced tabular datasets (imbalance ratios in the range of approximately 5:1 to 10:1) where the minority class contains several thousand samples. Under these conditions, for applications prioritizing minority class detection, SMOTE combined with Gradient Boosting may offer a practical and computationally efficient option under default configurations. For applications requiring high data fidelity, Gaussian Copula may provide a suitable balance. Deep learning-based methods (TVAE, CTGAN), while showing lower performance under default settings in our experiments, may achieve improved results with comprehensive hyperparameter optimization and larger training sets, and should not be dismissed based solely on default configuration comparisons. For datasets with extreme imbalance ratios (>20:1), very small minority classes (<500 samples), or fundamentally different feature compositions, the relative effectiveness of these methods may differ from the patterns observed in this study.
Future work should extend this comparative framework in several directions. First, validation on multiple datasets spanning diverse domains (e.g., healthcare, cybersecurity, manufacturing) with varying feature compositions, sample sizes, and class distributions would strengthen the generalizability of the findings and enable the identification of domain-specific patterns in augmentation effectiveness. Second, systematic evaluation across a range of imbalance ratios (e.g., 3:1, 10:1, 50:1, 100:1) would reveal how the relative effectiveness of augmentation methods changes with imbalance severity, providing more nuanced guidance for practitioners. Third, the inclusion of recent generative approaches—particularly diffusion-based models (TabDDPM, TabSyn) and LLM-based generators (GReaT)—would enhance the timeliness and comprehensiveness of the comparison as these methods become available through stable frameworks. Fourth, broadening the classifier pool to include deep neural networks (e.g., multi-layer perceptrons, TabNet) and incorporating precision–recall AUC (PR-AUC) as an evaluation metric alongside threshold optimization strategies would provide a more complete picture of augmentation benefits across different operational settings. Fifth, systematic hyperparameter optimization studies for deep generative models would help establish the performance ceiling of these methods on small minority class samples. Finally, the integration of privacy-preserving mechanisms with synthetic data generation represents an increasingly important research direction.

Author Contributions

Conceptualization, K.-S.S. and D.-H.W.; methodology, K.-S.S. and S.Y.; software, S.Y.; validation, K.-S.S. and D.-H.W.; formal analysis, K.-S.S. and D.-H.W.; investigation, K.-S.S. and D.-H.W.; resources, K.-S.S.; data curation, K.-S.S.; writing—original draft preparation, D.-H.W.; writing—review and editing, K.-S.S. and D.-H.W.; visualization, S.Y.; supervision, K.-S.S.; project administration, S.Y.; funding acquisition, D.-H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported by Wonkwang University in 2023.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The Bank Marketing dataset is publicly available at the UCI Machine Learning Repository: https://archive.ics.uci.edu/dataset/222/bank+marketing (accessed on 10 January 2026). The experimental code is available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ADASYNAdaptive Synthetic Sampling
AdamAdaptive Moment Estimation
AUC-ROCArea Under the Receiver Operating Characteristic Curve
CDFCumulative Distribution Function
CTGANConditional Tabular Generative Adversarial Network
ELBOEvidence Lower Bound
FPRFalse Positive Rate
GANGenerative Adversarial Network
GBGradient Boosting
GPT-2Generative Pre-trained Transformer 2
JSDJensen–Shannon Divergence
KSKolmogorov–Smirnov
LeakyReLULeaky Rectified Linear Unit
LLMLarge Language Model
LRLogistic Regression
PR-AUCPrecision–Recall Area Under Curve
ReLURectified Linear Unit
RFRandom Forest
SDVSynthetic Data Vault
SMOTESynthetic Minority Over-sampling Technique
TPRTrue Positive Rate
TVAETabular Variational Autoencoder
UCIUniversity of California Irvine
VAEVariational Autoencoder

Appendix A. Experimental Configuration

Table A1 lists the hyperparameters for each synthetic data generation method. Activation functions include Rectified Linear Unit (ReLU) and Leaky Rectified Linear Unit (LeakyReLU). All deep generative models use the Adaptive Moment Estimation (Adam) optimizer.
Table A1. Hyperparameters for synthetic data generation methods.
Table A1. Hyperparameters for synthetic data generation methods.
MethodParameterValue
SMOTEk_neighbors5
SMOTEsampling_strategy‘auto’ (balance classes)
SMOTEinput_dimension19 (label-encoded)
Gaussian Copuladefault_distribution‘beta’
Gaussian Copulainput_dimension19 (no expansion)
TVAEepochs100
TVAEbatch_size500
TVAEembedding_dim (latent)128
TVAEcompress_dims(128, 128)
TVAEdecompress_dims(128, 128)
TVAEoptimizerAdam
TVAElearning_rate 1 × 10 3
TVAEactivationReLU
CTGANepochs100
CTGANbatch_size500
CTGANembedding_dim (noise)128
CTGANgenerator_dim(256, 256)
CTGANdiscriminator_dim(256, 256)
CTGANoptimizerAdam ( β 1 = 0.5 , β 2 = 0.9 )
CTGANlearning_rate 2 × 10 4
CTGANgenerator activationReLU
CTGANdiscriminator activationLeakyReLU (slope = 0.2)
CTGANdiscriminator_steps1
Table A2. Machine learning model hyperparameters.
Table A2. Machine learning model hyperparameters.
ModelParameterValue
Logistic Regressionmax_iter1000
Logistic Regressionrandom_state42
Random Forestn_estimators100
Random Forestrandom_state42
Gradient Boostingn_estimators100
Gradient Boostingrandom_state42

Appendix B. Evaluation Metrics Definitions

This appendix provides the formal definitions of the classification performance metrics. Let T P , T N , F P , and F N denote true positives, true negatives, false positives, and false negatives, respectively.
Accuracy measures the overall proportion of correct predictions:
Accuracy = T P + T N T P + T N + F P + F N
Precision measures the proportion of predicted positive instances that are actually positive:
Precision = T P T P + F P
Recall measures the proportion of actual positive instances that are correctly identified:
Recall = T P T P + F N
F1-Score is the harmonic mean of precision and recall:
F 1 = 2 × Precision × Recall Precision + Recall = 2 T P 2 T P + F P + F N
AUC-ROC quantifies the model’s ability to discriminate between classes. The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR):
T P R = T P T P + F N , F P R = F P F P + T N
The AUC is computed as:
A U C = 0 1 T P R ( F P R 1 ( t ) ) d t

References

  1. He, H.; Garcia, E.A. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
  2. Japkowicz, N.; Stephen, S. The Class Imbalance Problem: A Systematic Study. Intell. Data Anal. 2002, 6, 429–449. [Google Scholar] [CrossRef]
  3. Wang, X.; Jiang, H.; Zeng, T.; Dong, Y. An Adaptive Fused Domain-Cycling Variational Generative Adversarial Network for Machine Fault Diagnosis under Data Scarcity. Inf. Fusion 2025, 126, 103616. [Google Scholar] [CrossRef]
  4. Yan, J.; Cheng, Y.; Zhang, F.; Li, M.; Zhou, N.; Jin, B.; Wang, H.; Yang, H.; Zhang, W. Research on Multimodal Techniques for Arc Detection in Railway Systems with Limited Data. Struct. Health Monit. 2025, in press. [Google Scholar] [CrossRef]
  5. Wang, X.; Jiang, H.; Mu, M.; Dong, Y. A Dynamic Collaborative Adversarial Domain Adaptation Network for Unsupervised Rotating Machinery Fault Diagnosis. Reliab. Eng. Syst. Saf. 2024, 255, 110662. [Google Scholar] [CrossRef]
  6. Provost, F. Machine Learning from Imbalanced Data Sets 101. In Proceedings of the AAAI Workshop on Imbalanced Data Sets, Austin, TX, USA, 31 July 2000; Volume 68, pp. 1–3. [Google Scholar]
  7. Krawczyk, B. Learning from Imbalanced Data: Open Challenges and Future Directions. Prog. Artif. Intell. 2016, 5, 221–232. [Google Scholar] [CrossRef]
  8. Moro, S.; Cortez, P.; Rita, P. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decis. Support Syst. 2014, 62, 22–31. [Google Scholar] [CrossRef]
  9. Fernández, A.; Garcia, S.; Herrera, F.; Chawla, N.V. SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary. J. Artif. Intell. Res. 2018, 61, 863–905. [Google Scholar] [CrossRef]
  10. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  11. Xu, L.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling Tabular Data Using Conditional GAN. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32, pp. 7335–7345. [Google Scholar]
  12. Chawla, N.V.; Japkowicz, N.; Kotcz, A. Editorial: Special Issue on Learning from Imbalanced Data Sets. ACM SIGKDD Explor. Newsl. 2004, 6, 1–6. [Google Scholar] [CrossRef]
  13. Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In Proceedings of the International Conference on Intelligent Computing, Hefei, China, 23–26 August 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 878–887. [Google Scholar]
  14. He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In Proceedings of the IEEE International Joint Conference on Neural Networks, Hong Kong, China, 1–8 June 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 1322–1328. [Google Scholar]
  15. Batista, G.E.A.P.A.; Prati, R.C.; Monard, M.C. A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. ACM SIGKDD Explor. Newsl. 2004, 6, 20–29. [Google Scholar] [CrossRef]
  16. Patki, N.; Wedge, R.; Veeramachaneni, K. The Synthetic Data Vault. In Proceedings of the IEEE International Conference on Data Science and Advanced Analytics (DSAA), Montreal, QC, Canada, 17–19 October 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 399–410. [Google Scholar]
  17. Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the International Conference on Learning Representations (ICLR), Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
  18. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Volume 27, pp. 2672–2680. [Google Scholar]
  19. Zhao, Z.; Kunar, A.; Birke, R.; Van der Scheer, H.; Chen, L.Y. CTAB-GAN+: Enhancing Tabular Data Synthesis. Front. Big Data 2024, 6, 1296508. [Google Scholar] [CrossRef] [PubMed]
  20. Borisov, V.; Leemann, T.; Seßler, K.; Haug, J.; Pawelczyk, M.; Kasneci, G. Deep Neural Networks and Tabular Data: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2022, in press. [Google Scholar]
  21. Kotelnikov, A.; Baranchuk, D.; Rubachev, I.; Babenko, A. TabDDPM: Modelling Tabular Data with Diffusion Models. In Proceedings of the International Conference on Machine Learning (ICML), Honolulu, HI, USA, 23–29 July 2023; PMLR: Cambridge, MA, USA, 2023; Volume 202, pp. 17564–17579. [Google Scholar]
  22. Zhang, H.; Zhang, J.; Srinivasan, B.; Shen, Z.; Qin, X.; Faloutsos, C.; Rangwala, H.; Karypis, G. Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
  23. Borisov, V.; Seßler, K.; Leemann, T.; Pawelczyk, M.; Kasneci, G. Language Models are Realistic Tabular Data Generators. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  24. Jordon, J.; Yoon, J.; Van Der Schaar, M. PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  25. Dankar, F.K.; Ibrahim, M. Fake It Till You Make It: Guidelines for Effective Synthetic Data Generation. Appl. Sci. 2021, 11, 2158. [Google Scholar] [CrossRef]
  26. Moro, S.; Rita, P.; Cortez, P. Bank Marketing [Dataset]. UCI Machine Learning Repository, 2014. Available online: https://www.archive.ics.uci.edu/dataset/222/bank+marketing (accessed on 22 January 2026).
  27. Shwartz-Ziv, R.; Armon, A. Tabular Data: Deep Learning is Not All You Need. Inf. Fusion 2022, 81, 84–90. [Google Scholar] [CrossRef]
Figure 1. Experimental workflow for synthetic data augmentation comparison.
Figure 1. Experimental workflow for synthetic data augmentation comparison.
Electronics 15 00883 g001
Figure 2. Class distribution comparison across augmentation methods.
Figure 2. Class distribution comparison across augmentation methods.
Electronics 15 00883 g002
Figure 3. Statistical fidelity scores measuring the distributional quality of synthetic data.
Figure 3. Statistical fidelity scores measuring the distributional quality of synthetic data.
Electronics 15 00883 g003
Figure 4. Performance comparison across augmentation methods and classification models.
Figure 4. Performance comparison across augmentation methods and classification models.
Electronics 15 00883 g004
Figure 5. Percentage improvement over baseline for each augmentation method.
Figure 5. Percentage improvement over baseline for each augmentation method.
Electronics 15 00883 g005
Figure 6. Precision–recall trade-off visualization across all augmentation methods.
Figure 6. Precision–recall trade-off visualization across all augmentation methods.
Electronics 15 00883 g006
Table 1. Comparative taxonomy of synthetic data generation methods.
Table 1. Comparative taxonomy of synthetic data generation methods.
CriterionSMOTEGaussian CopulaTVAECTGAN
CategoryInterpolation-basedStatistical modelDeep generative (VAE)Deep generative (GAN)
Core assumptionFeature space convexity; linear interpolation produces valid samplesGaussian dependency structure; Sklar’s decomposition of joint distributionContinuous latent space captures data manifoldAdversarial equilibrium; generator learns data distribution
StrengthsSimple, fast, effective with small data; no distributional assumptionsInterpretable parameters; explicit correlation modeling; handles mixed typesLearns non-linear relationships; probabilistic frameworkStrong categorical handling; conditional generation; captures complex dependencies
LimitationsDistorts categorical distributions; does not preserve global correlation structureLimited to linear dependencies; cannot capture complex non-linear interactionsSensitive to latent dimension; potential posterior collapseTraining instability; mode collapse risk; hyperparameter sensitive
Min. data requirementHundreds of samplesThousands of samples>10,000 samples>10,000 samples
Failure casesMultimodal distributions; overlapping class boundaries; high-dimensional sparse dataComplex non-linear feature interactions; highly skewed marginalsInsufficient training data; high-cardinality categorical featuresSmall datasets; mode collapse; imbalanced categorical features
Computational costLow (seconds)Moderate (minutes)High (hours, GPU)High (hours, GPU)
Feature type supportPrimarily numerical (categorical via encoding)Native mixed-type supportMixed-type via mode-specific normalizationMixed-type via conditional generator
Table 2. Dataset characteristics.
Table 2. Dataset characteristics.
CharacteristicValue
Total instances41,188
Features19 (after removing ‘duration’)
Numerical features9
Categorical features10
Target variableBinary (yes/no)
Class distributionNo: 36,548 (88.7%), Yes: 4640 (11.3%)
Imbalance ratio7.88:1
Table 3. Feature dimensionality at each pipeline stage.
Table 3. Feature dimensionality at each pipeline stage.
MethodInputInternal RepresentationLatent/NoiseOutput
SMOTE19 (label-encoded)19 (no transformation)N/A19
Gaussian Copula19 (label-encoded)19 (uniform marginals via CDF transform)N/A19
TVAE19 (label-encoded)Expanded (one-hot + mode-specific normalization)12819
CTGAN19 (label-encoded)Expanded (one-hot + mode-specific normalization)128 (noise) + conditional vector19
Table 4. Multi-dimensional fidelity evaluation of synthetic data.
Table 4. Multi-dimensional fidelity evaluation of synthetic data.
Method S num S cat S corr S KS Composite
Baseline (Original)1.0001.0001.0001.0001.000
Gaussian Copula0.9430.9150.9580.9050.930
CTGAN0.8810.8960.9210.8380.884
SMOTE0.9060.8580.8790.8670.878
TVAE0.7890.8470.8970.8110.836
Table 5. Classification performance across augmentation methods and models (mean ± std over 10 runs).
Table 5. Classification performance across augmentation methods and models (mean ± std over 10 runs).
MethodModelAccuracyPrecisionRecallF1-ScoreAUC-ROC
BaselineLR0.901 ± 0.0020.702 ± 0.0150.206 ± 0.0120.318 ± 0.0110.796 ± 0.008
BaselineRF0.898 ± 0.0030.588 ± 0.0210.307 ± 0.0180.403 ± 0.0140.787 ± 0.009
BaselineGB0.902 ± 0.0020.696 ± 0.0180.234 ± 0.0140.350 ± 0.0120.809 ± 0.007
SMOTELR0.796 ± 0.0080.299 ± 0.0120.601 ± 0.0210.399 ± 0.0100.759 ± 0.011
SMOTERF0.880 ± 0.0050.463 ± 0.0190.426 ± 0.0240.444 ± 0.0160.772 ± 0.010
SMOTEGB0.847 ± 0.0070.385 ± 0.0160.598 ± 0.0220.468 ± 0.0130.784 ± 0.009
GaussianCopulaLR0.807 ± 0.0110.324 ± 0.0180.654 ± 0.0280.433 ± 0.0150.786 ± 0.012
GaussianCopulaRF0.895 ± 0.0040.566 ± 0.0220.306 ± 0.0210.397 ± 0.0170.780 ± 0.011
GaussianCopulaGB0.897 ± 0.0040.574 ± 0.0200.335 ± 0.0190.423 ± 0.0150.803 ± 0.008
TVAELR0.799 ± 0.0150.278 ± 0.0210.491 ± 0.0350.355 ± 0.0190.726 ± 0.018
TVAERF0.896 ± 0.0050.565 ± 0.0240.314 ± 0.0260.403 ± 0.0200.780 ± 0.013
TVAEGB0.889 ± 0.0060.507 ± 0.0230.376 ± 0.0280.432 ± 0.0180.784 ± 0.011
CTGANLR0.808 ± 0.0130.320 ± 0.0190.623 ± 0.0310.423 ± 0.0160.774 ± 0.014
CTGANRF0.898 ± 0.0040.586 ± 0.0230.313 ± 0.0240.408 ± 0.0180.785 ± 0.012
CTGANGB0.901 ± 0.0030.629 ± 0.0210.287 ± 0.0200.394 ± 0.0160.804 ± 0.009
Table 6. Average performance metrics by augmentation method (mean ± std).
Table 6. Average performance metrics by augmentation method (mean ± std).
MethodF1-ScoreAUC-ROCRecall
Baseline0.357 ± 0.0120.797 ± 0.0080.249 ± 0.015
SMOTE0.437 ± 0.0130.772 ± 0.0100.542 ± 0.022
GaussianCopula0.418 ± 0.0160.790 ± 0.0100.432 ± 0.023
CTGAN0.408 ± 0.0170.788 ± 0.0120.407 ± 0.025
TVAE0.397 ± 0.0190.763 ± 0.0140.394 ± 0.030
Table 7. Statistical significance of improvements over Baseline (Wilcoxon signed-rank test).
Table 7. Statistical significance of improvements over Baseline (Wilcoxon signed-rank test).
MethodMetric Δ Meanp-ValueSignificant
SMOTERecall+0.2930.002Yes ( p < 0.01 )
SMOTEF1-Score+0.0800.004Yes ( p < 0.01 )
GaussianCopulaRecall+0.1830.004Yes ( p < 0.01 )
GaussianCopulaF1-Score+0.0610.006Yes ( p < 0.01 )
CTGANRecall+0.1580.006Yes ( p < 0.01 )
CTGANF1-Score+0.0510.012Yes ( p < 0.05 )
TVAERecall+0.1450.010Yes ( p < 0.05 )
TVAEF1-Score+0.0400.027Yes ( p < 0.05 )
Table 8. Relationship between composite fidelity and classification performance.
Table 8. Relationship between composite fidelity and classification performance.
MethodComposite FidelityAvg F1Avg RecallAvg AUC-ROC
Gaussian Copula0.930 (1st)0.418 (2nd)0.432 (2nd)0.790 (2nd)
CTGAN0.884 (2nd)0.408 (3rd)0.407 (3rd)0.788 (3rd)
SMOTE0.878 (3rd)0.437 (1st)0.542 (1st)0.772 (4th)
TVAE0.836 (4th)0.397 (4th)0.394 (4th)0.763 (5th)
Spearman ρ 0.30 0.30 1.00
Table 9. Hyperparameter sensitivity analysis for deep generative models.
Table 9. Hyperparameter sensitivity analysis for deep generative models.
HyperparameterDefaultTuning RangeExpected Impact
Epochs100300–1000Extended training may improve convergence; however, our loss monitoring indicated plateau before epoch 100
Latent dimension12832–256Smaller dimensions may reduce overfitting with limited data; larger dimensions increase expressiveness
Batch size500100–1000Smaller batches may improve generalization but increase training time and variance
Learning rate 2 × 10 4 10 5 10 3 Lower rates may enable finer convergence; higher rates risk training instability
Generator layers2 × 2561–4 layers, 128–512 unitsDeeper/wider networks increase model capacity but require more data to avoid overfitting
Discriminator steps (CTGAN)11–5More discriminator updates per generator step may improve training stability [11]
Table 10. Decision framework for selecting synthetic data augmentation methods.
Table 10. Decision framework for selecting synthetic data augmentation methods.
Application ScenarioRecommended MethodRationale and Evidence
Minority class detection is the primary objectiveSMOTE + Gradient BoostingHighest F1-Score (0.468) and recall (0.598); statistically significant ( p < 0.01 )
High data fidelity required (e.g., regulatory compliance, exploratory analysis)Gaussian CopulaHighest composite fidelity (0.930); strongest correlation preservation ( S corr = 0.958 )
Dataset with many categorical featuresCTGAN or Gaussian CopulaCTGAN: S cat = 0.896 ; Gaussian Copula: S cat = 0.915 . SMOTE distorts categorical distributions ( S cat = 0.858 )
Very small minority class (<1000 samples)SMOTENo minimum data threshold; operates with as few as k + 1 samples per class
Large minority class (>10,000 samples) with GPU resourcesTVAE or CTGANDeep models can learn complex non-linear relationships with sufficient data; potential for higher fidelity
Rapid prototyping or resource-constrained deploymentSMOTEComputational cost in seconds; no GPU required; no hyperparameter tuning needed
Balanced fidelity and utility neededGaussian CopulaCompetitive F1-Score (0.418) with highest fidelity (0.930); interpretable generation parameters
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Won, D.-H.; Shin, K.-S.; Youm, S. Synthetic Data Augmentation for Imbalanced Tabular Data: A Comparative Study of Generation Methods. Electronics 2026, 15, 883. https://doi.org/10.3390/electronics15040883

AMA Style

Won D-H, Shin K-S, Youm S. Synthetic Data Augmentation for Imbalanced Tabular Data: A Comparative Study of Generation Methods. Electronics. 2026; 15(4):883. https://doi.org/10.3390/electronics15040883

Chicago/Turabian Style

Won, Dong-Hyun, Kwang-Seong Shin, and Sungkwan Youm. 2026. "Synthetic Data Augmentation for Imbalanced Tabular Data: A Comparative Study of Generation Methods" Electronics 15, no. 4: 883. https://doi.org/10.3390/electronics15040883

APA Style

Won, D.-H., Shin, K.-S., & Youm, S. (2026). Synthetic Data Augmentation for Imbalanced Tabular Data: A Comparative Study of Generation Methods. Electronics, 15(4), 883. https://doi.org/10.3390/electronics15040883

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop