Next Article in Journal
Advanced Manifold–Metric Pairs
Previous Article in Journal
Line Defects in Two-Dimensional Dodecagonal Quasicrystals
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Privacy-Aware Table Data Generation by Adversarial Gradient Boosting Decision Tree

1
Graduate School of Advanced Science and Engineering, Hiroshima University, Kagamiyama 1-7-1, Higashi-Hiroshima 739-8521, Japan
2
Department of Computer Science and Engineering, Khulna University of Engineering and Technology, Khulna 9203, Bangladesh
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(15), 2509; https://doi.org/10.3390/math13152509
Submission received: 7 July 2025 / Revised: 23 July 2025 / Accepted: 30 July 2025 / Published: 4 August 2025

Abstract

Privacy preservation poses significant challenges in third-party data sharing, particularly when handling table data containing personal information such as demographic and behavioral records. Synthetic table data generation has emerged as a promising solution to enable data analysis while mitigating privacy risks. While Generative Adversarial Networks (GANs) are widely used for this purpose, they exhibit limitations in modeling table data due to challenges in handling mixed data types (numerical/categorical), non-Gaussian distributions, and imbalanced variables. To address these limitations, this study proposes a novel adversarial learning framework integrating gradient boosting trees for synthesizing table data, called Adversarial Gradient Boosting Decision Tree (AGBDT). Experimental evaluations on several datasets demonstrate that our method outperforms representative baseline models regarding statistical similarity and machine learning utility. Furthermore, we introduce a privacy-aware adaptation of the framework by incorporating k-anonymization constraints, effectively reducing overfitting to source data while maintaining practical usability. The results validate the balance between data utility and privacy preservation achieved by our approach.

1. Introduction

The rapid advancement of information technology has enabled large-scale data collection across sectors such as healthcare, finance, and public administration. While the rise of deep learning has significantly improved data-driven decision-making, most applications remain confined within organizational boundaries. Expanding data sharing with external parties holds great promise for value creation but introduces serious privacy risks.
Many real-world datasets contain Personally Identifiable Information (PII) such as names, genders, ages, and locations [1]. Even after removing direct identifiers, quasi-identifiers (e.g., age, gender, zip code) can still enable re-identification. For instance, Sweeney demonstrated that anonymized medical records could be cross-linked with voter registrations to re-identify individuals [2], while Narayanan and Shmatikov showed that movie ratings can be deanonymized via statistical matching [3]. These privacy risks have prompted strict legal regulations such as the EU’s General Data Protection Regulation (GDPR) (https://gdpr-info.eu/, accessed on 20 July 2025) and the California Consumer Privacy Act (CCPA) (https://oag.ca.gov/privacy/ccpa, accessed on 20 July 2025).
Traditional anonymization techniques like k-anonymity [2], l-diversity [4], and t-closeness [5] offer some protection but often sacrifice data utility [6]. Synthetic data generation has emerged as an alternative, aiming to generate artificial datasets that retain statistical fidelity without linking to real individuals. Recent research [7] extended and evaluated synthetic data generation for longitudinal cohorts using statistical methods grounded in k-anonymity and l-diversity, ensuring privacy protection and analytical reproducibility. Generative Adversarial Networks (GANs) [8] have shown promise in synthetic data generation by training a generator–discriminator pair adversarially [9,10]. However, applying GANs to tabular data introduces several challenges, such as mixed data types, imbalanced variables, and non-Gaussian distributions.
To address these challenges, deep learning-based methods such as Table-GAN [11], CTGAN [12], and TVAE [12] have been proposed. These models incorporate innovations like mode-specific normalization and stratified sampling to capture tabular structures better. CTAB-GAN+ [13] further introduces downstream task alignment and differential privacy, while federated methods like FedEqGAN [14] integrate encryption to address cross-source heterogeneity. In addition, for privacy-aware synthesis, DP-CTGAN [15] embeds rigorous  ( ϵ , δ ) -differential privacy into Conditional Tabular GANs through gradient clipping and Gaussian noise injection, preserves higher statistical fidelity and downstream predictive accuracy than prior private baselines.
In parallel, tree-based machine learning approaches have also gained traction for tabular data synthesis. ARF [16] uses ensembles of random forests to model local feature distributions, while Generative Trees (GTs) [17] mimic the partition logic of decision trees to synthesize data along learned splits. A recent study by Emam et al. [18] utilized a sequential synthesis framework based on gradient-boosted decision trees (GBDT) to assess the replicability of health data analyses, showing high consistency between synthetic and real datasets. While efficient, these models struggle to capture global feature dependencies or provide substantial privacy enhancement.
Building on this line of work, we propose a novel method called Adversarial Gradient Boosting Decision Tree (AGBDT). It integrates Gradient Boosting Decision Tree ensembles with adversarial training to generate high-quality tabular data while ensuring privacy. Our method iteratively updates data generation domains through discriminator feedback, then employs a k-anonymity Mondrian sampling strategy to generate privacy-aware synthetic records. To further enhance protection, we extend our framework to support l-diversity and t-closeness constraints.
Our main contributions include the following:
  • A novel hybrid framework combining gradient boosting and adversarial learning for tabular data synthesis.
  • A domain update mechanism guided by decision path logic, enabling privacy-preserving generation with controlled fidelity.
  • Empirical evaluation on five datasets against several baselines, with ablation studies on k-anonymity constraints.

2. Methods

2.1. Adversarial Gradient Boosting Framework

The proposed methodology, Adversarial Gradient Boosting Decision Tree (AGBDT), integrates gradient boosting trees into an adversarial learning framework to address the limitations of existing table data generation approaches. As illustrated in Figure 1, the system comprises three core components: (1) a generator that synthesizes privacy-preserving synthetic data, (2) a discriminator that distinguishes real from synthetic records using gradient boosting decision trees, and (3) a range table that dynamically constrains attribute value ranges to prevent overfitting.

2.1.1. Core Mechanism

The workflow begins with initial synthetic data generation, where a range table (Table (b) in Figure 2) is constructed by extracting minimum and maximum values for each attribute from the original dataset (Table (a) in Figure 2). Synthetic records are generated by uniformly sampling values within these ranges (Table (c) in Figure 2). For example, an integer-type attribute  x 1 with a range of [0, 100] would produce synthetic values randomly drawn from this interval.
A gradient boosting tree classifier learns to distinguish real and synthetic data during discriminator training through iterative 5-fold cross-validation. The discriminator’s decision paths are then used to guide the generator adaptation. Inspired by GTs [17], the proposed method augments each boosting round with a “range table” that records split conditions across all features. The generator retrains its gradient boosting trees using discriminator-predicted labels, updating the range table by narrowing attribute value boundaries according to the discriminator’s splitting criteria. For example, as illustrated in Figure 3, record  I D = 1 in Figure 2 traverses eight split predicates to reach a leaf node, at which point the generator uses these predicates to tighten its intervals to  20 < x 1 50 x 2 0 , and  x 3 < 3.5 ; the updated range table is shown in Table (d) in Figure 2. Such training processes of gradient boosting tree classifiers for the discriminator and generator are iterated until the discriminator’s classification accuracy falls below a predefined threshold, ensuring synthetic data convergence toward the real data distribution.
For details, in AGBDT, each real record  x i R is associated with a feature-wise domain  D i , which represents the feasible sampling interval for generating synthetic data. This domain is iteratively refined using the decision paths of the gradient boosting classifier. Let  D i , f ( j ) denote the domain of feature f for record  x i after the j-th decision tree in the ensemble. Each internal node of the decision tree defines a binary split of the form  x f θ , where  θ is the split threshold. We update the domain  D i , f ( j ) based on whether the path of  x i goes left or right at each node:
D i , f ( j ) = D i , f ( j 1 ) [ 0 ] , min D i , f ( j 1 ) [ 1 ] , θ if x i f θ , max D i , f ( j 1 ) [ 0 ] , θ , D i , f ( j 1 ) [ 1 ] if x i f > θ .
The final domain after traversing all M trees is the intersection of all intermediate updates:
D i ( * ) = j = 1 M D i ( j ) .
This iterative domain shrinking ensures that synthetic samples generated from  D i ( * ) follow the same logical constraints as the real data under the ensemble classifier, thereby improving fidelity while maintaining diversity. See Line 4–14 in Algorithm 1.

2.1.2. Parameter Tuning

Key hyperparameters include the tree depth to limit model complexity, the number of trees to balance computational efficiency and learning capacity, and a range update ratio to control the granularity of attribute range adjustments. These settings prevent excessive overfitting while maintaining synthesis fidelity.

2.2. Privacy Enhancement via k-Anonymity

To address the risk of privacy leakage caused by overfitting during synthetic data generation, the proposed method incorporates k-anonymization into the range table using the Mondrian k-anonymity algorithm [19]. Although alternative privacy-preserving methods such as k-concealment [20] or differential privacy [21] could also be considered, we select the Mondrian k-anonymity approach due to its natural compatibility with the decision path partitioning logic of gradient boosting trees. The range table stores the minimum and maximum values of each attribute for all records, and is treated as a  2 d -dimensional space (where d is the number of attributes).
For example, in Figure 4a, the original range table contains age ranges such as  [ 20 , 40 ] for  I D = 1 and  [ 30 , 50 ] for  I D = 2 . The Mondrian algorithm recursively partitions this space by selecting a random dimension (e.g., minimal value of age (Age_min) or maximum value of age (Age_max)) and splitting it at the median value until each partition contains at least k records. As shown in Figure 4b, IDs  { 1 , 2 } are merged into a single partition after splitting the Age_min dimension at 35 and the Age_max dimension at 50. The generalized range for these IDs becomes  [ 20 , 50 ] , replacing their original ranges  [ 20 , 40 ] and  [ 30 , 50 ] . This process ensures that any synthetic record generated from the anonymized range table cannot be uniquely linked to fewer than k original records, thereby satisfying k-anonymity. The final anonymized range table (Figure 4a) guarantees that each generalized range corresponds to at least k original records. Synthetic data is then regenerated within these generalized ranges, balancing privacy protection with statistical utility.
To formally define k-anonymity, let  D = r 1 , r 2 , , r n denote the original dataset, where each record  r i contains a set of quasi-identifiers  QI ( r i ) = a i 1 , a i 2 , , a i d . A dataset  D satisfies k-anonymity if every combination of quasi-identifier values in  D appears in at least k distinct records. Mathematically,
r i D , r j D QI ( r j ) = QI ( r i ) k
In our framework, each record  r i is associated with an interval-based domain  D i = [ l i 1 , u i 1 ] , [ l i 2 , u i 2 ] , , [ l i d , u i d ] , where  [ l i j , u i j ] denotes the valid range of attribute j for record i. After applying Mondrian partitioning, records are grouped into equivalence classes  P 1 , P 2 , , P m , each satisfying  | P k | k . Within each class, all records are generalized to share the same domain:
r i , r j P k , D i = D j = min r P k a r j , max r P k a r j j = 1 d
This ensures that any synthetic record generated from the generalized domain cannot be uniquely traced back to fewer than k original records, thereby mitigating re-identification risks. See Line 15–26 in Algorithm 1.
Algorithm 1: AGBDT: Adversarial Gradient Boosting for Tabular Data Synthesis
Mathematics 13 02509 i001

3. Experiments

3.1. Datasets

The experiments utilized five table datasets: Adult, Census, Credit, Covtype, and Intrusion. The Adult, Census, Credit, and Covtype datasets were sourced from SDV [22], and the Intrusion dataset was sourced from UCI KDD Archive [23]. Table 1 represents classification tasks with varying characteristics. To address computational constraints, Census, Credit, Covtype, and Intrusion were subsampled to 50,000 records each, stratified by their target variables.

3.2. Synthetic Data Generation

The proposed method AGBDT was compared against four baselines: CTGAN [12], TVAE [12], CTAB-GAN+ [13], DP-CTGAN [15,24], and ARF [16]. All models were configured using parameters specified in their original publications. For AGBDT, three variants were tested: baseline (without k-anonymization),  k = 2 , and  k = 5 (with Mondrian-based k-anonymization applied to the range table). The gradient boosting trees in AGBDT were trained using scikit-learn’s GradientBoostingClassifier with fixed hyperparameters: the maximum tree depth is set to 3; the number of trees is set to 50; the discriminator’s classification accuracy threshold for early stopping is set to  0.52 ; and the maximum number of rounds is set to 100. For a discussion on parameter selection and computational cost, see the Discussion section. Each synthesis experiment was repeated five times per dataset to ensure statistical reliability.

3.3. Results

3.3.1. Statistical Similarity

Synthetic data fidelity was evaluated using Wasserstein Distance (WD) for numerical attributes, Jensen–Shannon Divergence (JSD) for categorical attributes, and Difference in Correlation Matrices (Diff. Corr). The lower the value, the better, indicating that the synthetic data is closer to the original data. It will be marked in bold in Table 2 and Table A1.
Table 2 shows that AGBDT achieves the best overall performance across all datasets in terms of average results. Specifically, AGBDT attains optimal values in WD and Diff. Corr metrics, while its JSD value ranks second-best and remains within the same order of magnitude as the leading method. After applying k-anonymity, JSD improves to the best result, whereas WD experiences a noticeable increase. This observation suggests that synthetic data, which is highly similar to real data, will be excluded due to privacy enforcement. Moreover, Diff. Corr significantly rises to a level comparable with DP-CTGAN, likely indicating insufficient data meeting privacy-preservation criteria, leading to substantial duplication of certain records. For details, as shown in Appendix A Table A1, the model applying k-anonymity exhibits a substantial increase in magnitude for the Census and Intrusion datasets, which contain numerous categorical columns. This suggests that after applying k-anonymity, only a few decision trees satisfy the required conditions, significantly reducing synthetic data per iteration. Consequently, repeated iterations of data synthesis lead to a notable increase in redundancy. However, due to our mechanism controlling the anonymity parameter k, the extent of this increase remains lower than DP-CTGAN, thus validating the effectiveness of our method.

3.3.2. Machine Learning Utility

We evaluated the utility of synthetic data in machine learning by comparing two scenarios: (1) results from training and testing entirely on original data, and (2) results from training on synthetic data and testing on original data. Closer alignment between these outcomes indicates higher substitutability of original training data with synthetic data, i.e., greater machine learning utility. Best results will be marked in bold in Table 3 and Table A2.
Here, four classifiers—Decision Tree Classifier (DTC), Logistic Regression (LR), Multilayer Perceptron (MLP), and Random Forest (RF)—were trained on synthetic data and evaluated using AUC, F1-score, and Accuracy. Label encoding was applied for DTC and RF for categorical variable preprocessing, while one-hot encoding was used for LR and MLP. Numerical variables were normalized exclusively for MLP. Machine learning models were implemented using scikit-learn without hyperparameter tuning. The average evaluation metrics for each combination of generative model and dataset are shown in Table 3. Each value represents the average of 20 trials (5 synthetic datasets × 4 machine learning models). The Oracle row represents results from training and testing entirely on original data, where closer alignment indicates higher synthetic data utility.
For details in Table A2, for the Credit and Covtype dataset, TVAE failed to generate the minority class of the target variable, which is marked as ‘-’. For the Intrusion dataset, stratified sampling resulted in the absence of the least frequent class in the test data, rendering AUC calculation infeasible (also marked as ‘-’). The AGBDT achieved the best results on three out of five datasets. Thus, AGBDT is validated to generate synthetic data with superior utility for machine learning tasks. After applying k-anonymization to the range table, the model struggles to accurately capture data distribution characteristics due to privacy protection constraints, resulting in a decline in performance metrics similar to DP-CTGAN, which also prioritizes privacy. However, the proposed method exhibits smaller reductions in the F1-score and Accuracy compared to DP-CTGAN, demonstrating that our model provides greater utility relative to the state-of-the-art privacy-preserving method.

3.3.3. Privacy Evaluation

We adopt the NewRowSynthesis metric [22], which evaluates the proportion of synthetic records that exactly match records in the original data (i.e.,  1 m / t , where m is the number of matching synthetic records with the original data and t is the number of total synthetic records). The metric ranges from 0 (all synthetic records exist in the original data) to 1 (all synthetic records are novel). Table A3 reports the average and worst-case values across five trials, rounded to four decimal places. Best results will be marked in bold. Without k-anonymization, AGBDT generated the lowest proportion of novel records (highest privacy risk). However, applying  k = 2 anonymization reduced the overlap with original records, achieving parity with baseline models on almost all datasets.
In addition, referring to [25,26,27], we use Distance to Closest Records (DCR) to measure Euclidean distance between any synthetic data record and its closest corresponding real neighbour; the higher the DCR, the lesser the risk of privacy breach. In addition, we also use Nearest Neighbor Distance Ratio (NNDR) to calculate the ratio of the distance between the closest synthetic record, the second closest synthetic record, and the closest record in the original data. Higher values indicate better privacy. These metrics’ 5th percentile is computed to provide a robust estimate of the privacy risk. At the same time, we calculate these Euclidean distance-based metrics only on numeric columns to eliminate inaccuracies in distance calculations. The results are shown in Table A3 and Table 4. Best results will be marked in bold.
Across all three datasets, the proposed AGBDT method without k-anonymity yielded DCR and NNDR values of 0 on Census, Credit, and Intrusion, suggesting that some synthetic records are overly similar to real ones, thus risking privacy leakage. In contrast, incorporating k-anonymity (with  k = 2 or  k = 5 ) significantly improved both metrics, with DCR scores exceeding 1.0 and NNDR values approaching or exceeding 0.8 on most datasets. These results demonstrate that AGBDT with k-anonymity provides enhanced privacy compared to baseline generative models such as CTGAN, TVAE, and CTAB-GAN+, highlighting the effectiveness of anonymity-aware sampling in mitigating memorization and improving synthetic data privacy. It is worth mentioning that compared with the differential privacy version of CTGAN (DP-CTGAN), AGBDT has a lower DCR value, indicating that even in the most extreme case, the distance of the generated data is farther than the real data, proving the effectiveness of the differential privacy method for privacy protection. However, regarding the NNDR indicator, the AGBDT with k-anonymization is better than DP-CTGAN, which also proves the effectiveness of our method.
In addition, Figure 5 illustrates the density distributions of each record of Distance to Closest Record (DCR) and Nearest Neighbor Distance Ratio (NNDR), respectively, across different generative models and datasets. These metrics assess privacy risks in synthetic data by quantifying proximity and indistinguishability concerning real records. In the DCR plots (Figure 5a), a higher concentration of synthetic samples with large DCR values indicates improved privacy, as the generated records are farther from the closest real instances. It can be seen that most of the DCR of ARF falls around 1.0, while the DCR of the proposed model is more evenly distributed, and the peak values of the proposed models are all greater than 1.0, which proves that the synthesized data by the proposed model is far from the real data and has good privacy protection. While in the NNDR plots (Figure 5b), distributions with mass closer to 1.0 reflect better privacy preservation, as the synthetic samples have similar distances to multiple real neighbors, making linkage attacks more difficult. The NNDR of most models falls around 1.0, proving that both the proposed model and the baseline can synthesize data that is different from the real data. In the Adult and Covtype datasets, the proposed model with  k = 5 can surpass the SOTA model DP-CTGAN, proving the effectiveness of the proposed model.

4. Discussion

Based on our experimental results and observations, several critical points merit discussion to guide future research and practical applications clearly:
  • Utility vs. Privacy Tradeoff: The proposed AGBDT model demonstrated superior or competitive performance across multiple datasets regarding statistical similarity and machine learning utility, confirming its strong capability for generating analytically valuable synthetic data. However, without explicit privacy constraints, the method showed relatively low novelty and potential privacy risks. Introducing k-anonymity effectively mitigated these risks while maintaining acceptable utility degradation.
  • Handling Severe Class Imbalance: Our method showed limitations in replicating severely imbalanced datasets, notably the Credit dataset. In such cases, alternative generative models (e.g., CTAB-GAN+ or ARF), which better manage minority class distributions, may offer improved synthetic data quality. Enhancing our framework to better capture imbalanced class structures will be a priority in future work.
  • Alternative Privacy Protection Approaches: Although Mondrian k-anonymity aligns naturally with gradient boosting tree logic, it can lead to information loss and reduced data diversity. Exploring alternatives such as k-concealment or differential privacy for more rigorous privacy guarantees, possibly through hybrid methods, will be valuable in future developments.
  • Missing Data Handling: All datasets in this study were complete, but real-world data frequently contain missing values. Integrating advanced imputation techniques (e.g., median/mode imputation, predictive models, or surrogate splits in tree structures) into the AGBDT framework would significantly enhance its applicability and robustness to common practical scenarios.
  • Sensitivity to Hyperparameters: The parameter analysis (see Table A4) demonstrated that moderate complexity (tree depth = 3, estimators = 50) provided optimal balance between data quality and computational efficiency in most datasets. With higher depths and estimators, data quality deteriorated significantly, indicating increased risks of overfitting and decreased generalizability. Future research should investigate automated hyperparameter tuning strategies, such as Bayesian optimization or validation-driven dynamic adjustment, to improve model adaptability across diverse datasets.
In summary, AGBDT effectively balances data utility and privacy under general scenarios. However, further improvements addressing its limitations—including handling severe class imbalance, integrating formal privacy guarantees, dealing with missing data, and optimizing parameter sensitivity—will significantly enhance its generalizability and practical applicability.

5. Conclusions

In this study, we proposed an Adversarial Gradient Boosting Decision Tree (AGBDT) framework for synthesizing privacy-aware tabular data by integrating adversarial training with gradient boosting trees. Empirical evaluations on five diverse datasets demonstrated that AGBDT achieved superior or highly competitive results in statistical similarity and machine learning utility compared to state-of-the-art baseline models.
Key findings indicated that our approach effectively balanced data fidelity and privacy risks, particularly in datasets with moderate diversity and class balance. Nevertheless, we observed performance degradation in datasets characterized by severe class imbalance, highlighting a limitation regarding the current model’s sensitivity to minority class distributions. Additionally, our application of k-anonymity provided intuitive privacy protection, but lacks formal probabilistic privacy guarantees available through differential privacy.
Practical implications of our study suggest that while AGBDT effectively synthesizes highly usable tabular data suitable for typical analytical tasks, careful consideration should be given when dealing with severely imbalanced datasets or stringent formal privacy constraints. Future research directions include exploring differential privacy integration for formal guarantees, improving methods for handling missing values common in real-world data, and conducting comprehensive sensitivity analyses to enhance the robustness and generalizability of the proposed framework.

Author Contributions

Conceptualization, S.J. and N.I.; software, S.J. and N.I.; validation, S.J.; formal analysis, S.J. and N.I.; investigation, N.I.; data curation, S.J.; security writing—original draft, S.J.; writing—review and supervision, S.K., K.M.R.A. and Y.M.; funding acquisition, S.J., K.M.R.A. and Y.M.; project administration, Y.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by JST SPRING, Grant Number JPMJSP2132. This research was supported by KAKENHI (25K15130) Japan.

Data Availability Statement

Publicly available datasets were analyzed in this study. The Adult, Census, Credit, and Covtype datasets are from SDV project at https://sdv.dev/ (accessed on 11 November 2023). The Intrusion dataset is from KDD Cup 1999 Data at http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html (accessed on 11 November 2023).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Average statistical similarity evaluation index.
Table A1. Average statistical similarity evaluation index.
DatasetModelWDJSDDiff. Corr
AdultCTGAN0.0660.0840.386
TVAE0.3070.2600.251
CTAB-GAN+0.0470.0780.267
DP-CTGAN0.1030.2720.828
ARF0.2030.0070.046
AGBDT0.0020.0140.042
AGBDT ( k = 2 )0.1550.0060.571
AGBDT ( k = 5 )0.1580.0060.494
CensusCTGAN0.0870.1092.947
TVAE0.2050.1831.133
CTAB-GAN+0.1430.1662.198
DP-CTGAN0.2640.2943.875
ARF0.1950.0050.303
AGBDT0.0010.0150.066
AGBDT ( k = 2 )0.1500.0053.003
AGBDT ( k = 5 )0.1500.0052.982
CreditCTGAN0.2140.3272.508
TVAE0.5130.0241.019
CTAB-GAN+0.1370.0053.353
DP-CTGAN0.0650.0202.491
ARF0.1170.0011.481
AGBDT0.0180.0031.376
AGBDT ( k = 2 )0.2420.0023.151
AGBDT ( k = 5 )0.2270.0023.031
CovtypeCTGAN0.0500.0442.210
TVAE0.0650.0631.052
CTAB-GAN+0.0230.0211.259
DP-CTGAN0.1520.0773.066
ARF0.1500.0010.445
AGBDT0.0020.0020.052
AGBDT ( k = 2 )0.0670.0011.926
AGBDT ( k = 5 )0.0760.0012.087
IntrusionCTGAN0.0960.0977.956
TVAE0.1150.0774.604
CTAB-GAN+---
DP-CTGAN0.1560.1159.969
ARF0.3100.0030.713
AGBDT0.0020.0060.887
AGBDT ( k = 2 )0.1490.0037.786
AGBDT ( k = 5 )0.1500.0047.830
Table A2. Average machine learning usefulness evaluation index.
Table A2. Average machine learning usefulness evaluation index.
DatasetModelAUCF1-ScoreAccuracy
AdultOracle0.8910.7690.845
CTGAN0.5630.4390.748
TVAE0.7570.5610.754
CTAB-GAN+0.8510.7230.814
DP-CTGAN0.5320.4660.692
ARF0.8780.7490.836
AGBDT0.8840.7670.841
AGBDT ( k = 2 )0.4880.4400.748
AGBDT ( k = 5 )0.4990.4400.754
CensusOracle0.9200.7140.948
CTGAN0.6300.4900.937
TVAE0.7180.4980.935
CTAB-GAN+0.6770.6700.934
DP-CTGAN0.5440.4760.902
ARF0.8890.6150.944
AGBDT0.9140.7080.947
AGBDT ( k = 2 )0.5140.4860.936
AGBDT ( k = 5 )0.4790.4910.929
CreditOracle0.8820.7990.999
CTGAN0.9870.6120.977
TVAE---
CTAB-GAN+0.8860.7260.997
DP-CTGAN0.5500.5020.998
ARF0.8860.7420.999
AGBDT0.8610.7500.999
AGBDT ( k = 2 )0.4620.4970.989
AGBDT ( k = 5 )0.5400.4960.984
CovtypeOracle0.9430.7680.635
CTGAN0.6930.5000.189
TVAE-0.6370.273
CTAB-GAN+0.8240.6140.322
DP-CTGAN0.4850.4200.102
ARF0.9090.6830.434
AGBDT0.9340.7390.583
AGBDT ( k = 2 )0.5180.4710.108
AGBDT ( k = 5 )0.4960.4690.107
IntrusionOracle-0.9950.870
CTGAN-0.9770.738
TVAE0.8770.9840.629
CTAB-GAN+---
DP-CTGAN-0.7590.235
ARF-0.9940.813
AGBDT-0.9950.862
AGBDT ( k = 2 )-0.7280.216
AGBDT ( k = 5 )-0.7660.226
Table A3. Privacy evaluation metrics.
Table A3. Privacy evaluation metrics.
DatasetModelNewRowSynthesisDCRNNDR
Average Worst
AdultCTGAN1.01.00.3400.320
TVAE1.01.00.5130.101
CTAB-GAN+1.01.00.1990.203
DP-CTGAN1.01.00.9750.438
ARF1.01.00.3520.434
AGBDT0.999+0.999+0.0850.062
AGBDT ( k = 2 )1.01.00.9580.679
AGBDT ( k = 5 )1.01.01.0130.701
CensusCTGAN1.01.00.7910.561
TVAE0.9860.9781.2250.416
CTAB-GAN+0.9970.9900.4800.368
DP-CTGAN1.01.09.2000.760
ARF1.01.00.7000.757
AGBDT0.6420.59500
AGBDT ( k = 2 )1.01.01.1580.659
AGBDT ( k = 5 )1.01.01.1650.661
CreditCTGAN0.9990.9980.8860.878
TVAE0.0020.00100
CTAB-GAN+0.999+0.9990.9920.902
DP-CTGAN0.999+0.999+1.5450.849
ARF0.9990.9980.6960.838
AGBDT0.6220.57700
AGBDT ( k = 2 )0.999+0.999+1.1650.926
AGBDT ( k = 5 )1.01.01.1620.926
CovtypeCTGAN1.01.00.8830.886
TVAE1.01.00.8620.821
CTAB-GAN+1.01.00.7940.845
DP-CTGAN1.01.01.7000.905
ARF1.01.00.6650.783
AGBDT1.01.00.2830.234
AGBDT ( k = 2 )1.01.00.9150.878
AGBDT ( k = 5 )1.01.00.9180.889
IntrusionCTGAN1.01.01.5680.870
TVAE1.01.01.2630.715
CTAB-GAN+0.9920.971--
DP-CTGAN1.01.011.7190.942
ARF0.9950.9940.7120.808
AGBDT0.4230.41100
AGBDT ( k = 2 )0.9990.9951.5970.731
AGBDT ( k = 5 )0.9990.9961.6010.731
Table A4. Fitting time and distance indicators under different parameter designs. The best results for WD and JSD are in bold.
Table A4. Fitting time and distance indicators under different parameter designs. The best results for WD and JSD are in bold.
DatasetMax DepthEstimatorsFitting Time/sWDJSDDiff. Corr
Census2300.6980.0030.0140.042
2501.1980.0010.0140.030
2701.5860.0020.0090.040
3300.9630.0010.0170.057
3501.5780.0010.0140.086
3702.2810.0030.0010.054
4301.2660.0030.0020.047
4502.0510.0250.0102.090
4702.8800.0210.0021.676
5301.5050.0160.0030.830
5502.4020.0020.0051.681
5703.3330.0010.0040.588
Credit2300.5050.0020.0170.032
2500.8170.0020.0090.031
2701.1000.0020.0240.073
3300.6520.0010.0170.083
3501.0710.0010.0120.058
3701.4530.0030.0010.060
4300.8300.0030.0010.038
4501.3800.0260.0032.198
4701.8850.0210.0000.906
5301.0150.0020.0071.370
5501.7540.0020.0040.724
5702.3040.0010.0030.575
Covtype2300.8640.0030.0170.032
2501.3910.0020.0110.049
2701.9490.0030.0250.057
3301.3140.0010.0150.083
3501.9840.0010.0090.053
3702.8470.0030.0010.069
4301.6260.0020.0010.043
4502.5320.0260.0052.052
4703.5780.0220.0021.486
5301.8630.0140.0021.085
5503.1900.0020.0060.683
5704.2980.0020.0040.821

References

  1. Schwartz, P.M.; Solove, D.J. The PII problem: Privacy and a new concept of personally identifiable information. NYUL Rev. 2011, 86, 1814. [Google Scholar]
  2. Sweeney, L. k-anonymity: A model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 2002, 10, 557–570. [Google Scholar] [CrossRef]
  3. Narayanan, A.; Shmatikov, V. Robust de-anonymization of large sparse datasets. In Proceedings of the 2008 IEEE Symposium on Security and Privacy, Oakland, CA, USA, 18–22 May 2008; pp. 111–125. [Google Scholar]
  4. Machanavajjhala, A.; Kifer, D.; Gehrke, J.; Venkitasubramaniam, M. l-diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data (TKDD) 2007, 1, 3-es. [Google Scholar] [CrossRef]
  5. Li, N.; Li, T.; Venkatasubramanian, S. t-closeness: Privacy beyond k-anonymity and l-diversity. In Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering, Istanbul, Turkey, 15–20 April 2007; pp. 106–115. [Google Scholar]
  6. Rajendran, K.; Jayabalan, M.; Rana, M.E. A study on k-anonymity, l-diversity, and t-closeness techniques. IJCSNS 2017, 17, 172. [Google Scholar]
  7. Kühnel, L.; Schneider, J.; Perrar, I.; Adams, T.; Moazemi, S.; Prasser, F.; Nöthlings, U.; Fröhlich, H.; Fluck, J. Synthetic data generation for a longitudinal cohort study–evaluation, method extension and reproduction of published data analysis results. Sci. Rep. 2024, 14, 14412. [Google Scholar] [CrossRef] [PubMed]
  8. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
  9. Tang, H.; Li, C.; Jiang, S.; Yu, H.; Kamei, S.; Yamanishi, Y.; Morimoto, Y. EarlGAN: An enhanced actor–critic reinforcement learning agent-driven GAN for de novo drug design. Pattern Recognit. Lett. 2023, 175, 45–51. [Google Scholar] [CrossRef]
  10. Tang, H.; Li, C.; Jiang, S.; Yu, H.; Kamei, S.; Yamanishi, Y.; Morimoto, Y. MacGAN: A Moment-Actor-Critic Reinforcement Learning-Based Generative Adversarial Network for Molecular Generation. In Proceedings of the Web and Big Data, Singapore, 9–10 October 2024; pp. 127–141. [Google Scholar]
  11. Park, N.; Mohammadi, M.; Gorde, K.; Jajodia, S.; Park, H.; Kim, Y. Data synthesis based on generative adversarial networks. In Proceedings of the 44th International Conference on Very Large Data Bases, Rio de Janeiro, Brazil, 27–31 August 2018; Volume 11, pp. 1071–1083. [Google Scholar]
  12. Xu, L.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling tabular data using conditional GAN. Adv. Neural Inf. Process. Syst. 2019, 32, 659. [Google Scholar]
  13. Zhao, Z.; Kunar, A.; Birke, R.; Van der Scheer, H.; Chen, L.Y. CTAB-GAN+: Enhancing tabular data synthesis. Front. Big Data 2024, 6, 1296508. [Google Scholar] [CrossRef] [PubMed]
  14. Xiao, Y.; Zhao, D.; Li, X.; Li, T.; Wang, R.; Wang, G. A Federated Learning-based Data Augmentation Method for Privacy Preservation under Heterogeneous Data. IEEE Trans. Mob. Comput. 2025, 1–14. [Google Scholar] [CrossRef]
  15. Fang, M.L.; Dhami, D.S.; Kersting, K. Dp-ctgan: Differentially private medical data generation using ctgans. In Proceedings of the International Conference on Artificial Intelligence in Medicine, Halifax, NS, Canada, 14–17 June 2022; pp. 178–188. [Google Scholar]
  16. Watson, D.S.; Blesch, K.; Kapar, J.; Wright, M.N. Adversarial random forests for density estimation and generative modeling. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Valencia, Spain, 25–27 April 2023; pp. 5357–5375. [Google Scholar]
  17. Nock, R.; Guillame-Bert, M. Generative Trees: Adversarial and Copycat. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 16906–16951. [Google Scholar]
  18. El Emam, K.; Mosquera, L.; Fang, X.; El-Hussuna, A. An evaluation of the replicability of analyses using synthetic health data. Sci. Rep. 2024, 14, 6978. [Google Scholar] [CrossRef] [PubMed]
  19. LeFevre, K.; DeWitt, D.J.; Ramakrishnan, R. Mondrian multidimensional k-anonymity. In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06), Atlanta, GA, USA, 3–7 April 2006; p. 25. [Google Scholar]
  20. Tassa, T.; Mazza, A.; Gionis, A. k-concealment: An alternative model of k-type anonymity. Trans. Data Priv. 2012, 5, 189–222. [Google Scholar]
  21. Dwork, C. Differential privacy. In Proceedings of the International Colloquium on Automata, Languages, and Programming, Venice, Italy, 10–14 July 2006; pp. 1–12. [Google Scholar]
  22. Patki, N.; Wedge, R.; Veeramachaneni, K. The synthetic data vault. In Proceedings of the 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Montreal, QC, Canada, 17–19 October 2016; pp. 399–410. [Google Scholar]
  23. Stolfo, S.; Fan, W.; Lee, W.; Prodromidis, A.; Chan, P. KDD Cup 1999 Data; UCI Machine Learning Repository: Irvine, CA, USA, 1999. [Google Scholar] [CrossRef]
  24. Alabdulwahab, S.; Kim, Y.T.; Son, Y. Privacy-Preserving Synthetic Data Generation Method for IoT-Sensor Network IDS Using CTGAN. Sensors 2024, 24, 7389. [Google Scholar] [CrossRef] [PubMed]
  25. Karst, F.S.; Chong, S.Y.; Antenor, A.A.; Lin, E.; Li, M.M.; Leimeister, J.M. Generative AI for Banks: Benchmarks and Algorithms for Synthetic Financial Transaction Data Submission Type: Completed Full Research Paper. In Proceedings of the Workshop on Information Technologies and Systems (WITS), Bangkok, Thailand, 18–20 December 2024. [Google Scholar]
  26. Zhao, Z.; Kunar, A.; Birke, R.; Chen, L.Y. Ctab-gan: Effective table data synthesizing. In Proceedings of the Asian Conference on Machine Learning, Virtual, 17–19 November 2021; pp. 97–112. [Google Scholar]
  27. Qi, H.; Zou, W.; Fu, S.; Deng, L. A Self-Attention Synthesizing Model with Privacy-Preserving (ACCT-GAN) for Medical Tabular Data. In Proceedings of the 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Lisboa, Portugal, 3–6 December 2024; pp. 1129–1134. [Google Scholar]
Figure 1. Overview of the model.
Figure 1. Overview of the model.
Mathematics 13 02509 g001
Figure 2. Illustrative example of the iterative domain refinement process. Real data and synthetic samples are used to update the domain boundaries via decision rules.
Figure 2. Illustrative example of the iterative domain refinement process. Real data and synthetic samples are used to update the domain boundaries via decision rules.
Mathematics 13 02509 g002
Figure 3. Overview trees of  I D = 1 in Figure 2 for updating the range table. The red line is the decision path for this record.
Figure 3. Overview trees of  I D = 1 in Figure 2 for updating the range table. The red line is the decision path for this record.
Mathematics 13 02509 g003
Figure 4. Example of Mondrian application to a range table. (a) Range table before and after applying 2-anonymity; (b) division of 2D space in Mondrian.
Figure 4. Example of Mondrian application to a range table. (a) Range table before and after applying 2-anonymity; (b) division of 2D space in Mondrian.
Mathematics 13 02509 g004
Figure 5. Privacy metrics density of each record by dataset and model.
Figure 5. Privacy metrics density of each record by dataset and model.
Mathematics 13 02509 g005
Table 1. Overview of benchmark datasets used for evaluation.
Table 1. Overview of benchmark datasets used for evaluation.
ClassesTrain SizeTest SizeNumeric ColsCategorical Cols
Adult222,792976969
Census245,0005000734
Credit240,00010,000291
Covtype745,00050001045
Intrusion545,00050003210
Table 2. Average statistical similarity metrics across all datasets.
Table 2. Average statistical similarity metrics across all datasets.
WDJSDDiffCorr
Model
CTGAN0.1030.1323.201
TVAE0.2410.1211.612
CTAB-GAN+0.0880.0681.769
DP-CTGAN0.1480.1564.046
ARF0.1950.0030.598
AGBDT0.0050.0080.485
AGBDT ( k = 2 )0.1530.0033.287
AGBDT ( k = 5 )0.1520.0043.285
Table 3. Model-averaged machine learning usefulness evaluation.
Table 3. Model-averaged machine learning usefulness evaluation.
AUCF1-ScoreAccuracy
Model
Oracle0.9090.8090.859
CTGAN0.7180.6040.718
TVAE0.7840.6700.648
CTAB-GAN+0.8100.6830.767
DP-CTGAN0.5280.5250.586
ARF0.8910.7570.805
AGBDT0.8980.7920.846
AGBDT ( k = 2 )0.4960.5240.599
AGBDT ( k = 5 )0.5040.5320.600
Table 4. Average privacy evaluation metrics.
Table 4. Average privacy evaluation metrics.
DCRNNDR
Model
CTGAN0.8940.703
TVAE0.7730.411
CTAB-GAN+0.6160.580
DP-CTGAN5.0280.779
ARF0.6250.724
AGBDT0.0740.059
AGBDT ( k = 2 )1.1590.775
AGBDT ( k = 5 )1.1720.782
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jiang, S.; Iwata, N.; Kamei, S.; Alam, K.M.R.; Morimoto, Y. Privacy-Aware Table Data Generation by Adversarial Gradient Boosting Decision Tree. Mathematics 2025, 13, 2509. https://doi.org/10.3390/math13152509

AMA Style

Jiang S, Iwata N, Kamei S, Alam KMR, Morimoto Y. Privacy-Aware Table Data Generation by Adversarial Gradient Boosting Decision Tree. Mathematics. 2025; 13(15):2509. https://doi.org/10.3390/math13152509

Chicago/Turabian Style

Jiang, Shuai, Naoto Iwata, Sayaka Kamei, Kazi Md. Rokibul Alam, and Yasuhiko Morimoto. 2025. "Privacy-Aware Table Data Generation by Adversarial Gradient Boosting Decision Tree" Mathematics 13, no. 15: 2509. https://doi.org/10.3390/math13152509

APA Style

Jiang, S., Iwata, N., Kamei, S., Alam, K. M. R., & Morimoto, Y. (2025). Privacy-Aware Table Data Generation by Adversarial Gradient Boosting Decision Tree. Mathematics, 13(15), 2509. https://doi.org/10.3390/math13152509

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop