HT-Fed-GAN: Federated Generative Model for Decentralized Tabular Data Synthesis

In this paper, we study the problem of privacy-preserving data synthesis (PPDS) for tabular data in a distributed multi-party environment. In a decentralized setting, for PPDS, federated generative models with differential privacy are used by the existing methods. Unfortunately, the existing models apply only to images or text data and not to tabular data. Unlike images, tabular data usually consist of mixed data types (discrete and continuous attributes) and real-world datasets with highly imbalanced data distributions. Existing methods hardly model such scenarios due to the multimodal distributions in the decentralized continuous columns and highly imbalanced categorical attributes of the clients. To solve these problems, we propose a federated generative model for decentralized tabular data synthesis (HT-Fed-GAN). There are three important parts of HT-Fed-GAN: the federated variational Bayesian Gaussian mixture model (Fed-VB-GMM), which is designed to solve the problem of multimodal distributions; federated conditional one-hot encoding with conditional sampling for global categorical attribute representation and rebalancing; and a privacy consumption-based federated conditional GAN for privacy-preserving decentralized data modeling. The experimental results on five real-world datasets show that HT-Fed-GAN obtains the best trade-off between the data utility and privacy level. For the data utility, the tables generated by HT-Fed-GAN are the most statistically similar to the original tables and the evaluation scores show that HT-Fed-GAN outperforms the state-of-the-art model in terms of machine learning tasks.


Introduction
With the development of digital technology, tabular data, such as electronic health records (EHR) or financial data, have been increasingly collected and used. The data types are each owned by different parties and horizontally partitioned between multiple parties. Data holders sometimes intend to publish their data for various purposes such as academic research [1], service improvements [2], or public competitions [3]. Consider a scenario of data publishing as follows: There are several hospitals that have some data on different individuals with the same disease. A single hospital has a small amount of data, which it hopes to jointly release to a third party for machine learning tasks such as predicting heart disease. However, considering the privacy and security concerns, the hospital does not want to jeopardize the privacy of the individuals contained in the released data. In the era of big data, similar scenarios have become more and more common. Thus, there is a strong motivation to develop privacy-preserving data-publishing (PPDP) methods for distributed tabular data. To address these challenges, in this paper, a novel solution for privacy-preserving decentralized tabular data synthesis is proposed called HT-Fed-GAN. In HT-Fed-GAN, a federated variational Bayesian Gaussian mixture model (Fed-VB-GMM) is designed to learn the global multimodal distributions from decentralized continuous columns and then augment the representation according to the learned modes, which can effectively eliminate the mode collapse problem. Then, federated conditional one-hot encoding is used to encode the categorical columns and the conditional sampling method is proposed to rebalance the imbalanced categorical attributes across the clients. Finally, a privacy consumption-based federated conditional GAN is used to model the decentralized tables.
For our experiments, we evaluate HT-Fed-GAN with the state-of-the-art federated generative model on five real-world datasets. The experimental results show that the data generated by HT-Fed-GAN are statically similar to real-world data. For the machine learning tests, our method outperforms the state-of-the-art model. Furthermore, we customize the membership inference attack method presented in [16] to attack HT-Fed-GAN.
The experimental results show that HT-Fed-GAN achieves the best trade-off between the privacy level and data utility.
The main contributions of our paper are as follows: 1.
We propose HT-Fed-GAN, a novel, practical federated conditional generative approach for synthesizing decentralized tabular data. To the best of our knowledge, this is the first work on privacy-preserving data synthesis on decentralized tabular datasets.

2.
We propose the federated variational Bayesian Gaussian mixture model for extracting the multimodal distributions from decentralized tables without disclosing data, which can effectively eliminate mode collapse. 3.
We propose the federated conditional one-hot encoding and conditional sampling method for highly imbalanced categorical columns, which preserves the real distributions of the categorical attributes. 4.
We conduct experiments on five real-world datasets and the results demonstrate that HT-Fed-GAN offers desirable data with a high privacy level.
The rest of this paper is organized as follows. In Section 2, we discuss the related works. Then, in Section 3, we describe the preliminaries of our HT-Fed-GAN. Then, we introduce the proposed HT-Fed-GAN in Section 4 and evaluate the proposed HT-Fed-GAN in Section 5. Finally, we conclude this paper in Section 6.

Related Works
Research on privacy-preserving data synthesis has received widespread attention recently. A promising method is using GAN to synthesize fake data similar to real-world data [4,8,17]. For tabular data, medGAN [17] uses a pre-trained auto-encoder to convert discrete values into a continuous representation. The decoder of the auto-encoder translates the continuous output of the generator into the format of the real-world tabular data. TableGAN [4] adds a classifier based on DCGAN to improve the quality of the synthetic table and the privacy penalty in the loss function for data privacy. CTGAN [12] is a stateof-the-art work for synthesizing tabular data, which proposes mode-specific normalization, a conditional generator, and training using sampling strategies to solve the problems of multiple modes in continuous columns and categorical imbalances in discrete columns of tabular data. These studies have been successfully applied to the synthesis of tabular data in centralized settings but failed in decentralized scenarios.
In a decentralized setting, there are no GAN-based methods for tabular data. Federated GANs, such as MD-GAN [18] and FeGAN [19,20], train a federated GAN on the distributed data. However, these methods are vulnerable to privacy attacks [16]. To protect data privacy, DP-FedAvg-GAN [8] and FL-GAN [21] introduce differential privacy [22] into the training process. Unfortunately, these methods cannot be applied to tabular data directly because the problems of multiple modes in continuous columns and categorical imbalances in discrete columns lead to model collapse.

Preliminaries
In this section, we describe the background of our proposed methods. First, we describe the variational Bayesian Gaussian mixture model (VB-GMM). This algorithm is the basis of the Fed-VB-GMM. Second, we introduce the differential privacy (DP) technique. This technique is the basic technology of the proposed privacy consumption-based federated conditional GAN. Finally, we present the membership inference attack. This attack method is used to verify the privacy level of our HT-Fed-GAN. Table 1 summarizes the frequently used notations in this paper. Notation Description The mixture weight of the k-th Gaussian distribution.
The k-th Gaussian distribution with mean µ k and variance Σ k . p(x|π, µ, Σ) A Gaussian mixture model.

Dir(α)
A Dirichlet distribution with parameter α. Λ The inverse matrix of Σ.

W(W, ν)
A Wishart distribution with parameters W and ν. D The neighboring dataset of D. , δ Parameters of a differential privacy algorithm.
The representation of the j-th record after normalization. ⊕ The concatenate operation.

Variational Bayesian Gaussian Mixture Model
The Gaussian mixture model (GMM) is a probabilistic model, as shown in Equation (1).
where π k represents the mixture weight of the k-th Gaussian distribution and N(x|µ k , Σ k ) represents the k-th Gaussian distribution with mean µ k and variance Σ k . The GMM is represented as the sum of the number of K Gaussian distributions multiplied by the corresponding weight π k . Figure 2 shows an example of a GMM with three clusters [23]. There are four parameters π, µ, Σ, and K to be optimized when the GMM is applied to a dataset. The variational Bayesian Gaussian mixture model (VB-GMM) [24] is a model that applies the variational Bayesian method to estimate the parameters of the GMM [23]. Three distributions, Dirichlet, Gaussian, and Wishart, are introduced to be the prior distributions for π, µ, and Λ (inverse matrix of Σ), in the VB-GMM algorithm, respectively. These three distributions are shown in Equations (2)-(4).
where α is the parameter of the Dirichlet distribution, which represents the prior distribution of π; m and β are the parameters of the Gaussian distribution, which represent the prior distribution of µ; and W and ν are the parameters of the Wishart distribution, which represent the prior distribution of Λ. The parameters of the GMM can be calculated using the variational EM algorithm [23][24][25].

Differential Privacy
Differential privacy [22,26] is a mathematical algorithm that indicates the level of privacy protection of individuals in a dataset.
Definition 1 (Neighboring Dataset). Two datasets D and D are called neighboring datasets if one can be obtained from the other by either adding or removing one record. This is denoted as Definition 2 (Differential Privacy [26]). A randomized algorithm A has ( , δ)-differential privacy if for any output subset S and two neighboring datasets D and D where A(D) and A(D ) are the outputs of the algorithm for the neighboring datasets D and D , respectively, and P is the randomness of the noise in the algorithm.
In practice, we employ a randomized mechanism A that ensures DP for any deterministic vector-valued computation function h: D → R n . More specifically, mechanism A covers the sensitivity of h by injecting noise into the output of h. The sensitivity measures the maximum degree of variation in any pair of inputs d and d .

Definition 3 (Sensitivity). The sensitivity of a function h for any pair of inputs
where . 2 represents the l 2 -norm. Based on the sensitivity of h, we can design the degree of noise to achieve local differential privacy.
The post-processing of differential privacy allows us to impose differential privacy on the discriminator and the generator will have the same level of privacy as the discriminator. The proof of the post-processing can be found in [26].

Membership Inference Attack
A membership inference attack [27] is a new attack method against machine learning models, which mainly infers the target samples in the training datasets according to the output of a target machine learning algorithm. It is assumed that an attacker has black-box access to the target model and knows the details of the target model. However, in PPDP, the attacker can only obtain the shared data and knows nothing about the target model. Therefore, we adopt the membership inference attack method presented in [16] to attack our HT-Fed-GAN in this paper. The attacker uses the fake data generated by the target generation model to train a GAN model. The discriminator of the trained GAN is used as an attack model to infer whether the data held by the attacker is the training sample of the target model.

HT-Fed-GAN
As shown in Figure 3, the pipeline of the HT-Fed-GAN mainly consists of two steps: data encoding and federated GAN training.

1.
Data Encoding. To eliminate the multiple-mode synthesis problem, the federated variational Bayesian Gaussian mixture model (Fed-VB-GMM) (Algorithm 1) is de-signed to learn the multimodal distributions in continuous columns from each client. Then, the continuous columns are encoded to become input representations using Equation (24), which uses the extracted multimodal distributions as prior knowledge. In addition, a federated conditional one-hot encoding method is proposed to encode the discrete columns using Equation (25). Finally, each sample in the decentralized tables is represented as the representation shown in Equation (26).

2.
Federated GAN Training. To alleviate the problem of categorical imbalances across the clients, a conditional sampling method is proposed to rebalance the categorical distributions during federated training. In addition, to prevent the privacy leakage caused by the federated conditional GAN, a privacy consumption-based federated conditional GAN training algorithm (Algorithm 2) is presented to flexibly control the privacy level of GAN.

Variational E step Variational M step
Global update and convergence check  In the data encoding process, each sample in the decentralized tables is encoded as r j by the proposed Fed-VB-GMM and the federated conditional one-hot encoding method. Then, in the federated GAN training process, the conditional sampling method and a privacy consumption-based federated conditional GAN training algorithm are designed to model the decentralized tabular data.

Federated Variational Bayesian Gaussian Mixture Model
In a centralized setting, the state-of-the-art model [12] performs the VB-GMM [24] to learn a GMM in a continuous column and then perform a mode-specific normalization in this column, which can effectively eliminate mode collapse. However, there is no global overview of data statistics in federated learning and the multimodal distributions cannot be learned further.
Therefore, we propose Fed-VB-GMM to extract multimodal distributions in continuous columns from clients. As shown in Algorithm 1, we initialize all parameters randomly on the server and then send them to each client (Step 1 of Algorithm 1). Fed-VB-GMM consists of two steps: the variational E step and the variational M step. The purpose of the variational E step is to estimate the responsibilities r nk , which indicates the probability that data belong to the k-th mode. First, we estimate the mixture weight π and precision Λ, as shown in Equations (8) and (9). ψ(.) in Equation (8) represents the digamma function [28]. Then, we calculate the responsibilities r nk (Equation (10)) and the mean of the k-th mode, which is marked asx k (Equation (11)), locally because they need private data. Finally, each client encrypts r nk andx k using homomorphic encryption [29] and then sends them to the server.

Algorithm 1: Fed-VB-GMM Algorithm
Step 2 (Variational E step): Client • Estimate r nk by π and Λ • Encrypt r nk andx k using homomorphic encryption and send to server Step 3 (Variational M step): Server Step 4 (Variational M step): Client • Encrypt S k , L , W k and send to server Step 5 (Convergence Check): Server • Compute lower bound L by L and W k • Compare the lower bound L in this iteration with the last L last . If the difference between them is less than a predefined value , the process ends. Otherwise, return to Step 2.
In the variational M step, we compute some parameters according to the updated parameters in the variational E step. We update N k ,x k , α k , β k , ν k , and m k , as shown in Equations (12)- (17), where N k andx k represent the global responsibility and the mean of the k-th mode, respectively. We found that the covariance matrix of the k-th mode S k requires private data and W k and L are hardly computed under homomorphic encryption, where L represents the lower bound of the Gaussian mixture model (GMM). To update these three parameters, we compute the complex part on the client side (Equations (18)- (20) and then merge them on the server (Equations (21)-(23)). Γ(.) in Equation (19) represents the gamma function [30]. We compare the lower bound of this iteration with that of the previous round; if the difference between them is less than a predefined value , the process ends. Otherwise, return to Step 2.

Algorithm 2:
Training Algorithm of HT-Fed-GAN 1 Server training loop: 2 parameters: total number of clients N, total number of rounds T, initialize generator θ 0 G , discriminator θ 0 D , client set C, clip norm bound S, noise scale σ, target epsilon , target delta δ When the algorithm converges, we use the extracted global Gaussian mixture model to encode the continuous columns as follows: where , c i,j is the continuous data; η k and φ k are the mean and standard deviation of the k-th Gaussian mode, respectively; b i,j represents the one-hot representation of Gaussian mode, which c i,j follows; N c is the number of continuous columns; and ⊕ refers to the concatenate operation.

Federated Conditional One-Hot Encoding and Conditional Sampling
Federated conditional one-hot encoding. In general, discrete columns are commonly normalized using the one-hot encoding method. In federated learning, when each client does not expose its data to others, one-hot encoding can only be performed locally. However, since different clients collect data independently, some categories in one client may not be present in other clients. As shown in Algorithm 3, after local one-hot encoding, the local feature header H c and its frequency F c are collected and sent to the server. The server then aligns the features and sends the intersection of the feature header H global back to each client. Meanwhile, the global frequency of each feature F global is computed and sent back to each client. Upon receiving the aligned features, each client creates additional features if necessary and uses zeros to represent the non-presence of the related features. After federated one-hot encoding, the representation of the categorical columns is where d i,j is a one-hot representation of a categorical value and N d is the number of categorical columns. ⊕ refers to the concatenate operation. The final encoding of the j-th record in the tabular data with N c continuous columns and N d categorical columns is Conditional sampling. In the training phase, we propose conditional sampling to rebalance the decentralized categorical attributes. First, we randomly select a categorical column m with equal probability. Then, we calculate the probability distribution of each feature f in the selected column m according to its global frequency F global [m][ f ]. Finally, we sample a feature based on the logarithm of its probability. Compared to random sampling, conditional sampling enables the conditional generator to learn the true global distribution. To represent the above condition for the conditional generator and preserve it in the output, we use a conditional vector [12] to encode the selected category during conditional sampling. Each conditional vector represents a single category.

Privacy Consumption-Based Federated Conditional GAN
In HT-Fed-GAN, we customize a federated conditional GAN to model the decentralized tabular data. Existing architectures of decentralized GANs can be classified into two categories: (1) multi-discriminator architectures [8,9], and (2) federated GAN architectures. In the multi-discriminator architecture, multiple discriminators are distributed across the clients and a single generator is trained on the server following the aggregated discriminator. In contrast to the multi-discriminator architecture, both the discriminator and generator are trained locally and aggregated on the server in the federated GAN architecture. The federated GAN architecture has multiple benefits, ranging from better communication overhead, scalability, and training stability compared to the multi-discriminator architecture. Therefore, as shown in Algorithm 2, we adopt the federated GAN architecture as the base architecture of our HT-Fed-GAN.
To prevent the discriminator from memorizing private data during training, we apply user-level differential privacy [26] during the aggregation of the discriminator. Unlike the existing methods [8,9] that use fixed-size federated rounds, where it is difficult to control the privacy level of the generative model, our HT-Fed-GAN introduces privacy consumption as a condition to control the privacy level. We calculate the privacy cost of each federated training round and stop the learning when it meets the pre-defined privacy consumption.
Since tabular data do not have a local structure, we use the fully-connected network in both the generator and discriminator. In the discriminator, the input to the first layer is a vector with a length of 256 that represents one real or fake record. The input of the generator is a latent vector with a length of 128 that is randomly sampled from the unit space. Multiple hidden layers convert the input to a vector with a length of 256, which represents a synthetic record. The dimensions of the input and output of our model are configurable and able to learn from enough attributes. The architecture of the GAN model follows the architecture of CTGAN [12]. We train the model using the objective of WGAN [31] with a gradient penalty.

Privacy Analysis
In this paper, we follow the typical assumption of federated learning, that is, the clients and the server are honest but curious. As shown in Figure 3, the overall pipeline of our HT-Fed-GAN mainly includes two steps: data encoding and federated GAN training. First, during the data encoding process, the Fed-VB-GMM is designed to compute the multimodal distributions of the continuous columns from each client, as shown in Algorithm 1. In addition, the federated conditional one-hot encoding algorithm is proposed to compute the global frequency of the discrete columns from each client, as shown in Algorithm 3. Therefore, the main risk of privacy leakage in the data encoding process is that the statistical information of the clients is leaked during the global distribution calculation. To protect the local private information from being disclosed, the homomorphic encryption [29] technique is used to compute the global statistics under the coordination of the server. Since there are only addition operations on the server, as shown in Algorithm 1 and Algorithm 3, the additively homomorphic encryption scheme in [29] is applied to the HT-Fed-GAN. In this scheme, the participants jointly set up the publish key pk and the secret key sk. The secret key sk is held by each client and kept secret from the server. Each client establishes a different TLS/SSL secure channel to communicate with the server to ensure the integrity of the homomorphic ciphertext.
Second, during the federated GAN training process, HT-Fed-GAN strictly follows the standard federated learning training protocol [11]. Compared with other deep learning models, GAN tends to memorize local training data, which may lead to private information leakage during data synthesis. Thus, we propose a privacy consumption-based federated training method to build a privacy-preserving GAN model. As shown in Algorithm 2, Gaussian noise is injected into the parameters of the discriminator from each client to achieve user-level differential privacy before model aggregation. Following [26], we set the standard deviation of Gaussian noise as σ = Σ √ 2ln(1.25/δ) to satisfy ( , δ)-differential privacy, where Σ represents the l 2 sensitivity of the target parameters and and δ are the privacy budget of the differential privacy. Since the generator's loss is a function of the discriminator, according to the post-processing property of differential privacy, the generator has the same privacy level as the discriminator. Meanwhile, to control the privacy level of the federated GAN flexibly, the privacy consumption is calculated after each round of federated training and the federated training is stopped until the privacy consumption reaches the privacy budget.
Overall, there is no information leakage during the data encoding process and federated GAN training process in HT-Fed-GAN. Note that the research on the vulnerabilities of federated learning is a hot topic including gradient leakage [32], compromised clients/servers [33], adversarial attacks [34], statistical heterogeneity [35], etc., but these are beyond the scope of this paper.

Membership Inference Attack for HT-Fed-GAN
To attack the proposed HT-Fed-GAN, we customize the membership inference attack method presented in [16]. In a scenario of releasing data to the public, we attack the entire HT-Fed-GAN model rather than the discriminator [4] because there is no knowledge other than the synthetic data. As shown in Figure 4, the overall procedure of the customized membership inference attack is as follows: 1.
Let M target be a target HT-Fed-GAN that an attacker wants to attack. Note that access to the target model M target is not allowed because it is not shared with the public after being trained.

2.
Obtain the released synthetic table generated by M target , which is the positive training sample for the attack model, as shown in Figure 4. Each record of the synthetic  Figure 4. The overall procedure of the customized membership attack method.

Experiments
In this section, we describe the experimental environments, datasets, and results. We evaluate HT-Fed-GAN on five commonly used machine learning datasets from five domains: personal information, network intrusion, credit cards, forest cover types, and clinical.

Environment Setup
We set up our experiment in a distributed computing network equipped with GPUs. There are four nodes in total, one representing the server, and the other three representing clients (Client 1-Client 3) from different organizations in the real world. Each node was configured with an NVIDIA Tesla P100 GPU card. Our code implementation was based on PyTorch [36] 1.4.0, PySyft 0.2.5, Python 3.7.6. For simplicity, we assumed that all clients participated in learning.

Dataset
We used five datasets, as shown in Table 2. Each dataset contained continuous and discrete columns that can be used for machine learning tasks, as follows:

•
The Adult [1] dataset comes from the 1994 Census database and contains a lot of personal information (such as income, work hours per week, education, and so on).
The income attribute represents the salary for each individual, which is greater than 50 K or less than 50 K. Thus, we performed a binary classification test using this attribute. For the regression tasks, we chose the hours_per_week attribute as the target label, which represents the number of work hours per week.

•
The Intrusion [15] dataset comes from a network intrusion detection competition and consists of a wide variety of intrusions simulated in a military network environment. The label attribute represents the type of intrusion, which contains seven classes. Thus, we performed multi-class classification tests using the label attribute. For the regression tasks, we used the count attribute.

•
The Credit [37] dataset comes from a Kaggle competition named Credit Fraud Detection and contains transactions made using credit cards in September 2013 by European cardholders. The label attribute indicates whether the record is fraudulent. The amount attribute has the information on the transaction amount. Thus, we performed binary classification and regression tests using the two attributes, respectively. • The Cover-Type [38] dataset is derived from the US Geological Survey (USGS) and US Forest Service (USFS). The label attribute represents the forest cover type and contains seven classes. The elevation attribute is the elevation of the forest. Thus, we performed multi-class classification and regression tests using this dataset.

•
The Health [39] dataset is about cardiovascular diseases (CVDs),and contains 12 features used to predict mortality due to heart failure. The DEATH_EVENT attribute represents whether the patient died during the follow-up period. Thus, we performed a binary classification test using this attribute and used the age attribute for the regression tests. Table 2 shows the statistical information of the experimental dataset including (from left to right) the dataset, number of records, number of attributes, classification type, attribute for the classification tests, and attribute for the regression tests. We split the original dataset into two parts: a training set (70%) and test set (30%). To construct a biased training subset, we split the training set into three parts as the client dataset based on the split attribute, as shown in Table 2.

Evaluation Method
We evaluated the data synthesis method by comparing the privacy and utility of the synthesized data. The evaluation metrics are detailed below.

•
The data-utility-related evaluation metrics are as follows: -Cumulative distributions: We compared the cumulative distributions for the same attribute between the original data and synthesized data [4]. This mainly compared the statistical similarity between the original data and the synthesized data. -Machine Learning Score: SDGym [12] is a framework to benchmark the performance of synthetic data generators. We used SDGym to train the machine learning models on the synthetic data and test the trained model on the original dataset. In the training process, the machine learning models and their parameters were fixed in each dataset. We evaluated the performance of the classification tasks using the F1 score (binary classification tasks) or the macroand micro-F1 scores (multi-class classification tasks) and the regression tasks using the mean absolute error (MAE). For each dataset, we performed four classifiers or regressors to evaluate the performance of the machine learning task.

•
The privacy-related evaluation metrics are as follows: -Membership Inference Attack: We customized the black-box membership inference attack presented in [16] to evaluate the privacy of our HT-Fed-GAN without any auxiliary information other than the synthetic data. The detailed procedure is described in Section 4.5.

Baseline Model
To the best of our knowledge, there is no federated generative model for decentralized tabular data synthesis. Thus, we used DP-FedAvg-GAN [8] as the baseline model, which is the state-of-the-art work for synthesizing decentralized images. To generate the tabular data, we customized DP-FedAvg-GAN to model the structured data.

Parameter Setup
There are two parameters and δ to control the level of privacy in HT-Fed-GAN. Following [4], there are two types of privacy levels in HT-Fed-GAN: the low-privacy setting and high-privacy setting. The low-privacy setting indicates that differential privacy was not applied to the federated GAN training process, whereas the high-privacy setting denotes that differential privacy was used in the federated GAN training process. In this paper, in the high-privacy setting, we set the privacy budget as = 0.5 and δ = 9.8. The values of the privacy budgets in DP-FedAvg-GAN [8] were the same as those used in HT-Fed-GAN. Figure 5 presents the cumulative distributions of several attributes from the four datasets: Adult (work class), Health (platelets), Cover-Type (elevation), and Intrusion (label). For the Credit dataset, no matter how hard we tried, DP-FedAvg-GAN experienced mode collapse. Therefore, we did not show the statistical comparison of the Credit dataset. In Figure 5, the orange lines represent the synthetic data and the blue lines are the real values in the original data. Figure 5. Cumulative distributions. Cumulative distributions of attributes (workclass, platelets, elevation, and label for each dataset, respectively) by HT-Fed-GAN with low-privacy setting, highprivacy setting, and DP-FedAvg-GAN. Blue lines represent the original attributes and orange lines represent the synthetic data. The low-privacy setting indicates that differential privacy was not used in HT-Fed-GAN, whereas the high-privacy setting denotes that differential privacy was used in HT-Fed-GAN and the privacy budgets were = 0.5 and δ = 9.8. Figure 5a-d,m-p are the cumulative distributions of the categorical columns. For the categorical columns, the cumulative distributions of HT-Fed-GAN were mostly close to the original tables, which means that our HT-Fed-GAN can capture the real distributions of the highly imbalanced categorical attributes among the clients. Figure 5e-l show the cumulative distributions of the continuous columns. HT-Fed-GAN shows a more realistic statistical distribution than DP-FedAvg-GAN, which demonstrates that HT-Fed-GAN is superior to DP-FedAvg-GAN in terms of synthesizing the multimodal distributions in decentralized continuous columns.

Cumulative Distributions
To summarize, HT-Fed-GAN with the low-privacy setting showed the best synthesis performance. In almost all the datasets, the synthetic data of HT-Fed-GAN were statistically similar to the real data. DP-FedAvg-GAN showed the worst performance because it was not designed for decentralized tabular data.

Machine Learning Scores
In this section, we use several classification and regression algorithms to evaluate the quality of the synthetic data.

Binary Classification
We compared HT-Fed-GAN and DP-FedAvg-GAN on the Adult, Credit, and Health datasets, respectively. We used the AdaBoost, decision tree, logistic regression, and multilayer perceptron classifiers and their recommended parameter settings from the scikit-learn web pages to test the binary classification tasks. The experimental results are shown in Figure 6, where the x-axis represents the F1 score of the classifier trained using the original tables and the y-axis is the F1 score of the classifier trained using the synthetic tables. The diagonal line indicates that the performance of the classifiers trained on the synthetic data is the same as that on the real-world data. Table 3 shows the average F1 scores of the classifiers trained on each synthetic table and the original table. The y-axis is the F1 score of the classifier trained using the synthesized table. The low-privacy setting indicates that differential privacy was not used in HT-Fed-GAN, whereas the high-privacy setting denotes that differential privacy was used in HT-Fed-GAN and the privacy budgets were = 0.5 and δ = 9.8. Table 3. Average F1 scores of binary classification tests on three datasets, where "-" represents mode collapse. The low-privacy setting indicates that differential privacy was not used in HT-Fed-GAN, whereas the high-privacy setting denotes that differential privacy was used in HT-Fed-GAN and the privacy budgets were = 0.5 and δ = 9.8.

Dataset
Origin Table  HT From Figure 6 and Table 3, we can see that HT-Fed-GAN outperformed DP-FedAvg-GAN in terms of the binary classification tests. Figure 6a shows the performance on the Adult dataset and the average F1 score of HT-Fed-GAN exceeds that of DP-FedAvg-GAN by 25% with the low-privacy setting and 23% with the high-privacy setting. Figure 6b shows the results on the Credit dataset, where DP-FedAvg-GAN experienced mode collapse no matter how hard we tried. Because on the Credit experiment the data of each client were non-IID, as shown in Table 2, this resulted in mode collapse during the training process of DP-FedAvg-GAN. Surprisingly, there were some classifiers trained on the data generated by HT-Fed-GAN that performed better than those trained on the original data, as shown in Figure 6b,c. This means that our HT-Fed-GAN not only maintained a similar distribution to the real data but also extracted some prominent features that were helpful for machine learning tasks.

Multi-Classification
We performed multi-class classification tests on the Intrusion and Cover-Type datasets. Figure 7 shows the experimental results of HT-Fed-GAN compared to DP-FedAvg-GAN. We followed the same plotting method as that shown in Figure 6, which shows the results. Table 4 shows the average macro/micro-F1 scores of the multi-class classification tests. The low-privacy setting indicates that differential privacy was not used in HT-Fed-GAN, whereas the high-privacy setting denotes that differential privacy was used in HT-Fed-GAN and the privacy budgets were = 0.5 and δ = 9.8. Table 4. Average macro-and micro-F1 scores of multi-classification tests. The low-privacy setting indicates that differential privacy was not used in HT-Fed-GAN, whereas the high-privacy setting denotes that differential privacy was used in HT-Fed-GAN and the privacy budgets were = 0.5 and δ = 9.8.

Dataset
Origin As shown in Figure 6 and Table 4, in almost all cases, HT-Fed-GAN with the lowprivacy setting achieved the best performance. In the two datasets, The average macro-F1 scores of HT-Fed-GAN exceeded those of DP-FedAvg-GAN by 23-28% under the lowprivacy setting. This shows that our method outperformed DP-FedAvg-GAN in terms of categorical attribute modeling.

Regression
The regression tests followed the same training methods as the classification tests. We used the following four regression algorithms for each dataset: Lasso regression, multi-layer perceptron regression, gradient-boosting regression, and random forest regression. Recall that we used the MAE as the base metric to evaluate the regression models. Figure 8 shows the experimental results of the regression tests. We followed the same plotting method as that used in the classification tests. The diagonal line indicates that the performance of the regression algorithm trained on the synthetic data was equal to that of the real-world data. Table 5 shows the average MAE of the regression tests.
In Figure 8a-d, it can be seen that HT-Fed-GAN outperformed DP-FedAvg-GAN in terms of the MAE score. However, in Figure 8e, it can be seen that HT-Fed-GAN was inferior to DP-FedAvg-GAN both in the low-privacy setting and high-privacy setting. This shows that multimodal distribution augmentation was not dominant in the small datasets. Moreover, Table 5 shows that HT-Fed-GAN with the low-privacy setting had better performance than HT-Fed-GAN with the high-privacy setting, except for the Intrusion dataset. Table 5. Average MAE scores of regression tests, where "-" represents mode collapse. The low-privacy setting indicates that differential privacy was not used in HT-Fed-GAN, whereas the high-privacy setting denotes that differential privacy was used in HT-Fed-GAN and the privacy budgets were = 0.5 and δ = 9.8.

Dataset
Origin Table  HT  The y-axis is the MAE score of the regression algorithm trained using the synthesized table. The low-privacy setting indicates that differential privacy was not used in HT-Fed-GAN, whereas the high-privacy setting denotes that differential privacy was used in HT-Fed-GAN and the privacy budgets were = 0.5 and δ = 9.8.

Results for Privacy
We evaluated our HT-Fed-GAN and the baseline model for privacy using the membership inference attack. For each dataset, we trained four attack models following the procedure described in [16]. Each testing dataset comprised 50% original data labeled as 1 and the rest were synthetic data labeled as 0. We used the F1 score as the metric to evaluate the performance of the attack. If the F1 score was lower than 0.5, it meant that the attack was successful. Table 6 shows the results of the membership inference attack. HT-Fed-GAN with the high-privacy setting achieved the best performance in defending membership attacks. As the level of privacy increased, the attack performance decreased. We found that the privacy of DP-FedAvg-GAN was similar to that of HT-Fed-GAN in the low-privacy configuration. However, the quality of the synthesized data was far lower than that of HT-Fed-GAN, as described in Sections 5.2 and 5.3. Therefore, our method showed the best trade-off between the privacy level and utility. Table 6. F1 scores of membership inference attack, where "-" represents mode collapse. The lowprivacy setting indicates that differential privacy was not used in HT-Fed-GAN, whereas the highprivacy setting denotes that differential privacy was used in HT-Fed-GAN and the privacy budgets were = 0.5 and δ = 9.8.

Multimodal Distribution Study
To evaluate the effectiveness of the Fed-VB-GMM in multimodal distribution extractions, we compared the average log-likelihood of the Fed-VB-GMM and VB-GMM on the Adult dataset (age attribute). The average log-likelihood of the data was compared by varying the number of clusters. The average log-likelihood is the degree of likelihood for the probability distribution of the data and is represented by Equation (27). Figure 9 shows the evaluation results. The number of clusters varied from 2 to 32. The blue line represents the VB-GMM and the red line represents the Fed-VB-GMM. Figure 9 shows that the performance of the Fed-VB-GMM was similar to that of the VB-GMM. Of course, there were still some minor differences when the number of clusters increased. These differences were because the homomorphic encryption led to a precision loss during the float-point number computation.  Table 7 shows the fake data generated by HT-Fed-GAN with the low-privacy setting on the Adult dataset. Some records of three clients from the original dataset are shown in Table 8 after choosing a subset of columns for space consideration. In Table 8, rows 1-2 are from Client 1, rows 3-4 from Client 2, and the last two rows are from Client 3. As shown, there is no one-to-one relationship between Table 8 and Table 7, and the real records have very different values from the synthesized records. It is almost impossible to re-identify the original information from the synthesized data.

Conclusions
In this paper, we studied the problem where existing solutions suffer from mode collapse and private leakage from membership inference attacks on privacy-preserving data synthesis (PPDS) for tabular data in a distributed multi-party environment. We propose HT-Fed-GAN, a novel federated generative method for horizontally partitioned tabular data synthesis in multiple parties. In HT-Fed-GAN, the multimodal distribution in the decentralized tables extracted by the proposed Fed-VB-GMM can effectively eliminate mode collapse. To prevent private leakage, HT-Fed-GAN employs differential privacy during federated training rounds. To the best of our knowledge, this is the first work for privacy-preserving data synthesis on horizontally partitioned decentralized tabular datasets. We experimentally evaluated our HT-Fed-GAN on five real-world datasets from five different domains and showed were most statistically similar to the original tables without revealing private information. In terms of the performance of the machine learning tasks, HT-Fed-GAN outperformed the state-of-the-art model.
In the future, we plan to extend our method to vertically partitioned decentralized tables and further improve the data quality of the federated generative model.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

GAN
Generative Adversarial Network GMM Gaussian Mixture Model VB-GMM Variational Bayesian Gaussian Mixture Model MAE Mean Absolute Error DP Differential Privacy