MaWGAN: A Generative Adversarial Network to Create Synthetic Data from Datasets with Missing Data

: The creation of synthetic data are important for a range of applications, for example, to anonymise sensitive datasets or to increase the volume of data in a dataset. When the target dataset has missing data, then it is common to just discard incomplete observations, even though this necessarily means some loss of information. However, when the proportion of missing data are large, discarding incomplete observations may not leave enough data to accurately estimate their joint distribution. Thus, there is a need for data synthesis methods capable of using datasets with missing data, to improve accuracy and, in more extreme cases, to make data synthesis possible. To achieve this, we propose a novel generative adversarial network (GAN) called MaWGAN (for masked Wasserstein GAN), which creates synthetic data directly from datasets with missing values. As with existing GAN approaches, the MaWGAN synthetic data generator generates samples from the full joint distribution. We introduce a novel methodology for comparing the generator output with the original data that does not require us to discard incomplete observations, based on a modiﬁcation of the Wasserstein distance and easily implemented using masks generated from the pattern of missing data in the original dataset. Numerical experiments are used to demonstrate the superior performance of MaWGAN compared to (a) discarding incomplete observations before using a GAN, and (b) imputing missing values (using the GAIN algorithm) before using a GAN.


Introduction
Missing data is a common problem and can arise due to a variety of reasons. Rubin [1] defines three types of missing data: missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR). Suppose that we have independent observations x i = (x i1 , . . . , x id ) T and put m ij = 0 if x ij is missing and 1 if it is present (we call m i = (m i1 , . . . , m id ) T the mask corresponding to x i ). The data are MCAR if for any j, m ij is independent of x i , it is MAR if m ij is independent of x ij but dependent on some x ik for k = j, and NMAR if it is dependent on x ij . We will assume that our dataset is MCAR.
A range of imputation methods exist to fill in missing values. Suppose that m ij = 0 (so variable j is missing from observation i), different methods for imputing x ij include • Using the mean of the non-missing x hj , h = i [2]. • Using a neighbourhood of x i to impute x ij . KNN uses the mean of non-missing x hj in the neighbourhood [3]. Hot deck imputation samples randomly from the non-missing x hj in the neighbourhood [4]. • Using a (parametric) regression model for x ij given x ik , k = j, built using complete observations. If the regression model includes a distribution for the error term, then we can use it to randomly impute x ij (see stochastic regression imputation [5]). • Using a (non-parametric) estimate of the conditional distribution of x ij given x ik , k = j, to sample from. The GAIN methodology (generative adversarial imputation nets [6]) is an example of this approach using a GAN architecture. While there have been recent advances in synthetic data generation due to the application of machine-learning models, missing data have not received much attention. Synthetic data generation is increasingly important in a range of applications, for example, to increase dataset volume or to anonymise sensitive datasets [8,9], and in practice often has to deal with missing data. A promising development for data synthesis has been the advent of generative adversarial networks (GANs) [10]. GANs use two neural nets, one to generate synthetic data, and the other to build a critic, which is used to train the generator (also called a discriminator). The generator and critic are trained iteratively, so that as the generator improves the critic becomes more discerning, allowing further refinement of the generator. Until now, missing data have been a problem for GANs as existing algorithms require complete observations, so users have to either first impute the missing data or just discard incomplete observations. In this paper, we propose a novel GAN algorithm that can directly train a synthetic data generator from datasets with missing values; to our knowledge this is the first such attempt. We called it MaWGAN (for Masked Wasserstein GAN). As with existing GAN approaches, the MaWGAN synthetic data generator generates samples from the full joint distribution. The novelty of our approach is a methodology for comparing the generator output with the original data that does not require us to discard incomplete observations, based on a modification of the Wasserstein distance. Moreover, our approach is easily implemented by incorporating into the critic masks generated from the pattern of missing data in the original dataset.

Theoretical Basis
MaWGAN builds on the WGAN-GP algorithm [11,12]. Let x 1 , . . . , x n ∈ R d be an i.i.d. sample from some (unknown) distribution P, and let G : (0, 1) d → R d be our generator. G takes a vector of i.i.d. U(0, 1) random variates and returns a vector with distribution Q say. The WGAN-GP critic calculates an estimate of the Wasserstein distance, so that the generator is trained to minimise the distance between P and Q as measured by the Wasserstein distance.
Let Π(P, Q) be the set of measures on R d × R d with marginals P and Q, then the Wasserstein distance is where f L is the Lipschitz constant of f . Let C : R d → R + be our critic. Let y 1 , . . . , y n be a sample from the generator G, and for i ∼ U(0, 1) put z i = i x i + (1 − i )y i , then we train the critic to maximise The key idea here is that the regularisation term will restrict the critic C to be close to a Lipschitz function with Lipschitz constant 1. Here, λ > 0 controls the degree of regularisation and can be tuned to improve the convergence of the critic.
We introduce a variation of the Wasserstein distance that incorporates a random mask to capture the effect of MCAR missing data. For our purposes a mask m = (m 1 , . . . , m d ) T is an element of {0, 1} d and a random mask is just a measure M on {0, 1} d . Given a data point x = (x 1 , . . . , x d ) T and a mask m, x j is treated as missing if and only if m j = 0. We define the M-Wasserstein distance as where represents pointwise multiplication. The following lemma shows that W M is equivalent to W in the topological sense (meaning they generate the same topology on the space of measures on R d ). The practical consequence of the lemma is that a sequence of measures Q i (representing a sequence of improving generators) will converge to P w.r.t. the Wasserstein distance if and only if they converge to P w.r.t. the M-Wasserstein distance.
and, thus, integrating M w.r.t. M, we get Here, the second line follows because we can view X M as a realisation of P projected onto the subspace corresponding to the non-zero co-ordinates of M, and similarly for Y M.
We approximate the M-Wasserstein distance analogously to the WGAN-GP approach (1). Let m i be the mask corresponding to data point x i , then using our previous notation, we train the critic to maximise Here, we interpret x i m i as replacing the missing values in x i with zeros, and y i m i replaces the corresponding values of y i with zeros.

Implementation
In this section, we explain the details of our MaWGAN implementation. Figures 1 and 2 illustrate the flow of information in a single training step for the generator and critic respectively. In both cases, we calculate a loss that measures the performance of the generator/critic. Given the loss we calculate its gradient w.r.t. the weights (parameters) of the generator/critic, then update the weights in the direction of the gradient. Note that the generator is minimising its loss, so takes steps in the direction of the negative gradient, while the critic is maximising its loss, so take steps in the direction of the gradient.
Given the current critic, updating the generator is straightforward. We feed an array of random numbers into the generator one row at a time, to obtain an array of synthetic data (each row represents an independent realisation). We then feed the synthetic data into the critic, one row at a time, to obtain a vector of performance evaluations, which we average to obtain our loss.  To train the critic we need two sets of inputs: a sample (or batch) from the original dataset and a synthetic dataset of the same size produced by the generator. From the original dataset, we generate a mask indicating which data are missing, which we use to both replace the missing data with zeros, and replace the corresponding entries in the synthetic data array with zeros. We also generate an interpolated data array, which is just a linear combination of the masked original and masked synthetic data. The relative weights given to the original and synthetic data are chosen independently for each row. Each row of the original and synthetic data are fed into the critic, each row of the interpolated data array is fed into the gradient of the critic, and these are averaged as per Equation (2) to give the loss.
The pseudocode (Algorithm 1) shows how the generator and critic steps are interwoven. Note that for each update step of the gradient, we perform several updates of the critic, as we wish to keep the critic as a good approximation of the M-Wasserstein distance.

17:
L i G ← C(G(u)) 18: end for 19: update θ G using negative gradient of L G (decreasing L G ) 21: end for We have observations x i ∈ R d for i = 1, . . . , n, which we collect into an n × d matrix X, where the i-th row of X is x T i . Let m i be the mask corresponding to x i and let M be the n × d matrix whose i-th row is m T i . G : (0, 1) d → R d is our generator and C : R d → R + our critic. Write θ G for the weights that parameterise the generator G, and θ C for the critic weights. It is θ G and θ C that we update when training G and C. The update steps require a learning rate α, which we don't explicitly include in our pseudocode.
In our algorithm we update the generator t G times, which we call epochs. For each epoch, the critic is updated t C times, and we use a batch of data size k. We will write σ ⊂ {1, . . . , n} for the batch and σ(i) for its i-th element. λ > 0 is the regularisation parameter for the critic loss, which also needs to be set before hand.

Numerical Testing
Datasets. To test the performance of MaWGAN, we used three datasets of varying sizes and complexities. The Iris and Letter datasets are well known and can be found, for example, in the UCI Machine Learning Repository [13]. The Welsh Index of Multiple Deprivation is less well known, but was used because it has a flavour of the sort of official data that users want to synthesise for data-privacy reasons: • The Iris dataset records the length and width of the sepals and petals of the flowers of three different iris species [14,15]. There were 150 observations of 4 numerical and 1 categorical variable (not used in this study).

• The Welsh Index of Multiple Deprivation (WIMD)
is the Welsh Government's official measure of relative deprivation in Wales (UK); we used the 2014 figures [16]. For 1904 separate regions, the WIMD has measures of income, employment, education, and health. One region had a missing value and was removed from the dataset, leaving 1903 observations of 11 numerical variables. • The Letter dataset was generated by Frey and Slate [17] and records 16 measured characteristics of images of the capital letters in the English alphabet. Letters were selected from 20 different fonts and randomly distorted a number of times; there were 20,000 observations of 16 numerical variables.
Simulated MCAR datasets. We generated eight additional versions of each dataset with 10%, 20%, . . . , 90% missing data. Points were removed at random with equal probability until the required percentage was reached. The additional datasets are nested in the sense that if an element is missing from one then it is missing from all versions with higher levels of missing data. By artificially removing data, we are able to compare the performance of our synthetic data generator with the complete dataset, even when it is trained with missing data.
Competing methodologies. MaWGAN was compared to two other approaches. The first is a two-step process where we apply the GAIN imputation method and then use WGAN-GP to train a generator on the completed data. The second alternative was to discard incomplete observations then use a WGAN-GP to train a generator on what remained. The number of remaining observations at the different levels of missingness are given in Table 2. Performance metrics. To assess the performance of the three methods, we used two metrics, the Fréchet distance F and the likeness score L introduced by Guan and Loew [18]. To evaluate the performance of a data synthesis method, we need a metric that compares the distributions of the real and synthetic data, rather than single observations. There is no single best way of doing this and a number of approaches have been suggested in the literature (see for example the reviews of Borji [19,20]). Most of these are tailored to image data; however, the two we chose are very general in application. We found that metrics for comparing distributions need a lot of data to give really consistent results, though the likeness score has proved better in this regard than the others we have looked at.
Suppose we have observations x 1 , . . . , x n from some distribution and observations y 1 , . . . , y m from a second distribution, then to calculate L, we first generate three auxiliary sets of information 1] be the Kolmogorov-Smirnov distance between A and B, namely the maximum absolute difference between the empirical cumulative distribution functions of A and B. The likeness score for our two sets of observations is then Note that L ∈ [0, 1] and the two sets of observations have likeness one if and only if they are identical, with lower scores indicating greater dissimilarity.
The Fréchet distance F is given by where µ x and Σ x are the sample mean and sample covariance matrix of x 1 , . . . , x n , and similarly for µ y and Σ y . Smaller values indicate greater similarity, with F = 0 if and only if the means and covariances are the same (which does not imply the samples are identical). It is common to calculate F not using the x i and y i directly but instead by first applying a feature extracting transform; in particular if the inception network is used then the resulting metric is called the Fréchet inception distance [21]. We do not do this in our case. In our application the x 1 , . . . , x n will always be one of the original three datasets, and the y 1 , . . . , y m will be synthetic data generated by one of our three methods-subject to varying degrees of missing data-with m = n. To reduce the variation due to sampling from the generator, we calculate F and L 100 times using different sets of synthetic data, then take the average for each.

Algorithmic Details
The MaWGAN, GAIN, and WGAN-GP algorithms were implemented in Python using the PyTorch library [22]. The MaWGAN and WGAN-GP implementations incorporated code publicly available on GitHub [23], and the GAIN implementation used the code provided by the original authors [6]. For both MaWGAN and WGAN-GP, the neural network architecture of both the generator and critic had five layers. For the generators, the input and output layers had nodes equal to the number of variables, and we used 150 nodes per hidden layer. For the critic, the input layer had nodes equal to the number of variables, output layer size 1, and we used 150 nodes per hidden layer. For training, we used t G = 15,000 epochs with t C = 5 training steps for the critic each time. We used a batch size of k = 30, a learning rate of α = 0.0001, and critic regularisation λ = 10.
An important practical observation is that when training a MaWGAN, the optimal tuning depends on the level of missing data. We found that as the level of missing data increases, the number of training steps for the critic in each epoch needs to increase (t C in the pseudo-code above). Formally, considering L i C (the component of the critic loss corresponding to observation i), we see that the variables that are masked do not contribute to the gradient of L i C w.r.t. the critic weights θ C . That is, the masking means that when updating θ C , observation i only contributes information about the distribution of its nonmissing variables. Thus, as is intuitively clear, the level of information available in each observation reduces as the level of missingness increases, and so we need to do more work to train the critic. If the critic does not get sufficient training in each epoch then the generator can converge too quickly to a lower-dimensional projection of the target distribution (so-called mode collapse).

Results
Because GAN training is stochastic, the performance of the resulting generator can vary. Accordingly for each combination of method, dataset and missingness we fitted the generator 20 times, calculating the likeness score and the Fréchet distance each time. The results are summarised in Figures 3 and 4. For each combination of method, dataset, and missingness, we give the average performance and a 95% confidence interval for the mean.  Looking at the likeness score, the results show that for these datasets MaWGAN performs consistently well with levels of missing data up to 50%. MaWGAN also performs significantly better than both the two-step method and the complete observations method with moderate to high levels of missing data, and never performed any worse than either alternative.
With respect to the Fréchet distance, the picture is not as one-sided, though overall, MaWGAN still performs best. All three methods give similar levels of performance with up to 30% missing data. For higher levels of missing data the complete observations method is poor, while MaWGAN usually outperforms the two-step method, but not always.
To get a better feel for the behaviour of each method, it is useful to directly compare the original data with a synthetic sample. In Figure 5, we feature the Iris dataset and use methods trained with 50% missing data. On the left, we have output from the MaWGAN, and on the right-from the two-step method. For each plot, we overlay the original data with a synthetic sample of the same size. We have four variables, and on the diagonal, we have for each a marginal density plot using a kernel smoother, and off the diagonal, we have pairs plots. Both methods have captured the location and scale of the data; however, the MaWGAN is noticeably better at picking up the bimodality. Original data compared with synthetic data from methods trained with 50% missing data: MaWGAN on the left and the two-step method on the right. In both cases, we overlay the original data with a synthetic sample of the same size. Marginal densities are given on the diagonals and pairs plots off the diagonal.

Discussion
MaWGAN is a proper generalisation of WGAN-GP, since in the absence of missing data it is exactly a WGAN-GP, yet it requires no more parameter tuning than a WGAN-GP. Moreover the masking step that implements MaWGAN is simple to add to existing code, and has a marginal impact on the running time (calculating the weight-gradient for the generator and critic remains the most expensive steps). In particular, MaWGAN can use existing GPU-optimised code, such as the Torch library. We note that our theory and implementation apply equally well to the original WGAN formulation as the WGAN-GP approach, though we would always recommend the latter, as we have found its approach to training the critic much more stable.
Our experimental results indicate that compared to WGAN-GP, for dealing with data missing completely at random (MCAR), MaWGAN has a superior performance to the alternatives of separately imputing missing data or discarding incomplete observations, particularly with high levels of missing data. The two-step method of using GAIN to impute missing data, then WGAN-GP to synthesise data, performed essentially the same as MaWGAN with low levels of missing data. However the two-step method requires the fitting and tuning of two models, so it is slower, more prone to fitting error, and inherently more variable due to the additional variability introduced in the training ofand subsequent sampling from-the GAIN.

Conflicts of Interest:
The authors declare no conflict of interest.