GAN-BASED T ABULAR D ATA G ENERATOR FOR C ONSTRUCTING S YNOPSIS IN A PPROXIMATE Q UERY P ROCESSING : C HALLENGES AND S OLUTIONS

,


Introduction
Research and business today rely heavily on big data and its analysis.However, big data is stored in massive databases that are difficult to retrieve, analyze, share, and visualize using standard database query tools [1].For data-driven systems, data exploration is imperative for making real-time decisions and understanding the knowledge contained in the data.However, supporting these systems can be costly, especially regarding big data.One of the most critical challenges posed by big data is the high computational cost associated with data exploration and real-time query processing [2].To assist with the analysis of big data, several systems have been developed, such as Apache Hive, which typically takes a considerable amount of time to respond to analytical queries [3].However, approximation results can sometimes be provided for a query in a fraction of the execution time to resolve this issue, particularly in aggregation queries.This is because aggregation queries are typically designed to provide a big picture of a large amount of information without having to compute an exact answer [4].The majority of analytical queries require aggregate answers (such as sum(), avg(), count(), min(), and max()) for a given set of queries (joined or nested queries) over one or more categories (Group by columns) on a subset (where and having) of big data.Approximate Query Processing (AQP) comes to the rescue by identifying a summary of the population (a.k.a synopsis) for discovering trends and aggregate functions results [5].Online aggregation and Offline precomputed synopsis are the two primary categories that can be used to classify existing AQP approaches.Offline techniques summarize the data distribution and return the approximate results by running queries on these synopses.However, Online aggregation techniques progressively generate synopses and return approximate results while data is processing.The traditional approach in both categories uses data distribution to generate a subset of data with statistical methods such as sampling methods [2].One novel technique for AQP is to take advantage of machine learning to further reduce the execution time, improve accuracy, and support all types of aggregate functions.For instance, the DBEst Query processing engine [6] trains models, notably regression models and density estimators, that provide accurate, efficient, and cost-effective responses to different types of aggregate queries.Learning-based AQP (LAQP) [7] and ML-AQP [8] methods build machine learning models based on historical executed queries.The former builds an error model to predict each incoming query's sampling-based estimation error, whereas the latter trains models that learn patterns to predict future query results with a bound error by applying prediction intervals constructed using Quantile Regression models.Deep Generative Models (DGM) are employed to approximate complex, high-dimensional probability distributions of the population.Therefore, DGMs can be used to estimate each observation's probability and generate data synopses from the underlying distribution [9].Thirumuruganathan et al. [10] utilized the Deep Generative Model in the AQP with Variational Autoencoders (VAEs) to generate many samples as required without accessing the underlying dataset.VAEs are a type of generative model in which auto-encoders produce new data from an interpretable latent space by encoding the input distribution.The Generative Adversarial Network (GAN) is another state-of-the-art algorithm that follows a direct implicit density model that samples directly from the model provided distribution [11] without estimating the data distribution [12].Consequently, GANs appear to be a suitable option for AQP.The remainder of this paper is structured to methodically explore the intersection of synopsis creation in APQ and Generative Adversarial Networks (GANs).Section 2 explores the theoretical underpinnings of database synopses and AQP, alongside a technical exposition on tabular GANs, setting the stage for understanding their relevance and application.Section 3 identifies and discusses the inherent challenges in constructing synopses from relational databases, underscoring the need for innovative approaches.Section 4 proposes a GAN-based solution, demonstrating how tabular GAN-based generators can effectively meet these synopsis creation challenges.Section 5 details the evaluative metrics and methodologies for assessing the fidelity and utility of the generated synopses, including error estimation techniques.The paper concludes with Section 6, which synthesizes our findings and reflections on the potential of tabular GANs to enhance real-time decision-making in data-intensive environments.

Background
This section provides the necessary foundation for comprehending the fundamental principles underlying the data synopses in APQ, and the novel utilization of GANs in the context of tabular data.

Data Synopsis in Database
Query processing refers to the process of the compilation and execution of a database query using a specific query language, such as SQL, in order to obtain an approximate result of the requested query.Initially, the query parser validates the query to ensure that the query has been properly stated.Afterward, the query optimizer adjusts the plan to provide a more effective query execution plan.Finally, the query evaluation and execution engine executes the query on the database and returns the results [13].A traditional database system performs aggregate operations in batch mode, in which a query is submitted, the system processes a huge amount of data slowly and then returns the final result [4].Result in, the primary concern of query processing is how to process queries efficiently based on computational resources and time.Occasionally, it is impossible to provide exact results in a reasonable amount of time, and an approximate answer with some error guarantee would greatly assist users.In order to approximate a query plan outcome for complex join queries, the optimizer requires accurate estimates of the sizes of results generated at accurate selectivity estimates.As a result, data synopses can be used to estimate the number of results generated by a query by estimating the underlying data distribution [14].

Approximate Query Processing (AQP)
Approximate Query Processing (AQP) is a method that returns approximations of aggregate query answers using a data synopsis that closely replicates the actual data's behavior [15].As a higher level of abstraction, AQP aims to calculate an answer that is approximate to the actual query result based on a data synopsis as a highly compressed and lossy version of the database [16].In Figure 1, the different phases of query processing are shown, as the query in AQP is executed based on a data synopsis rather than actual data.Based on a cost-effective approach, approximation accuracy (consequently completion time) is determined by the size of data synopses, which means how much smaller the synopsis is than the original database [15].We can create this synopsis using either offline or online techniques.Offline synopsis is built using existing data statistics and helps answer queries quickly, but can involve more complex and resource-intensive methods.With offline methods, database optimization techniques like replication and indexing can be employed to refine the synopsis when the database changes [17].On the other hand, online synopsis allows for real-time query monitoring, giving users preliminary results that are refined as more data is processed, and stopping once the results reach a satisfactory level of accuracy and confidence [4].By taking an online approach, there is no need to make any a priori assumptions.In contrast to the offline approach, creating good data synopses is much more difficult [17].The Online Analytical Processing (OLAP) system is an example of these systems, and one of its key issues is the regular updating of aggregates to ensure that approximated answers are smooth and continuously improving.By constructing a concise and accurate synopsis of the underlying data distribution, the system consistently strives to reduce the amount of time it takes to complete the task [2].

Synopsis Construction
There may be considerable differences in the structure of the synopsis, and it should be tailored to the problem being addressed.As an example, the AQP synopsis structure is likely to differ from data mining tasks such as change detection and classification [18].AQP systems should generate an effective synopsis that can be applied to various data distributions and data types within different databases.It is common for big data to produce massive amounts of complex data in a streaming manner.Traditionally, streaming algorithms are evaluated based on three factors: running time, memory complexity, and approximation ratio [19].Synopsis construction in data streams can be achieved using a variety of techniques: Sampling methods: It has been demonstrated that sampling is a simple and effective method of obtaining approximate results providing an error guarantee when compared with other approximate query processing techniques.It is possible to divide a sampling estimation roughly into two stages.Initially, a suitable sampling method must be identified to construct a sampling synopsis from the original data set, and then a sampling estimator must be analyzed in order to determine its distribution characteristics [20].
Histograms: In the histogram, the value range of attributes is divided into K buckets with equal widths, and then the numbers of values falling within each bucket are counted [21].Based on these statistics, the histogram can then be used to reconstruct the value of the entire dataset within each bucket using the most representative statistics for each bucket [2].In real-world applications, multiple visits to a data stream can improve accuracy and performance, but this is not realistic.For this reason, one-pass and high-accuracy algorithms are required in order to generate data synopses [20].Histogram is cheap to compute since only one pass through the relationship is required, but its precision is not always satisfactory [21].Wavelets: In synopsis construction, wavelets are derived from wavelet transformations in signal processing, which decompose a function into a set of wavelets using a wavelet decomposition tree.In order to generate a synopsis of data, the original data is decomposed n times using the approximation coefficient at each level of the tree to reach an abstraction of data [22].Wavelets are conceptually similar to histograms, but a wavelet transforms data and attempts to compress the most expressive features in data, requiring more computation time, as opposed to a histogram, which generates buckets by analyzing a subset of the original data in a short amount of time [2].Sketches: Sketches are a type of probabilistic data structure based on the frequencies of unique items in a dataset [23].In order to construct the synopses, k random vectors can be selected and the data can be transformed by dot product to those vectors [18].Although this section introduced the basic methods for constructing synopses, many other techniques, such as clustering [18] and materialized views [24], can also be used to generate them.Traditional methods have many challenges relating to data type, structure, distribution, and query aggregation functions.Furthermore, synopses provide the most accurate summary using the entire data stream, and it would be inconvenient to retrieve the entire data set in real-time databases as it changes over time.A discussion of the challenges associated with generating data synopses in relational databases will be presented in the following subsection.

GAN-based Tabular Generator
GANs were introduced in computer vision, where they are commonly used to process image data via Convolutional Neural Networks (CNNs).However, they are capable of generating tabular data as well.The GAN architecture has undergone numerous enhancements in recent years as a result of the improvement in the architecture among the research community over the past few years [25].To determine whether or not GAN is an appropriate option for synopsis generation, first we provide a detailed description of the GAN method and its architecture.Generative Adversarial Networks (GANs) are characterized by two neural networks, the generator, which creates data that is intended to mimic the true data distribution, and the discriminator, which evaluates the data, distinguishing between the generator's fake data and real data from the actual distribution [26].The generator draws a random vector z from the latent space with the distribution p z (z).The generator G(z; θ g ) then uses a parameter θ g to map z from the latent space to the data space.Therefore, p g (x) the probability density function over the generated data is used by G(z) to generate x g .Then, the discriminator neural network D(x; θ d ) receives randomly either x g the generated sample or x data the actual sample from the probability density function over the data space p data (x).The discriminator neural network D(x; θ d ) is a binary classification model in which D(x) returns the probability that x is derived from real data.Therefore, the output of this function is a single scalar that indicates if the passed sample is real or fake.Figure 2 depicts the described process and GANs architecture.θ g and θ d are the weights for the generator and discriminator that are learned through the optimization procedure during training.The goal of the discriminator in training is to maximize the probability that a given training example or generated sample has been assigned the proper label, whereas the goal of the generator is to minimize the probability that it has detected real data.Therefore, the objective function can be expressed as a minimax value function, V (G, D), which is jointly dependent on the generator and the discriminator, where: min The discriminator performs binary classification, which gives a value of 1 to real samples (x ∼ p data (x)) and a value of 0 to generated samples (z ∼ p z (z)).Therefore, in the optimal adversarial networks, p g converges to p data , and the algorithm is stopped at D(x) = 1/2 which means the global optimum occurs when p g = p data [26].
The generating data in an unconditioned GAN is completely unmanageable in multimodal distribution.Mirza et al. [27] introduced a conditional version of GAN that can provide generators with prior information so that they can control the generation process for different modes.Achieving this objective requires conditioning the generator and discriminator on some additional information, y, where y can be anything from class labels to information on the distribution of data (modes).This can be done by giving the discriminator and the generator Y as an extra input layer in the form of a one-hot vector.In fact, the input noise p z (z) to the generator is not truly random if the information y is added to it, and the discriminator does not only regulate the similarity between real and generated data, but also the correlation between the generated data and input information y.Therefore, the objective function in Eq. 1 can be rewritten as follows: Figure 3 illustrates the structure of a CGAN and how to input information is applied during the process.A majority of applications for conditional GAN were concerned with synthesizing images by giving the label for the image that should be generated.Nonetheless, in the case of tabular data, this could be the shape of data on a multimodal distribution and can be used to inject information as prior knowledge to the generator.To date, all proposed solutions have been published with the aim of adhering to real data privacy regulations and preventing data leakage during data sharing or the generation of synthetic data for data imputation and augmentation.In contrast, in AQP applications, it is necessary to generate realistic data rather than synthetic data that is as close to real data as possible.The challenges associated with generating tabular data using GANs have been addressed in a few publications since 2017.The purpose of this section is to introduce promising variants of GANs for tabular data generation, followed by a classification of the proposed solutions based on the previously discussed synopsis construction challenges.Choi et al. [28] proposed the medical Generative Adversarial Network (medGAN) to generate realistic synthetic patient records based on real data as inputs to protect patient confidentiality to a significant extent.The medGAN generates high-dimensional, multi-label discrete variables by combining an autoen coder with a feedforward network, batch normalization and shortcut connections.With an autoencoder, flow gradients are able to end-to-end fine-tune the system from discriminator to decoder for discrete patient records.The medGAN architecture uses MSE loss for numerical columns and Cross-Entropy loss for binary columns, and ReLU activation function for both encoder and decoder networks.The medGAN uses the pre-trained autoencoder to generate distributed representations of patient records rather than directly generating patient records.In addition, it provides a simple and efficient method of dealing with mode collapse when generating discrete outputs using minibatch averaging.Figure 4 shows medGAN architecture and defines the autoencoder role in training process.The generator cannot generate discrete data because it must be differentiable.To address this issue Mottini et al. [29] proposed a method for generating realistic synthetic Passenger Name Records (PNRs) using Cramer GANs, categorical Generator (,   )

Distribution
(|)  Figure 4: medGAN architecture: Discriminator utilizes autoencoder (which is learned by real data) to receive decoded random noise variable feature embedding, and a Cross-Net architecture for the handling of this issue (categorical or numerical with null values).As opposed to simply embedding the most probable category, they used the weighted average of the embedded representation of each discrete category.The embedding layer is shared by the generator and discriminator, resulting in a fully differentiable process as a result of this continuous relaxation.For handling null values, they are substituted with a new category in categorical columns.However, continuous columns fill null values with a random value from the same column and then a new binary column is inserted with 1 for filled rows and 0 otherwise.These additional binary columns are encoded like category columns.It should be noted that in this architecture, both the generator and discriminator consist of fully connected layers and cross-layers.Also, except for the last layer (Sigmoid), all layers of the generator use leaky ReLU activations for numerical features and Softmax for categorical features.However, discriminator uses leaky ReLU activations in all but the last layer (linear).Neither batch normalization nor dropout is used in this architecture like Wasserstein and Cramer GANs [30].Data pre-possessing in this algorithm is depicted in Figure 5.As indicated, discrete values will be embedded using the embedding matrix, followed by the concatenation of them with continuous columns of input data.Table-GAN [31] uses GANs to create fake tables that are statistically similar to the original tables but are resistant to re-identification attacks and can be shared without exposing private information.Table-GAN supports both discrete and continuous columns and is based on Deep Convolutional GANs (DCGANs) [32].Besides the generator and discriminator with multilayer convolutional and deconvolutional layers, the table-GAN architecture also includes a classifier neural network with the same architecture as the discriminator.However, it is trained using ground-truth labels from the original table to increase the semantic integrity of the generated records.Information loss and classification loss are two additional types of loss that are introduced during the backpropagation process.The purpose of these functions is to maintain a balance between privacy and usability while ensuring the semantic integrity of the real and generated data.The information loss compares the mean and standard deviation of real and generated data to measure the discrepancy between them and determine whether they have statistically the same features from the perspective of the discriminator or not, and the classification loss measures the difference between how a record is labeled and how the classifier predicts it should be labeled.Figure 6 is a representation of the loss functions in the table-GAN architecture.Xu et al. developed TGAN [33], which is a synthetic tabular data generator for data augmentation that can take into account mixed data types (continuous and categorical).TGAN generates tabular data, column by column, using a Long-Short Term Memory (LSTM) network with attention.The LSTM will generate each continuous column from the input noise in two steps.First, it generates a probability that the column comes from mode m, and then normalizes the

Continuous Columns Matrix Embedding
Figure 5: Pre-processing input data before feeding the discriminator in PNR-GAN column value based on this probability.TGAN penalizes GAN's original loss function by adding two KL-divergence terms between generated and real data for continuous and categorical columns separately.Therefore, generator will be optimized as follow: Where u ′ i and u i are probability distribution over continuous column c i for generated and real data, respectively, d ′ i and d i are probability over categorical column d i using softmax function for generated and real data respectively, N c is number continuous columns, and N d is number of categorical columns.They also also proposed a conditional version of TGAN, named CTGAN [34] , for addressing data imbalance and multimodal distribution problems by designing a conditional generator with training by sampling strategy to validate the generator output by estimating the distance between the conditional distribution over generated and real data.

𝐺(𝑧)
Discriminator () ,  It is difficult for GANs to control the generation process of data-driven systems; therefore, integrating prior knowledge about data relationships and constraints can assist the generator in generating synopses that are realistic and meaningful.In order to implement this, DATGAN [36] incorporates expert knowledge into GANs generator by matching the generator structure to the underlying data structure using a Directed Acyclic Graph (DAG).Using a DAG, the nodes represent the columns of a data table, while the directed links between them allow the generator to determine the relationship between variables so that one column's generation influences another.It means if two variables have no common ancestors, they will not be correlated in the generated dataset.In relational databases, there is no particular order in which columns appear in data tables.Nevertheless, the DAG enables data tables to have a specific column order based on their semantic relationship.

Tabular GAN Evolution
GAN has made significant progress in recent years, which has led to the development of novel variants that improve on previously introduced versions that had promising results prior to their introduction.Table 1 provides a summary of the variants of GAN that have been discussed in this paper.Also, figure 7 shows tabular GAN evolution, along with the year that they were introduced and their ancestors.As shown in this figure, table-GAN and CTAB-GAN utilize convolutional layers as part of their generator, however, CTAB-GAN makes use of a conditional version of generator built on CGAN and AC-GAN.CTGAN also utilizes the conditional version of GAN.With conditioned generators, realistic data can be generated based on the constraints on the data table.On the other hand, TGAN and DATGAN use LSTM for memorizing the data relationships and correlations.Indeed, both conditional generators and LSTMs attempt to generate data based on a prior knowledge about the relationship between columns in a data table.

Synopsis Construction Challenges
Traditional methods in synopsis construction have many challenges relating to data type, structure, distribution, and query aggregation functions.Furthermore, synopses provide the most accurate summary using the entire data stream, and it would be inconvenient to retrieve the entire data set in real-time databases as it changes over time.According to the data structure of relational databases, the challenges associated with generating synopses can be categorized into many significant groups [33].

Data type
It is challenging to construct a data synopsis that is representative of the entire data table due to the difference in data types.For instance, different activation functions on output are required for the generative models since relational database tables include numerical, categorical, ordinal, and mixed data types.As an example of a mixed data type, the financial database contains columns for loan debts, where a loan holder may have no debt or debt with a positive value [35].In data analysis, it can be defined as categorical data using a step function, but in reality, it is continuous data.In this regard, a data generator must be able to detect these types of data in order to avoid adverse effects on the interpretation of the data.The several types of data used in creating AQP data synopsis are broken down in Table 2. Mentioning that textual data types are not typically utilized in AQP queries and are therefore ignored here.

Bounded Continuous Columns
Continuous Column C i is bounded if there are two numbers a and b in which for all x ∈ X a ≤ x ≤ b.The intricacy in Synopsis construction for these bounded continuous columns arises from the necessity to sample from a probability distribution that not only mirrors the true underlying statistical characteristics of the original data but also adheres strictly to these boundary constraints.For instance, a credit card's expenditure column might be restricted from zero to an upper limit reflecting the credit limit, hence the synthetic data generation process must respect these limits to produce meaningful and applicable synthetic transactions.The challenge intensifies as it requires the synthesis process to be sensitive to the distribution's tails, avoiding the generation of outliers that fall outside the established bounds, which would otherwise lead to unrealistic and operationally irrelevant data points.

Non-Gaussian Distribution
When dealing with the non-Gaussian distributions that are common in real-world datasets, the assumption of normality often fails in the field of synopsis construction.Such distributions may be multimodal, containing several peaks or modes, which reflects the complexity of underlying data-generating processes.For instance, the distribution of incomes in a socio-economic dataset could exhibit multiple modes, corresponding to different socio-economic classes.
Traditional synopsis generation techniques may inadequately capture the multi-modes structure of such distributions, leading to the missing of entire modes.This results in a generated synopsis that fails to represent segments of the population within the original dataset [33].Moreover, the presence of long-tailed distributions poses additional challenges [35].These distributions are characterized by a proliferation of infrequent events, such as a customer purchase history where a vast majority of customers make infrequent purchases, while a minor fraction exhibits high purchase frequencies.Synthesizing data from such a distribution requires not only capturing the frequent low-occurrence events but also accurately representing the rare high-occurrence instances.The conventional methods may struggle with this, often either over-representing the tail and creating too many rare events or under-representing it, thus failing to capture the true nature of the underlying data.This misrepresentation can skew the synopsis, rendering it less effective for use in decision-making processes where an understanding of rare events is critical.

Imbalance Categorical Column
In the construction of data synopses, the handling of imbalanced categorical columns presents a significant challenge [33].Categorical variables in real-world datasets frequently show a skewed distribution in terms of the frequency of occurrence across categories.The presence of such a disparity indicates that minority categories make only a small contribution to the overall distribution of data, which may result in their under-representation in the generated synopsis.The process of creating synopses is influenced by a lack of representation of certain classes, resulting in a bias towards the majority class due to its higher statistical likelihood.For instance, consider a customer gender column in a retail database with a pronounced imbalance, where 'male' customers vastly outnumber 'female' customers.A synopsis generated from this distribution might reflect this skew, resulting in a synthetic dataset dominated by 'male' entries.However, this skew inaccurately portrays the significance of the 'female' category, which, despite its smaller size, may carry substantial weight in consumer behavior analysis.

Semantic relationship and Constraint
Tabular data often includes complex semantic relationships that are not easily understood through standard statistical analysis [37].These relationships can exist between categorical and numerical columns alike and are crucial for maintaining the integrity and usefulness of the generated synopsis.Identifying and encoding such relationships is a challenge due to the heterogeneity of domain-specific constraints and the complex nature of the inter-column dependencies which may not be amenable to simple rule-based generalizations.For instance, semantic relationships may determine that certain numerical values possess validity solely when paired with specific categorical entries, imposing a constraint-based association.Alternatively, a rule-based linkage could suggest a probabilistic co-occurrence pattern between different fields in the data.Hence, it is imperative for a comprehensive process of generating synopses to include mechanisms that can deduce these complex relationships, which can be multi-faceted and deeply embedded within the structure of the data.A failure to do so not only compromises the authenticity of the synthesized data but also limits the operational relevance of the synopsis, as it could lead to the generation of implausible or inconsistent records that do not adhere to the real-world rules and constraints governing the dataset.Figure 8 represents two examples of generated samples from a table that the model should reject semantically.To generate a representative data synopsis in AQP, the city must be properly associated with the state, and the joined column cannot precede the founded column.

GAN-based Synopsis Construction Solutions
It is possible to categorize synopsis construction solutions into three different categories: Data Transformation, which addresses data type issues; Distribution Matching, which addresses ranges and distributions of data; and Conditional and Informed Generator, which addresses imbalance classes, semantic relationships, and table constraints.

Data Transformation
Mode normalization is capable of detecting modes of data by assigning samples to different modes and then normalizing each sample based on the corresponding mode estimator [38].To deal with multimodal distribution for continuous columns, mode-specific normalization is introduced in TGAN [33].Using this algorithm, first, the number of continuous columns' modes is calculated using Gaussian kernel density estimation.Then, Gaussian Mixture Model (GMM) can be employed to efficiently sample values from a distribution with multiple modes by clustering the values of continuous columns (C i ).In other words, the weighted sum of the Gaussian distributions over C i can represent the multimodal distribution over it.A normalized probability distribution over m Gaussian distributions can then be used to represent each continuous column so that each column can be clustered into m fixed Gaussian distributions.As a result, if there are less than m modes in one column, then the probability of that mode is high, and for the rest, it is close to zero.However, in CTGAN [34], first, a Variational Gaussian Mixture model (VGM) should be applied to each continuous column (C i ) in order to fit a Gaussian mixture and find the number of modes (m).Then, a one-hot vector (β i,j ) indicates to which mode a given value belongs, and a scalar (α i,j ) serves as the value itself within that mode.For the learned Gaussian mixture for column C i with m modes, the following equation is given: where c i,j : value of j th row from i th column, µ k and σ k : the mean and standard deviation of Gaussian distribution for k th mode, and w k is the weight of k th mode.For each value, the probability density ρ of k th mode is: Therefore, each value can be normalized according to the mode with highest probability.As an example, the values of α and β related to column c i,j in k th mode will be: where δ is a parameter specified by the modeller.
For categorical columns D, the situation is different; TGAN [33] stated to convert these columns (d ij ) to a representation using one-hot encoding with added noise (U nif orm(0, γ), γ is an arbitrary number).To achieve this, after creating the one-hot vector, noise will be added to each element, and the resulting representation will be renormalized.Therefore, each data row can be represented by a concatenation of continuous and categorical columns as follows: where d i,j is one-hot representation of a categorical column, N c is number of continuous columns and N d is number of categorical columns D i .
As previously discussed, columns can be considered mixed if they contain both categorical and continuous values or continuous values with null values.The encoding process for continuous and categorical columns in CTAB-GAN [35] is exactly the same as CTGAN [34] by defining α and β.However, in mixed-type columns, the encoder is defined so that each column is considered a concatenation of value-mode pairs, where the categorical part of values takes zero for α and is treated as continuous.Figure 9 shows the distribution over an arbitrary mixed-type column, with two modes for continuous (m 2 , m 3 ) and two categorical parts (m 1 , m 4 ) and illustrates how this algorithm transforms one row of mixed-mode data.

Distribution Matching
In order to generate synopses with the same distribution as the underlying distribution, the training algorithm should penalize the generator.Information loss [31] helps generator to generate synopses statistically closer to the real one.It utilizes the statistical characteristics L mean (first-order statistics, Eq. 8) and L sd (second-order statistics Eq. 9) of the extracted features prior to the classifier in the discriminator to penalize the generator for the discrepancy between real and generated data.This makes sense because the extracted features are used to determine the binary decision of the discriminator.
Where f represents features, E[f ] is the average and SD[f ] is the standard deviation of features over all rows in the data table.The Euclidean norm is used to measure the discrepancy between two terms.As we discussed before, table-GAN [31] was developed to protect confidential data privacy when it is shared with the public.As a result, it should be possible to control the similarity of generated data with real data during the generating process.To this end, information loss for the generator is demonstrated as follows: Where δ is a threshold indicating a quality degradation of generated data and max(.)represents the hinge-loss that is zero until δ is reached.However, in AQP, it is not necessary to meet this threshold in order to generate realistic data synopses.DATGAN [36] uses the improved version of the Wasserstein loss function in WGAN [39] in addition to the Vanilla GAN loss function with gradient penalty [40] and also add the KL-divergence as an extra term to the original loss function.Both of these terms aim to minimize the difference between the probability distributions of real and generated data.WGAN employs an alternative method of training the generator to better approximate real data distribution.This approach replaces the discriminator model with a critic that scores the degree to which a data sample is real or fake rather than using the discriminator as a classifier.Therefore, WGAN considers discriminator output as a scalar score instead of a probability, and Wasserstein loss ensures a greater difference between the scores for real and generated data.
As a result, it can prevent vanishing gradients in the generator models.However, the WGAN's primary problem is that it must clip the weights of the critic in order to enforce the Lipschitz constraint.This issue can be addressed by adding a gradient penalty to the critic.Eq. 11 shows the Wasserstein objective function, and Eq. 12 shows the same with a penalty on the gradient norm for random samples x ∼ p x.
where λ is a parameter defined by the modeler and x sampled from G(z) and x.

Conditional and Informed Generator
Imbalances in categorical columns can cause inaccuracies when generating synopses and may result in the generator not being trained to match the distribution of the real data.In CTGAN [34], the conditional generator is introduced (using training-by-sampling) as a solution to this problem.To this aim, the generated value can be interpreted as a conditional distribution of rows given the value of an imbalanced categorical column.Therefore, the original distribution can be reconstructed as follows: where k is a value in i th categorical cloumns D i .For the implementation of this solution, a conditional vector consisting of a mask vector that represents the address of the table value (column and corresponding row value) is required.This conditional vector does not guarantee the feed-forward pass obtains the correct value based on the mask vector M ; instead, the suggested approach penalizes the conditional generator's loss by averaging the cross-entropy between the generated Mi and the expected conditional vector M i over all instances of the batch.The generator loss can be expressed as follows: Where H(.) is cross-entropy between two values.As a result, the generator learns to replicate the masked value in the generated row during training.The conditional vector for a data table with N categorical columns is the direct sum of all mask vectors (M ) across each column D i , where for each value c i,j : In fact, generator loss allows the generator to learn to produce the same classes as the given conditions.Mask vectors (M i ) are initialized with 0 for each categorical column (D i ) during the conditional generator procedure.Then, a column is chosen at random, and the Probability Mass Function (PMF) is applied to the column's range of categories.
According to PMF, one category is then picked, and its value in the corresponding mask vector is changed to 1. Finally, the conditional vector is formed, and the generator is able to generate a synthetic row for the given categorical column.Figure 10   : Following vectorization of categorical columns, all vectors will be initiated by 0, then j th category from i th column will be selected, and the value of the corresponding element will be changed to 1.
CTAB-GAN [35] utilizes a classifier neural network using auxiliary classifier GAN (AC-GAN) [41] that is a conditional GAN type that requires the discriminator to predict the class label c ∼ p c of generated data as well as the realness classifier.In AC-GAN, the generator generates a new sample using noise z and a class label c, while the discriminator provides both a probability distribution over sources P (S|X) and a probability distribution over class labels P (C|X).
The objective function contains the following terms: Where L S is likelihood of predicting the correct source, L C is likelihood of predicting the correct class, and c is a class label.Discriminator is trained to maximize L C + L S and generator is trained to maximize L C − L S .These objective functions allow the training procedure to generate data according to a specific type of data, while the discriminator must predict the class label of the generated data and determine whether or not it is real.As a result of this, the classifier loss (Eq.18) will be added to the generator in CTAB-GAN to increase the semantic integrity of generated records and penalizes generator where the combination of columns in a data row is semantically incorrect.
where l(.) returns the target label and f e(.) returns the input features of a given row.As mentioned before, DATGAN [36] uses DAG to control the generation process based on semantic relationships and correlations between columns.According to the constructed DAG, each column and its sequence are represented by Long Short Term Memory (LSTM) cells.Therefore, by providing the generator with prior knowledge, DAG decreases the GAN's capacity to overfit noise in the training process and enables the GAN to produce more accurate data by using these noises more efficiently.Inputs and outputs of LSTM cells should be modified in accordance with the GAN architecture.Inputs can be expressed as follows: where z t is a tensor of Gaussian noise, which is the concatenation of the noise from the source nodes at each node of the DAG.f t−1 is the transformed output of previous tensor (h t−1 ).For the purposes of determining which previous cell outputs are relevant to a node input, a t represents a weighted average of all ancestor LSTM outputs.Therefore, a t and the z t are defined based on all ancestors of the current node.Data input into DATGAN architecture (generated and real data) should be encoded into [-1,1] or [0,1] using techniques described in the "Data Transformation" section.Additionally, for categorical columns, generators produce probability over each class, making it easy for a discriminator to differentiate between real and created values.Therefore, DATGAN recommends using one-sided label smoothing for the default loss.It means the categorical 0,1 vectors are introduced with additive uniform noise and then rescaled to [0,1] bound vectors.be injected as expert knowledge and cannot be detected by the model.However, tabular data cannot be considered sequential since the order of columns in a data table is generally random.Therefore, a DAG is used to create a specific sequence of columns.

Synopsis Evaluation and Error Estimation
To avoid performing expensive computations and to take a trial-and-error approach, AQP requires an estimation of errors before running the query.As a result, the AQP system is able to select the optimum synopsis type and resolution based on the user's latency or accuracy requirements.Error Quantification Modules (EQMs) in AQP systems perform this process by measuring the quality of responses either by predicting or by running queries on synopses [17].A broad classification of the criteria for evaluating approximate query processing systems can be made as follows [42]: • Query type: In terms of aggregation functions and conditions, what types of queries are covered by methods?• Time Complexity: How long does it take to produce the synopses and return an approximate result?
• Space Complexity: What is the required storage space for data synopses?
• Accuracy: Does the approximate answer meet the error confidence interval?
In this study, however, the focus is on the creation and quality of synopses, therefore, this section discusses the synopses evaluation technique rather than the EQM process.Different aggregation functions necessitate unique considerations when creating a synopsis.For example, Min() and Max() require that the synopsis include the critical outliers which contain critical information.For instance, as depicted in Figure 4, a single transaction in a real-world database may contain vital information and influence decisions based on business problems.As can be seen in this figure, a food supplier sells its product to both retailers and wholesalers.Although retailers make the majority of sale transactions, wholesalers can place orders that are more significant in terms of quantity and cost.Likewise, Count() and Sum() may return unacceptable results for minority groups if the query aggregates groups by those columns.Avg() can also be sensitive in situations where a query filters data based on specific conditions or criteria.

(a) (b)
In generative models, evaluation methods cannot be generalized to other contexts; instead, they must be evaluated explicitly based on their application.In GMs optimization, Gaussian distributions are fitted to a mixture of Gaussian distributions by minimizing distance measures such as Maximum mean discrepancy (MMD) [43] and Jensen-Shannon divergence (JSD).Minimizing MMD or JSD results in the omission of some modes in a multimodal distribution.In addition, maximizing average log-likelihood or minimizing KL-divergence can assign large probabilities to non-data regions.In image synthesizing applications, three common criteria are used to evaluate generative models: log-likelihood, Parzen window estimates, and visual fidelity of samples [44].However, the evaluation of results for tabular data with complex data types and distribution would be quite different.
In order to measure accuracy, a generated synopsis should first demonstrate that it is a good representation of real data.The SDMetrics Python library [45] introduces a set of metrics to measure the quality and privacy of synthetic data.However, considering data privacy is not the goal of this study and may reduce the quality of the data.These metrics are summarized and reformed to make them suitable for evaluating the generated synopses in AQP.In order to achieve this objective, the comparison of two real and generated datasets can be divided into the following categories: Data Coverage; Data Constraint; Data Similarity, and Data Relationship.

Data Coverage
For discrete columns D i , we must determine whether all categories in the real data are represented in the synopsis.To accomplish this goal, A score is calculated by dividing the number of unique categories in the synopsis by the number of unique categories in the corresponding column of the actual data as follow: Where i is the column index, N Dg is number of unique categories in the generated synopsis and N D data is the number of unique categories in the real data.When a column is scored 1, all of the unique categories in the actual data are present in the generated synopsis, while a score of 0 indicates that no unique categories are present in the generated synopsis.In the case of continuous columns, the coverage metric is used to measure whether a generated column in the synopsis covers the whole range of values that can be found in the real column.The coverage score for continuous columns is calculated as follows: Where C g is the generated value and C data is the real value of column C i .The goal of this metric is to determine how closely the min and max of the generated values match the actual min and max values.It is possible for the equation 21 to become negative if the range covered by the generated synopsis is inadequate, and in such a situation, it returns a score of 0 since it is the lowest possible result.

Data Constraint
In order to measure how a continuous column adheres to a boundary of real data, Boundary Adherence is introduced.The frequency of generated values within the minimum and maximum ranges of the real column values is calculated using this metric.
Where N i is the number of records in the column C i .A column with a score of 1 indicates all values adhere to the boundaries of real data, while a column with a score of 0 indicates that no values fall between the minimum and maximum of the real data.

Data Similarity
The Synthetic Data Metrics (SDMetrics) library [45] introduced several metrics for measuring data similarity.In order to calculate the similarity between real and generated marginal distribution, two types of metrics are available: the Kolmogorov-Smirnov (KS) statistic for continuous columns and the Total Variation Distance (TVD) for discrete columns.Based on the KS statistic, we can determine how much the empirical distribution function of the generated data differs from the cumulative distribution function (CDF) of the real data.This means that, in this case, the KS statistic represents the maximum difference between the two generated and real CDFs, as illustrated in figure 13.
CFD Max distance = 0.08 Figure 13: Distances are measured between 0 and 1, but the complement of this metric can also be considered.Therefore, a higher score indicates a higher quality according to 1-(KS statistic distance) [45].
It can be calculated using the following expression: the most negative correlation and 1 representing the most positive correlation between the real and generated column.
Traditionally, relational databases are divided into separate tables for each object, and to retrieve information from those tables, related fields within the tables need to be linked together by defining relationships.Users can then retrieve information from multiple tables simultaneously by calling queries [47].In terms of measuring the similarity among those tables, the cardinality of related tables can be used.The cardinality of a table relationship is determined by how many rows in each table are related.Therefore, when generating synopses, measuring whether a table's cardinality is the same between the real and generated data is an important metric.In order to measure the similarity of cardinality between related tables, marginal distribution can be utilized by computing the similarity between a real and generated cardinality distribution using KS or TVD score.In the case of real and generated data, the cardinality complement score returns 1 when the cardinality values are the same and 0 when they are different.

Conclusion
The construction of synopses is essential to data-driven decision-making systems in order to provide approximate answers to queries.Since traditional statistical approaches are ineffective, many researchers are exploring how realistic data can be generated with Deep Generative Models in order to revolutionize the AQP system.In this paper, we discussed the challenges associated with the generation of synopses in relational databases and whether Generative Adversarial Networks can be used to accomplish this task.Furthermore, we summarized and reformed statistical metrics for evaluating the quality of the generated synopses as a part of this study.
There is no doubt that GANs have an incredible ability to generate realistic images and videos.However, each data point in an image is represented by a pixel, which cannot be interpreted alone but only in relation to the other pixels in the image.Consequently, the meaning of the same pixel in one image differs from the meaning of the corresponding pixel in another image [36].In contrast with images, data tables typically contain columns that have a specific meaning and can be understood by their positions and values, and their values may also have a semantic relationship with each other.We analyzed the challenges associated with synopsis construction in a relational database and categorized them into the following categories: data type, bounded continuous columns, non-Gaussian distribution, imbalance categorical columns, and semantic relationship and constraint between columns.Then, by reviewing the promising variants of GANs designed for generating tabular data, we realized that the solutions to the given challenges revolve around the following areas.First, data transformation in preprocessing phase, especially for handling categorical and null values.Second, data distribution matching, typically by defining specific loss functions to penalize discriminator and generator for the difference between generated and real data distribution to learn multimodal mapping from inputs to outputs.Third, conditional and informed generator where the generator is conditioned upon some sort of auxiliary information (such as class labels or data) from other modalities so that the generator is fed with different contextual information, and prior knowledge, so that it can capture the interactions between columns in a data set.We demonstrated that although the majority of proposed methods are geared towards applications such as data privacy for data sharing, data augmentation for machine learning model training, and data imputation for missing values, rather than generating synopsis, GANs are capable of generating data synopsis that are identical to actual data.In summary, Generative Adversarial Networks have demonstrated tremendous potential in the world of machine learning for creating realistic images, videos, audio, and text.However, the complexity of tabular data makes it difficult for algorithms to understand semantic relationships and constraints in relational databases during the training process.The field of GANs and Adversarial Learning is still relatively young, thus, there is a need to improve current methods in order to be able to construct effective synopses for AQP in data-driven decision-making systems.

Figure 7 :
Figure 7: Tabular GAN-based generators evolution based on their relationship.Yellow boxes are tabular generators, and green boxes introduced for non-tabular data.

Figure 8 :
Figure 8: Each soccer team in the table corresponds to a particular location and has a specific capacity, foundation year, and year of entry into MLS.(a) and (b) shows two examples of constraint-based and rule-based sample rejection during the data synopsis generation process.

Figure 9 :
Figure 9: Distribution over a mixed-type column.m 1 and m 4 represent the categorical part or null values of this column, whereas m 2 and m 3 represent modes for numeric parts.The numeric parts are defined by Variational Gaussian Mixture (VGM) model.[35]

Figure 10
Figure10: Following vectorization of categorical columns, all vectors will be initiated by 0, then j th category from i th column will be selected, and the value of the corresponding element will be changed to 1.

Figure 12 :
Figure 12: In the "invoices" table, there are only two records with outliers for each branch.(a) The aggregated query group by the branch is shown to return the maximum quantity and maximum price sold in a week, and (b) shows a similar query for minimum values.Obviously, approximate results for min() and max() are unreliable when dealing with outliers.
Ceil( √ N c + N d ), N c and N d are number of continuous and categorical columns respectively in a row of the data table, and then, the extra cells values (d × d − (N c + N d )) are padded with zeros.
[35]-GAN[35]were introduced with the ability to encode the mixed data type and skewed distribution of input data table, utilizing conditional generator, information and classification loss functions derived from table-GAN, as well as CNNs for both generator and discriminator functions.Since CNNs are effective at capturing the relationship between pixels within an image, therefore they can be employed in enhancing the semantic integrity of created data.However, in order to prepare data tables for feeding CNN, rows are transformed into the nearest square d × d matrix, where d =

Table 1 :
Different tabular GAN architecture and capability.

Table 2 :
The data types that can be used in AQP queries.Can be aggregated (sum, avg, max, min), can be bounded in where conditions, and can be considered as a group to aggregate other columns.
represents a mask vector generation process for a data table with N d categorical columns when generator is conditioned for j th category of i th categorical column.It has been discussed previously that columns in a table may have a meaningful relationship with one another.