A Review of Tabular Data Synthesis Using GANs on an IDS Dataset

: Recent technological innovations along with the vast amount of available data worldwide have led to the rise of cyberattacks against network systems. Intrusion Detection Systems (IDS) play a crucial role as a defense mechanism in networks against adversarial attackers. Machine Learning methods provide various cybersecurity tools. However, these methods require plenty of data to be trained efﬁciently, which may be hard to collect or to use due to privacy reasons. One of the most notable Machine Learning tools is the Generative Adversarial Network (GAN), and it has great potential for tabular data synthesis. In this work, we start by brieﬂy presenting the most popular GAN architectures, VanillaGAN, WGAN, and WGAN-GP. Focusing on tabular data generation, CTGAN, CopulaGAN, and TableGAN models are used for the creation of synthetic IDS data. Speciﬁcally, the models are trained and evaluated on an NSL-KDD dataset, considering the limitations and requirements that this procedure needs. Finally, based on certain quantitative and qualitative methods, we argue and evaluate the most prominent GANs for tabular network data synthesis.


Introduction
Most of the network systems preserve high-quality security characteristics; however, there can be a lot of different adversarial cyberattacks, which due to recent technological innovations change drastically over time. Hence, there is a great need for sophisticated, robust, and adaptive security systems. The Intrusion Detection System (IDS) is a valuable and essential defense mechanism introduced by Anderson in 1980 [1], which is responsible for inspecting network traffic for suspicious and/or malicious activities. The goal of an IDS is to distinguish the malicious network traffic from the normal one. In particular, the IDS monitors the network for harmful activities, raising an alert when suspicious traffic is identified. With the spread of Artificial Intelligence, in the past few years, Machine Learning methods have been applied in Intrusion Detection Systems as a robust and accurate defense mechanism, demonstrating remarkable performance. Specifically, various Machine Learning classifiers, such as Decision Trees Support [2], Artificial Neural Networks (Machine Learning, Deep Learning), such as Convolutional Neural Networks [3], Support Vector Machines (SVMs) [4], and Bayesian Networks [5], have been widely employed in IDSs, because, contrary to conventional methods, they can detect suspicious traffic or discover unseen patterns and abnormalities in the data.
However, these learnable methods require plenty of data to be trained correctly. Data acquisition is often a difficult procedure, since most data are private with strict privacy policies. Therefore, the generation of synthetic tabular data is an important task, which has been thrilling researchers for many years. Relevant past work treats each column of a table differently, creating a joint multivariate probability distribution and then sampling from that distribution, employing Bayesian networks [6][7][8][9] as well as classification and The mathematical expression of the Vanilla GAN derives from the cross-entropy between the real and generated distributions, and it is the following: min G max D V(D, G) = E x∼p data (x) [log D(x)] + E z∼p z (z) [log(1 − D(G(z)))]. (1) A proof of the above equation is out of the scope of this research material, though a brief description will be referenced here, shortly.
In Equation (1), discriminator D(x) tries to maximize the quantity V(G, D) for any given generator G(z), while G(z) is the generator's output when given z. E x∼p data (x) and E z∼p z (z) correspond to expected values over all real data instances and over all generated fake instances, respectively. The global optimum for this task is when p data = p g , and this corresponds to the global minimum of the training criterion.
In order to avoid overfitting when training finite datasets, the Discriminator D must be optimized simultaneously with the Generator G. During the start of the learning process, the Discriminator D may reject the high confidence samples created by the Generator G, because they are different from the training data. To address that, Generator G can be trained to maximize log D(G(z)) instead of minimizing log(1 − D(G(z)).

Wasserstein GAN (WGAN)
The Wasserstein GAN (WGAN) [13] is a modification of the Vanilla GAN, which aims to train the Generator and the Discriminator to better approximate the distribution of the real data. The WGAN introduces some changes over the traditional GAN model, which improve the training process.
Initially, a meaningful loss function, namely Wasserstein-1 distance, is applied, which correlates with the quality of the generated samples. The Wasserstein-1 distance measures the distance between probability distributions.
Moreover, the Discriminator of WGAN does not contain a sigmoid activation at the last layer, resulting to logits. The Wasserstein-1 distance, as it is described in Equation (2), is applied directly to logits, forcing the logit distributions to be similar. The output of the discriminator loss is a score, which indicates the realness or fakeness of the generated sample. The Lipschitz function is used to constrain the optimization problem by clipping the weights of the discriminator function. Lastly, the RMSProp optimizer is used.
The Wasserstein-1 metric, also called Earth Mover's distance, is defined as follows: where ||x − y|| is the cost function, p r , p g are the probability distributions, and Π p r , p g denotes the set of all joint distributions γ(x, y). The infinite number of joint distributions in Π p r , p g makes the Wasserstein-1 distance intractable. Thus, the authors in [13] apply Kantorovich-Rubinstein duality. Hence, Wasserstein-1 distance takes the following form: where the supremum corresponds to all the 1-Lipschitz functions f : X → R . Merging this function with a GAN, the result is as follows: where the functions { f w } w∈W are K-Lipschitz.
The WGAN model performs a more stable training process that is less sensitive to the architecture of the model and the selection of hyperparameters. Moreover, the mode collapse phenomenon that typically occurs in GAN is reduced. The most significant benefit of WGAN is the continuously estimation of Wasserstein-1 distance, by training the Discriminator until optimality. In contrast, weight clipping is not a good way to enforce Lipschitz constraint. If the clipping parameter is large, then the time needed for the weights to reach their limit increase. If the clipping is small, this may result in vanishing gradients. That is the reason why the authors in [14] proposed the use of gradient penalty instead of weight clipping.

WGAN-GP
Wasserstein GAN with Gradient Penalty (WGAN-GP) [14] was introduced shortly after the WGAN algorithm. The improvement of this work lies on the gradient penalty that is used to enforce Lipschitz constraint, instead of the weight clipping of the WGAN. Particularly, the WGAN-GP penalizes the model when the gradient norm moves away from the target norm value of 1. The application of gradient penalty requires one more modification in the architecture. Specifically, the batch normalization is not used in the Discriminator, since the batch normalization introduces a correlation between the samples of the same batch. However, the gradient penalty is calculated for each individual sample and not for the entire batch, making the batch normalization not a suitable technique. Other normalization techniques that do not correlate the samples can be used, such as layer normalization. The WGAN-GP demonstrates strong performance as well as stability in different applications.
The loss function with Wasserstein distance and gradient penalty applied is defined by the equation: where λ is the penalty coefficient.

GANs for Synthesizing Tabular Data
GAN models have shown great possibilities in the generation of synthetic images and text. Over the past few years, potential uses of GANs have been explored for tabular data generation, since they offer great flexibility to model data distributions in contrast with traditional statistical techniques.
Giving the table Treal with real data, the task of synthetic data generation results in a synthetic table T syn. The T is partitioned into training set T train and test set T test . A GAN model is trained at T train . The data generator G learns the data distribution of each column in a table T, and then, it is used to generate the synthetic data of the table T syn . A successful data generator G for tabular data should be able to address the challenges associated with the nature of real-world tabular data.
It is common for T to contain mixed data types of numerical and categorical columns. The numerical columns of table can have either discrete or continues values. Thus, the Generator G should be trained to simultaneously learn and generate a mix of data types. Additionally, the shape distribution of each column may differ, following usually non-Gaussian and multimodal distributions, where the min-max transformation causes vanishing gradient problems. In categorical columns of real-world tabular data, the imbalance problem often occurs, since some classes have more instances than others. Imbalanced data lead to mode collapse as well as to inadequate training of the minor classes. Furthermore, sparse one-hot-encoded vectors can cause issues at the training procedure of the Discriminator D, since it learns to distinguish real from fake data from the distribution's rareness rather to the realness of the value. Different GAN models have been created in order to solve some or all of the abovementioned issues. TGAN [22] is constructed to work on any tabular dataset. MedGAN [27] tries to generate simultaneously discrete and continuous samples. In this work, the most promising GAN models for tabular data generation are analyzed, which are TableGAN [28], CTGAN [23], and CopulaGAN [29]. TableGAN [28] is a GAN-based architecture that synthesizes fake tabular data with similar statistical properties to the original table. Privacy concerns motivate the authors to develop this model in order to prevent information leakage. The discriminator D and the generator G of the TableGAN are convolutional neural networks (CNN). The architecture of TableGAN is based on the deep convolutional GAN (DCGAN) [15], which is one of the most popular models for image synthesis.
Another architectural component of TableGAN is the classifier C, which is involved in the training process and aims to increase the semantic integrity of generated records. The classifier C has the same architecture as the discriminator D, while it is trained based on the ground-truth labels of the real table. C learns the correlation between true labels and the features of the table and predicts the labels of the synthetic data. Thus, C educates the generator G if the generated record is semantically correct.

CTGAN
Conditional Tabular GAN (CTGAN) [23] is a GAN-based architecture that is designed to synthesize tabular data. The key improvements of CTGAN try to overcome the challenges of modeling tabular data using GAN architecture. In particular, the architecture of CTGAN deals with non-Gaussian and multimodal distribution by exploiting a modespecific normalization, which converts continuous values of arbitrary distribution into a bounded vector, which is a representation suitable for neural networks. Previous models, such as TableGAN [28], normalize continuous values to [−1, 1] using min-max normalization techniques. In CTGAN, the variational Gaussian mixture model (VGM) [38] is used for each continuous column independently.
Additionally, a conditional generator and training-by-sampling is implemented to overcome the data imbalance challenge of discrete columns. The data are sampled in a way that all the categories of discrete columns evenly occurred during the training procedure. A cond vector allows the conditioning on a value of a specific column via one-hot encoding. The conditional generator G takes as inputs random noise as well as the cond vector, while it is forced to mimic the desired condition. The training of the model is done using the WGAN loss with gradient penalty [14]. The output of conditional generator is evaluated by the critic, computing the distance between the learned and real conditional distribution.

CopulaGAN
The CopulaGAN model [29] is a variation of the CTGAN, which is introduced in the SDV opensource library. It exploits the Cumulative Distribution Function (CDF)-based transformation, which is applied via GaussianCopula. Particularly, CopulaGAN uses those alternatives of CTGAN in order to learn the data more easily. Based on probability theory, copulas are used to describe the intercorrelation between random variables.
During the training procedure, CopulaGAN tries to learn the data types and the format of the training data. The non-numerical and null data are transformed using a Reversible Data Transformation (RDT). Due to this transformation, a fully numerical representation is occurred from which the model can learn the probability distributions of each table column. Additionally, the CopulaGAN attempts to learn the correlation between the columns of the table.

Evaluation Metrics
Recent advances of generative modelling identified the need for suitable quantitative and qualitative methods to evaluate trainable models. Reliable evaluation metrics are important not only to rate GAN models but also to identify possible errors in the generated data. Specifically in cases where people face difficulties distinguishing the quality of synthetic data, such as medical images, the requirement for trusted metrics is essential [39]. Evaluating a GAN model is not a straightforward procedure, since various metrics can lead to different outcomes. Specifically, a good performance in one evaluation metric cannot guarantee good performance in another metric [40]. Additionally, the metrics should be chosen with respect to the application that are going to be used for. Inception Score [41], Fréchet Inception Distance [42] and Perceptual Path Length [43] are some metrics that are introduced for the evaluation of general GAN models.
The synthetic data would be evaluated against a sufficient number of metrics that are suitable for the task of tabular data generation. A combination of those methods can express a complete picture about the performance of the generator G of different GAN models. The evaluation is performed on the table of real data Treal as well as on the table of synthetic data Tsyn, which are generated from the trained generator G. Metrics can be categorized into three subcategories: Visual, Statistical, and Machine Learning based.

Visual Evaluation
Visual representation of the generated data is a powerful method to evaluate the performance of the generator G, by analyzing if G is able to maintain the properties of the real data. Based on this, humans can easily verify results and recognize similar patterns between real and synthetic data. Additionally, the visual analysis of results provides information that cannot be covered from the quantitative metrics. The visual evaluation can be based on Distribution, Cumulative Sums, and Column Correlation.
The Distribution plot of each column for real and synthetic data can be a quick sanity check, although it does not reveal any hidden relation. This representation can point out if the statistical properties of the generated and real data are similar to each other.
The Cumulative Sum of each column for real and generated data can be visualized to indicate the similarity between the distributions per column. This visualization can present a useful understanding for both categorical and continuous columns. However, this representation cannot provide any insight about the relations between columns.
Another evaluation method can be based on the Correlation table, which shows the association between each column of the table. Comparing the correlation matrix of the real and synthetic data can indicate if the generator manages to appropriately model the relationship between the columns of the table.

Statistical Metrics
Different types of statistical tests can be applied on real and generated tables. Particularly, these metrics compare individual columns of the real table with the corresponding column from the generated data and produce a result of this analysis. Two statistical tests are used for the evaluation of the GAN models, which are trained with an intrusion detection dataset: KSTest and CSTest.
The KSTest compares the distributions for the continuous features between the real and the generated table by using the two-sample Kolmogorov-Smirnov test and the empirical Cumulative Distributed Function (CDF). The result of each column comparison is 1 minus the Test D, expressing the maximum distance between the observed and expected CDF values.
For discrete values, the CSTest metric exploits the Chi-squared test in order to compare the distributions of columns between real and synthesized data. The comparison is performed column by column, and the result is the CSTest p-value, the probability that values from the two columns have been sampled from the same distribution.

Machine Learning-Based Metrics
This family of metrics exploits Machine Learning algorithms to evaluate the quality of the generated data. They are able to provide insight knowledge about the relations that Treal and Tsyn have. Particularly, GANs the for tabular data generation task are evaluated by the detection metrics as well as Machine Learning efficacy metrics.
The Detection Metrics evaluate how difficult is to differentiate the generated from the real data. Specifically, those metrics are based on Machine Learning models, which predict if the input data is synthetic or real. For this reason, a flag is associated to each Information 2021, 12, 375 7 of 14 data record, indicating if it is real or generated. Afterwards, the data with the flags are shuffled and the Machine Learning models are cross-validated, attempting to predict the flag. Finally, the result of those metrics equal to 1 minus the average ROC AUC score of all cross-validation splits. The Machine Learning models that can be used in those metrics are Logistic Regression or SVD classifier.
The Machine Learning Efficacy Metrics indicate if it is possible to replace the real with generated data to solve problems using Machine Learning models. In particular, a model is trained on Tsyn, and then, it is tested on Treal. In case of classification problems, Decision Tree, AdaBoost, or MLP classifier can be used, while the performance of those models is evaluated based on accuracy and F1 score. For regression tasks, Linear Regression or MLP regression may be utilized as machine learning models, and the evaluation is performed by R2. The average performance of different models can be used as metric for the evaluation of G.
As it is mentioned above, those metrics are occurred by solving Machine Learning problems. Therefore, they can only be applied on datasets that contain a target column, which should be predicted based on the rest of the data. The target column could contain true labels or ground truth values for the classification and regression task, respectively.

NSL-KDD Dataset
The NSL-KDD [32] is an Intrusion Detection System dataset, which is an improved version of its predecessor, the KDD'99 [44] dataset, which suffers from a large number of duplicate records. Specifically, the NSL-KDD dataset does not contain redundant records in the train and test sets. Therefore, classifiers that would be trained and tested at those datasets would not be biased based on the most frequent records. Another advantage of NSL-KDD over the KDD'99 is the reasonable amount of data in the train and test sets, providing the opportunity to perform experiments on the complete sets. The NSL-KDD is one of the most well-known publicly available datasets, since it is used by many researchers to develop efficient and accurate Intrusion Detection Systems. Studies have been performed to analyze the properties of the NSL-KDD dataset [45][46][47].
The NSL-KDD dataset contains 148,514 traffic records of normal activities and attacks. The dataset includes the train set KDDTrain+ and the test set KDDTest+. The train and the test datasets do not have the same distribution, while the test dataset contains some attacks that are not included in the train dataset. The labeled attacks of the NSL-KDD dataset can be grouped in four main types, which are Denial of Service (DoS), Probe, User to Root (U2R), and Remote to Local (R2L).
DoS [48] is a type of attack in which the malicious actor makes the network or the device unavailable for the users. This is achieved by flooding the computing or memory resources of the targeted machine until normal traffic cannot be processed; therefore, the user access to the machine is denied. DoS attack is one of the most frequently occurred cases in network intrusion. Probe is another type of attack, in which the hacker scans a machine or a device and gathers network information to identify vulnerabilities or weak points that can be used to harm the system [49]. Remote to User (R2L) is a type of attack, in which a remote user sends packets to a machine or a device through the internet, in which she/he does not have access [50]. The attacker aims to reveal the network vulnerabilities and take advantages of the privileges that a local user would have. The User to Root (U2R) type refers to an attack in which the attacker imitates a normal user, while he/she attempts to gain root access and super user privileges to the system [51]. Table 1 illustrates the type and the labeled attacks of the NSL-KDD dataset. From the statistical analysis of the NSL-KDD dataset, it is observed that there are 77,052 normal traffics, which is more than half of the total records. Regarding the attack records, DoS type is the most frequent with 53,386 instances, Probe type has 14,077 instances, while there are a few traffic records for R2L and U2R attacks, 3880 and 119 instances, respectively. Figure 1 illustrates the distribution of records, normal and attack types, in the NSL-KDD dataset.
he/she attempts to gain root access and super user privileges to the system [51]. Table 1 illustrates the type and the labeled attacks of the NSL-KDD dataset. From the statistical analysis of the NSL-KDD dataset, it is observed that there are 77,052 normal traffics, which is more than half of the total records. Regarding the attack records, DoS type is the most frequent with 53,386 instances, Probe type has 14,077 instances, while there are a few traffic records for R2L and U2R attacks, 3880 and 119 instances, respectively. Figure 1 illustrates the distribution of records, normal and attack types, in the NSL-KDD dataset.
DoS is one of the most common types of attack, and it frequently occurs in everyday life. This characteristic of DoS attacks is reflected in the NSL-KDD dataset, in which this category contains the majority of attack records. Additionally, the detection of DoS attacks is a crucial challenge that needs thorough investigation. Considering all those reasons, the present work is focused on the application of different GAN models for the generation of synthetic DoS attacks. Thus, the traffic records of the NSL-KDD dataset are divided into normal and attack classes based on the 'class' column, while the attacks are separated to four categories: DoS, Probe, R2L, and U2R. The records of DoS attack are selected for the training and evaluation GANs.  Table 2. According to the type  Table 2. According to the type and property of the features, they can be broken down into four categories, which are Intrinsic, Content, Time-based and Host-based.  (9): contains the number of connections to the same destination host or service as the current connection in the past two seconds.

•
Host-based features (10): checks the 100 past connections, which have the same destination host or service with the current connection.

Constraints for the Generating Adversarial Examples
The generated tabular data should represent attacks that have evading IDS as their purpose; however, in order to achieve this, the generation process should take account and maintain the functional characteristic of each attack category [19]. Based on attack principles, each category of attacks has its functional and nonfunctional features. The functional features describe the basic function of the attack, while the nonfunctional represent the secondary characteristics of the attack. The attack properties remain undistributed when the functional features do not change, and only the nonfunctional features are modified. Thus, to achieve reliable and valid generated attack records, the functional features should be unchanged, while the nonfunctional features can be modified. Therefore, the GAN models should be trained and then generate only the nonfunctional features, taking into account the examined attack category. Table 3 illustrates the functional features of each attack category of the NSL-KDD dataset, as it is discussed in [52]. Considering the aforementioned, the features of each DoS record of NSL-KDD are split into functional and nonfunctional, the functional ones remain unmodified, while the nonfunctional are used as the training set of GAN models. Consequently, the trained GAN models will be able to generate the nonfunctional features and coupled with the functional ones will create meaningful DoS records.

Experimental Setup
Experiments are designed to investigate the general properties and performance of the different GAN models for the task of generation of synthetic IDS data. In particular, the CTGAN [23], CopulaGAN [29], and TableGAN [28] are trained at the NSL-KDD dataset, taking into account the restrictions in the generation of DoS attack records, as described in Section 4.1. For our experiments, we use the GAN models, which are provided by the open-source synthetic data generation ecosystem SDV-The Synthetic Data Vault [30]. Each model is trained with a batch size of 500 and for 100 epochs. The learning rates for the generator and discriminator of all models are both 0.0002. Additionally, the discriminator steps of CTGAN and CopulaGAN are set to 5.
Then, the synthetic datasets that occurred from the trained GAN models are evaluated using the metrics, which are described in Section 3.2 of this work. The Distribution and Cumulative Sum plots are created using the Table Evaluator [53] library, while the Statistical and Machine Learning-based metrics are calculated based on the Single Table Metrics of the SDV library [54].

Results
To visually demonstrate the performance of GAN models for the generation of DoS attack records, some indicative features are chosen. Specifically, one discrete and two continuous features can be used to summarize the behavior of synthetic data. Figure 2 shows the cumulative sum and distribution plots of discrete features, namely "num_compromised". It is observed that both CTGAN and CopulaGAN outperform TableGAN in the case of discrete values. scribed in Section 4.1. For our experiments, we use the GAN models, which are provided by the open-source synthetic data generation ecosystem SDV-The Synthetic Data Vault [30]. Each model is trained with a batch size of 500 and for 100 epochs. The learning rates for the generator and discriminator of all models are both 0.0002. Additionally, the discriminator steps of CTGAN and CopulaGAN are set to 5.
Then, the synthetic datasets that occurred from the trained GAN models are evaluated using the metrics, which are described in Section 3.2 of this work. The Distribution and Cumulative Sum plots are created using the Table Evaluator [53] library, while the Statistical and Machine Learning-based metrics are calculated based on the Single Table  Metrics of the SDV library [54].

Results
To visually demonstrate the performance of GAN models for the generation of DoS attack records, some indicative features are chosen. Specifically, one discrete and two continuous features can be used to summarize the behavior of synthetic data. Figure 2 shows the cumulative sum and distribution plots of discrete features, namely "num_compromised". It is observed that both CTGAN and CopulaGAN outperform TableGAN in the case of discrete values.

CTGAN
CopulaGAN TableGAN Figure 2. Cumulative Sums and Distributions for discrete feature "num_compromised". Blue illustrates the real data, while orange illustrates the synthetic. Figures 3 and 4 illustrate the corresponding diagrams for continuous features. In those cases, the TableGAN achieves slightly better performance than CTAGN and Copu-laGAN. However, as it is depicted in Figure 4, all GAN methods suffer when modeling continuous features with sparse data.   Figure 3. Cumulative Sums and Distributions for continuous feature "dst_host_rerror_rate". Blue illustrates the real data, while orange illustrates the synthetic data. Figure 3. Cumulative Sums and Distributions for continuous feature "dst_host_rerror_rate". Blue illustrates the real data, while orange illustrates the synthetic data. Figure 4. Cumulative Sums and Distributions for continuous feature "dst_host_srv_diff_host_rate". Blue illustrates the real data, while orange illustrates the synthetic data. Figure 5 illustrates the column wise correlation of real data as well as the generated one from CTGAN, CopulaGAN, and TableGAN. The columns with features that contain zero values are eliminated from this representation. The correlation matrix of generated data is compared to the correlations of real data. Generally, all the models are able to adequately capture the correlations between features. Although, as it is observed, TableGAN faces difficulties in capturing some of the correlations.  Figure 5 illustrates the column wise correlation of real data as well as the generated one from CTGAN, CopulaGAN, and TableGAN. The columns with features that contain zero values are eliminated from this representation. The correlation matrix of generated data is compared to the correlations of real data. Generally, all the models are able to adequately capture the correlations between features. Although, as it is observed, TableGAN faces difficulties in capturing some of the correlations.    Table 4 shows the Statistical Metrics as well as the Detection Metrics. Considering the CSTest and KSTEst values, it is concluded that CTGAN and CopulaGAN can better model the discrete features rather than the continuous values of the real data. On the other hand, it seems that TableGAN more easily models the discrete features and struggles with the continuous ones. For all the models, the Detection Metric indicates that the Logistic Regression Classifier finds it moderately difficult to distinguish the real from the generated data. Therefore, the real and the generated data are distinguishable to one degree. Finally, the Machine Learning Metrics are visualized in Table 5. All the models indicate similar performance, showing that it is possible to replace the real with synthetic data to solve Machine Learning problems.

Conclusions
In this work, we review the most popular GAN-based models for tabular IDS data synthesis. More specifically, CTGAN, TableGAN, and CopulaGAN are employed on the well-known NSL-KDD dataset to generate synthesized IDS samples. Based on visual representation and statistical metrics, it is observed that TableGAN works sufficiently for continuous data; however, it suffers in situations of discrete values. Additionally, CTGAN and CopulaGAN work fairly well for both continuous and discrete values. From our experiments, all the models can adequately capture the correlations between features. From the Machine Learning efficacy metrics, it is observed that it is possible to replace the real data with synthetic ones to solve data issues for Machine Learning tasks. However, from detection metrics, the classifiers' accuracy in distinguishing the real from synthetic generated data is moderated. Considering all these, we can conclude that the generated IDS dataset from CTGAN, CopulaGAN, and TableGAN can be used as learning data for various Machine Learning tasks.