Generating Synthetic Fermentation Data of Shindari, a Traditional Jeju Beverage, Using Multiple Imputation Ensemble and Generative Adversarial Networks

: Fermentation is an age-old technique used to preserve food by restoring proper microbial balance. Boiled barley and nuruk are fermented for a short period to produce Shindari, a traditional beverage for the people of Jeju, South Korea. Shindari has been proven to be a drink of multiple health beneﬁts if fermented for an optimal period. It is necessary to predict the ideal fermentation time required by each microbial community to keep the advantages of the microorganisms produced by the fermentation process in Shindari intact and to eliminate contamination. Prediction through machine learning requires past data but the process of obtaining fermentation data of Shindari is time consuming, expensive, and not easily available. Therefore, there is a need to generate synthetic fermentation data to explore various beneﬁts of the drink and to reduce any risk from overfermentation. In this paper, we propose a model that takes incomplete tabular fermentation data of Shindari as input and uses multiple imputation ensemble (MIE) and generative adversarial networks (GAN) to generate synthetic fermentation data that can be later used for prediction and microbial spoilage control. For multiple imputation, we used multivariate imputation by chained equations and random forest imputation, and ensembling was done using the bagging and stacking method. For generating synthetic data, we remodeled the tabular GAN with skip connections and adapted the architecture of Wasserstein GAN with gradient penalty. We compared the performance of our model with other imputation and ensemble models using various evaluation metrics and visual representations. Our GAN model could overcome the mode collapse problem and converged at a faster rate than existing GAN models for synthetic data generation. Experiment results show that our proposed model executes with less error, is more accurate, and generates signiﬁcantly better synthetic fermentation data compared to other models.


Introduction
Shindari is a traditional Jeju beverage made out of rice or barley and manufactured through a fermentation process that takes 3-4 days in summer and 5-7 days in winter [1]. Shindari is considered a Jeju yogurt and also known as sweet liquor or low-alcohol wine. Jeju is a volcanic island and, in older days, it was difficult to transport and acquire food from the mainland, so the only way to obtain food was to rely on ingredients produced on the island itself and by preserving food through fermentation. To produce Shindari, barley is first cleaned, dried, grounded, and kneed to form the shape. Then, the barley is fermented at 40 degrees Celsius and can be stored for 2 months.
Fermented foods are of great consumer acceptability and nutritional value. Fermented food, especially yogurt, helps to reinstate proper microbial balance in the intestine, helping in the elimination of lactose intolerance, thus improving the overall immune system. Research has also shown [2] that fermented food can recover people from Hepatitis C connection between consecutive diagnostic data, the authors used one-dimensional (1-D) convolutional neural networks. Convolutional autoencoders were then used to connect discrete and continuous values. PATE (Private Aggregation of Teacher Ensembles) is another GAN model for generating synthetic data that preserves the generator's differential privacy, which is critical for biomedical data [18]. The generator in PATE-GAN is similar to that of a regular GAN model, but the discriminator is fitted with a PATE mechanism, resulting in K-teacher discriminators and asymmetrical training. Pedro Narvaez et al. [19] used GAN to synthesize natural heart sounds. Neuron-based generative models for SSVEP classification in EEG data is one of the models [20]. For this, the researchers used deep convolutional generative adversarial networks and variational autoencoders. Using the bidirectional recurrent neural network, recent work has also yielded efficient results for producing single synthetic biomedicals [21].
Synthetic data are artificial data generated by replicating the statistical characteristics of the original data. There are many methods of generating synthetic data. When real data is not available, synthetic data can be created according to distribution known by the data analyst or the best-fit distribution can be determined for the original data. Variational autoencoder and generative adversarial networks are the latest models that perform superior to traditional models in generating synthetic data. The synthetic data generated through deep learning models are processed through classification, clustering, and regression algorithms to validate their correlation with the original data and evaluate how they impact the performance and accuracy of the models when compared to using original data. Often, original data has missing values that, when imputed, play an important role in enhancing the performance of any model. Prediction of missing values in original data is one of the most important tasks and the initial task for generating synthetic data. In this paper, we propose multiple imputation ensemble to predict the missing values in the original data and then apply generative adversarial networks to generate synthetic fermentation data for Shindari from the original tabular data. The main contribution of this paper is summarized below: 1.
In the preprocessing stage, we proposed a multiple imputation model which is a combination of multivariate imputation by chained equations (MICE) and random forest imputation model (RFI) to impute the original missing data.

2.
Multiple imputation was then combined with the bagging and stacking ensemble approach to predict and complete the tabular fermentation data.

3.
We proposed tabular GAN with skip connection and adapted WGAN-GP architecture to generate synthetic tabular data. We proposed TGAN-skip-WGAN-GP (Tabular GAN-skip with Wasserstein GAN with gradient penalty), which could enhance the performance by reducing convergence time and eliminating the mode collapse problem.

4.
In the evaluation stage, we visually and quantitatively produced the performance of the proposed imputation and GAN model.

5.
The proposed approach can be utilized for generating synthetic data for any kind of fermentation process that involves tabular data.
We present the outcome of our proposed approach in the Results section. Imputation with ensemble was evaluated mean dissimilarity, regressor accuracy, and Kappa error plots. GAN was evaluated through correlation coefficient, mean absolute error, root mean square error, percent root mean square difference, Fréchet Distance, and mirror column association.

Proposed Methodology for Generating Synthetic Fermentation Data of Shindari
In this section, we elaborate on our proposed approach for generating synthetic fermentation data of Shindari using the multiple imputation ensemble method and generative adversarial networks. The overview of the proposed approach is presented in Figure 1. First, we discuss the data we used for the implementation and then proceed with the explanation of preprocessing, prediction, and GAN synthesizer stage. In the preprocessing phase, the distribution of unknown data is used by the multiple imputation method to fill up the missing values of the fermentation dataset. For the multiple imputation approach, we explored several imputation methods, among which multivariate imputation by chained equations (MICE) and random forest imputation (RFI) executed plausible results. The bagging and stacking ensemble approach was used to aggregate the predictions of various regressors to obtain best predictions to feed the GAN synthesizer. After the final prediction, the proposed method transforms the data, applies TGAN-skip-WGAN-GP architecture to synthesize the tabular data, performs inverse transformation, and produces the end result for synthetic fermentation data generation of Shindari. We evaluated our proposed architecture based on the output quality of imputation and synthetic data. This section is further divided into four subsections that explain the overall approach used for each module in the proposed architecture. Shindari microbial community data is described in Section 2.1, we explain preprocessing of the data in Section 2.2, the next Section 2.3 presents the methodology we have used for prediction, and finally, we define the GAN architecture for generating synthetic data in Section 2.4.

Shindari Microbial Community Data
It is also important to understand how our data was obtained, factors influencing the fermentation process, the variety of raw materials used, and the interaction between microbial communities. The Shindari fermentation extract was prepared by combining cooked rice, barley nuruk, and water, which was left in a 30 • C incubator and was sampled for 24 h, 72 h, or 120 h. The samples were then filtered and 99% ethanol was added to balance the final concentration to 70%. Fermented beverages contain microorganisms that interact with other microorganisms to form a community and each community represents different characteristics. Therefore, it is important to analyze the microbial community and their interaction rather than focus only on individual microorganisms. At every stage of the fermentation process, the changes in the microbial community were observed. While obtaining the Shindari data for our experiment, it was found that different fermentation times had different effects on the microbial community structures. It was found that lactic acid bacteria was over 97% when the samples were fermented for over 3 days. For example, within 24 h of fermentation, Saccharomyces cerevisiae is mainly involved in the Shindari fermentation process where it produces ethanol, carbon dioxide, and other compounds from sugars in the rice, whereas Pediococcus sp.-which enhances immunity and prevents the growth of several spoilage and pathogenic microorganisms-is rapidly dominant during 48 h to 72 h of the fermentation process. When we synthesize to generate new data, we consider the combinations of microorganisms and their interaction and learn the distribution of microorganisms for each specific microbial community from the original data. The detailed procedure to obtain our data can be reviewed from [1]. We present the layout of our data in Figure 2. The microbial community CSV data of Shindari was obtained from the Jeju Inside Agency & Cosmetic Science Center, Republic of Korea [1]. Features of our interest among many from the CSV data were rank, taxonomy, name of the microbial community, and the total count for each community. We had only species in the rank category, whereas taxonomy in the CSV data represented the grouping of microorganisms having the same characteristics. The name feature defined names of the Shindari microbial community. Each species and microbial community contained several microorganisms. During the fermentation process, the composition of every microbial community changes. According to the experimentation in laboratories, we could obtain total counts for each microbial community for the 1st, 3rd, 5th, and 7th day. For our research work to generate synthetic data, we considered the top ten most relevant and important microbial communities in the fermentation process of Shindari [1]. They are, namely, Bacillus amyloliquefaciens group, Citrobacter koseri, Enterobacter asburiae group, Enterococcus durans group, Escherichia hermannii group, Lactococcus lactis group, Pantoea allii, Pediococcus acidilacticici group, Pediococcus_uc, and Ralstonia pickettii. Rank; taxonomy; name; and total count for the 1st day, 3rd, 5th, and 7th day of each microbial community from the Shindari fermentation data is then passed through the preprocessing, prediction, and GAN synthesizer stages to obtain the new data.

Preprocessing
The raw data has the total count for the alternate days of the week but lacks hourly information, which would make the generation of synthetic data more correlated to the original data. Therefore, in the preprocessing stage, we take the training set with missing data and perform multiple imputation. There are four different approaches to handle missing data [22]. An overall analysis can be done to deal with missing data, statistical imputation methods and imputation using machine learning algorithms are also wellknown approaches. Another way is by creating an algorithm that can itself incorporate the mechanism to deal with missing data. In our proposed work, we concentrate on two different multiple imputation techniques, namely, multivariate imputation by chained equations (MICE) and random forest imputation (RFI).

Multivariate Imputation by Chained Equations
Multivariate Imputation by Chained Equations is based on Fully Conditional Specification (FCS), which defines multivariate imputation method with conditional densities for each incomplete variable one after the other. Let us assume that I n , where (n = 1, 2, ..., m) is one of the incomplete variables, where I = (I 1 , I 2 , ..., I m ). We denote the observed and missing values in I as I O n = (I O 1 , ..., I O m ) and I Mis n = (I Mis 1 , ..., I Mis m ), respectively. "Quant" represents the quantity of scientific interest, which is often a multivariate vector [23]. So, with the above notation, the main steps of MICE can be explained as below:

1.
We start the experimentation by analyzing the observed incomplete data I O . To estimate "Quant" from I O , we need to make assumptions from the unobserved data. From the distribution of observed data, we select related or plausible values to replace the missing values. For each column or variable with missing values I Miss , MICE identifies an imputation model and iterates repeatedly based on the number of variables or columns with missing values.

2.
For each imputed dataset, we then compute the "Quant" with an identical model.

3.
The last step involves pooling of all the "Quant" estimates into one "Quant " and evaluating its variance. The missing values are either replaced by predictions of the model or by using mean matching.

Random Forest Imputation
We use random forest imputation since it requires less hyperparameter tuning, is inexpensive in measuring Out of Bag (OOB) performance and is efficient in handling nonlinear relationships in data. In the random forest approach, we use the proximity imputation method to fill the missing values. The missing values are divided according to continuous and categorical variables. For continuous variables, the imputation is done using the median of nonmissing values; for categorical variables, frequently existing nonmissing values are used for imputation. The imputed data is utilized to fit a random forest, which is used to determine the symmetric proximity matrix. We then use the proximity matrix to fill the original missing values. We use the weighted mean of nonmissing data to fill the missing values of continuous variables and over nonmissing data, the largest mean proximity is used to fill the missing values of categorical variables [24]. After every iteration, a new random forest is generated and the process is repeated.

Ensemble and Prediction
Ensemble was introduced by Tukey [25] to combine several base models of machine learning to find an optimal model for prediction. The ensemble technique can be used for both classification and regression tasks. In our work, instead of using a single model, we induced a set of regressors and aggregated their prediction to obtain a better result. Ensemble techniques can be categorized into homogeneous and heterogeneous types. Homogeneous, as the name suggests, are ensembles that are created from regressors of the same type, whereas heterogeneous defines ensembles created by combining regressors of different types. Our main goal of using the ensemble model is to achieve higher accuracy and better performance compared to any single model. In our proposed work, we achieve this by using the bagging and stacking ensemble.

Bagging Ensemble
Bagging is a well-known ensemble approach useful for both classification and regression problems. Bagging is also known as "bootstrap aggregating". Bootstrapping refers to the generation of bootstrap samples of size B S from a dataset of size D S , using random selections from B S observations. These samples are mostly obtained from the unknown data distribution and from each other or independently. These are further used for variance approximation of the estimator. For every model, there is an input, output, and processing layer. The processing of the model used for a particular problem differs according to the input training dataset and so does the output, according to the performance of each model. Bagging is the process where we apply and try to fit various independent models and, at the end, average their prediction to obtain the best result with low variance.
As shown in Figure 3, the first step in bagging approach is that we create multiple B S , then we fit a base regressor for each B S and aggregate them to obtain the average of their outputs. Let us denote m bootstrap samples of size B S as in (1) where O p r is the r th observation of p th bootstrap sample. Therefore, there can be n independent base regressors {R1, R2, ..., Rm}, as shown in Figure 3. Then, to obtain the ensemble model, we combine and compute the average of the regressor to get a model with a lower variance. The average can be denoted as in Equation (2): Every regressor model makes a prediction and after majority voting, the final prediction is the one that receives more than half of the vote.

Stacking Ensemble
The next ensemble model is the stacking ensemble, which combines several base learners by training a metamodel and uses the output of the regressors instead of the training input data to develop the ensemble model [26]. So, to build a stacking ensemble, we first produced the models by training a set of base learners on the full training input data. The prediction from the first layer of each model is used as an input or metalevel attribute to the ensemble model. As we can see from Figure 4, the new set of predictions at level 1 is used to train a meta regressor in the ensemble model. After cross-validation, we obtain the final prediction.  Generative Adversarial Networks were first introduced by Ian Goodfellow et al. [27] in the year 2014 as a machine learning framework where two neural network competes against each other in a zero-sum game. GAN is considered as one of the cutting edge frameworks that can be effectively applied in various fields, producing incredible results. The two networks in the GAN model are known as the generator network (G) and discriminator network (D). As shown in Figure 5, random input noise variable Z p (p) is first passed through the generator network that retains the data distribution D pd over data x. The discriminator is fed with the samples generated from the generator and the real data. The discriminator then attempts to evaluate and correctly distinguish between the sample from the generator and the real data. The goal of the GAN architecture is that the generator continuously tries to fool the discriminator and at one point it becomes successful in generating realistic samples that cannot be distinguished from the real data. So, the generator tries to minimize log(1 − D(G(p))) and the minmax game function can be defined as in Equation (3): In this paper, we propose a combination resulting in an improved version of the well-known GAN model to generate synthetic tabular data related to the fermentation of Shindari. The architectures involved in the proposed methodology are as follows: 1.
Enhancing TGAN-skip with WGAN-GP (Wasserstein GAN with gradient penalty architecture. Therefore, we combine these architectures to form TGAN-skip-WGAN-GP for generating synthetic data. We first preprocess the data by removing unwanted data and perform data transformation to make sense of the continuous and categorical values. After the transformation, we pass the data through our GAN model (TGAN-skip-WGAN-GP). The output data from the GAN model goes through a reverse transformation and finally generates the synthetic data, which is then evaluated.

Data Transformation
A significant representation of the original data is required for any machine learning model to extract meaningful data. Data transformation of continuous values is commonly done through normalization or standardization. Normalization is the process where the values are transformed between 0 and 1, whereas in standardization, the data is transformed to have a mean value of 0 and standard deviation of 1. We performed standardization with the real data x, which can be defined as in (4), where λ is the mean and σ is the standard deviation.
For categorical values, we used embedding for data transformation. Embedding conserves the semantic relationship by transforming large vectors into smaller dimensional space vectors.

TGAN-Skip-WGAN-GP
TGAN is a general purpose GAN for generating synthetic data for tabular datasets. In general, the TGAN architecture does not possess skip connections. We use the improved version of TGAN that has skip connections to eliminate the vanishing gradient problem and decrease the convergence time with improvements in the overall performance. The skip connections in the TGAN architecture preserve the high magnitude of activation by allowing prompt activation to skip in-between layers. This results in retaining old information, achieves higher gradients, and helps in training deeper models [28]. Figure 6a depicts the layout of the TGAN-skip architecture. For the generator of the GAN architecture, we used the long short-term memory (LSTM) cell, as shown in Figure 6b, and the discriminator is built with multilayer perceptron (MLP), as shown in Figure 6c. The generator is modeled to generate each continuous variable in two steps and each categorical variable in one step. For continuous variables, we first generate the value scalar S n and then the cluster vector V n ; for categorical variables, we compute the probability distribution PD n over each label, as shown in Figure 6b. RV is the random variable that is forwarded as input to the LSTM cell in each step st along with the previous hidden vector HV n or embedding vector HV n and the attention-based context vector CV n [29]. We can determine the context from the attention weight vector β st by Equation (5): where Out is the output from the LSTM cell. In TGAN, the output from the LSTM passes through two dense layers to generate the final output, then the output of the two dense layers goes through another dense layer to form the attention vector for the next iteration of the LSTM network. We added the skip connection in between the two dense layers. WGAN [30] was built to solve the flaws in Vanilla GAN. WGAN-GP, developed by Gulrajani et al. [31], proposed an improved version of WGAN by introducing a gradient penalty that eliminates the need for clipping and complies with the constraint of 1-Lipschitz. The new objective or loss by WGAN-GP is defined by the combination of the original critic loss and gradient penalty loss, as mentioned in Equations (6) and (7): where R p and G p are the data and generator distribution withx ∼ Px denoting random samples. The value of λ, according to [31], is selected as 10 since when experimented, it worked well with various datasets and architectures. Using the WGAN-GP architecture for TGAN-skip decreases the discriminator loss and improves the performance of the generator. Normally, TGAN uses the architecture of vanilla GAN, in which any imbalance between the generator and discriminator can cause stagnation of training. By using WGAN-GP architecture in the TGAN-skip model, we can avoid a stagnation scenario and mode collapse. TGAN-skip-WGAN-GP also converges smoothly with higher speed than the TGAN with vanilla GAN architecture. To implement the WGAN-GP architecture in TGANskip, we adapted the loss function to the Wasserstein distance and evaluated the gradient penalty. We changed the batch normalization layer of the discriminator in TGAN-skip to layer normalization and eliminated the sigmoid activation from the last layer. Furthermore, according to WGAN-GP architecture, we trained the discriminator for more iterations compared to the generator. For the discriminator, as shown in Figure 6c, we use n fully connected neural network layers and, for input, we combine PD i , S i and V i . Then, the internal layers of the discriminator were computed through (8): where LayerNorm is layer normalization in the discriminator, LP is the learning parameter, ⊕ represents concatenation, and diversity defines the minibatch vector of the discriminator.

Inverse Transformation
Since the data was transformed to generate more meaning, it is required to perform inverse transformation with respect to the initial data transformation to obtain an output as realistic as the original input data. We can convert the continuous data into a scalar that ranges from −1 to 1 along with a multinomial distribution, and discrete variables are also converted into a multinomial distribution. The synthetic data generated for Shindari fermentation is then evaluated to measure the correlation with the original data and to evaluate our proposed model.

Results and Discussion
We evaluated our model based on the quality of multiple imputation ensemble method results and synthetic data generated by the GAN model. The evaluation metrics we used to judge imputation results are mean dissimilarity, regressor accuracy, and Kappa error plots. For the GAN model, we evaluated mean correlation coefficient, mean absolute error, root mean square error, percent root mean square difference, Fréchet distance, and mirror column association. In Sections 3.1 and 3.2, we show the results of our experimentation for imputation and generation of synthetic data.

Imputation
For evaluating our imputation method, we experimented and compared the results of different combinations of imputation and ensemble methods, as shown in Table 1. As regressors, we compared the outcome of using support vector (SV), decision tree (DT), random forest (RF), and simple linear (SL) regressors, in different multiple imputation ensemble model. Values marked as bold in Table 1 performed better than other regressors and, from the result, we can say that random forest produces the best result and the combination of imputation and ensemble mentioned in our proposed work (RFI-Bag, RFI-SE, MICE-Bag, and MICE-SE) performed better than other approaches. In Table 1, "Bag" represents bagging ensemble and "SE" denotes stacking ensemble.

Mean Dissimilarity
To measure mean dissimilarity between the imputed and original data, we used the normalized Euclidean distance. The closer the value of mean dissimilarity to zero, the better the quality of imputed data. Table 1 shows random forest imputation separately, with bagging and stacking ensembles producing values closer to zero indicating better performance.

Kappa Error Plots
Kappa error plots identify a point for every regressor in the ensemble model to measure the diversity-error pattern for the ensemble model [32]. So, the lesser the error, the better the model's performance. So, according to the results shown in Table 1, RFI-SE produces less error than other models. Figure 7 shows the count distribution of the ten microbial communities that are most relevant in Shindari fermentation. Original data refers to the raw data we obtained from the Cosmetic Science Center. The count is for 1st, 3rd, 5th, and 7th day and, as we can see from the figure, the fermentation process makes the counts of the microbial community change continuously.

Imputed Data
The original data is then imputed to fill the missing values. We generated six-hour data for every microbial community, as shown in Figure 8. We show the imputation result for different microbial communities in Figure 9. Comparing Figure 9 with the original data in Figure 7, we can see that the imputed value is correlated with the original count distribution of each microbial community.

Synthetic Data
To evaluate the quality of the synthetic data generated, we first try to find the correlation between the synthetic and the original data. We use Pearson's correlation coefficient with values ranging from −1 to 1 to measure the relationship between two variables [33]. The representation of the correlation values is shown in Table 2, where a positive value represents a direct relationship, zero represents no relationship, and a negative value indicates an inverse relationship. We can define Pearson's correlation coefficient (Coe f ) as in Equation (9), where Ori represents original data and Syn is generated synthetic data.

Mean Absolute Error
By Equation (10), we calculate mean absolute error (MAE), which is the measure of absolute error between the original and the generated synthetic data.

Root Mean Square Error
The root mean square error (RMSE) is used to quantify the stability and measure how different the generated data is from the original data, as shown in Equation (11).

Percent Root Mean Square Difference
To measure the distortion between the original and synthetic signals, we defined the Percent Root Mean Square Difference (PRD) as in Equation (12):

Fréchet Distance
To find how similar the ordering and location of points along the curves are, we evaluated the Fréchet distance (FD). Suppose Ori O = b 1 , b 2 , b 3 , . . . , b O is the order of points in the original curves and Syn P = c 1 , c 2 , c 3 , . . . , c P is the order of points along the synthetic curve, we can then measure the length l of the sequence as in (13) [33]: where l refers to the Euclidean distance and b s i and c t i indicate the order of points sequence. So, the Fréchet distance can be computed as shown in (14) [34]: 3.

Mirror Column Association
To find the association among each column from the original dataset and the generated synthetic dataset, we measured the mirror column association. A higher value represents higher association indicating better performance of the model. Table 3 shows the performance comparison of different models and depicts that our proposed model TGAN-skip-WGAN-GP performs the best. For visual evaluation, we plotted the mean and standard deviation of each column. The plotted values that are closer to the diagonal have better mean and deviation. The closer the values of each column get to the diagonal, the more correlated they are. We compare the mean and standard deviation result of TGAN, TGAN-skip, TGAN-WGAN-GP, and our proposed method TGAN-skip-WGAN-GP for the synthetic data in Figure 10. As results show, the mean and standard deviation of each column using TGAN-skip-WGAN-GP are closest to the diagonal and produce better results than other GAN models.

Cumulative Sum
Cumulative sum displays the distribution per column for original and generated synthetic data in Figure 11. In the figure, we show the count distribution of Pediococcus acidilactici for original and generated synthetic data for four different models of GAN. Cumulative sum can be used to evaluate both continuous and categorical values. We can see from Figure 11 that the distribution of Pediococcus acidilactici using TGAN-skip-WGAN-GP is the most closely related.

Maximum Mean Discrepancy
The maximum mean discrepancy (MMD) is a measure for distance between two distributions in a reproducing kernel Hilbert space on the mean embeddings. The measured MMD values for our original and synthetic samples were 0.6 and 0.3, respectively, which evaluated our synthetic data to be better.

Conclusions
In this paper, we proposed a methodology to successfully generate synthetic tabular data for Shindari fermentation, using multiple imputation ensemble and generative adversarial networks. We first preprocessed the data and used multivariate imputation by chained equations and random forest imputation to fill the missing data. For prediction, we used ensemble methods called bagging and stacking. We tested our model with different regressors, among which random forest produced the best result with low values for Kappa error plots. In the Results section, we produced experimentation results for different combinations of the imputation and ensemble methods, which show that our proposed methodologies MICE-Bag, MICE-SE, RFI-Bag, and RFI-SE produce low error rates and obtain higher accuracy with low mean dissimilarity. A visual presentation of the original and imputed data for every microbial community has also been provided in this paper.
After multiple imputation ensemble, we perform data transformation, which provides more meaning to the data. For generating synthetic data, we proposed TGAN-skip-WGAN-GP, which is an architectural combination of TGAN-skip and WGAN-GP. We added skip connections to the normal TGAN model and inherited the architecture of WGAN-GP into TGAN, which resulted in better performance, eliminated the mode collapse problem, and reduced the convergence time. After synthesizing, we performed inverse transformation, and finally, we obtained the tabular synthetic data for Shindari fermentation. We evaluated the quality of synthetic data based on correlation coefficient, MAE, RMSE, PRD, FD, and mirror column association. Mean and standard deviation were shown in the paper for different GAN models on our dataset. Cumulative sum visually represented the distribution of data per column, showing that TGAN-skip-WGAN-GP performs significantly better for generating synthetic tabular data than existing GAN models. Funding: This research was financially supported by the Ministry of Small and Medium-sized Enterprises(SMEs) and Startups(MSS), Korea, under the "Regional Specialized Industry Development Program(R&D, S2855401)" supervised by the Korea Institute for Advancement of Technology(KIAT).

Institutional Review Board Statement: Not applicable
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.