Searching for Optimal Oversampling to Process Imbalanced Data: Generative Adversarial Networks and Synthetic Minority Over-Sampling Technique

: Classiﬁcation problems due to data imbalance occur in many ﬁelds and have long been studied in the machine learning ﬁeld. Many real-world datasets suffer from the issue of class imbalance, which occurs when the sizes of classes are not uniform; thus, data belonging to the minority class are likely to be misclassiﬁed. It is particularly important to overcome this issue when dealing with medical data because class imbalance inevitably arises due to incidence rates within medical datasets. This study adjusted the imbalance ratio (IR) within the National Biobank of Korea dataset “Epidemiologic data of Parkinson’s disease dementia patients” to values of 6.8 (raw data), 9, and 19 and compared four traditional oversampling methods with techniques using the conditional generative adversarial network (CGAN) and conditional tabular generative adversarial network (CTGAN). The results showed that when the classes were balanced with CGAN and CTGAN, they showed a better classiﬁcation performance than the more traditional oversampling techniques based on the AUC and F1-score. We were able to expand the application scope of GAN, widely used in unstructured data, to structured data. We also offer a better solution for the imbalanced data problem and suggest future research directions.


Introduction
Classification issues associated with data imbalance occur in many fields and have long been studied in the machine learning field [1,2]. Many real-world datasets suffer from the class imbalance issue, which occurs because the quantities of data between classes are uneven. This issue occurs frequently in many fields, including fraud detection for credit card users [3], customer churn prediction [4], finding bad data in quality control [5], and diagnosis prediction for rare diseases [6]. In general, when machine learning techniques are used, researchers use training datasets that have similarly distributed categories with a similar sample size. When learning is conducted with imbalanced datasets, data belonging to the minority class are more likely to be misclassified than data belonging to the majority class [7]. Furthermore, even when the accuracy is high, recall or sensitivity may be low [8].
Class imbalance inevitably occurs in medical data as a function of prevalence because the amount of target data tends to be extremely small in medical contexts. For example, in cancer diagnosis, the number of patients with negative symptoms is always far greater than the number with positive symptoms. If machine learning techniques are applied without considering this bias, positive-diagnosis patients cannot be classified with high accuracy, which is problematic because the techniques will learn mainly using the negativesymptom patients who are the majority class. Furthermore, diagnosing a cancer patient evolved from image classification, there are limits to its effectiveness in generating structured data such as tables [21]. As a result, only a few studies have used GAN for the oversampling of structured data [22].
Yang et al. [23] applied CGAN to predicting drug-target interactions (DTI) and were able to balance the ratio between positive and negative samples. Oversampling methods using CGAN produce reliable samples, and these improve performance more than previous sampling methods. Quintana et al. [24] oversampled an imbalanced thermal comfort dataset using Tabular GAN (TGAN) as proposed by Xu et al. [25], who used it to oversample an imbalanced thermal comfort dataset. A particular GAN was designed to generate synthetic samples from a structured dataset, and both continuous and categorical classes were considered. Moreover, the study was able to generate both continuous and categorical data and overcome problems associated with the characteristics of tabular data by using CTGAN using the probability density for each condition, proposed by Xu et al. [26]. Wang et al. [27] applied CTGAN to traffic data to synthesize categorical samples and verify their similarity to real data, confirming that CTGAN has a higher performance and practical value than traditional oversampling and undersampling techniques. Recently, additional studies have begun to use GAN to overcome imbalance issues in structured data.

Imbalance Ratio (IR)
The IR can be calculated to expose the class imbalance issue using Equation (1). It can indicate how large the sample size of the majority class is compared to that of the minority class: where n + = number of instances in the majority class, and n − = number of instances in the minority class. When the IR is 1 or higher, a higher value indicates a greater degree of class imbalance. In particular, an IR of at least 9 indicates severely imbalanced data, with minority classes being 10% or less of the total [28].

Random Oversampling (ROS)
Random oversampling (ROS) randomly and repetitively replaces and extracts samples of a minority class until the sample sizes of the minority and majority classes become equal. Although the size of a dataset increases with the sample size of a minority class, the fact that samples of a minority class are simply replicated means it cannot be said that the amount of information increases, since the samples are duplicated. In other words, oversampling typically causes overfitting.

Synthetic Minority Oversampling Technique (SMOTE)
SMOTE [13] generates samples between linearly connected structures by using the k-NN algorithm to synthesize k nearest neighbors centered on the random samples of a minority class. Since, unlike ROS, SMOTE creates samples, it has the advantage of compensating for the overfitting problem caused by duplicating the same samples. The procedure for generating synthetic samples by the SMOTE method is as follows. First, SMOTE selects samples from a random minority class for oversampling. If the number of synthetic samples to be generated is greater than the number of samples in the minority class, then all the samples in the minority class are selected. If it is less, then a subset of all the samples in the majority class is randomly selected. Second, k nearest neighbors are selected around the minority-class random samples that have a linear relationship with the minority-class samples based on the first step; these are then multiplied by the weight, and a synthetic sample is created at the location of the multiplied value. A mathematical representation of this is given in Equation (2): where x i is a sample belonging to a minority class andx i is a random neighbor among k-NNs for x i . The process works by identifying the k nearest neighbors near x i , calculating the differences between x i and these neighbors, and multiplying by a value between 0 and 1 to create a synthetic sample x smote to supplement the original samples. This is repeated until the size of the minority class becomes equal to that of the majority class.

Borderline-SMOTE (B-SMOTE)
B-SMOTE [14] is an expansion of SMOTE. Whereas SMOTE generates a composite sample of the minority class without considering the location of neighboring samples, B-SMOTE defines the region where the two classes overlap as the boundary and applies the SMOTE technique to the minority-class samples on the boundary to generate a composite sample.
The B-SMOTE procedure is as follows. First, for each individual sample belonging to a minority class, the k closest observations are found, regardless of the class. Second, if S maj is the sample size of the majority class, it is classified as a "Danger" group if k 2 ≤ S maj < k, as a "Safe" group if 0 ≤ S maj < k 2 , and as a "Noise" group if S maj = k. Third, this method generates new samples only for minority-class samples belonging to the "Danger" group.

Adaptive Synthetic Sampling (ADASYN)
ADASYN [15] is an advanced form of SMOTE that calculates the density distribution for each sample of a minority class and determines the number of samples to be generated accordingly. ADASYN creates synthetic samples as follows. First, it finds K nearest neighbors for sample x i belonging to a minority class S min and denotes the number of samples belonging to the minority class as ∆ i . Then, it calculates an r i , density distribution, which can be expressed as Equation (3), while in Equation (4),r i refers to the normalized r i : In Equation (5), G calculates the number of samples to be generated for S min and β is used to balance the samples between the two classes: Next, ADASYN determines the number of synthetic samples (g i ) that need to be generated for samples x i belonging to S min and generates these samples by repeating the process g i times for each x i . Equation (6) expresses this as the formula: 3.3. GAN-Based Oversampling Technique 3.3.1. Generative Adversarial Network (GAN) GAN [17] is a deep learning-based unsupervised learning model that generates fake data resembling real data by pitting one neural network (generator, G) against the other (discriminator, D). G is trained with the goal of producing fake data that resemble real data, while D is trained to determine that the data created by G is indeed fake. In other words, G and D learn in an adversarial way. Figure 1 depicts the structure of GAN and describes how G and D learn. First, when G receives a random noise vector as input, data are generated. When the generated and actual data are provided to D, it determines whether they are real or fake. G and D compete for and learn from this result. Put differently, the goal of G is to maximize the probability that D determines the generated data as real, while the goal of D is to maximize the probability of discriminating generated data as fake. Equation (7) shows the GAN's objective function for this learning process: ability that D determines the generated data as real, while the goal of D is to maximize the probability of discriminating generated data as fake. Equation (7) shows the GAN's objective function for this learning process: In the equation, ( ) and ( ) refer to the real and fake data, respectively. D receives data x as a real-data input value and outputs the probability of being real data ( ( )). G takes a random noise vector z as an input value and generates fake data ( ( )). Since the goal of D is to distinguish effectively between generated fake data and real data, the GAN must learn so that ( ) is 1 and ( ) is 0. At the same time, since the goal of G is to deceive D, it should learn to make ( ) equal to 1. In other words, the objective function of the equation aims at maximization from the perspective of D and minimization from the viewpoint of G.

Conditional GAN (CGAN)
CGAN [18] is designed to improve the unstable learning of GAN. Although the basic learning method is the same, CGAN can impact the data generation process directly and learn characteristics as well as distribution by adding a feature y, which indicates a specific condition, to G and D. Figure 2 shows the process of generating data. It enters y, the feature desired by the user, along with a random noise vector ; the information for the labeled class is input to D. The objective function for learning CGAN is the same as for GAN, but y is conditionally added as expressed as Equation (8): In the equation, p data (x) and p z (z) refer to the real and fake data, respectively. D receives data x as a real-data input value and outputs the probability of being real data (D(x)). G takes a random noise vector z as an input value and generates fake data (G(z)). Since the goal of D is to distinguish effectively between generated fake data and real data, the GAN must learn so that D(x) is 1 and D(G(z)) is 0. At the same time, since the goal of G is to deceive D, it should learn to make D(G(z)) equal to 1. In other words, the objective function of the equation aims at maximization from the perspective of D and minimization from the viewpoint of G.

Conditional GAN (CGAN)
CGAN [18] is designed to improve the unstable learning of GAN. Although the basic learning method is the same, CGAN can impact the data generation process directly and learn characteristics as well as distribution by adding a feature y, which indicates a specific condition, to G and D. Figure 2 shows the process of generating data. It enters y, the feature desired by the user, along with a random noise vector z; the information for the labeled class is input to D. The objective function for learning CGAN is the same as for GAN, but y is conditionally added as expressed as Equation (8): Conventional GAN algorithms have shown strong performance in the process of learning original images and generating and predicting synthetic images for each condition [18,21]. However, it is difficult to apply them to structured data, constituting a short- Conventional GAN algorithms have shown strong performance in the process of learning original images and generating and predicting synthetic images for each condition [18,21]. However, it is difficult to apply them to structured data, constituting a shortfall [18,21], as they suffer from problems such as various tabular data types, data distributions not following the Gaussian distribution, multi-modal data types, sparse matrices generated by one-hot encoding, and categorical variables with a high degree of imbalance [29].
As a result, CTGAN [26], a generative model designed to use GAN functions for structured data, was proposed. CTGAN is a model that combines the conditional-GAN [18] and the tabular-GAN algorithms [25]. A common problem with GANs is that they do not learn sparse categories well if certain categories are imbalanced. Therefore, CGANs allow for the adding of conditions to the constructor to ensure that sparse categories are included in the learning process.
CTGAN proposes mode-specific normalization and training-by-sampling to solve the problems caused by GANs. Mode-specific normalization, a component of CTGAN, learns while considering multimodal and non-Gaussian distribution problems by normalizing numerical data using the variational Gaussian mixture. In the learning process, the normalized values of each numerical variable are used as the input, rather than the values of the original data. After learning is completed, the data created through G are converted to the scale of the original data.
Training-by-sampling is a method for uniformly sampling the state vector and the training data. To make the conditional distributions of the constructor representation and the actual data equal, the difference between the two distributions must be accurately estimated from the identifier (Critic). The specific procedure is shown in Figure 3. The procedure for training by sampling is as follows. First, select one of the categorical columns with equal probability and take a logarithmic function of the frequency of each category to create a probability distribution over the frequency of occurrence of each value. Second, generate a state vector according to the selected column and class and randomly sample the data for training so that the generator generates a representation through the state vector and the latent variable. Third, by putting the actual data and the reproduced data into an identifier and calculating the distance (score) between the two conditional distributions, the generator learns to generate data that satisfies the conditions.

Data
The data source for this study is the "Epidemiologic data of Parkinson's disease dementia patients" from the National Biobank of Korea under the Korea Disease Control and Prevention Agency. Data were collected from 14 tertiary medical institutions (university hospitals) nationwide from January 2015 to December 2015 under the supervision of the Korea Centers for Disease Control and Prevention. A health survey was conducted using computer-assisted personal interviews (CAPI). We obtained the approval of the Korea Disease Control and Prevention Agency's Research Ethics Review Committee (No.

Data
The data source for this study is the "Epidemiologic data of Parkinson's disease dementia patients" from the National Biobank of Korea under the Korea Disease Control and Prevention Agency. Data were collected from 14 tertiary medical institutions (university The data contain information on Alzheimer's disease and Parkinson's disease patients. The data classify Parkinson's disease patients into dementia, mild cognitive impairment, and normal cognitive function. The explanatory variables consist of 54 variables such as basic information, environmental factors, disease history, Alzheimer's/Parkinson's disease basic information, and clinical scale. Fourteen continuous variables and seven categorical variables were selected as the final explanatory variables based on using feature importance. Missing values for each item were replaced by mean imputation. The dependent variable, "patient classification", reclassified Parkinson's disease patients into two classes after excluding Alzheimer's patients: 0 means Parkinson's disease patients with normal cognitive function (51 patients) and 1 means Parkinson's disease patients with dementia or mild cognitive impairment (125 and 223 patients, respectively). Out of the 399 Parkinson's disease patients, 51 had normal cognitive function, accounting for 12.78% of the total data. This yielded an IR value of 6.8, indicating the presence of an imbalance. Table 1 shows the numbers and ratios by category.

Experimental Design
First, IR values were adjusted to 6.8 (raw data), 9, and 19 for comparing oversampling techniques according to the imbalance ratio. In case of insufficient data in the majority class, data were created with CTGAN specialized for structured data and added to the original data to prevent data loss. Minority classes were randomly extracted from the original data as needed. The numbers of samples according to the IR value are shown in Table 2. This study used ROS, SMOTE, B-SMTOE, and ADASYN techniques, comparing them with oversampling techniques using GAN and CTGAN. The imblearn package was used for this purpose. Moreover, k = 5 was used for k-NN-based SMOTE, B-SMOTE, and ADASYN. Sampling was adjusted to make the ratio of normal cognitive function (0) and dementia and mild cognitive impairment (1) equal to 1:1. Numbers could vary slightly because, unlike other oversampling techniques, ADASYN oversampled by automatically adjusting the number as needed in the package.
When learning CGAN, both G and D consisted of three hidden layers with the epoch set to 1000. In addition, Leaky ReLU and Adam were used as activation functions and optimizers, respectively; Adam's learning rate, β1, and β2 were set to 0.0002, 0.5, and 0.9, respectively. CTGAN is usable in the Synthetic Data Vault (SDV) [30], and the experiment was conducted by setting the epoch to 100. The amount of data after applying each oversampling to the dataset is shown in Table 3. The support vector machine (SVM) [31], logistic regression (LR) [32], random forest (RF) [33], and multi-layer perceptron (MLP) [34] were used as classification models.

Performance Evaluation Methods and Indicators
The entire dataset was divided, with 80% used as training data and the remaining 20% as validation data. This study conducted a 10-fold cross-validation to circumvent the problem of greatly varying model performance by chance and determined the final performance based on the mean performance of the ten models.
The F1-score and area under the curve (AUC), widely used in class imbalance studies, were used as indicators for performance evaluation [35,36]. In the confusion matrix, TP and TN indicate true positive (predict positive for positive) and true negative (predict negative for negative), respectively. FP and FN stand for false positive and false negative, respectively. The F1-score is the harmonic mean of precision and recall; the closer it is to 1, the better the classification performance of the minority class.
The "receiver operating characteristic (ROC) curve" means a curve presenting the classification prediction result of the model with the TPR (true positive rate, recall) on the vertical axis and FPR (false positive rate, 1-specificity) on the horizontal axis. AUC is the area under the ROC curve; the closer it is to 1, the better the performance; it is calculated as the mean of TPR and TNR.

Results
This study presents the result of comparing classification performances when applying each oversampling technique after adjusting the IR of the experimental dataset (IR = 6.8) to 9 and 19. Values showing the best and lowest performance are bold and underlined, respectively. Table 4 shows the AUC scores for the classification results. The GAN-based oversampling techniques showed a higher performance than the traditional oversampling techniques in all areas. CTGAN showed a strong performance, especially in the SVM and LR classification models. CGAN produced high AUC scores in the MLP and RF classification models. Although the ROS technique exhibited the poorest performance among the traditional oversampling techniques, it did not do so in the LR because CGAN, which generally performed well, rapidly fell away in performance with the LR classification model. Moreover, although CTGAN showed higher AUC scores than conventional oversampling techniques in SVM, LR, and RF, it was confirmed that its performance decreased under classification by MLP. In other words, the combinations CTGAN + SVM, CTGAN + LR, CGAN + RF, and CGAN + MLP showed the highest performance, while the ROS method showed the lowest performance in most classification models. Moreover, despite the increase in the degree of imbalance, no technique showed greatly decreased performance. Rather, the overall classification performance increased slightly. Table 5 shows the F1-score values for the classification results. As with the AUC score, the GAN-based oversampling technique showed a better performance than the traditional oversampling technique. Effective performance can be seen in the combinations CTGAN + SVM, CTGAN + LR, CGAN + RF, and CGAN + MLP, while the ROS method showed the lowest overall performance. All methods showed stable performance, even for higher IR values. Figures 4 and 5 display the AUC and F1-score values for six oversampling techniques by classification model. Ranking the techniques revealed differences, even though the classification performances for the techniques appeared similar when only the best and lowest AUC and F1-score performances were examined. This is because the two measures indicate different things. Even considering these elements, the results of this study confirmed that CGAN and CTGAN showed better AUC and F1-score results than the existing oversampling techniques.

Discussion
Most medical datasets have class imbalance issues due to low incidence rates. It is very important to overcome this issue because the misclassification of a minority class can decrease sensitivity among classification performance components. Therefore, this study adjusted the imbalance ratio (IR) to 6.8 (raw data), 9, and 19 using actual epidemiological data on Parkinson's disease dementia patients. The study applied oversampling techniques using CGAN and CTGAN as well as more traditional oversampling techniques (ROS, SMOTE, ADASYN, and B-SMOTE); it aimed to solve the imbalance problem by comparing the performance of each technique through classification models (SVM, LR, RF, and MLP).
This study classified the levels of cognitive impairment associated with Parkinson's disease by applying oversampling techniques to three datasets with three different IR values and found that GAN-based oversampling techniques showed better AUC and F1-score values than traditional techniques. Nugraha et al. [37] used insurance fraud imbalance data and proposed CTGAN as an oversampling method, showing that over the application of 17 classification models, CTGAN presented a better performance (AUC, F1-score, precision, etc.) than ROS, SMOTE, and ADASYN. A study using imbalanced CVD clinical data by García-Vicente et al. [38] also found that the combination of CTGAN and the classification model LASSO showed strong potential for generating categorical data. Many previous studies [39][40][41] also showed that the CGAN-based oversampling technique achieved a higher performance than more traditional techniques over various classification models for datasets with complex structures because it was effective at generating data for the minority class, whereas the traditional minority oversampling techniques added data randomly rather than based on the actual data distribution, a significant limitation of these techniques [42]. Moreover, SMOTE-based oversampling techniques are ineffective at reproducing high-dimensional data, being more useful for low-dimensional data, i.e., another shortcoming. In contrast, previous studies [43] reported that GAN could overcome the disadvantages of existing oversampling techniques because it generated data according to the distribution of actual data and was effective even for high-dimensional data. The results of this study also demonstrated that GAN treated imbalanced data better than more traditional minority oversampling techniques such as SMOTE in highdimensional data. More recently, Sharma et al. [44] developed a SMOTified-GAN algorithm, a data augmentation technique based on variations of GAN designed to overcome the class imbalance classification problem. However, future studies would be useful to evaluate the effectiveness of GAN on various imbalanced datasets.
The significance of this study lies in its confirmation that the combinations CTGAN + SVM, CTGAN + LR, CGAN + RF, and CGAN + MLP showed better performance, proving that GAN-based oversampling contributed to improving classification accuracy in clinical data by comparing the classification performance of various oversampling techniques. The study also demonstrated that CTGAN oversampling could generate high-quality synthetic data without adjusting any hyperparameter. As a result, it will be possible to expand the application scope of GAN, which has been widely used for unstructured data such as images and videos.
This study had several limitations. First, since only one dataset was used, the study could not compare the performance of oversampling techniques according to dataset size or the ratio of categorical to continuous variables. Second, the optimal number of epochs in the process of learning CTGAN could not be determined. It was, therefore, necessary to learn many times, thus, the best performance might not be identifiable due to the optimal number of learning times not being known. Third, there were many missing values (e.g., answered as "don't know") due to the nature of medical data. Moreover, there was little change in performance over variations in the IR because the sample size was small; so, there was not much difference in the sample size of the minority class according to the degree of imbalance due to the use of actual Parkinson's disease patient data. Future studies should aim to identify oversampling techniques more accurately by applying oversampling to multiple datasets and checking the difference in classification performance while taking this into account.

Conclusions
This study confirmed the effectiveness of CTGAN and CGAN oversampling techniques by applying six oversampling techniques to imbalanced data and comparing their performance. Data imbalance is a critical problem because it occurs in many fields, including the medical field featured in this study. It should be possible to apply the superior performance of GAN-based oversampling to imbalance issues based on the study's results. Future studies need to identify the optimal oversampling technique by comparing its performance to the performance of other types of techniques in addition to more traditional oversampling techniques.

Data Availability Statement:
The data source for this study was the 'Epidemiologic data of Parkinson's disease dementia patients' of the National Biobank of Korea (https://nih.go.kr/biobank/cmm/ main/mainPage.do; accessed on 15 June 2023.) under the Korea Disease Control and Prevention Agency. In order to use this data, it is necessary to obtain approval from the Korea Centers for Disease Control and Prevention Research Ethics Review Committee and the Lotting-out Committee of the National Biobank of Korea.

Conflicts of Interest:
The authors declare that there are no conflict of interest regarding the publication of this article.