Rough-Fuzzy Based Synthetic Data Generation Exploring Boundary Region of Rough Sets to Handle Class Imbalance Problem

: Class imbalance is a prevalent problem that not only reduces the performance of the machine learning techniques but also causes the lacking of the inherent complex characteristics of data. Though the researchers have proposed various ways to deal with the problem, they have yet to consider how to select a proper treatment, especially when uncertainty levels are high. Applying rough-fuzzy theory to the imbalanced data learning problem could be a promising research direction that generates the synthetic data and removes the outliers. The proposed work identiﬁes the positive, boundary, and negative regions of the target set using the rough set theory and removes the objects in the negative region as outliers. It also explores the positive and boundary regions of the rough set by applying the fuzzy theory to generate the samples of the minority class and remove the samples of the majority class. Thus the proposed rough-fuzzy approach performs both oversampling and undersampling to handle the imbalanced class problem. The experimental results demonstrate that the novel technique allows qualitative and quantitative data handling.


Introduction
With the growing interest of knowledge researchers and the increasing number of proposed solutions, the imbalanced data problem has become one of the most significant and challenging issues.Imbalance data refers to data sets where the target class has an uneven distribution of observations.Predictive modeling is challenging for classimbalanced datasets because most classification models are built on the premise that each class should have an equal number of samples.As a result, models perform poorly in prediction, particularly for the minority class.This is a problem because, in general, the minority class is more significant.As a result, the issue is more susceptible to errors in classification for the minority class than the majority class.For example, if there is a dataset where for every 100 records, five are diagnosed with the disease, and if the model predicts all 100 patients with no disease, then the model's accuracy becomes 95%.In reality, it is not working correctly.Considering this issue, it is clear that the cost of not recognizing a patient that suffers from a rare disease might lead to severe and irreversible consequences.Apart from this, there are numerous domains in which imbalanced class distribution occurs, some of which are fraudulent credit card transactions [1], detecting oil spills from satellite images [2], network intrusion detection [3], financial risk analysis [4], text categorization [5], and information filtering [6].Indeed, the wide range of problem occurrences increases its significance and explains the efforts put into finding practical solutions.
By giving minority categories more weight, class imbalanced learning approaches [7] hope to lessen the bias in model learning that favours majority categories.The various strategies for handling class imbalance in object classification can be categorized into different groups like data-level [8], algorithm-level [9], and hybrid approach [10].Nevertheless, the majority of them use traditional imbalanced algorithms, which cannot handle the severely unbalanced dataset.SMOTE (Synthetic Minority Over-Sampling Technique) [11], a sampling-based technique, was established in 2002 and aims to address the class imbalance issue.It is a widely used strategy because of its simplicity and efficiency.It combines oversampling and undersampling, but the oversampling method creates new minority class data instances using an algorithm rather than replicating existing minority class data.Gradually, many hybrid approaches [12][13][14] are proposed by the researchers by combining the SMOTE with different machine learning approaches.But there has yet to be much research on the issue of class imbalance problems in the field with uncertainties.In the proposed work, we have tried to explore the uncertainties integrating the concepts of rough set and fuzzy theory.We have collected a class-imbalanced dataset from Kaggle and preprocessed [15] it by removing data irregularities and discretizing the continuous-valued attributes before applying the Rough Set Theory (RST) to the dataset.Next, we use the RST-based feature selection algorithm to select the most relevant features and generate the dataset with the modified feature subset.The modified dataset is also unbalanced, and its values are discrete.The RST is applied now to find the target sets' negative, positive, and boundary regions, which is the collection of the objects of majority and minority classes.The negative region contains noise or outliers, which are directly discarded from the system.The positive region objects are partitioned into groups based on their class labels, and the boundary region objects are uncertain or inconsistent.The uncertainty is measured using fuzzy theory, which gives the fuzzy membership values of each object by which it belongs to each group in the positive region.Finally, based on the positive region and the fuzzy membership values of the objects in the boundary region, the oversampling and undersampling are performed to balance the dataset class.The workflow diagram of the proposed methodology is depicted in Figure 1.

Literature Survey
Research improves the stratum of society, and handling the data imbalance problem [16] can directly or indirectly improve the exploration, boosting community.In the actual world, the datasets may not be balanced [17].A few times, these imbalanced data can influence the execution of a model or the outcome vitally.Researchers have considered various techniques for class imbalance issues.Sharma et al. [18] created a bilingual dataset and got the solution to analyze sentiment for class-imbalanced code-mixed data.They divided the corpus between training and testing data, preprocessed it using Levenshtein distance, and applied Synthetic Minority Oversampling Technique (SMOTE) with different Machine Learning classifiers to compare their performances.Srinivasan [19] combined SMOTE and Generative Adversarial Network (GAN) as SMOTified GAN in their proposed work.With SMOTified GAN, they tried to overcome the deficiency of SMOTE and GAN.To achieve this goal, they used transfer learning concepts.They collected the knowledge regarding minority classes using SMOTE and then applied it to the random generator function of GAN.Zhenchuan Li [20] used the credit card fraud dataset from Kaggle and the private data from a financial company in China.The researchers proposed a hybrid framework to handle class imbalance with the overlap.They used the divide step to get overlapping and non-overlapping subsets, and at conquer step, Artificial Neural Network (ANN) classifiers were used for an overlapping subset.M. Thanh Vo and Le [8] proposed a Fake Job Description Detection framework using the oversampling technique and applied it to the publicly available Employment Scam Agean Dataset (EMSCAD).This framework improved the prediction result of traditional classifiers.At preprocessing step, Support Vector Machine (SVM) and SMOTE were used to balance the training set, and the classification was done with a logistic regression model.Joel Jang [9] used the Internet Movie Database (IMDB) to propose a new training architecture.They partitioned the training data into mutually exclusive subsets and then performed continual learning on a deep learning-based classifier to handle the class imbalance problem.They used EWC to stabilize knowledge obtained from the previous partition and test the proposed method using the CNN+BiLSTM model.Lee [21] used Intrusion Detection Evaluation Dataset, CICIDS 2017, which contained regular traffic and 12 attacks.They compared classification performance without resampling of a Random Forest model, which classifies rare class data with classification performance after resampling based on GAN.Banerjee [22] used a dataset based on sarcastic and non-sarcastic tweets.They used SMOTE to preprocess the training dataset to handle imbalance during sarcasm detection.They used six different classification algorithms for experimental analysis of the effect of oversampling.Shafqat and Byun [23] used the data from an online shopping mall in Jeju.They proposed a hybrid GAN to solve the data imbalance problem and improve the recommendation system's performance.Yafooz and Alsaeedi [24] used the Arabic dataset on Herbal Treatments for Diabetes to find out the opinions of youtube users on the videos related to diabetes.Analysis of user comments to get the view was done using the Multilabel Classification (MLC) model.They noted the impact on the performance of MLCs by normalization, stop word removal, Tokenization, and Arabic stemming.Among the techniques of oversampling, undersampling, and SMOTE they applied, SMOTE performed better.Sungho Suh [25] have used the MNIST (Modified National Institute of Standards and Technology) database to present classification enhancement GAN to improve generated synthetic minority data, thereby improving prediction accuracy.They had worked to reduce the ambiguity in cases where multiple similar classes hamper the classification accuracy.Feature extraction and clustering consider the relationship between ambiguous classes to get subsets of vague categories.They formulated a new loss function with multiple subsets of obscure class labels.Ali Shariq Imran [26] had taken CR23 and CR100K datasets from Coursera and Kaggle to address the data imbalance issue.To investigate the impact of text generation on sentiment analysis on a dataset, they employed two text generation models, SentiGAN and CatGAN.BLUE, NLLgen, and NLLdiv metrics were used to evaluate the quality of the generated text.Precision, recall, and the F-score were used to analyze the model's performance in sentiment analysis.Mollas [27] presented ETHOS, a textual dataset with two variants, binary and multilabel.They proposed steps required to create a balanced dataset.They used a three-stage process to prepare the dataset, which included platform selection, data collection, data validation, then data configuration or preparation.After this, experiments were conducted using State-of-the-art (SOTA) techniques to determine this dataset's performance.Chen et al. [28] applied the Rough Set Theory (RST) for feature selection in an imbalanced dataset.They measured the feature significance by computing the neighborhood set and general decision of each object.The relevance of each feature was calculated by considering the uneven distribution of the classes.After constructing a discernibility matrix and the generation of reducts, the significance of the features was used to determine a subset of conditional features.The particle swarm optimization (PSO) algorithm was used to determine the optimized parameters of the algorithm.Zhang et al. [29] had presented Multi-Imbalance, an open-source software package for multiclass imbalanced data classification.It had seven different categories of a multiclass imbalance learning algorithm.Behmanesh et al. [30] proposed an approach that used fuzzy rough set theory in weighted least square twin support vector machine (FRLSTSVM) to classify imbalanced data.To create a hyperplane, the data points from the minority class remain unchanged.Subsets of data points from the majority class were selected using the new method based on a fuzzy rough set to reduce majority points.The bias phenomenon to the majority class was overcome by embedding weight biases in the least squares TSVM using RST.In this work, the researchers have attempted to generate new data to balance the training data.However, they have yet to consider the irregularities and overlapping of the data and their features.Real-world problems usually contain ambiguous and uncertain data; thus, managing these has long been the main focus of research.Although there have been many contributions in that area, the concept of fuzzy sets marked the beginning of a focused endeavor.Since then, numerous fields have used fuzzy theory to deal with outliers.Fuzzy classifiers are well renowned for their ability to address the issue of outliers while also delivering much-needed performance resilience.For accurate prediction, the researchers developed a similarity classifier in their work [31] to address the Archimedean-Dombi aggregation operators, which are well known for offering sufficient generalization in aggregating data.There are many crisp oversampling techniques in the literature [32][33][34][35][36][37], which may not handle the class imbalance problem properly due to the presence of some uncertainties in the data.Fuzzy theory is very popularly used for solving such problems.Liu et al. [38] suggested an oversampling strategy based on the fuzzy theory that generates fuzzy rules and assigns weights to each rule to assess how much the sample belongs to a fuzzy space.Ultimately, the synthetic data is generated using weighted fuzzy rules.Ren et al. [39] suggested a fuzzy oversampling approach based on the chromosome theory of inheritance and affinity propagation.They have selected a representative sample of each class based on the significance of the samples.The representativeness of each sample was then determined using Mahalanobis distance.Finally, they created the synthetic data using the chromosomal theory of inheritance.Keeping all the issues and shortcomings in mind, we have proposed our rough-fuzzy approach to generate minority class samples and remove majority class samples by exploring the boundary and positive regions of the rough sets.

Objective
As the performance of most of the classification models is greatly affected by the class imbalance data, the learning of a predictive model from imbalanced class data has been a challenging task [17,40,41].Also, the research and analysis have [20] revealed that underrepresented classes are not only the cause of performance loss but also the complex characteristics of data that include class overlapping and the presence of the outliers are equally responsible for reducing the classification performance.Despite recent attempts by researchers to solve the challenge of learning from severely skewed datasets [42,43], the problem still affects the model for the dataset with outliers and uncertainties.To properly learn the predictive model, our major goal in this research is to examine the uncertainty region on class imbalance data using rough set and fuzzy set theory.Thus, the paper's main objective is to remove the negative region of the rough set as an outlier and explore the positive and boundary regions of the rough set with the help of fuzzy theory to generate the synthetic data of the minority class and to remove the uncertain or inconsistent data of the majority class.

Contribution
We have used the RST-based feature selection algorithm to select the most relevant features and generate the dataset with the modified feature subset.The RST is applied now to find the target sets' negative, positive, and boundary regions, which is the collection of the objects of majority and minority classes.The negative region contains noise or outliers, which are directly discarded from the system.The positive region objects are partitioned into groups based on their class labels, and the boundary region objects are uncertain or inconsistent.The uncertainty is measured using fuzzy theory, which gives the fuzzy membership values of each object by which it belongs to each group in the positive region.Finally, based on the positive region and the fuzzy membership values of the objects in the boundary region, the oversampling and undersampling are performed to balance the dataset class.So, the main contributions of the paper are listed as follows:

•
The dataset is collected and preprocessed to remove the irregularities from the data.The processed dataset is discretized by an efficient discretization algorithm [44].The discretized dataset is fed on the RST-based feature selection algorithm to retain only the relevant features of the dataset.

•
The negative, positive, and boundary regions of the target sets are identified, and the negative region is discarded as an outlier.The positive region is categorized into different groups based on the class labels of the objects.The fuzzy-membership values of each object in the boundary region are computed.

•
The rough-fuzzy based oversampling and undersampling method is proposed to generate the minority class objects and remove the majority class objects with the help of positive and boundary regions.During this method, the membership values computed using fuzzy theory take important roles to remove and generate new objects.• Finally, the method is validated by evaluating different performance measure metrics.Also, the method is compared with some related state-of-the-art methods with the help of the same metrics.

Summary
The rest of the paper is organized as follows: The preprocessing and RST-based feature selection technique is described in Section 2. Section 3 described the proposed roughfuzzy-based sampling technique.The experimental result to demonstrate the method's effectiveness is described in Section 4, and finally, the conclusions and future scopes of the paper are drawn in Section 5.

Preprocessing and Feature Selection
The proposed work collects the imbalance dataset from Kaggle and removes the irregularities and null values from the dataset.The different discretization algorithms [45] are explored on the dataset, and it is observed that the rectified Chi2 algorithm [44] works better for the collected imbalanced dataset.Thus in the proposed work, we have applied the rectified Chi2 algorithm proposed in [44] for discretizing the continuous-valued attributes of the datasets.There may be many features in the dataset that are not necessary for the training of the model.Sometimes these features during the training of a model may reduce the performance and raise the complexity.This increases the need to select essential features before generating a model.There are various feature selection techniques like [46][47][48] based on filter, wrapper, and embedded methods.The proposed work uses the RST-based filter method to withdraw duplicated, correlated, and redundant features.The following subsections describe the concepts of RST related to our work and the RST-based feature selection algorithm.

Rough Set Theory
The RST [49] is vital to deal with imprecise, inconsistent, and incomplete information and knowledge.Let DS = (U, A, L) be a decision system where U is a finite set of objects, A is a finite set of features, and L is a set of class labels.The basis of RST is the indiscernibility relation defined over a pair of objects in U. We have found the indiscernible sets of objects using the indiscernibility relation, defined in Equation (1).The sets are called the equivalence classes, which are disjoint from one another.
We have taken target sets based on class labels of the decision system.Let, L = {l 1 , l 2 , . . ., l k } is the set of the k class labels.The universe U of objects can be divided into three disjoint regions: positive, boundary, and negative.These three regions are defined based on the lower and upper approximations of the target sets.Let Q i be the target set of objects whose class label is l i , for i = 1, 2, . . ., k.The lower approximation of Q i with respect to the feature subset P ⊆ A is given by the Equation (2), where P x is the equivalence class of objects which are indiscernible from the object x by the indiscernibility relation defined in Equation (1).
Similarly, the upper approximation of Q i with respect to the feature subset P ⊂ A is given by the Equation (3).
The positive region, PR P (DS), for the information system DS with respect to the feature subset P ⊆ A is the union of the lower approximation of the target sets and is defined by the Equation (4).
Similarly, the boundary Region, BR P (DS), and negative region, NR P (DS) for the information system DS are defined by the Equations ( 5) and ( 6), respectively.
Any object in the positive region definitely belongs to a single class, so the region is more informatic towards the classification.The boundary region objects have the uncertainty of belonging to a single class.So to extract some information regarding classification, we have applied fuzzy theory to compute the class belongingness of the objects in the boundary region.But the negative region is simply discarded as noise from the system as it is not required for classification purposes.

Feature Selection
The feature selection technique [50,51] is one of the most important tasks for machine learning researchers.It helps to reduce the complexity of the learning models by removing redundant or spurious features.If a feature subset P (⊆ A) provides a distribution pattern of the dataset that is similar to that obtained considering the whole feature set A, then the subset P is sufficient to describe the decision system DS.Hence, the remaining features (i.e., A − P) are redundant and must be removed from the dataset.This subset P must also be minimal in the sense that if we remove any one feature (say, a) from P, then the feature subset P − {a} is unable to provide the same distribution pattern.So, in terms of RST, we may say that the subset P is sufficient if it is minimal and provides the equivalence class structure similar to that obtained considering the whole feature set A. This subset P is termed as Reduct in RST.Thus, for a decision system, DS = (U, A, L), a set P ⊆ A is called a reduct if (i) both P and A provide the same set of equivalence classes, and (ii) P is minimal, i.e., after removal of any feature a from P, P − {a} and A provide the different sets of equivalence classes.But, finding the exact reduct is an NP−hard problem, and in RST, an approximate solution is provided by the Quick Reduct generation algorithm [52,53].In RST, we compute the dependency of L on a feature subset, say P ⊆ A, and this dependency (i.e., γ P (L)) is defined by Equation (7).
In the Quick Reduct generation algorithm, first, we randomly select one feature, say a, from the set A, and let P = {a}.If γ P (L) is equal to γ A (L) then we terminate the algorithm; otherwise, we select any one feature, say b from the rest of A and set P = P ∪ {b}.If γ P (L) is not equal to γ A (L), we remove b from P and select the next feature from A and continue the process.The detailed RST-based feature selection algorithm used in work is described in Algorithm 1.

Algorithm 1: Rough Set Theory based Feature Selection(RSTFS).
Input: DS = (U, A, L), the dataset /* U is the set of instances, A is the set of features, and L is the set of class labels */ Output: DS = (U, P, L), the dataset after selection of important features Let, L = {l 1 , l 2 , . . ., l k } be the k class labels of the dataset; Let, Q i be the set of objects with class label l i , for i = 1, 2, . . ., k; Compute dependency of L on A, i.e., γ A (L) using Equations ( 2) and ( 7); Randomly select a feature a ∈ A and set P = {a}; Compute γ P (L) using Equations ( 2) and ( 7); while γ P (L) = γ A (L) do for each b ∈ (A − P) do Compute γ P∪{b} (L) using Equations ( 2) and ( 7); if γ P∪{b} (L) > γ P (L) then P = P ∪ {b} and γ P (L) = γ P∪{b} (L); end end end return DS = (U, P, L);

Rough-Fuzzy Based Oversampling Technique
The Rough Set Theory based Feature Selection algorithm (RSTFS) provides the decision system DS = (U, P, L), where U is the set of objects each described by P selected features, and L is the set of class labels.As there are k−class labels, l 1 , l 2 , . . ., l k , so the positive region (PR P ) for DS (obtained using Equations ( 2) and ( 4)) is divided into k−clusters, R 1 , R 2 , . . ., R k .Here, the i−th cluster R i contains all the objects of class label l i which definitely belong to the target set Q i , for i = 1, 2, . . ., k. Let, X i be the representative of R i , for i = 1, 2, . . ., k.As the lower approximation set may be null, so some R i may be empty.The boundary region (BR P ) of DS is identified by Equations ( 2), ( 3) and (5), where the objects are uncertain.In the proposed work, uncertainty is measured using the fuzzy theory [54,55], which gives the fuzzy membership values of each object in BR P by which it belongs to each cluster in PR P .Let, there are n−objects, O 1 , O 2 , . . ., O n in BR P .Then for each object O i ∈ BR P , find the membership value F ij using Equation (8), where d ij is the eucledian distance between O i and X j .This gives the belongingness of O i in cluster R j , for j = 1, 2, . . ., k.The less the d ij , the greater the F ij suggests, which implies that the belongingness of an object in the boundary region to a cluster R j decreases with the distance from the cluster.
Thus for each object O i , we get the membership vector [F i1 , F i2 , . . ., F ik ], for i = 1, 2, . . ., n.So, we define an n × k membership matrix, F = (F ij ) n×k , where i-th row gives the membership values of belongingness of i−th object in k−different clusters in PR P .Let the actual class of O i in BR P is l i .The oversampling and undersampling process is performed in, Algorithm 2, with the help of O i as follows: Algorithm 2: Rough-Fuzzy based Class Balancing Method (DS).
Input: DS = (U, P, L), the unbalanced dataset /* U is the set of instances with P selected features and L is the decision class, namely the majority and minority classes */ Output: DS = (U , P, L), the balanced dataset with generated instances Let, L = {l 1 , l 2 , . . ., l k } be the k−classes; Let L = L major ∪ L minor , where L major and L minor are the sets of majority classes and minority classes, respectively; Compute PR P , and BR P using Equations ( 2) to (5);

•
l i is major class: Here, object O i is of majority class.If its fuzzy membership value for its own class l i is less than a threshold (say, δ 1 ), i.e., if F ii ≤ δ 1 , then we remove the object from the dataset, i.e., we perform undersampling.We are not losing valuable information because the object was in a region of uncertainty.On the other hand, if its fuzzy membership value for another class, say class l j (for j = 1, 2, . . ., k and j = i) is greater than a threshold (say, δ 2 ), i.e., if F ij > δ 2 then we are allowing O i to generate a synthetic object of class l j provided l j is of minor class.In this case, we create a new object O j i of class l j from O i and the representative object X j of cluster R j .Thus we create one object from O i for each class l j where F ij > δ 2 and l j are of minor class.• l i is minor class: Here, object O i is of minor class.Let F ij > δ 2 for n i number of clusters.
Then for each of these n i number of clusters, say, R 1 , R 2 , . . ., R n i , a new object O t , for t = 1, 2, . . ., n i of class l t is created.So, if there is u number of objects in BR P of minor classes, then the total number of new objects created is ∑ u i=1 n i .How the new objects are generated from O i with the help of the cluster representatives is described in Algorithm 3.After these two steps, it is desired that all minor classes will remain minor classes or be transformed into major classes.But if any minor class, say l i , becomes a major class by making some of the major class a minor class, then we don't allow the creation of all new objects of class l i .In this case, without loss of generality, say O 1 , O 2 , . . ., O v of BR P could generate new objects of class l i in cluster R i .Then we arrange the membership values, F 1i , F 2i , . . ., F vi in descending order and select at most first v values which makes l i major class and discard the other values.Based on each of these v objects, say, O j , for j = 1, 2, . . ., v and X i (representative of cluster R i ) the new object of class l i is created.
If a minor class, say l i , remains minor, then we directly create objects of that class from the objects of the same class in BR P using the same Algorithm 3. If BR P contains t−objects of class l i and we need to create T number of objects of class l i to make it a major class, then we apply T t times the Algorithm 3 for object O j .This algorithm generates the new object O j by the linear combination of X i and O j , as defined in Equation (9), where w gives the weightage of X i .The value of w is selected randomly in between [0.8, 1.0) and considers T t times to generate T t new objects of class l i .The larger value of w indicates that the new object O j is more similar to the representative X i of cluster R i than the object O j .For example, if the objects in the dataset are two-dimensional (2-D) and let O j = (2, 5) and X i = (7, 2), then say, for w = 0.9, O j = 0.9(7, 2) + 0.1(2, 5) = (6.5, 2.3).Here, the distance of O j from X i is 0.583 unit and that of O j from O j is 5.247 unit.So the object O j is very closed to X i compare to O j .
Input: O i , and X j /* O i is a P−dimensional object, and X j is the representative of the set R j of objects of class l j */ Output: O j i , the generated object Let X j = (x j1 , x j2 , . . ., x jd ); Thus based on the fuzzy membership values of the objects in the boundary region of rough sets, both removal and augmentation of data are done.In brief, we can say that the proposed synthesizing technique is a combination of the Rough-Fuzzy technique, where Rough Set Theory is used to segregate the precise or certain objects from the region of uncertainty, and fuzzy theory is used to explore the boundary region of rough sets for generating synthesized data based on the fuzzy membership values.The rough-fuzzybased data sampling method is described in Algorithm 2.

Result and Discussion
The collected imbalanced dataset is preprocessed, discretized, and essential features are selected by the RST-based feature selection algorithm.Then the modified dataset is partitioned into k-folds (here, k = 10) [56] such that the proportionality of each class of objects in each fold is the same as that in the whole dataset.Then like the k-fold cross-validation technique, k-1 folds are used for training, and the remaining one fold is used for testing the classification models [57].The training and testing datasets are class imbalanced as the dataset is imbalanced.The training dataset is balanced by the proposed sampling technique, and the proposed model is evaluated by the imbalanced test dataset, i.e., the performance of different classification models is measured by the imbalanced test dataset.So, the generated data are not used for testing the model.Thus, the model is trained and tested k times, and the average performance is considered as the performance of the model.To measure the performance of the proposed method, we consider various traditional machine learning classifiers, such as Naive Bayes, Logistic, Multilayer Perceptron (MLP), SGD, SimpleLogistic, SMO, Voted Perceptron, IBk, KStar, AdaBoost, AttributeSelectedClassifier (ASC), Bagging, ClassificationViaRegression (CVR), Filtered Classifier, IterativeClassifierOptimizer (ICO), LogitBoost, MultiClassClassifier (MCC), MCC Updateable, Random Committee, RandomizableFilteredClassifier (RFC), RandomSubSpace, Decision Table, JRip, PART, Hoeffding Tree, J48, LMT, Random Forest, Random Tree, REPTree.For each classifier, a confusion matrix [58], shown in Table 1, is created for each test dataset, where the terms 'True Positive' (TP), 'False Positive' (FP), 'False Negative' (FN), and 'True Negative' (TN) have their usual meanings.These terms are used to compute the performance metrics like Accuracy (A), Precision (P), Recall (R), and F-measures (F) of the classifier, as defined in the Equation (10) to Equation (13).
In Table 2, we have considered three different types of training sets, namely the Imbalanced Dataset (ID), Balanced using Oversampling (BO) and Balanced with both oversampling and undersampling (BS).Almost all the classification models provide nearly 95% accuracy for the ID dataset.However, if we look at the precision, most models give undefined values, denoted by '-' in the table.This is because TP and FP are 0, which means that the model with comparatively fewer positive samples in the dataset cannot predict the positive class, and all the positive classes are misclassified as negative.Similarly, if we look at the recall, most classifiers provide 0, as the TP value is 0, though the FN value is nonzero.So, the value of the F-measure is also undefined.Hence, we may conclude that even though the accuracy is quite good, the overall performance of the classification models for the imbalanced dataset (ID) is not satisfactory.Hence, there is an urgent need to deal with this problem.That is why we have opted to balance the dataset, again apply the same classifiers, and check for the performances we get for all the evaluating metrics.In the BO dataset, we have only oversampled the minority classes using the synthetic data generation technique, which gives a satisfying result having balanced precision and recall values and proper F-measures.But the accuracy ranges from around 80% to 93%, which is significantly less compared to the original imbalanced dataset.Next, we experiment with the method using the BS dataset, where we have also performed undersampling to remove some majority class samples.The addition of undersampling gives us improved accuracy, which ranges from 77% 98%.In this approach, the classifier PART has shown decreased accuracy of 77% from 91% in the BO dataset; similarly, Naive Bayes has reduced in accuracy from 82% to 81%.The accuracy has increased from around 8% to 10% in the rest of the models compared to the BO dataset.One more observation we have noticed is that in the Logistic model, although the accuracy has decreased, the model predicts the classes more accurately as the precision and recall values are improved.As a result, we realized that the model works more accurately and effectively after balancing the class observations.

Comparison with Other Methods
From Table 2, it is observed that the proposed Rough-Fuzzy based Synthetic Data Generation (RFSDG) method provides the best result when the RF classifier model is applied to BS (obtained using both oversampling and undersampling technique) dataset.So, we have considered RFSDG + RF as the final proposed model and compared it with some state-of-the-art methods Chawla et al. [11], Lee and Kim [59], Schölkopf et al. [60], Vuttipittayamongkol and Elyan [61], Zhenchuan Li [20], Kokkotis et al. [62], described below.

1.
SMOTE (Chawla et al. [11]): A well-known oversampling method for uneven data files produced novel minority instances by linear interpolation in the middle of the adjoining points to make the classes even.Random Forest classifier was used on the balanced data file.2.
OSM (Lee and Kim [59]): The support vector machine was modified with fuzzy and KNN algorithm as an Overlap-sensitive margin (OSM) to deal with the uneven and overlying datasets.

3.
OC-SVM (Schölkopf et al. [60]): With the single-class learning method, only minority samples were trained without considering the majority samples.Suitable for severely imbalanced datasets.4.
NB-Tomek (Vuttipittayamongkol and Elyan [61]): Here, the majority-class elements were removed from the overlapping area and prevented the excess data removal, which could lead to greater information loss. 5.
Hybrid(AE+ANN) (Zhenchuan Li [20]): They had found out the overlapping subset.Since this subset had a low imbalanced ratio, a non-linear classifier was used to distinguish datasets.6.
Kokkotis et al. [62] developed reliable machine learning (ML) prediction models for stroke disease and coped with a typical severe class imbalance problem.The effectiveness of the proposed ML approach was investigated with well-known classifiers Random Forest(RF), Logistic Regression(LR), Multilayer Perceptron(MLP), XGBoost, Support Vector Machine(SVM), and K-nearest Neighbours(KNN).We have taken the LR model performance for comparison, as it provides the best results.The proposed method is compared with other methods (in terms of accuracy, precision, recall, and F-measures) using two different real-life datasets consisting of credit card data and Health data, shown in Tables 3 and 4. The best results obtained are marked in boldface in both the table, and it is observed that the proposed model provides much better performance compared to other models.The comparative result is also visualized using a bar chart in Figure 2, and it is observed that the proposed method provides the best result for both datasets.
The models are also compared using Receiver Operating Characteristics (ROC) curve [63] to evaluate the proposed model.The False Positive Rate (FPR) is considered along the X axis, and the True Positive Rate (TPR) is along the Y axis in drawing the ROC curve.The TPR and FPR are computed using the Equations ( 14) and (15).
The top left corner of the box is the 'ideal' point, where the FPR = 0 and TPR = 1, which indicates that the larger the Area Under the Curve (AUC) provides, the better performance of the model.The ROC curves are drawn for all seven models compared using both datasets, as shown in Figure 3.The figure shows that the proposed model provides the best AUC.

Conclusions
The imbalanced dataset creates problems in classification which become acute in some cases, as we have seen in Table 2 that though the model provides good accuracy, it may need to perform better for predicting positive samples.To tackle this problem and improve classification quality, we have used rough-fuzzy theory to apply oversampling and undersampling methods.The proposed sampling improves the result, and we get proper precision and recall value with acceptable accuracy.We have generated new data from the minority class and dealt with the overlapping problem, noise, and outliers that we have stated as the research gap.The main demerit of the work is that we need discretized data for the application of the RST.The generalized version of RST applies to the real-valued dataset.There are many generalizations of the RST using arbitrary binary relation [64].The main objective of the application of this relation is to increase the positive region of the rough sets to achieve more accuracy.For such generalization of the RST, the neighborhood system helps to approximate a rough set in a better way to understand the imperfect knowledge.On the other hand, there are many similarities between topological structures and RST, which motivates many researchers [65] to exploit some topological generalizations, such as infra-topology, supra-topology, and so on.This helps to replace the RST with their topological counterparts for more generalization of the RST.All such generalization of the RST is the future scope of this paper to handle the imperfect knowledge more effectively towards balancing the class imbalance dataset.
In the paper, we have considered a linear combination technique for the oversampling of the data; we may apply many other techniques [66,67] for the same purpose and make a comparative study among them, which is also the future scope of this paper.

Figure 1 .
Figure 1.The WorkFlow of the proposed methodology.

Figure 2 .Figure 3 .
Figure 2. Comparison of the performance of different methods.

Table 1 .
The Confusion Matrix.

Table 2 .
Performance evaluation using different classifiers and different datasets.

Table 3 .
Comparison (in %) of different methods on Credit-card data.

Table 4 .
Comparison (in %) of different methods on Health data.