Addressing Binary Classification over Class Imbalanced Clinical Datasets Using Computationally Intelligent Techniques

Nowadays, healthcare is the prime need of every human being in the world, and clinical datasets play an important role in developing an intelligent healthcare system for monitoring the health of people. Mostly, the real-world datasets are inherently class imbalanced, clinical datasets also suffer from this imbalance problem, and the imbalanced class distributions pose several issues in the training of classifiers. Consequently, classifiers suffer from low accuracy, precision, recall, and a high degree of misclassification, etc. We performed a brief literature review on the class imbalanced learning scenario. This study carries the empirical performance evaluation of six classifiers, namely Decision Tree, k-Nearest Neighbor, Logistic regression, Artificial Neural Network, Support Vector Machine, and Gaussian Naïve Bayes, over five imbalanced clinical datasets, Breast Cancer Disease, Coronary Heart Disease, Indian Liver Patient, Pima Indians Diabetes Database, and Coronary Kidney Disease, with respect to seven different class balancing techniques, namely Undersampling, Random oversampling, SMOTE, ADASYN, SVM-SMOTE, SMOTEEN, and SMOTETOMEK. In addition to this, the appropriate explanations for the superiority of the classifiers as well as data-balancing techniques are also explored. Furthermore, we discuss the possible recommendations on how to tackle the class imbalanced datasets while training the different supervised machine learning methods. Result analysis demonstrates that SMOTEEN balancing method often performed better over all the other six data-balancing techniques with all six classifiers and for all five clinical datasets. Except for SMOTEEN, all other six balancing techniques almost had equal performance but moderately lesser performance than SMOTEEN.


Introduction
For the past few years, imbalanced data have attracted a significant amount of attention from learners in the machine learning area. Different challenges occur at various stages of data mining applications [1]. The development in technology and computational science has assisted the availability and growth of the data obtained from real-world problems such as medical diagnosis [2,3], credit card fault detection [4], intrusion detection, culture modeling [5], text classification, oil spill detection [6], land mine detection [7], etc., at an explosive rate [8]. A classification dataset with skewed class proportions is called imbalanced.
Classifying imbalanced data is an important and frequently occurring challenge of data mining. Classes that comprise a maximum part of the dataset are known as majority classes; on the other hand, minority classes comprise a minor proportion. The major challenge that imbalanced datasets suffer is that majority of the machine learning algorithms are inclined toward the majority class. It is noteworthy that minority class has a serious concern from a learning perspective and cost significantly on misclassification [9][10][11]. Acquiring new understanding from imbalanced datasets is posing a new challenge for various data mining applications. This challenge reveals itself in two forms: minority interests and uncommon examples [12,13]. Standard learning algorithms have to compromise their performance while dealing with imbalanced learning problems [14]. It has been proved by most of the state-of-the-art classifiers that biased class distribution is the major reason for the significant loss of performance which is demonstrated by the imbalance ratio (IR) is the ratio of the number of instances in the majority class to the number of instances in the minority class. Many algorithms are employed to get rid of class imbalance problems, such as data sampling and boosting [15,16]. Data sampling has its own merits and demerits in terms of time safety and information loss. In various applications of supervised learning, a substantial difference among the prior probabilities of different classes is absorbed. The condition is known as the imbalance problem of class [17]. Most machine learning algorithms have faced challenges in countering the problem of classification of imbalanced data [18][19][20]. Data imbalance is the result of the nature of dataspace. The summarized details of various significant clinical datasets are presented in Table 1. Imbalance data classification is one of the top ten challenging issues of data mining [21]. The medical datasets often face the problem of imbalance. Herein, we used five clinical datasets for our study. In women, after skin cancer, breast cancer is the second most common cancer. In 2018, World Health Organization informed 2.09 million persons suffering from breast cancer, and 627,000 died because of this disease. It develops in breast cells, and females are the major sufferers than males. A block in the breast, discharge (bloody) from the nipple and breast shape changes are the main symptoms [22]. Coronary Heart Disease (CHD) grows in a condition where arteries are unable to supply sufficient oxygen-rich blood to the heart. Generally, it is caused due to the plague (a waxy substance) building up in the larger coronary arteries, and consequently, the flow of the blood in larger arteries is blocked. In 2017, CHD, a very common heart disease, killed 365,914 people. About 20% of deaths due to CHD are in adults below 65 years of age [23]. Liver disease causes almost 2 million deaths in a year across the globe. Some of the causes of liver disease are alcohol, obesity, viruses, or it can be inherited genetically. A deadly condition where the liver is failed by the scarring (cirrhosis) result of the damaged liver) [24]. Coronary kidney disease (CKD) means the kidneys are unable to filter the blood. Persons with high blood pressure or diabetes are at higher risk for kidney disease. High blood pressure and heart disease are the results of extra water and waste in the body caused due to the malfunctioning of the kidney. As per the study, 37 million people, which is around 15% of US adults suffering from CKD and 90% of the adults with CKD, are unaware of it, and 50% of the persons who are at low kidney function are not aware of the CKD if they are not at dialysis. According to current estimates: CKD is more common in the age group of 65 years or older (38%) than in persons of the age group of 45-64 years (13%) or 18-44 years (7%), women (15%) are badly sufferer than men who are 12% with CKD). Dialysis and kidney transplant are the treatments for kidney failure [25,26]. Diabetes is a chronic disease and is caused when insulins are not produced by the pancreas or the insulin produced is not properly used in the body. The occurrence of diabetes in 2019 is assessed to be 9.3% (463 million people) globally, amounting to 10.2% (578 million) by 2030 and 10.9% (700 million) by 2045. The effect is higher in urban areas (10.8%) than in rural (7.2%) areas and in rich (10.4%) than poor countries (4.0%), and 50.1% of persons suffering from diabetes are not aware of having the disease. The prevalence of impaired glucose tolerance is assessed to be 7.5% (374 million) in 2019 globally and is predicted to reach 8.0% (454 million) by 2030 and 8.6% (548 million) by 2045 [26]. Worldwide, lung cancer remains the major reason for the deaths of women and men suffering from cancer. Worldwide, the third most common cancer is lung cancer. The uncontrolled growth of abnormal cells in one or both lungs leads to lung cancer. The abnormal cells are unable to function normally and don't grow into healthy lung tissue. With the growth of abnormal cells, the tumors can be formed and obstruct the normal function of the lungs, which supplies oxygen to the body via the blood. World Health Organization reported 1.76 million deaths out of 2.09 million total cases of lung cancer in 2018, and 10% of the deaths in cancer are due to lung cancer. The survival of lung cancers is decided by the stage of the diagnosis. Survival is poorer if diagnosed at a late stage [27].
In this paper, the seven algorithms are used for balancing the imbalanced data over the five clinical datasets. The six well-known classifiers are implemented to classify the data. To evaluate the performance, the four parameters-accuracy, precision, recall, and F1-score are used in this study. What is imbalanced? The response ranges from mild to extreme, as shown in Table 2. The imbalance ratio (IR) for binary class data is the ratio of number of samples of the majority class to the number of samples of the minority class.
Class imbalance learning approaches can be divided into three major categories: (1) data-level strategy, (2) algorithm-level strategy, and (3) hybrid strategies as shown in Figure 1. At the data-level strategy, the resampling procedure is used to handle class imbalance issues in imbalanced datasets. Further, the data-level strategy is divided further into random undersampling, oversampling, and the hybrid approach, which is a combination of undersampling and oversampling. For dealing with imbalanced data, an algorithm-level strategy may develop or update current algorithms and evaluate the consequences of minor classes. The hybrid strategy combines both data-level strategy and algorithm-level strategy to deal with the class imbalance problem. The data level strategy for balancing the class data is more successful, and it is implemented prior to the learning process during the data preprocessing stage. Hence, the main contribution of this paper is to design a performance evaluation setup and analyze the performance effects of important data-balancing techniques with various classification methods on five imbalanced clinical datasets: Breast Cancer Disease, Coronary Heart Disease, Indian Liver Patients, Pima Indians Diabetes Database, and Coronary Kidney Disease.
The paper is organized as follows: Section 2 outlines the related work dealing with the imbalanced data. Section 3 of this paper discusses the various algorithms used for balancing the clinical data. Section 4 talks about the experimental setup and gives a description of the dataset. The results are discussed in Section 5 of this paper. The conclusion is discussed in Section 6. The bold represents to class labels.

Related Works
In machine learning, data is crucial for training the model. In the real world, we constantly encounter the problem of imbalanced data. This section discusses the work completed towards the efficiency of some of the machine learning techniques while dealing with the different clinical datasets, as most of the clinical datasets are inherently imbalanced in nature. Various algorithms are designed to get rid of the consequences of imbalance. The very popular algorithms are studied and analyzed for the balancing of the datasets, and afterward, the different techniques of machine learning are employed to check their performances.
Undersampling and random oversampling (ROS) for majority and minority instances can ease the change of distribution for the original dataset. To conquer the downsides of the elementary sampling techniques, such as the overfitting risk involved in oversampling and menace of information loss for undersampling method, the Synthetic Minority Oversampling Technique (SMOTE) is implemented [29].
M. Mostafizur Rahman and D. N. Davis proposed a modified cluster-based undersampling method for balancing the data, and a training set of good quality is generated for constructing classification models [17]. SMOTE offers a new technique for oversampling. The blend of undersampling and SMOTE gives better performance than plain undersampling. SMOTE was applied on various datasets having variable imbalance degree and training datasets in different amounts, which provides a diverse test field [29].
Adaptive Synthetic (ADASYN) can produce synthetic data samples adaptively for minority classes to decrease the favoritism generated by the imbalanced data distribution. Moreover, the Learning performance is improved because of the capabilities of ADASYN to change boundaries for concentrating more on tough-to-learn examples [12].
With the help of data sampling and deep neural networks, frauds can be detected in highly imbalanced data rather than big data. Random undersampling (RUS), Random oversampling (ROS), and amalgamation of the two (ROS-RUS) are implemented to learn how different class imbalance levels influence the training and performance of the model. ROS-RUS and ROS outperform RUS and baseline models with average Area Under Curve (AUC) scores of 0.8505 and 0.8509. It is confirmed from the results that when training data are imbalanced, the default decision threshold is not optimal at 0.5, and it is recommended that the threshold be used for optimizing the performance of imbalanced classes [30].
Undersampling based on clustering (SBC), here, all samples in the datasets are divided into clusters. SBC has a very fast execution time along with a high accuracy of classification in predicting the minority class samples. Sampling methods based on SBC are used to select the majority class sample from the cluster based on the distance between minority and majority class samples [31].
Applying TOMEK links as a data cleaning technique over the oversampled training set for creating better-defined class clusters. Instances from both the classes are eliminated; consequently, not only majority class examples that form TOMEK links are removed.
In the beginning, the original dataset (a) is oversampled with SMOTES (b), and then TOMEK links are acknowledged (c) and removed, generating a balanced dataset with well-defined class clusters (d). SMOTE + ENN (Edited Nearest Neighbor), the inspiration behind this method is similar to SMOTE + TOMEK links. ENN facilitates more in-depth data cleaning as ENN removes more instances than TOMEK links. Contrarily from an under-sampling method, i.e., Neighborhood Cleaning Rule (NCL), ENN is implemented to eliminate instances of both classes. consequently, instance that is misclassified by its three nearest neighbors is eliminated from the training set [32]. SMOTE has over one hundred variants [33]. Hien M. Nguyen et al. proposed a technique where the SVM is applied to the original dataset to make a distinction between the classes B-SMOTE is implemented to find the minority sample ear the hyperplane to eliminate these samples [34]. Support vector machine (SVM) was first introduced by Vapnik in 1995, and it was a great success in widespread series of applications, but while encountering imbalanced data, the performance of SVM was significantly reduced. SVM handles and works very fine with linear as well as nonlinear datasets. The important training tuples help in forming a hyperplane for defining the data separation in a higher dimensional space known as support vectors [35]. For the classification of the datasets, prominent classification techniques are used. A. Endo et al. [8] implemented seven classifiers, namely, Artificial Neural Network (ANN), Decision Trees with naive Bayes, Naive Bayes, Bayes Net, Logistic Regression (LR), ID3, J48. They proved maximum accuracy by a logistic regression model. A decision tree (DT) constructs the structure of the flow-chart; here, every node denotes a test on an attribute value, while each branch represents a result of the test work, and leaf nodes of the tree symbolize classes. In a decision tree, the classification is done with less computation, and understandable rules can be generated easily [36]. If in a dataset most of the attributes are continuous, then Gaussian Naive Bayes (GNB) is used. It is assumed in this algorithm that predictor values are samples from Gaussian distribution [37]. k-Nearest Neighbor (k-NN) prediction model is generally acknowledged as lazy learning (no learning) approachbased estimation mechanism, and it predicts on the account of k nearest numbers provided to it [37]. An Artificial Neural Network (ANN) is formed with the combination of artificial neurons which receive input, alters the internal state (activation) as per the input, and produces output [38]. From this brief literature review, it can be inferred that no single algorithm for balancing the dataset can be considered the state-of-the-art algorithm for all the datasets in all circumstances. Moreover, there is no denying the fact about not having a single machine learning technique that can be put at the top of the hierarchy in terms of performance. They can produce the best results in domain-specific applications. Summary of significant and related works from literature for balancing techniques are given in Table 3.

Description of Data-Balancing Algorithms
The prime focus of our study is to analyze the various balancing techniques over five clinical datasets, having varying imbalance degree. In our experiment, we used seven different balancing techniques, Undersampling, ROS, SMOTE, ADASYN, SVM SMOTE, SMOTEEN, and SMOTETOMEK, for balancing the datasets. After balancing the imbalanced datasets, six machine learning techniques, LR, DT, SVM, GNB, k-NN, and ANN, are employed over Five Clinical datasets Breast Cancer Disease, Indian Liver Patient Dataset (ILPD), Kidney Disease, Coronary heart disease (CHD), and Pima Indians Diabetes.

Undersampling
In undersampling, the randomly selected samples are deleted from the training datasets, but random undersampling throw-outs potentially large number of samples. It could be very challenging to define the decision boundary between minority instance and majority instance because of the discarded samples, consequent upon which the performance of classification is reduced. Algorithm 1 shows the pseudo code for the undersampling approach.

Random Oversampling
In random oversampling, the samples are chosen from minority classes randomly and, with the help of replacement, are further added to the training dataset. It can be put in other ways that, in random oversampling, the instances are duplicated from minority class in the training dataset, which may result in the overfitting of some machine learning techniques. Algorithm 2 shows the pseudo code for oversampling approach.
It has been observed in many studies that random selection of samples performs quite well if not better than many processes where samples are removed intentionally. Figure 2 portrays the semantic of undersampling and oversampling strategy for class balancing.

SMOTE
Considering an imbalanced dataset of a very smaller number of minority samples in comparison to the majority samples, which are large in numbers, a vector space is a collection of feature vectors that represents each sample. k nearest neighbors are selected from the minority sample for every minority sample → x i , after that → n a minority sample is selected randomly. A point is chosen randomly between → n and → x i . → syn is the new synthesized sample, which is further added to the dataset. Bal is the balancing parameter for controlling the synthesized samples. Bal = 1, indicates equal number of samples from minority and majority classes. G_all is the total number of samples to be synthesized while G denotes the number of samples to be synthesized from one minority sample? The synthesis of minority samples from → x i is repeated G times. Algorithm 3 displays the pseudo code for SMOTE [29].

Algorithm 3: Pseudo code of SMOTE
Input: X (original training data), bal (balance parameter), k (number of nearest neighbors) 1. S_ min ← a set of minority samples in X 2. S_maj ← a set of majority samples in X 3.

SVM-SMOTE
In this method, the borderline area is figured out by the support vectors after training SVMs method on the original training set. Artificial data are randomly generated along the borderline linking each minority class support vector with a number of its closest neighbors. Thus, it establishes a clear boundary between minority and majority classes [34,40]. Algorithm 5 presents the pseudo code for SVM-SMOTE T ← (N/100) × |X| 3.
Compute SV + by training SVMs on X 4.
Compute amount by evenly distributing T among SV + 5.
For each sv + i ∈ SV + , compute m nearest neighbors on X. 7.
If less than a half of the m nearest neighbors come from the negative class, along the lines joining sv + i with its k positive nearest neighbors (in the first to k-th nearest neighbor order), create amount[i] artificial positive instances using the following formula (extrapolate to expand positive class area): where nn[i][j]is the jth positve nearest neighbor of sv + i σ is a random number in the range [0, 1]. 8.
Otherwise, use the following formula (interpolate like in SMOTE to consolidate the current boundary area of the positive class):

Stop
Output : X new : Over − sampled training set

SMOTEEN
Firstly, SMOTE determines the k-Nearest Neighbors (k-NNs), which is denoted by ψ x i for each minority sample x i ∈ α min . To generate a synthetic data sample x new for x i SMOTE randomly selects an elementx i in ψ x i andx i in α min . The feature vector of x new is the sum of the feature vectors of x i and the value, which can be obtained by multiplying the vector difference betweenx i and x i and a random value δ which is between 0 and 1. By doing so, we obtain a synthetic point along the line segment joining x i andx i . Further, the Edited Nearest Neighbour (ENN) is applied to clean the overlapping of classes. Algorithm 6 contains the pseudo code for SMOTEEN [33].

SMOTETOMEK
It is another modified version of SMOTE, where the TOMEK links are used for removing the noisy data. The TOMEK links are defined as if instance l is the nearest neighbor of instance m and m is the nearest neighbor of l, further l and m belong to different classes [32]. Algorithm 7 shows the pseudo code for SMOTE.

Description of Classification Methods
An explanation in brief for every classification technique implemented in this study is given below so as to give the fundamental information regarding these classification methods:

Logistic Regression
Logistic regression yields probabilistic approximations rather than predictive analysis [41,42]. The relation between one or more variables (independent) is described and is also used for explaining the data. In more simple terms, it presents a model that gives a probability of events happening as a linear function of a set of predictor variables. The estimated regression model can be represented by Equation (1) (2)

Decision Tree
A flow-chart-like tree structure, wherein every internal node represents a test on an attribute, every branch gives an outcome of the test, and class distribution is represented by a leaf node is classed as a decision tree. The peak node in a tree is called the root node. A decision tree can produce understandable rules easily and performs classification in lesser computation [43]. It is shown in Figure 3.

Support Vector Machine
A very powerful and widespread mechanism of classification was developed by V. Vapnik [44]. A division between two data levels is made with a hyperplane, and these two data levels fall on both sides of the hyperplane. The effort is always made to maximize the margin and thereby to make the sufficient probable gap amid the instances and segregating the hyperplane on either side of it.
Equation (3) is a representation of segregating hyperplane.

k-Nearest Neighbour
k-Nearest Neighbor (k-NN) prediction model is generally acknowledged as lazy learning (no learning) approach-based estimation mechanism, and it predicts on account of k nearest numbers provided to it. Generally, the neighborhood is measured using the Euclidian distance formula [37], but as per the requirement, other distance measures such as Minkowski, Hamming, and Manhattan distances are also used [43]. The distance between two points x and y is measured by the formula given by the Equation (6).

Gaussian Naïve Bayes
Gaussian Naïve Bayes is used if most of the attributes in the examples are continues. The conditional probability is given by the formula given in Equation (7): where µ y and σ y are mean and variance of predictor distribution.

Artificial Neural Network
Artificial Neural Networks simplify and imitate the brain behavior. ANN is a network of modules known as artificial neurons which receive input, vary their internal state (activation) in line with that of input, and produce output as per the input and activation [38,43].

Bias (b):
It aids in the modification of the curve of the activation function. Input Layer: The input layer incorporates inputs and weights. Activation Function: A very important part is activation function, which gives nonlinear characteristics to the neural networks. It mainly converts any input of an artificial neuron (AN) as output. Thereafter, the obtained output is served as input to the next layer of AN [45,46]. There are many activation functions, such as the sigmoid function Equation (8).
Hidden Layer: Many hidden layers may be there in ANN. Basically; hidden layer has both summation as well as activation function.
Output Layer: The output layer has the set of outcomes generated by the preceding layer.

Confusion Matrix (CM):
The confusion matrix is a tabular representation that describes the brief assessment of the performance of a classification model [43]. The diagonal values are ones where the learning algorithm gives the correct results.
True Positive (TP): The training instances of which the true class is positive and which also have been positively hypothesized by us. They can be called true positives.
False Positive (FP): Those training instances which are negative but wrongly classified as positive by learning algorithm.
True Negative (TN): The training instances which are actually negative and are also hypothesized as negative.
False Negative (FN): The training instances are positive, but the learning algorithm is classifying these instances wrongly as negative.
Accuracy: -It is defined as the proportion of all true results to the total number of cases checked.
Precision: Precision speaks about how trustable is the model prediction.
Recall: Ability of the model to detect the class F-Score/F-Measure: It combines the precision and recall for the assessment of the classifier.
It can be put in a more simplified way: 1. Accuracy alone is not a sufficient metric to evaluate a classification model time it is misleading.

2.
High recall and high precision-This is a good model.

3.
Low recall and high precision-Model cannot detect the classes, but it is highly trustable when it does. 4.
High recall and low precision-Model can detect the classes but includes points of other classes in it.

5.
Low recall and precision-Poor model.

Experimental Setup
To accomplish the goal of comprehensive empirical performance analysis of different classifiers with several data-balancing techniques over the clinical datasets, the experiments were conducted to evaluate the efficiency and effectiveness of the algorithms in terms of classifier accuracy (CA), precision, recall, F1 score/F measure. The whole experiment was conducted using python programming language on the 'Google Colab' environment that runs entirely in the cloud. Figure 4 depicts the experimental workflow of the proposed work.

Clinical Datasets
The clinical datasets are medical records collected from different patients for a specific disease. The clinical datasets are beneficial for providing cost-effective solutions for healthcare and medical diagnosis software systems. The five clinical datasets, Breast Cancer Disease, Indian Liver Patient, Coronary Kidney Disease, Coronary Heart Disease, and Pima Indians Diabetes Database, under this study have been downloaded from the UCI Machine Learning repository and detailed with their set of features, instances, imbalance ratio (IR), degree of imbalance in Table 4.

Results and Discussion
The experiments have been conducted for the review of seven balancing techniques and six classification techniques over five class imbalanced clinical datasets, as described in Table 4. Figure 5a-e demonstrates the effect of applying the various data-balancing methods. To assess the results of classification, the evaluation has been performed on the basis of well-known performance measures, namely Accuracy, Precision, Recall, and F1 score.

Breast Cancer Disease dataset
The breast cancer disease dataset was first preprocessed, and then each of the seven data-balancing procedures-undersampling, random oversampling, SMOTE, ADASYN, SVM-SMOTE, SMOTEEN, and SMOTETOMEK-was applied separately. As illustrated in Figure 4, the balanced dataset was then tested against six significant classifiers. The following observations were noted:

•
The balancing technique SMOTEEN with k-NN, SVM, LR, and ANN shows the accuracy of 99.8%, 99.5%, 99.1%, and 99.1%, respectively. There is a 3% increase in the accuracy as compared to classification without data imbalance (Refer to Figure 6).

•
Precision value for both SVM and ANN with SMOTEEN was reported as 100%. LR and k-NN also show a comparable precision value of 99.5% (Refer to Figure 7).

•
Recall varies from 97.2 to 100% for all classifiers in general when SMOTEEN was applied. SVM reported the 100% recall for the BCD dataset (Refer to Figure 8). • F1 Score for k-NN, SVM, and ANN with SMOTEEN observed 99.8, 99.5, and 99.1%, respectively (Refer to Figure 9). • Thus, the balancing technique SMOTEEN for BCD provides the highest accuracy, Recall, Precision, and F1 score over all the Machine learning techniques, especially k-NN outperforms all others.

Indian Liver Patient Dataset
The ILPD dataset was also experimented with as BCD dataset. The following observations were seen-  Figure 10).

•
Undersampling underperforms with all the classification methods due to loss in significant data while data balancing in ILPD.

•
SMOTEEN as compared to the other six data-balancing techniques shows better precision for GNB, DT, KNN, LR, and ANN with 94.8%, 89.5%, 89.5, 89.5%, and 88.5%, respectively (Refer to Figure 11). • Likewise, recall for k-NN and DT was 86.7% and for LR it is 83.7% with SMOTEEN, whereas SVM, GNB, and ANN give low values.
• F1 score for all machine learning techniques with SMOTEEN as a balancing technique also gives a high recall value of 88.1% for both k-NN and DT (refer to Figure 12), whereas LR, GNB, and ANN give a poor performance with low F1-score values, i.e., 84.5%, 83.4%, and 78.4%, respectively (refer to Figure 13). • Thus, the experimental analysis recommends the balancing technique SMOTEEN with k-NN is the most suitable for detecting liver disease compared to the other six balancing techniques. Moreover, SMOTEEN with Decision Tree (DT) also projected considerably equal performances for ILPD Dataset.

Coronary Kidney Disease Dataset
When Coronary Kidney Disease dataset was experimented as BCD and ILPD dataset, the following observations were noticed:  Figure 14). • ROS has outperformed all the balancing techniques over all the machine learning algorithms while measuring precision (refer to Figure 15).

Coronary Heart Disease dataset
When the CHD dataset was experimented, the following observations were noticed-• k-NN gives the highest value of accuracy, i.e., 92.2% for SMOTEEN, and DT gives 84% for SMOTEEN as compared to all other classifiers and balancing techniques (refer to Figure 18). • SMOTEEN gives the highest value of 90% precision for k-NN, but DT, GNB, and SVM are also found to be better (refer to Figure 19). • SMOTEEN gives the highest value of recall, 98.6% over k-NN but GNB and ANN underperform over CHD (refer to Figure 20). • SMOTEEN reported the highest F1 Score value of 94.1%, whereas classifiers DT, SVM, and LR with SMOTEEN displayed an F1 Score of 87.6%, 82.8%, and 82.5%, respectively (refer to Figure 21).

Pima Indians diabetes dataset
When the diabetes dataset was experimented with the proposed experimental setup, the following observations were noticed- It is quite evident from the result analysis that the SMOTEEEN balancing method often performed better over all the other six data-balancing techniques for all five clinical datasets. This is because SMOTEEN combines oversampling and under-sampling with SMOTE and Edited Nearest Neighbors. Additionally, ENN leans towards removing a larger number of instances as compared to the Tomek links. ENN works for the elimination of cases in all classes, so any case which undergoes misclassification from all three nearest neighbors will be disposed of in the training set. In many cases, undersampling underperformed because it had discarded potentially useful instances from clinical datasets.        ROS also underperformed with different classifiers because of making exact copies of existing examples which posed overfitting to the model. SMOTE moderately underperformed in some cases as compared to SMOTEEN because of the lack of flexibility and overgeneralization done by it. It does not just replicate the present minority cases as an alternative; SMOTE takes instances of feature space for each target class and its neighbors and then makes new instances that syndicate the attributes of the target cases with attributes of its neighbors.
ADASYN is a slight improvement over SMOTE by adding a random small value to the points to make it more genuine.
The main attention of SVM-SMOTE was on producing the new minority class samples near the dividing line with the SVM approach to support establishing the borderline between classes. Thus, wherever overfitting did not occur, the SVM-SMOTE gave a comparable result. Opposite class paired instances that are the closest neighbors to each other come under the Tomek links. Hence, the majority of the class instances from these links are eliminated as it is thought to rise the class segregation close to the decision boundaries. Therefore, in place of removing the instance solely from the majority class, in general, instances are removed from both the classes from the Tomek links. Consequently, sometimes inappropriate operation causes poor results.

Conclusions
The classification of data into specified class labels has always been a great challenge, and it is even more persistent while dealing with imbalanced data. In this study, we have implemented seven balancing techniques-Undersampling, Random oversampling, SMOTE, ADASYN, SVM-SMOTE, SMOTEEN, and SMOTETOMEK-and six different disease predication techniques-Logistic regression, Decision Tree, Support Vector Machine, k-Nearest Neighbor, and Artificial Neural Network-over five different clinical datasets, namely BCD, ILPD, CKD, CHD, and Pima Indians Diabetes Database.
SMOTEEN with k-NN provided the highest accuracy, Recall, Precision, and F1 score over all the machine learning techniques all others for the BCD dataset and bagged a 3% increase in the accuracy as compared to classification without data imbalance.
• SMOTEEN with k-NN was found the most suitable for detecting liver disease. • Moreover, k-NN gives the highest value of accuracy of 92.2% over coronary heart disease for SMOTEEN compared to all other classifiers and balancing techniques.

•
As for as the diabetes dataset is concerned, SMOTEEN with k-NN was found the most suitable, with accuracy of 96.2. • SMOTE with Logistic regression (LR) gives the highest value of accuracy, 99.2%, over the CHD dataset.
The performance of these balancing algorithms has been observed and it is concluded that there is no single balancing technique that can generate the best results over all the datasets. If dataspace is important, then machine learning techniques cannot be ignored, and the balancing algorithms are equally important.