Effective Class-Imbalance learning based on SMOTE and Convolutional Neural Networks

Imbalanced Data (ID) is a problem that deters Machine Learning (ML) models for achieving satisfactory results. ID is the occurrence of a situation where the quantity of the samples belonging to one class outnumbers that of the other by a wide margin, making such models learning process biased towards the majority class. In recent years, to address this issue, several solutions have been put forward, which opt for either synthetically generating new data for the minority class or reducing the number of majority classes for balancing the data. Hence, in this paper, we investigate the effectiveness of methods based on Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs), mixed with a variety of well-known imbalanced data solutions meaning oversampling and undersampling. To evaluate our methods, we have used KEEL, breast cancer, and Z-Alizadeh Sani datasets. In order to achieve reliable results, we conducted our experiments 100 times with randomly shuffled data distributions. The classification results demonstrate that the mixed Synthetic Minority Oversampling Technique (SMOTE)-Normalization-CNN outperforms different methodologies achieving 99.08% accuracy on the 24 imbalanced datasets. Therefore, the proposed mixed model can be applied to imbalanced binary classification problems on other real datasets.


Introduction
Learning a classifier from an imbalanced dataset is an important topic and still a complicated problem in supervised learning algorithms.In other words, the class imbalance is a customary long-standing challenge in classification problems (1)(2)(3)(4)(5), which deals with a dataset that contains an asymmetrically larger number of samples of the majority class.The imbalanced datasets appear in vast real-world research, such as life sciences (6), facial age approximation (7), anomaly detection (8), determining counterfeit credit card transactions (9), medical imaging (10), DNA sequence identification (11) and so forth.For an imbalanced binary classification problem, samples are typically characterized by two classes namely majority and minority.In general terms, the minority class often illustrates samples of higher importance and interest rather than the majority class.Nevertheless, compared to the minority class, the majority class usually has a more significant number of samples in a meaningful way, and sometimes, the situation may be extremely serious.Different situations can occur in confronting the imbalanced datasets, and four common cases are depicted in Figure 1, where the blue-filled circles represent the samples of the majority class, in contrast, the red circles denote the minority class (12).It has been shown that the type of data complexity is the principal determining factor of classification performance reduction (13).Most of the classical classification methods, like decision trees (13)(14)(15), KNN (16,17), and Repeated Incremental Pruning to Produce Error Reduction (RIPPER) (18,19), usually train models that maximize the accuracy of proposed algorithms, sometimes ignoring the minority class (20)(21)(22).Hence, several techniques have been designed and implemented to handle the imbalanced binary classification problems.Among these techniques, oversampling and undersampling are well-known (1,(23)(24)(25)).Yet, the common undersampling and oversampling algorithms modify the initial class distribution of the dataset by excluding the majority class samples or expanding the minority class samples.Cost-sensitive learning algorithms were among the solutions for the above-mentioned issues of imbalanced data (26)(27)(28).Such algorithms designate misclassification cost errors for multiple classes, mainly lower costs for the samples of the majority class and higher for the minority class.In addition, Bagging (29) and Boosting (30) methods, which are based on ensemble learning algorithms, are among the other commonly used methods to handle imbalanced class problems (18,19,31,32) In this paper, we use several undersampling and oversampling methods in the process of implementing our methodology, which is briefly introduced in the sequel: 1) RUS: Among undersampling methods, random undersampling (RUS) is the simplest one, in which the samples of the majority class are randomly removed until suitable balanced data is obtained (33).2) Tomek Links: Some of the undersampling techniques focus on overlap elimination.For example, the Tomek Links (34) method, which.) is a modification of the Condensed Nearest Neighbor rule, is one of these methods.3) One-Sided Selection: As the development of the Tomek Links algorithm, one can refer to the One-Sided Selection or briefly OSS method (35) that merges Tomek Links and the Condensed Nearest Neighbor algorithms.4) Near Miss is another popular undersampling method that randomly removes the majority of class samples.When two samples classified in different classes are very close to each other, it removes the sample belonging to the larger class (16).5) ROS: Among the oversampling algorithms, random oversampling (ROS) is the simplest one that merely selects and copies the samples from the minority class randomly, leading to more balanced data (36).6) SMOTE: The best-known oversampling method is the Synthetic Minority Oversampling Technique (SMOTE) (24,37,38) which leverages the kNN algorithm to identify the neighbors of minority class samples and generates the new sample by selecting the k th neighbor randomly (39) It is worth noting that the methods mentioned above may cause some unexpected issues.For example, undersampling techniques may ignore some valuable data, which could be vital for training a classifier.In contrast, oversampling algorithms may cause overfitting.Also, for cost-sensitive learning techniques, it is not straightforward to determine the exact misclassification cost, and different misclassification costs might result in different induced outcomes.Moreover, Bagging and Boosting algorithms may exclude some valuable data while they propose sampling methods in every single iteration and they may face an overfitting problem.Consequently, the classification results obtained by these methods are not stable.To address these problems, this paper proposes two DL-based methods mixed with different resampling methods for better tackling the issue of an imbalanced dataset.The existing DL-based methods, especially CNN architectures, have been employed in a wide variety of challenges, and they have proven to be extremely powerful in terms of learning balanced datasets.Their efficacy has not been satisfactorily investigated when tackling imbalanced datasets (40).The CNNs are types of architectures that contain convolutional blocks and can provide an end-to-end classification algorithm.These blocks are a stack of different layers, namely convolutional layers, pooling layers, and activation functions.The most significant attributes of such models are their learning capacity with fewer parameters and translational invariance concerning the input data.In CNNs, the input data are fed to multiple convolutional blocks, which are named mainly backbone as a whole, and then followed by a sequence of fully connected layers to be classified.
The training procedure is done using Focal Loss (FL), which optimizes the abstraction learned by the models to handle complex samples better.In particular, the main contributions of this paper are threefold.First, 24 popular imbalanced datasets from KEEL Dataset Repository, breast cancer dataset from KDD Cup, and Z-Alizadeh Sani are chosen.The proposed pipeline is trained and validated 100 times with the object of achieving more reliable results.Second, this paper is the first SMOTE.Unlike small data samples, for a large-scale dataset with a small imbalance ratio, SMOTE-LOF outperforms the SMOTE.In order to annihilate the overlap between the majority class and the minority class in an imbalanced dataset and obtain a balanced and normalized class distribution, the study (39) implemented two innovative density-based methods.These methods were density-based undersampling (DB_US) and density-based hybrid sampling (DB_HS).The first method applies merely an undersampling algorithm, while the second implements both undersampling and oversampling approaches.In addition, the balanced datasets were modeled employing Random Forest (RF) and Support Vector Machine (SVM) classifiers.As a result, the two proposed methods eliminated high-density samples from the majority class and omitted the noises of both classes.The performance of these methods was examined on 16 imbalanced datasets.In the literature (51), the authors proposed a novel classification method called the Bagging Supervised Autoencoder Classifier (BSAC) to model the credit scoring problems.This algorithm essentially leverages the superlative implementation of a supervised autoencoder based on the axioms of multi-task learning.Also, BSAC tackles the issue of imbalanced datasets via engaging a variation of the Bagging procedure based on undersampling techniques.The examinations of benchmark and real-world credit scoring datasets show the robustness and efficiency of BSAC.To improve the performance of the basic antlion optimization (ALO), in (52), a novel modified antlion optimization method (MALO) was introduced.This algorithm adds an extra variable that depends on the step size of the ants as revising the antlion position.Also, MALO is modified to the issues of sample reduction to achieve better performance due to various metrics.MALO was examined on several benchmarks and balanced and imbalanced datasets.The results show the outperformance of MALO against the primary ALO method and some other comparable algorithms.Yang et al. in (53) implemented a sampling level technique called gravitational balanced multiple kernel learning (GBMKL) algorithm, which merges the gravity approach to produce the gravitation balanced midpoint samples (GBMS) placed on the classification boundary.Moreover, to better the generalization efficiency, the classification boundary was modified according to the nearest neighbors of the boundary (NNB) samples.Finally, two regularization terms that correspond to GBMS and NNB were formulated to prevent overfitting.The resulting method was examined on 54 artificial and real-life imbalanced datasets, and the outcomes show the dominance of the implemented method.Tanimoto et al. in (54) studied the near-miss positive samples in the class of imbalanced datasets.They showed that if the true positive samples are severely limited, the accuracy of the proposed model could be increased by obtaining modified label-like side information positivity to identify near-miss samples from true negatives.Also, the proposed method is following learning using privileged information that leverages side information for training the desired model devoid of predicting the side information itself.The results of the experiments show the outperformance of the method in contrast to the existing algorithms.The research study (55) proposed new development of SMOTE by merging it with the Kalman filter.After applying SMOTE to the given dataset, the implemented algorithm, called Kalman-SMOTE (KSMOTE), excludes the noisy samples in the resultant dataset that simultaneously contains the initial data and the synthetically added samples.The method was examined on a broad range of datasets, and the results show that the implemented algorithm outperforms the existing methods.Since oversampling techniques cannot usually achieve high performance in the presence of noise, the study (56) implemented an innovative oversampling algorithm, called IR-SMOTE, that handles this issue.By sorting the majority class samples and the k-means clustering algorithm, the noise in minority class clusters is eliminated.After that, using the kernel density estimation method, the amount of synthetic samples is compatibly designated to each cluster.Finally, regarding random-SMOTE, the desired algorithm was improved to add new samples with an ensured diversity.The literature (40) studied the performance of convolutional neural networks (CNNs) in the presence of imbalanced data for classification problems.To explore this probable impact, the research used MNIST, CIFAR-10, and ImageNet as benchmarks, alongside undersampling, oversampling, two-phase training, and thresholding.The results show that imbalanced data has a detrimental effect on the performance of the proposed method.Also, one should implement oversampling to the level that removes the imbalance, while the extent of imbalance determines the ideal undersampling ratio.In addition, oversampling does not lead to the overfitting of CNNs.Fault diagnosis of complex equipment, which plays an important role in the industries, is a crucial technology, and CNN is a general tool for this purpose.In this case, faults are not common, which leads to imbalanced data, and therefore, one cannot propose CNN methods directly.To address this problem, a hierarchical training-CNN is implemented in (57).At first, the method uses a number-resampling technique to balance data.Then, a magnet-loss pretraining algorithm is provided to handle the overlap between diverse faults.The proposed method was examined on the public dataset CWRU with an accuracy of 94.28%.

Methodology
In this paper, we have used our methods applied to various datasets collected from benchmark repositories such as the KEEL 1 , breast cancer2 , and Z-Alizadeh Sani3 datasets in order to address the class imbalance problem. Figure 2 demonstrates an overview of our proposed methodology, whose details are included in this section.Fig. 2.An overview of our proposed methodology.
Based on Figure 2, the main steps in our methodology include preprocessing, classification, and analysis of models.

Dataset Preprocessing
As stated before, the most acute problem in classifying imbalanced data is that classifiers become biased towards the majority class.There are several methods to overcome this issue which are generally called resampling techniques.By adding minority class samples or removing samples from the majority class, resampling turns the data into a more balanced one.In this regard, there are two principal methods: oversampling and undersampling.Oversampling algorithms generate new samples, duplicated or synthetic, that belong to the minority class.In contrast, the undersampling techniques delete samples that belong to the majority class to afford balance to the dataset (33).As a preprocessing step in our methodology, we have utilized various well-known oversampling and undersampling techniques for balancing the dataset.Normalization and split dataset are the next steps in data preprocessing.These are elaborated in the following.

Oversampling techniques 1) Random Over-Sampling (ROS)
The first and simplest method in this field is random oversampling (ROS), which aims to help the distribution of datasets by increasing the number of samples in the majority class until the class distributions tend to a balance.This approach is non-heuristic, meaning that it does not boast any intelligent decision boundaries.Random oversampling is usually applied to the level that excludes the imbalance.By merely regenerating samples from the minority class, ROS tackles a balancing in the training model.However, duplicating similar samples may lead to the problem of overfitting, particularly for the samples belonging to the minority classes (36,58).Figure 3 shows an illustration of the oversampling technique.Fig. 3. Illustration of Random Over-Sampling.

2) Synthetic Minority Oversampling Technique
Synthetic Minority Oversampling Technique (SMOTE) (24,(59)(60)(61) is another resampling technique that aims to increase the amount of minority class samples by creating synthetic samples in the minority class and is applied for balancing datasets with a highly unbalanced ratio.In order to avoid the issue of overfitting, the synthetic generation of new samples differed from the multiplication algorithm.The main idea behind SMOTE is to generate new samples of data in the minority class by interpolation between samples of this class that are in close vicinity of each other (15,62).Thus, SMOTE increases the number of minority class examples within an imbalanced dataset and consequently enables the classifier to achieve better generalizability.The formal procedure for SMOTE can be explained as follows: Firstly, N, which is the desired amount of oversampling, should be set to an integer number.This number can be opted for in that the dataset becomes balanced with a ratio of 1:1 within the different classes.Then, three main steps should be taken iteratively.These steps are 1: Randomly selecting a sample that belongs to the minority class, 2: The K (default 5) nearest neighbors of this sample should be selected, 3: N of these K neighbors are selected randomly for interpolation and generating new samples (63).An intuition of how SMOTE works is shown in Figure 4.

Undersampling techniques 1) Random Under-Sampling
The simplest technique among under-sampling methods is Random Under-Sampling (RUS) which is a data-level approach.Here, the algorithm tries to reduce the number of the majority class samples to balance data.In RUS, we randomly select samples within the majority class and delete them, which makes the distribution of a class-imbalanced dataset with a highly unbalanced ratio more balanced.RUS is a non-heuristic approach that does not behave as smart as some other algorithms.Its main drawback is the high probability of losing valuable information within a dataset (15).More precisely, the principal issue in proposing this method is that there is no control over what information about the majority class is being thrown away.As a result, the samples that contain information and details about the decision boundary may be removed, and that valuable information is lost (33).An overview of RUS is shown in Figure 5.

2) Tomek Links
Tomek Links (TL) ( 34) is another effective undersampling technique used for balancing the data.TLs are pairs of samples that are very close two each other, but they belong to different classes.These samples are contiguous to the borderline between classes.In mathematical language, given a pair of samples (  .  ) from the dataset, we say that there is a TL between the two samples if at least one of the two following inequalities is satisfied: Where (.) is the distance between  and  (64).Generally, one of the two samples that form a TL is considered a noisy sample, or the two samples together are considered borderline (15).In this case, by eliminating the samples of the majority class that belong to the pairs forming TLs, the distance between the two classes increases, and the dataset becomes more balanced (33).See Figure 6, which shows how TLs can be used to reduce the number of samples in the majority class.

3) One-Sided Selection (OSS)
One-Sided Selection (OSS) ( 35) is proposed as an undersampling technique whose main idea is to combine TL and Condensed Nearest Neighbor Rule.To address the issue of imbalanced datasets, this approach leaves the minority class samples completely intact.It filters out the redundant samples in the majority class through a modification of the condensed nearest neighbor rule (65).
In OSS, (.) is supposed to be a distance value that meets the requirements for being a TL, where  is chosen from the majority class, and  is selected from the minority.This way, two scenarios can happen: 1) a TL is found to be on the class boundary when both  and  exist in the right class regions 2) a TL is found to be inside one of the class regions when either  or  lies in the wrong region.OSS was introduced to decrease the number of majority class samples by omitting the data points which are borderline or noisy (66).Figure 7 illustrates a diagram of the OSS technique.

4) NearMiss
The last undersampling technique proposed in this study and introduced here is NearMiss ( 16).This method is based on the K-nearest neighbors algorithm and categorized as NearMiss-1, NearMiss-2, and Nearmiss-3.The main idea behind NearMiss is to consider the mean distances of samples from the majority class to the samples from the minority class.
Contrary to randomly removing samples from the majority class, these methods eliminate these samples intelligibly.NearMiss-1 removes the majority class samples whose mean distances to the three nearest samples of the minority class are minimal.On the other hand, NearMiss-2 deletes the samples from the majority class with the minimal average distances to the three farthest minority samples.Finally, NearMiss-3 selects a certain number of the closest samples of the majority class regarding every minority class sample (67).
As claimed in ( 16), the results of experiments showed that NearMiss-2 has a better performance than NearMiss-1 and NearMiss-3.Also, it outperforms the RUS technique (33).
It is worth noting that NearMiss can be fine-tuned in two aspects: The variant that can be chosen from 1, 2, and 3.In addition, the number of neighbors to consider for calculating the mean distances is three as the default.An outline of the NearMiss algorithms is shown in Figure 8.

Normalization
Normalization is one of the most crucial preprocessing steps for any challenge in machine learning.It can be done by scaling or transforming the original data to balance the contributions of different features in data samples.In this study, we have normalized the input data to make a distribution between zero and one.

Split dataset
Further, due to the low number of samples in datasets which makes the classification result extremely unstable, we have trained and evaluated our models for 100 runs.In each run, first, we randomly shuffle and split the data into training and testing sets and train them according to the model for 2000 epochs and then evaluate it.

Models
In this section, we introduce our two proposed Artificial Neural Network (ANN)-based models, including Deep Neural Network and Convolutional Neural Network.

Proposed Deep Neural Network
Deep Neural Networks (DNNs) have recently become among the favorite approaches in various fields in the domain of Artificial Intelligence (68).These networks, which are famously called models, are characterized by several layers that contain a huge number of computational units.These units, which are interconnected, meaning that the output of a unit is the input of the other, are conceived as the imitation of the physiological brain's structure.In mathematical terms, they are a set of parametrized linear and nonlinear transformations capable of being adjusted in order to output abstractions of the input data ( 69).This capability comes from the amalgamation of multiple layers full of perceptrons.Although a single perceptron cannot handle data that are not linearly separable, they are the basis of Multi-Layer Perceptrons (MLPs), whose ability to transform highly non-linear data makes them a powerful and efficient tool in machine learning (70).Furthermore, the first proposed method in this paper is a DNN-based model.The architecture of this model is demonstrated in Figure 9.As is observed in Figure 9, our DNN model comprises different layers, including a fully connected one followed by an activation function and a batch normalization layer.Then, a fully connected layer, an activation function, and a batch normalization followed by a dropout and a single neuron fully connected layer come after.

Proposed Convolutional Neural Network
Convolutional operations are the main components in Convolutional Neural Network (CNN)-based models.These operations enable CNNs to extract and learn the salient features existent in the input data (71).CNN comprises different layers that output feature maps, resulting in sliding different kernels on the input and applying activation functions (72).Compared with DNNs, the major advantage of CNNs over DNNs is their capability to reduce the computational cost in each layer.The convoluted features extracted by these models are compact representations of the input data, which can be further used in downstream tasks such as classification (73).
In this paper, the second method proposed for the binary classification of the input data is a CNN-based model.The architecture of this model is demonstrated in Figure 10.As is seen in Figure 10, our proposed model consists of 4 layers (two 1-dimensional convolutional layers and two fully connected layers).After each hidden layer, a non-linear activation function (ReLU) is applied to the output.To make our training process more efficient, we have experimented with several loss functions, among which Focal Loss (FL) (74) claimed better supervision of the network.In fact, it was invented to address the issue of class imbalance.FL belongs to the cost-sensitive methods which were originally introduced in the case of object detection, where the imbalance between background and salient object is often frequent.FL is a modification to the cross-entropy loss in that during the training procedure, the neural network receives more cost for wrongly predicting complex training samples.
More precisely, the cross-entropy loss function is among the most common loss functions in deep learning that originates from information theory.It is seemingly identical to the negative log-likelihood loss function, and for the binary classification problems, the binary cross-entropy loss function, denoted by   is as follows: ( , ) ( log( ) (1 ) log( 1)) Here , where N is the number of samples, y is the predicted value, and y denotes the ground truth table.The problem with the cross-entropy loss function is that in the case of imbalance classification, the larger class overwhelms the loss by dominating the gradient (75).Hence, to obtain the Focal loss function, one can simplify and rewrite the Equation (1) in the following way: Name the probability of predicting the ground truth class   and define   as: Therefore,   can be rewritten and simplified as: Finally, FL augments a modulating factor (1 −   )  to the binary class entropy loss function, where γ > 0 is a tunable focusing parameter which yields the following equation:

Experimental Results
This section comprised simulation setup, dataset description, split dataset, evaluation metrics, and classification results.

Simulation setup
This section includes the implementation details of our proposed methods.The tools used in this paper are listed in Table 1.Moreover, in our implementation, we used the Adam algorithm to optimize the models' parameters with a learning rate of 0.001.For the loss function, FL is used with the alpha parameter set to 0.25 and gamma parameters set to 2. Further, the list of hyperparameters of the DNN and CNN models is described in detail in Tables 2 and 3.

Dataset description
In order to examine our proposed methods, we used the KEEL (76) dataset repository, breast cancer, and Z-Alizadeh Sani datasets.As is depicted in Table 4, the datasets comprise different imbalanced datasets for classification tasks.Based on Table 4, the first column indicates the number of attributes of each dataset.The second, the sum of positive and negative samples is calculated as all samples.Also, the imbalance ratio between minority or positive and majority or negative classes is assigned in the third column.Meanwhile, the imbalance ratio is achieved by dividing negative samples into positive samples.As described in Section 3.1.3(split dataset), the dataset was randomly shuffled and split into training and testing sets which dataset trained according to the model for 2000 epochs.The generated models were trained and evaluated for 100 runs.

Evaluation metrics
This section includes the elaboration of the metrics which is used to evaluate the performance of our proposed models.
A fundamental classification metric tool is the Confusion Matrix.This tool is a way of demonstrating the number of correctly and incorrectly predicted samples by a classifier.It is usually a table that contains the actual and predicted state of samples compared to each other.Based on Figure 11, the classes of the minority and majority are remarked as positive and negative classes, respectively.Therefore, a confusion matrix is used to obtain performance metrics for the models on the imbalanced datasets.We utilized eight metrics such as accuracy, precision, recall, F1-score, G-Mean, specificity, AUC-ROC, and kappa for evaluating the DNN and CNN models (39, 52, 77-81).

Accuracy
Accuracy is the ratio of the number of samples that are predicted correctly to the total of the input samples, as formulated in ( 7). (

Specificity
The specificity is the proportion of true-negative samples to the overall number of true-negative and false-positive samples.The specificity or True Negative Rate (TNR) of a classifier is calculated using Equation (8).

Recall
A recall is another measurement that shows the ratio of predicted positive samples to all the relevant samples, meaning the samples which have been actually positive.The recall is a significant metric for imbalanced datasets, demonstrating the learning accuracy of the positive class.It is calculated by Equation (9). =  + (9)

G-Mean
The G-mean is exploited as an accuracy metric as it can gauge the accuracy rates of majority and minority classes.It is achieved by Equation (10).

Precision
Precision shows how well a classifier's performance is in terms of predicting positive samples.As Equation (11) shows, it is easily calculated by dividing the number of true positives by the total number of predicted samples as positive.

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
+ (11) 4.4.6.F1-Score F1-Score, which is also called F-score or F-measure, indicates the balance which exists between recall and precision for a classifier.The closer it is to one, the more balance between precision and recall exists.F1-Score can be obtained by Equation (12).

Kappa
The kappa metric considers the random classification model accuracy to evaluate the obtained classification accuracy.It is an important metric that indicates whether the accuracy of the classifier is at the level of reliability.The values of the Kappa are between -1 to 1. On the other hand, three reliability levels of Kappa have been exploited to assess the accuracy are as follows: 1. Kappa >= 0.75: Robust consistency, high reliable accuracy.2. 0.4 <= Kappa <0.75: the accuracy's reliance level is generally.3. Kappa < 0.4: Accuracy is unreliable.The kappa formula has been specified in (13).

AUC-ROC
The AUC-ROC is a crucial measurement to evaluate the performance of generated classification models.A ROC plot represents the tradeoff between true positives and false positives, which actually indicates the correlation between specificity and recall.Also, AUC specifies the amount of separability power of the classifier.The AUC range is from 0 to 1. Therefore, the higher the AUC means the model has better performance at recognizing the minority and majority classes.

Classification results
In this section, we demonstrate our experimental results based on the evaluation metrics such as accuracy, precision, recall, F1-score, G-Mean, specificity, AUC, and kappa.The results have been elaborated by obtaining the average for each metric on three imbalanced datasets, including the KEEL repository, breast cancer, and Z-Alizadeh Sani for classification tasks.
The results are given in Tables 5 -CNN model is superior to the other models in terms of the evaluation criteria for the imbalanced datasets.As a result, this model demonstrates the impact of using SMOTE in our CNN model so that the overall performance has been enhanced.

Discussion
The classification performance usually drops and faces different failures in the presence of imbalanced datasets.However, imbalanced datasets exist in a broad range of real-life research.Hence, for imbalanced binary classification problems, samples are usually categorized into two classes, majority and minority.Generally, the minority class often illustrates the more significant and crucial samples and interests rather than the majority class samples.Nevertheless, compared to the minority class, the majority class has a larger amount of samples, and in some cases, the situation may be exceedingly serious.Therefore, handling these problems efficiently has become a crucial and significant topic in machine and deep learning methods. ( To overcome these challenges, we proposed two methods that are based on DNN and CNN algorithms.At first, several classical and well-known undersampling and oversampling methods such as RUS, Tomek Links, OSS, Near Miss, ROS, and SMOTE were used in the data preprocessing procedure.Also, to achieve better performance, we normalized current datasets.Then, we considered the focal loss function in the process of training the desired models, which are widely implemented in neural networks frameworks for class imbalance problems.Moreover, due to the limited amount of datasets samples which causes unstable classification results, we have trained and evaluated our models for 100 runs and 2000 epochs.In the end, we analyzed our proposed models concerning the accuracy, precision, recall, F1-score, G-Mean, specificity, AUC, and Kappa as evaluation metrics.Based on 24 imbalanced datasets, the average performance score of the evaluation metrics of the executed models is indicated in Table 11.According to Table 12, the results show the efficiency of our proposed model performance on 16 imbalanced that 99.00% recall, 99.00% G-Mean, and 98.98% F1score were attained by the SMOTE+NORM+CNN model.In addition, the results of the proposed SMOTE+NORM.+CNNmodel is compared with the related works on the Z-Alizadeh Sani dataset, as represented in Table 13.shows the balance between recall and precision for classifiers.

Conclusion and future work
An unbalanced dataset of the majority and minority classes is a challenging issue when samples belonging to one or more classes are not evenly distributed.Especially, imbalanced dataset reasons deep learning-based models to obtain biased results for binary classification.To address this issue, we presented oversampling and undersampling techniques such as SMOTE, TL, OSS, NearMiss, ROS, and RUS.Among these techniques, SMOTE is the most common robust that targets the growth of the amount of minority class samples by generating synthetic samples, which is employed for balancing datasets with an extremely unbalanced ratio.In this study, six deep learning-based models were used to classify the majority and minority classes.We investigated SMOTE + NORM.+ CNN/DNN, TL + NORM.+ CNN/DNN, OSS + NORM.+ CNN/DNN, NearMiss + NORM.+ CNN/DNN, ROS + NORM.+ CNN/DNN, and RUS + NORM.+ CNN/DNN.To evaluate these models, we utilized KEEL, breast cancer, and Z-Alizadeh Sani datasets.The results show that the mixed SMOTE-NORM-CNN model significantly outperforms other models achieving 99.08% accuracy, 99.09% precision, 99.08% sensitivity, 99.09% F1-score, 99.08% G-Mean, 99.03% specificity, 99.08% AUC, and 98.92% kappa on 24 imbalanced datasets.Also, the proposed model has been compared to the study (39), and the mixed model is suitable for the same dataset.Furthermore, we investigated the related methodologies on the Z-Alizadeh Sani dataset.The results indicate that our proposed

Fig. 1 .
Fig. 1.The schematic diagram of several data distributions with two-dimensional binary-class-imbalanced data.

Fig. 10 .
Fig. 10.The architecture of the proposed CNN-based classifier.
Fig. 17.The average rate of the metrics for the RUS + NORM.+ CNN/DNN.

Table 1 .
The implementation details

Table 2 .
The list of hyperparameters of the DNN model.

Table 3 .
The list of hyperparameters of the CNN-based model.

Table 4 .
Datasets description in detail.

100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
According to the results obtained, the proposed SMOTE + NORM.+CNN model outperforms other models in terms of eight metrics on the datasets.For a better comparison of the results, the best average performance score is also shown in Figures12-17.According to these Figures, the mixed SMOTE-NORM.

Table 11 .
(39)average performance score of the evaluation metrics of the executed models.According to Table11, it can be founded that the mixed SMOTE+NORM+CNN model has the best performance with 99.08% accuracy, 99.09% precision, 99.08% sensitivity, 99.09% F1-score, 99.08% G-Mean, 99.03% specificity, 99.08% AUC, and 98.92% kappa.Also, the comparison of performance metrics between our study and study(39)on the same imbalanced dataset is demonstrated in Table12.
*Bold specifies that SMOTE + NORM.+ CNN is the most robust model.

Table 12 .
The comparison of performance metrics on the same dataset.

Table 13 .
The comparison of the metrics results between the proposed study and related studies on the Z-Alizadeh Sani dataset.

Table 13 ,
the outcomes show the dominance of the Proposed SMOTE+NORM.+CNN model compared with other studies.The hybrid model of SMOTE+NORM.+CNN verifies the best performance with 98.57% accuracy, 98.58% recall, 98.57% F1-score, 98.58% precision, 98.42% specificity, and 99.14% AUC.In addition to what I have just explained, I can add that the F1-score metric is reputable with the imbalanced Z-Alizadeh Sani dataset because this metric