A Selective Dynamic Sampling Back-Propagation Approach for Handling the Two-Class Imbalance Problem

: In this work, we developed a Selective Dynamic Sampling Approach (SDSA) to deal with the class imbalance problem. It is based on the idea of using only the most appropriate samples during the neural network training stage. The “average samples”are the best to train the neural network, they are neither hard, nor easy to learn, and they could improve the classiﬁer performance. The experimental results show that the proposed method is a successful method to deal with the two-class imbalance problem. It is very competitive with respect to well-known over-sampling approaches and dynamic sampling approaches, even often outperforming the under-sampling and standard back-propagation methods. SDSA is a very simple method for automatically selecting the most appropriate samples (average samples) during the training of the back-propagation, and it is very efﬁcient. In the training stage, SDSA uses signiﬁcantly fewer samples than the popular over-sampling approaches and even than the standard back-propagation trained with the original dataset.


Introduction
In recent years, the class imbalance problem has been a hot topic in machine learning and data-mining [1,2].It appears when the classifier is trained with a dataset where the number of samples in one class is lower than the samples in the other class, this and produces an important deterioration in the classifier performance [3,4].
The common methods handled with the class imbalance problem have been the re-sampling methods (under-sampling and over-sampling) [2,5,6], mainly due to the independence of the underlying classifier [7].One of the most well-known over-sampling methods is the Synthetic Minority Over-sampling Technique (SMOTE).This generates artificial samples of the minority class by interpolating existing instances that lie close together [8].The development of other samplings has been motivated: borderline-SMOTE, Adaptive Synthetic Sampling (ADASYN), SMOTE editing nearest neighbor, safe-level-SMOTE, Density-Based Synthetic Minority Over-sampling TEchnique (DBSMOTE), SMOTE + Tomek's Links [9], among others (see [1,7,10]).
An interest has been observed for finding the best samples to build the classifiers.For example, borderline-SMOTE has been proposed to over-sample only the minority samples near the class decision borderline [11].Accordingly, in [12], the safe-level-SMOTE is proposed, to select minority class instances from the safe level region, and then, these samples are used to generate synthetic instances.ADASYN has been developed to generate more synthetic data from minority class samples that are harder to learn than those from minority class samples, which are easy to learn [13].In a similar way, SPIDER approaches (framework that integrates a selective data pre-processing with the Ivotes ensemble method) over-sampling locally only for those minority class samples that are difficult to learn and includes a removing or relabeling process of noisy samples from the majority class [14,15].The above discussed approaches have in common that they use the K nearest neighbors rule as the basis, and they are applied before the classifier training stage.
On the other hand, the under-sampling methods have shown effectiveness to deal with the class imbalance problem (see [7,8,10,[16][17][18][19]). One of the most successful under-sampling methods has been the random under-sampling, which eliminates random samples from the original dataset (usually from the majority class) to decrease the class imbalance, however, this method loses effectiveness when removing significant samples [7].Other important under-sampling methods including a heuristic mechanism are: the neighborhood cleaning rule, from Wilson editing [20], one-sided selection [21], Tomek links [22] and the Condensed Nearest Neighbor rule (CNN) [23].Basically, the aim of the cleaning mechanism is: (i) to eliminate samples with a high likelihood of being noise or atypical samples or (ii) to eliminate redundant samples in CNN methods.In the same way as the above approaches, we apply these methods before the training process.They employ the K nearest neighbors rule (except the Tomek links methods) as the basis.
Another important alternative to face the class imbalance has been the Cost Sensitive (CS) approach, which has become one of the most relevant topics in machine learning research in recent years [24].They consider the costs associated with misclassifying samples, i.e., CS methods use different cost matrices describing the costs for misclassifying any particular data sample [10].The over-and under-sampling could be a special case of the CS techniques [25].Anther CS method is threshold-moving, which moves the output threshold toward inexpensive classes, such that samples with higher costs become hard to misclassify.It is applied in the test phase and does not affect the training phase [24].
Ensemble learning is an effective method that has increasingly been adopted to combine multiple classifiers and class imbalance approaches to improve the classification performance [2,4,5].In order to combine the multiple classifiers, it is common to use the hard and soft ensemble.The former uses binary votes, while the latter uses real-valued votes [26].
Recently, dynamic sampling methods have become an interesting way to deal with the class imbalance problem.They are attractive, because they automatically find the proper sampling amount for each class in the training stage (different from conventional strategies as over-and/or under-sampling techniques).In addition, some dynamic sampling methods also identify the "best samples" for classifier training.For example, Lin et al. [27] propose a dynamic sampling method with the ability to identify samples with a high probability to be misclassified.The idea is that the classifier trained with these samples may produce better classification results.Other methods that can be considered as dynamic sampling are: (i) the snowball method (proposed in [28] and used as a dynamic training method in [29,30]); (ii) the genetic dynamic training technique [31,32]; in it, the authors employ a genetic algorithm to find the best over-sampling ratio; (iii) the mean square error (MSE) dynamic over-sampling method [19], which is based on the MSE back-propagation for automatically identifying the over-sampling rate.Chawla et al. [33] present a WRAPPER paradigm (for which the search is guided by the classification goodness measure as score) to discover the amount of the under-sampling and over-sampling rate for a dataset.Debowski et al. [34] show a very similar work.
The dynamic sampling approaches are a special case of the sampling techniques.The main difference of these methods with respect to the conventional sampling strategies is in the time when they sample the data or when they select the examples to be sampled (see [19,27,28,31,32]).
In this paper, a Selective Dynamic Sampling Approach (SDSA) to deal with the two-class imbalance problem is presented.This method is useful to find automatically the appropriate sampling amount for each class through the selection of the "best samples" to train the multilayer perceptron with the back-propagation algorithm [35].The proposed method was tested over thirty five real datasets and compared to some state-of-the-art class imbalance approaches.

Selective Dynamic Sampling Approach
Researchers in the class imbalance problem have shown their interest in finding the best samples to build the classifiers, for example eliminating those samples with a high probability to be noise or overlapped samples [18,[36][37][38][39][40], or focusing on those close to the borderline decision [11,13,41] (the latter has been less explored).
In accordance with the above discussion, three categories of samples can be basically identified in the class imbalance literature: • Noise and rare or outlier samples.The first ones are instances with error in their labels [7] or erroneous values in the features that describe them, and the last ones are the minority and rare samples located inside the majority class [42].• Border or overlapped samples are those samples located where the decision boundary regions intersect [18,38].• Safe samples are those with a high probability of being correctly labeled by the classifier, and they are surrounded by samples of the same class [42].
Nevertheless, those samples situated close to the borderline decision and far from the safe samples might be of interest; in other words, those that are neither hard nor easy to learn.These samples are called "average samples" [35].
In this section, a Selective Dynamic Sampling Approach (SDSA) to train the multilayer perceptron is presented.The aim of this proposal is to deal with the two-class imbalance problem, i.e., this method only works with two-class imbalanced datasets.This SDSA is based on a modification of the "stochastic" back-propagation algorithm and derived from the idea of using average samples to train Arificial Neural Networks (ANN), in order to try to improve the classifier performance.The proposed method consists of two steps, and it is described below: Variable ∆ q is the normalized difference amongst the real neural network outputs for a sample q, where z q 0 and z q 1 are respectively the real neural network outputs corresponding to a q sample.The ANN only has two neural network outputs (z q 0 and z q 1 ), because it has been designed to work with datasets of two classes [43].
The Selective Dynamic Sampling Approach (SDSA) is detailed in Algorithm 1, where t (q) j and z (q) j are the desired and real neural network outputs for a sample q, respectively.

Algorithm 1
The Selective Dynamic Sampling Approach (SDSA) based on the stochastic back-propagation multilayer perceptron.
Input: X (input dataset), N (number of features in X), K (number of classes in X), Q (number of samples in X), M (number of middle neurodes), J (number output neurodes), I number of iterations and learning rate η.

Selecting µ Values
The appropriate selection of the variable µ is critical to select the average samples or other kind of samples (border or safe samples [42]).Variable µ is computed under the following consideration: the target ANN outputs (t j ) are usually codified in zero and one values [43].For example, for a two-class problem (Class A and Class B), the desired ANN outputs are codified as (1, 0) and (0, 1) for Classes A and B, respectively.These values are the target ANN outputs (t j ), i.e., the desired final values emitted by the ANN after training.In accordance with this understanding, the expected µ values are: • µ ≈ 1.0 for safe samples.It is expected that ANN classifies with a high accuracy level, i.e., it is expected that the real ANN outputs for all neurons (z j ) will be values close to (1, 0) and (0, 1) for Classes A and B, respectively.Whether we apply Equation (2), the expected value is 1.0, at which the γ function has its maximum value.• µ ≈ 0.0 for border samples.It is expected that the classifier misclassifies.The expected ANN outputs for all neurons are values close to (0.5, 0.5), then the ∆ is approximately 0.0, at which the γ function has its maximum value for these samples.• µ ≈ 0.5 for average samples.It is expected that ANN classifies correctly, but with less accuracy.
The recommended µ values to select the average samples are those around 0.5.An independent validation set to find the most appropriate µ value is proposed to avoid any bias in the testing process.
For this independent validation, a minimal subset from the training data is used.Firstly, the ten-fold cross-validation for each dataset is applied (Section 5.1); next, only 10% of samples are randomly taken from each training fold (TF 10 ), then TF 10 is split into two disjoints folds of the same size (TF 5  train and TF 5 test , respectively).Next, the proposed method (SDSA) is applied over the TF 5 train and TF 5 test to find the best µ value.The tested values for µ were 0.25, 0.375, 0.5, 0.625 and 0.75.Finally, the µ value, for which the best Area Under the Curve (AUC) [44] rank was obtained, is chosen by SDSA on TF 10 .
Note that this independent validation does not imply an important computational cost, because it only uses 10% of the training data to find the most appropriate µ value.This independent validation unbiased the performance on the testing data process, due to the test data not being used.

State-of-the-Art of the Class Imbalance Approaches
In the state-of-the-art class imbalance problem, the over-and under-sampling methods are very popular and successful approaches to deal with this problem (see [7,8,10,[16][17][18][19]). Over-sampling replicates samples in the minority-class, and under-sampling eliminates samples from the majority-class, biasing the discrimination process to compensate for the class imbalance.
This section describes some well-known sampling approaches that have been effectively applied to deal with the class imbalance problem.These approaches are used with the aim to compare the classification performance of the proposed method with respect to the state-of-the-art of class imbalance approaches.< d(a, b), where d is the distance between pairs of samples [22].Samples in TL are noisy or lie in the decision border.This method removes those majority class samples belonging to TL [9].
CNN The main goal of the condensed nearest neighbor algorithm is the reduction of the size of the stored dataset of training samples while trying to maintain (or even improve) generalization accuracy.In this method, every member of X (the original training dataset) must be closer to a member of S (the pruned set) of the same class than any other member of S from a different class [23].
CNNTL combines the CNN with TL [9].NCL The Neighborhood Cleaning Rule uses the Editing Nearest Neighbor (ENN) rule, but only eliminates the majority class samples.ENN uses the k − NN (k > 1) classifier to estimate the class label of every sample in the dataset and discards those samples whose class labels disagree with the class associated with the majority of the k neighbors [20].
OSS The One-Sided Selection method performs TL, then CNN on the training dataset [21].

RUS
The Random Under-Sampling randomly eliminates samples from the majority class and biases the discrimination process to compensate for the class imbalance.

Over-Sampling Approaches
ADASYN is an extension of SMOTE, creating more samples in the vicinity of the boundary among the two classes than in the interior of the minority class [13].
ADOMS The Adjusting the Direction Of the synthetic Minority clasS method, setting the direction of the synthetic minority class samples, this works like SMOTE, but it generates synthetic examples along the first component of the main axis of the local data distribution [45].
ROS The Random Over-Sampling duplicates samples randomly from the minority class, biasing the discrimination process to compensate for the class imbalance.
SMOTE [8] generates artificial samples of the minority class by interpolating existing instances that lie close together.It finds the k intra-class nearest neighbors for each minority sample, and then, synthetic samples are generated in the direction of some or all of those nearest neighbors.
B-SMOTE Borderline-SMOTE [11] selects samples from the minority class that are on the borderline (of the minority decision region, in the feature space) and only performs SMOTE on those samples, instead of over-sampling all or taking a random subset.

SMOTE-ENN
This technique consists of applying the SMOTE and then applying the ENN rule [9].SMOTE-TL is the combination of SMOTE and TL [9].SL -SMOTE Safe-Level SMOTE is based on the SMOTE, but it generates synthetic minority class samples positioned closer to the largest safe level; then, all synthetic samples are only generated in safe regions [12].
SPIDER-1 is an approach that combines a local over-sampling of those minority class samples that are difficult to learn with removing or relabeling noisy samples from the majority class [14].

SPIDER-2
The major difference between this method and SPIDER-1 is that it divides into two stages the pre-processing of the majority and minority class samples, i.e., first pre-processing the majority class samples and next the minority class samples (considering the changes introduced in the first stage) [15].

Dynamic Sampling Techniques to Train Artificial Neural Networks
Dynamic sampling techniques have become an interesting way to deal with the class imbalance problem on the Multilayer Perceptron (MLP) trained with stochastic back-propagation [19,27,28,31,32].Different from conventional strategies as over-and/or under-sampling techniques, the dynamic sampling finds automatically in the training stage the properly sampling amount for each class for dealing with the class imbalance problem.In this section, we present some details and the main features of two dynamic sampling methods.

Method 1. Dynamic Sampling
The basic idea of the Dynamic Sampling (DyS) method, proposed in [27], is to design a simple DyS that dynamically selects samples during the training process.In this method, a pre-deletion of any sample to prevent information loss, to dynamically select the samples (hard to classify) to train the ANN and to make the best use of the dataset does not exist.According to this main idea, the general steps in each epoch can be described as follows.
1. Randomly fetch a sample q from the training dataset.2. Estimate the probability p that the example should be used for the training.
where δ = z q j − max i =c {z q i }. z q i is the i-th real ANN output of the sample q and j is the class label to which q belongs.r c = Q c /Q is the class ratio; Q c is the number of samples belonging to class c; and Q is the sample number.3. Generate a uniform random real number µ between zero and one.4. If µ < p, then use the sample q to update the weights by the back-propagation rules.5. Repeat Steps 1-4 on all samples of the training dataset in each training epoch.
In addition, the authors of the paper [27] use an over-sampling method based on a heuristic technique to avoid bias for the class imbalance problem.Beginning with the first epoch, the process consists of the samples of all classes, except the largest classes over-sampled to make the dataset balanced.As the training process goes on, the over-sampling ratio (ρ) is attenuated in each epoch (ep) by a heuristic technique (Equation ( 4)).It is calculated as: where ep (> 2) and max represent the largest majority class.

Method 2. Dynamic Over-Sampling
In [19], a Dynamic Over-Sampling (DOS) technique to deal with the class imbalance problem was proposed.The main idea of DOS is to balance the MSE on the training stage (when a multi-class imbalanced dataset is used) through an over-sampling technique.Basically, the DOS method consists of two steps: 1. Before training: The training dataset is balanced at 100% through an effective over-sampling technique.In this work, SMOTE [8] is utilized.

During training:
The MSE by class E j is used to determine the number of samples by class (or class ratio) in order to forward it to the ANN.The equation employed to obtain the class ratio is defined as: where J is the number of classes in the dataset and max identifies the largest majority class.Equation ( 5) allows balancing the MSE by class, reducing the impact of the class imbalance problem on the ANN.
The DOS method only uses the necessary samples for dealing with the class imbalance problem and, in this way, to avoid getting a poor classifications performance as a result of training the ANN with imbalanced datasets.

Experimental Set-Up
In this section, the techniques, datasets and experimental framework used in this paper are to be described.

Database Description
Firstly, for the experimental stage, five real-world remote sensing databases are chosen: Cayo, Feltwell, Satimage, Segment and 92AV3C.The Cayo dataset comes from a particular region in the Gulf of Mexico [18].The Feltwell dataset represents an agricultural area near the village of Feltwell (UK) [46].The Satimage and Segment datasets are from the UCI (University of California, Irvine) Machine Learning Database Repository [47].The 92AV3C dataset [48] corresponds to a hyperspectral image (145 × 145 pixels, 220 bands, 17 classes) taken over the Northwestern Indiana Indian Pines by the AVIRIS (Airborne Visible / Infrared Imaging Spectrometer) sensor.In this work, we employed a reduced version of this dataset with six classes (2, 3, 4, 6, 7 and 8) and thirty eight attributes as in [18].
The two-class imbalance problem is only studied.We decompose the multi-class problems into multiple two-class imbalanced problems.This proceeds as follows: one class (c j ) is taken from the original database (DB) to integrate the minority class (c + ), and the rest of classes were joined to shape the majority class (c − ).Then, we integrate the two-class database DB j (j = 1, 2, ..., J, and J is the number of classes in DB).In other words, DB j = c + ∪ c − .Therefore, for each database, J two-class imbalanced datasets were obtained.The main characteristics of the new produced benchmarking datasets are shown in Table 1.This table shows that the datasets used in this work have several class imbalance levels (see the class imbalance ratio), ranging from a low to a high class imbalance ratio (for example, see 92A3 and CAY4 datasets).In addition, the ten-fold cross-validation method was applied on all datasets shown in this table.

Parameter Specification for the Algorithms Employed in the Experimentation
The stochastic back-propagation algorithm was used in this work (the source code of back-propagation algorithm and the approaches (dynamic sampling methods) and the datasets used in this work are available at Ref. [49]), and for each training process, the weights were ten times randomly initialized.The learning rate (η) was set to 0.1, and we established the stopping criterion at 500 epochs or if the MSE value is lower than 0.001.A single hidden layer was used.The number of neurons in the hidden layer was set to four for every experiment.
All sampling methods (except ENN, SPIDER-1 and SPIDER-2, which employ three) use five nearest neighbors (if applicable) and sampling the training dataset to reach to relative class distribution balance (if applicable).ADASYN and ADOMS use the Euclidean distance, and the rest of the methods employ the Heterogeneous Value Difference Metric (HVDM) [50], if applicable.SPIDER-1 applies a weak amplification pre-processing option, and SPIDER-2 employs relabeling of noisy samples from the majority class and an amplification option.The sampling methods have been done using the KEEL [51].
In order to identify the most suitable value for the variable µ, an independent validation set to avoid any bias in the performance on the testing data is considered, meaning that the testing data for this validation are not used (see Section 2.1).Thereafter, the most appropriate value for the variable µ obtained for the datasets used in this work (Table 1) is 0.375.The results presented in this paper were obtained with µ = 0.375.In addition, for this independent validation, only 200 epochs are used in the neural network training stage and about 8% of the samples of each dataset.This does not imply an important additional computational effort.The SDSAO and SDSAS methods are the proposed methods using ROS and SMOTE, respectively (see Section 4).

Classifier Performance and Significant Statistical Test
The Area Under the receiver operating characteristic Curve (AUC) [44] was used as the criteria of measure for the classifiers performance.It is one of the most widely-used and accepted techniques for the evaluation of binary classifiers in class imbalance domains [10].
Additionally, in order to strengthen the results analysis, a non-parametric statistical test is achieved.The Friedman test is a non-parametric method in which the first step is to rank the algorithms for each dataset separately; the best performing algorithm should have rank as 1, the second best rank as 2, etc.In case of ties, average ranks are computed.The Friedman test uses the average rankings to calculate the Friedman statistic, which can be computed as, K denotes the number of methods; N is the number of data sets; and R j is the average rank of method j on all datasets.On the other hand, Iman and Davenport [52] demonstrated that χ 2 F has a conservative behavior.They proposed a better statistic (Equation ( 7)) distributed according to the F−distribution with K − 1 and (K − 1)(N − 1) degrees of freedom, In this work, the Friedman and Iman-Davenport tests are employed with the γ = 0.05 level of confidence, and KEEL software [51] is utilized.
In addition, when the null-hypothesis was rejected, a post-hoc test is used in order to find the particular pairwise method comparisons producing statistically-significant differences.The Holm-Shaffer post-hoc tests are applied in order to report any significant difference between individual methods.The Holm procedure rejects the hypotheses (H i ) one at a time until no further rejections can be done [53].To accomplish this, the Holm method ordains the p-values from the smallest to the largest, i.e., p 1 ≤ p 2 ≤ p k−1 , corresponding to the hypothesis sequence H 1 , H 2 , ..., H k−1 .Then, the Holm procedure rejects H 1 to H i−1 if i is the smallest integer, such that p i ≤ α/(k − i).This procedure starts with the most significant p-value.As soon as a certain null-hypothesis cannot be rejected, all of the remaining hypotheses are retained, as well [54].The Shaffer method follows a very similar procedure to that proposed by Holm, but instead of rejecting where t i is the maximum number of hypotheses that can be true given that any (i − 1) hypotheses are false [55].

Experimental Results and Discussion
In order to assess the performance of the proposed methods (SDSAO and SDSAS), a set of experiments has been carried out, over thirty five two-class datasets (Table 1) with ten well-known over-sampling approaches (ADASYN, ADOMS, B-SMOTE, ROS, SMOTE, SMOTE-ENN, SMOTE-TL, SPIDER-1, SPIDER-2 and SL-SMOTE), six popular under-sampling methods (TL, CNN, CNNTL, NCL, OSS and RUS) (for more detail about these re-sampling techniques, see Section 3) and two dynamic sampling approaches (DyS and DOS).
This section is organized as follows: First, the AUC values are shown, and the Friedman ranks are used to analyze the classification results (Table 2).Second, a statistical test is presented in order to strengthen the results discussion (Figure 1).Finally, the relationship between the training dataset size and the tested methods performance is studied (Figure 2).The results presented in Table 2 are the AUC values obtained in the classifying stage, and they are averaged values between ten folds and ten different initialization weights of the neural network (see Section 5).
In accordance with the averaged ranks shown in Table 2, all over-sampling methods and dynamic sampling approaches (SDSAO, SDSAS, DyS and DOS) can improve the standard back-propagation (BP) performance, and the worst approaches with respect to standard BP are the under-sampling techniques, except by RUS, NCL and TL, which show a better performance than the standard BP.This table also shows that only the ROS technique presents a better performance than the proposed methods.SDSAO and DyS show a slight advantage over SDSAS.
In addition, Table 2 indicates that the class Imbalance Ratio (IR) is not determinant in order to get high AUC values, for example CAY7, SAT2, SEG1, SEG5 and 92A5 datasets present high values of AUC no matter their IR; also in these datasets, most over-sampling methods and dynamic sampling approaches are very competitive.Other datasets support this fact, i.e., IR is not critical in the classification performance, for example the SEG4 and SEG5 datasets have the same IR, but the classification performance (using the standard BP) is very different (values of AUC of 0.999 and 0.630, respectively).This confirms was was presented in other works, in that other features of the data might become a strong problem for the class imbalance [2].For example: (i) the class overlapping or noisy data [39,42,56,57]; (ii) the small disjuncts; (iii) the lack of density and information in the training data [58]; (iv) the significance of the borderline instances [13,59] and their relationship with noisy samples; and (v) the possible differences in the data distribution for the training and testing data, also known as the dataset shift [7].
In order to strengthen the result analysis, a non-parametrical statistical and post-hoc tests are applied (see Section 5.3): Friedman and Iman-Davenport tests report that considering reduction performance distributed according to chi-square with 20 degrees of freedom, the Friedman statistic is set at 329.474, and the p-value computed by the Friedman test is 1.690 × 10 −10 .However, considering reduction performance distributed according to the F-distribution with 20 and 680 degrees of freedom, the Iman and Davenport statistic is 30.233, and the p-value computed by their test is 2.588 × 10 −80 .Then, the null hypothesis is rejected, i.e., the Friedman and Iman-Davenport tests indicate the existence of significant differences in the results.Due to these results, a post-hoc statistical analysis is required.
Figure 1 shows the results of the non-parametric statistical Holm and Shaffer post-hoc tests.The rows and columns constitute the studied methods; as a consequence, it represents all C × C pairwise classifier comparisons.The filled circles mean that for these particular pairwise methods (for C i × C j ; i = 1, 2, ..., C and i = j), the null hypothesis was reject by the Holm-Shaffer post-hoc tests.Therefore, the color of circles is the darkest when the p-values are close to zero; this means that the statistical difference is significant.
Table 2. Back-propagation classification performance using the Area Under the receiver operating characteristic Curve (AUC) .The results represent the averaged values between ten folds and the initialization of ten different weights of the neural network.The best values are underlined in order to highlight them.ROS, Random Over-Sampling; SDSAO, Selective Dynamic Sampling Approach using ROS; DyS, Dynamic Sampling; SDSAS, Selective Dynamic Sampling Approach applying SMOTE; SMOTE-TL, SMOTE and TL;SMOTE, Synthetic Minority Over-sampling Technique; SL-SMOTE, Safe-Level SMOTE; SMOTE-ENN, SMOTE and Editing Nearest Neighbor (ENN) rule; ADOMS, Adjusting the Direction Of the synthetic Minority clasS method; B-SMOTE, Borderline-SMOTE; DOS, Dynamic Over-Sampling; RUS, Random Under-Sampling; ADASYN, Adaptive Synthetic Sampling; SPIDER 1 and 2, frameworks that integrate a selective data pre-processing with an ensemble method; NCL, Neighborhood Cleaning Rule; TL, Tomek links method; STANDARD, back-propagation without any pre-processing; OSS, One-Sided Selection method; CNN, Condensed Nearest Neighbor; CNNTL, Condensed Nearest Neighbor with TL (for more details see Sections 3 and 4).In accordance with Table 2 and Figure 1, most methods of over-sampling present a better classification performance than the standard BP with statistical significance.The under-sampling methods do not present a statistical difference with respect to standard BP performance, and all dynamic sampling approaches improve the standard BP performance with statistical differences.
ADASYIN, SPIDER-1 and SPIDER-2 (over-sampling methods) and RUS, NCL and TL (under-sampling methods) show the trend of improving the classification results, but they do not significantly improve the standard BP performance.Then, the OSS, CNN and CNNTL classify worse than standard BP; this notwithstanding, these approaches do not show a statistical difference with it.
SDSAO, SDSAS and DyS are statistically better than ADASYIN, SPIDER-1 and SPIDER-2 (over-sampling methods) and also than all under-sampling approaches studied in this work.With a statistical difference, the DOS performance is better than CNN, CNNTL, OSS and TL.
Table 2 shows that the trend is that ROS presents a better performance than the proposed method (SDSAO and SDSAS), and that DyS shows a slight advantage over SDSAS; however, in accordance with the Holm-Shaffer post-hoc tests, statistical difference in the classification performance does not exist among these methods (see Figure 1).
In general terms, most over-sampling methods and dynamic sampling approaches are successful methods to deal with the class imbalance problem, but with respect to the training dataset size, SDSAS, SDSAO and DyS use significantly fewer samples than the over-sampling approaches.They employed about 78% less samples than most over-sampling methods; in addition, SDSAS, SDSAO and DyS still use fewer samples than the standard BP trained with the original training dataset.They use about 60% less samples; these facts stand out in Figure 2.However, the DyS method applies the ROS in each epoch or iteration (see Section 4), whereas SDSA only applies the ROS or SMOTE one time before ANN training (see Section 2).
Figure 2 shows that the under-sampling methods employ significantly fewer samples than the rest of the techniques (except dynamic sampling approaches with respect to RUS, NCL and TL); however, their classification performance in most of the cases is worse than the standard BP (without statistical significant) or is not better (with statistical significant) than the standard BP.
On the other hand, the worst methods studied in this paper (in agreement with Table 2) are those based on the CNN technique (OSS, CNN and CNNTL), i.e., those that use a k − NN rule as the basis and achieving an important size reduction of the training dataset.In contrast, NCL, which is of the k − NN family, also improves the classification performance of the back-propagation; however, the dataset size reduction reached for this method is not of CNN's magnitude; in addition, it only eliminates majority samples.The use of TL (TL and SMOTE-TL) seems to increase the classification performance, but it does not eliminate too many samples (see Figure 2), except by CNNTL, which we consider to cancel the positive effect of TL by the important training dataset reduction.SMOTE-ENN does not seem to improve the classification performance of SMOTE in spite of including a cleaning step that removes both majority and minority samples.The methods that have achieved the enhancing of the classifier performance are those that only eliminate samples from the majority class.
Furthermore, analyzing only the selective samples methods (SL-SMOTE, B-SMOTE, ADASYN, SPIDER-1 and SPIDER-2), those are the ones in which the more appropriate samples are selected to be over-sampled.It is considered that in the result presented in Figure 2, SL-SMOTE and B-SMOTE obtain the best results, whereas the advantages of ADASYN, SPIDER-1 and SPIDER-2 are not clear (RUS often outperforms these approaches, but without statistical significance; Figure 1).SL-SMOTE, B-SMOTE and the proposed method do not show statistical significance in their classification results, but the number of samples used by SDSA in the training stage is fewer than employed for SL-SMOTE and B-SMOTE (see Figure 2).
Focusing on the dynamic sampling approaches' analysis, SDSAO presents a slight advantage in performance than DyS and SDSAS, whereas DOS does not seem to be an attractive method.However, the aim of DOS is to identify a suitable over-sampling rate, whilst reducing the processing time and storage requirements, as well as keeping or increasing the ANN performance, to obtain a trade-off between classification performance and computational cost.
SDSA and DyS improve the classification performance, including a selective process, but while DyS tries to reduce the oversampling ratio during the training (i.e., it applies the ROS method in each epoch with different class imbalance ratios; see Section 4), the SDSA only tries to use the "best samples" to train the ANN.
Dynamic sampling approaches are a very attractive way to deal with a class imbalance problem.They face two important topics: (i) improving the classification performance; and (ii) reducing the classifier computational cost.

Conclusions and Future Work
We propose a new Selective Dynamic Sampling Approach (SDSA) to deal with the class imbalance problem.It is attractive because it automatically selects the best samples to train the multilayer perceptron neural network with the stochastic back-propagation.The SDSA identifies the most appropriate samples ("average samples") to train the neural network.The average samples are the most adequate samples to train the neural network; they are neither hard nor easy to learn.These are between the safe and border areas in the training space.SDSA employs a Gaussian function to give priority to the average samples during the neural network training stage.
The experimental results in this paper point out that SDSA is a successful method to deal with the class imbalance problem, and its performance is statistically equivalent to other well-known over-sampling and dynamic sampling approaches.It is statistically better than the under-sampling methods compared to this work and also than the standard back-propagation.In addition, in the neural network training stage, SDSA uses significantly fewer samples than the over-sampling methods, even than the standard back-propagation trained with the original dataset.
Future work will extend this study.The interest is: to explore the effectiveness of the SDSA in multi-class and high imbalanced problems and to find a mechanism to automatically identify the most suitable µ value for each dataset.The appropriate selection of µ value might significantly improve the proposed method.In addition, it is important to explore the possibility to use the SDSA to obtain optimal subsets to train other classifiers like support vector machines or to compare its effectiveness with the other kinds of class imbalance approaches using other learning models.

3. 1 .
Under-Sampling Approaches TL Tomek links are pairs of samples a and b from different classes, and there does not exist a sample c, such that d(a, c) < d(a, b) or d(b, c)

Figure 1 .
Figure 1.Results of the non-parametric statistical Holm and Shaffer post-hoc test.The fill circles mean that for these particular pairs of classifiers, the null hypothesis was rejected by both test.The color of the circles is the darkest at p-values close to zero, i.e., when the statistical difference is the most significant.

Figure 2 .
Figure 2. Number of samples used in the training process by the studied methods in contrast to the Area Under the receiver operating characteristic Curve (AUC) average ranks obtained in the classification test.The x axis represents the average ranks (the best performing method should have the rank of one or close to this value).We previously used the ten-fold cross-validation method.The number shown in the y axis corresponds to the average training fold size.

end while FORWARD(x q ): 13: for m
= 0 to m < M do

Table 1 .
A brief summary of the main characteristics of the new produced benchmarking dataset.