DTO-SMOTE: Delaunay Tessellation Oversampling for Imbalanced Data Sets

One of the significant challenges in machine learning is the classification of imbalanced data. In many situations, standard classifiers cannot learn how to distinguish minority class examples from the others. Since many real problems are unbalanced, this problem has become very relevant and deeply studied today. This paper presents a new preprocessing method based on Delaunay tessellation and the preprocessing algorithm SMOTE (Synthetic Minority Over-sampling Technique), which we call DTO-SMOTE (Delaunay Tessellation Oversampling SMOTE). DTO-SMOTE constructs a mesh of simplices (in this paper, we use tetrahedrons) for creating synthetic examples. We compare results with five preprocessing algorithms (GEOMETRIC-SMOTE, SVM-SMOTE, SMOTE-BORDERLINE-1, SMOTE-BORDERLINE-2, and SMOTE), eight classification algorithms, and 61 binary-class data sets. For some classifiers, DTO-SMOTE has higher performance than others in terms of Area Under the ROC curve (AUC), Geometric Mean (GEO), and Generalized Index of Balanced Accuracy (IBA).


Introduction
Imbalanced data sets [1] are among the pattern classification problems. To illustrate some applications in recent years, methods for dealing with class imbalance were applied in different areas, such as: Medicine [2], Agriculture [3], Computer Networks Protection [4], analysis of social media [5], and Financial Risks [6,7]. Although researchers and professionals have intensely studied this problem, today, this challenge is still open-ended. Most of the real-world data sets are naturally imbalanced [8], and classifiers have difficulties learning from them. Different techniques to solve or mitigate this problem emerge all the time [9][10][11][12]. Nevertheless, this problem is far from being addressed entirely.
Synthetic Minority Oversampling Technique (SMOTE) [13] is one of the most used preprocessing methods. It works by synthetically generating instances in the line segment joining two examples. SMOTE has several variations in the literature. However, they also follow the same procedure for synthetic instance generation. This procedure leads to a bias of SMOTE (and its variants) to generate a chain trail pattern. This preference is discussed further in Section 3.2. As far as we know, this bias was unnoticed in the literature. We argue that this undesirable tendency may limit the effective use of SMOTE in some domains.
To overcome this problem, we propose a new preprocessing method that relies on constructing a proximity graph based on Delaunay Tessellation. The Delaunay Tessellation (also known as triangulation in the plane) is a fundamental computational geometry structure. The Voronoi diagram's dual and provides a connection graph for closer points in the space, forming a tessellation composed of a simplex set that completely covers the space. The Delaunay graph is interesting in data analysis, as it can represent the geometry of a point set or approximate its density [14]. We can use the fact that any point inside the simplex can be described as a linear combination of its vertices to generate new synthetic samples.
We name this new method Delaunay Tessellation Oversampling SMOTE (DTO-SMOTE). Like SMOTE [13], the main idea of Delaunay-Oversampling is to synthetically generate new instances, aiming to balance the training data distribution before applying the learning algorithms. The main difference is that, instead of creating synthetic examples in the line segment joining two instances, we generate synthetic examples inside a simplex selected from the Delaunay tessellation. The vertices of the simplex are instances of the data set. The creation of synthetic instances follows a Dirichlet distribution, and we can use the distribution parameters to manipulate where instances can be generated according to the classes of the vertices. This approach makes it possible to avoid SMOTE's chain trail pattern , allowing artificial instances to cover the input space better.
This paper substantially extends our previous research [15] by introducing a simplex selection step based on quality measures and data generation and widely extends the empirical evaluation, including other learning algorithms, baselines, and data sets. The main contributions of this paper are: • We point out (for the first time) the chain trail pattern formed by the artificial instances that SMOTE generates.
• We propose a new preprocessing technique, named Delaunay Tessellation Oversampling SMOTE (DTO-SMOTE), which uses Delaunay Tessellation to build a simplex mesh. Then, we use simplex quality measures to select candidates (our previous study draws a simplex at random) for instance generation and use a Dirichlet distribution to control where synthetic instances creation inside a simplex (our former study uses the barycenter of the simplex).
• We conduct an extensive experimental evaluation with 61 bi-class data sets (our previous study only considers 15 binary data sets). This empirical comparison includes five preprocessing methods and ten learning algorithms (our former analysis only compares with SMOTE and uses kNN as the learning algorithm). It shows our approach's appropriateness in many situations, with better average performance on binary data sets.
The remaining of this paper is organized as follows: Section 2 presents issues related to class imbalance and some methods for dealing with them and performance measures. Section 3 discusses the SMOTE family of preprocessing algorithms and their bias in generating synthetic examples in a chain trail pattern. Section 4 presents our new method, DTO-SMOTE, in detail. Section 5 presents the experiments and results. Section 6 discusses the achieved results and describes insights, comments, and future work directions.

The Imbalanced Data Set Classification Problem
Classification problems in which one of the classes have more instances than the others are called Imbalanced Classification problems. In practice, although a strict imbalance ratio threshold does not exist, when there is a significant difference in the number of cases in each class, it is considered a matter of a Class Imbalanced Problem. Several authors in the literature acknowledge that this imbalance may degrade classification performance [16]. Learning from imbalanced data is an important issue as these data sets frequently occur in real-life problems, where practical problems are rarely balanced. This problem occurs in areas like Engineering, Information Technology, Bioinformatics, Medicine, Security, Business Management, among others [16].
Facing these problems, researchers have proposed some techniques to deal with imbalanced data set classification problems. There are three primary directives in this area: Preprocessing methods, Cost-Sensitive Learning, and Ensembles [16]. Section 2.2 briefly describes these three approaches, and Section 2.1 discusses the evaluation of classification problems. Based on these measures, we can assess if specific methods are working or not to solve imbalanced problems.

Performance Evaluation
Choosing adequate performance metrics to evaluate imbalanced data set classification problems is crucial to select the right preprocessing method and classifier's algorithm [17]. In sequence, we review the most used measures for imbalanced classification problems [16,18]. Without loss of generality, here, we consider two-class classification problems to present these measures. Figure 1 shows a Confusion Matrix in which the two classes are generally named Class 0 (Negative class or Majority class) and Class 1 (Positive class or Minority class). In this confusion matrix, C R0 corresponds to the actual number of majority class instances, and C R1 corresponds to the actual number of minority class instances. Furthermore, C P0 corresponds to the number of instances predicted as the majority class, and C P1 corresponds to the number of instances predicted as the minority class. True Positive values, or TP, means that a classifier correctly classified majority class samples. False Positive, or FP, means that examples of which correct label is the minority class misclassified as majority class. False Negative, or FN, means that a real majority class instance is misclassified as belonging to the minority class. Finally, True Negative, or TN, means that an actual minority class example is correctly classified as belonging to the minority class. Based on this table, accuracy is defined as a rate between the total samples classified correctly by the total test samples submitted to the classifier. Equation (1) presents A cc or Accuracy.

Predicted Value
Considering that, in real-life applications, the positive class is the most important, A cc is generally inadequate for imbalanced classifications problems. For instance, a trivial classifier could achieve 95% accuracy in a classification problem where the majority (negative) class instances account for 95%. Nevertheless, it could mispredict all positive class instances. To compromise with these issues, Sensitivity or Recall, (REC), and Specificity, (SPE), Equations (2) and (3), respectively, consider the correct classification in each class separately. Here, True Positive Rate, or TPrate, is equal to Recall, or REC, and True Negative rate, or TNrate, is equal to SPE, like in Equations (2) and (3), respectively.
Recall or True Positive Rate (TPrate-Equation (2)) is the percentage of positive instances correctly classified, whereas Specificity or True Negative Rate (TNrate-Equation (3)) is the percentage of negative cases correctly classified.
False Positive Rate (FPrate-Equation (4)) is the percentage of misclassified negative instances, and False Negative Rate (FNrate-Equation (5)) is the percentage of misclassified positive instances.
There are situations where the focal point is to achieve high assertiveness for positive class. In this case, precision (PRE-Equation (6)) is more suitable. Precision is the percentage of instances predicted as positive and correctly classified as positive.
Precision and Recall are two conflicting factors, where improving one may imply in degrading the other. One measure that relates to Precision and Recall is called F1. Equation (7) shows the F1 score, which is the weighted harmonic mean between Precision and Recall.
A measure that tries to capture the trade-offs between errors in both classes for imbalanced problems is Geometric Mean (GEO-Equation 8). This measure is associated with the ROCcurve [19]. Note that, when REC and SPE have high values, GEO also has a high value. On the other hand, when one of them has a low value, GEO diminishes.
AUC or Area Under Curve is related to the ROC (Receiver Operating Characteristic) curve [19]. This curve plots TPR and FPR when for various thresholds of the positive class likelihood. For a single threshold, AUC could be calculated by Equation (9). For multiple thresholds, the trapezoidal rule [19].
Generalized Index of Balanced Accuracy (IBA-Equation (10)), described in Reference [18], quantifies an arrangement between Gmean and a ratio of how balanced the two-class precision in each class is in the data set [18]. The parameter α trades-off the influence dominance of (TPrate − TNrate) and the GMean. The authors recommend that α ≤ 0.5, and, when α is set to 0, IBA = Gmean 2 .
In this present work, we adopted α = 0, as recommended by the authors who proposed the metric [18].

Methods for Dealing with Class Imbalance
Dealing with imbalanced data is a challenging task. Real-world problems are imbalanced, and, to overcome this problem, several techniques have been proposed in the literature. In this section, we review some of them.

Preprocessing
Preprocessing occurs before the data set is submitted to the classifier training process [16,20]. The main advantage of these methods is that they are algorithm agnostic, and it does not necessarily alter the original classifier algorithm. Therefore, they can be used together with any classifier. On the other hand, preprocessing could be cost-intensive in terms of processing time. Main techniques for preprocessing imbalanced data sets are resampling algorithms [16]. Resampling works to achieve a balanced data set as a result. There are three resampling categories: Oversampling, Undersampling, and Hybrid methods.
Oversampling Oversampling creates synthetic or duplicate minority class samples [16,20] to match the same number of samples from the majority class. As a result, the training data set becomes balanced before the training phase. The Synthetic Minority Oversampling Technique [13,21], discussed in Section 3, is the primary method in this category. Undersampling Undersampling discards some majority class instances to matches the number of samples from the minority class. The primary method in this category is Random Under-Sampling (RUS) [16,20,22]. Nevertheless, there are some issues regarding these techniques. When some data suffer deletion, important information could be discarded, leading to a weak classifier's training due to the lack of relevant information. To deal with the lost of information, some strategies were proposed in the literature. In Evolutionary Undersampling [23], the undersampling is framed as a search problem for prototypes. This process reduces the number of instances from a data set, aiming not to lose a significant accuracy classification rate, using an evolutionary algorithm. Another interesting method is ACOSampling [24]. This method is based on ant colony optimization [25] in the search phase. It adopts this strategy to determine the best subset of majority class instances to keep in the training set. Hybrid The hybrid method's main idea is to minimize drawbacks from undersampling and oversampling, while taking into account their benefits, to achieve a balanced data set. To illustrate this combination, we can cite some methods, like SMOTE + Tomek Link [26], SMOTE + ENN [27], SMOTE-RSB [28], and SMOTE-IPF [29], which combine the SMOTE oversampling technique with different data cleaning, to remove some spurious artificial instances introduced in the oversampling phase, and data clustering followed by oversampling [30].

Cost-Sensitive Learning
The main idea about this strategy is that misclassification costs are uneven for the majority and minority classes. It is necessary to alter the original classifier's algorithm to implement this strategy so that the algorithm considers the different misclassification costs. [31] present a recent review of cost-sensitive approaches.

Ensemble Learning
Ensemble methods combine two or more classifiers resulting in one classifier with (potentially) more performance than if they were used separately [32]. However, only using standard ensembles cannot solve imbalanced problems [16,33]. The reason is that ensembles aim mainly to improve accuracy, and, for imbalanced data classification, accuracy is not adequate due to the prevalence of the majority class. So, the most common technique to apply ensemble in imbalanced classification problems are variations of Bagging [34] and Boosting [33] especially developed for imbalanced data: Bagging Bagging [33] consists of training several classifiers with different sampled data sets, randomly drawn from the original one with replacement. The final classification uses a majority or weighted vote from the pool of classifiers [16,33]. One way to extend Bagging to imbalanced data is to train different classifiers with bootstrapped replicas of the original data set. To ensure that minority class instances are present in each sampled data set, stratified sampling is performed. Within each data set, oversampling methods are used to balance the classes [33].

Boosting
Boosting [33] differs from Bagging as Boosting weights samples to measure classification difficulty in the learning phase. At the beginning of the process, all instances have the same weight, and classifiers are trained iteratively, changing their weights. Difficult examples receive higher weights than simpler ones. To extend boosting to imbalanced scenarios, oversampling methods are introduced in the process to balance the original data set in each iteration [33].

The SMOTE Family of Oversampling
As our method was developed to relieve bias of the SMOTE oversampling method (and many variations that uses the procedure of SMOTE for generating a synthetic instance based on the interpolation within the line segment of two instances in the data set) suffers; in this section, we review the SMOTE family of oversampling methods.
The preprocessing technique SMOTE [13,21] is an oversampling technique in which the main goal is to artificially generate new instances of the minority class interpolating pairs of neighbor instances. The target is to achieve a balanced distribution among all the classes in the data set. To do that, SMOTE selects one minority class instance i and calculates its k minority class nearest-neighbors. The generation of a new synthetic instance consists of interpolating in a line segment formed by selected sample i and a random instance j, which lies in the k minority nearest neighborhood of i. In the original paper [13], the parameter k is set to 5. The generation of a new instance follows Equation (11). where: • x new is the new instance vector; • x i is the feature vector of instance i; • x j is the feature vector of instance j; • r is a random number between 0 and 1.
In 2018, SMOTE was 15 years old [21], and an extensive overview of its applicability and variants based on it are surveyed. We briefly describe some of these methods next.

Borderline-SMOTE
Borderline-SMOTE [35] is a variation that considers samples of the minority class that are far from the majority class boundary may contribute less than samples on the border to build the classifier. Thus, the Borderline-SMOTE preferentially generates synthetic instances that lie near to the decision border. Han describes in his paper [35] that data set examples on the borderline and nearby are more misclassified than cases in other regions. Here, borderline means a region where minority class examples are close to the majority ones.

SMOTE-SVM
SMOTE-SVM [36] generates artificial samples along the decision boundary. Like Borderline-Smote, SMOTE-SVM assumes that the decision boundary is the best place to create new synthetic samples, as this region is the most critical one for the training process. However, SMOTE-SVM uses support vectors to determine the decision boundaries in SMOTE-SVM.

ADASYN
The basic idea of ADASYN [37] is to use a density distribution of each class's instances as a criterion to automatically choose the number of synthetic minority samples that need to be generated. Considering a binary class data set, and lets m s and m l be the number of the minority class and the majority class samples. The degree of class imbalance d is calculated by Equation (12).
The number of minority class samples to be generated is calculated using Equation (13), G is the number of artificial minority class instances to be created, and β is a desired balanced coefficient parameter. When β = 1, the target is a fully balanced data set.
The ratio r i around each minority class instance in Equation (14) is defined by the distance of the K nearest neighbor calculated in Euclidean distance, and ∆ i is the number of examples in K nearest neighbors of x i which belongs to the majority class.
Normalizing r i result inr i calculated by Equation (15).
After that, for each minority class instance x i , the number of synthetic data samples that needs to be generated g i is calculated by Equation (16).
Finally, for each g i , generate new synthetic samples by Equation (17), where s i is a new synthetic sample, x zi is a randomly chosen minority class sample, and λ is a random number between 0 and 1.
After all these steps, the output is a balanced data set.

Geometric SMOTE
Geometric SMOTE [38] generates synthetic samples in a geometric region called hyperspheroid. There are three main steps in this method. First, it creates new synthetic examples in the surface of a hypersphere centered at an original minority class example covering only minority class instances, ensuring that new examples are not noise. Second, it increases the variety of generated examples by expanding the respective hyperspheroid around the minority class instances applying hyper-sphere expansion factors. Finally, Geometric SMOTE has the options to use geometric transformations, such as translation, deformations, and truncation of the hyperspheroids, to cover other areas of the space.

Manifold-Based Synthetic Oversampling
Manifold-based synthetic oversampling [39] transforms the original data set into a manifold structure and generates instances in the manifold space. Finally, the algorithm maps the synthetic artificial samples to feature space. Aiming to improve SMOTE, the manifold oversampling method requires that the data set conform to the manifold property [39]. If this requirement is fulfilled, a manifold embedding method can be used to induce a manifold representation of the minority class.
Then, oversampling is applied in this embedded space. The most common manifold methods are Principal Component Analysis (PCA) and Denoising Autoencoders (DAE), according to Bellinger [39].

The SMOTE Chain Trail Pattern Bias
In general, SMOTE variants differ where the synthetic instances are generated. However, they are similar to how these synthetic instances are created. In this section, we argue that SMOTE and other methods based on its artificial instance generation process have an undesirable bias. The synthetic instances follow a chain trail pattern. The reason is related to the pairwise interpolation process, in which artificial examples are generated in the line segment joining two examples. For highly imbalanced data sets, the interpolation process has a high likelihood to select the same pair, as the selection procedure uses the neighborhood of the instances, and an instance can be selected as a seed several times. Therefore, the generation process creates synthetic examples within the same line segment, and a chain trail pattern emerges. Recently, Reference [40] noticed that the SMOTE has a bias where instances are inward-placed diverging from the original class distribution. Still, they did not correlate this with the synthetic data generation process. As far as we knew, this chain trail pattern is unnoticed in the literature. For higher-dimensional spaces, the analogous to triangulation is a tessellation, and to a triangle is a simplex, respectively.
The Delaunay triangulation of a set of points P in the two-dimensional Euclidean space is defined as follows: the Delaunay triangulation DT(P) is a triangulation such that no point in P is inside the 300 circumscribed spheres of other triangles in DT(P). The definition extends to higher-dimensional spaces, where no point in P is inside the circumscribed hyper-spheres of other simplices in DT(P). On the other hand, although the chain trail pattern is not present in Geometric SMOTE, its synthetic data generation also leads to an undesirable bias. The artificial data generation process considers hyper-spheres centered in some instances, where the radii may grow. Furthermore, the seed candidates are instances far from the decision boundary, and the bias pattern follows these hyper-spheres. Therefore, Geometric SMOTE also has a bias related to the synthetic data generation procedure.
As we will discuss in the next section, our approach can interpolate over a simplex from a simplex mesh, and thus is not restricted to a line segment, avoiding the chain trail pattern (as can be seen in the Figure 2g. We also use simplex mesh quality measures to spread the synthetic generated data over the instance space.

Simplex Geometry
In computation geometry, triangulation is one of the main primitives in two-dimensional space. It is often used to determine the vicinity of a point by forming triangles with other nearby points. For higher-dimensional spaces, the analogous to triangulation is a tessellation, and to a triangle is a simplex, respectively.
The Delaunay triangulation of a set of points P in the two-dimensional Euclidean space is defined as follows: the Delaunay triangulation DT(P) is a triangulation such that no point in P is inside the circumscribed spheres of other triangles in DT(P). The definition extends to higher-dimensional spaces, where no point in P is inside the circumscribed hyper-spheres of other simplices in DT(P).
The Delaunay tessellation has some interesting properties. In the plane, the triangulation is a maximum planar graph and completely divides the space by triangles (if no four points are co-linear). The center of the circumsphere of a Delaunay polyhedron is a vertex of the Voronoi cells. Figure 3 shows an example of a Voronoi diagram (in grey) and a Delaunay tessellation (in orange) for a two-dimensional space. The closest neighbor P c to any point P x lies on edge P c P x in the Delaunay triangulation, as the nearest neighbor graph is a subgraph of the Delaunay triangulation. This unique set of neighboring points defines the neighborhood of the point and represents a parameter-free definition of a point's surroundings. The triangulation maximizes the minimum angle of the triangles, avoiding the occurrence of silver triangles. A silver triangle is a triangle with one or two extremely acute angles, hence with a long/thin shape. Intuitively, the Delaunay tessellation is a group of simplices that are most regularized in shape, compared to any other triangulation type. Delaunay tessellation may be used to approximate a manifold by a simple mesh [41,42], in computational geometry [43,44], and planning in automated driving [45], among others. For a D-dimensional space, a simplex has D + 1 points. For instance, in the plane, a simplex has three points (a triangle), while, in a three-dimensional space, a simplex has four points (a tetrahedron).
Each point from a simplex is called a simplex's vertex. Another simplex description is a convex hull of its vertices, or S is the set of all points p ∈ R D that can be expressed as a convex combination of its vertices s i . S boundaries consist of faces, which are simplices of lower dimensions composed by a subset of the simplex vertices. Furthermore, any point inside a simplex can be expressed as a combination of its vertices as p = Sx. The vector x is called the barycentric coordinates of the point p concerning the simplex S [46].
For a set P of points, different algorithms can be used to compute DT(P). A reasonably general approach for computing the tessellation in a D-dimensional space consists of converting the tessellation problem into finding the convex hull of P in (D + 1)-dimensional space, by giving each point p a new coordinate equal to |p| 2 . The computational complexity of this approach is O(n d 2 ). This paper uses the Delaunay tessellation algorithm implemented in the Python SciPy package [47].

Mesh Generation
Given a data set composed by a set of instances (X,Y), where X is an n-dimensional feature matrix, and Y is the vector of class labels. Each x i = {x i,1 , x i,2 , · · · , x i,n } represents the feature values, and y i ∈ Y corresponds to the class value of instance i. The number of instances of the data set is m.
The first step to applying Delaunay Tetrahedral Oversampling (DTO-SMOTE) is to perform a dimensionality reduction. This operation reduces the n-dimensional features matrix X to a three-dimensional space X . The reasons for using a dimensionality reduction are twofold: first, for m points and n dimensions, a generic mesh contains O(m n/2 ) simplices, and n + 1 points compose each simplex. Thus, we will have many simplices to interpolated, and many vertices will compose each simplex. Second, the computational complexity of mesh building will increase with large data dimensions. Hence, projecting the high-dimensional data to a lower dimension space is a way to make the mesh building more manageable. As we project into 3D space, the mesh is formed by tetrahedrons, and the interpolation process is carried out within four instances. However, the process is generic and other dimensions (including not applying a feature reduction). Any dimensionality reduction technique can be used for this reduction. However, it should be kept in mind that this dimensionality reduction is only used for the mesh generation, and the artificial instances are generated into the original space. Therefore, the classifiers' training is also performed using the original space. A similar procedure was used in Reference [39] to immerse the data into a manifold space with lower dimension.
The next step consists of running the Delaunay tessellation algorithm using X p 3 . In other words, we calculate DT(P), where P = X p 3 . The result of this process is a set of tetrahedrons, in which the vertices lie in X p 3 . The method of synthetically generate a new instance corresponds to an interpolation using the vertices of a tetrahedron. To this end, our algorithm randomly selects a tetrahedron and a point p inside this tetrahedron in barycentric coordinates and uses p for interpolation. As there is a one-to-one mapping between the barycentric coordinates the Euclidean coordinates, the new instance's generation corresponds to an interpolation of the points of the selected tetrahedron in the original n-dimensional X-space.
For selecting a tetrahedron for interpolation, our algorithm computes some quality indices for each tetrahedron. Furthermore, to consider the class neighborhood of the instance, we use the ratio of vertices of the tetrahedron associated with a minority class to weight the quality indices and associate them with probabilities. These probabilities are then used to randomly select tetrahedrons with replacement. (It is necessary because some data sets do not produce enough tetrahedrons to generate the adequate numbers of artificial samples.) The generation of the point p inside the tetrahedrons follows a Dirichlet Distribution. The algorithm adjusts the distribution parameters according to the class associated with a vertex for tetrahedrons with vertices belonging to majority class instances. This step assigns higher probabilities to vertices and facets related to the minority class, similar to Reference [35,36]. These steps will be explained in detail in the next sections.

Tetrahedral Quality Evaluation
Given a Delaunay tessellation DT(P), our approach selects a tetrahedron from DT(P) to generate a new instance. In principle, we can randomly choose a tetrahedron of DT(P). However, the tetrahedrons have different shapes. Furthermore, the tetrahedrons have different class labels associated with their vertices. Our method uses the shape and class distribution to guide the selection of the tetrahedrons.
The shapes of the tetrahedrons are generally associated with the quality of the tessellation [48]. Smaller tetrahedrons with a regular shape usually emerge from dense areas of the input space [14]. On the other hand, bigger tetrahedrons or very acute angles appear near boundaries.
Maur [48] proposes different tetrahedral quality measures. The main idea is how far a new tetrahedron is from the equilateral tetrahedron. The answer to this question, the following quality measures are proposed: Relative Volume, Radius Ratio, Solid Angle, Minimum Solid Angle, Maximum Solid Angle, Edge Ratio, and Aspect Ratio, explained next.
Relative Volume: The relative volume of the current tetrahedron is computed as its real volume divided by the value of maximal volume in the tessellation [48]. Radius Ratio: The radius ratio is the weighed ratio between the radius of the inscribed sphere (r) to the radius of the circumscribed sphere (R), as shown in Equation (18).
Solid Angle: The solid angle is the area of a spherical triangle created on the unit sphere in which the center is in the tetrahedron vertex [48]. We compute the sum of four solid angles of the tetrahedron. Minimum Solid Angle: This returns the minimum solid angle instead of their sum. Maximum Solid Angle: This returns the maximum solid angle instead of their sum. Edge Ratio: The edge ratio computes the ratio between the length of the most prolonged edge E to the length of the shortest edge of the tetrahedron, as shown in Equation (19).
Aspect Ratio: The aspect ration computes the ratio between the radius of the sphere that circumscribes the tetrahedron (R) to the length of the longest edge (E), as shown in Equation (20).
The selection of tetrahedron quality measure to use is a hyperparameter of our algorithm. For each tetrahedron from the tesselation, we compute the selected quality measure and weight it by the proportion of the minority class examples we want to generate the new synthetic instance. If all four vertices belong to the majority class, the tetrahedron quality measure is multiplied by zero and will not be taken into account for synthetic instance creation. On the other hand, if all four vertices belong to the minority class, the quality measure is multiplied by one. Tetrahedrons with mixed class vertices are weighed accordingly. For instance, if two out of the four vertices belong to the minority class, the calculated quality measure is multiplied by 0.5.
These weighed tetrahedron quality measures are then normalized, to sum up to 1 and represent a probability distribution. Therefore, to select tetrahedrons for interpolation, we use a weighted sampling with replacement, where probabilities came from the tetrahedrons' quality measures, as described earlier.

Synthetic Instance Generation
Once a tetrahedron has been selected, the synthetic instance generation process requires a draw of a random point inside the tetrahedron. To this end, we use a Dirichlet distribution. A p-dimensional vector α of positive reals parametrizes this distribution. A random draw of a Dirichlet distribution returns a p-dimensional vector d 1 , · · · , d p , where d i ∈ (0, 1), and ∑ n i d i = 1. For the symmetric Dirichlet distribution, when α i = 1, ∀i ∈ 1..p, each point inside the simplex has an equal probability of being chosen. On the other hand, when α i is a positive constant greater than 1, the distribution favors points near the simplex center. For nonsymmetric distribution, different values of α i can be used to control the likelihood of where points can be generated (see Table 1), where, for the sake of visualization, we show density plots of three-dimensional Dirichlet distribution for different values of α.  3,9] Given a tetrahedron, we use the parameter vector to establish the likelihood region where the new point will be generated. If all the tetrahedron's vertices are associated with minority class instances, then we use a symmetric Dirichlet distribution to draw the new instance. Therefore, depending on the value of α, we can draw a point uniformly in the simplex (α i = 1) or closer to the tetrahedron barycenter (α i = c, c > 1). However, suppose not all tetrahedron vertices are associated with the minority class. In that case, the distribution is asymmetric, where α i = 1 if the class of the vertex i is the majority, and α i = c, c > 1, where c is a hyperparameter that can be set by the user.
In summary, let ∆ k = [a, b, c, d] be the index set of the instances a tetrahedron k selected as described in Section 4.3. We draw a random vector λ = Dir(α), where α is defined as Then, a new instance where x new can be interpreted as the weighted mean of the instances that are vertices of the tetrahedron ∆ k , where the weights λ are drawn from the Dirichlet distribution.

Method Description
Here, we present steps to build DTO-SMOTE. Algorithm 1 presents the algorithm. The main steps are outlined next. Using the compression technique, reduce feature space to a three-dimensional space resulting in X p 3 ; 3.
Construct the Delaunay tessellation DT(X After that, we calculate weights for all simplex in this way: W = (tetraprops * tetrastats) where tetrapops is the vertex's proportion of instances belonging to the minority class in the simplex and tetrastats is the quality index calculated for the simplex, according to quality measure selected;, The weights are normalized to sum 1, for representing probabilities. 6.
Randomly choose with replacement a simplex (tetrahedron) from DT(X p 3 ), according to the probabilities calculated from step 5.

7.
Once a simplex was selected, generate a new sample x new using a Dirichlet distribution, as described in Section 4.4. 8.
Steps 6 and 7 are repeated until the number of samples in Minority Class matches the number of instances of the Majority Class.

Experiments and Results
This section describes the experimental evaluation carried out for evaluating our new preprocessing techniques. The experimental protocol is depicted in Figure 4. Experiments are conducted using 5-fold cross-validation. We opt to execute 5-fold cross-validation instead of 10-fold cross-validation as in imbalanced data sets because using 10-fold may lead to a test set with very few instances from the minority class. As suggested in Reference [49], preprocessing methods are only applied to the training set. Considering the random nature of DTO-SMOTE and other oversampling methods to build artificial data, we repeat the entire process three times, i.e., we ran 3 × 5-fold cross-validation.  We compare our method with the preprocessing techniques: SMOTE [13], SMOTESVM [36], BORDERLINE SMOTE 1 [35], BORDERLINE SMOTE 2 [35], and GEOMETRIC SMOTE [38]. All these methods are executed with the default parameters. For running SMOTE, SMOTESVM, BORDERLINE SMOTE 1, and BORDERLINE SMOTE 2, we use the imbalanced-learn package [22]. For GEOMETRIC SMOTE, we use the implementation of the author, available at https://geometric-smote.readthedocs. io/. Furthermore, we also compare results without any preprocessing. Table 2 presents the alias used  for showing the results. To evaluate our methods' effectiveness, we compare the original and oversampled data sets with eight different learning algorithms, as presented in Table 3. All classifiers are available in the SciKit-Learn package and Intel Python Distribution, at http://scikit-learn.org/ and https://software. intel.com/en-us/distribution-for-python, respectively, and were used with default parameters, as described in Table 3.  SGD loss="hinge",penalty="l2",max_iter=500 [57] Experiments were performed with 61 bi-class data sets. Table 4 show the characteristics of the bi-class Data sets used in our experiments. For each data set, the tables present the Imbalance Ratio (IR), the number of samples (Samples), and the number of features (Features).
We use the Principal Component Analysis (PCA) algorithm to compress the data sets before applying the Delaunay Tessellation. This choice was motivated due to its simplicity and extensive use. However, any method that compresses the data to a lower dimension can be used instead of PCA. Furthermore, PCA was also used in Reference [39] for dimensionality reduction for manifold embedding.

Influence of Parameters
Our method has two parameters: the tetrahedral quality metric and the Dirichlet parameter α. The quality metric influences the selection of tetrahedral for interpolation, while the α influences where the interpolation will occur in a particular tetrahedral. We have experimented with all the tetrahedral quality metrics described in Section 4.3. For the parameter α, we experiment with the parameter ranging from 1 to 9.5 with step size 0.5. In total, we made 3 × 18 × 61 = 3294 runs for bi-class data sets for each of the eight classifiers used in the experiments. Figure 5 depicts the influence of this parameter. The graphs show the variation in rank as a function of the α parameter for each tetrahedral quality measure, averaged over all the data sets. Results are homogeneous among the different metrics (AUC, GEO, and IBA for binary data sets). As can be seen, the quality measure based on the solid angle produces better results. Furthermore, α larger than four also produces good average results.
With these results, we choose the default values for α and geometry for the DTO-SMOTE algorithm. The selected values α for bi-class is 7, and the quality measure set is the solid angle. Therefore,we use the solid angle as the tetrahedron quality measure and α = 7 for bi-class problems for comparison with other SMOTE methods.
Version November 24, 2020 submitted to Journal Not Specified 17 of 22 quality metrics described in Section 4.3. For the parameter α, we experiment with the parameter ranging from 1 to 9.5 with step size 0.5. In total, we made 3 * 18 * 61 = 3294 runs for bi-class data sets for each of the eight classifiers used in the experiments. Figure 5 depicts the influence of this parameter. The graphs show the variation in rank as a 485 function of the al pha parameter , for each tetrahedral quality measure, averaged over all the data sets. Results are homogeneous among the different metrics (AUC, GEO, : and IBA for binary data sets). As can be seen, the quality measure based on the solid angle produce :::::::: produces : better results. Furthermore, α larger than 4 also produce :::: four :::: also ::::::::: produces good average results.
With these results, we choose the default values for α and geometry for ::: the DTO-SMOTE algorithm.
To evaluate whether the differences among methods are statistically significant, we use a non-parametric Friedman multiple comparison testing ( Table 8). The Friedman test is the non-parametric equivalent of the two-way ANOVA. Under the null hypothesis, it states that all 505 the algorithms are equivalent, so rejecting this hypothesis implies significant differences among all algorithms. When the Friedman test rejects the null hypothesis, we can proceed with a post-hoc test to detect the significant differences among the methods. For this, we used Shaffer post-hoc multiple comparisons for controlling the family-wise error.
The results of the statistical tests are presented in the form of a diagram of significant differences.

510
In these diagrams, results were ordered by decreasing performance, where the best algorithms are placed to the left in the graph. A thick line joining two methods indicates that there is no statistical significance among them.
All the results are based on :: the : average performance of 3 runs of 5-fold cross-validation. In terms of AUC (Table 5) (Table 3). Only the KNN algorithm, our method did not obtain the best

Experimental Results for bi-class Data Sets
For binary class data sets, we evaluate three performance measures: Area Under de ROC curve (AUC), Geometric Mean (GEO), and Index of Balanced Accuracy (IBA). In this work, we adopted the α parameter for IBA to 0, as recommended by the authors.
Due to space limitations, we did not include in the paper numerical results for each data set. However, all the results are publicly available at https://github.com/carvalhoamc/DTO-SMOTE.
To draw general conclusions about the our method's performance, we present average rank performance and statistical analysis.
To evaluate whether the differences among methods are statistically significant, we use a non-parametric Friedman multiple comparison testing ( Table 5). The Friedman test is the non-parametric equivalent of the two-way ANOVA. Under the null hypothesis, it states that all the algorithms are equivalent, so rejecting this hypothesis implies significant differences among all algorithms. When the Friedman test rejects the null hypothesis, we can proceed with a post-hoc test to detect the significant differences among the methods. For this, we used Shaffer post-hoc multiple comparisons for controlling the family-wise error. The results of the statistical tests are presented in the form of a diagram of significant differences. In these diagrams, results were ordered by decreasing performance, where the best algorithms are placed to the left in the graph. A thick line joining two methods indicates that there is no statistical significance among them. All the results are based on the average performance of 3 runs of 5-fold cross-validation.
In terms of AUC (Table 6), DTO-SMOTE performed better in terms of global rank in 7 of the eight classification algorithms (Table 3). Only the KNN algorithm, our method did not obtain the best average performance with DTO-SMOTE. However, for KNN, DTO-SMOTE was in second place, little after the SMOTE algorithm. Regarding the critical differences diagram (Table 5) DTO-SMOTE is in the first place in 7 of 8 algorithms, significantly outperforming the data set without treatment.
In terms of GEO (Table 7), DTO-SMOTE performed better in terms of rank with ABC, DTREE, LRG,and RF (Table 3). There was a tie between DTO-SMOTE and SMOTE for the SVM algorithm. For KNN, MLP, and SGD, SMOTE presented better performance. The winner in terms of global rank for bi-class problems under the GEO measure was the DTO-SMOTE algorithm. Again, DTO-SMOTE significantly outperforms the data set without treatment in terms of GEO Table 5.
Regarding IBA (Table 8), DTO-SMOTE has a better average rank with ABC, LRG, RF and SVM classifiers (Table 3). With DTREE, KNN, MLP, and SGD, SMOTE had the best rank performance. In terms of average rank, SMOTE is in first place, followed by DTO-SMOTE in second place. DTO-SMOTE is significantly better than no treating, as can be seen in Table 5.
In general, our method (DTO-SMOTE) showed a better average performance when used as a preprocessing technique for several classifiers. For all learning algorithms, it was ranked first or second in at least one performance measure for all eight learning algorithms. Furthermore, it has better overall rank performance in terms of AUC and GEO, and second (with a lower standard deviation) in IBA. These results are a strong indicator of the utility of our proposed method for imbalanced bi-class classification problems. Table 5 shows all the pairwise comparisons of the data methods evaluated in bi-class data sets in terms of AUC, GEO, and IBA. Although there are a few statistical differences between our proposed method (DTO-SMOTE) and the other SMOTE variants, our approach is ranked firstly with ABC, RF, LRG, and RF for all three performance measures AUC, GEO, and IBA. It is ranked firstly with DTREE for AUC and GEO, and it is ranked firstly with SVM for AUC and IBA. For KNN, it is ranked in second place for all three measures and, for MLP and SGD, it is ranked in second place for GEO and IBA. Furthermore, DTO-SMOTE is statistically better than Original for all classifiers. Table 6. Area Under the ROC curve (AUC) rank (solid angle and α = 7.0); see Table 5.  Table 7. Geometric Mean (GEO) rank (solid angle and α = 7.0); see Table 5.  Table 8. Index of Balanced Accuracy (IBA) rank (solid angle and α = 7.0); see Table 5.

Conclusions
This paper presents DTO-SMOTE, an oversampling algorithm that generates synthetic instances based on the Delaunay tessellation. DTO-SMOTE is an evolution of our previous work, presented in Reference [15]. In that study, we evaluated an initial version of our preprocessing method considering only the KNN learning algorithm. Furthermore, our previous version does not consider the shape of the tetrahedrons and uses their barycenters for interpolation. This new version uses tetrahedron quality indices to select tetrahedrons, and a Dirichlet distribution to randomly choose a point for interpolation. We conduct a sizeable experimental study comparing our method with five SMOTE variations evaluated with eight learning algorithms, 61 binary data sets, and three different performance measures.
Results indicate a better average performance when the method is used for class imbalanced classification problems. Besides that, we plan to study the imbalanced data set complexity by analyzing the simplex mesh and its quality indexes. The main advantage is that DTO-SMOTE proved to be more efficient than the other oversampling algorithms on average, according to the results presented. In comparison to the other algorithms, DTO-SMOTE can be a little slower as it needs to create a three-dimensional mesh. In future work, we intend to assess whether the refinement of the mesh can lead to better results in terms of performance in the classification.
Our goal is to better understand the data set characteristics that lead to an increase in performance, as the simplex geometry could be linked to the data's local density [14]. We also plan to investigate the SMOTE bias theoretically in the future. This procedure could reduce efforts in selecting a proper method and parameters for specific data sets. Our implementation, and all results and scripts for analysis, are publicly available on the Internet.