Customized Instance Random Undersampling to Increase Knowledge Management for Multiclass Imbalanced Data Classification

Tusell-Rey, Claudia C.; Camacho-Nieto, Oscar; Yáñez-Márquez, Cornelio; Villuendas-Rey, Yenny

doi:10.3390/su142114398

Open AccessArticle

Customized Instance Random Undersampling to Increase Knowledge Management for Multiclass Imbalanced Data Classification

by

Claudia C. Tusell-Rey

¹,

Oscar Camacho-Nieto

²

,

Cornelio Yáñez-Márquez

^1,*

and

Yenny Villuendas-Rey

^2,*

¹

Instituto Politécnico Nacional, Centro de Investigación en Computación, Av. Juan de Dios Bátiz s/n, GAM, Ciudad de México 07700, Mexico

²

Instituto Politécnico Nacional, Centro de Innovación y Desarrollo Tecnológico en Cómputo, Av. Juan de Dios Bátiz s/n, GAM, Ciudad de México 07700, Mexico

^*

Authors to whom correspondence should be addressed.

Sustainability 2022, 14(21), 14398; https://doi.org/10.3390/su142114398

Submission received: 27 September 2022 / Revised: 18 October 2022 / Accepted: 1 November 2022 / Published: 3 November 2022

(This article belongs to the Special Issue Knowledge Management in Healthcare)

Download

Browse Figures

Versions Notes

Abstract

:

Imbalanced data constitutes a challenge for knowledge management. This problem is even more complex in the presence of hybrid (numeric and categorical data) having missing values and multiple decision classes. Unfortunately, health-related information is often multiclass, hybrid, and imbalanced. This paper introduces a novel undersampling procedure that deals with multiclass hybrid data. We explore its impact on the performance of the recently proposed customized naïve associative classifier (CNAC). The experiments made, and the statistical analysis, show that the proposed method surpasses existing classifiers, with the advantage of being able to deal with multiclass, hybrid, and incomplete data with a low computational cost. In addition, our experiments showed that the CNAC benefits from data sampling; therefore, we recommend using the proposed undersampling procedure to balance data for CNAC.

Keywords:

imbalanced data classification; knowledge management; undersampling; decision-making; artificial intelligence

1. Introduction

Assessing the behavior of clients and employees has been vital for business profitability [1,2,3]. Obtaining insights into the expected behavior of subjects helps to increase the client’s satisfaction [4], predict customers’ purchases [5], enhance human resources responsibilities regarding employees [6,7], and acquire new clients [8]. Several researchers have addressed the problem of increasing business profitability by using individuals’ information [9].

In the healthcare industry, recent research has focused on the patient’s experiences and satisfaction, as well as their behavior, by considering the patients as critical layers in this industry [10]. Patient-centric health care has gained attention in the scientific community, with several investigations on artificial intelligence and its applications [11,12,13] and technological issues [14,15].

A vital issue in knowledge management in healthcare is the ability to deal with the available data. Most of the information from real-world problems is described for numerical and categorical features (hybrid data) and includes incomplete (missing) attribute values. In addition, such information is often organized in several decision classes (for instance, several diseases), with no guarantee of equal representation of the instances in the decision classes. Therefore, most business-related data is multiclass, hybrid, incomplete, and imbalanced. This last characteristic arises when one or more decision classes are overrepresented or underrepresented, which is extremely common in medical data [16,17].

Several supervised classifiers exist in the literature that have been applied to decision-making with data related to people’s behavior and satisfaction. Among them, we are interested in the recently proposed customized naïve associative classifier (CNAC) [4], which was found to be suitable for customers’ preferences assessment.

It is well-known that data imbalance is a challenge for supervised classifiers [18] and that many classifiers benefit from data sampling procedures [19]. In addition, several of the existing sampling procedures in the literature do not deal with multiclass or hybrid and incomplete data. The objective of the study was to assess the impact of undersampling methods for the CNAC in data related to individuals’ behavior and to introduce a novel undersampling algorithm, as well as to determine its performance concerning existing algorithms.

Despite several sampling procedures existing in the literature, most are designed for binary-class problems. In addition, the handling of hybrid and missing data is often missing. Another known drawback is the computational complexity associated with sampling algorithms. This is why there is a need to propose a novel under-sampling procedure. The significance of the paper lies in its assessment of the impact of sampling procedures in the recently proposed CNAC classifier, as well as in the introduction of a novel undersampling algorithm, with tractable computational complexity, able to deal simultaneously with hybrid, multiclass and missing data. The research contributions are the following:

An experimental study to assess the impact of data sampling for the customized naïve associative classifier;
A new undersampling algorithm for dealing with multiclass, hybrid and missing data, with tractable computational complexity bounded by $O (n + n^{2})$ .

2. Materials and Methods

To deal with class imbalance, researchers have focused on three approaches: data sampling; algorithm adaptation; and classifier ensembles and cost-sensitive learning [20]. Data sampling s devoted to modifying the imbalanced training data to obtain a balanced dataset using undersampling the majority classes, oversampling the minority classes, or both (hybrid sampling) [19]. On the other hand, the algorithm adaptation paradigm focused on transforming existing supervised classifiers, making them robust in the presence of imbalanced data. Lastly, the classifier ensemble approach constructs a committee of classifiers to handle the imbalanced data and minimize the error committed by individual (base) classifiers [20]. As stated before, this paper focuses on the data sampling approach and its impact on the customized naïve associative classifier.

Data sampling for supervised classification has been addressed in the literature since the 1960s [21], and its focus on data balancing has been part of the scientific community since the 1990s [22]. Several studies show that data sampling enhances the performance of supervised classifiers, such as artificial neural networks [23], C4.5 [24], nearest neighbor, support vector machines, and naïve Bayes [25]. Sampling methods are often divided into oversampling, undersampling, and hybrid [18]. We briefly reviewed some of the existing data sampling methods in the following.

2.1. Oversampling Methods

Oversampling methods obtain a balanced data set by creating new synthetic instances from the minority class. Some of the most used oversampling algorithms are briefly described in the following.

ADAptive SYNthetic Sampling (ADASYN) [26] adaptively generates minority class instances. Its main disadvantage is its inability to deal with hybrid, incomplete or multiclass data.

Adjusting the direction of the synthetic minority clasS examples (ADOMS) [27] uses principal components axis (PCA) to generate synthetic minority class instances. Like ADASYN, its main disadvantage is dealing with hybrid, incomplete or multiclass data.

Agglomerative hierarchical clustering [28] (AHC) creates synthetic instances by clustering. Due to its agglomerative approach, the AHC algorithm has high computational complexity. In addition, it does not deal with hybrid, incomplete or multiclass data.

Random over-sampling [29] (ROS) randomly creates new synthetic instances. This method is fast, and is easily extended to multiclass data.

The synthetic minority over-sampling technique [30] (SMOTE) uses the k nearest neighbors to create synthetic minority instances. SMOTE does not deal with categorical, incomplete, or multiclass data. There are several variants of SMOTE [31], such as Borderline-SMOTE [32], Safe level SMOTE [33], SMOTE-RSB* [34], Geometric SMOTE [35] and SMOTE-WENN [36].

Other recent works in oversampling are: studies by Jiang et al. [37] which used the concept of classification contribution degree; and by [38] et al. which introduced a variant of SMOTE called a synthetic minority oversampling technique with natural neighbors (NaNSMOTE); the SCOTE algorithm [39] which transforms multiclass problems into binary ones; and [40] which employed generative adversarial networks for oversampling purposes.

Another interesting approach is to generate virtual samples. Virtual sample generation methods (VSG) aim at predicting the trends in small-size data, and then generate new data from different distributions [41,42,43].

For medical data and other data related to individuals, creating synthetic, unrealistic instances can be considered controversial and unethical. In addition, most oversampling procedures use mathematical tools to create new data, such as means, standard deviations, and others. For mixed and incomplete data, such tools are not applicable. Another drawback of oversampling procedures is the consequent increase in the computational execution time of the decision-making algorithms due to the increased number of instances in the dataset.

2.2. Undersampling Methods

From a different point of view, undersampling methods aim to balance the data by selecting representative instances from the majority classes to reduce the data while preserving the quality of the original data. Some of the most used undersampling algorithms are briefly reviewed in the following:

Condensed nearest neighbor + Tomek’s modification of condensed nearest neighbor [29] (CNNTL) merges the CNN [21] and TL [44] algorithms. Although this combination of strategies is not per se a data balancing method, CNNTL has been used widely for dealing with data imbalance [45,46].

Class purity maximization [47] (CPM) selects two instances (one majority and one minority) as centers. By forming clusters, a classifier committee decides which instances are removed.

Neighborhood cleaning rule [48] (NCL) applies the ENN [49] to delete majority class instances. Although NCL does not deal with multiclass data, it can be extended for this task.

One-sided selection [22] (OSS) uses Tomek’s links to select the instances to delete. Majority class instances in Tomek’s links are deleted. The OSS method does not deal with multiclass data.

Random under-sampling [29] (RUS) randomly selects instances of the majority class. RUS, the same as ROS, can be easily extended to deal with multiclass data.

Simulated annealing-based undersampling [46] (SAUS) uses the simulated annealing metaheuristic to obtain a balanced dataset, combining sensitivity and specificity measures as one optimization function. It deals with incomplete data but not with multiclass data.

Tomek’s modification of condensed nearest neighbor [44] (TL) also uses the information of Tomek’s links to guide the selection. Although this is not a data balancing method, like CNNTL, it has been widely used for data balancing.

Unlike oversampling, undersampling does not create synthetic data, and most undersampling algorithms can deal with mixed and incomplete data. In addition, by obtaining small datasets, undersampling algorithms can diminish the computational execution time of the decision-making algorithms. The main disadvantage of undersampling methods is that often the majority classes end up underrepresented after applying the sampling procedures, and therefore, the data quality is lost.

2.3. Hybrid Sampling Methods

Combining over and undersampling algorithms, hybrid sampling techniques obtained a balanced dataset by deleting instances of the majority classes and creating synthetic instances of the minority classes.

3. Results

Given the limitations of some of the existing sampling algorithms, we aimed to design an undersampling method suitable for multiclass data and able to deal with hybrid (numerical and categorical) incomplete data. Let us have a set of features

A = {A_{1}, ..., A_{m}}

than can be numeric or categorical. Each instance

x

from a set of instances

X

will be described in terms of the features in

A

in a way such that

x [i]

represents the value of the feature

A_{i}

in the instance

x

. If the value is missing, we will consider that

x [i] = ?

.

In addition, we have a decision feature (label)

D

, which can have multiple categorical values in the instances

D (x) = d_{i} \in {d_{1}, ..., d_{c}}

. To know the class value of an instance, we will use

D (x)

.

The number of instances in a class

d_{i}

will be given as

| {x \in X : D (x) = d_{i}} |

. The minority class will be

d_{m i n} = \underset{j = 1 \dots c}{argmin} | {x \in X : D (x) = d_{j}} |

. All other classes will be considered as majority classes.

3.1. Proposed Undersampling Method

For our undersampling proposal, we wanted to deal with multiclass hybrid and incomplete data, and we wanted our proposal to be computationally tractable. Therefore, we chose to hybridize the concepts of structuralizations of the logical combinatorial approach to pattern recognition (LCPR) [50] with the ideas of random undersampling (RUS) [29].

Our proposal is named customized instance random undersampling (CIRUS) and works as follows. It computes the compact sets for each majority class, obtaining a structuralization of the class. Then, it randomly selects one of the compact sets and an instance of the chosen compact set. The selected instance will be added to the resulting set. The process will continue until as many instances as minority class count are selected. Algorithm 1 shows the pseudocode of the proposed CIRUS.

Effortlessly, a compact set is a connected component of a maximum similarity graph; a maximum similarity graph is obtained by forming an arc between each instance and its more similar instances. We computed the similarity between all pairs of instances to obtain such a graph and create the desired arcs.

Algorithm 1: CIRUS
	Inputs: Dissimilarity function diss, Imbalanced set of instances X
1	$B \leftarrow \emptyset$
2	Compute the minority class count as $m i n_c o u n t = \min_{j = 1.. c} \| {x \in X : D (x) = d_{j}} \|$ .
3	for each subset $X_{i} \in S$
4	if $\| X_{i} \| = m i n_c o u n t$ , then $B \leftarrow B \cup X_{i}$ //minority classes are preserved as is
5	else
6	$Compute the compact sets of X_{i}$ $, obtaining a multi - list c s$ with the compact sets. Each item of the multi-list is also a list, containing the instances belonging to the current compact
7	$c o u n t = 0$
8	repeat
9	Select a compact set randomly, as $i d x_c s = R a n d o m (0, c s . C o u n t)$
10	$Select an instance of the compact set randomly, as i d x = R a n d o m (0, c s [i d x_c s] . C o u n t)$
11	$Add the selected instance, as B \leftarrow B \cup {c s [i d x_c s] [i d x]}$
12	$c o u n t = c o u n t + 1$
13	until $c o u n t = m i n_c o u n t$
14	end if
15	end for
16	return $B$

Formally, having a similarity threshold

β_{0}

, a subset

N \neq \emptyset

of

X

is a compact set if and only if [50]:

$\forall x \in X [y \in N \land (\begin{matrix} \max_{\begin{matrix} y \in X \\ y \neq x \end{matrix}} {sim (y, x)} = sim (y, x) \geq β_{0} \\ \lor \max_{\begin{matrix} y \in X \\ y \neq x \end{matrix}} {sim (x, y)} = sim (x, y) \geq β_{0} \end{matrix})] x \in N$
$\forall y, x \in N, \exists y_{1}, \dots, y_{q} \in N [\begin{matrix} y = y_{1} \land x = x_{q} \land \forall p {1, \dots, q - 1} \\ [\begin{matrix} \max_{\begin{matrix} z \in X \\ z \neq y_{p} \end{matrix}} {sim (y_{p}, z)} = sim (y_{p}, y_{p + 1}) \geq β_{0} \\ \lor \max_{\begin{matrix} z \in X \\ z \neq x_{p} \end{matrix}} {sim (y_{p + 1}, z)} = sim (y_{p + 1}, y_{p}) \geq β_{0} \end{matrix}] \end{matrix}]$
Every isolated instance is a compact set, degenerated.

Compact sets help find the inner structure of classes [51], with the benefit of dealing with hybrid and incomplete data.

For CIRUS, we neglect the

β_{0}

parameter, using

β_{0} = 0

, to guarantee that each instance is connected to their most similar instances, disregarding the similarity value. CIRUS needs a similarity function to compute the compact sets. Its ability to handle hybrid and incomplete data will depend on the similarity function used. Several dissimilarities have been proposed to handle such data, perhaps the most known is the HEOM dissimilarity [52]. Generally, a dissimilarity function

diss

is transformed into a similarity function

sim

as

s i m = 1 / d i s s

. The capability of CIRUS to use user-defined similarity is beneficial because it does not predefine a function to compare the instance and gives the user the ability to select the desired function in a customized way.

The computational complexity of the proposed CIRUS is bounded by

O (n + n^{2})

, where

n

is the number of instances.

3.2. Experimental Setup

To test the impact of data sampling procedures in the customized naïve associative classifier (CNAC), we used 20 datasets associated with the business environment. Such datasets relate to the behavior of patients, clients, users, employees, and others. The description of the used datasets is given in Table 1. We use five-fold cross-validation and average results.

To assess the degree of imbalance of a dataset, the Imbalance Ratio (IR) measure is used. A dataset is considered imbalanced if

I R > 1 . 5

.

I R = \frac{\max_{j = 1.. c} | {x \in X : D (x) = d_{j}} |}{\min_{j = 1.. c} | {x \in X : D (x) = d_{j}} |}

(1)

We tested the performance of the CNAC classifier before and after data sampling. To assess the performance of the CNAC, we used the non-error rate measure (NER) [53], which is robust for multiclass imbalanced data.

Table 1. Description of the used datasets.

Dataset	Instances	Attributes		Missing Values	Classes	IR	Reference
Dataset	Instances	Numeric	Categorical	Missing Values	Classes	IR	Reference
alpha_bank	30,477	1	6	No	2	6.90	[54]
attribute_dataset	500	1	11	Yes	2	1.38	[55]
aug	19,158	2	10	No	2	3.01	[56]
churn_modelling	10,000	6	4	No	2	3.91	[57]
customer_behaviour	400	2	1	No	2	1.80	[58]
customer_segmentation	8068	3	6	Yes	4	1.22	[59]
customer_targeting	6620	66	4	No	3	1.85	[60]
deposit2020	40,000	5	8	No	2	12.81	[61]
df_clean	358	3	16	No	2	9.23	[62]
employee_satisfaction	500	2	9	No	2	1.11	[63]
in-vehicle-coupon	12,684	1	24	Yes	2	1.32	[64]
marketing_campaign	2240	14	13	Yes	2	5.71	[65]
marketing_series	6499	3	16	Yes	2	2.79	[66]
non-verbal-tourist	73	4	18	Yes	6	9.00	[4]
online_shoppers	12,330	10	7	No	2	5.46	[67]
promoted	24,016	4	2	Yes	5	5.44	[68]
telecom_churn	3333	10	0	No	2	5.90	[69]
telecom_churnV2	3333	15	4	No	2	5.90	[70]
telecust	1000	4	7	No	4	1.29	[71]
term_deposit	31,647	7	9	No	2	7.52	[72]

Considering the values in a confusion matrix (Figure 1), the NER is calculated as:

N E R = \frac{\sum_{g = 1}^{G} S n_{g}}{G}

(2)

where

S n_{g} = \frac{c_{g g}}{n_{g}}

(3)

We applied six algorithms of the undersampling approach (CIRUS, CNNTL, NCL, OSS, RUS, and TL). The CIRUS algorithm is available in the EPIC software [73,74], and the remaining sampling algorithms are in the KEEL software [75].

It is essential to mention that several sampling algorithms do not deal with incomplete data; therefore, for those algorithms, we imputed missing values in the datasets using mean and mode (most common missing values), also integrated into the KEEL software. In addition, the KEEL software internally stores all feature values as numeric ones; due to such functionality, all sampling methods tested work, disregarding having hybrid feature data.

In Table 2, we show the sampling algorithms’ parameters and the ones for the CNAC classifier. For CNNTL, NCL and OSS, we used the default parameter values offered in the KEEL software.

We used nonparametric statistical tests to determine the existence or not of significant differences in the performance of CNAC and to compare the proposed sampling algorithms with respect to the state-of-the-art undersampling procedures. We selected the Friedman test [76], the Holm post hoc test [77], and the Wilcoxon test [78] for these purposes. In the next section, we show the experimental results and discuss them.

4. Discussion

In this section, we analyze two topics: the impact of sampling methods on the performance of CNAC; and the behavior of the proposed CIRUS with respect to the state-of-the-art sampling algorithms. Section 4.1 is devoted to the former topic, while Section 4.2 addresses the latter.

4.1. Impact of Undersampling Methods on the Performance of CNAC

First, we want to determine if the undersampling techniques impact the performance of the CNAC. Table 3 shows the non-error rate (NER) results for CNAC after undersampling.

Some undersampling algorithms could not execute in datasets with more than two decision classes. In some cases, this is due to the algorithm’s nature, which considers only two classes. For other algorithms, such as RUS, the KEEL software does not offer an implementation able to handle multiple classes.

Figure 2 presents the maximum, minimum, and average results for CNAC before and after undersampling algorithms.

As shown in Table 3, the CNAC obtained the best results using the unprocessed data for 13 datasets, and its performance was improved or maintained in six datasets (after the CIRUS algorithm). These results support the idea that the proposed undersampling method does not degrade the performance of CNAC. However, three datasets suffer huge performance drops: the datasets alpha_bank, deposit2020, and term_deposit.

We used the Friedman test to determine whether or not the performance differences were significant. We define a significance of

α = 0.05

for a 95% of confidence. The results of the Friedman ranking are shown in Table 4. Friedman test returns a probability value of

p = 0 . 00037

, and we reject the null hypothesis.

We define the null and alternative hypotheses as:

Hypothesis 0 (H0).

There are no differences in the performance of CNAC before and after undersampling the datasets.

Hypothesis 1 (H1).

There are differences in the performance of CNAC before and after undersampling the datasets.

Due to the tests rejecting the null hypothesis, we applied the Holm post hoc test to determine the significant differences between which algorithms. Such results are given in Table 5.

The Holm test finds no significant differences between the performance of CNAC before and after applying the proposed CIRUS algorithm and finds the performance of CNAC before all other undersampling algorithms significantly worse than the performance of CNAC before their application. These results show that the proposed CIRUS does not significantly degrade the performance of CNAC.

However, further research is needed to determine under which specific conditions the datasets can be fully beneficiated from the proposed CIRUS to avoid high drops in performance for the alpha_bank, deposit2020, and term_deposit.

According to the time used to classify instances, the reduction achieved by the proposed algorithm was outstanding (Table 6 shows the results according to time), with a maximum gain of 24.3 times faster for the biggest dataset (deposit2020). Figure 3 shows the relation between the time expended by CNAC without undersampling and the time spent after CIRUS.

The results confirm that the proposed CIRUS is helpful for the CNAC classifier because it maintains classifier performance using less time, for instance classifying. We used the Wilcoxon test [78] to determine if the differences in time are significant or not.

We set as null and alternative hypotheses the following:

Hypothesis 0 (H0).

There are no differences in the time expended by CNAC before and after undersampling the datasets.

Hypothesis 1 (H1).

There are differences in the time expended by CNAC before and after undersampling the datasets.

We define a significance of

α = 0.05

for a 95% of confidence. The test returns a probability value of

p = 0 . 000121

, so we reject the null hypothesis. The results allow us to state that CIRUS significantly decreases the time expended by CNAC without scarifying correctness.

4.2. Performance of the Proposed Undersampling Method with Respect to Others

To assess the performance of sampling methods, we use two measures: the NER of the classifier and the obtained IR of the datasets. Sampling aims to return a balanced dataset suitable for classification. Therefore, the best sampling methods will return datasets with IR = 1. Because we already presented the NER of the CNAC classifier after undersampling (Table 3), we will only provide in this section the analysis of the imbalance ratio obtained by the compared undersampling methods (Table 7, Table 8 and Table 9) and the statistical comparison of the NER of all undersampling algorithms (Table 10). The hypotheses are:

Hypothesis 0 (H0).

There is no difference in the imbalance ratio obtained by the compared undersampling methods.

Hypothesis 1 (H1).

There are differences in the imbalance ratio obtained by the compared sampling methods.

We set a significance value of 0.05 for 95% of confidence. For the datasets in which the algorithms could not execute, we consider the original IR of the dataset (without sampling) as the corresponding IR value.

Both RUS and the proposed CIRUS obtained a perfectly balanced dataset. The remaining methods fail in several datasets, bringing IRs as high as 11.95 for TL in the deposit2020 dataset. The Friedman test returns a p-value of zero, and the null hypothesis is rejected. The corresponding ranking is shown in Table 8.

The Holm test found no significant differences in IR for CIRUS and RUS; and found CIRUS to obtain significantly more balanced datasets than OSS, NCL, CNNTL, and TL methods.

We used the Friedman test to determine whether or not the differences in performance according to NER values (disregarding the baseline results without instance sampling) were significant. We define the null and alternative hypotheses as:

Hypothesis 0 (H0).

There are no differences in the performance of CNAC after undersampling the datasets.

Hypothesis 1 (H1).

There are differences in the performance of CNAC after undersampling the datasets.

We define a significance of

α = 0.05

for a 95% of confidence. The results of the Friedman ranking are shown in Table 10. Friedman test returns a probability value of

p = 0 . 033632

, and therefore we reject the null hypothesis.

Due to the rejected null hypothesis, we applied the Holm post hoc test to determine the significant differences between which algorithms t. Such results are given in Table 11.

The experimental results show that the proposed CIRUS significantly exceeds RUS, CNNTL, and OSS according to NER values. Considering both IR and non-error rate measures, the proposal has excellent results, being significantly better in at least one measure (Table 12).

When we want to consider several performance measures simultaneously, the Pareto front (also called the Pareto frontier or Pareto curve) can be handy. The Pareto front is the set of all Pareto efficient solutions [79].

Given a certain state, a Pareto improvement is a new state where some measures are improved, and no measure is downgraded. A state is called Pareto-dominated if there exists a possible Pareto improvement. Two states are called Pareto-optimal or Pareto-efficient if one state is better in at least one measure and worst in at least another measure.

As shown in Figure 4, CIRUS dominates all state-of-the-art algorithms while considering IR and NER measures; for this scenario, only CIRUS will be in the Pareto front.

5. Conclusions

We were able to fulfill the objectives of the study, by introducing a novel undersampling algorithm (CIRUS). We also evaluate its performance with respect to state-of-the-art sampling methods and find that CIRUS obtained datasets significantly more balanced than the ones obtained by the compared undersampling methods, with significant differences in performance. It is essential to mention that the proposed CIRUS deals with multiclass hybrid (numeric and categorical) and incomplete (missing) data.

In addition, we assessed the impact of undersampling algorithms on the performance of the Customized Naïve Associative Classifier. From the experiments and the corresponding statistical analysis, we can conclude that CNAC is beneficiated from using the proposed sampling algorithm for data balancing. Accordingly, we do recommend using CIRUS as data preprocessing for CNAC. However, the statistical analysis made finds the performance of CNAC before all other undersampling algorithms were significantly worse than the performance of CNAC before their application. Therefore, we do not recommend using other undersampling techniques but CIRUS for CNAC. We consider that these results are due to our proposal using a customized similarity function and compact sets to guarantee a good representation of the majority classes.

The main advantages of CIRUS are the following: it does not need parameters but the similarity function; therefore, it does not predefine a function to compare the instance and gives the user the ability to select the desired function in a customized way. It handles hybrid and incomplete data as well as multiclass data.

Its main disadvantage is its computational complexity, bounded by

O (n + n^{2})

, where

n

is the number of instances. Although this complexity is tractable, it may be prohibitive for big-data scenarios.

In future work, we want to explore the impact of the proposed sampling methods for other lazy classifiers, such as the nearest neighbor [80] and the extended gamma [81] classifiers. In addition, we want to work on diminishing the complexity of the algorithm.

Author Contributions

Conceptualization, Y.V.-R.; methodology, C.C.T.-R. and Y.V.-R.; validation, Y.V.-R., O.C.-N. and C.Y.-M.; formal analysis, C.Y.-M.; investigation, C.C.T.-R.; data curation, O.C.-N. and C.C.T.-R.; writing—original draft preparation, C.C.T.-R.; writing—review and editing, Y.V.-R. and C.Y.-M.; visualization, Y.V.-R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable; we used publicly available data only.

Informed Consent Statement

Not applicable; we used publicly available data only.

Data Availability Statement

All the used datasets are available at the Kaggle repository (https://www.kaggle.com/datasets, accessed on 7 July 2021) or in the Machine Learning Repository of the University of California in Irvine (https://archive.ics.uci.edu/ml/datasets.php, accessed on 7 July 2021).

Acknowledgments

The authors would like to thank the Instituto Politécnico Nacional (Secretaría Académica, Comisión de Operación y Fomento de Actividades Académicas, Secretaría de Investigación y Posgrado, Escuela Superior de Turismo, Centro de Investigación en Computación, and Centro de Innovación y Desarrollo Tecnológico en Cómputo), the Consejo Nacional de Ciencia y Tecnología, and Sistema Nacional de Investigadores for their economic support developing this work.

Conflicts of Interest

The authors declare no conflict of interest.

References

Lin, H.C.K.; Wang, T.H.; Lin, G.C.; Cheng, S.C.; Chen, H.R.; Huang, Y.M. Applying sentiment analysis to automatically classify consumer comments concerning marketing 4Cs aspects. Appl. Soft Comput. 2020, 97, 106755. [Google Scholar] [CrossRef]
Godinho, P.; Dias, J.; Torres, P. An Application of Data Mining Methods to the Analysis of Bank Customer Profitability and Buying Behavior. Data Anal. Appl. 1 Clust. Regres. Model.-Estim. Forecast. Data Min. 2019, 2, 225–240. [Google Scholar]
Kim, A.; Yang, Y.; Lessmann, S.; Ma, T.; Sung, M.C.; Johnson, J.E. Can deep learning predict risky retail investors? A case study in financial risk behavior forecasting. Eur. J. Oper. Res. 2020, 283, 217–234. [Google Scholar] [CrossRef] [Green Version]
Tusell-Rey, C.C.; Tejeida-Padilla, R.; Camacho-Nieto, O.; Villuendas-Rey, Y.; Yáñez-Márquez, C. Improvement of Tourists Satisfaction According to Their Non-Verbal Preferences Using Computational Intelligence. Appl. Sci. 2021, 11, 2491. [Google Scholar] [CrossRef]
Sakar, C.O.; Polat, S.O.; Katircioglu, M.; Kastro, Y. Real-time prediction of online shoppers’ purchasing intention using multilayer perceptron and LSTM recurrent neural networks. Neural Comput. Appl. 2019, 31, 6893–6908. [Google Scholar] [CrossRef]
Fan, C.Y.; Fan, P.S.; Chan, T.Y.; Chang, S.H. Using hybrid data mining and machine learning clustering analysis to predict the turnover rate for technology professionals. Expert Syst. Appl. 2012, 39, 8844–8851. [Google Scholar] [CrossRef]
Fallucchi, F.; Coladangelo, M.; Giuliano, R.; William De Luca, E. Predicting employee attrition using machine learning techniques. Computers 2020, 9, 86. [Google Scholar] [CrossRef]
Keon, Y.; Kim, H.; Choi, J.Y.; Kim, D.; Kim, S.Y.; Kim, S. Call Center Call Count Prediction Model by Machine Learning. J. Adv. Inf. Technol. Converg. 2018, 8, 31–42. [Google Scholar] [CrossRef]
Kocakulah, M.C.; Komissarov, S. Using Activity-Based Costing to Increase Profitability of Individual Deposit Services in Banking. Manag. Account. Q. 2020, 21, 10–17. [Google Scholar]
Esmaeilzadeh, P.; Dharanikota, S.; Mirzaei, T. The role of patient engagement in patient-centric health information exchange (HIE) initiatives: An empirical study in the United States. Inf. Technol. People 2021. ahead-of-print. [Google Scholar] [CrossRef]
Jabarulla, M.Y.; Lee, H.N. A blockchain and artificial intelligence-based, patient-centric healthcare system for combating the COVID-19 pandemic: Opportunities and applications. Healthcare 2021, 9, 1019. [Google Scholar] [CrossRef] [PubMed]
Barnes, R.; Zvarikova, K. Artificial intelligence-enabled wearable medical devices, clinical and diagnostic decision support systems, and Internet of Things-based healthcare applications in COVID-19 prevention, screening, and treatment. Am. J. Med. Res. 2021, 8, 9–22. [Google Scholar]
Haldorai, A.; Ramu, A. An Analysis of Artificial Intelligence Clinical Decision-Making and Patient-Centric Framework. In Computational Vision and Bio-Inspired Computing; Springer: Berlin/Heidelberg, Germany, 2021; pp. 813–827. [Google Scholar]
Gohar, A.; AbdelGaber, S.; Salah, M. A Patient-Centric Healthcare Framework Reference Architecture for Better Semantic Interoperability based on Blockchain, Cloud, and IoT. IEEE Access 2022, 10, 92137–92157. [Google Scholar] [CrossRef]
Naresh, V.S.; Reddi, S.; Allavarpu, V.D. Blockchain-based patient centric health care communication system. Int. J. Commun. Syst. 2021, 34, e4749. [Google Scholar] [CrossRef]
Xu, Z.; Shen, D.; Nie, T.; Kou, Y.; Yin, N.; Han, X. A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data. Inf. Sci. 2021, 572, 574–589. [Google Scholar] [CrossRef]
Solanki, Y.S.; Chakrabarti, P.; Jasinski, M.; Leonowicz, Z.; Bolshev, V.; Vinogradov, A.; Jasinska, E.; Gono, R.; Nami, M. A hybrid supervised machine learning classifier system for breast cancer prognosis using feature selection and data imbalance handling approaches. Electronics 2021, 10, 699. [Google Scholar] [CrossRef]
Fernández, A.; García, S.; Galar, M.; Prati, R.C.; Krawczyk, B.; Herrera, F. Learning from Imbalanced Data Sets; Springer: Berlin/Heidelberg, Germany, 2018; Volume 10. [Google Scholar]
López, V.; Fernández, A.; García, S.; Palade, V.; Herrera, F. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 2013, 250, 113–141. [Google Scholar] [CrossRef]
Krawczyk, B. Learning from imbalanced data: Open challenges and future directions. Prog. Artif. Intell. 2016, 5, 221–232. [Google Scholar] [CrossRef] [Green Version]
Hart, P. The condensed nearest neighbor rule (corresp.). IEEE Trans. Inf. Theory 1968, 14, 515–516. [Google Scholar] [CrossRef]
Kubat, M.; Matwin, S. Addressing the curse of imbalanced training sets: One-sided selection. In Proceedings of the 14th International Conference on Machine Learning (ICML97), Nashville, TN, USA, 8–12 July 1997; pp. 179–186. [Google Scholar]
Mazurowski, M.A.; Habas, P.A.; Zurada, J.M.; Lo, J.Y.; Baker, J.A.; Tourassi, G.D. Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Netw. 2008, 21, 427–436. [Google Scholar] [CrossRef] [Green Version]
Yin, H.; Gai, K. An empirical study on preprocessing high-dimensional class-imbalanced data for classification. In Proceedings of the 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems, New York, NY, USA, 24–26 August 2015; pp. 1314–1319. [Google Scholar]
Koziarski, M.; Krawczyk, B.; Woźniak, M. Radial-based oversampling for noisy imbalanced data classification. Neurocomputing 2019, 343, 19–33. [Google Scholar] [CrossRef]
He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar]
Tang, S.; Chen, S.P. The generation mechanism of synthetic minority class examples. In Proceedings of the 2008 International Conference on Information Technology and Applications in Biomedicine, Shenzhen, China, 30–31 May 2008; pp. 444–447. [Google Scholar]
Cohen, G.; Hilario, M.; Sax, H.; Hugonnet, S.; Geissbuhler, A. Learning from imbalanced data in surveillance of nosocomial infection. Artif. Intell. Med. 2006, 37, 7–18. [Google Scholar] [CrossRef] [PubMed]
Batista, G.E.; Prati, R.C.; Monard, M.C. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 2004, 6, 20–29. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Fernández, A.; Garcia, S.; Herrera, F.; Chawla, N.V. SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 2018, 61, 863–905. [Google Scholar] [CrossRef]
Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing; Springer: Berlin/Heidelberg, Germany, 2005; pp. 878–887. [Google Scholar]
Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C. Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Pacific-Asia Conference on Knowledge Discovery and Data Mining; Springer: Berlin/Heidelberg, Germany, 2009; pp. 475–482. [Google Scholar]
Ramentol, E.; Caballero, Y.; Bello, R.; Herrera, F. SMOTE-RS B*: A hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl. Inf. Syst. 2012, 33, 245–265. [Google Scholar] [CrossRef]
Douzas, G.; Bacao, F. Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE. Inf. Sci. 2019, 501, 118–135. [Google Scholar] [CrossRef]
Guan, H.; Zhang, Y.; Xian, M.; Cheng, H.D.; Tang, X. SMOTE-WENN: Solving class imbalance and small sample problems by oversampling and distance scaling. Appl. Intell. 2021, 51, 1394–1409. [Google Scholar] [CrossRef]
Jiang, Z.; Pan, T.; Zhang, C.; Yang, J.J.S. A new oversampling method based on the classification contribution degree. Symmetry 2021, 13, 194. [Google Scholar] [CrossRef]
Li, J.; Zhu, Q.; Wu, Q.; Fan, Z.J.I.S. A novel oversampling technique for class-imbalanced learning based on SMOTE and natural neighbors. Inf. Sci. 2021, 565, 438–455. [Google Scholar] [CrossRef]
Wei, J.; Huang, H.; Yao, L.; Hu, Y.; Fan, Q.; Huang, D.J.A.S.C. New imbalanced bearing fault diagnosis method based on Sample-characteristic Oversampling TechniquE (SCOTE) and multi-class LS-SVM. Appl. Soft Comput. 2021, 101, 107043. [Google Scholar] [CrossRef]
Roy, S.K.; Haut, J.M.; Paoletti, M.E.; Dubey, S.R.; Plaza, A.J.I.T.O.G.; Sensing, R. Generative adversarial minority oversampling for spectral–spatial hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5500615. [Google Scholar] [CrossRef]
Li, L.; Damarla, S.K.; Wang, Y.; Huang, B.J.I.S. A Gaussian mixture model based virtual sample generation approach for small datasets in industrial processes. Inf. Sci. 2021, 581, 262–277. [Google Scholar] [CrossRef]
Kim, D.H.; Song, B.C.J.P.R. Virtual sample-based deep metric learning using discriminant analysis. Pattern Recognit. 2021, 110, 107643. [Google Scholar] [CrossRef]
Lin, L.S.; Hu, S.C.; Lin, Y.S.; Li, D.C.; Siao, L.R.J.M.B.; Engineering. A new approach to generating virtual samples to enhance classification accuracy with small data—A case of bladder cancer. Math. Biosci. Eng. 2022, 19, 6204–6233. [Google Scholar] [CrossRef]
Tomek, I. Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 1976, 6, 769–772. [Google Scholar]
Fotouhi, S.; Asadi, S.; Kattan, M.W. A comprehensive data level analysis for cancer diagnosis on imbalanced data. J. Biomed. Inform. 2019, 90, 103089. [Google Scholar] [CrossRef]
Chennuru, V.K.; Timmappareddy, S.R. Simulated annealing based undersampling (SAUS): A hybrid multi-objective optimization method to tackle class imbalance. Appl. Intell. 2021, 52, 2092–2110. [Google Scholar] [CrossRef]
Yoon, K.; Kwek, S. An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics. In Proceedings of the Fifth International Conference on Hybrid Intelligent Systems (HIS’05), Rio de Janeiro, Brazil, 6–9 November 2005; p. 6. [Google Scholar]
Laurikkala, J. Improving identification of difficult small classes by balancing class distribution. In Conference on Artificial Intelligence in Medicine in Europe; Springer: Berlin/Heidelberg, Germany, 2001; pp. 63–66. [Google Scholar]
Wilson, D.L. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 1972, 2, 408–421. [Google Scholar] [CrossRef] [Green Version]
Martínez-Trinidad, J.F.; Guzmán-Arenas, A. The logical combinatorial approach to pattern recognition, an overview through selected works. Pattern Recognit. 2001, 34, 741–751. [Google Scholar] [CrossRef]
García-Borroto, M.; Ruiz-Shulcloper, J. Selecting prototypes in mixed incomplete data. In Iberoamerican Congress on Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2005; pp. 450–459. [Google Scholar]
Wilson, D.R.; Martinez, T.R. Improved heterogeneous distance functions. J. Artif. Intell. Res. 1997, 6, 1–34. [Google Scholar] [CrossRef]
Ballabio, D.; Grisoni, F.; Todeschini, R. Multivariate comparison of classification performance measures. Chemom. Intell. Lab. Syst. 2018, 174, 33–44. [Google Scholar] [CrossRef]
Available online: https://www.kaggle.com/raosuny/success-of-bank-telemarketing-data (accessed on 7 July 2021).
Available online: https://archive.ics.uci.edu/ml/datasets/dresses_attribute_sales (accessed on 7 July 2021).
Available online: https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists?select=aug_train.csv (accessed on 7 July 2021).
Available online: https://www.kaggle.com/shivan118/churn-modeling-dataset (accessed on 7 July 2021).
Available online: https://www.kaggle.com/denisadutca/customer-behaviour (accessed on 7 July 2021).
Available online: https://www.kaggle.com/vetrirah/customer?select=Train.csv (accessed on 7 July 2021).
Available online: https://www.kaggle.com/tsiaras/predicting-profitable-customer-segments (accessed on 7 July 2021).
Available online: https://www.kaggle.com/arinzy/deposit-subscription-what-makes-consumers-buy (accessed on 7 July 2021).
Available online: https://www.kaggle.com/c/warranty-claims/leaderboard (accessed on 7 July 2021).
Available online: https://www.kaggle.com/mohamedharris/employee-satisfaction-index-dataset (accessed on 7 July 2021).
Wang, T.; Rudin, C.; Doshi-Velez, F.; Liu, Y.; Klampfl, E.; MacNeille, P. A bayesian framework for learning rule sets for interpretable classification. J. Mach. Learn. Res. 2017, 18, 2357–2393. [Google Scholar]
Available online: https://www.kaggle.com/rodsaldanha/arketing-campaign (accessed on 7 July 2021).
Available online: https://www.kaggle.com/arashnic/marketing-series-customer-churn?select=train.csv (accessed on 7 July 2021).
Available online: https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset (accessed on 7 July 2021).
Available online: https://www.kaggle.com/regivm/promotion-response-and-target-datasets?select=promoted.csv (accessed on 7 July 2021).
Available online: https://www.kaggle.com/barun2104/telecom-churn (accessed on 7 July 2021).
Available online: https://www.kaggle.com/sagnikpatra/edadata (accessed on 7 July 2021).
Available online: https://www.kaggle.com/prathamtripathi/customersegmentation (accessed on 7 July 2021).
Available online: https://www.kaggle.com/brajeshmohapatra/term-deposit-prediction-data-set (accessed on 7 July 2021).
Hernández-Castaño, J.A.; Villuendas-Rey, Y.; Camacho-Nieto, O.; Yáñez-Márquez, C. Experimental platform for intelligent computing (EPIC). Comput. Sist. 2018, 22, 245–253. [Google Scholar] [CrossRef]
Hernández-Castaño, J.A.; Villuendas-Rey, Y.; Nieto, O.C.; Rey-Benguría, C.F. A New Experimentation Module for the EPIC Software. Res. Comput. Sci. 2018, 147, 243–252. [Google Scholar] [CrossRef]
Triguero, I.; González, S.; Moyano, J.M.; García, S.; Alcalá-Fdez, J.; Luengo, J.; Fernández, A.; del Jesús, M.J.; Sánchez, L.; Herrera, F. KEEL 3.0: An open source software for multi-stage analysis in data mining. Int. J. Comput. Intell. Syst. 2017, 10, 1238–1249. [Google Scholar] [CrossRef] [Green Version]
Friedman, M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 1937, 32, 675–701. [Google Scholar] [CrossRef]
Holm, S. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 1979, 6, 65–70. [Google Scholar]
Wilcoxon, F. Individual comparisons by ranking methods. Biometrics 1945, 1, 80–83. [Google Scholar] [CrossRef]
Zitzler, E.; Thiele, L.; Laumanns, M.; Fonseca, C.M.; Da Fonseca, V.G. Performance assessment of multiobjective optimizers: An analysis and review. IEEE Trans. Evol. Comput. 2003, 7, 117–132. [Google Scholar] [CrossRef] [Green Version]
Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef] [Green Version]
Villuendas-Rey, Y.; Yáñez-Márquez, C.; Anton-Vargas, J.A.; López-Yáñez, I. An extension of the gamma associative classifier for dealing with hybrid data. IEEE Access 2019, 7, 64198–64205. [Google Scholar] [CrossRef]

Figure 1. Confusion matrix of c classes.

Figure 2. Performance of CNAC before and after undersampling.

Figure 3. Percent of the time expended by CNAC before and after CIRUS.

Figure 4. Representation of undersampling algorithms in a Pareto front.

Table 2. Parameters of the algorithms.

Algorithm	Parameters
CNNTL	Number of neighbors: 5
NCL	Number of neighbors: 5
OSS	Number of neighbors: 5
RUS	None
TL	None
CIRUS	Dissimilarity: HEOM [52], except for the non-verbal-tourist data, in which we used the function suggested in [4]
CNAC	Dissimilarity: HEOM [52], except for the non-verbal-tourist data, in which we used the function suggested in [4]; Attribute weighting: None

Table 3. NER results of CNAC, before and after applying undersampling methods. The best results are in bold, and results of CIRUS with the same or better classifier performance are in italics.

Dataset	CNAC	CNAC after
Dataset	CNAC	NCL	OSS	RUS	TL	CNNTL	CIRUS
alpha_bank	0.84	0.55	0.53	0.54	0.54	0.53	0.53
attribute_dataset	0.48	0.51	0.47	0.50	0.50	0.49	0.49
aug	0.66	0.53	0.61	0.52	0.63	0.59	0.60
churn_modelling	0.56	0.56	0.54	0.56	0.56	0.54	0.56
customer_behaviour	0.72	0.77	0.73	0.77	0.78	0.67	0.72
customer_segmentation	0.42	-	-	-	-	-	0.34
customer_targeting	0.45	-	-	-	-	-	0.46
deposit2020	0.84	0.78	0.77	0.78	0.78	0.78	0.78
df_clean	0.55	0.63	0.60	0.60	0.57	0.58	0.68
employee_satisfaction	0.53	0.51	0.53	0.52	0.55	0.53	0.52
in-vehicle-coupon	0.54	0.53	0.53	0.53	0.53	0.53	0.53
marketing_campaign	0.66	0.63	0.61	0.60	0.64	0.61	0.67
marketing_series	0.72	0.69	0.64	0.68	0.69	0.65	0.71
non-verbal-tourist	0.73	-	-	-	-	-	0.61
online_shoppers	0.64	0.65	0.66	0.61	0.65	0.66	0.65
promoted	0.99	-	-	-	-	-	0.98
telecom_churn	0.65	0.65	0.65	0.65	0.65	0.65	0.65
telecom_churnV2	0.67	-	-	-	-	-	0.67
telecust	0.36	-	-	-	-	-	0.34
term_deposit	0.76	-	-	-	-	-	0.69

Table 4. Friedman ranking of the results of CNAC before and after undersampling.

Algorithm	Ranking
CNAC	2.150
CIRUS	3.400
TL	3.875
NCL	4.150
RUS	4.750
CNNTL	4.800
OSS	4.875

Table 5. Holm’s post hoc test, comparing CNAC before and after undersampling algorithms.

i	Algorithm	z	p	$Holm (α / i)$
6	OSS	3.988992	0.000066	0.008333
5	CNNTL	3.879203	0.000105	0.010000
4	RUS	3.80601	0.000141	0.012500
3	NCL	2.9277	0.003415	0.016667
2	TL	2.525141	0.011565	0.025000
1	CSRUS	1.829813	0.067278	0.050000

Table 6. Testing time results (in seconds) of CNAC before and after CIRUS.

Datasets	Before CIRUS	After CIRUS	Gain
alpha_bank	73.40	9.57	7.67
attribute_dataset	0.06	0.02	2.98
aug	148.74	12.72	11.69
churn_modelling	13.45	5.23	2.57
customer_behaviour	0.02	0.00	4.00
customer_segmentation	7.58	4.63	1.64
customer_targeting	82.59	30.02	2.75
deposit2020	575.30	23.70	24.27
df_clean	0.03	0.00	12.10
employee_satisfaction	0.07	0.04	1.79
in-vehicle-coupon	123.69	14.51	8.52
marketing_campaign	3.06	0.45	6.81
marketing_series	7.75	2.77	2.79
non-verbal-tourist	0.00	0.00	0.00
online_shoppers	169.36	9.39	18.03
promoted	40.72	13.56	3.00
telecom_churn	4.07	0.62	6.57
telecom_churnV2	6.03	0.96	6.31
telecust	0.20	0.09	2.20
term_deposit	536.08	28.43	18.85

Table 7. Results of imbalance ratio for the undersampling algorithms.

Dataset	NCL	OSS	RUS	TL	CNNTL	CIRUS
alpha_bank	1.82	4.89	2.21	1.00	6.20	1.00
attribute_dataset	6.60	3.79	4.07	1.00	1.62	1.00
aug	2.40	1.44	1.37	1.00	2.23	1.00
churn_modelling	1.98	2.08	1.13	1.00	3.18	1.00
customer_behaviour	5.90	1.37	4.48	1.00	1.58	1.00
customer_segmentation	-	-	-	-	-	1.00
customer_targeting	1.22	1.22	1.22	1.22	1.22	1.00
deposit2020	1.16	9.93	1.47	1.00	11.95	1.00
df_clean	1.12	6.21	1.98	1.00	8.16	1.00
employee_satisfaction	7.41	5.60	6.39	1.00	2.08	1.00
in-vehicle-coupon	5.34	2.78	3.40	1.00	1.27	1.00
marketing_campaign	1.63	3.82	1.07	1.00	4.98	1.00
marketing_series	2.81	1.54	1.69	1.00	2.10	1.00
non-verbal-tourist	-	-	-	-	-	1.00
online_shoppers	1.58	3.34	1.16	1.00	4.59	1.00
promoted	-	-	-	-	-	1.00
telecom_churn	1.91	4.36	1.12	1.00	5.32	1.00
telecom_churnV2	-	-	-	-	-	1.00
telecust	-	-	-	-	-	1.00
term_deposit	3.46	6.08	4.67	1.00	5.52	1.00

Table 8. Friedman ranking of the undersampling algorithms, according to IR.

Algorithm	Ranking
CIRUS	1.325
RUS	2.375
OSS	3.950
NCL	4.250
CNNTL	4.450
TL	4.650

Table 9. Results of Holm’s post hoc test, comparing undersampling algorithms according to IR.

i	Algorithm	z	p	$Holm (α / i)$
5	TL	5.620276	0.000000	0.010000
4	CNNTL	5.282214	0.000000	0.012500
3	NCL	4.944152	0.000001	0.016667
2	OSS	4.437060	0.000009	0.025000
1	RUS	1.774824	0.075927	0.050000

Table 10. Friedman ranking of NER values obtained by CNAC after applying undersampling algorithms.

Algorithm	Ranking
CIRUS	2.425
TL	3.150
NCL	3.375
RUS	3.900
OSS	4.075
CNNTL	4.075

Table 11. Holm’s post hoc test comparison of undersampling algorithms according to the NER.

i	Algorithm	z	p	$Holm (α / i)$
5	OSS	2.789009	0.005287	0.010000
4	CNNTL	2.789009	0.005287	0.012500
3	RUS	2.493205	0.012660	0.016667
2	NCL	1.605793	0.108319	0.025000
1	TL	1.225474	0.220397	0.050000

Table 12. Friedman ranking of NER values obtained by CNAC after applying undersampling algorithms.

CIRUS vs.	IR	NER	Overall
OSS	Better	Better	Better
CNNTL	Better	Better	Better
RUS	Equal	Better	Better
NCL	Better	Equal	Better
TL	Better	Equal	Better

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tusell-Rey, C.C.; Camacho-Nieto, O.; Yáñez-Márquez, C.; Villuendas-Rey, Y. Customized Instance Random Undersampling to Increase Knowledge Management for Multiclass Imbalanced Data Classification. Sustainability 2022, 14, 14398. https://doi.org/10.3390/su142114398

AMA Style

Tusell-Rey CC, Camacho-Nieto O, Yáñez-Márquez C, Villuendas-Rey Y. Customized Instance Random Undersampling to Increase Knowledge Management for Multiclass Imbalanced Data Classification. Sustainability. 2022; 14(21):14398. https://doi.org/10.3390/su142114398

Chicago/Turabian Style

Tusell-Rey, Claudia C., Oscar Camacho-Nieto, Cornelio Yáñez-Márquez, and Yenny Villuendas-Rey. 2022. "Customized Instance Random Undersampling to Increase Knowledge Management for Multiclass Imbalanced Data Classification" Sustainability 14, no. 21: 14398. https://doi.org/10.3390/su142114398

APA Style

Tusell-Rey, C. C., Camacho-Nieto, O., Yáñez-Márquez, C., & Villuendas-Rey, Y. (2022). Customized Instance Random Undersampling to Increase Knowledge Management for Multiclass Imbalanced Data Classification. Sustainability, 14(21), 14398. https://doi.org/10.3390/su142114398

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Customized Instance Random Undersampling to Increase Knowledge Management for Multiclass Imbalanced Data Classification

Abstract

1. Introduction

2. Materials and Methods

2.1. Oversampling Methods

2.2. Undersampling Methods

2.3. Hybrid Sampling Methods

3. Results

3.1. Proposed Undersampling Method

3.2. Experimental Setup

4. Discussion

4.1. Impact of Undersampling Methods on the Performance of CNAC

4.2. Performance of the Proposed Undersampling Method with Respect to Others

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI