Integrating Data Mining Techniques for Naïve Bayes Classiﬁcation: Applications to Medical Datasets

: In this study, we designed a framework in which three techniques—classiﬁcation tree, association rules analysis (ASA), and the naïve Bayes classiﬁer—were combined to improve the performance of the latter. A classiﬁcation tree was used to discretize quantitative predictors into categories and ASA was used to generate interactions in a fully realized way, as discretized variables and interactions are key to improving the classiﬁcation accuracy of the naïve Bayes classiﬁer. We applied our methodology to three medical datasets to demonstrate the efﬁcacy of the proposed method. The results showed that our methodology outperformed the existing techniques for all the illustrated datasets. Although our focus here was on medical datasets, our proposed methodology is equally applicable to datasets in many other areas.


Introduction
As one of the most important data mining tasks in medical research, classification has the defining purpose of predicting the group or class to which a new record belongs based on its observed values for significant predictor variables. For example, classification techniques can be used to assign new patients to a high-risk or low-risk group based on observations of predictors related to disease patterns. Among the many classifiers applied to medical problems, the naïve Bayes classification algorithm is widely used due to its simplicity, efficiency, and efficacy [1][2][3][4].
Several extensions of the naïve Bayes classifier have been proposed, with the goal of improving its classification performance. Presenting an overview of naïve Bayes variants, Al-Aidaroos et al. [5] roughly categorized them into four groups depending on whether they (1) manipulated a set of attributes; (2) allowed interdependencies between attributes; (3) used the principle of local learning; or (4) adjusted the probability by numeric weight. However, some naïve Bayes adaptations integrate more than one approach-a fact that these categorizations do not take into account. For example, Melingi and Vijayalakshmi [6] utilized an effective meta-heuristic algorithm for selecting features and integrated naïve Bayes (NB) and sample weighted random forest (SWRF) classifiers into a single classification approach to achieve an efficient technique for sub-acute ischemic stroke lesion segmentation. After preprocessing, the extracted features were selected by using the multiobjective enhanced firefly algorithm to minimize errors and reduce dimensionality. In the procedure proposed by Melingi and Vijayalakshmi, the hybrid NB-SWRF classifier was used for image segmentation.
Under the assumption that all categorical predictors are independent for each class (i.e., the conditional independence assumption), the naïve Bayes classifier works very well at predicting the class of a new record based on the conditional probabilities using Bayes' theorem. However, for most datasets in real-world applications, the conditional independence assumption is often violated. Furthermore, to alleviate the interdependence problem and improve classification, numerous researchers have proposed some adapted naïve Bayesian classifiers. Jiang et al. [7] reviewed several improved algorithms that deal with the interdependence issue, and divided them into four main approaches: feature selection, structure extension, local learning, and data expansion.
In addition, some naïve Bayes adaptations have been hybridized with other classification techniques. For example, Farid et al. [8] proposed a hybrid algorithm for a naïve Bayes classifier to improve classification accuracy in multi-class classification tasks. In the hybrid naïve Bayes classifier, a decision tree is used to find a subset of important attributes for classification, with the corresponding weights serving as exponential parameters for the calculating the conditional probability of the class. Abraham et al. [9] proposed a hybrid feature selection algorithm using the naïve Bayes classifier to reduce dimensionality by removing irrelevant data, increasing learning accuracy, and improving the comprehensibility of the results. Their proposed algorithm relied on naïve minimum description length (MDL) discretization to filter out the least relevant and irrelevant features via chi-square feature selection ranking and used a greedy algorithm, the wrapper subset selector, to identify the best feature set.
A new approach, associative classification with Bayes (AC-Bayes), has been used to resolve rule conflicts in the naïve Bayesian model [10]. In AC-Bayes, a small set of high-quality rules is generated by discovering both the frequent and mutually associated item sets, then the best n rules are selected to predict the class of new instances. When rule conflicts occur, the instances covered by the matched rules are collected to form a new training set, which is used to compute the posterior probabilities of each class, conditioned on the test instance.
By integrating association rule mining with classification tasks, associative classification (AC) algorithms improve classification accuracy and produce easy-to-understand rules. However, AC-based approaches often generate a large number of classification rules. Moreover, several attributes may be excluded from the AC model by various ranking and pruning methods. To cope with these shortcomings, Hadi et al. [11] proposed a new hybrid AC algorithm (HAC) in which the naïve Bayes algorithm was used to reduce the number of classification rules representing all the attribute values, thereby improving the classification accuracy.
In this study, we integrated both the classification tree and association rules analysis (ASA) with the naïve Bayes classifier into one framework. Our goal was to generate candidate variables and interactions via two data mining methods-classification tree and ASA-in order to improve the classification performance of the naïve Bayes classifier. The focal step in the method we propose is to find interactions through ASA, as the most thorough way of finding the combinations of variables that help to predict the class of the response. In terms of a discretization method, we developed and described a classification tree with a weighting as the most effective way to partition quantitative predictors into levels for ASA. The proposed framework was applied to three medical datasets, all of which initially consist of quantitative predictors only. Our proposed methodology was shown to be significantly superior to all the established classifiers in terms of classification accuracy.
This study is organized as follows. The techniques that comprise our framework are reviewed in Section 2, followed by a detailed description of the framework and the proposed method. Applications of our framework to real datasets are described in Section 3, and performance comparisons between our framework and some well-known data classifiers are provided in Section 4. In Section 5, the implications of our results are discussed and the concluding remarks are presented.

Basic Concepts
In the context of statistical classification, our goal is to assign a new record x p = x 1 , x 2 , . . . , x p to a particular class C * k with a minimal probability of misclassification. It can be proved that, when the new record x p is assigned to class C * k , the posterior probability P C k x p is maximized [12,13]. Based on Bayes' Theorem, we can calculate the posterior probability P C k x p for k = 1, 2, . . . , m as follows: With the naïve Bayes classifier, based on the assumption that all the predictors x 1 , x 2 , . . . , x p are conditionally independent of each other, given the class, we obtain: Note that all the probabilities in Equation (2) can be estimated from pivot tables of the response and predictor values in the training set. For example, P(x 1 |C 1 ) can be estimated by referring to the proportion of the x 1 values of the records belonging to class C 1 in the training set and P(C 1 ) can be estimated by referring to the proportion of the records belonging to class C 1 in the training set. We assign the class with the highest probability to each observation.

Classification Tree
Due to its transparent rules and visual presentation, a classification tree is one of the most frequently used data mining techniques for classification [14]; for this reason, we selected this as the discretization method in our framework. Based on testing multiple discretization methods with different criteria, we found that the most effective method for our framework was a classification tree using weight to calculate measures including the proportion of the data belonging to each class, the proportion of the data in the left and right child nodes, the Gini impurity index in each node, and the reduction in the impurity of the split. Note that we obtained the discretization results from the Salford Predictive Modeler software program (https://cdn2.hubspot.net/hub/160602/file-249977783-pdf/docs/JSM, accessed on 13 February 2021), which explains the classification tree using weight as explained above.

Association Rules Analysis (ASA)
ASA is used to explore relationships between items in the form of rules, each of which has two parts: the first part comprises left-hand-side item(s), or condition, and the second is a right-hand-side item, or result. All the rules are represented in the following format: if condition, then result [15][16][17]. Two measurements are attached to each rule. The first measurement, support (s), is computed by s = P(condition and result). The second measurement, confidence (c), is computed by c = P(condition and result) P(condition) . ASA finds all the rules that meet two key thresholds: minimum support and minimum confidence [18].
This set of rules can be used for other purposes, including classification. A technique called classification rule mining (CRM), a subset of ASA, was developed to find a set of rules in a database in order to produce an accurate classifier [19,20]. In this technique, an item is used to represent a pair consisting of a main effect and its corresponding integer value. More specific than ASA, CRM has only one target, and this must be specified in advance. In general, the target of CRM is the response, which means the result of the rule (the right-hand-side item) can only be the response and its class. Therefore, the left-handside item (the condition) consists of the explanatory variable and its level. For example, assume that there are k categorical variables, X 1 , X 2 , . . . , X k , and a categorical response, Y.
We used CRM to find the combinations of levels of variables that appear frequently and strongly for each of the classes of the response through selected rules, which will be converted into new variables, called interactions (explained in detail in the next section). These interactions have the potential to improve classification accuracy when they are included in the models, as we will demonstrate with the focal datasets.

Proposed Method: Naïve Bayes Classifier Framework
The proposed framework for building a naïve Bayes classifier consists of four key steps ( Figure 1). Many rules can be generated by CRM. As an example, a rule could be "If X1 = 1, X2 = 3, then Y = 1" with s = P(X1 = 1, X2 = 3, and Y = 1) and c = P(X1 = 1, X2 = 3, and Y = 1) / P(X1 = 1 and X2 = 3). We used CRM to find the combinations of levels of variables that appear frequently and strongly for each of the classes of the response through selected rules, which will be converted into new variables, called interactions (explained in detail in the next section). These interactions have the potential to improve classification accuracy when they are included in the models, as we will demonstrate with the focal datasets.

Proposed Method: Naïve Bayes Classifier Framework
The proposed framework for building a naïve Bayes classifier consists of four key steps ( Figure 1). The four steps in our framework are: Step 1 (Discretization by CT): Utilize a classification tree to discretize each quantitative explanatory variable and convert each of them into a categorical variable.
Step 2 (Rules generation by ASA): Utilize CRM, a subset of ASA, to generate classifier rules from all the categorical variables, i.e., the new categorical variables generated in Step 1 and the original categorical variables.
Step 3 (Interactions generation): Generate the interactions for all the classifier rules in Step 2.
Step 4 (Naïve Bayes model selection): Select the optimal model for the naïve Bayes classifier-i.e., the one that provides the best value for our selection method-from all the original categorical variables, all the generated categorical variables in Step 1, and all the interactions generated in Step 3.

Step 1: Discretization by CT
As noted, we recommended a classification tree with weighting as the discretization method for our framework. In this step, we fitted the classification tree with each predictor as the sole predictor to find the splitting values. In turn, these values were used to partition the quantitative variable into levels as a basis for converting each quantitative variable into a categorical variable as needed.
Step 2: Rule Generation by ASA In Step 2, we used CRM to create rules from the datasets. The candidate variables for generating the rules are (i) all the original categorical variables; and (ii) all the newly generated categorical variables from Step 1. This step is expected to result in rules in the form of "If Xi's = xi's, then Y = y," where xi is the level of variable Xi and where y is the level of response Y. To perform the CRM, we used the classification based on associations (CBA) program developed by the Department of Information Systems and Computer Sciences at the National University of Singapore [19]. By simplifying the process, we used the classifier rules obtained from CBA, as shown in the following section. All classifier rules became the input for Step 3.

Step 3: Interactions Generation
Step 1: Discretization by CT Step 2: Rules Generation by ASA Step 3: Interactions Generation Step 4: Naïve Bayes Model Selection The four steps in our framework are: Step 1 (Discretization by CT): Utilize a classification tree to discretize each quantitative explanatory variable and convert each of them into a categorical variable.
Step 2 (Rules generation by ASA): Utilize CRM, a subset of ASA, to generate classifier rules from all the categorical variables, i.e., the new categorical variables generated in Step 1 and the original categorical variables.
Step 3 (Interactions generation): Generate the interactions for all the classifier rules in Step 2.
Step 4 (Naïve Bayes model selection): Select the optimal model for the naïve Bayes classifier-i.e., the one that provides the best value for our selection method-from all the original categorical variables, all the generated categorical variables in Step 1, and all the interactions generated in Step 3.
Step 1: Discretization by CT As noted, we recommended a classification tree with weighting as the discretization method for our framework. In this step, we fitted the classification tree with each predictor as the sole predictor to find the splitting values. In turn, these values were used to partition the quantitative variable into levels as a basis for converting each quantitative variable into a categorical variable as needed.
Step 2: Rule Generation by ASA In Step 2, we used CRM to create rules from the datasets. The candidate variables for generating the rules are (i) all the original categorical variables; and (ii) all the newly generated categorical variables from Step 1. This step is expected to result in rules in the form of "If X i 's = x i 's, then Y = y," where x i is the level of variable X i and where y is the level of response Y. To perform the CRM, we used the classification based on associations (CBA) program developed by the Department of Information Systems and Computer Sciences at the National University of Singapore [19]. By simplifying the process, we used the classifier rules obtained from CBA, as shown in the following section. All classifier rules became the input for Step 3. Step

3: Interactions Generation
In Step 3, we generated the interactions for the naïve Bayes classifier from the classifier rules generated in Step 2. We generated interactions between the items on the left-hand side with the same settings as those that appear in the rule. We assumed that the selected rule had three predictors in the form of "If X i = x i , X j = x j , and X k = x k , then Y = y," where x i is the level of variable X i , x j is the level of variable X j , x k is the level of variable X k , and y is the level of response Y. We generated the interactions among X i , X j , and X k by labeling each interaction as 1 if X i = x i , X j = x j , and X k = x k , and as 0 otherwise. This interaction is denoted X i (x i )X j (x j )X k (x k ). For example, for the rule "If X 1 = 2, X 2 = 2, and X 3 = 1, then Y = 1," we created an interaction among X 1 , X 2 , and X 3 , denoted X 1 (2)X 2 (2)X 3 (1). We have X 1 (2)X 2 (2)X 3 (1) = 1 if X 1 = 2, X 2 = 2, and X 3 = 1, and 0 otherwise. The level of Y does not play any role in generating the variables. These interactions will be the candidate variables in Step 4.
Step 4: Naïve Bayes Model Selection In Step 4, we selected the model for the naïve Bayes classifier by finding the set of predictors that give the best accuracy measure, which is the leave-one-out cross-validation method (LOOCV) or k-fold cross-validation with k = number of observations. The candidate variables are (i) the original categorical variables; (ii) the categorical variables generated in Step 1; and (iii) the interactions generated in Step 3.

Illustrated Examples
We demonstrated our methodology using three datasets: the thyroid dataset, the diabetes dataset, and the appendicitis dataset. Note that each of these three datasets initially comprised only quantitative predictors.

Thyroid Dataset
Retrieved from the University of California Irvine (UCI) machine learning site (https://archive.ics.uci.edu/ml/datasets/thyroid+disease, accessed on 11 February 2021), the dataset provided information on the thyroid function of 215 patients: 150 (69.77%) with normal function, 35 (16.28%) with hyperfunction, and 30 (13.95%) with hypofunction. There were five predictors in the dataset, all of which were quantitative variables ( Table 1). The objective of this analysis was to classify the patients as normal (Class 1), hyperfunction (Class 2), or hypofunction (Class 3). We applied our approach to the thyroid dataset via the following steps.
Step 1 (Discretization by CT): We discretized the five quantitative variables into categories using a classification tree. We fitted the model to predict the response, using one variable at a time, and thus obtaining the splitting values for each quantitative variable.
The classification model in which T3 resin was used as a predictor to classify the response yielded two splitting values: 99.5 and 117.5. Therefore, we generated the categorical variable by discretizing T3 resin (X1), which has three levels ( Table 2).
The classification model in which thyroxine was used as a predictor to classify the response yielded two splitting values: 5.65 and 12.65. Therefore, we generated the categorical variable by discretizing thyroxin (X2), which has three levels ( Table 2).
The classification model in which thyronine was used as a predictor to classify the response yielded two splitting values: 1.15 and 2.65. Therefore, we generated the categorical variable by discretizing thyronine (X3), which has three levels ( Table 2). The classification model in which thyroid was used as a predictor to classify the response yielded eight splitting values: 0.75, 1.05, 1.15, 1.45, 1.65, 1.75, 1.85, and 4.0. Therefore, we generated the categorical variable by discretizing thyroid (X4), which has nine levels ( Table 2).
The classification model in which the TSH-value was used as a predictor to classify the response yielded two splitting values: 0.65 and 4.45. Therefore, we generated the categorical variable by discretizing the TSH-value (X5), which has three levels ( Table 2).
Step 2 (Rules generation by ASA): We used CBA to obtain the classifier rules. In this step, the variables inputted into the process were the original categorical predictors (X1-X5). In total, 21 classifier rules were generated in this step (Table 3).    Step 3 (Interactions generation): We converted the 21 classifier rules into interactions. In total, 21 interactions were generated from the 21 classifier rules ( Table 3).
Step 4 (Naïve Bayes model selection): We combined the 21 interactions with the other discretized variables (X1-X5) to generate 26 candidate predictors for naïve Bayes. We searched for the model that gave the best LOOCV accuracy. We selected the model with all the discretized variables X1-X5 and the first 19 interactions shown in Table 3. The LOOCV value generated by this model was 99.53%.

Diabetes Dataset
Originally from the National Institute of Diabetes and Digestive and Kidney Diseases, the diabetes dataset was retrieved from Kaggle (https://www.kaggle.com/uciml/pimaindians-diabetes-database, accessed on 17 February 2021). In this dataset, eight quantitative variables were used to classify patients as either healthy or diabetic [21]. With 768 observations, there were 500 healthy patients (Class 0) and 268 patients with diabetes (Class 1). In this data, 65.1% of the observations belonged to Class 0 and 34.9% belonged to Class 1. There were eight predictors in the dataset, all of which were quantitative variables ( Table 4). The objective of this analysis was to classify the patients as healthy (Class 0) or diabetic (Class 1). We applied our approach to the diabetes dataset via the follow steps.

Step 1 (Discretization by CT):
We discretized the eight quantitative variables into categories using a classification tree. The discretized variables are shown in Table 5.  Step 2 (Rules generation by ASA): We used CBA to obtain the classifier rules. In this step, the variables inputted into the process were the original categorical predictors (X1-X8). In total, 77 classifier rules were generated in this step.
Step 3 (Interactions generation): We converted the 77 classifier rules into interactions. In total, 77 interactions were generated from the 77 classifier rules.
Given the high number of rules and interactions generated, we presented only the first 10 rules and the interactions they generated in Table 6. Table 6. First 10 classifier rules generated by CBA: diabetes dataset. Step 4 (Naïve Bayes model selection) : We combined the 77 interactions with the other discretized variables (X1-X8) to generate 85 candidate predictors for naïve Bayes. We searched for the model that gave the best LOOCV value. We selected the model with X1, X2, X5, X6, X7, and X8 and the interaction generated from Rule 2, which is X4(2)X3(2)X2(1). The LOOCV value from this model was 81.25%.

Appendicitis Dataset
Retrieved from the KEEL website (https://sci2s.ugr.es/keel/dataset.php?cod=183, accessed on 21 April 2021), the appendicitis dataset comprised seven medical measures to classify patients according to whether or not they had appendicitis. In 106 observations, there were 85 healthy patients (Class 0) and 21 patients who had appendicitis (Class 1). In the data, 80.19% of the observations belonged to Class 0 and 19.81% belonged to Class 1. There were seven predictors in the dataset, all of which were quantitative variables ( Table 7). The objective of this analysis was to classify the patients as healthy (Class 0) or as having appendicitis (Class 1). We applied our approach to the appendicitis dataset via the following steps: Step 1 (Discretization by CT): We discretized the seven quantitative variables into categories using a classification tree. The discretized variables are shown in Table 8.  Step 2 (Rules generation by ASA): We used CBA to obtain the classifier rules. In this step, the variables inputted into the process were the original categorical predictors (X1-X7). In total, 10 classifier rules were generated in this step.

Performance Comparison via Medical Datasets
In this section, we describe our application of the other well-known classification methods to the thyroid, diabetes, and appendicitis datasets in order to compare their performance with our methodology.
A comparison of the performance of the five methods is shown in Table 10. The five methods tested are as follows: (1) random forest (RF); (2) support vector machine (SVM); (3) k-nearest neighbors (kNN); (4) classification tree (CT); and (5) naïve Bayes (NB) with classification tree (CT) and ASA, which is our approach (NB + CT + ASA). The comparison is shown through the LOOCV accuracy. For random forest, we set the number of trees (ntree) according to four levels: 100, 200, 500, and 1000. Then, for each number of trees, we searched for the best LOOCV value among all the numbers of variables considered at each split, as indicated in Table 10.
For SVM, the LOOCV value, as shown in Table 10, was found for each of the four kernel types: the sigmoid kernel, the linear kernel, the polynomial kernel, and the radial basis kernel.
For kNN, the indicated LOOCV accuracy value shown in Table 10 is the highest for all the odd numbers of neighbors (k) from 1 to 19.
For the classification tree, the LOOCV accuracy value shown in Table 10 was obtained from the number of splits that gave the best LOOCV value among all possible numbers.
As shown in Table 10, our approach provided the highest LOOCV value of all the methods for all three medical datasets, with the most impressive performance shown for the appendicitis dataset.

Discussion and Conclusions
Our naïve Bayes model selection framework provides a classifier that significantly outperformed the other well-known data mining techniques tested, i.e., classification tree, random forest, kNN, and SVM. Our approach has an advantage over the other methods in that it can be used to generate interactions through ASA-an unconventional way of generating interactions by finding the combinations of the levels of the variables that are important for predicting the class for the categorical responses. In particular, ASA is effective at finding the combinations of the levels of variables that appear frequently and strongly for each of the classes of the response through selected rules. The model's effectiveness in this regard is very helpful for working with unbalanced datasets such as the thyroid and appendicitis datasets. Moreover, our experiments with the different discretization methods showed the classification tree to be the most effective for our approach.
We demonstrated that the integration of three techniques-classification tree, ASA, and the naïve Bayes classifier-constituted a superior and practical classifier. Based on our application examples, it is evident that these newly generated variables and interactions made a significant contribution to improving the naïve Bayes classifier.
Author Contributions: Conceptualization and methodology, P.C.; software, P.C. and C.Y.; formal analysis, P.C. and A.P.; validation and investigation, P.C. and S.H.; writing-original draft preparation, P.C., A.P. and S.H.; writing-review and editing, P.C., S.H. and C.Y. All authors have read and agreed to the published version of the manuscript.