An Accurate and Easy to Interpret Binary Classifier Based on Association Rules Using Implication Intensity andMajority Vote

In supervised learning, classifiers range from simpler, more interpretable and generally less accurate ones (e.g., CART, C4.5, J48) to more complex, less interpretable and more accurate ones (e.g., neural networks, SVM). In this tradeoff between interpretability and accuracy, we propose a new classifier based on association rules, that is to say, both easy to interpret and leading to relevant accuracy. To illustrate this proposal, its performance is compared to other widely used methods on six open access datasets.


Introduction
The problem of classification is crucial in many applications. For example, in handwritten digit recognition, a digitized image of the written digit (the input) is processed, and each character must be classified as one of the digits 0-9 (ten classes in all). In a SPAM filter, each message is processed and must be classified as SPAM or ham (not SPAM). The inputs are features of the messages (the frequency of some key words, of capital letters, etc.). In medicine, inputs such as blood indicators and patients' other pieces of information can be used to decide if a disease is present or absent. The last two examples are instances of binary classification. In general, each classifier is a flexible model that learns from data, in its particular way, using a large database of observations for which the ground truth is known (the training set). Once the classifier is tuned up in order to have a small misclassification rate over the training set, it can be tested with new observations (new handwritten character images, new messages, new patients). It will assign to each observation the most plausible class (i.e., the most likely letter or number, whether it is SPAM or not, whether the patient suffers from the disease or not) and its accuracy can be tested against a test set [1].
Among the many existing classification methods, one can cite decision tree algorithms, support vector machines, Bayesian algorithms, rule-based algorithms, neural networks, distance based methods, genetic algorithms and associative classification [2].
Focusing on the rule-based algorithms, one can highlight the seminal work [3] and the papers [4][5][6][7], which lead to classification algorithms with good results. Those methods are based on the mining of rules by the well known Apriori algorithm [8], which has been improved since, and has been successfully applied to very large datasets, mining all the possible rules concerning frequent itemsets (sacrificing only small support rules). In the context of classification, class association rules are rules implying a particular category of the classification variable. In [3], the authors introduced the first serious classification based on association rules (CBA), by relating classification rules under a precedence order, so that the observations can be classified using, the strongest rules, and if not applicable, a default class, so that the misclassification rate is small [9,10]. The strength of the rules is measured by the confidence, as it estimates the probability of the class given the other features of the rule. Their results are successful with respect to the ROC curves, but they usually produce rather complex structures describing the classification process.
Another rule-based method with two main differences from the CBA strategy is proposed. First, only 2-length classifying rules are considered (and are combined for classification improvement). Then, a new quality measure is used, taking into account both the confidence of the rules and the statistical effect of the premises over the conclusions of the rules.
For this second aspect of the rules, the concept of implication intensity of Statistical Implicative Analysis (SIA) (see [11,12]) is used in this work. SIA is a data analysis method for both hierarchical clustering and association rules. It uses statistical independence (among the variables) as a baseline, to measure the similarity between variables (for the clustering task) and the quality of rules (for association rule task). SIA was initially developed in the field of Didactics of Mathematics [12], and is now being used in a wider range of data and domains [13][14][15][16][17].
In this work, it will be compared to other classification methods of widespread use in the literature (Naïve Bayes, radial basis neural networks, decision trees J48 and simple CART) with breast cancer open access datasets from UCI Machine Learning repository, namely Wisconsin Breast Cancer (WBC), Wisconsin Diagnosis Breast Cancer (WDBC) and Wisconsin Prognosis Breast Cancer (WPBC) [18], through the computation of the confusion matrices: accuracy, precision, sensitivity and specificity. These datasets have been used in order to test classifiers such as in [2,[19][20][21][22][23]. In addition, the Haberman dataset has been used to test algorithms in [24]. The EEG Eye State Data Set (IDs_mapping) has been used in [25] for classification of eye state using k-Nearest Neighbors algorithm and multilayer perceptron neural networks models, and also used in [26], which proposes a novel EEG eye state identification approach based on Incremental Attribute Learning (IAL). The Cervical Cancer Behavior Risk Data Set (SOBAR) in [27] was also used in this study.
All the computations in this paper have been conducted by means of the R Statistical Software [28].

Methodology
The aim of this paper is to define an accurate and easy to interpret binary classifier, and to show that it is as competitive as other widely used and more complex classifiers. For that aim, a quality measure taken from the Statistical Implicative analysis, described in Section 2.1, was used. The rationale of the classifier is exposed in Section 2.2, and the criteria for comparing the performance of the competing algorithms are given in Section 2.4.
In Section 2.3, the open access datasets used for the test are introduced, and the results of the comparison are shown in Section 3. Finally results are discussed in Section 4.

A New Quality Measure of Association Rules
Association rules are endowed with several quality (or interestingness) measures. The most widely used ones are confidence, support, and lift, and they can be easily displayed when rules are mined through the Apriori algorithm [8], implemented in the popular arules R package [29].
There are several other quality measures, each one extracting a slightly different facet of the considered rule [30][31][32]. For the sake of good prediction rates, the confidence is the most important one, since it measures how likely the right-hand side (rhs) of the rule is observed for an individual who is seen to hold the left-hand side (lhs) of the rule.
For very large datasets, the computation is an issue, and the Apriori algorithm needs to sacrifice rules affecting a low proportion of the sample, in order to finish in a reasonable time. The support of a rule is the fraction of observations (relative frequency) holding both the lhs and rhs of the rule, and only rules exceeding a minimum support and a minimum confidence are mined.
The lift of a rule is the proportion between the confidence of the rule and the support of the rhs. It can be interpreted as the effect of the lhs over the rhs: when lift >1, the lhs 'raises' the chances of observing the rhs, while a lift <1 means that the lhs 'reduces' the chances of observing the rhs.
Lift and confidence are neither completely independent nor tightly related. One can have rules with a confidence such as 0.80 (hence giving a good predictive ability), but a lift of only 0.90, meaning that the lhs reduces the initial chances of the rhs by 10%. Although the best prediction is the occurrence of the rhs, because of the confidence, the effect of the lhs should not be overlooked.
All the quality measures are descriptive, therefore subject to a natural variability under the sampling scheme that has led to our dataset. In order to measure the effect of the lhs over the rhs, one can use the statistical inference over the value of the lift of a rule in the whole population. To that aim, the implication intensity of [11] is used. Depending on the random sampling scheme, the distribution of N ab can be Binomial or Poisson, and a Gaussian approximation is feasible for large sample sizes [11]. Other probabilistic approaches to define the interestingness of rules can be seen in [33]. Table 1, the implication intensity of A ⇒ B is P(N ab > 20). We can see A as the outcome of Bernoulli trials of parameter p = 40/100, and similarly for B, with parameter p = 25/100. Under statistical independence, we can see the observations of counterexamples (A = 1, B = 0) as Bernoulli trials of parameter p = 0.4 · 0.25 = 0.1. Assuming statistical independence between A and B, the observed number of counterexamples is exceeded with probability P(N ab > 20) = 0.9835. Although the confidence of the rule is poor, 20/40 = 0.5, the lift is 0.5/0.25 = 2, showing how the fact that an itemset contains A (A = 1) doubles the chances of that itemset containing B (B = 1). The implication intensity accounts for this extraordinary significant effect of A over B. Bearing this in mind, the authors want to build a classifier based upon strong rules, showing a large confidence, but also significant rules. Moreover, the goal is to choose rules whose lhs shows a significant positive effect on the rhs (i.e., when the implication intensity is strong). Then, the notion of implifidence (a term formed by the contraction of implication and confidence) as a useful quality measure is used.

Definition 2 ([34]
). The implifidence of a rule a ⇒ b is defined as where IC, I and C denote, respectively, implifidence, implication intensity, and confidence, and the rule b → a is the transposition of the mined rule a → b.
The properties of IC(·) with respect to the confidence C(·) and the intensity I(·) are studied in [34], where the authors show, under several conditions, their relationship.

The Proposed Approach
Usually, input variables are numerical or categorical data. In order to extract association rules, they are turned it into transactions. Numerical data can be binned into a small number of intervals (for example, variable V i has a range of data that can be split into n i intervals I 1 , I 2 , . . . , I n i ). For each j = 1, 2, . . . , n i , one can define the binary variable V i,j , indicator of interval I j (i.e., reaching value 1 only for data whose variable V i belong to interval I j , and 0 elsewhere). The number of intervals for the numeric variables is the first parameter of our classifier.
The output binary classification variable Y is replaced by two other binary variables Y 0 and Y 1 , each one indicator of the corresponding class.
The notion of implifidence [34], introduced in the previous section as the quality measure that takes into account both the confidence (best for prediction) and a statistically significant effect of the lhs over the rhs, can be used. Every 2-length classifying rule (i.e., whose lhs is a binary input variable and whose rhs is either Y 0 or Y 1 ) can be mined. The implifidence threshold is the second parameter of the model, and it will filter the rules in order to keep only the significant ones.
Once the significant rules have been detected, the involved variables are said to be the significant variables for the classification process.
The more variables are used to define the classification process, the lower prediction error may be reached. In contrast, the more complex would be the resulting classification method, and less comprehensible for the practitioner. In order to get a trade-off, a third parameter has to be chosen, an odd number to group significant variables, and classify by majority voting.
The final group of significant variables shall be the one with the lowest sample prediction error, and it will conform to the final classification method.
To summarize, the process is the following: 1. Transform every input variable V i into a set of binary variables V i,1 , V i,2 , . . . , V i,n i (here one has one or more parameters, the number of intervals in which each numeric variable is categorized). The classification variable Y is doubled into the binary variables Y 0 and Y 1 .

2.
Mine all the 2-length classifying rules (whose lhs is a binary input variable and rhs is either Y 0 or Y 1 ). This choice allows us to mine all the rules without the sacrifice of a minimum support, even for large datasets, in a reasonable time.

3.
Choose a threshold i 0 (this is a new parameter in [0, 1]) for implifidence, and filter the rules exceeding that implifidence. They can be called significant rules. Any original variable V i such that all its binary related variables (V i,1 , V i,2 , . . . , V i,n i ) are the lhs of a significant rule, it can be called significant variable.

4.
Extract the subset of significant variables (in this step, variables with low effect on the classification variable Y are discarded).

5.
Make the table of 1-predictor classifications, using the significant rules, where each individual (row) is classified in accordance to each of the selected significant variables (column). 6.
Choose a low odd number m of 'premises' (another parameter). Rules will be built with at most m premises in order to classify data).

7.
For every combination of m significant variables: • Classify each individual in the sample using the classifying rules involving the m variables, by majority voting among the outcomes of those rules. • Compare classification with the true class. • Assess the prediction of the m-tuple by the four measures of performance (accuracy, precision, sensitivity, specificity).

8.
Choose the m-tuple with best performance (accuracy by default, but any other can be chosen). 9.
Return final classifier that uses the m-tuple leading to the best performance.
For example, if a variable V has 3 categories a, b and c, one then considers the rules where each category implies the positive class (a → 1, b → 1 and c → 1), and the rules where each category implies the negative class (a → 0, b → 0 and c → 0). If for category a, both of its rules are not significant, it will not be possible to use this variable V to classify an individual for who V = a. Then V is not a significant variable.
We have implemented the algorithm in a convenient R package for Linux, freely available in [35]. It contains the function SIAclassif(), that develops the algorithm, and returns the classifier, as well as the function predict.SIA(), which takes the object given by the previous function and applies it to new data, returning the predicted class of all the instances present in a new dataset.
In the Appendix A, one can find the text output of our algorithm, which explains to the user how the instances are classified according to their values in the significant variables.
The WBC dataset relates the malignancy of tumors (2 for benign, 4 for malignant) to 10 attributes like clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, mitoses, and class. It has 683 records with a class distribution of 444 negative and 239 positive records.
The WDBC dataset relates the diagnosis (M = malignant, B = benign) to ten cell nuclei features, extracted by computer vision diagnostic system: radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry and fractal dimension [36]. For each image, as it contains several cells, three values of each feature are kept: the mean value, the standard error, and the mean of the three largest values. In summary, there are 569 records with 30 input features, and a class distribution of 212 positive and 357 negative records.
In the WPBC dataset, the outcome (R = recurrent, N = non-recurrent) is related to time (recurrence time, for recurrent, and disease-free time, for non-recurrent), as well as to features such as in the WDBC dataset, and tumor size and lymph node status. In summary, there are 194 records on 30 input features with 46 positive and 148 negative records.
The Haberman dataset reflects a study on the survival of patients who had undergone surgery for breast cancer. It contains 306 instances and 3 attributes (the age of patient at the time of operation, the year the patient underwent surgery and the number of positive auxiliary nodes detected) as well as the survival status class (1 = the patient survived 5 years or longer, and 2 = the patient died within 5 years). It gives 81 positive and 225 negative records.
EEG Eye State classification is important and useful to detect humans' cognition state. The dataset includes 14 continuous EEG measurements. The eye state was detected via a camera during the EEG measurement and added later manually to the file after analyzing the video frames. A '1' indicates the eye-closed and '0' the eye-open state, with an amount of 6723 positive and 8257 negative records.
The Cervical Cancer Behavior Risk Data Set contains 18 explanatory attributes regarding ca cervix (1 = has cervical cancer, 0 = no cervical cancer) in 72 instances, with 21 positive and 51 negative records [37].

Classifier Comparison Criteria
To evaluate our approach, we have compared our results to those of the different classifiers (J48, Naive Bayes, svmRadial, CART and K-nearest neighbors). All machine learning algorithms used for the comparison in this paper were conducted using RWeka [38]. The RWeka package provides a collection of functions that give R access to the machine learning algorithms in the Java-based Weka software package [39].
The level of effectiveness of the classification model is calculated with the number of correct and incorrect classification in each possible value of the variable being classified in the confusion matrix [1,40].
The entries in the confusion matrix have the following meaning in the context of our study: TP is the number of true positives (the number of correct predictions that an instance is positive), TN is the number of true negatives (the number of correct predictions that an instance is negative), FP is the number of false positives (the number of incorrect predictions that an instance is positive) and FN is the number of false negatives (the number of incorrect predictions that an instance is negative).
The accuracy is the proportion of the total number of predictions that were correct ( TP+TN TP+TN+FP+FN ). The precision is the measure of accuracy provided that a specific class has been predicted ( TP TP+FP ). The sensitivity is the measure of the ability of a prediction model to select instances of a certain class from a data set ( TP TP+FN ). The specificity corresponds to the true negative rate, which is commonly used in two class problems ( TN TN+FP ) [20,22].

Results
As stated in Section 2.4, the decision was made to use 5-fold cross-validation. The datasets were randomly divided into five groups. Fixing one of such groups, all methods were fitted to all the instances in the other groups. Then, all fitted methods were asked to classify the instances in the fixed group, from which the confusion matrix and the four performance criteria were computed. Repeating the process for each of the five groups and averaging the obtained values, the results shown in Table 2 and represented for ease of interpretation in Figure 1 are obtained. These results are discussed in Section 4.
When the classes are unbalanced, some classes may not appear in some training/test sets, and it leads to undefined values of the performance criteria (because of the fractions 0/0).
The classifier has been implemented in an R package, and it can be freely downloaded from [35] and used as follows under Linux: • Place the Clasif.zip file in a local folder (that we denote as path_to_file) • Run install.packages(path_to_file, repos = NULL, type="source") in the R Console. The package will be installed from source. • Run library(Classif) in the R Console • Run Classif() in the R Console. It provides a comfortable windows interface where the user can pick the dataset file and select the number of votes without writing code. Its output is the comparison among the classifiers that we show in Table 2. The object rules (that the user can print just typing rules in the R Console) contains the classifying rules and the accuracy for each possible number of votes. • Internally, the function SIAclassif() is the one that performs all the steps of our algorithm, according to the implifidence threshold and the maximum number of votes. It computes the implifidence of all the rules, it selects the significant variables, and it computes all the combinations of groups of significant variables and the resulting classification for each one, taking the one which maximizes the accuracy (by default, or any other criterion specified by the user). It prints a sentence explaining the rule for classification. As an example, the result for the WBC dataset is shown in Appendix A. • The function predict.SIA() requires, as a first argument, the object returned by the function SIAclassif(), and as a second argument, a data frame with the new instances, in order to apply the classification and produce the predicted output.

Conclusions
We have proposed a machine learning algorithm for classification based on association rules, using a rather novel quality measure, the implifidence (see Section 2.1), and a majority vote among a set of significant rules.
Our approach has been tested using six well known open access datasets, and its results have been compared with four other well established algorithms for classification (Naive Bayes, Radial basis neural networks, Decision trees J48 and simple CART), using 5-fold cross-validation in order to keep away from overfitting.
As can be seen in Figure 1, our proposal gets a top or second position among the five classifiers in four of the six datasets, with regard to accuracy, precision and sensitivity. Our model is ranked top or second in only two of the datasets regarding specificity.
The worthiest benefit of our approach is the simplicity of interpretation of the classification rule (see Appendix A for an example), since the classification is directly related to the observed variables.
One can consider that, in the trade-off between accuracy and interpretability of the model, our proposal overcomes all the other tested methods. Even if J48 and CART are also interpretable, their trees are usually more complex than the rules we get with our method.
In this study, the default parameters for all the classifiers have been kept. The effect of choosing different parameters in every model will be analysed in detail in a forthcoming research.
Author Contributions: All authors have contributed equally to all the steps of this piece of research, from conceptualization to review and editing. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study are openly available in UCI Machine Learning Repository at http://archive.ics.uci.edu/ml.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
As an example for practitioners, we show below the output of our classifier, the function SIAclassif(), stored in an object called rules, for the WBC dataset, using up to 7 significant variables. Output classes are "malignant" and "benign", while input features are V1, V2, ..., V8 and V9.

•
If the user wants to use only one variable, the best classification is done by checking V2, which predicts the class "malignant" for an individual whenever it takes one of the values 10, 3, 4, 5, 6, 7, 8 or 9. The accuracy of this classifier is 0.926793557833089. • If the user accepts to use up to three variables, the best classification reaches an accuracy of 0.961932650073206 by checking the variables V8, V3 and V6. The rule forecasts the class "malignant" for a record, if at least two of the variables V8, V3 and V6 take one of their listed values (see component [ [2]] below). • In general, the user decides on a particular classifier according to the importance given to the accuracy, as well as to the simplicity of the rule. In any case, the interpretation of the classification of our approach is easy, contrary to other classifiers, because the practitioner only needs to check the values of a few of the original variables in order to decide the output class.
The user can choose to maximize other criteria such as precision, sensitivity, specificity, or any linear combination of all the four measures. If the user types rules in the R Console, the output is the following: