Attribute Selecting in Tree-Augmented Naive Bayes by Cross Validation Risk Minimization

: As an important improvement to naive Bayes, Tree-Augmented Naive Bayes (TAN) exhibits excellent classiﬁcation performance and efﬁciency since it allows that every attribute depends on at most one other attribute in addition to the class variable. However, its performance might be lowered as some attributes might be redundant. In this paper, we propose an attribute Selective Tree-Augmented Naive Bayes (STAN) algorithm which builds a sequence of approximate models each involving only the top certain attributes and searches the model to minimize the cross validation risk. Five different approaches to ranking the attributes have been explored. As the models can be evaluated simultaneously in one pass learning through the data, it is efﬁcient and can avoid local optima in the model space. The extensive experiments on 70 UCI data sets demonstrated that STAN achieves superior performance while maintaining the efﬁciency and simplicity.


Introduction
Naive Bayes (NB) [1,2] has attracted considerable attention due to its computational efficiency and competitive classification performance. Its efficiency originates in the independence assumption among the attributes given the class. Figure 1a describes an example of NB structure with 4 attributes, where X 1 , . . . , X 4 are the attribute variables and Y is the class variable. However, the independence assumption rarely holds in real world applications. Although it has been demonstrated that some violations of the independence assumption are not harmful to classification accuracy [3], it is clear that many are. Many efforts have been done to allow specific dependence between attributes while retaining naive Bayesisan classifiers' desirable simplicity and efficiency [4][5][6]. The Tree-Augmented Naive Bayes (TAN) [4] corresponds to the algorithm better improving the accuracy of naive Bayesian classifiers by alleviating its attribute independence assumption. TAN allows that every attribute depends on at most one attribute other than the class. So the dependencies among the attributes can be described by a tree structure, which can be found by a scoring measurement, called conditional mutual information. Figure 1b describes an example of TAN structure with 4 attributes, where X 2 depends only on X 1 , while X 3 and X 4 depend on X 2 . Since TAN exploits the first-order dependencies among the attributes, the classification performance can be greatly improved compared to NB at the cost of a search of tree structure.
However, TAN demands all attributes to be connected to the class, so it exploits all the attributes regardless of whether it is redundant or not. As a result, there is an increasing body of work to improve TAN since TAN was proposed [7]. Jiang et al. [8] presented a Forest Augmented Naive Bayes (FAN) for better evaluating the ranking performance. Alhussan et al. [9] proposed a fine-tuning stage in a Bayesian Network (BN) learning algorithm to more accurately estimate the probability terms used by the BN. They apply the algorithm to fine-tune TAN and other models. Wang et al. [10] presented a kind of semi-lazy TAN Classifier, which builds a TAN identical to the original TAN at training time, but adjusts the dependence relations for a new test instance at classification time. Campos et al. [11] proposed an extended version of the well-known tree-augmented naive Bayes. The structure learning procedure explores a superset of the structures that are considered by TAN, yet achieves global optimality of the learning score function in a very efficient way. The procedure is enhanced with a new score function that only takes into account arcs that are relevant to predict the class, as well as an optimization over the equivalent sample size during learning. Jiang et al. [12] investigated the class probability estimation performance of TAN in terms of Conditional Log Likelihood (CLL). An improved algorithm was presented by choosing each attribute as the root of the tree and then averaging all the spanning TAN classifiers. Cerquides and Mántaras [13] introduced decomposable distributions over TANs. They proposed four classifiers based on decomposable distributions over TANs. These classifiers provide clearly significant improvements, specially when data is scarce. It could be found that these work focuses mainly on the TAN structure learning, few of them have tried to eliminate redundant attributes in the process of training.
Attribute selection in Bayesian network classifiers had been investigated. Zhang et al. [14] proposed a discriminative model selection approach which chooses different single models for different test instances and retains the interpretability of single models. Langley and Sage [15] proposed Selective Bayes Classifiers (SBC) by hill-climbing search of the optimal attribute subset. The same strategy had also been applied to Averaged-One Dependence Estimator (AODE) to find the optimal parent and child attribute set [16]. As this carries out a greedy search through the space of feature, it often falls into a local optimization. Furthermore, the evaluation of the successively added attributes is timeconsuming. Recently, an attribute selection approach based on efficient attribute subsets construction and evaluation has been investigated in NB [17], k-dependence Bayesian classifier (KDB) [18], averaged n-dependence estimators (AnDE) [19,20]. However, the performance of TAN [4] with this attribute selection has not been explored.
This paper proposes an attribute Selective TAN (STAN) algorithm by cross validation risk minimization. The attribute subsets are first constructed very efficiently as the latter subset can be obtained only by adding one attribute to the former. Classification models based on these different subsets of attributes could then be searched by cross validation risk minimization in one pass learning of the training data. Five different approaches to ranking the attributes have been explored. Different from the traditional attribute selection based on hill climbing, the strategy in this paper is efficient and able to avoid local optima in the model space. The extensive experiments on 70 UCI data sets demonstrated that STAN, along with a certain attribute ranking approach, called Minimum Redundancy-Maximum Relevance (MRMR), achieves superior performance while maintaining the efficiency and simplicity. It provides consistently better predictions than regular TAN in a statistically way. The win/draw/loss result in terms of zero-one loss is 34/22/14, which means STAN with MRMR obtains lower zero-one loss than regular TAN on 34 data sets, the same zeroone loss as regular TAN on 22 data sets and greater zero-one loss than regular TAN on 14 data sets.

Bayesian Network Classifiers
The classification problem can be described as a procedure that given a data set D and an unclassified observation x assigns a class to x. Suppose we have N observations in D. Each observation is a pair (x, y), consisting of an a-dimensional attribute vector x = [x 1 , . . . , x a ] T , and a target class y, draw from the underlying random variables X = {X 1 , . . . , X a } and Y.
Bayesian network classifier addresses this classification task by first modelling the joint distribution P(y, x) by a certain Bayesian network B, and then calculating the posterior distribution P(y|x) by bayesian rule. A Bayesian network is characterised by a pair B = G, Θ . The first component, G, is a directed acyclic graph. The nodes in G represent random variables, including attributes X 1 , . . . , X a and the class variable Y. The arcs in G represent directed dependencies between the nodes. If X j is pointing directly to X i via a directed edge (an arc), we say X j is the parent of X i , or X i is the child of X j . Different bayesian network classifiers assume different dependencies among the attributes, but all assume Y is the parent of all attributes and has no parents.
The second component of the pair, namely Θ, represents the set of parameters that quantifies the network. It contains a parameter θ x i |y,π i for each x i of node X i , each y of Y and each π i of Π i , where Π i is the set of parent nodes of X i in network G. θ x i |y,π i = P B (x i |y, π i ), abbreviated for P B (X i = x i |Y = y, Π i = π i ), represents the probability that variable X i takes the value x i given that Y takes the class y and Π i takes the value π i . It is obvious that When the data set D is given, the log-likelihood of the data given a specific network structure is maximized when θ x i |y,π i corresponds to empirical estimates of probabilities from the data, that is, θ x i |y,π i = P D (X i = x i |Y = y, Π i = π i ) [21]. This produces the Maximum-Likelihood Estimation (MLE) of the parameters Θ.
A Bayesian network defines a unique joint probability distribution given by By Bayesian rule, the posterior distribution of a new unclassified example x can be calculated as follows, So we can easily classify x into class arg max y (P B (y|x)).

Tree Augmented Naive Bayes
Although naive Bayes [1] performs surprisingly well on many data sets, its independence assumption among attributes rarely holds in real world. In order to relax this independence assumption, Friedman et al. [4] proposed to augment the naive Bayes structure with edges among the attributes, when needed, thus dispensing with its strong assumptions about independence. In order to learn the optimal set of augmenting edges in polynomial time, a tree restriction has been imposed on the form of the allowed interaction. So the resulting structure is called Tree-Augmented Naive Bayesian (TAN) network, in which the class variable has no parents and each attribute has as parents at most one other attribute in addition to the class variable. Thus, each attribute can have one augmenting edge pointing to it and all the augmented edges form a tree structure.
In order to learn a TAN structure such that the log likelihood is maximized, they proposed to use conditional mutual information between attributes given the class variable as the weight of an edge in the graph. The conditional mutual information between attributes X and Z given the class variable Y is defined as I(X; Z|Y) = ∑ x,z,y P(x, z, y)log P(x, z|y) P(x|y)P(z|y) , Roughly speaking, this function measures the information that Z provides about X when the value of Y is known.
The procedure to construct TAN structure consists of five main steps: 1.
Compute I(X i ; X j |Y) between each pair of attributes, i = j, from the training data.

2.
Build a complete undirected graph in which the nodes are the attributes X 1 , . . . , X a . Annotate the weight of an edge connecting X i to X j by I(X i ; X j |Y).

3.
Build a maximum weighted spanning tree.

4.
Transform the resulting undirected tree to a directed one by choosing a root variable and setting the direction of all edges to be outward from it, thus getting the parent node Π i of node X i .

5.
Construct a TAN model by adding a node labelled by Y and adding an arc from Y to each X i .
Note that in TAN, vector Π i has deteriorated to scalar Π i as TAN allows only one parent for each attribute. So TAN model defines a unique joint probability distribution given by The structure and the parameters of TAN model can be learned in one pass learning through the data.

Motivation
It could be found from Equation (4) that the joint probability P TAN (x, y) is estimated by the product of prior probability P TAN (y) and conditional probabilities P TAN (x i |y, π i ). Considering only the top certain attributes will produce an approximation to P TAN (x, y). This implies that it is possible to build a sequence of alternative selective models such that the latter one is a trivial extension to the former. Different models that build upon one another in this way can be efficiently evaluated in a single set of computations. So in just one pass learning through the data, cross validation risks of different models can be obtained. Risk minimization will obtain the best model, which means also the optimal attribute subset to perform classification in the framework of TAN.

Building Model Sequence
When trying to search the best attribute subset, the size of the search space for a variables is 2 a . Instead of searching the whole space exhaustively, it is natural to impose some restrictions on the construction of the model space. As TAN compute the joint probability P TAN (x, y) by sequentially multiplying the conditional probability P TAN (x i |y, π i ) to the prior P TAN (y), considering only the top s attributes results in an approximate model to P TAN (x, y), where 1 ≤ s ≤ a. So the model space of attribute selective TAN would be By this means we can construct a model sequence of size a. These models can be evaluated efficiently in a single set of computations as the latter one is only a trivial extension to the former one. Although each model is only an approximation to TAN model, regular TAN is also included in this model sequence. Consequently, this attribute selective model could be expected not to be worse than regular TAN.

Ranking the Attributes
Since the selective strategy is to consider only the top s attributes, the strategy relies on a ranking of the attributes. As the purpose of attribute selection is to eliminate those redundant attributes, we should prioritize those more informative attributes. We could rank the attributes based on the attributes' marginal relevance with respect to the class. Fortunately, the attribute ranking has been extensively investigated in feature selection area [22]. Here we adopt the most well known five strategies to measure the relevance between the attribute and the class.

1.
Mutual Information (MI) (Mutual information measures the amount of information shared by two variables. This can also be interpreted as the amount of information that one variable provides about another) is an intuitive score since it is a measure of correlation between an attribute and the class variable. Before we present the definition of mutual information, we would first present the concepts of entropy and conditional entropy. The entropy of a random variable x is defined as The conditional entropy H(X|Y) of X given Y is The mutual information between X and Y is defined as the difference between the entropy H(X) and the conditional entropy H(X|Y) This heuristic considers a score for each attribute independently of others.

2.
Symmetrical Uncertainty (SU) (Symmetrical uncertainty is the normalized mutual information. The range of symmetrical uncertainty is [0, 1], where the value 1 indicates that knowledge of the value of either one completely predicts the value of the other and the value 0 indicates that the two variables are independent) [23] can be interpreted as a sort of mutual information normalized to interval [0, 1]: It is obvious that mutual information is biased in favor of attributes with more values and so large entropy. However, symmetrical uncertainty, which is normalized to the range [0, 1], is an unbiased metric and ensures they are comparable and have the same effect. As a result, we can expect to obtain a more appropriate ranking of attributes based on symmetrical uncertainty.

3.
Minimum Redundancy-Maximum Relevance (MRMR) criterion (MRMR, short for Minimum Reduncancy-Maximum Relevance, always tries to select the attribute which has the best trade off between the relevance to the class variable and the the averaged redundancy to the attributes already selected), which was proposed by Peng et al. [24], not only considers mutual information to ensure feature relevance, but introduces a penalty to enforce low correlations with features already selected. MRMR is very similar to Mutual Information Feature Selection (MIFS) [25], except that the latter replace 1 k with a more general configurable parameter β, where k means the number of the attributes that have been selected so far, and is also the number of steps. Assume at step k, the attribute set selected so far is A S k , while A −S k = A \ A S k is the set difference between the original set of inputs A and A S k . The attribute returned by MRMR criterion at step k + 1 is, At each step, this strategy selected the variable which has the best trade off between the relevance I(X; Y) of X to the class Y and the averaged redundancy of X to the selected attributes X ∈ A S k . 4.
Conditional Mutual Information Maximization (CMIM) (CMIM, short for Conditional Mutual Information Maximization, tries to select the attribute that maximizes the minimal mutual information with the class within the attributes already selected) proposes to select the feature whose minimal relevance conditioned to the selected attributes is maximal. This heuristic was proposed by Fleuret [26] and also later by Chen et al. [27] as direct rank (dRank). CMIM computes the mutual information of X and the class variable Y, conditioned on each attribute X ∈ A S k previously selected. Then the minimal value is retained and the attribute that has a maximal minimal conditional relevance is selected. In formal notation, the variable returned at step k + 1 according to the CMIM strategy is 5. Joint Mutual Information (JMI) (JMI, short for joint Mutual Information, tries to select the attribute which is complementary most with existing attributes), which was proposed by Yang and Moody [28] and also later by Meyer et al. [29], tries to select a candidate attribute if it is complementary with existing attributes. As a result, JMI focuses on increasing the complementary information between attributes. The variable returned by JMI criterion at step k + 1 is The score in JMI is the information between the class variable and a joint random variable X, X , defined by pairing the candidate X with each attribute X previously selected.
Note that for the first two scores, we can simply rank the attributes in the descending order of MI or SU scores. While for the last three methods, a forward selection search strategy is involved, which means we are selecting attributes sequentially, iteratively constructing our attribute subset. Suppose at step k, the set of attributes selected so far is A S k , A −S k = A \ A S k is the set difference between the original set of inputs A and A S k . At step k + 1 of forward selection search, these methods select the attribute X k+1 which maximizes the given score in Equation (10), Equation (11) or Equation (12). Then the attribute sets can be updated as Initially, the set A S is empty. So they select the attribute arg max X∈A I(X; Y), which is with maximal mutual information with respect to the class variable. The procedure terminates when the set A −S becomes empty.

Cross Validation Risk Minimization
Since the model space has been built, next we need select the best model in this space. A natural idea is to apply these models to the training examples and select the model with the best accuracy. However, this will cause the over fitting problem as the models have been trained and tested on the same examples. Low error rate on the training data set does not mean low error rate on the testing data set. The more practical way is to use only part of training examples to construct the models and leave the rest for testing and model selection. This is the idea of cross validation. In order to obtain the risks of different models in one pass learning through the training data, incremental leave-one-out cross validation [30] is adopted.
In the process of leave-one-out cross validation, the training set D is divided into a validation set (having only one instance) and an effective training set (having |D| − 1 instances). Each instance in D can act as the validation instance in turn. At this time, the contribution of the instance to the models will be removed and the instance acts only as the validation instance. This realizes cross validation on the training set. During learning, no use of an instance in test set is made.
As the Root Mean Squared Error (RMSE) is a finer grained measure of the calibration of the probability estimates compared to zero-one loss, RMSE is adopted to measure the cross validation risk. Since the model that minimizes the empirical risk is searched for, we call the score Cross Validation Risk (CVR): Based on the above methodologies, we develop the training algorithm of attribute selective TAN, described in Algorithm 1. It involves two passes learning through the training set D. One pass is to collect the information that is needed to form the table of joint frequencies of all combinations of 2 attributes values and the class label. The second pass is to evaluate all the models by leave-one-out cross validation. Remove inst from the frequency table 5: Predict inst by all models in Equation (5) 6: Accumulate the squared error for each model 7: Add inst back to the frequency table 8: end for 9: Compute the CVR score for each model as in Equation (13) 10: Select the model with the lowest CVR It could be found that this strategy can search through the model space in one more pass learning through the training data, thus it is efficient. Furthermore, local optima can be avoided in this strategy. This is different from those attribute selections that rely on hill climbing search [16], where multiple passes learning through training data might be involved and we can only get the local optima. In our strategy, if the search space could be expanded, the better model will be obtained.
From the training process in Algorithm 1, we could see that the space complexity of the table of joint frequencies of all combinations of 2 attributes values and the class label is O(c(av) 2 ), where v is the average number of values per attribute and c is the number of classes. Attribute selection will not require more memory. The time complexity consists of three parts. One is derivation of the frequencies required to populate the table, the time complexity of which is O(ta 2 ). The second part is the attribute ranking, the time complexity of which is O(a 2 ). This part can be ignored given the first part. The last part is attribute selection in a second pass through the training data, the time complexity of which is O(tca), since for each example we need to compute the joint probability in Equation (4). So the overall time complexity is O(ta 2 + tca). The time complexity of classifying a single example is O(ca) in the worst-case scenario, because some attributes may be omitted after attribute selection.

1.
Missing values are considered as a distinct value rather than replaced with modes and means for nominal and numeric attributes as in the Weka software.

2.
Root mean squared error is calculated exclusively on the true class label. This is different from Weka's implementation, where all class labels are considered.
The base probabilities are estimated using m-estimation (m = 1) [33]. 5-bin equal frequency discretization is performed to discretize the numeric attributes as in [34]. All the algorithms have been run on the data sets in the 10-fold cross validation mode.
In the experiments, we will compare STAN with regular TAN. According to different attributes ranking strategies described in Section 3.3, we develop five versions of STAN, namely STAN MI , STAN SU , STAN JMI , STAN CMIM and STAN MRMR . It is worthwhile to note that in the implementation of STAN MRMR , we use the following criterion instead of Equation (10) as suggested by the authors [24]: where 0.01 is added so as to avoid to be divided by zero. We also compare the best version of STAN with state-of-the-art one-dependence BNCs as AODE [5] and KDB 1 (KDB where k = 1).

Win/Draw/Loss Analysis
In this subsection, we demonstrate the classification performance of the proposed algorithms. Two commonly used performance measures are reported, namely Zero-one Loss (ZOL) and Root Mean Squared Error (RMSE). ZOL is the proportion of instances that are misclassified. RMSE is squared root of the mean squared probability that the testing example is misclassified, which is the difference between 1.0 and the probability estimated by the algorithm for the true class for the testing example.
Tables A1 and A2 in the Appendix A provide the detailed ZOL and RMSE results of eight algorithms on 70 data sets. In order to present a brief summary of the comparison of different algorithms, statistical win/draw/loss records in terms of the above performance measures are reported in Tables 2 and 3.
The win/draw/loss record indicates the frequency of one algorithm wins, draws with or loses to another algorithm with respect to the specified measure on 70 data sets. For example, win/draw/loss of STAN MI against TAN with respect to ZOL is 26/22/22, which means STAN MI obtains lower ZOL than TAN on 26 data sets, the same ZOL as TAN on 22 data sets and greater ZOL than TAN on 22 data sets.
To decide whether two comparing algorithms have the equal chances of win, a standard binomial sign test [35] is applied to these records. Given the null hypothesis that wins and losses are equiprobable, the binomial test indicates the probability of observing the specified numbers of win and loss. In our analysis, the number of draws is divided equally to the number of wins and losses. If the number of draws is an odd number, we ignore one. We reject the hypothesis and consider the difference between the two algorithms significant if the p value is less than the critical value 0.05, which is in bold font. The p value we reported is the outcome of a one-tailed test. For example, p value of STAN MI against TAN is 0.3601, which means the probability of 37 (26 + 22/2) wins in 70 comparisons is 0.3601 according to the binomial distribution. Since 0.3601 is greater than 0.5, we can draw a conclusion that the difference between STAN MI and TAN is not significant, although STAN MI obtains lower ZOL than TAN more often than the reverse.
We first compare different versions of STAN with TAN. From Table 2, we could find that relative to TAN, STAN MI achieves lower error almost as often as higher. STAN SU , STAN JMI and STAN CMIM deliver lower error more often than TAN, only significantly so with respect to STAN JMI and STAN CMIM on RMSE. STAN MRMR reduces both zero-one loss and RMSE significantly often relative to TAN. We could conclude that along with the ranking strategy MRMR, STAN MRMR achieves the best performance among the 5 five improvements.
We also present the scatter plot of STAN MRMR to TAN in terms of ZOL in Figure 2. The points below the diagonal represent the data sets on which STAN MRMR achieves lower ZOL than TAN. It could be found that STAN MRMR provides consistently better predictions than regular TAN in a statistically way.  Next, we compare STAN MRMR with state-of-the-art one-dependence Bayesian classifiers, AODE [5] and KDB [6]. We use the version of KDB where k = 1, simplified as KDB 1 . The win/draw/loss results are summarized in Table 3. It could be found that STAN MRMR achieves lower error almost as often as higher relative to AODE, while it obtains lower zero-one-loss and RMSE more often than KDB 1 than the reverse.

Analysis of Training and Classification Time
In this subsection, we will compare the averaged training and classification time of the proposed algorithms. The averaged training and classification time of all the 8 algorithms have been plotted in Figure 3.
It could be found that selective TAN requires more training time than regular TAN. This could be explained that selective TAN involves one more pass learning through the training data. While the difference between various versions of STAN is not significant. Five STAN algorithms require more training time than AODE and KDB 1 .  As far as the classification time is concerned, five STAN algorithms achieve the same results as regular TAN. As the classification process uses less attributes in STAN than in regular TAN, the classification times are expected to be less than regular TAN. However, the plot does not indicate this trend. The deep observation of the classification times of different algorithms on each data set shows that the classification times on most data sets are 0 due to the limited number of instances in those data sets. AODE require more classification time since it needs to classify the test instance by multiple one-dependence estimators.

Discussion
In this paper, we propose an attribute Selective Tree-Augmented Naive Bayes (STAN) algorithm, which builds a sequence of approximate models by adding one attribute at a time to the previous model and searches the model space to minimize the cross validation risk. The extensive experiments on 70 UCI data sets demonstrats that STAN achieves superior performance while maintaining the efficiency and simplicity. The conclusions are summarized as follows: • STAN algorithms with different ranking strategies achieve superior classification performance than regular TAN at the cost of a modest increase in training time. • MRMR ranking strategy achieves the best classification performance compared to other ranking strategies, and the advantage over regular TAN is significant. • STAN with MRMR ranking strategy is comparable with AODE and superior to KDB 1 in terms of accuracy, while requires less classification time than AODE.
As the cross validation risk minimization provides an efficient search in the model space, expansion of the space would be expected to produce better models. So in the future it is worthwhile to expand the model space by varying the dependence level so as to find more practical model.  Acknowledgments: This research has been supported in part by the NAU Educational Technology Center through the use of the NAU HPC Cluster.

Conflicts of Interest:
The authors declare no conflict of interest.