Next Article in Journal
A Multidimensional Fuzzy Quality Function Deployment Design for Brand Experience Assessment of Convenience Stores
Next Article in Special Issue
A Novel Hybrid Approach: Instance Weighted Hidden Naive Bayes
Previous Article in Journal
Entropy Optimization of First-Grade Viscoelastic Nanofluid Flow over a Stretching Sheet by Using Classical Keller-Box Scheme
Previous Article in Special Issue
Adapting Hidden Naive Bayes for Text Classification
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Attribute Selecting in Tree-Augmented Naive Bayes by Cross Validation Risk Minimization

1
Department of E-Commerce, Nanjing Audit University, Nanjing 211815, China
2
School of Finance, Nanjing Audit University, Nanjing 211815, China
*
Author to whom correspondence should be addressed.
Mathematics 2021, 9(20), 2564; https://doi.org/10.3390/math9202564
Submission received: 5 September 2021 / Revised: 30 September 2021 / Accepted: 9 October 2021 / Published: 13 October 2021
(This article belongs to the Special Issue Machine Learning and Data Mining: Techniques and Tasks)

Abstract

:
As an important improvement to naive Bayes, Tree-Augmented Naive Bayes (TAN) exhibits excellent classification performance and efficiency since it allows that every attribute depends on at most one other attribute in addition to the class variable. However, its performance might be lowered as some attributes might be redundant. In this paper, we propose an attribute Selective Tree-Augmented Naive Bayes (STAN) algorithm which builds a sequence of approximate models each involving only the top certain attributes and searches the model to minimize the cross validation risk. Five different approaches to ranking the attributes have been explored. As the models can be evaluated simultaneously in one pass learning through the data, it is efficient and can avoid local optima in the model space. The extensive experiments on 70 UCI data sets demonstrated that STAN achieves superior performance while maintaining the efficiency and simplicity.

1. Introduction

Naive Bayes (NB) [1,2] has attracted considerable attention due to its computational efficiency and competitive classification performance. Its efficiency originates in the independence assumption among the attributes given the class. Figure 1a describes an example of NB structure with 4 attributes, where X 1 , , X 4 are the attribute variables and Y is the class variable. However, the independence assumption rarely holds in real world applications. Although it has been demonstrated that some violations of the independence assumption are not harmful to classification accuracy [3], it is clear that many are. Many efforts have been done to allow specific dependence between attributes while retaining naive Bayesisan classifiers’ desirable simplicity and efficiency [4,5,6].
The Tree-Augmented Naive Bayes (TAN) [4] corresponds to the algorithm better improving the accuracy of naive Bayesian classifiers by alleviating its attribute independence assumption. TAN allows that every attribute depends on at most one attribute other than the class. So the dependencies among the attributes can be described by a tree structure, which can be found by a scoring measurement, called conditional mutual information. Figure 1b describes an example of TAN structure with 4 attributes, where X 2 depends only on X 1 , while X 3 and X 4 depend on X 2 . Since TAN exploits the first-order dependencies among the attributes, the classification performance can be greatly improved compared to NB at the cost of a search of tree structure.
However, TAN demands all attributes to be connected to the class, so it exploits all the attributes regardless of whether it is redundant or not. As a result, there is an increasing body of work to improve TAN since TAN was proposed [7]. Jiang et al. [8] presented a Forest Augmented Naive Bayes (FAN) for better evaluating the ranking performance. Alhussan et al. [9] proposed a fine-tuning stage in a Bayesian Network (BN) learning algorithm to more accurately estimate the probability terms used by the BN. They apply the algorithm to fine-tune TAN and other models. Wang et al. [10] presented a kind of semi-lazy TAN Classifier, which builds a TAN identical to the original TAN at training time, but adjusts the dependence relations for a new test instance at classification time. Campos et al. [11] proposed an extended version of the well-known tree-augmented naive Bayes. The structure learning procedure explores a superset of the structures that are considered by TAN, yet achieves global optimality of the learning score function in a very efficient way. The procedure is enhanced with a new score function that only takes into account arcs that are relevant to predict the class, as well as an optimization over the equivalent sample size during learning. Jiang et al. [12] investigated the class probability estimation performance of TAN in terms of Conditional Log Likelihood (CLL). An improved algorithm was presented by choosing each attribute as the root of the tree and then averaging all the spanning TAN classifiers. Cerquides and M a ´ ntaras [13] introduced decomposable distributions over TANs. They proposed four classifiers based on decomposable distributions over TANs. These classifiers provide clearly significant improvements, specially when data is scarce. It could be found that these work focuses mainly on the TAN structure learning, few of them have tried to eliminate redundant attributes in the process of training.
Attribute selection in Bayesian network classifiers had been investigated. Zhang et al. [14] proposed a discriminative model selection approach which chooses different single models for different test instances and retains the interpretability of single models. Langley and Sage [15] proposed Selective Bayes Classifiers (SBC) by hill-climbing search of the optimal attribute subset. The same strategy had also been applied to Averaged-One Dependence Estimator (AODE) to find the optimal parent and child attribute set [16]. As this carries out a greedy search through the space of feature, it often falls into a local optimization. Furthermore, the evaluation of the successively added attributes is time-consuming. Recently, an attribute selection approach based on efficient attribute subsets construction and evaluation has been investigated in NB [17], k-dependence Bayesian classifier (KDB) [18], averaged n-dependence estimators (AnDE) [19,20]. However, the performance of TAN [4] with this attribute selection has not been explored.
This paper proposes an attribute Selective TAN (STAN) algorithm by cross validation risk minimization. The attribute subsets are first constructed very efficiently as the latter subset can be obtained only by adding one attribute to the former. Classification models based on these different subsets of attributes could then be searched by cross validation risk minimization in one pass learning of the training data. Five different approaches to ranking the attributes have been explored. Different from the traditional attribute selection based on hill climbing, the strategy in this paper is efficient and able to avoid local optima in the model space. The extensive experiments on 70 UCI data sets demonstrated that STAN, along with a certain attribute ranking approach, called Minimum Redundancy-Maximum Relevance (MRMR), achieves superior performance while maintaining the efficiency and simplicity. It provides consistently better predictions than regular TAN in a statistically way. The win/draw/loss result in terms of zero-one loss is 34/22/14, which means STAN with MRMR obtains lower zero-one loss than regular TAN on 34 data sets, the same zero-one loss as regular TAN on 22 data sets and greater zero-one loss than regular TAN on 14 data sets.

2. Preliminaries

2.1. Bayesian Network Classifiers

The classification problem can be described as a procedure that given a data set D and an unclassified observation x assigns a class to x . Suppose we have N observations in D . Each observation is a pair ( x , y ) , consisting of an a-dimensional attribute vector x = [ x 1 , , x a ] T , and a target class y, draw from the underlying random variables X = { X 1 , , X a } and Y.
Bayesian network classifier addresses this classification task by first modelling the joint distribution P ( y , x ) by a certain Bayesian network B , and then calculating the posterior distribution P ( y | x ) by bayesian rule. A Bayesian network is characterised by a pair B = G , Θ . The first component, G , is a directed acyclic graph. The nodes in G represent random variables, including attributes X 1 , , X a and the class variable Y. The arcs in G represent directed dependencies between the nodes. If X j is pointing directly to X i via a directed edge (an arc), we say X j is the parent of X i , or X i is the child of X j . Different bayesian network classifiers assume different dependencies among the attributes, but all assume Y is the parent of all attributes and has no parents.
The second component of the pair, namely Θ , represents the set of parameters that quantifies the network. It contains a parameter θ x i y , π i for each x i of node X i , each y of Y and each π i of Π i , where Π i is the set of parent nodes of X i in network G . θ x i | y , π i = P B ( x i | y , π i ) , abbreviated for P B ( X i = x i | Y = y , Π i = π i ) , represents the probability that variable X i takes the value x i given that Y takes the class y and Π i takes the value π i . It is obvious that θ x i | y , π i is constrained by x i X i θ x i | y , π i = 1 .
When the data set D is given, the log-likelihood of the data given a specific network structure is maximized when θ x i | y , π i corresponds to empirical estimates of probabilities from the data, that is, θ x i | y , π i = P D ( X i = x i | Y = y , Π i = π i ) [21]. This produces the Maximum-Likelihood Estimation (MLE) of the parameters Θ .
A Bayesian network defines a unique joint probability distribution given by
P B ( x 1 , , x a , y ) = P B ( y ) i = 1 a P B ( x i | y , π i ) = P B ( y ) i = 1 a θ x i | y , π i
By Bayesian rule, the posterior distribution of a new unclassified example x can be calculated as follows,
P B ( y | x ) = P B ( x , y ) y P B ( x , y )
So we can easily classify x into class arg max y ( P B ( y | x ) ) .

2.2. Tree Augmented Naive Bayes

Although naive Bayes [1] performs surprisingly well on many data sets, its independence assumption among attributes rarely holds in real world. In order to relax this independence assumption, Friedman et al. [4] proposed to augment the naive Bayes structure with edges among the attributes, when needed, thus dispensing with its strong assumptions about independence. In order to learn the optimal set of augmenting edges in polynomial time, a tree restriction has been imposed on the form of the allowed interaction. So the resulting structure is called Tree-Augmented Naive Bayesian (TAN) network, in which the class variable has no parents and each attribute has as parents at most one other attribute in addition to the class variable. Thus, each attribute can have one augmenting edge pointing to it and all the augmented edges form a tree structure.
In order to learn a TAN structure such that the log likelihood is maximized, they proposed to use conditional mutual information between attributes given the class variable as the weight of an edge in the graph. The conditional mutual information between attributes X and Z given the class variable Y is defined as
I ( X ; Z | Y ) = x , z , y P ( x , z , y ) l o g P ( x , z | y ) P ( x | y ) P ( z | y ) ,
Roughly speaking, this function measures the information that Z provides about X when the value of Y is known.
The procedure to construct TAN structure consists of five main steps:
1.
Compute I ( X i ; X j | Y ) between each pair of attributes, i j , from the training data.
2.
Build a complete undirected graph in which the nodes are the attributes X 1 , , X a . Annotate the weight of an edge connecting X i to X j by I ( X i ; X j | Y ) .
3.
Build a maximum weighted spanning tree.
4.
Transform the resulting undirected tree to a directed one by choosing a root variable and setting the direction of all edges to be outward from it, thus getting the parent node Π i of node X i .
5.
Construct a TAN model by adding a node labelled by Y and adding an arc from Y to each X i .
Note that in TAN, vector Π i has deteriorated to scalar Π i as TAN allows only one parent for each attribute. So TAN model defines a unique joint probability distribution given by
P TAN ( x , y ) = P TAN ( y ) i = 1 a P TAN ( x i | y , π i )
The structure and the parameters of TAN model can be learned in one pass learning through the data.

3. Attribute Selective Tree-Augmented Naive Bayes

3.1. Motivation

It could be found from Equation (4) that the joint probability P TAN ( x , y ) is estimated by the product of prior probability P TAN ( y ) and conditional probabilities P TAN ( x i | y , π i ) . Considering only the top certain attributes will produce an approximation to P TAN ( x , y ) . This implies that it is possible to build a sequence of alternative selective models such that the latter one is a trivial extension to the former. Different models that build upon one another in this way can be efficiently evaluated in a single set of computations. So in just one pass learning through the data, cross validation risks of different models can be obtained. Risk minimization will obtain the best model, which means also the optimal attribute subset to perform classification in the framework of TAN.

3.2. Building Model Sequence

When trying to search the best attribute subset, the size of the search space for a variables is 2 a . Instead of searching the whole space exhaustively, it is natural to impose some restrictions on the construction of the model space. As TAN compute the joint probability P TAN ( x , y ) by sequentially multiplying the conditional probability P TAN ( x i | y , π i ) to the prior P TAN ( y ) , considering only the top s attributes results in an approximate model to P TAN ( x , y ) , where 1 s a . So the model space of attribute selective TAN would be
P TAN s ( x , y ) = P TAN ( y ) i = 1 s P TAN ( x i | y , π i )
By this means we can construct a model sequence of size a. These models can be evaluated efficiently in a single set of computations as the latter one is only a trivial extension to the former one. Although each model is only an approximation to TAN model, regular TAN is also included in this model sequence. Consequently, this attribute selective model could be expected not to be worse than regular TAN.

3.3. Ranking the Attributes

Since the selective strategy is to consider only the top s attributes, the strategy relies on a ranking of the attributes. As the purpose of attribute selection is to eliminate those redundant attributes, we should prioritize those more informative attributes. We could rank the attributes based on the attributes’ marginal relevance with respect to the class. Fortunately, the attribute ranking has been extensively investigated in feature selection area [22]. Here we adopt the most well known five strategies to measure the relevance between the attribute and the class.
1.
Mutual Information (MI) (Mutual information measures the amount of information shared by two variables. This can also be interpreted as the amount of information that one variable provides about another) is an intuitive score since it is a measure of correlation between an attribute and the class variable. Before we present the definition of mutual information, we would first present the concepts of entropy and conditional entropy. The entropy of a random variable x is defined as
H ( X ) = x X P ( x ) l o g 2 P ( x )
The conditional entropy H ( X | Y ) of X given Y is
H ( X | Y ) = y Y P ( y ) x X P ( x | y ) l o g 2 P ( x | y )
The mutual information between X and Y is defined as the difference between the entropy H ( X ) and the conditional entropy H ( X | Y )
I ( X ; Y ) = H ( X ) H ( X | Y ) = y Y x X P ( x , y ) l o g 2 P ( x , y ) P ( x ) P ( y ) ,
This heuristic considers a score for each attribute independently of others.
2.
Symmetrical Uncertainty (SU) (Symmetrical uncertainty is the normalized mutual information. The range of symmetrical uncertainty is [ 0 , 1 ] , where the value 1 indicates that knowledge of the value of either one completely predicts the value of the other and the value 0 indicates that the two variables are independent) [23] can be interpreted as a sort of mutual information normalized to interval [0, 1]:
S U ( X ; Y ) = 2 I ( X ; Y ) H ( X ) + H ( Y ) .
It is obvious that mutual information is biased in favor of attributes with more values and so large entropy. However, symmetrical uncertainty, which is normalized to the range [ 0 , 1 ] , is an unbiased metric and ensures they are comparable and have the same effect. As a result, we can expect to obtain a more appropriate ranking of attributes based on symmetrical uncertainty.
3.
Minimum Redundancy-Maximum Relevance (MRMR) criterion (MRMR, short for Minimum Reduncancy-Maximum Relevance, always tries to select the attribute which has the best trade off between the relevance to the class variable and the the averaged redundancy to the attributes already selected), which was proposed by Peng et al. [24], not only considers mutual information to ensure feature relevance, but introduces a penalty to enforce low correlations with features already selected. MRMR is very similar to Mutual Information Feature Selection (MIFS) [25], except that the latter replace 1 k with a more general configurable parameter β , where k means the number of the attributes that have been selected so far, and is also the number of steps. Assume at step k, the attribute set selected so far is A k S , while A k S = A \ A k S is the set difference between the original set of inputs A and A k S . The attribute returned by MRMR criterion at step k + 1 is,
X k + 1 MRMR = arg max X A k S I ( X ; Y ) 1 k X A k S I ( X ; X ) .
At each step, this strategy selected the variable which has the best trade off between the relevance I ( X ; Y ) of X to the class Y and the averaged redundancy of X to the selected attributes X A k S .
4.
Conditional Mutual Information Maximization (CMIM) (CMIM, short for Conditional Mutual Information Maximization, tries to select the attribute that maximizes the minimal mutual information with the class within the attributes already selected) proposes to select the feature whose minimal relevance conditioned to the selected attributes is maximal. This heuristic was proposed by Fleuret [26] and also later by Chen et al. [27] as direct rank (dRank). CMIM computes the mutual information of X and the class variable Y, conditioned on each attribute X A k S previously selected. Then the minimal value is retained and the attribute that has a maximal minimal conditional relevance is selected.
In formal notation, the variable returned at step k + 1 according to the CMIM strategy is
X k + 1 CMIM = arg max X A k S min X A k S I ( X ; Y | X ) .
5.
Joint Mutual Information (JMI) (JMI, short for joint Mutual Information, tries to select the attribute which is complementary most with existing attributes), which was proposed by Yang and Moody [28] and also later by Meyer et al. [29], tries to select a candidate attribute if it is complementary with existing attributes. As a result, JMI focuses on increasing the complementary information between attributes. The variable returned by JMI criterion at step k + 1 is
X k + 1 JMI = arg max X A k S X A k S I ( X , X ; Y ) .
The score in JMI is the information between the class variable and a j o i n t random variable X , X , defined by pairing the candidate X with each attribute X previously selected.
Note that for the first two scores, we can simply rank the attributes in the descending order of MI or SU scores. While for the last three methods, a forward selection search strategy is involved, which means we are selecting attributes sequentially, iteratively constructing our attribute subset. Suppose at step k, the set of attributes selected so far is A k S , A k S = A \ A k S is the set difference between the original set of inputs A and A k S . At step k + 1 of forward selection search, these methods select the attribute X k + 1 which maximizes the given score in Equation (10), Equation (11) or Equation (12). Then the attribute sets can be updated as A k + 1 S A k S { X k + 1 } , A k + 1 S A k S \ { X k + 1 } . Initially, the set A S is empty. So they select the attribute arg max X A I ( X ; Y ) , which is with maximal mutual information with respect to the class variable. The procedure terminates when the set A S becomes empty.

3.4. Cross Validation Risk Minimization

Since the model space has been built, next we need select the best model in this space. A natural idea is to apply these models to the training examples and select the model with the best accuracy. However, this will cause the over fitting problem as the models have been trained and tested on the same examples. Low error rate on the training data set does not mean low error rate on the testing data set. The more practical way is to use only part of training examples to construct the models and leave the rest for testing and model selection. This is the idea of cross validation. In order to obtain the risks of different models in one pass learning through the training data, incremental leave-one-out cross validation [30] is adopted.
In the process of leave-one-out cross validation, the training set D is divided into a validation set (having only one instance) and an effective training set (having D 1 instances). Each instance in D can act as the validation instance in turn. At this time, the contribution of the instance to the models will be removed and the instance acts only as the validation instance. This realizes cross validation on the training set. During learning, no use of an instance in test set is made.
As the Root Mean Squared Error (RMSE) is a finer grained measure of the calibration of the probability estimates compared to zero-one loss, RMSE is adopted to measure the cross validation risk. Since the model that minimizes the empirical risk is searched for, we call the score Cross Validation Risk (CVR):
C V R = 1 | D | i = 1 | D | ( 1.0 P TAN s ( y = y i | x i ) ) 2 ,
Based on the above methodologies, we develop the training algorithm of attribute selective TAN, described in Algorithm 1. It involves two passes learning through the training set D . One pass is to collect the information that is needed to form the table of joint frequencies of all combinations of 2 attributes values and the class label. The second pass is to evaluate all the models by leave-one-out cross validation.
Algorithm 1 Training algorithm of attribute selective TAN.
1:
Form the table of joint frequencies of all combinations of 2 attribute values and the class label ▹ first pass through training data
2:
Rank the attributes by Equations (8)–(12)
3:
for instance inst D do           ▹ second pass through training data
4:
    Remove inst from the frequency table
5:
    Predict inst by all models in Equation (5)
6:
    Accumulate the squared error for each model
7:
    Add inst back to the frequency table
8:
end for
9:
Compute the CVR score for each model as in Equation (13)
10:
Select the model with the lowest CVR
It could be found that this strategy can search through the model space in one more pass learning through the training data, thus it is efficient. Furthermore, local optima can be avoided in this strategy. This is different from those attribute selections that rely on hill climbing search [16], where multiple passes learning through training data might be involved and we can only get the local optima. In our strategy, if the search space could be expanded, the better model will be obtained.
From the training process in Algorithm 1, we could see that the space complexity of the table of joint frequencies of all combinations of 2 attributes values and the class label is O ( c ( a v ) 2 ) , where v is the average number of values per attribute and c is the number of classes. Attribute selection will not require more memory. The time complexity consists of three parts. One is derivation of the frequencies required to populate the table, the time complexity of which is O ( t a 2 ) . The second part is the attribute ranking, the time complexity of which is O ( a 2 ) . This part can be ignored given the first part. The last part is attribute selection in a second pass through the training data, the time complexity of which is O ( t c a ) , since for each example we need to compute the joint probability in Equation (4). So the overall time complexity is O ( t a 2 + t c a ) . The time complexity of classifying a single example is O ( c a ) in the worst-case scenario, because some attributes may be omitted after attribute selection.

4. Experiments and Analysis

4.1. Experimental Methodology

We have performed the experiments on 70 UCI data sets [31], covering a wide spectrum of number of instances (24–5,749,132), attributes (3–166) and classes (2–50), which allows us to examine the performance of the proposed algorithm on data sets with various characteristics. Table 1 lists the data sets, including the name of the data set, the number of instances, attributes, and classes. Note that the data sets have been listed in the ascending order of number of instances.
The experiments has been done on a Linux HPC cluster which has 4 nodes each with 64 GB RAM. The experimental system is implemented in C++. In our experimental system, several different strategies from the Weka software [32] were adopted, namely:
1.
Missing values are considered as a distinct value rather than replaced with modes and means for nominal and numeric attributes as in the Weka software.
2.
Root mean squared error is calculated exclusively on the true class label. This is different from Weka’s implementation, where all class labels are considered.
The base probabilities are estimated using m-estimation ( m = 1 ) [33]. 5-bin equal frequency discretization is performed to discretize the numeric attributes as in [34]. All the algorithms have been run on the data sets in the 10-fold cross validation mode.
In the experiments, we will compare STAN with regular TAN. According to different attributes ranking strategies described in Section 3.3, we develop five versions of STAN, namely STAN MI , STAN SU , STAN JMI , STAN CMIM and STAN MRMR . It is worthwhile to note that in the implementation of STAN MRMR , we use the following criterion instead of Equation (10) as suggested by the authors [24]:
X k + 1 MRMR = arg max X A k S I ( X ; Y ) 1 k X A k S I ( X ; X ) + 0.01 ,
where 0.01 is added so as to avoid to be divided by zero. We also compare the best version of STAN with state-of-the-art one-dependence BNCs as AODE [5] and KDB 1 (KDB where k = 1).

4.2. Win/Draw/Loss Analysis

In this subsection, we demonstrate the classification performance of the proposed algorithms. Two commonly used performance measures are reported, namely Zero-one Loss (ZOL) and Root Mean Squared Error (RMSE). ZOL is the proportion of instances that are misclassified. RMSE is squared root of the mean squared probability that the testing example is misclassified, which is the difference between 1.0 and the probability estimated by the algorithm for the true class for the testing example.
Table A1 and Table A2 in the Appendix A provide the detailed ZOL and RMSE results of eight algorithms on 70 data sets. In order to present a brief summary of the comparison of different algorithms, statistical win/draw/loss records in terms of the above performance measures are reported in Table 2 and Table 3.
The win/draw/loss record indicates the frequency of one algorithm wins, draws with or loses to another algorithm with respect to the specified measure on 70 data sets. For example, win/draw/loss of STAN MI against TAN with respect to ZOL is 26/22/22, which means STAN MI obtains lower ZOL than TAN on 26 data sets, the same ZOL as TAN on 22 data sets and greater ZOL than TAN on 22 data sets.
To decide whether two comparing algorithms have the equal chances of win, a standard binomial sign test [35] is applied to these records. Given the null hypothesis that wins and losses are equiprobable, the binomial test indicates the probability of observing the specified numbers of win and loss. In our analysis, the number of draws is divided equally to the number of wins and losses. If the number of draws is an odd number, we ignore one. We reject the hypothesis and consider the difference between the two algorithms significant if the p value is less than the critical value 0.05, which is in bold font. The p value we reported is the outcome of a one-tailed test. For example, p value of STAN MI against TAN is 0.3601, which means the probability of 37 ( 26 + 22 / 2 ) wins in 70 comparisons is 0.3601 according to the binomial distribution. Since 0.3601 is greater than 0.5, we can draw a conclusion that the difference between STAN MI and TAN is not significant, although STAN MI obtains lower ZOL than TAN more often than the reverse.
We first compare different versions of STAN with TAN. From Table 2, we could find that relative to TAN, STAN MI achieves lower error almost as often as higher. STAN SU , STAN JMI and STAN CMIM deliver lower error more often than TAN, only significantly so with respect to STAN JMI and STAN CMIM on RMSE. STAN MRMR reduces both zero-one loss and RMSE significantly often relative to TAN. We could conclude that along with the ranking strategy MRMR, STAN MRMR achieves the best performance among the 5 five improvements.
We also present the scatter plot of STAN MRMR to TAN in terms of ZOL in Figure 2. The points below the diagonal represent the data sets on which STAN MRMR achieves lower ZOL than TAN. It could be found that STAN MRMR provides consistently better predictions than regular TAN in a statistically way.
Next, we compare STAN MRMR with state-of-the-art one-dependence Bayesian classifiers, AODE [5] and KDB [6]. We use the version of KDB where k = 1 , simplified as KDB 1 . The win/draw/loss results are summarized in Table 3. It could be found that STAN MRMR achieves lower error almost as often as higher relative to AODE, while it obtains lower zero-one-loss and RMSE more often than KDB 1 than the reverse.

4.3. Analysis of Training and Classification Time

In this subsection, we will compare the averaged training and classification time of the proposed algorithms. The averaged training and classification time of all the 8 algorithms have been plotted in Figure 3.
It could be found that selective TAN requires more training time than regular TAN. This could be explained that selective TAN involves one more pass learning through the training data. While the difference between various versions of STAN is not significant. Five STAN algorithms require more training time than AODE and KDB 1 .
As far as the classification time is concerned, five STAN algorithms achieve the same results as regular TAN. As the classification process uses less attributes in STAN than in regular TAN, the classification times are expected to be less than regular TAN. However, the plot does not indicate this trend. The deep observation of the classification times of different algorithms on each data set shows that the classification times on most data sets are 0 due to the limited number of instances in those data sets. AODE require more classification time since it needs to classify the test instance by multiple one-dependence estimators.

5. Discussion

In this paper, we propose an attribute Selective Tree-Augmented Naive Bayes (STAN) algorithm, which builds a sequence of approximate models by adding one attribute at a time to the previous model and searches the model space to minimize the cross validation risk. The extensive experiments on 70 UCI data sets demonstrats that STAN achieves superior performance while maintaining the efficiency and simplicity. The conclusions are summarized as follows:
  • STAN algorithms with different ranking strategies achieve superior classification performance than regular TAN at the cost of a modest increase in training time.
  • MRMR ranking strategy achieves the best classification performance compared to other ranking strategies, and the advantage over regular TAN is significant.
  • STAN with MRMR ranking strategy is comparable with AODE and superior to KDB 1 in terms of accuracy, while requires less classification time than AODE.
As the cross validation risk minimization provides an efficient search in the model space, expansion of the space would be expected to produce better models. So in the future it is worthwhile to expand the model space by varying the dependence level so as to find more practical model.

Author Contributions

Conceptualization, S.C.; methodology, S.C.; software, S.C.; validation, L.L.; formal analysis, S.C.; investigation, S.C.; resources, S.C.; data curation, S.C.; writing—original draft preparation, S.C.; writing—review and editing, S.C.; visualization, Z.Z.; supervision, S.C.; project administration, S.C.; funding acquisition, S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sets used in this manuscript can be found in UCI data repository (http://archive.ics.uci.edu/ml (accessed on 5 September 2021)).

Acknowledgments

This research has been supported in part by the NAU Educational Technology Center through the use of the NAU HPC Cluster.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. ZOL results of 8 algorithms on 70 data sets.
Table A1. ZOL results of 8 algorithms on 70 data sets.
DatasetTANSTANMISTANSUSTANJMISTANCMIMSTANMRMRAODEKDB1
contact-lenses0.3750 ± 0.37580.3750 ± 0.34250.3750 ± 0.34250.3750 ± 0.34250.3750 ± 0.34250.3750 ± 0.34250.4167 ± 0.35740.2917 ± 0.3543
lung-cancer0.5938 ± 0.22650.6875 ± 0.26410.7188 ± 0.25020.5625 ± 0.32890.6250 ± 0.26050.6562 ± 0.27330.4688 ± 0.28850.5938 ± 0.3082
labor-negotiations0.1053 ± 0.12340.1053 ± 0.12340.0877 ± 0.12720.1053 ± 0.12340.1228 ± 0.20900.1053 ± 0.12340.0526 ± 0.06750.1053 ± 0.1146
post-operative0.3667 ± 0.20750.3222 ± 0.21170.3444 ± 0.21240.3111 ± 0.16630.3111 ± 0.16630.3222 ± 0.22630.3444 ± 0.18820.3444 ± 0.1748
zoo0.0099 ± 0.05270.0099 ± 0.05270.0099 ± 0.05270.0099 ± 0.05270.0198 ± 0.05420.0099 ± 0.05270.0198 ± 0.03840.0495 ± 0.0614
promoters0.1321 ± 0.10360.1792 ± 0.13070.1887 ± 0.12510.1792 ± 0.12800.1604 ± 0.10720.1321 ± 0.11080.1038 ± 0.06480.1321 ± 0.0891
echocardiogram0.3664 ± 0.15490.3893 ± 0.11910.3664 ± 0.09220.3969 ± 0.09980.3969 ± 0.09980.3511 ± 0.11320.3435 ± 0.11430.3664 ± 0.1511
lymphography0.1757 ± 0.10030.1622 ± 0.10070.1689 ± 0.10050.1622 ± 0.10070.1689 ± 0.10470.2432 ± 0.11770.1486 ± 0.09910.1757 ± 0.0791
iris0.0667 ± 0.06320.0667 ± 0.06320.0667 ± 0.06320.0667 ± 0.06320.0667 ± 0.06320.0667 ± 0.06320.0600 ± 0.06550.0733 ± 0.0505
teaching-ae0.4901 ± 0.12450.4901 ± 0.12450.4901 ± 0.12450.5232 ± 0.12160.5232 ± 0.12160.5166 ± 0.14360.4834 ± 0.11790.4834 ± 0.1079
hepatitis0.1484 ± 0.12800.1548 ± 0.12460.1548 ± 0.12640.1484 ± 0.12800.1484 ± 0.12640.1806 ± 0.12360.1935 ± 0.12440.2194 ± 0.1205
wine0.0618 ± 0.06430.0562 ± 0.05340.0562 ± 0.05340.0618 ± 0.06470.0674 ± 0.06110.0449 ± 0.05320.0281 ± 0.04040.0674 ± 0.0633
autos0.2293 ± 0.13740.1951 ± 0.12780.2000 ± 0.11180.2098 ± 0.10670.1951 ± 0.11620.2146 ± 0.12300.2537 ± 0.11040.2293 ± 0.1374
sonar0.2788 ± 0.08400.3029 ± 0.10860.3029 ± 0.10860.2981 ± 0.09250.2692 ± 0.10320.3173 ± 0.13010.1394 ± 0.08880.2548 ± 0.0914
glass-id0.2617 ± 0.09440.2664 ± 0.09230.2290 ± 0.07570.2523 ± 0.09440.2523 ± 0.09440.2243 ± 0.06710.1589 ± 0.05760.2383 ± 0.0720
new-thyroid0.0791 ± 0.06470.0791 ± 0.06470.0791 ± 0.06470.0791 ± 0.06470.0791 ± 0.06470.0791 ± 0.06470.0512 ± 0.05440.0651 ± 0.0454
audio0.2920 ± 0.09260.3053 ± 0.06760.3053 ± 0.06760.3009 ± 0.08510.3009 ± 0.09340.2876 ± 0.08210.2301 ± 0.06490.3097 ± 0.1054
hungarian0.1973 ± 0.06060.2041 ± 0.06130.2109 ± 0.07080.1973 ± 0.05240.1905 ± 0.05110.1871 ± 0.07890.1429 ± 0.06760.2075 ± 0.0625
heart-disease-c0.2112 ± 0.10050.2211 ± 0.11540.2244 ± 0.10370.2178 ± 0.10200.2079 ± 0.10810.1914 ± 0.08200.1848 ± 0.10670.2178 ± 0.1428
haberman0.2843 ± 0.10230.2778 ± 0.08680.2778 ± 0.08680.2745 ± 0.08510.2745 ± 0.08510.2680 ± 0.09850.2712 ± 0.11880.2778 ± 0.1024
primary-tumor0.5752 ± 0.09600.5841 ± 0.11880.5841 ± 0.11880.5841 ± 0.11880.5782 ± 0.12090.5841 ± 0.11840.5162 ± 0.09840.5841 ± 0.1119
ionosphere0.0684 ± 0.05100.0741 ± 0.04530.0769 ± 0.04480.0741 ± 0.05420.0712 ± 0.04090.0855 ± 0.04280.0826 ± 0.04050.0684 ± 0.0441
dermatology0.0464 ± 0.03900.0383 ± 0.03450.0437 ± 0.03590.0410 ± 0.02870.0437 ± 0.03780.0301 ± 0.02820.0219 ± 0.02750.0301 ± 0.0258
horse-colic0.2092 ± 0.06290.1875 ± 0.05240.1793 ± 0.05670.1793 ± 0.07150.1685 ± 0.06180.1902 ± 0.06040.2038 ± 0.05900.2120 ± 0.0615
house-votes-840.0552 ± 0.03750.0644 ± 0.03860.0644 ± 0.03860.0552 ± 0.03150.0529 ± 0.04040.0552 ± 0.03680.0529 ± 0.03460.0690 ± 0.0353
cylinder-bands0.3296 ± 0.07190.3833 ± 0.07300.3796 ± 0.08210.3722 ± 0.06340.3759 ± 0.06770.3704 ± 0.07740.1611 ± 0.04210.2074 ± 0.0575
chess0.0926 ± 0.04920.0907 ± 0.05090.0907 ± 0.05090.0944 ± 0.05530.0907 ± 0.05090.0907 ± 0.05090.1053 ± 0.06310.0998 ± 0.0354
syncon0.0300 ± 0.02490.0283 ± 0.02410.0283 ± 0.02410.0317 ± 0.02660.0267 ± 0.02350.0317 ± 0.02910.0200 ± 0.01630.0200 ± 0.0156
balance-scale0.1328 ± 0.01560.1328 ± 0.01560.1328 ± 0.01560.1328 ± 0.01560.1328 ± 0.01560.1328 ± 0.01560.1120 ± 0.01590.1424 ± 0.0307
soybean0.0469 ± 0.01360.0586 ± 0.01950.0571 ± 0.01800.0469 ± 0.01580.0454 ± 0.01000.0410 ± 0.00950.0542 ± 0.01840.0644 ± 0.0205
credit-a0.1696 ± 0.03700.1667 ± 0.03940.1623 ± 0.03740.1739 ± 0.04600.1696 ± 0.04440.1536 ± 0.03770.1261 ± 0.02100.1696 ± 0.0417
breast-cancer-w0.0415 ± 0.02730.0443 ± 0.02520.0429 ± 0.02710.0415 ± 0.02710.0386 ± 0.02070.0372 ± 0.02370.0386 ± 0.02480.0486 ± 0.0181
pima-ind-diabetes0.2526 ± 0.05090.2487 ± 0.04160.2487 ± 0.04160.2409 ± 0.05050.2461 ± 0.04800.2396 ± 0.05500.2513 ± 0.06360.2578 ± 0.0583
vehicle0.2837 ± 0.06030.3014 ± 0.05050.3014 ± 0.05050.3121 ± 0.04790.2837 ± 0.05700.2884 ± 0.06540.3132 ± 0.05630.3026 ± 0.0627
anneal0.0468 ± 0.01820.0468 ± 0.01820.0468 ± 0.01820.0468 ± 0.01820.0468 ± 0.01820.0468 ± 0.01820.0735 ± 0.02320.0445 ± 0.0156
tic-tac-toe0.2286 ± 0.03950.2286 ± 0.03950.2286 ± 0.03950.2286 ± 0.03950.2286 ± 0.03950.2286 ± 0.03950.2683 ± 0.04320.2463 ± 0.0382
vowel0.0667 ± 0.02590.0616 ± 0.02840.0616 ± 0.02840.0646 ± 0.02700.0707 ± 0.03540.0616 ± 0.02840.0808 ± 0.02960.2162 ± 0.0272
german0.2700 ± 0.05150.2750 ± 0.04110.2760 ± 0.04700.2770 ± 0.06530.2740 ± 0.06040.2670 ± 0.03980.2410 ± 0.05350.2660 ± 0.0634
led0.2660 ± 0.05690.2660 ± 0.05690.2660 ± 0.05690.2660 ± 0.05690.2660 ± 0.05690.2660 ± 0.05690.2700 ± 0.06040.2640 ± 0.0603
contraceptive-mc0.4739 ± 0.03450.4800 ± 0.03280.4779 ± 0.03180.4793 ± 0.03330.4739 ± 0.02340.4745 ± 0.02660.4671 ± 0.04550.4684 ± 0.0276
yeast0.4481 ± 0.03600.4481 ± 0.03600.4461 ± 0.03240.4481 ± 0.03600.4481 ± 0.03600.4454 ± 0.03220.4205 ± 0.04020.4394 ± 0.0326
volcanoes0.3559 ± 0.02500.3539 ± 0.02760.3539 ± 0.02760.3533 ± 0.02940.3533 ± 0.02940.3539 ± 0.02760.3539 ± 0.03310.3520 ± 0.0258
car0.0567 ± 0.01820.0567 ± 0.01820.0567 ± 0.01820.0567 ± 0.01820.0567 ± 0.01820.0567 ± 0.01820.0845 ± 0.01930.0567 ± 0.0182
segment0.0615 ± 0.01420.0610 ± 0.01330.0610 ± 0.01330.0610 ± 0.01330.0610 ± 0.01330.0615 ± 0.01300.0563 ± 0.00910.0567 ± 0.0158
hypothyroid0.0332 ± 0.01260.0319 ± 0.01100.0322 ± 0.00980.0326 ± 0.01040.0316 ± 0.01060.0310 ± 0.00950.0348 ± 0.01180.0338 ± 0.0137
splice-c4.50.0466 ± 0.01290.0349 ± 0.00890.0349 ± 0.00890.0349 ± 0.01020.0318 ± 0.00780.0340 ± 0.00880.0375 ± 0.00870.0482 ± 0.0152
kr-vs-kp0.0776 ± 0.02280.0569 ± 0.01860.0579 ± 0.01870.0569 ± 0.01860.0607 ± 0.01450.0551 ± 0.01420.0854 ± 0.01870.0544 ± 0.0171
abalone0.4692 ± 0.02850.4692 ± 0.02850.4692 ± 0.02850.4692 ± 0.02850.4690 ± 0.02790.4692 ± 0.02850.4551 ± 0.02140.4656 ± 0.0237
spambase0.0696 ± 0.01060.0689 ± 0.01150.0685 ± 0.01180.0696 ± 0.01060.0689 ± 0.01120.0682 ± 0.01140.0635 ± 0.01140.0702 ± 0.0121
phoneme0.2733 ± 0.01770.2413 ± 0.01190.2413 ± 0.01190.2413 ± 0.01190.2413 ± 0.01190.2444 ± 0.01190.2100 ± 0.01440.2120 ± 0.0123
wall-following0.1147 ± 0.01160.0872 ± 0.00920.0872 ± 0.00920.0867 ± 0.00970.0861 ± 0.01080.0924 ± 0.01100.1514 ± 0.01010.1043 ± 0.0094
page-blocks0.0541 ± 0.01000.0530 ± 0.00810.0530 ± 0.00810.0541 ± 0.01000.0550 ± 0.00990.0530 ± 0.00810.0502 ± 0.00660.0590 ± 0.0102
optdigits0.0438 ± 0.00640.0441 ± 0.00640.0441 ± 0.00640.0441 ± 0.00640.0443 ± 0.00680.0441 ± 0.00670.0283 ± 0.00950.0454 ± 0.0070
satellite0.1310 ± 0.01260.1321 ± 0.01180.1321 ± 0.01180.1318 ± 0.01260.1340 ± 0.01350.1322 ± 0.01320.1301 ± 0.01310.1392 ± 0.0135
musk20.0917 ± 0.00860.1003 ± 0.01420.1073 ± 0.01650.0996 ± 0.01460.0997 ± 0.01870.1028 ± 0.01270.1511 ± 0.01010.0867 ± 0.0097
mushrooms0.0001 ± 0.00040.0001 ± 0.00040.0001 ± 0.00040.0001 ± 0.00040.0000 ± 0.00000.0001 ± 0.00040.0002 ± 0.00050.0006 ± 0.0009
thyroid0.2294 ± 0.01110.2294 ± 0.01110.2301 ± 0.01210.2294 ± 0.01110.2294 ± 0.01110.2294 ± 0.01110.2421 ± 0.01360.2319 ± 0.0146
pendigits0.0576 ± 0.00640.0576 ± 0.00640.0576 ± 0.00640.0552 ± 0.00560.0544 ± 0.00580.0568 ± 0.00640.0254 ± 0.00290.0529 ± 0.0066
sign0.2853 ± 0.00940.2853 ± 0.00940.2853 ± 0.00940.2853 ± 0.00940.2853 ± 0.00940.2853 ± 0.00940.2960 ± 0.01190.3055 ± 0.0140
nursery0.0654 ± 0.00620.0654 ± 0.00620.0654 ± 0.00620.0654 ± 0.00620.0654 ± 0.00620.0654 ± 0.00620.0733 ± 0.00590.0654 ± 0.0061
magic0.1613 ± 0.00760.1611 ± 0.00860.1611 ± 0.00860.1613 ± 0.00760.1613 ± 0.00760.1611 ± 0.00760.1726 ± 0.00840.1759 ± 0.0107
letter-recog0.1941 ± 0.00850.1941 ± 0.00850.1941 ± 0.00850.1941 ± 0.00850.1941 ± 0.00850.1941 ± 0.00850.1514 ± 0.00890.1920 ± 0.0112
adult0.1641 ± 0.00370.1609 ± 0.00400.1609 ± 0.00400.1635 ± 0.00340.1642 ± 0.00370.1631 ± 0.00450.1679 ± 0.00320.1638 ± 0.0044
shuttle0.0097 ± 0.00130.0085 ± 0.00130.0085 ± 0.00130.0085 ± 0.00130.0085 ± 0.00130.0085 ± 0.00130.0101 ± 0.00100.0163 ± 0.0012
connect-40.2354 ± 0.00500.2354 ± 0.00500.2354 ± 0.00500.2354 ± 0.00500.2354 ± 0.00500.2354 ± 0.00500.2422 ± 0.00470.2406 ± 0.0030
waveform0.0368 ± 0.00150.0370 ± 0.00140.0370 ± 0.00140.0369 ± 0.00140.0369 ± 0.00140.0367 ± 0.00150.0343 ± 0.00080.0396 ± 0.0021
localization0.4367 ± 0.00330.4367 ± 0.00330.4367 ± 0.00330.4367 ± 0.00330.4367 ± 0.00330.4367 ± 0.00330.4333 ± 0.00270.4642 ± 0.0040
census-income0.0675 ± 0.00160.0585 ± 0.00200.0571 ± 0.00100.0567 ± 0.00100.0599 ± 0.00110.0544 ± 0.00150.1106 ± 0.00150.0667 ± 0.0014
poker-hand0.3295 ± 0.00150.3294 ± 0.00150.3294 ± 0.00150.3294 ± 0.00150.3294 ± 0.00150.3294 ± 0.00150.4812 ± 0.00280.3291 ± 0.0012
donation0.0001 ± 0.00000.0001 ± 0.00000.0001 ± 0.00000.0001 ± 0.00000.0001 ± 0.00000.0001 ± 0.00000.0002 ± 0.00000.0001 ± 0.0000
Table A2. RMSE results of 8 algorithms on 70 data sets.
Table A2. RMSE results of 8 algorithms on 70 data sets.
DatasetTANSTANMISTANSUSTANJMISTANCMIMSTANMRMRAODEKDB1
contact-lenses0.6077 ± 0.18310.5438 ± 0.20910.5438 ± 0.20910.5635 ± 0.22780.5438 ± 0.20910.5635 ± 0.22780.5226 ± 0.22210.5024 ± 0.2104
lung-cancer0.7623 ± 0.13570.8044 ± 0.15150.7955 ± 0.14120.6807 ± 0.26430.7364 ± 0.17090.7690 ± 0.14630.6614 ± 0.24440.7523 ± 0.2928
labor-negotiations0.2935 ± 0.19750.2988 ± 0.20810.2847 ± 0.21320.2877 ± 0.19720.3131 ± 0.25140.2915 ± 0.20330.2104 ± 0.14550.3014 ± 0.1907
post-operative0.5340 ± 0.13930.5153 ± 0.13540.5206 ± 0.12410.5017 ± 0.11500.5017 ± 0.11500.5133 ± 0.14050.5136 ± 0.10590.5289 ± 0.1031
zoo0.1309 ± 0.11310.1168 ± 0.10540.1168 ± 0.10540.1144 ± 0.10520.1477 ± 0.11440.1397 ± 0.11390.1344 ± 0.09350.1984 ± 0.1255
promoters0.3264 ± 0.16590.3883 ± 0.17210.3895 ± 0.16980.3864 ± 0.17490.3702 ± 0.16470.3485 ± 0.17480.2795 ± 0.09400.3292 ± 0.1603
echocardiogram0.5276 ± 0.10170.5144 ± 0.08900.4999 ± 0.06400.5073 ± 0.07410.5073 ± 0.07410.4986 ± 0.06930.4829 ± 0.08080.5288 ± 0.1034
lymphography0.3813 ± 0.12270.3816 ± 0.12310.3891 ± 0.12500.3814 ± 0.12300.3874 ± 0.11750.4369 ± 0.09960.3274 ± 0.13950.3726 ± 0.1169
iris0.2211 ± 0.13530.2211 ± 0.13530.2211 ± 0.13530.2211 ± 0.13530.2211 ± 0.13530.2211 ± 0.13530.2224 ± 0.13030.2273 ± 0.1252
teaching-ae0.6189 ± 0.06710.6189 ± 0.06710.6189 ± 0.06710.6272 ± 0.08340.6272 ± 0.08340.6404 ± 0.07490.6105 ± 0.06840.6224 ± 0.0683
hepatitis0.3434 ± 0.14790.3530 ± 0.13960.3409 ± 0.13260.3416 ± 0.14220.3475 ± 0.14590.3751 ± 0.14420.3711 ± 0.10790.4188 ± 0.1082
wine0.2026 ± 0.12230.2020 ± 0.11790.2063 ± 0.12340.2142 ± 0.12180.2202 ± 0.11800.1923 ± 0.10290.1528 ± 0.10070.2210 ± 0.0927
autos0.4725 ± 0.12910.4214 ± 0.16370.4241 ± 0.13500.4339 ± 0.13270.4312 ± 0.13560.4350 ± 0.14270.4760 ± 0.11020.4736 ± 0.1286
sonar0.4856 ± 0.08900.5085 ± 0.10870.5085 ± 0.10870.5027 ± 0.09890.4805 ± 0.09200.5161 ± 0.11720.3349 ± 0.11090.4629 ± 0.0783
glass-id0.4360 ± 0.05850.4504 ± 0.06080.4286 ± 0.06040.4381 ± 0.05940.4371 ± 0.05830.4170 ± 0.05870.3654 ± 0.05460.4199 ± 0.0650
new-thyroid0.2554 ± 0.09910.2554 ± 0.09910.2554 ± 0.09910.2554 ± 0.09910.2554 ± 0.09910.2554 ± 0.09910.2221 ± 0.08500.2262 ± 0.0710
audio0.5212 ± 0.08550.5168 ± 0.06630.5175 ± 0.06600.5151 ± 0.08030.5139 ± 0.08090.5136 ± 0.06720.4639 ± 0.06060.5294 ± 0.0939
hungarian0.3895 ± 0.07110.3882 ± 0.07400.3816 ± 0.06840.3855 ± 0.06100.3870 ± 0.05480.3778 ± 0.06280.3506 ± 0.08450.3917 ± 0.0684
heart-disease-c0.4177 ± 0.08610.4203 ± 0.08810.4159 ± 0.07780.4171 ± 0.08200.4152 ± 0.08100.3874 ± 0.06920.3605 ± 0.08440.4135 ± 0.0989
haberman0.4433 ± 0.07590.4299 ± 0.07560.4299 ± 0.07560.4280 ± 0.07430.4280 ± 0.07430.4283 ± 0.07280.4402 ± 0.08200.4416 ± 0.0776
primary-tumor0.7280 ± 0.05790.7272 ± 0.05740.7264 ± 0.06100.7272 ± 0.05740.7266 ± 0.05800.7258 ± 0.05720.6972 ± 0.05850.7250 ± 0.0589
ionosphere0.2573 ± 0.10770.2616 ± 0.09910.2654 ± 0.09810.2596 ± 0.10440.2452 ± 0.09470.2638 ± 0.09410.2841 ± 0.07240.2434 ± 0.1047
dermatology0.1826 ± 0.06950.1792 ± 0.07450.1786 ± 0.07190.1786 ± 0.06330.1878 ± 0.07930.1576 ± 0.05930.1145 ± 0.06170.1521 ± 0.0784
horse-colic0.4289 ± 0.06720.3829 ± 0.05850.3714 ± 0.05690.3746 ± 0.05930.3764 ± 0.05870.3879 ± 0.05200.4029 ± 0.07090.4185 ± 0.0597
house-votes-840.2181 ± 0.07920.2337 ± 0.08030.2337 ± 0.08030.2161 ± 0.06630.2123 ± 0.07890.2115 ± 0.06860.2016 ± 0.07360.2235 ± 0.0721
cylinder-bands0.4405 ± 0.04200.4407 ± 0.02820.4393 ± 0.02800.4365 ± 0.02780.4423 ± 0.02810.4436 ± 0.02460.3656 ± 0.04510.4312 ± 0.0590
chess0.2594 ± 0.04700.2589 ± 0.04770.2590 ± 0.04750.2613 ± 0.04950.2597 ± 0.04790.2590 ± 0.04750.2855 ± 0.04850.2671 ± 0.0384
syncon0.1602 ± 0.06880.1608 ± 0.08070.1608 ± 0.08070.1651 ± 0.07480.1557 ± 0.08620.1617 ± 0.08000.1287 ± 0.04480.1271 ± 0.0686
balance-scale0.3971 ± 0.01860.3971 ± 0.01860.3971 ± 0.01860.3971 ± 0.01860.3971 ± 0.01860.3971 ± 0.01860.3999 ± 0.02340.4014 ± 0.0200
soybean0.2014 ± 0.03410.2139 ± 0.03650.2062 ± 0.03470.1914 ± 0.02940.1945 ± 0.03370.1828 ± 0.02650.2224 ± 0.04020.2206 ± 0.0436
credit-a0.3704 ± 0.04430.3404 ± 0.02420.3377 ± 0.03960.3386 ± 0.03140.3371 ± 0.03340.3473 ± 0.03070.3164 ± 0.03870.3692 ± 0.0419
breast-cancer-w0.1928 ± 0.06180.1877 ± 0.05440.1931 ± 0.05950.1794 ± 0.05660.1830 ± 0.04750.1746 ± 0.05270.1778 ± 0.08790.1951 ± 0.0461
pima-ind-diabetes0.4225 ± 0.04420.4142 ± 0.04960.4142 ± 0.04960.4116 ± 0.05160.4114 ± 0.05010.4079 ± 0.04950.4071 ± 0.04380.4212 ± 0.0494
vehicle0.4638 ± 0.04580.4691 ± 0.03890.4691 ± 0.03890.4706 ± 0.03870.4634 ± 0.04480.4611 ± 0.04250.4653 ± 0.03430.4637 ± 0.0433
anneal0.1813 ± 0.03660.1813 ± 0.03660.1813 ± 0.03660.1813 ± 0.03660.1813 ± 0.03660.1813 ± 0.03660.2311 ± 0.03730.1815 ± 0.0330
tic-tac-toe0.4023 ± 0.02690.4023 ± 0.02690.4023 ± 0.02690.4023 ± 0.02690.4023 ± 0.02690.4023 ± 0.02690.3995 ± 0.02120.4050 ± 0.0252
vowel0.2316 ± 0.04070.2232 ± 0.03900.2232 ± 0.03900.2305 ± 0.04050.2366 ± 0.04810.2232 ± 0.03900.2593 ± 0.03470.4182 ± 0.0185
german0.4389 ± 0.04760.4380 ± 0.03890.4374 ± 0.04160.4348 ± 0.04290.4385 ± 0.03540.4368 ± 0.03390.4147 ± 0.03050.4333 ± 0.0392
led0.5000 ± 0.03760.5000 ± 0.03760.5000 ± 0.03760.5000 ± 0.03760.5000 ± 0.03760.5000 ± 0.03760.4970 ± 0.03640.4991 ± 0.0375
contraceptive-mc0.5955 ± 0.01480.5970 ± 0.01310.5957 ± 0.01120.5969 ± 0.01320.5955 ± 0.01220.5965 ± 0.01230.5938 ± 0.01830.5923 ± 0.0164
yeast0.6204 ± 0.02260.6204 ± 0.02260.6205 ± 0.01950.6204 ± 0.02260.6204 ± 0.02260.6201 ± 0.01880.6063 ± 0.01950.6144 ± 0.0201
volcanoes0.5313 ± 0.01550.5324 ± 0.01460.5324 ± 0.01460.5322 ± 0.01440.5322 ± 0.01440.5324 ± 0.01460.5284 ± 0.01840.5297 ± 0.0168
car0.2405 ± 0.01710.2405 ± 0.01710.2405 ± 0.01710.2405 ± 0.01710.2405 ± 0.01710.2405 ± 0.01710.3065 ± 0.01510.2404 ± 0.0170
segment0.2215 ± 0.02550.2216 ± 0.02540.2216 ± 0.02540.2216 ± 0.02540.2216 ± 0.02540.2218 ± 0.02540.2069 ± 0.01430.2166 ± 0.0256
hypothyroid0.1528 ± 0.02620.1448 ± 0.01950.1442 ± 0.01870.1467 ± 0.01870.1445 ± 0.01950.1447 ± 0.01920.1636 ± 0.02770.1517 ± 0.0268
splice-c4.50.1917 ± 0.02480.1670 ± 0.01500.1670 ± 0.01500.1650 ± 0.01530.1638 ± 0.01600.1640 ± 0.01410.1720 ± 0.02110.1944 ± 0.0245
kr-vs-kp0.2358 ± 0.02230.2249 ± 0.02180.2230 ± 0.02140.2244 ± 0.02290.2206 ± 0.02150.2220 ± 0.02000.2658 ± 0.01550.2159 ± 0.0229
abalone0.5635 ± 0.00800.5635 ± 0.00800.5635 ± 0.00800.5635 ± 0.00800.5634 ± 0.00790.5635 ± 0.00800.5576 ± 0.00770.5638 ± 0.0081
spambase0.2377 ± 0.01870.2370 ± 0.01940.2366 ± 0.01960.2374 ± 0.01960.2367 ± 0.01970.2365 ± 0.01960.2282 ± 0.02120.2383 ± 0.0206
phoneme0.5048 ± 0.01330.4789 ± 0.01040.4789 ± 0.01040.4789 ± 0.01040.4789 ± 0.01040.4799 ± 0.01010.4397 ± 0.01230.4385 ± 0.0127
wall-following0.3113 ± 0.01460.2598 ± 0.01210.2598 ± 0.01210.2602 ± 0.01420.2658 ± 0.01490.2661 ± 0.01210.3677 ± 0.01360.2968 ± 0.0145
page-blocks0.2127 ± 0.02110.2095 ± 0.01980.2095 ± 0.01980.2127 ± 0.02110.2141 ± 0.02100.2095 ± 0.01980.2024 ± 0.01100.2168 ± 0.0173
optdigits0.1919 ± 0.01250.1924 ± 0.01270.1924 ± 0.01270.1924 ± 0.01270.1931 ± 0.01330.1922 ± 0.01290.1542 ± 0.02360.1968 ± 0.0162
satellite0.3396 ± 0.01730.3403 ± 0.01650.3403 ± 0.01650.3407 ± 0.01720.3424 ± 0.01770.3406 ± 0.01760.3307 ± 0.01610.3479 ± 0.0195
musk20.2946 ± 0.01440.2961 ± 0.01100.2826 ± 0.01720.2982 ± 0.01150.2998 ± 0.01210.2762 ± 0.01590.3837 ± 0.01150.2847 ± 0.0153
mushrooms0.0083 ± 0.00820.0083 ± 0.00820.0083 ± 0.00820.0083 ± 0.00820.0036 ± 0.00350.0083 ± 0.00820.0112 ± 0.00980.0188 ± 0.0155
thyroid0.4156 ± 0.01030.4156 ± 0.01030.4158 ± 0.01060.4156 ± 0.01030.4156 ± 0.01030.4156 ± 0.01030.4334 ± 0.01090.4193 ± 0.0127
pendigits0.2140 ± 0.01300.2140 ± 0.01300.2140 ± 0.01300.2127 ± 0.01350.2116 ± 0.01330.2138 ± 0.01340.1420 ± 0.00470.2060 ± 0.0120
sign0.4736 ± 0.00580.4736 ± 0.00580.4736 ± 0.00580.4736 ± 0.00580.4736 ± 0.00580.4736 ± 0.00580.4835 ± 0.00420.4911 ± 0.0073
nursery0.2194 ± 0.00680.2194 ± 0.00680.2194 ± 0.00680.2194 ± 0.00680.2194 ± 0.00680.2194 ± 0.00680.2510 ± 0.00470.2193 ± 0.0068
magic0.3437 ± 0.00680.3438 ± 0.00720.3438 ± 0.00720.3437 ± 0.00680.3437 ± 0.00680.3435 ± 0.00720.3505 ± 0.00790.3547 ± 0.0070
letter-recog0.4120 ± 0.00850.4120 ± 0.00850.4120 ± 0.00850.4120 ± 0.00850.4120 ± 0.00850.4120 ± 0.00850.3755 ± 0.00920.4106 ± 0.0101
adult0.3354 ± 0.00400.3322 ± 0.00370.3322 ± 0.00370.3339 ± 0.00350.3353 ± 0.00380.3335 ± 0.00400.3476 ± 0.00350.3345 ± 0.0037
shuttle0.0907 ± 0.00460.0865 ± 0.00460.0865 ± 0.00460.0865 ± 0.00460.0865 ± 0.00460.0865 ± 0.00460.0944 ± 0.00330.1036 ± 0.0037
connect-40.4435 ± 0.00310.4435 ± 0.00310.4435 ± 0.00310.4435 ± 0.00310.4435 ± 0.00310.4435 ± 0.00310.4506 ± 0.00180.4480 ± 0.0022
waveform0.1597 ± 0.00180.1597 ± 0.00180.1597 ± 0.00180.1595 ± 0.00210.1596 ± 0.00180.1593 ± 0.00180.1528 ± 0.00200.1684 ± 0.0051
localization0.6321 ± 0.00140.6321 ± 0.00140.6321 ± 0.00140.6321 ± 0.00140.6321 ± 0.00140.6321 ± 0.00140.6520 ± 0.00100.6501 ± 0.0012
census-income0.2247 ± 0.00250.2134 ± 0.00190.2104 ± 0.00190.2090 ± 0.00180.2119 ± 0.00180.2043 ± 0.00230.2932 ± 0.00200.2219 ± 0.0024
poker-hand0.4987 ± 0.00060.4987 ± 0.00060.4987 ± 0.00060.4987 ± 0.00060.4987 ± 0.00060.4987 ± 0.00060.5392 ± 0.00060.4987 ± 0.0005
donation0.0081 ± 0.00090.0081 ± 0.00090.0081 ± 0.00090.0081 ± 0.00090.0081 ± 0.00090.0081 ± 0.00090.0120 ± 0.00050.0082 ± 0.0009

References

  1. Duda, R.O.; Hart, P.E. Pattern Classification and Scene Analysis; John Wiley and Sons: Hoboken, NJ, USA, 1973. [Google Scholar]
  2. Zaidi, N.A.; Carman, M.J.; Cerquides, J.; Webb, G.I. Naive-bayes inspired effective pre-conditioner for speeding-up logistic regression. In Proceedings of the IEEE International Conference on Data Mining, Shenzhen, China, 14–17 December 2014; pp. 1097–1102. [Google Scholar]
  3. Domingos, P.; Pazzani, M. Beyond independence: Conditions for the optimality of the simple bayesian classifier. In Proceedings of the 13th International Conference on Machine Learning, Bari, Italy, 3–6 July 1996; pp. 105–112. [Google Scholar]
  4. Friedman, N.; Geiger, D.; Goldszmidt, M. Bayesian network classifiers. Mach. Learn. 1997, 29, 131–163. [Google Scholar] [CrossRef] [Green Version]
  5. Webb, G.I.; Boughton, J.R.; Wang, Z. Not so naive bayes: Aggregating one-dependence estimators. Mach. Learn. 2005, 58, 5–24. [Google Scholar] [CrossRef] [Green Version]
  6. Sahami, M. Learning limited dependence bayesian classifiers. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; ACM: New York, NY, USA, 1996; pp. 335–338. [Google Scholar]
  7. Wang, L.; Liu, Y.; Mammadov, M.; Sun, M.; Qi, S. Discriminative structure learning of bayesian network classifiers from training dataset and testing instance. Entropy 2019, 21, 489. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  8. Jiang, L.; Zhang, H.; Cai, Z.; Su, J. Learning tree augmented naive bayes for ranking. In Database Systems for Advanced Applications; Springer: Berlin/Heidelberg, Germany, 2005; pp. 688–698. [Google Scholar]
  9. Alhussan, A.; El Hindi, K. Selectively fine-tuning bayesian network learning algorithm. Int. J. Pattern Recognit. Artif. Intell. 2016, 30, 1651005. [Google Scholar] [CrossRef]
  10. Wang, Z.; Webb, G.I.; Zheng, F. Adjusting dependence relations for semi-lazy tan classifiers. In AI 2003: Advances in Artificial Intelligence; Gedeon, T.D., Fung, L.C.C., Eds.; Springer: Berlin/Heidelberg, Germany, 2003; pp. 453–465. [Google Scholar]
  11. De Campos, C.P.; Corani, G.; Scanagatta, M.; Cuccu, M.; Zaffalon, M. Learning extended tree augmented naive structures. Int. J. Approx. Reason. 2016, 68, 153–163. [Google Scholar] [CrossRef] [Green Version]
  12. Jiang, L.; Cai, Z.; Wang, D.; Zhang, H. Improving tree augmented naive bayes for class probability estimation. Knowl.-Based Syst. 2012, 26, 239–245. [Google Scholar] [CrossRef]
  13. Cerquides, J.; De Mántaras, R.L. TAN classifiers based on decomposable distributions. Mach. Learn. 2005, 59, 323–354. [Google Scholar] [CrossRef] [Green Version]
  14. Zhang, L.; Jiang, L.; Li, C. A discriminative model selection approach and its application to text classification. Neural Comput. Appl. 2019, 31, 1173–1187. [Google Scholar] [CrossRef]
  15. Langley, P.; Sage, S. Induction of selective bayesian classifiers. In Proceedings of the 10th International Conference on Uncertainty in Artificial Intelligence, Nice, France, 21–23 September 2016; Morgan Kaufmann Publishers Inc., MIT: Cambridge, MA, USA, 1994; pp. 399–406. [Google Scholar]
  16. Zheng, F.; Webb, G.I. Finding the right family: Parent and child selection for averaged one-dependence estimators. In European Conference on Machine Learning; Springer: Berlin/Heidelberg, Germany, 2007; pp. 490–501. [Google Scholar]
  17. Chen, S.; Webb, G.I.; Liu, L.; Ma, X. A novel selective naive bayes algorithm. Knowl.-Based Syst. 2020, 192, 105361. [Google Scholar] [CrossRef]
  18. Martínez, A.M.; Webb, G.I.; Chen, S.; Zaidi, N.A. Scalable learning of bayesian network classifiers. J. Mach. Learn. Res. 2016, 17, 1–35. [Google Scholar]
  19. Chen, S.; Martínez, A.M.; Webb, G.I. Highly scalable attributes selection for averaged one-dependence estimators. In Proceedings of the 18th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Tainan, Taiwan, 13–16 May 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 86–97. [Google Scholar]
  20. Chen, S.; Martínez, A.M.; Webb, G.I.; Wang, L. Sample-based attribute selective ande for large data. IEEE Trans. Knowl. Data Eng. 2017, 29, 172–185. [Google Scholar] [CrossRef]
  21. Zaidi, N.A.; Webb, G.I.; Carman, M.J.; Petitjean, F.; Buntine, W.; Hynes, M.; Sterck, H.D. Efficient parameter learning of bayesian network classifiers. Mach. Learn. 2017, 106, 1289–1329. [Google Scholar] [CrossRef] [Green Version]
  22. Brown, G.; Pocock, A.; Zhao, M.J.; Lujan, M. Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. J. Mach. Learn. Res. 2012, 13, 27–66. [Google Scholar]
  23. Yu, L.; Liu, H. Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML’03, Washington, DC, USA, 21–24 August 2003; pp. 856–863. [Google Scholar]
  24. Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef] [PubMed]
  25. Battiti, R. Using mutual information for selecting features in supervised neural net learning. IEEE Trans. Neural Netw. 1994, 5, 537–550. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  26. Fleuret, F. Fast binary feature selection with conditional mutual information. J. Mach. Learn. Res. 2004, 5, 1531–1555. [Google Scholar]
  27. Chen, S.; Martínez, A.M.; Webb, G.I.; Wang, L. Selective AnDE for large data learning: A low-bias memory constrained approach. Knowl. Inf. Syst. 2017, 50, 475–503. [Google Scholar] [CrossRef]
  28. Data visualization and feature selection: New algorithms for nongaussian data. Adv. Neural Inf. Process. Syst. 2000, 12, 687–693.
  29. Meyer, P.E.; Schretter, C.; Bontempi, G. Information-theoretic feature selection in microarray data using variable complementarity. IEEE J. Sel. Top. Signal Process. 2008, 2, 261–274. [Google Scholar]
  30. Kohavi, R. The power of decision tables. In European Conference on Machine Learning; Lavrac, N., Wrobel, S., Eds.; Springer: Berlin/Heidelberg, Germany, 1995; pp. 174–189. [Google Scholar]
  31. Dua, D.; Graff, C. UCI Machine Learning Repository. 2017. Available online: http://archive.ics.uci.edu/ml (accessed on 5 September 2021).
  32. Witten, I.H.; Frank, E.; Trigg, L.E.; Hall, M.A.; Holmes, G.; Cunningham, S.J. Weka: Practical Machine Learning Tools and Techniques with JAVA Implementations. Available online: https://researchcommons.waikato.ac.nz/handle/10289/1040 (accessed on 9 October 2021).
  33. Cestnik, B. Estimating probabilities: A crucial task in machine learning. In Proceedings of the European Conference on Artificial Intelligence, Stockholm, Sweden, 1 January 1990; Volume 90, pp. 147–149. [Google Scholar]
  34. Flores, M.J.; Gámez, J.A.; Martínez, A.M.; Puerta, J.M. Handling numeric attributes when comparing bayesian network classifiers: Does the discretization method matter? Appl. Intell. 2011, 34, 372–385. [Google Scholar] [CrossRef]
  35. Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
Figure 1. Structure of NB and TAN.
Figure 1. Structure of NB and TAN.
Mathematics 09 02564 g001
Figure 2. Scatter plot of STAN MRMR to TAN in terms of ZOL.
Figure 2. Scatter plot of STAN MRMR to TAN in terms of ZOL.
Mathematics 09 02564 g002
Figure 3. Averaged (a) training time and (b) classification time of all algorithms on 70 datasets (seconds).
Figure 3. Averaged (a) training time and (b) classification time of all algorithms on 70 datasets (seconds).
Mathematics 09 02564 g003
Table 1. Data sets.
Table 1. Data sets.
No.NameInstAttClassNo.NameInstAttClass
1contact-lenses244336tic-tac-toe95892
2lung-cancer3256337vowel9901311
3labor-negotiations5716238german1000202
4post-operative908339led1000710
5zoo10116740contraceptive-mc147393
6promoters10657241yeast1484810
7echocardiogram1316242volcanoes152034
8lymphography14818443car172864
9iris1504344segment2310197
10teaching-ae1515345hypothyroid3163252
11hepatitis15519246splice-c4.53177603
12wine17813347kr-vs-kp3196362
13autos20525748abalone417783
14sonar20860249spambase4601572
15glass-id2149350phoneme5438750
16new-thyroid2155351wall-following5456244
17audio226692452page-blocks5473105
18hungarian29413253optdigits56206410
19heart-disease-c30313254satellite6435366
20haberman3063255musk265981662
21primary-tumor339172256mushrooms8124222
22ionosphere35134257thyroid91692920
23dermatology36634658pendigits10,9921610
24horse-colic36821259sign12,54683
25house-votes-8443516260nursery12,96085
26cylinder-bands54039261magic19,020102
27chess55139262letter-recog20,0001626
28syncon60060663adult48,842142
29balance-scale6254364shuttle58,00097
30soybean683351965connect-467,557423
31credit-a69015266waveform100,000213
32breast-cancer-w6999267localization164,860511
33pima-ind-diabetes7688268census-income299,285412
34vehicle84618469poker-hand1,025,0101010
35anneal89838670donation5,749,132112
Table 2. Win/draw/loss records of different versions of STAN vs. TAN in terms of ZOL and RMSE on 70 data sets.
Table 2. Win/draw/loss records of different versions of STAN vs. TAN in terms of ZOL and RMSE on 70 data sets.
STANMI vs. TANSTANSU vs. TANSTANJMI vs. TAN
win/draw/loss p win/draw/loss p win/draw/loss p
ZOL26/22/220.360129/20/210.201522/29/190.4050
RMSE28/21/210.235232/19/190.074034/20/160.0207
STANCMIM vs. TANSTANMRMR vs. TAN
win/draw/loss p win/draw/loss p
ZOL30/21/190.114234/22/140.0112
RMSE34/18/180.036138/17/150.0038
Table 3. Win/draw/loss records of STANMRMR vs. AODE and KDB 1 in terms of ZOL and RMSE on 70 data sets.
Table 3. Win/draw/loss records of STANMRMR vs. AODE and KDB 1 in terms of ZOL and RMSE on 70 data sets.
STANMRMR vs. AODESTANMRMR vs. KDB1
Win/Draw/LosspWin/Draw/Lossp
ZOL33/1/360.405040/7/230.0266
RMSE31/0/390.201545/1/240.0077
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Chen, S.; Zhang, Z.; Liu, L. Attribute Selecting in Tree-Augmented Naive Bayes by Cross Validation Risk Minimization. Mathematics 2021, 9, 2564. https://doi.org/10.3390/math9202564

AMA Style

Chen S, Zhang Z, Liu L. Attribute Selecting in Tree-Augmented Naive Bayes by Cross Validation Risk Minimization. Mathematics. 2021; 9(20):2564. https://doi.org/10.3390/math9202564

Chicago/Turabian Style

Chen, Shenglei, Zhonghui Zhang, and Linyuan Liu. 2021. "Attribute Selecting in Tree-Augmented Naive Bayes by Cross Validation Risk Minimization" Mathematics 9, no. 20: 2564. https://doi.org/10.3390/math9202564

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop