Learning a Flexible K-Dependence Bayesian Classifier from the Chain Rule of Joint Probability Distribution

As one of the most common types of graphical models, the Bayesian classifier has become an extremely popular approach to dealing with uncertainty and complexity. The scoring functions once proposed and widely used for a Bayesian network are not appropriate for a Bayesian classifier, in which class variable C is considered as a distinguished one. In this paper, we aim to clarify the working mechanism of Bayesian classifiers from the perspective of the chain rule of joint probability distribution. By establishing the mapping relationship between conditional probability distribution and mutual information, a new scoring function, Sum_MI , is derived and applied to evaluate the rationality of the Bayesian classifiers. To achieve global optimization and high dependence representation, the proposed learning algorithm, the flexible K-dependence Bayesian (FKDB) classifier, applies greedy search to extract more information from the K-dependence network structure. Meanwhile, during the learning procedure, the optimal attribute order is determined dynamically, rather than rigidly. In the experimental study, functional dependency analysis is used to improve model interpretability when the structure complexity is restricted.


Introduction
Graphical models [1,2] provide a natural tool for dealing with two problems that occur throughout applied mathematics and engineering: uncertainty and complexity.The two most common types of graphical models are directed graphical models (also called Bayesian networks) [3,4] and undirected graphical models (also called Markov networks) [5].A Bayesian network (BN) is a type of statistical model consisting of a set of conditional probability distributions and a directed acyclic graph (DAG), in which the nodes denote a set of random variables and arcs describing conditional (in)dependence relationship between them.Therefore, BNs can be used to predict the consequences of intervention.The conditional dependencies in the graph are often estimated using known statistical and computational methods.
Supervised classification is an outstanding task in data analysis and pattern recognition.It requires the construction of a classifier, that is a function that assigns a class label to instances described by a set of variables.There are numerous classifier paradigms, among which Bayesian classifiers [6][7][8][9][10][11], based on probabilistic graphical models (PGMs) [2], are well known and very effective in domains with uncertainty.Given class variable C and a set of attributes X = {X 1 , X 2 , • • • , X n }, the aim of supervised learning is to predict from a training set the class of a testing instance x = {x 1 , • • • , x n }, where x i is the value of the i-th attribute.We wish to precisely estimate the conditional probability of P (c|x) by selecting arg max C P (c|x), where P (•) is a probability distribution function and c ∈ {c 1 , • • • , c k } are the k classes.By applying Bayes' theorem, the classification process can be done in the following way with the BNs: This kind of classifier is known as generative, and it forms the most common approach in the BN literature for classification [6][7][8][9][10][11].
Many scoring functions, e.g., maximum likelihood (ML) [12], Bayesian information criterion (BIC) [13], minimum description length (MDL) [14] and Akaike information criterion (AIC) [15], were proposed to evaluate whether the learned BN best fits the dataset.For BN, all attributes (including class variable) are treated equally, while for Bayesian classifiers, the class variable is treated as a distinguished one.Additionally, these scoring functions do not work well for Bayesian classifiers [9].In this paper, we limit our attention to a class of network structures, restricted Bayesian classifiers, which require that the class variable C be a parent of every attribute and no attribute be the parent of C. P (c, x) can be rewritten in terms of the product of a set of conditional distributions, which is also known as the chain rule of joint probability distribution.
where P a i denotes a set of parent attributes of the node X i , except the class variable, i.e., P a i = {X 1 , • • • , X i−1 }.Each node X i has a conditional probability distribution (CPD) representing P (x i |P a i , c).If the Bayesian classifier can be constructed based on Equation ( 2), the corresponding model is "optimal", since all conditional dependencies implicated in the joint probability distribution are fully described, and the main term determining the classification will take every attribute into account.
From Equation (2), the order of attributes {X 1 , • • • , X n } is fixed in such a way that an arc between two attributes {X l , X h } always goes from the lower ordered attribute X l to the higher ordered attribute X h .That is, the network can only contain arcs X l → X h where l < h.The first few lower ordered attributes are more important than the higher ordered ones, because X l may be possible parent attributes of X h , but X h cannot be possible parent attributes of X l .One attribute may be dependent on several other attributes, and this dependence relationship will propagate to the whole attribute set.A slight move in one part may affect the whole situation.Finding an optimal order requires searching the space of all possible network structures for one that best describes the data.Without restrictive assumptions, learning Bayesian networks from data is NP-hard [16].Because of the limitation of time and space complexity, only a limited number of conditional probabilities can be encoded in the network.Additionally, precise estimation of P (x i |P a i , c) is non-trivial when given too many parent attributes.One of the most important features of BNs is the fact that they provide an elegant mathematical structure for modeling complicated relationships, while keeping a relatively simple visualization of these relationships.If the network can capture all or at least the most important dependencies that exist in a database, we would expect a classifier to achieve optimal prediction accuracy.If the structure complexity is restricted to some extent, higher dependence cannot be represented.The restricted Bayesian classifier family can offer different tradeoffs between structure complexity and prediction performance.The simplest model is the naive Bayes [6,7], where C is the parent of all predictive attributes, and there are no dependence relationships among them.On the basis of this, we can progressively increase the level of dependence, giving rise to a extension family of naive Bayes models, e.g., tree-augmented naive Bayes (TAN) [8] or K-dependence Bayesian network (KDB) [10,11].
Different Bayesian classifiers correspond to different factorizations of P (x|c).However, few studies have proposed to learn Bayesian classifiers from the perspective of the chain rule.This paper first establishes the mapping relationship between conditional probability distribution and mutual information, then proposes to evaluate the rationality of the Bayesian classifier from the perspective of information quantity.To build an optimal Bayesian classifier, the key point is to achieve the largest sum of mutual information that corresponds to the largest a posteriori probability.The working mechanisms of three classical restricted Bayesian classifiers, i.e., NB, TAN and KDB, are analyzed and evaluated from the perspectives of the chain rule and information quantity implicated in the graphical structure.On the basis of this, the proposed learning algorithm, the flexible K-dependence Bayesian (FKDB) classifier, applies greedy search of the mutual information space to represent high-dependence relationships.The optimal attribute order is determined dynamically during the learning procedure.The experimental results on the UCImachine learning repository [17] validate the rationality of the FKDB classifier from the viewpoints of zero-one loss and information quantity.

The Mapping Relationship between Probability Distribution and Mutual Information
Information theory is the theoretical foundation of modern digital communication and was invented in the 1940s by Claude E. Shannon.Though Shannon was principally concerned with the problem of electronic communications, the theory has much broader applicability.Many commonly-used measures are based on the entropy of information theory and used in a variety of classification algorithms [18].Definition 1. [19].The entropy of an attribute (or random variable) is a function that attempts to characterize its unpredictability.When given a discrete random variable X with any possible value x and probability distribution function P (•), entropy is defined as follows, Definition 2. [19].Conditional entropy measures the amount of information needed to describe attribute X when another attribute Y is observed.Given discrete random variables X and Y and their possible value x, y, conditional entropy is defined as follows, Definition 3. [19].The mutual information I(X; Y ) of two random variables is a measure of the variables' mutual dependence and is defined as: Each part of the right side of Equation (2), i.e., P (x i |P a i , c), corresponds to a local structure of the restricted Bayesian classifier.Additionally, there should exist a strong relationship between X i and {P a i , C}, which can be measured by I(X i ; P a i , C).
For example, let us consider the simplest situation in which the attribute set is composed of just two attributes {X 1 , X 2 }.The joint probability distribution is: Figure 1a shows the corresponding "optimal" network structure, which is a triangle, and also the basic local structure of restricted Bayesian classifier.Similar to the learning procedure of TAN and KDB, we also use I(X i ; X j |C) to measure the weight of the arc between attributes X i and X j .Besides, we use I(X i ; C) to measure the weight of the arc between class variable C and attribute X i .The arcs in Figure 1a are divided into two groups by their final targets, i.e., the arc pointing to X 1 (as Figure 1b shows) and arcs pointing to X 2 (as Figure 1c shows).Suppose there exists information flow in the network, then the information quantity provided to X 1 and X 2 will be I(X 1 ; C) and I(X 2 ; C) + I(X 1 ; X 2 |C) = I(X 2 ; X 1 , C), respectively.
Thus, the mapping relationships between conditional probability distribution and mutual information are: and To ensure the robustness of entire Bayesian structure, the sum of mutual information I(X i ; P a i , C) should be maximized.Scoring function Sum_M I is proposed to measure the size of information quantity implicated in the Bayesian classifier and defined as follows,

Restricted Bayesian Classifier Analysis
In the following discussion, we will analyze and summarize the working mechanisms of some popular Bayesian classifiers to clarify their rationality from the viewpoints of information theory and probability theory.
NB: NB simplified the estimation of P (x|c) by conditional independence assumption: Then, the following equation is often calculated in practice, rather than Equation (2).
As Figure 2 shows, the NB classifier can be considered as a BN with a fixed network structure, where every attribute X i has the class variable as its only parent attribute, i.e., P a i will be restricted to being null.NB can only represent a zero-dependence relationship between predictive attributes.There exists no information flow, but that between predictive attributes and the class variable.TAN: The disadvantage of the NB classifier is that it assumes that all attributes are conditionally independent given the class, while this often is not a realistic assumption.As Figure 3 shows, TAN introduces more dependencies by allowing each attribute to have an extra parent from the other attributes, i.e., P a i can contain at most one attribute.TAN is based on the Chow-Liu algorithm [20] and can achieve global optimization by building a maximal spanning tree (MST).This algorithm is quadratic in the number of attributes.As a one-dependence Bayesian classifier, TAN is optimal.Different attribute orders provide the same undirected Bayesian network, which is the basis of TAN.When a different attribute is selected as the root node, the direction of some arcs may reverse.For example, Figure 3a,b represents the same dependence relationship while X 1 and X 4 are selected as the root nodes, respectively.Additionally, corresponding chain rules are described as: and: Sum_M I is the same for Figure 3a, b.That is the main reason why TAN performs almost the same, while the causal relationships implicated in the network structure differ.To achieve diversity, Ma and Shi [21] proposed the RTAN algorithm, the output of which is TAN ensembles.Each sub-classifier is trained with different training subsets sampled from the original instances, and the final decision is generated by a majority of votes.KDB: In KDB, the probability of each attribute value is conditioned by the class variable and, at most, K predictive attributes.The KDB algorithm adopts a greedy strategy in order to identify the graphical structure of the resulting classifier.KDB sets the order of attributes by calculating mutual information and achieves the weights of the relationship between attributes by calculating conditional mutual information.For example, given five predictive attributes {X 1 , X 2 , X 3 , X 4 , X 5 } and supposing that I(X 1 ; C) > I(X 2 ; C) > I(X 3 ; C) > I(X 4 ; C) > I(X 5 ; C), the attribute order is {X 1 , X 2 , X 3 , X 4 , X 5 } by comparing mutual information.
From the chain rule of joint probability distribution, there will be: Obviously, with more attributes to be considered as possible parent attributes, more causal relationships will be represented, and Sum_M I will be larger correspondingly.However, because of the time and space complexity overhead, only a limited number of attributes will be considered.For KDB, each predictive attribute can select at most K attributes as parent attributes.Figure 4 gives an example to show corresponding KDB models when given different K values.In summary, from the viewpoint of probability theory, all of these algorithms can be regarded as different variations of the chain rule.Different algorithms tried to get different levels of tradeoff between computational complexity and classification accuracy.One advantage of NB is avoiding model selection, because selecting between alternative models can be expected to increase variance and allow a learning system to overfit the training data.However, the conditional independence assumption makes NB neglect the conditional mutual information between predictive attributes.Thus, NB is zero-dependence based and performs the worst among the three algorithms.TAN proposes to achieve global optimization by building MST to weigh the one-dependence causal relationships, i.e., TAN can only have at most one parent, except the class variable.Thus, only a limited number of dependencies or a limited information quantity can be represented in TAN.KDB allows for higher dependence to represent much more complicated relationships between attributes and can have at most K parent attributes.However, KDB is guided by a rigid ordering obtained by using the mutual information between the predictive attribute and the class variable.Mutual information does not consider the interaction between predictive attributes, and this marginal knowledge may result in sub-optimal order.Suppose K = 2 and I(C; X 1 ) > I(C; X 2 ) > I(C; X 3 ) > I(C; X 4 ) > I(C; X 5 ); X 3 will use X 2 as the parent attribute, even if they are independent of each other.When K = 1, KDB performs poorer than TAN, because it can only achieve a local optimal network structure.Besides, as described in Equation (9), I(X i ; X j |C) can only partially measure the dependence between X i and {X j , C}.

The Flexible K-Dependence Bayesian Classifier
To retain the privileges of TAN and KDB, i.e., global optimization and higher dependence representation, we presently give an algorithm, i.e., FKDB, which also allows one to construct K-dependence classifiers along the attribute dependence spectrum.To achieve the optimal attribute order, FKDB considers not only the dependence between the predictive attribute and the class variable, but also the dependencies among predictive attributes.As the learning procedure proceeds, the attributes will be put into order one by one.Thus, the order is determined dynamically.
Let S represent the attribute set, and predictive attributes will be added to S in a sequential order.The newly-added attribute X j must select parent attributes from S. To achieve global optimization, X j should have the strongest relationship with its parent attributes on average, i.e., the largest mutual information should be between X j and {P a j , C}.Once selected, X j will be added to S as possible parent attributes of the following attribute.FKDB applies greedy search of the mutual information space to find an optimal ordering of all of the attributes, which may help to fully describe the interaction between attributes.
Algorithm 1 is described as follows: Algorithm 1 Algorithm FKDB.
Input: a database of pre-classified instances, DB, and the K value for the maximum allowable degree of attribute dependence.
Output: a K-dependence Bayesian classifiers with conditional probability tables determined from the input data.
1. Let the used attribute list, S, be empty.
2. Select attribute X root that corresponds to the largest value I(X i ; C), and add it to S.
3. Add an arc from C to X root .4. Repeat until S includes all domain attributes 5.
• Select attribute X i , which is not in S and corresponds to the largest sum value: where X j ∈ S and q = min(|S|; K).

6.
• Add a node to BN representing X i .7.
• Add an arc from C to X i in BN .8. • Add q arcs from q distinct attributes X j in S to X i .9.
• Add X i to S. 10.Compute the conditional probability tables inferred by the structure of BN using counts from DB, and output BN .
FKDB requires that at most K parent attributes can be selected for each new attribute.To make the working mechanism of FKDB clear, we set K = 2 in the following discussion.Because I(X i ; X j |C) = I(X j ; X i |C), we describe the relationships between attributes using an upper triangular matrix of conditional mutual information.The format and one example with five predictive attributes {X 0 , X 1 , X 2 , X 3 , X 4 } are shown in Figure 5a,b, respectively.Suppose that I(X 0 ; C) > I(X 3 ; C) > I(X 2 ; C) > I(X 4 ; C) > I(X 1 ; C), X 0 is added into S as the root node.X 3 = arg max (I(X i ; C) + I(X 0 ; X i |C)) (X i / ∈ S); thus, X 3 is added to S; and S = {X 0 , X 3 }.
∈ S, X j , X k ∈ S); thus, X 4 is added into S, and X 1 will be the last one in the order.Thus, the whole attribute order and causal relationship can be achieved simultaneously.The final network structures is illustrated in Figure 6.Additionally, the order number of predictive attributes is also annotated.
Optimal attribute order and high dependence representation are two key points for learning KDB.Note that KDB achieves these two goals in different steps.KDB first computes and compares mutual information to get an attribute order before structured learning.Then, during the structured learning procedure, each predictive attribute X i can select at most K attributes as parent attributes by comparing conditional mutual information (CMI).Because these two steps are separate, the attribute order cannot ensure that the first K strongest dependencies between X i and other attributes should be represented.On the other hand, to achieve the optimal attribute order, FKDB considers not only the dependence between predictive attribute and class variable, but also the dependencies among predictive attributes.As the learning procedure proceeds, the attributes will be put into order one by one.Thus, the order is determined dynamically.That is why the classifier is named "flexible".
We will further compare KDB and FKDB with an example.Suppose that for KDB, the attribute order is {X 1 , X 2 , X 3 , X 4 }; Figure 7 shows the corresponding network structure of KDB when K = 2 corresponds to the CMI matrix shown in Figure 7b, and the learning steps are annotated.The weight of dependencies between attributes are depicted in Figure 7b.Although the dependence relationship between X 2 and X 1 is the weakest, X 1 is selected as the parent attribute of X 2 ; whereas the strong dependence between X 4 and X 1 is neglected.Suppose that for FKDB, the mutual information I(X i ; C) is the same for all predictive attributes.Figure 8a shows the network structure of FKDB corresponding to the CMI matrix shown in Figure 8b, and learning steps are also annotated.The weights of causal relationships are depicted in Figure 8b, from which we can see that all strong causal relationships are implicated in the final network structure.

Experimental Study
In order to verify the efficiency and effectiveness of the proposed FKDB (K = 2), we conduct experiments on 45 datasets from the UCI machine learning repository.Table 1 summarizes the characteristics of each dataset, including the numbers of instances, attributes and classes.Missing values for qualitative attributes are replaced with modes, and those for quantitative attributes are replaced with means from the training data.For each benchmark dataset, numeric attributes are discretized using MDL discretization [22].The following algorithms are compared: • NB, standard naive Bayes.
• KDB (K = 2), standard K-dependence Bayesian classifier.All algorithms were coded in MATLAB 7.0 (MathWorks, Natick, MA, USA) on a Pentium 2.93 GHz/1 G RAM computer.Base probability estimates P (c), P (c, x i ) and P (c, x i , x j ) were smoothed using the Laplace estimate, which can be described as follows: where F (•) is the frequency with which a combination of terms appears in the training data, M is the number of training instances for which the class value is known, M i is the number of training instances for which both the class and attribute X i are known and M ij is the number of training instances for which all of the class and attributes X i and X j are known.m is the number of attribute values of class C; m i is the number of attribute value combinations of C and X i ; and m ij is the number of attribute value combinations of C, X j and X i .
In the following experimental study, functional dependencies (FDs) [24] are used to detect redundant attribute values and to improve model interpretability.To maintain the K-dependence restriction, P (x i |x 1 , • • • , x K , c) will be used as an approximate estimation of ).If there exists FD:x 2 → x 1 , then x 2 functionally determines x 1 and x 1 is extraneous for classification.According to the augmentation rule of probability [24], Correspondingly, in practice, FKDB uses P (x i |x 2 , • • • , x K+1 , c) instead, which still maintains Kdependence restriction, whereas it represents more causal relationships.
FDs use the following criterion: is the number of training cases with both values and l is a user-specified minimum frequency.A large number of deterministic attributes, which are on the left side of the FD, will increase the risk of incorrect inference and, at the same time, needs more computer memory to store credible FDs.Consequently, only the one-one FDs are selected in our current work.Besides, as no formal method has been used to select an appropriate value for l, we use the setting that l = 100, which is achieved from empirical studies.Kohavi and Wolpert [25] presented a powerful tool from sampling theory statistics for analyzing supervised learning scenarios.Suppose c and ĉ are the true class label and that generated by classifier A, respectively, for the i-th testing sample; the zero-one loss is defined as: where δ(c, ĉ) = 1 if ĉ = c and 0 otherwise.Table 2 presents for each dataset the zero-one loss and the standard deviation, which are estimated by 10-fold cross-validation to give an accurate estimation of the average performance of an algorithm.Statistically, a win/draw/loss record (W/D/L) is calculated for each pair of competitors A and B with regard to a performance measure M .The record represents the number of datasets in which A respectively beats, loses to or ties with B on M .Small improvements may be attributable to chance.Runs with the various algorithms are carried out on the same training sets and evaluated on the same test sets.In particular, the cross-validation folds are the same for all of the experiments on each dataset.Finally, related algorithms are compared via a one-tailed binomial sign test with a 95 percent confidence level.Table 3 shows the W/D/L records corresponding to zero-one loss.When dependence complexity increases, the performance of TAN gets better than that of NB.RTAN investigates the diversity of TAN by the K statistic.The bagging mechanism helps RTAN to achieve superior performance to TAN.FKDB performs undoubtedly the best.However, surprisingly, as a 2-dependence Bayesian classifier, the advantage of KDB is not obvious when compared to 1-dependence classifiers, and it even performs poorer than RTAN in general.However, when the data size increases to a certain extent, e.g., 4177 (the size of dataset "Abalone"), as Table 4 shows, the prediction performance of all restricted classifiers can be evaluated from the perspective of the dependence level.Two-dependence Bayesian classifiers, e.g., FKDB and KDB, perform the best.The one-dependence Bayesian classifier, e.g., TAN, performs better.Additionally, 0-dependence Bayesian classifiers, e.g., NB, perform the worst.Friedman proposed a non-parametric measure [28], the Friedman test, which compares the ranks of the algorithms for each dataset separately.The null-hypothesis is that all of the algorithms are equivalent, and there is no difference in average ranks.We can compute the Friedman statistic: To compare the relative performance of classifiers A and B, the zero-one loss ratio (ZLR) is proposed in this paper and defined as ZLR(A/B) = ξ i (A)/ ξ i (B).Figures 9-12 compare FKDB with NB, TAN, RTAN and KDB, respectively.Each figure is divided into four parts by comparing data size and ZLR.That is, the data size is greater than 4177 while ZLR ≥ 1 or ZLR < 1, and the data size is smaller than 4177 while ZLR ≥ 1 or ZLR < 1.In different parts, different symbols are used to represent different situations.When dealing with small datasets (data size <4177), the performance superiority of FKDB is not obvious when compared to the 0-dependence (NB) or 1-dependence Bayesian classifiers (TAN).For some datasets, e.g., "Lung Cancer" and "Hungarian", NB even performs the best.Because precise estimation of conditional mutual information is determined by probability estimation, which is affected greatly by data size, the robustness of network structure will be affected negatively by imprecise probability estimation.For example, for dataset "Lung Cancer" with 32 instances and 56 attributes, it is almost impossible to ensure that the basic causal relationships learned are of a high confidence level.That is why a simple structure can perform better than a complicated one.Since each submodel of RTAN can represent only a small proportion of all dependencies, the complementarity of the bagging mechanism works and helps to improve the performance of TAN.KDB shows equivalent performance to FKDB.
The comparison results of Inf o_ratio(F KDB/KDB) are shown in Figure 13, from which the superiority of FKDB in extracting information is much more obvious when dealing with large datasets.The increased information quantity does help to decrease zero-one loss.However, note that the growth rate of information quantity is not in proportion to the descent rate of zero-one loss.For some datasets, e.g., "Localization" and "Poker-hand", KDB and FKDB achieve the same Sum_M I, while their zero-one losses are different.The same Sum_M I corresponds to the same causal relationships.
The network structures learned from KDB and FKDB are similar, because the major dependencies are all implicated, except that the directions of some arcs are different.Dependence "X 3 − X 4 " can be represented by conditional probability distribution P (x 3 |x 4 , c) or P (x 4 |x 3 , c).Just as we clarified in Section 3, although the basic structures described in Figure 3a,b are the same, the corresponding joint probability distributions represented by Equations ( 13) and ( 14) are different.Since ZLR ≈ 1 for these two datasets, the difference in zero-one loss can be explained from the perspective of probability distribution.To prove the relevance of information quantity to zero-one loss, Figure 14 is divided into four zones.Similar to the comparison of Equations ( 13) and ( 14), the same information quantity does not certainly correspond to the same Bayesian network and, then, the same zero-one loss.Zone A contains 27 datasets and describes the situation that ZLR<1 and Inf o_ratio ≥ 1.The performance superiority of FKDB over KDB can be attributed to mining more information or correct conditional dependence representation.Zone D contains 6 datasets and describes the situation that ZLR>1 and Inf o_ratio ≤ 1.The performance inferiority of FKDB over KDB can be attributed to mining less information.Thus, the information quantity is strongly correlated to zero-one loss on 73.3% ( 27+6 45 ) of all datasets.On the other hand, although FKDB has proven its effectiveness from the perspective of W/D/L results and the Friedman test, the information quantity is a very important score, but not the only one.

Conclusions
BNs can graphically describe conditional dependence between attributes, and they have been previously demonstrated to be computationally efficient approaches to further reducing zero-one loss.Conditional mutual information is commonly applied to weigh the dependencies between attributes, while it cannot measure the information quantity provided to predictive attributes.On the basis of analyzing and summarizing the working mechanisms of three popular Bayesian classifiers from the viewpoints of information theory and probability theory, this paper proposed to mine reliable dependencies by maximizing the sum of mutual information.The experimental results validate the mapping relationship between conditional probability distribution and mutual information.

Figure 1 .
Figure 1.Arcs grouped according to their final targets.

Figure 2 .
Figure 2. The zero-dependence relationship between the attributes of the NB model.

Figure 3 .
Figure 3.The one-dependence relationship between the attributes of the tree-augmented naive Bayes (TAN) model.

Figure 4 .
Figure 4.The K-dependence relationship between attributes inferred from the Kdependence Bayesian (KDB) classifier.

Figure 5 .
Figure 5.The upper triangular matrix of conditional mutual information between attributes and one example.

Figure 6 .
Figure 6.The final network structure of flexible K-dependence Bayesian (FKDB).Additionally, the order number of predictive attributes is also annotated.

Figure 7 .
Figure 7.The K-dependence relationships among attributes inferred from the KDB learning algorithm are shown (a), and the learning steps are annotated.The unused causal relationship (b) is annotated in pink.

Figure 8 .
Figure 8.The K-dependency relationships among attributes inferred from the FKBN learning algorithm are shown (a), and the learning steps are annotated.The unused causal relationship (b) is annotated in pink.

Figure 9 .
Figure 9.The experimental results of zero-one loss ratio ZLR(F KDB/N B).

Figure 10 .
Figure 10.The experimental results of zero-one loss ratio ZLR(F KDB/T AN ).

Figure 11 .
Figure 11.The experimental results of zero-one loss ratio ZLR(F KDB/RT AN ).

Table 2 .
Experimental results of zero-one loss.

Table 3 .
Win/draw/loss record (W/D/L) comparison results of zero-one loss on all datasets.

Table 4 .
Win/draw/loss record (W/D/L) comparison results of zero-one loss when the data size > 4177.