How to Mine Information from Each Instance to Extract an Abbreviated and Credible Logical Rule

Decision trees are particularly promising in symbolic representation and reasoning due to their comprehensible nature, which resembles the hierarchical process of human decision making. However, their drawbacks, caused by the single-tree structure, cannot be ignored. A rigid decision path may cause the majority class to overwhelm other class when dealing with imbalanced data sets, and pruning removes not only superfluous nodes, but also subtrees. The proposed learning algorithm, flexible hybrid decision forest (FHDF), mines information implicated in each instance to form logical rules on the basis of a chain rule of local mutual information, then forms different decision tree structures and decision forests later. The most credible decision path from the decision forest can be selected to make a prediction. Furthermore, functional dependencies (FDs), which are extracted from the whole data set based on association rule analysis, perform embedded attribute selection to remove nodes rather than subtrees, thus helping to achieve different levels of knowledge representation and improve model comprehension in the framework of semi-supervised learning. Naive Bayes replaces the leaf nodes at the bottom of the tree hierarchy, where the conditional independence assumption may hold. This technique reduces the potential for overfitting and overtraining and improves the prediction quality and generalization. Experimental results on UCI data sets demonstrate the efficacy of the proposed approach. Entropy 2014, 16 5243


Introduction
The rapid development of information and web technology has made a significant amount of data readily available for knowledge discovery.Growing interest in symbolic representation and reasoning highlights knowledge discovery as a clearly identifiable and technically rich subfield in artificial intelligence.The desirable properties of tools used to investigate big data are easy to understand models and predictive decisions.Decision trees are particularly promising in this regard, due to their comprehensible nature that resembles the hierarchical process of human decision making.To split nodes while growing the tree, the vast majority of the oblique and univariate decision-tree induction algorithms employ different impurity-based measures [1,2].Aside from such advantages, such as the ability to explain the decision process and low computational costs, their drawbacks, caused by the single-tree structure, cannot be ignored.
Commonly, to classify an instance, we can just follow one path from the unique root to the leaf, examining every decision made along the way.This approach usually works well.However, in reality, the classification distribution is frequently used to deal with imbalanced datasets, where there are many more instances of a certain class than of another.In such cases, a rigid decision path may cause the majority class overwhelm other classes, thus creating a situation where the minority class is ignored.The decision tree corresponds to locally optimal solutions to all training data, but from it, we cannot get globally optimal solutions to the whole data set.Besides, low-quality training data may lead to the construction of overfitting or fragile classifiers.Thus, eliminating redundant attributes is generally used as a data preprocessing technique to improve the quality of the data.However, pruning removes not only superfluous nodes, but also subtrees, with the superfluous nodes as subroots.If some key nodes are removed by mistake, the negative effect caused by the absent nodes may propagate and aggravate the situation to some extent.To ensure a credible and robust decision path, according to Occam's razor rule, only a limited number of attributes can be used for prediction.Thus just a portion of the information implicated in the training data will be utilized.
To resemble the hierarchical process of human decision making and to make the final model much more flexible, the core idea of this paper is to build a decision forest rather than a single tree to model training data.The theoretical foundation of the traditional decision tree algorithm is the chain rule of joint mutual information, which applies mutual and conditional mutual information to describe the direct and indirect relationships between predictive attributes and the class label.Correspondingly, the proposed learning algorithm, flexible hybrid decision forest (FHDF), applies the chain rule of local mutual information to form logical rules for each instance.The rules with the same root form different decision tree structures and then form a decision forest later.Thus, we can choose the most credible path from the forest to classify.To reduce the potential for overfitting and overtraining, functional dependencies (FDs) [3,4] will be employed as prior knowledge to remove redundant attributes rather than subtrees.Additionally, different sets of redundant attributes may be found and removed from different test instances.Because FDs are unrelated to class label, they can be extracted from the whole data set, rather than training data based on association rule analysis [5]; thus, FHDF works in the framework of semi-supervised learning.Naive Bayes (NB) [6,7], which is an important mining classifier for data mining and applied in many real-world classification problems, because of its high classification performance, replaces the leaf nodes at the bottom of the tree hierarchy, where the conditional independence assumption may hold.This technique ensures that all attributes will be utilized for prediction, thus improving the prediction quality and generalization.
The rest of this paper is organized as follows.Section 2 first proposes the theoretical basis of decision forest-the chain rule of local mutual information-then clarifies the rationality of FDs and introduces related work about NB.Section 3 describes the learning procedure of FHDF.Section 4 compares various approaches on data sets from the UCI repository.Finally, Section 5 presents possible future work.

Information Theory
In the 1940s, Claude E. Shannon introduced information theory, the theoretical basis of modern digital communication.Although Shannon was principally concerned with the problem of electronic communications, the theory has much broader applicability.Many commonly used measures are based on the entropy of information theory and used in a variety of classification algorithms.
In the following discussion, Greek letters (α, β, γ) denote sets of attributes.Lower-case letters denote specific values taken by corresponding attributes (for instance, x i represents the event that X i = x i ).
Definition 1. Entropy is a measure of uncertainty of random variable C: where P (c) is the marginal probability distribution function of C.
Definition 2. Mutual information I(C; X) is the reduction of entropy about variable C after observing all possible values of where P (x, c) is the joint probability distribution function of X and C.
Definition 4. Conditional mutual information CM I(C; X i |X j ) is defined to measure the expected value of the mutual information of two random variables C and X i given the value of a third variable X j .

CM I(C; X
Definition 5. Conditional local mutual information CLM I(C; x i |x j ) is defined to measure the reduction of entropy about variable C after observing X i = x i when X j = x j always holds.

CLM I(C; x
Obviously, Mutual information and conditional mutual information are commonly applied to roughly describe the direct or conditional relationships between predictive attribute and class label C.However, in the real world, the relationships between attributes may differ greatly as the situation changes.Local mutual information and conditional local mutual information can be used to describe the dynamic changes, thus making the final model much more flexible.For example, as Figure 1 shows, suppose that the overall relationship among attributes {X 2 , • • • , X 3 } and class label C is just like a rectangle.However, for the k − th instance, {X 2 , X 3 } are independent of C, and the local relationship between X 1 and C is just like an oval.

. . .
According to the chain rule of joint mutual information, which is the theoretical basis of the decision tree learning algorithm, the mutual information between attributes X On the other hand, for a sample t = {x 1 , x 2 ,• • • , x n } in data set S, the chain rule of local mutual information between t and class label C can be represented as: On the basis of the chain rule of local mutual information described above, we can convert each instance into a single logical rule or decision path.We first choose the most appropriate attribute value from t, e.g., x 1 , which satisfies LM I(C; x 1 ) = max LM I(C; x i )(1 ≤ i ≤ n).The training set is split into finer subsets with attribute value x 1 .Then, we choose the next attribute value, e.g., x 2 , which satisfies CLM I(C; x 2 |x 1 ) = max CLM I(C; x i |x 1 )(1 ≤ i ≤ n, i = 1).The training subset is further split into finer subsets with attribute values x 1 and x 2 .This procedure will continue, until the class label of the subset is the same or the instances in the subset satisfy some criterion.Finally, each training instance corresponds to a logical rule or decision path.If several rules have the same root, the combination of these rules with the same root node can build a complete decision tree, and the same part of the structural information present in the data can be expressed.Thus, the final model is composed of not one tree, but one forest.This technique reduces the potential for overfitting and overtraining and improves the prediction quality and generalization.For example, given attributes X = {X 1 , X 2 , X 3 , X 4 } and three instances {t 1 , t 2 , t 3 }, according to the comparison results of local mutual information and conditional local mutual information instances, {t 1 , t 2 , t 3 } can be converted into three logical rules.The conversion procedure are shown in Figure 2.
Figure 2. The conversion procedure from instances {t 1 , t 2 , t 3 } to logical rules.
LM I(X 3 =c 0 ;C | X 2 =b 0 , X 1 =a 1 ) LM I(X 4 =d 0 ;C | X 2 =b 1 , X 3 =c 1 ) Because Rule 2 and Rule 3 have the same root, they can be combined into a single tree.Finally, from instances {t 1 , t 2 , t 3 }, we can build two tree structure, as Figure 3 shows.

Functional Dependency Rules of Probability
Given a relation R (in a relational database), attribute Y of R is functionally dependent on attribute X of R, and X of R functionally determines Y of R (in symbols X → Y ).We demonstrated functional dependency rules of probability in [8,9] to build a linkage between FD and probability theory, and the following rules are mainly included: • Representation equivalence of probability: Suppose data set S consists of two attribute sets {α, β} and β can be inferred by α, i.e., the FD α → β holds, then the following joint probability distribution holds: • Augmentation rule of probability: If FD α → β holds and γ is a set of attributes, then the following joint probability distribution holds: • Transitivity rule of probability: If FDs α → β and β → γ hold, then the following joint probability distribution holds: • Pseudo-transitivity rule of probability: If βγ → δ and α → β hold, then the joint probability distribution holds: When two attributes are strongly related, the classifier may overweight the inference from the two attributes, resulting in prediction bias.FDs will help to avoid this situation, and the high dimensional representation or even whole classification model is simplified.For example, if FD x 1 → x 2 exists, for instance t = {x 1 , x 2 ,• • • , x n }, by applying the representation equivalence of probability and the augmentation rule of probability, we will have: Then, from the definition of LM I, we will get: Thus, x 2 is extraneous to classify t.Obviously, extraneous attributes reduction resulting from FDs will not change underlying conditional probability, thus, in turn, leading to lower variance.Therefore, FDs have the potential to be a valuable supplementary of the classifier over a considerable range of classification tasks.Discovering FDs from existing databases is an important issue, investigated for many years, and recently addressed with a data mining viewpoint, in a novel and much more efficient way [10,11].

Naive Bayes
Supervised classification is a basic task in artificial intelligence and knowledge discovery.The aim of supervised learning is to predict from a training set the class of a testing instance x = {x 1 , • • • , x n }, where x i is the value of the i-th attribute.We estimate the conditional probability of P (c|x) by selecting arg max C P (c|x), where c ∈ {c 1 , • • • , c k } are the k classes.From Bayes theorem, we have: The BNclassifier is becoming increasingly popular in many areas, such as decision aid, diagnosis and complex system control, because of its inference capabilities [12,13].The structure of BN is a directed acyclic graph, where the nodes correspond to domain attributes and the arcs between the nodes represent direct dependencies between the attributes.The absence of an arc between two nodes X 1 and X 2 denotes that X 2 is independent of X 1 given its parents.
However, the accurate estimation of P (x|c) is a complex process.Learning an optimal BN structure from existing data has been proven to be an NP-hard problem.NB avoids this problem by assuming that the attributes are independent given the class, Then, the following equation is often calculated in practice rather than Equation (15).
NB has a simple structure, whereby an arc exists from the class node to other nodes, but no arcs exist among the other nodes, as illustrated in Figure 4.One advantage of NB is avoiding model selection, because selecting between alternative models can be expected to increase variance and allow a learning system to overfit the training data [14].

FHDF
The learning procedure of the FHDF algorithm is described as follows: Output: Hybrid decision forest model.
Pre-processing phase: Mine association rules from data set T 1 + T 2 , then convert them into FDs; achieve the closure of FDs.Find extraneous attributes based on FD analysis and remove them from attribute set.
Training phase: Step 1: For each instance t = {x 1 , • • • , x n } in T 1 , generate the root node X i that satisfies Then, verify if the corresponding subset, which satisfies X i = x i , has the same class label.
(a) If yes, exit and create a leaf node; (b) Otherwise, node X i is created and added as one branch node.
Step 2: Calculate for each attribute value, among the attributes that have not been used so far, its CLMIand select attribute X j , which satisfies: Then, verify if CLMI satisfies the stopping criterion: (a) If yes, create NB as a leaf node; (b) otherwise, child node X j is created and added to the branch.Repeat the same process for each instance t in T 1 from Step 1.
Step 3: Get the order of attribute values in each instance, which will be converted into the classification rule.Use the rules with the same root to construct a hybrid decision tree.

Testing phase:
Step 4: Use the rules to assign class labels to unlabeled instances in T 2 .
Step 5: For any instance t that does not match any rule in the decision forest, start from the training phase, achieve the attribute order and get the sub-optimal rule for classification.
Knowledge implicated in a database can be divided into certain and uncertain parts.The certain part is commonly described by a set of rules, such as a decision tree.The uncertain part is always described from the viewpoint of probability, such as BN.The proposed working mechanism of the hybrid model-FHDF resembles the hierarchical process of human decision making.First, we convert each instance into a rule form by applying the chain rule of local mutual information.The extracted robust knowledge structure can solve the most certain problems based on definite knowledge and experience.Second, if the criterion of information growth cannot be satisfied, that means knowledge is scattered, and the independence assumption may be satisfied to some extent; then, NB is applied to make full use of the rest of the attributes.
Before this structured learning process of knowledge accumulation and induction, FDs is utilized to remove extraneous information to build a simplified and robust knowledge structure.For example, suppose data set S, and the logical rules inferred are shown in Tables 1 and 2, respectively; the corresponding FHDF structure will be as Figure 5a shows.If FD {X 1 = a 1 } → {X 2 = b 0 } holds, for instance three, four and five X 2 are extraneous for further inference.As Figure 5b shows, X 2 will be removed from the FHDF structure.Equation ( 8) is the theoretical basis of FHDF.For each instance t in the training set, According to Occam's razor rule, a shorter assumption may be believable, but a longer one is highly coincidental.To make the final prediction credible, information growth must be assured when attributes are added in order.When adding new attribute x i , the stopping prerequisite should be as follows: Figure 5.The corresponding flexible hybrid decision forest (FHDF) structure before and after applying functional dependency (FD).
CMI(X 3 ; C| a 0, c 0 ) CMI(X 2 ; C| a 1 ,b 1 ) CMI(X 2 ; C| a 0 ,c 1 ) CMI(X 2 ; C| a 1 ,b 0 ) The information growth must be more than 20%.Otherwise, FHDF creates an NB leaf node with all of the other attributes that have not been processed to perform further inference.
To sum up, FHDF extracts logical rules from the logical viewpoint by calculating LM I and CLM I, while NB predicts from the probabilistic viewpoint.Besides, FHDF presents outstanding flexibility and adaptability when there is a lack of prior knowledge.For example, given a testing instance t = {a 3 , b 1 , c 0 }, no rule in the decision forest can be used to assign a class label.Suppose that the attribute order is {c 0 , a

Rules
Class Label

Bias and Variance
Kohavi and Wolpert presented a bias-variance decomposition of expected misclassification rate [15], which is a powerful tool from sampling theory statistics for analyzing supervised learning scenarios.Suppose c and ĉ are the true class label and that generated by a learning algorithm, respectively, the zero-one loss function is defined as: where δ(c, ĉ) = 1 if ĉ = c and zero otherwise.The bias term measures the squared difference between the average output of the target and the algorithm.This term is defined as follows: where x is the combination of any attribute value.The variance term is a real valued non-negative quantity and equals zero for an algorithm that always makes the same guess regardless of the training set.The variance increases as the algorithm becomes more sensitive to changes in the training set.It is defined as follows: Moore and McCabe illustrated bias and variance through shooting arrows at a target [16], as described in Figure 6.The perfect model can be regarded as the bull's eye on a target and the learned classifier as an arrow fired at the bull's eye.Bias and variance describe what happens when an archer fires many arrows at the target.Bias means that the aim is off and the arrows land consistently off the bull's eye in the same direction.Variance means that the arrows are scattered.Large variance means that repeated shots are widely scattered on the target.They do not give similar results, but differ widely among themselves.

Statistical Results on UCI Data Sets
In order to verify the efficiency and effectiveness of the proposed FHDF, we conduct experiments on 43 data sets from the UCI machine learning repository.Table 3 summarizes the characteristics of each data set, including the number of instances, attributes and classes.Large data sets with an instance number greater than 3000 are annotated with the symbol "*".Missing values for qualitative attributes are replaced with modes, and those for quantitative attributes are replaced with means from the training data.For each benchmark data set, numeric attributes are discretized using MDLdiscretization [17].The following techniques are compared: • NB, standard naive Bayes.
All algorithms were coded in MATLAB 7.0 on a Pentium 2.93 GHz/1 G RAM computer.Base probability estimates P (c), P (c, x i ) and P (c, x i , x j ) were smoothed using the Laplace estimate, which can be described as follows: where F (•) is the frequency with which a combination of terms appears in the training data, K is the number of training instances for which the class value is known, K i is the number of training instances for which both the class and attribute X i are known and K ij is the number of training instances for which all of the class and attributes X i and X j are known.k is the number of attribute values of class C, k i is the number of attribute value combinations of C and X i and k ij is the number of attribute value combinations of C, X j and X i .Semi-supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data).Semi-supervised learning is a class of machine learning techniques that uses both labeled and unlabeled data for training.Typically, a small amount of labeled data with a large amount of unlabeled data is used.Many machine learning researchers have found that when unlabeled data are used in conjunction with a small amount of labeled data, a considerable improvement in learning accuracy can be achieved.The acquisition of labeled data for a learning problem often requires a skilled human agent (e.g., to transcribe an audio segment) or a physical experiment (e.g., determining the 3D structure of a protein or determining whether oil is present at a particular location).Given data set S with attribute set X, suppose FD α → β has been deduced from whole data set and α, β ⊂ X.From the representation equivalence of probability, we can obtain: By applying the augmentation rule of probability, we will get: Therefore, FD α → β still holds in the training data regardless of the class label.Thus, we can use the whole data set to extract FDs with high confidence.Additionally, the framework of FHDF is semi-supervised learning.In the following discussion, we use FDs that are extracted based on the association rule analysis as the domain knowledge.These rules can be effective in uncovering unknown relationships, thereby providing results that can be the basis of forecast and decision.They have proven to be useful tools for an enterprise, as they strive to improve their competitiveness and profitability.To ensure the validity of FDs, the minimum number of instances that satisfies FD is 100.With the increasing number of attributes, more RAM is needed to store joint probability distributions.An important restriction of our algorithm is that the number of the left side of FD, i.e., the number of key attributes, should be no more than two.
Table 4 presents for each data set the zero-one loss, which is estimated by 10-fold cross-validation to give an accurate estimation of the average performance of an algorithm.The bias and variance results are shown in Tables 5 and 6, respectively, in which only 15 large data sets are selected, because of statistical significance.The zero-one loss, bias or variance across multiple data sets provides a gross measure of relative performance.Statistically, a win/draw/loss record (W/D/L) is calculated for each pair of competitors A and B with regard to a performance measure M .The record represents the number of data sets in which A respectively beats, loses to or ties with B on M .Small improvements in leave-one-out error may be attributable to chance.Consequently, it may be beneficial to use a statistical test to assess whether an improvement is significant.A standard binomial sign test, assuming that wins and losses are equiprobable, is applied to these records.A difference is considered significant when the outcome of a two-tailed binomial sign test is less than 0.05.Tables 7-10 show the W/D/L records corresponding to zero-one loss, bias and variance, respectively.
To clarify the main reason why the same classifier performs differently when data size changes, in the following discussion, we propose a new parameter, named clear winning percentage (CWP), which is defined as follows, to evaluate the extent to which one classifier is relatively superior to another.

CW P =
W in − Loss W in + Draw + Loss Table 4. Experimental results of 0-1 loss.C4.5, standard decision tree; TAN, tree-augmented naive Bayes applying incremental learning; NBTree, decision tree with naive Bayes as the leaf node; RTAN, tree-augmented naive Bayes ensembles.If CW P > 0, then the classifier performs better.Otherwise, it performs worse.Figure 7 shows the comparison results of CWP corresponding to Tables 7 and 8. NB delivers fast and effective classification with a clear theoretical foundation.However, NB is confronted with the limitations of the conditional independence assumption.From Figure 7a, we can see that, although the conditional independence assumption rarely holds in the real world, NB exhibits competitive performance compared to C4.5, CW P = (22 − 16)/43 = 0.14 > 0. When dealing with small data sets, the instances can be regarded as sparsely distributed, and the conditional independence assumption is satisfied to some extent.For example, there are as many as 38 attributes and only 894 instances for data set "Anneal".After calculating and comparing the sum of CMIbetween one attribute and all of the other attributes, most attributes have weak relationships to other attributes or are even nearly independent of them.However, when dealing with large data sets, the relationships between attributes become much more obvious, and the limitation of this assumption cannot be neglected, CW P = (2 − 12)/15 = −0.67 < 0. As Figure 7b shows, similar result appears when TAN and C4.5 are compared.The main reason may be that logical rules, which are extracted based on information theory, take the main role for classification and can represent credible linear dependency among attributes as data size increases.Besides, as Figure 7c shows, although TAN is restricted to have at most one parent node for each predictive attribute, its structure is more reasonable than NB and can exhibit the relationship between attributes to some extent.From Figure 7d-f, when compared with these three single-structure models, the hybrid model, i.e., NBTree, shows superior or at least equivalent performance.To decrease the computational overhead, NB and TAN simplify the network structure given different independence assumption; whereas the hybrid model can utilize the advantages of both logical rule and probabilistic estimation.Because the leaf of the tree structure contains only limited number instances, NB can evenly make use of the rest of the attributes.RTAN investigates the diversity of TAN by the K statistic.From Figure 7g-j, this bagging mechanism helps RTAN to achieve superior performance to NB and TAN.However, when compared to decision tree learning algorithms, i.e., C4.5 and NBTree, the advantage of RTAN is not obvious.FHDF is motivated by the desire to weaken the assumption by removing correlated attributes on the basis of FD analysis.It also applies the bagging mechanism to describe the hyperplane in which each instance exist.From Figure 7k-o, FHDF demonstrates comprehensive and superior performance.

Dataset
Bias can help to evaluate the extent to which the final model learned from training data fits the whole data set.From Table 9, we can see that the fitness of NB is the poorest, because its structure is definite regardless of what the true data distribution is.In contrast, TAN performs much better than C4.5 and NBTree.The main reason may be that each predictive attribute of TAN is affected by its direct parent node, which is selected by calculating CMI from the global viewpoint.As for C4.5 and NBTree, the final prediction greatly relies on the first few attributes corresponding to the main parts of the singletree structure, i.e., the root and the branch nodes.Thus, C4.5 and NBTree can easily achieve local rather than global optimization solution.Additionally, in view of this, the bagging mechanism can help RTAN and FHDF to make full use of the information that training data supply, since the knowledge hierarchy implicated in the training data can be described in different submodels, and the negative effect caused by data distribution change will be mitigated to a certain extent.The complicated relationship among attributes are measured and depicted from the viewpoint of information theory; thus, performance robustness can be achieved.With respect to variance, as Table 10 shows, of these algorithms, NB performs the best, because its network structure is definite and, thus, not sensitive to changes in the training set.By contrast, C4.5 performs the worst.The main reason may be that the logical rules resemble the hierarchical process of human decision making.The attributes that locate at the back must correlate with those in the front.If any attribute changes the location, the following attributes will be affected greatly.Thus, for different training sets, especially when the branch contains a very small number of instances, the conditional distribution estimates may differ greatly and different attribute orders may be obtained.For example, there are 58 attributes and 4601 instances for data set "Spambase".When 10-fold cross-validation is applied, only 4601 × 0.9 = 4141 instances can be used for training.Suppose each attribute contains two values and the leaf node must have at least 100 instances to ensure statistical significance: each logical rule will use at most five attributes for prediction.That means that the order of the rest of the 53 attributes may be at random.TAN and RTAN need to calculate CMI to build a maximal spanning tree, which may cause overfitting.On the other hand, FHDF and NBTree both use NB as the leaf node, thus retaining the simplicity and direct theoretical foundation of the decision tree, while mitigating the negative effect of the probability distribution of the training set.Besides, when different training and testing sets were given, FHDF uses the same FDs in the semi-supervised learning framework to eliminate redundant attributes.These FDs are extracted from the whole data set and entirely unrelated to the training set.
To further describe the working mechanism of FHDF, the information growth and the number of key attributes are changed while learning from training data.The comparison results of average bias and average variance are shown in Figure 8.When information growth increased from 10% to 30%, as can be seen from Figure 8a, the bias increased correspondingly, while the variance decreased.The main reason may be that if instance t needs to be converted into the classification rule, higher information growth will cause fewer attribute values to be selected.Thus, the classification rule will be too short to precisely fit instance t.However, when given different training data, the short rule may roughly hold for many instances.When the number of key attributes increased from one to three, as can be seen from Figure 8b, the bias and variance almost remain the same.With more key attributes used, according to Occam's razor rule, the extracted FDs may be coincidental and not credible.Furthermore, the number of FDs with one key attribute are much more than those with two or three key attributes; the negative effect on bias and variance can be neglected.

Conclusions
FHDF has demonstrated a number of advantages over previous decision tree learning algorithms.With sparsely distributed data, ensemble learning has difficulty in predicting with certainty, thus resulting in an error rate increase.Because regardless of which part is considered as the testing part, FDs are extracted from the whole data set and, thus, remain the same.The computational demands for determining the structure became lower, especially if a large number of attributes are available.The size of the conditional probability tables increased exponentially with the number of parents.Sufficient labeled instances are a prerequisite for the precise parameter estimation.However, from the probabilistic analysis perspective, if the size of the data set is too small, the distributions of different attribute values become uneven.Some attributes may be closely distributed, whereas others may be sparsely distributed.
As a result, an unreliable probability estimate might be obtained.To get the precise calculation of CLMI and estimation of the conditional probability distribution, determining when NB can be used as the leaf node remains unsolved.A number of techniques have been developed for extending the decision tree to handle numeric data.Hence, extending the current work to the more general FHDF framework is necessary.

Definition 3 .
Local mutual information LM I(C; x) is defined to measure the reduction of entropy about variable C after observing X = x, LM I(C; x) = c∈C P (x, c) log P (x, c) P (c)P (x)

Figure 1 .
Figure 1.The overall and local relationships among {X 1 , X 2 , X 3 } and C.

Figure 4 .
Figure 4.The network structure of naive Bayes.

Figure 6 .
Figure 6.Bias and variance in shooting arrows at a target.

Figure 7 .
Figure 7.The 0-1 loss comparison of learning algorithms on all and large data sets.

Table 1 .
3 , b 1 } by calculating LM I and CLM I; as shown in Table 1, only training Instance 8 meets the first two conditions and can be used to classify t.The class label of t should be C 2 .Corresponding classification rules.

Table 5 .
Experimental results of bias.

Table 6 .
Experimental results of variance.

Table 7 .
Win/draw/loss record (W/D/L) comparison results of 0-1 loss on all data sets.

Table 8 .
W/D/L comparison results of 0-1 loss on large data sets.

Table 9 .
W/D/L comparison results of bias on large data sets.

Table 10 .
W/D/L comparison results of variance on large data sets.Figure 8. Comparison results of the bias and variance of FHDF given different information growth and a different number of key attributes.