1. Introduction
Bayesian networks (BNs), which were introduced by Pearl [
1], can encode dependencies among all variables. Their success has led to a recent flurry of algorithms for learning BNs from data [
2–
5].
BN =
< N,
A, Θ > is a directed acyclic graph with a conditional probability distribution for each node, collectively represented by Θ that quantifies how much a node depends on its parents. Each node
n ∈
N represents a domain variable, and each arc
a ∈
A between nodes represents a probabilistic dependency. A BN can be used as a classifier that characterizes the joint distribution
P(
x,
y) (In the following discussion, lower-case letters denote specific values taken by corresponding attributes. For instance,
xi represents the event that
Xi =
xi) of class variable
Y and a set of attributes
X = {
X1,
X2, ⋯,
Xn}, and predicts the class label with the highest conditional probability. Denoting the parent nodes of
xi by
Pa(
xi), the joint distribution
PB(
x,
y) can be represented by factors over the network structure
B, as follows:
The inference of a general BN has been shown to be an NP-hard problem [
6] even for approximate solutions [
7]. However, learning unrestricted BNs does not necessarily lead to a classifier with good performance. For example, naive Bayes (NB) [
8] is the simplest BN, which considers only the dependence between each attribute
Xi and the class variable
Y. However, Friedman
et al. [
9] observed that unrestricted BN classifiers do not outperform NB in a large sample of benchmark data sets. Many BN classifiers have been proposed to overcome the limitation of NB. One practical approach for structure learning is to impose some restrictions on the structures of BNs, for example, learning tree-like structures. Sahami [
10] proposed to describe the limited dependence among variables within a general framework, which is called
k-dependence Bayesian (KDB) classifier. Friedman
et al. [
9] proposed tree augmented naive Bayes (TAN), a structure-learning algorithm that learns a maximum spanning tree from the attributes. Conditional mutual information is applied in these two algorithms to measure the weight of arcs between predictive attributes. When data size becomes larger, the superiority in high-dependence representation helps KDB obtain better classification performance than TAN.
The key differences between Bayesian classifiers are their structure-learning algorithms. Many criteria, such as Bayesian scoring function [
11], minimal description length (MDL) [
12] and Akaike Information Criterion (AIC) [
13], have been proposed to find out one global graph structure
BG that best characterizes the true distribution of given data. Considering the time and space complexity overhead, only a limited number of conditional probabilities can be encoded in BN. All credible dependencies must be represented to obtain a more accurate estimation of the true joint distribution. However, these criteria can only approximately measure the overall interdependencies between attributes, but cannot identify the change of interdependencies when attributes take different values. Thus the candidate graph structures may have very close score values and are non-negligible in the posterior sense [
14]. To extend the limited representation of
BG, some researchers proposed to aggregate several candidate BNs together. Averaged one-dependence estimators (AODE), which were proposed by Webb
et al. [
15], aggregate the predictions of all qualified restricted class of one-dependence estimators. Zheng
et al. [
16] proposed subsumption resolution (SR), to efficiently identify occurrences of the specialization-generalization relationship and eliminate generalizations at classification time. By introducing Functional Dependency (FD) analysis into the learning procedures, the model interpretability and robustness of different Bayesian classifiers can be improved greatly. After eliminating highly dependent attribute values by applying FD analysis, the maximal spanning tree (MST) of TAN is rebuilt with the rest of the attribute values for each test instance. Correspondingly the extraneous effect caused by logical relationships between attribute values will be mitigated [
17]. To evaluate the feasibility of integrating probabilistic reasoning and logical reasoning into the framework of AODE, we first select the branch nodes of MST as the super parents, then refine AODE by applying FD analysis to delete redundant children attribute [
18].
In this paper, local mutual information and conditional local mutual information, which are deduced from classical information theory, are applied to build the local graph structure BL. BL can be considered a complementary part of BG, to describe local causal relationships. To construct classifiers at arbitrary points (values of k) along the attribute dependence spectrum, both BL and BG are built in the framework of KDB model. Substitution-elimination resolution (SER), a new type of semi-naive Bayesian operation is proposed to substitute or eliminate generalization to achieve accurate estimation of conditional probability distribution while reducing computational complexity. SER deals only with specific values and only in the context of other specific values. We prove that this adjustment is theoretically correct and demonstrate experimentally that it can considerably improve zero-one loss, bias and variance.
The remainder of this paper is organized as follows: Section 2 first proposes the background theory—information theory and functional dependency rules of probability, and then clarifies the rationality of SER. Section 3 introduces the basic ideas of KDB, local KDB and the proposed algorithm, averaged k-dependence Bayesian classifiers (AKDB), which averages the output of KDB and local KDB. Section 4 compares various approaches on data sets from the UCI Machine Learning Repository. Finally, Section 5 presents possible future work.
3. KDB, Local KDB and AKDB
KDB allows us to construct classifiers at arbitrary points (values of k) along the feature dependence spectrum, while also capturing most of the computational efficiency of the naive Bayesian model. Thus KDB presents an alternative to the general trend in BN learning algorithms that conducts an expensive search through the space of network structures.
KDB is supplied with both a database of pre-classified instances, DB, and the k value for the maximum allowable degree of feature dependence. The KDB outputs a k-dependence Bayesian classifier with conditional probability tables determined from the input data. The algorithm is as follows:
Algorithm 1.
KDB.
For each attribute Xi, compute MI, I(Xi; Y), where Y is the class. Compute class CMI I(Xi; Xj|Y) for each pair of attributes Xi and Xj, where i ≠ j. Let the used variable list, S, be empty. Let the Bayesian network being constructed, BN, begin with a single class node, Y. Repeat until S includes all domain attributes 5.1. Select feature Xmax which is not in S and has the highest value I(Xmax; Y). 5.2. Add a node to BN representing Xmax. 5.3. Add an arc from Y to Xmax in BN. 5.4. Add m = min(|S|, k) arcs from m distinct attributes Xj in S with the highest value for I(Xmax; Xj|Y). 5.5. Add Xmax to S.
Compute the conditional probability tables inferred by the structure of BN by using counts from DB, and output BN.
|
From Definitions 3–6, we can obtain the following results:
MI and CMI are commonly applied to roughly measure the direct or conditional relationships between predictive attributes and class variable Y. However, in the real world, the relationships between attributes may differ significantly as the situation changes. For some instances attributes A and B are highly related. Meanwhile, for other instances, A is independent of B, but highly related to attribute Y. Considering the relationships among attributes Gender, Pregnant and Breast Cancer, Gender = female and Breast Cancer=yes are highly related. By contrast, if Gender = female, then we cannot make any definite conclusion about the value of Pregnant, nor about the value of Gender if Breast Cancer= no. Note that traditional Bayesian classifier, e.g., KDB, which is learned based on classical information theory, cannot describe such interdependencies. However, LMI and CLMI can be used to identify the dynamic changes, thus making the final model much more flexible.
As shown in
Figure 3, for the first instance, the attribute value
x2 is independent of other attribute values and the local relationship between {
x1,
x3} and class variable
Y is just like a triangle. For the
ith instance, {
x2,
x3} are independent of
Y, and the local relationship between
x1 and
Y is just like an oval. For the last instance,
x3 is independent of other attribute values and the local relationship between {
x1,
x2} and
Y is just like a broken line. If all situations are considered together, then the overall relationship between attributes {
X1,
X2,
X3} and class variable
Y is just like a rectangle.
KDB learns the basic relationships of full BN. In the first two learning steps of KDB, if I(Y; X) and I(Xi; Xj|Y) are replaced by I(Y; x) and I(xi; xj|Y) respectively, then the local KDB that describes the local relationships of each test instance can be inferred. On the basis of this, FD analysis is introduced into the learning procedure to improve model robustness.
The learning procedure of the local KDB is described as follows:
Algorithm 2.
Local KDB.
For each test instance x= {x1, x2, ⋯, xn} |
For each attribute value xi ∈x, compute LMI, I(xi; Y), where Y is the class. Compute CLMI I(xi; xj|Y) for each pair of attribute values xi and xj, where i ≠ j and xi, xj ∈x. Let the used variable list, S, be empty. Let the Bayesian network being constructed, BN, begin with a single class node, Y. Repeat until S includes all attribute values 5.1. Select attribute value xmax which is not in S and has the highest value I(xmax; Y). 5.2. Add a node to BN representing xmax. 5.3. Add an arc from Y to xmax in BN. 5.4. Add m = min(|S|, k) arcs from m distinct attribute values xj in S with the highest value for I(xmax; xj|Y). 5.5. Add xmax to S.
Apply SER to substitute generalization or eliminate redundant conditional probability. Compute the conditional probability tables inferred by the structure of BN by using counts from DB, and output BN.
|
The final classifier, AKDB, estimates the class membership probabilities by averaging KDB and local KDB classifiers. The basic idea of AKDB can be explained from the perspective of medical diagnosis. KDB describes the basic relationships between different symptoms that can be explained by domain knowledge learned from book or in school. Meanwhile, the local KDB describes the possible relationships between different symptoms of a specific patient. To make a definite diagnosis, rich experience (which corresponds to robust KDB model) and flexible mind (which corresponds to dynamic local KDB model) are both necessary and important.
FDs require a method for inferring from the training data whether one attribute value is a generalization of another. FDs use the following criterion:
to infer that
xj is a generalization of
xi, where
is the number of training cases with value
xi,
is the number of training cases with both values, and
l is a user-specified minimum frequency. A large number of deterministic attributes, which are on the left side of the FD, will increase the risk of incorrect inference, and at the same time need more computer memory to store credible FDs. Consequently, only the one-one FDs are selected in our current work. Besides, as no formal method has been used to select an appropriate value for
l, we use the same setting as that proposed by Webb [
15],
i.e.,
l = 100, which is achieved from empirical studies.
The learning framework of local KDB is described as follows. During training time, FD analysis is applied to detect all possible specialization-generalization relationships. During classification time, local KDB first builds the basic network structure for each test instance t; then selects the specialization-generalization relationships that hold in t, and applies SER to refine the network structure. From the definitions of local mutual information and FD we can see that, they both deal with attribute values rather than attributes. In the real world the interdependencies may be varied when attributes take different values. As for some test instances attributes Xi and Xj are independent, for other test instances Xi may be dependent on Xj. Classical Bayesian classifiers, e.g., TAN and KDB, which build the network structure by computing mutual information and conditional mutual information, cannot resolve such situations. Whereas local KDB helps to remedy this limitation.
Another feature of our algorithm which makes it very suitable for data mining domains is its relatively small computational complexity. Computing the actual network structure of KDB requires O(n2mcv2) time (dominated by Step 2) and that of Local KDB only requires O(n2mc) time, where n is the number of attributes, m is the number of training instances, c is the number of classes, and v is the average number of discrete values that an attribute may take. Moreover, classifying an instance both KDB and local KDB require O(nck) time. Forming the additional two-dimensional probability estimate table SER requires O(mn2v2) time. Classification of a single instance requires considering each pair of attributes to detect dependencies and is of time complexity O(cn).