Open Access
This article is
 freely available
 reusable
Entropy 2013, 15(7), 27162735; doi:10.3390/e15072716
Article
Efficient Approximation of the Conditional Relative Entropy with Applications to Discriminative Learning of Bayesian Network Classifiers
^{1}
Department of Electrical Engineering, IST, University of Lisbon, Lisbon 1049001, Portugal
^{2}
PIA, Instituto de Telecomunicações, Lisbon 1049001, Portugal
^{3}
Department of Computer Science, IST, University of Lisbon, Lisbon 1049001, Portugal
^{4}
SQIG, Instituto de Telecomunicações, Lisbon 1049001, Portugal
^{5}
Department of Mathematics, IST, University of Lisbon, Lisbon 1049001, Portugal
^{*}
Author to whom correspondence should be addressed.
Received: 8 June 2013; in revised form: 3 July 2013 / Accepted: 3 July 2013 / Published: 12 July 2013
Abstract
:We propose a minimum variance unbiased approximation to the conditional relative entropy of the distribution induced by the observed frequency estimates, for multiclassification tasks. Such approximation is an extension of a decomposable scoring criterion, named approximate conditional loglikelihood (aCLL), primarily used for discriminative learning of augmented Bayesian network classifiers. Our contribution is twofold: (i) it addresses multiclassification tasks and not only binaryclassification ones; and (ii) it covers broader stochastic assumptions than uniform distribution over the parameters. Specifically, we considered a Dirichlet distribution over the parameters, which was experimentally shown to be a very good approximation to CLL. In addition, for Bayesian network classifiers, a closedform equation is found for the parameters that maximize the scoring criterion.
Keywords:
conditional relative entropy; approximation; discriminative learning; Bayesian network classifiers1. Introduction
Bayesian networks [1] are probabilistic graphical models that represent the joint probability distribution of a set of random variables. They encode specific conditional independence properties pertaining to the joint distribution via a directed acyclic graph (DAG). To achieve this, each vertex (aka node) in the DAG contains a random variable, and edges between them represent the dependencies between the variables. Specifically, given a DAG, a node is conditionally independent of its nondescendants given its parents. Besides serving as a representation of a set of independencies, the DAG also aids as a skeleton for factorizing a distribution via the chain rule of probability. The chief advantage of Bayesian networks is that they can specify dependencies only when necessary, providing compact representations of complex domains that leads to a significant reduction in the cost of learning and inference.
Bayesian networks have been widely used for classification [2,3,4], being known in this context as Bayesian network classifiers (BNC). The use of generative learning methods in choosing the Bayesian network structure has been pointed out as the likely cause for their poor performance when compared to much simpler methods [2,5]. In contrast to generative learning, where the goal is to be able to describe (or generate) the entire data, discriminative learning focuses on the capacity of a model to discriminate between different classes. To achieve this end, generative methods usually maximize the loglikelihood (LL) or a score thereof, whereas discriminative methods focus on maximizing the conditional loglikelihood (CLL). Unfortunately, maximizing the CLL of a BNC turns out to be computationally much more challenging than maximizing LL. For this reason, the community has resorted to decomposing the learning procedure into generativediscriminative subtasks [3,6,7]. More recently, Carvalho et al. proposed a new scoring criterion, called approximate conditional loglikelihood (aCLL), for fullydiscriminative learning of BNCs, exhibiting good performance, both in terms of accuracy and computational cost [8]. The proposed scoring criterion was shown to be the minimum variance unbiased (MVU) approximation to CLL.
Despite the aCLL good performance, the initial proposal has three significant restrictions. First, it was only devised for binaryclassification tasks. Second, it was derived under the assumption of uniform distribution over the parameters. Third, the parameters of the network structure that maximize the score resorted in an unknown manner. In this paper, we address all these shortcomings, which makes it possible to apply aCLL in a broader setup, while maintaining the desired property of a MVU approximation to CLL. In order to solve the first two restrictions, we start by deriving the approximation for multiclassification tasks under much more relaxed stochastic assumptions. In this context, we considered multivariate symmetric distributions over the parameters and detailed the pertinent cases of uniform and Dirichlet distributions. The constants required by the approximation are computed analytically for binary and ternary uniform distributions. In addition, a Monte Carlo method is proposed to compute these constants numerically for other distributions, including the Dirichlet. In addressing the third shortcoming, the parameters of the BNC that maximize the proposed approximation to the conditional loglikelihood (CLL) are derived. Finally, maximizing CLL is shown to be equivalent to minimizing the conditional relative entropy of the (conditional) distribution (of the class, given the other variables) induced by the observed frequency estimates and the one induced by the learned BNC model.
To gauge the performance of the proposed approximation, we conducted a set of experiments over 89 biologically relevant datasets already used in previous works [9,10]. The results show that the models that maximized aCLL under a symmetric Dirichlet assumption attain a higher CLL, with great statistical significance, in comparison with the generative models that maximized LL and the discriminative models that maximized aCLL under a symmetric uniform distribution. In addition, aCLL under a symmetric uniform distribution also significantly outperforms LL in obtaining models with higher CLL.
The paper is organized as follows. In Section 2, we review the essentials of BNCs and revise the aCLL approximation. In Section 3, we present the main contribution of this paper, namely, we extend aCLL to multiclassification tasks under general stochastic assumptions and derive the parameters that maximize aCLL for BNCs. Additionally, in Section 4, we relate the proposed scoring criterion with the conditional relative entropy, and in Section 5, we provide experimental results. Finally, we draw some conclusions and mention future work in Section 6.
2. Background
In this section, we review the basic concepts of BNCs required to understand the proposed methods. We then discuss the difference between generative and discriminative learning of BNCs and present the aCLL scoring criterion [8].
2.1. Bayesian Network Classifiers
Let X be a discrete random variable taking values in a countable set, $\mathcal{X}\subset \mathbb{R}$. In all that follows, the domain, $\mathcal{X}$, is finite. We denote an $(n+1)$dimensional random vector by $\mathbf{X}=({X}_{1},\cdots ,{X}_{n},C)$, where each component, ${X}_{i}$, is a random variable over ${\mathcal{X}}_{i}$ and C is a random variable over $\mathcal{C}=\{1,\cdots ,s\}$. The variables, ${X}_{1},\cdots ,{X}_{n}$, are called attributes or features, and C is called the class variable. For each variable, ${X}_{i}$, we denote the elements of ${\mathcal{X}}_{i}$ by ${x}_{i1},\dots ,{x}_{i{r}_{i}}$, where ${r}_{i}$ is the number of values that ${X}_{i}$ can take. We say that ${x}_{ik}$ is the kth value of ${X}_{i}$, with $k\in \{1,\cdots ,{r}_{i}\}$. The probability that $\mathbf{X}$ takes value, $\mathbf{x}$, is denoted by $P(\mathbf{x})$, conditional probabilities, $P(\mathbf{x}\mid \mathbf{z})$, being defined correspondingly.
A Bayesian network classifier (BNC) is a triple $B=(\mathbf{X},G,\Theta )$, where $\mathbf{X}=({X}_{1},\cdots ,{X}_{n},C)$ is a random vector. The network structure, $G=(\mathbf{X},E)$, is a directed acyclic graph (DAG) with nodes in $\mathbf{X}$ and edges E representing direct dependencies between the variables. We denote by ${\Pi}_{{X}_{i}}$ the (possibly empty) set of parents of ${X}_{i}$ in G. For efficiency purposes, it is common to restrict the dependencies between the attributes and the class variable, imposing all attributes to have the class variable as the parent; rigorously, these are called augmented naive Bayes classifiers, but it is common to refer to them abusively as BNCs. In addition, the parents of ${X}_{i}$ without the class variable are denoted by ${\Pi}_{{X}_{i}}^{*}={\Pi}_{{X}_{i}}\backslash \{C\}$. We denote the number of possible configurations of the parent set, ${\Pi}_{{X}_{i}}^{*}$, by ${q}_{i}^{*}$. The actual parent configurations are ordered (arbitrarily) and denoted by ${w}_{i1}^{*},\cdots ,{w}_{i{q}_{i}^{*}}^{*}$, and we say that ${w}_{ij}^{*}$ is the jth configuration of ${\Pi}_{{X}_{i}}^{*}$, with $j\in \{1,\cdots ,{q}_{i}^{*}\}$. Taking into account this notation, the third element of the BNC triple denotes the parameters, Θ, given by the families, ${\{{\theta}_{ijck}\}}_{i\in \{1\cdots n\},\phantom{\rule{0.166667em}{0ex}}j\in \{1,\cdots ,{q}_{i}^{*}\},\phantom{\rule{0.166667em}{0ex}}c\in \{1,\cdots ,s\},k\in \{1,\cdots ,{r}_{i}\}}$ and ${\{{\theta}_{c}\}}_{c\in \{1,\cdots ,s\}}$, which encode the local distributions of the network via:
$$\begin{array}{}(1)& {P}_{B}(C=c)={\theta}_{c}\text{and}\hfill (2)& {P}_{B}({X}_{i}={x}_{ik}\mid {\Pi}_{{X}_{i}}^{*}={w}_{ij}^{*},C=c)={\theta}_{ijck}\hfill \end{array}$$
A BNC, B, defines a unique joint probability distribution over $\mathbf{X}$ given by:
The conditional independence properties pertaining to the joint distribution are essentially determined by the network structure. Specifically, ${X}_{i}$ is conditionally independent of its nondescendants given its parents, ${\Pi}_{{X}_{i}}$ in G [1]; and so, C depends on all attributes (as desired).
$${P}_{B}({X}_{1},\cdots ,{X}_{n},C)={P}_{B}(C)\prod _{i=1}^{n}{P}_{B}({X}_{i}\mid {\Pi}_{{X}_{i}})$$
The problem of learning a BNC given data, D, consists in finding the BNC that best fits D. This can be achieved by a scorebased learning algorithm, where a scoring criterion is considered in order to quantify the fitting of a BNC. Contributions in this area of research are typically divided in two different problems: scoring and searching. The scoring problem focuses on devising new scoring criteria to measure the goodness of a certain network structure given the data. On the other hand, the searching problem concentrates on identifying one or more network structures that yield a high value for the scoring criterion in mind. If the search is conducted with respect to a neighborhood structure defined on the space of possible solutions, then we are in the presence of local scorebased learning.
Local scorebased learning algorithms can be extremely efficient if the scoring criterion employed is decomposable, that is, if the scoring criterion can be expressed as a sum of local scores associated to each network node and its parents. In this case, any change over the network structure carried out during the search procedure is evaluated by considering only the difference to the score of the previously assessed network. The most common scoring criteria employed in BNC learning are reviewed in [11,12,13]. We refer the interested reader to newly developed scoring criteria in the works of Carvalho et al. [8], de Campos [14] and Silander et al. [15].
Unfortunately, even performing local scorebased learning, searching for unrestricted BNC structures from data is NPhard [16]. Worse than that, finding for unrestricted approximate solutions is also NPhard [17]. These results led the community to search for the largest subclass of BNCs for which there is an optimal and efficient learning algorithm. The first attempts confined the network to treeaugmented naive (TAN) structures [2] and used Edmonds [18] and ChowLiu [19] optimal branching algorithms to learn the network. More general classes of BNCs have eluded efforts to develop optimal and efficient learning algorithms. Indeed, Chickering [20] showed that learning networks constrained to have at most two indegrees is already NPhard.
2.2. Generative versus Discriminative Learning of Bayesian Network Classifiers
For convenience, we introduce some additional notation. Let data, D, be given by:
Generative learning reduces to maximizing the likelihood of the data, by using the loglikelihood scoring criterion or a score thereof (for instance, [14,15]). The loglikelihood scoring criterion can be written as:
$$D=\{{y}_{1},\cdots ,{y}_{N}\},\text{where}{y}_{t}=({y}_{t}^{1},\cdots ,{y}_{t}^{n},{c}_{t})$$
$$\begin{array}{c}\hfill \mathrm{LL}(B\mid D)=\sum _{t=1}^{N}log{P}_{B}({y}_{t}^{1},\cdots ,{y}_{t}^{n},{c}_{t})\end{array}$$
On the other hand, discriminative learning concerns maximizing the conditional likelihood of the data. The reason why this is a form of discriminative learning is that it focuses on correctly discriminating between classes by maximizing the probability of obtaining the correct classification. The conditional loglikelihood (CLL) scoring criterion can be written as:
$$\text{CLL}(B\mid D)=\sum _{t=1}^{N}log{P}_{B}({c}_{t}{y}_{t}^{1},\cdots ,{y}_{t}^{n})$$
Unlike LL, CLL does not decompose over the network structure, and therefore, there is no closedform equation for optimal parameter estimates for the CLL scoring criterion. This issue was first approached by splitting the problem into two distinct tasks: find optimalCLL parameters [6,7] and find optimalCLL structures [3,21]. Although showing promising results, these approaches still present a problem of computational nature. Indeed, optimalCLL parameters have been achieved by resorting to gradient descent methods, and optimalCLL structures have been found only with global search methods, which cause both approaches to be very inefficient. Recently, a leastsquares approximation to CLL, called approximate conditional loglikelihood (aCLL), was proposed, which enables full discriminative learning of BNCs in a very efficient way [8]. The aCLL scoring criterion is presented in detail in the next section.
2.3. A First Approximation to the Conditional LogLikelihood
In this section, we present the approximation to the conditional loglikelihood proposed in [8]. Therein, it was assumed that the class variable was binary, that is, $\mathcal{C}=\{0,1\}.$ For the binary case, the conditional probability of the class variable can then be written as:
For convenience, the two terms in the denominator were denoted by:
so that Equation (6) becomes simply:
The BNC, B, was omitted from the notation of both ${U}_{t}$ and ${V}_{t}$ for the sake of readability.
$$\begin{array}{ccc}& & {P}_{B}({c}_{t}\mid {y}_{t}^{1},\cdots ,{y}_{t}^{n})=\frac{{P}_{B}({y}_{t}^{1},\cdots ,{y}_{t}^{n},{c}_{t})}{{P}_{B}({y}_{t}^{1},\cdots ,{y}_{t}^{n},{c}_{t})+{P}_{B}({y}_{t}^{1},\cdots ,{y}_{t}^{n},1{c}_{t})}\hfill \end{array}$$
$$\begin{array}{}& \hfill {U}_{t}& =& {P}_{B}({y}_{t}^{1},\cdots ,{y}_{t}^{n},{c}_{t})\phantom{\rule{2.em}{0ex}}\text{and}\hfill (7)& \hfill {V}_{t}& =& {P}_{B}({y}_{t}^{1},\cdots ,{y}_{t}^{n},1{c}_{t})\hfill \end{array}$$
$$\begin{array}{ccc}\hfill {P}_{B}({c}_{t}\mid {y}_{t}^{1},\cdots ,{y}_{t}^{n})& =& \frac{{U}_{t}}{{U}_{t}+{V}_{t}}\hfill \end{array}$$
The loglikelihood (LL) and the conditional loglikelihood (CLL) now take the form:
As mentioned in Section 2.1, an efficient scoring criterion must be decomposable. The LL score decomposes over the network structure. To better see why this is the case, substitute the expression for the joint probability distribution, in Equations (1)–(3), into the LL criterion, in Equation (9), to obtain:
where ${N}_{ijck}$ is the number of instances in D, where ${X}_{i}$ takes its kth value, its parents (excluding the class variable) take their jth value and C takes its cth value, and ${N}_{c}$ is the number of instances where C takes its cth value.
$$\begin{array}{}(9)& \hfill \mathrm{LL}(B\mid D)& =& \sum _{t=1}^{N}log{U}_{t},\phantom{\rule{2.em}{0ex}}\text{and}\hfill (10)& \hfill \text{CLL}(B\mid D)& =& \sum _{t=1}^{N}log{U}_{t}log({U}_{t}+{V}_{t})\hfill \end{array}$$
$$\mathrm{LL}(B\mid D)=\sum _{i=1}^{n}\sum _{j=1}^{{q}_{i}}\sum _{k=1}^{{r}_{i}}\sum _{c=1}^{s}{N}_{ijck}log{\theta}_{ijck}+\sum _{c=1}^{s}{N}_{c}log{\theta}_{c}$$
Unfortunately, CLL does not decompose over the network structure, because $log({U}_{t}+{V}_{t})$ cannot be expressed as a sum of local contributions. To achieve the decomposability of CLL, an approximation:
of the original function:
was proposed in [8], where α, β and γ are real numbers to be chosen, so as to minimize the approximation error. To determine suitable values of α, β and γ, uniformity assumptions about ${U}_{t}$ and ${V}_{t}$ were made. To this end, let ${\Delta}^{2}=\{(x,y):x+y\le 1\text{and}x,y\ge 0\}$ be the twosimplex set. As ${U}_{t}$ and ${V}_{t}$ are expected to become exponentially small as the number of attributes grows, it was assumed that
for some $0<p<\frac{1}{2}$. Combining this constraint with
yielded the following assumption:
$$\widehat{f}({U}_{t},{V}_{t})=\alpha log{U}_{t}+\beta log{V}_{t}+\gamma $$
$$f({U}_{t},{V}_{t})=log{U}_{t}log({U}_{t}+{V}_{t})$$
$${U}_{t},{V}_{t}\le p<\frac{1}{2}$$
$$({U}_{t},{V}_{t})\sim \text{Uniform}({\Delta}^{2})$$
Assumption 2.1
There exists a small positive $p<\frac{1}{2}$, such that:
$$({U}_{t},{V}_{t})\sim Uniform({\Delta}^{2}){}_{{U}_{t},{V}_{t}\le p}=Uniform([0,p]\times [0,p])$$
In [8], it was shown that under Assumption 2.1, the values of $\alpha ,\beta $ and γ that minimize the mean square error (MSE) of $\widehat{f}$ w.r.t.f are given by:
$$\alpha =\frac{{\pi}^{2}+6}{24},\phantom{\rule{1.em}{0ex}}\beta =\frac{{\pi}^{2}18}{24}\phantom{\rule{1.em}{0ex}}\text{and}\phantom{\rule{1.em}{0ex}}\gamma =\frac{{\pi}^{2}}{12ln(2)}\left(2+\frac{({\pi}^{2}6)logp}{12}\right)$$
This resulted in a decomposable approximation for the CLL, called approximate CLL, defined as:
$$\mathrm{aCLL}(B\mid D)=\sum _{c=0}^{1}(\alpha {N}_{c}+\beta {N}_{1c})log{\theta}_{c}+\sum _{i=1}^{n}\sum _{j=1}^{{q}_{i}}\sum _{k=1}^{{r}_{i}}\sum _{c=0}^{1}(\alpha {N}_{ijck}+\beta {N}_{ij(1c)k})log{\theta}_{ijck}+N\gamma $$
This decomposable approximation has some desirable properties. It is unbiased, that is, the mean difference between $\widehat{f}$ and f is zero for all values of p. In addition, $\widehat{f}$ is the approximation with the lowest variance amongst unbiased ones, leading to a minimum variance unbiased (MVU) approximation of f. Moreover, since the goal is to maximize $\text{CLL}(B\mid D)$, the constant γ from Equation (14) can be dropped, yielding an approximation that disregards the value, p (used in Assumption 2.1). For this reason, p needs not to be known to maximize $\mathrm{aCLL}$. Despite these advantageous properties, the parameters that maximize aCLL resorted in an unknown manner. In addition, the aCLL score is not directly applied to multiclassification tasks, and no other stochastic assumptions rather than the uniformity of ${U}_{t}$ and ${V}_{t}$ were studied.
3. Extending the Approximation to CLL
The shortcomings of the initial aCLL proposal make it natural to explore a variety of extensions. Specifically, in this section, we derive a general closedform expression for aCLL grounding in regression theory. This yields a new scoring criterion for multiclassification tasks, with broader stochastic assumptions than the uniform one, while maintaining the desirable property of an MVU approximation to CLL. In addition, we also provide the parameters that maximize this new general form in the context of BNCs.
3.1. Generalizing aCLL to MultiClassification Tasks
We now set out to consider multiclassification tasks under a generalization of Assumption 2.1. In this case, let
, so that the conditional probability, ${P}_{B}({c}_{t}{y}_{t}^{1},\cdots ,{y}_{t}^{n})$, in Equation (8) becomes, now,
where ${U}_{t,c}={P}_{B}({y}_{t}^{1},\cdots ,{y}_{t}^{n},c)$, for all $1\le c\le s$. Observe that the vectors, $({y}_{t}^{1},\cdots ,{y}_{t}^{n},c)$, for all $1\le c\le s$ with $c\ne {c}_{t}$, called the complement samples, may or may not occur in D. Hence, for multiclassification tasks, the conditional loglikelihood in Equation (10) can be rewritten as:
In this case, the approximation, $\widehat{f}$, in Equation (12) consists now in approximating:
by a function of the form:
$${U}_{t,{c}_{t}}={P}_{B}({y}_{t}^{1},\cdots ,{y}_{t}^{n},{c}_{t})$$
$${P}_{B}({c}_{t}\mid {y}_{t}^{1},\cdots ,{y}_{t}^{n})=\frac{{U}_{t,{c}_{t}}}{{\displaystyle \sum _{c=1}^{s}{U}_{t,c}}}$$
$$\text{CLL}(B\mid D)=\sum _{t=1}^{N}log{U}_{t,{c}_{t}}log\left(\sum _{c=1}^{s}{U}_{t,c}\right)$$
$$f({U}_{t,1},\cdots ,{U}_{t,s})=log{U}_{t,{c}_{t}}log\left(\sum _{c=1}^{s}{U}_{t,c}\right)$$
$$\widehat{f}({U}_{t,1},\cdots ,{U}_{t,s})=\alpha log{U}_{t,{c}_{t}}+\sum _{c\ne {c}_{t}}{\beta}_{c}log{U}_{t,c}+\gamma $$
Notice that ${\beta}_{{c}_{t}}$ is not defined in Equation (15); hence, considering ${\beta}_{{c}_{t}}=\alpha 1$, the approximation in Equation (15) is equivalent to the following linear approximation:
Therefore, we aim at minimizing the expected squared error for the approximation in Equation (16). We are able to achieve this if we assume that the joint distribution, $({U}_{t,1},\cdots ,{U}_{t,s})$, is symmetric.
$$\begin{array}{ccc}\hfill log\left(\sum _{c=1}^{s}{U}_{t,c}\right)& =& f({U}_{t,1},\cdots ,{U}_{t,s})log{U}_{t,{c}_{t}}\hfill \\ & \approx & \widehat{f}({U}_{t,1},\cdots ,{U}_{t,s})log{U}_{t,{c}_{t}}\hfill \\ & =& (\alpha 1)log{U}_{t,{c}_{t}}+\sum _{c\ne {c}_{t}}{\beta}_{c}log{U}_{t,c}+\gamma \hfill \\ & =& \sum _{c=1}^{s}{\beta}_{c}log{U}_{t,c}+\gamma \hfill \end{array}$$
Assumption 3.1
For all permutations, $({\pi}_{1},\cdots ,{\pi}_{s})$, we have that $({U}_{t,1},\cdots ,{U}_{t,s})\sim ({U}_{t,{\pi}_{1}},\cdots ,{U}_{t,{\pi}_{s}}).$
Assumption 3.1 imposes that ${U}_{t,c}$ and ${U}_{t,{c}^{\prime}}$ are identically distributed, for all $1\le c,{c}^{\prime}\le s$, that is, ${U}_{t,c}\sim {U}_{t,{c}^{\prime}}$. This is clearly much more general than the symmetric uniformity imposed in Assumption 2.1 for binary classification.
Under Assumption 3.1, the approximation in Equation (16) is such that ${\beta}_{1}=\cdots ={\beta}_{s}$. Let β denote the common value, and consider that:
Our goal is to find β and γ that minimize the following expected value:
$$A=log\left(\sum _{c=1}^{s}{U}_{t,c}\right)\phantom{\rule{1.em}{0ex}}\text{and}\phantom{\rule{1.em}{0ex}}B=\sum _{c=1}^{s}log{U}_{t,c}$$
$$E[{(A(\beta B+\gamma ))}^{2}]$$
Let ${\sigma}_{A}^{2}$ and ${\sigma}_{B}^{2}$ denote the variance of A and B, respectively, and let ${\sigma}_{AB}$ denote the covariance between A and B. Standard regression allows us to derive the next result [22].
Theorem 3.2
Assume that ${\sigma}_{B}^{2}>0$. The unique values of β and γ that minimize Equation (18) are given by:
Moreover, it follows that:
$$\beta =\frac{{\sigma}_{AB}}{{\sigma}_{B}^{2}}\phantom{\rule{1.em}{0ex}}\text{and}\phantom{\rule{1.em}{0ex}}\gamma =E[A\beta B]$$
$$\begin{array}{ccc}\hfill E[A(\beta B+\gamma )]& =& 0,\phantom{\rule{1.em}{0ex}}and\hfill \end{array}$$
$$\begin{array}{ccc}\hfill E[{(A(\beta B+\gamma ))}^{2}]& =& {\sigma}_{A}^{2}\frac{{\sigma}_{AB}^{2}}{{\sigma}_{B}^{2}}\hfill \end{array}$$
From Equation (20), we conclude that the approximation in Equation (16) is unbiased, and since it minimizes Equation (18), it is an MVU approximation. As in [8], we are adopting estimation terminology for approximations. In this case, by taking $\widehat{A}=(\beta B+\gamma )$, Equation (18) is precisely $E[{(A\widehat{A})}^{2}]=\mathrm{MSE}(\widehat{A})$, where MSE is the mean squared error of the approximation/estimation. The MSE coincides with the variance when the approximation/estimator is unbiased, and so, the approximation in Theorem 3.2 is MVU. In addition, the standard error of the approximation in Equation (16) is given by the square root of Equation (21).
The previous result allows us to generalize the aCLL scoring criterion for multiclassification tasks. The values of β and γ needed by the approximation in Equation (16) are given by Equation (19). This results in a decomposable approximation of CLL and a generalization of aCLL for multiclassification as:
$$\mathrm{aCLL}(B\mid D)=\sum _{t=1}^{N}\left(log{U}_{t,{c}_{t}}+\sum _{c=1}^{s}\beta log{U}_{t,c}+\gamma \right)$$
3.1.1. Symmetric Uniform Assumption for MultiClassification Tasks
Herein, we analyze aCLL under the symmetric uniform assumption. We start by providing a general explanation on why aCLL approximation is robust to the choice of p, which was already noticed for the binary case. In addition, we confirm that the analysis in Section 2.3 for the binary case coincides with that given in Section 3.1 when $s=2$. Finally, we provide the constants β and γ for ternaryclassification tasks, under the symmetric uniform assumption of $({U}_{t,1},{U}_{t,2},{U}_{t,3})$.
Assumption 3.3
Let $({U}_{t,1},\cdots ,{U}_{t,s})\sim Uniform({[0,p]}^{s})$.
Start by noticing that under Assumption 3.3, changes in p correspond to multiplying the random vector, $({U}_{t,1},\cdots ,{U}_{t,s})$, by a scale factor. Indeed, if each random variable is multiplied by a common scale factor, κ, it results in the addition of constant values to A and B. To this end, note that:
and:
Since these additive terms have no effect on variances and covariances, it follows that β is not affected with changes in p. Therefore, the choice of p is irrelevant for maximizing aCLL when a uniform distribution is chosen. Moreover, it is enough to obtain the parameter, β, as by Equation (22) aCLL maximization is insensitive to the constant factor, γ.
$$\begin{array}{c}\hfill log\left(\sum _{c=1}^{s}\kappa {U}_{t,c}\right)=log\left(\kappa \sum _{c=1}^{s}{U}_{t,c}\right)=log(\kappa )log\left(\sum _{c=1}^{s}{U}_{t,c}\right)\end{array}$$
$$\begin{array}{c}\hfill \sum _{c=1}^{s}log(\kappa {U}_{t,c})=\sum _{c=1}^{s}log\kappa +log{U}_{t,c}=slog(\kappa )+\sum _{c=1}^{s}log{U}_{t,c}\end{array}$$
We also stress that for the binary case, with $({U}_{t,1},{U}_{t,2})\sim \mathrm{Uniform}({[0,p]}^{2})$, the values of β and γ, given by Equation (19), with $\alpha =1+\beta $, coincide with those given by Equation (13).
Finally, by using Mathematica 9.0, we were able to obtain an analytical expression of β for ternary classification tasks.
Example 3.4
For the ternary case where $({U}_{t,1},{U}_{t,2},{U}_{t,3})\sim Uniform({[0,p]}^{3})$, the constant, β, that minimizes Equation (18) is
where ${\mathrm{Li}}_{n}(z)$ is the polylogarithm function of order n.
$$\frac{1}{36}\left(15{\pi}^{2}2\left(11+9ln(3)12ln(2)+60{ln}^{2}(2)+72{\mathrm{Li}}_{2}(2)+24{\mathrm{Li}}_{2}(1/4)\right)\right)$$
3.1.2. Symmetric Dirichlet Assumption for MultiClassification Tasks
In this section, we provide an alternative assumption, which will lead us to a very good approximation of CLL. Instead of a symmetric uniform distribution, we assume that the random vector, $({U}_{t,1},\cdots ,{U}_{t,s},{W}_{t})$, where ${W}_{t}=1{\sum}_{c=1}^{s}{U}_{t,c}$, follows a symmetric Dirichlet distribution. In the following, we omit the random variable, ${W}_{t}$, from the vector, as it is completely defined given ${U}_{t,1},\cdots ,{U}_{t,s}$. The use of the Dirichlet distribution is attributed to the fact that it is a conjugate family of the distribution for a multinomial sample; this ties perfectly with the fact that data is assumed to be a multinomial sample when learning BNCs [23].
In order to take profit of Theorem 3.2, we consider the following symmetric Dirichlet assumption:
Assumption 3.5
Let $({U}_{t,1},\cdots ,{U}_{t,s})\sim Dir({a}_{1},\cdots ,{a}_{s},b)$, where ${a}_{c}=a$, for all $c=1,\cdots ,s$.
Assumption 3.5 implies that the tuple, $({y}_{t}^{1},\cdots ,{y}_{t}^{n},c)$, occurs in the data exactly $a1$ times, for all $c=1,\cdots ,s$. Moreover, there are $b1$ instances in the data different from $({y}_{t}^{1},\cdots ,{y}_{t}^{n},c)$, for all $c=1,\cdots ,s$. Given a dataset of size N, it is reasonable to assume $a=1$ and $b=N$. Indeed, probabilities ${U}_{t,c}$ are expected to be very low, becoming exponentially smaller as the number of attributes, n, grows; therefore, it is reasonable to assume that $({y}_{t}^{1},\cdots ,{y}_{t}^{n},{c}_{t})$ occurs only once in the data and that its complement instances, $({y}_{t}^{1},\cdots ,{y}_{t}^{n},c)$, with $c\ne {c}_{t}$, do not even occur in the data. Thus, by starting with a noninformative prior, that is, with distribution $\mathrm{Dir}(1,\cdots ,1,1)$ over the ssimplex and, then, conditioning this distribution over the multinomial observation of the data of size N, we obtain the prior:
where ${a}_{c}=1$ for $c\ne {c}_{t}$ and ${a}_{{c}_{t}}=2$. However, such a distribution is asymmetric, and it is not in the conditions of Theorem 3.2. We address this problem by considering the symmetric distribution:
which is very close to the distribution in Equation (23). We stress that the goal of any assumption is to find good approximations for the constants, β and γ, and that the conditions upon which such approximations were performed need not hold true exactly.
$$({U}_{t,1},\cdots ,{U}_{t,s})\sim \mathrm{Dir}({a}_{1},\cdots ,{a}_{s},N)$$
$$({U}_{t,1},\cdots ,{U}_{t,s})\sim \mathrm{Dir}(1,\cdots ,1,N)$$
3.1.3. Estimating Parameters β and γ
We now focus our attention in finding, numerically, the values of β and γ, via a MonteCarlo method. To this end, several random vectors, $({u}_{1},\cdots ,{u}_{s})$, are generated with the envisaged Dirichlet distribution in order to define a set, $\mathcal{S}$, of pairs $(B,A)$ given by $A=log({\sum}_{c=1}^{s}{u}_{c})$ and $B={\sum}_{c=1}^{s}log({u}_{c})$. In this way, we are sampling the random variables, A and B, as defined in Equation (17). Given $\mathcal{S}$, it is straightforward to find the best linear fit of the form $A=\widehat{\beta}B+\widehat{\gamma}$ and moreover, by the strong law of large numbers, $\widehat{\beta}$ and $\widehat{\gamma}$, converge in probability to β and γ, respectively, as the set, $\mathcal{S}$, grows. The general method is described in Algorithm 1.
Algorithm 1 General MonteCarlo method to estimate β and γ 
Input: number of samples, m

Algorithm 1 works for any distribution (even for nonsymmetric ones). The idea in Step 3 is to use the standard simulation technique of using the cumulative distribution function (CDF) of the univariate marginal for ${U}_{t,1}$ to generate a sample for the first component, ${u}_{1}$, of the vector and, afterwards, apply the CDF of the univariate conditional distributions, to generate samples for the remaining components, ${u}_{2},\cdots ,{u}_{s}$.
We used Algorithm 1 to crosscheck the values of β and γ for the case of a symmetric uniform distribution obtained analytically in Equation (13) and Example 3.4, for binary and ternary classification tasks, respectively.
Example 3.6
For binary classification under Assumption 3.3 and for $m=20,000$, we obtained an exact approximation of the analytical expression in Equation (13) up to four decimal places. For ternary classification, we required 80,000 samples to achieve the same precision for the analytical expression given in Example 3.4. We estimated the values of β and γ to be in this case $\widehat{\beta}=0.198873$ and $\widehat{\gamma}=1.85116$.
Although Algorithm 1 works well for Dirichlet distributions, we note that there are simpler and more efficient ways to generate samples for a distribution, $\mathrm{Dir}({a}_{1},\cdots ,{a}_{s},{a}_{s+1})$. Consider $s+1$ independent Gamma random variables, ${Y}_{i}\sim \mathrm{Gamma}({a}_{i},1)$, for $i=1,\cdots ,s+1$, and let ${Y}_{0}={\sum}_{i=1}^{s+1}{Y}_{i}$. Then, it is well known that $({Y}_{1}/{Y}_{0},\cdots ,{Y}_{s}/{Y}_{0})\sim \mathrm{Dir}({a}_{1},\cdots ,{a}_{s},{a}_{s+1})$. Clearly, it is much more simple to sample $s+1$ independent random variables than sampling marginal and conditional distributions. The modified algorithm is presented in Algorithm 2.
Algorithm 2 MonteCarlo method to estimate β and γ for Dir$(a,\cdots ,a,b)$ 
Input: number of samples, m, and hyperparameters, a and b.

Next, we illustrate the use of the Algorithm 2 in the conditions discussed in Equation (24).
Example 3.7
For binary classification under Assumption 3.5 and with ${a}_{1}={a}_{2}=1$, $b=1,000$ and $m=100,000$, the estimated values for β and γ are $\widehat{\beta}=0.39291$ and $\widehat{\gamma}=0.61698$, respectively. For ternary classification, and under the same assumptions, the estimated values for β and γ are $\widehat{\beta}=0.239266$ and $\widehat{\gamma}=0.6085$, respectively.
3.2. Parameter Maximization for aCLL
The main goal for devising an approximation to CLL is to obtain an expression that decomposes over the BNC structure. By plugging in the probability expansion of Equation (3) in the general aCLL scoring criterion for multiclassification given in Equation (22), we obtain:
where $\alpha =1+\beta $. This score is decomposable, as it allows one to compute, independently, the contribution of each node (and its parents) to the global score. However, the values of the parameters, ${\theta}_{ijck}$ and ${\theta}_{c}$, that maximize the aCLL score in Equation (25) remain unknown. This problem was left open in [8], and a further approximation was required to obtain optimal BNC parameters.
$$\mathrm{aCLL}(BD)=\sum _{c=1}^{s}\left(\alpha {N}_{c}+\beta \sum _{{c}^{\prime}\ne c}{N}_{{c}^{\prime}}\right)log({\theta}_{c})+\sum _{i=1}^{n}\sum _{j=1}^{{q}_{i}^{*}}\sum _{k=1}^{{r}_{i}}\sum _{c=1}^{s}\left(\alpha {N}_{ijck}+\beta \sum _{{c}^{\prime}\ne c}{N}_{ij{c}^{\prime}k}\right)log({\theta}_{ijck})+N\gamma $$
We are able to obtain the optimal values of ${\theta}_{ijck}$ and ${\theta}_{c}$ by assuming that they are lowerbounded. This lower bound follows naturally by adopting pseudocounts, commonly used in BNCs to smooth observed frequencies with Dirichlet priors and increase the quality of the classifier [2]. Pseudocounts intend to impose the common sense assumption that there are no situations with the probability of zero. Indeed, it is a common mistake to assign a probability of zero to an event that is extremely unlikely, but not impossible [24].
Theorem 3.8
Let ${N}^{\prime}>0$ be the number of pseudocounts. The parameters, ${\theta}_{ijck}$, that maximize the aCLL scoring criterion in Equation (25) are given by:
where:
constrained to ${\theta}_{ijck}\ge \frac{{N}^{\prime}}{{N}_{ij+c}}$ and ${\theta}_{c}\ge \frac{{N}^{\prime}}{{N}_{+}}$, for all $i,j,c$ and k.
$${\theta}_{ijck}=\frac{{N}_{ij+ck}}{{N}_{ij+c}}\phantom{\rule{4pt}{0ex}}\text{and}\phantom{\rule{4pt}{0ex}}\phantom{\rule{4pt}{0ex}}{\theta}_{c}=\frac{{N}_{+c}}{{N}_{+}}$$
$${N}_{ij+ck}=\left\{\begin{array}{cc}\alpha {N}_{ijck}+\beta \sum _{{c}^{\prime}\ne c}{N}_{ij{c}^{\prime}k}\hfill & \mathrm{if}\alpha {N}_{ijck}+\beta \sum _{{c}^{\prime}\ne c}{N}_{ij{c}^{\prime}k}\ge {N}^{\prime}\hfill \\ {N}^{\prime}\hfill & otherwise,\hfill \end{array}\right.$$
$${N}_{+c}=\left\{\begin{array}{cc}\alpha {N}_{c}+\beta \sum _{{c}^{\prime}\ne c}{N}_{{c}^{\prime}}\hfill & \mathrm{if}\alpha {N}_{c}+\beta \sum _{{c}^{\prime}\ne c}{N}_{{c}^{\prime}}\ge {N}^{\prime}\hfill \\ {N}^{\prime}\hfill & otherwise,\hfill \end{array}\right.$$
$${N}_{ij+c}=\sum _{k=1}^{{r}_{i}}{N}_{ij+ck}and{N}_{+}=\sum _{c=1}^{s}{N}_{+c},$$
Proof:
We only show the maximization for the parameters, ${\theta}_{ijck}$, as the maximization for ${\theta}_{c}$ is similar. Then, by taking the summand of Equation (25), which depends only on the parameters, ${\theta}_{ijck}$, we have:
$$\begin{array}{}& {\displaystyle \sum _{i=1}^{n}\sum _{j=1}^{{q}_{i}^{*}}\sum _{k=1}^{{r}_{i}}\sum _{c=1}^{s}}(\alpha {N}_{ijck}+\beta {\displaystyle \sum _{{c}^{\prime}\ne c}}{N}_{ij{c}^{\prime}k})log\left({\theta}_{ijck}\right)(7)& ={\displaystyle \sum _{i=1}^{n}\sum _{j=1}^{{q}_{i}^{*}}\sum _{k=1}^{{r}_{i}}\sum _{c=1}^{s}}\underset{(a)}{\underbrace{{N}_{ij+ck}log\left({\theta}_{ijck}\right)}}+\underset{(b)}{\underbrace{((\alpha {N}_{ijck}+\beta {\displaystyle \sum _{{c}^{\prime}\ne c}}{N}_{ij{c}^{\prime}k}){N}_{ij+ck})log\left({\theta}_{ijck}\right)}}\end{array}$$
Observe that if ${N}_{ij+ck}\ge {N}^{\prime}$, then ${N}_{ij+ck}=\alpha {N}_{ijck}+\beta {\sum}_{{c}^{\prime}\ne c}{N}_{ij{c}^{\prime}k}$. Thus, summand (b) in Equation (27) is only different from zero when $\alpha {N}_{ij+ck}+\beta {\sum}_{{c}^{\prime}\ne c}{N}_{ij{c}^{\prime}k}<{N}^{\prime}$. In this case, ${N}_{ij+ck}={N}^{\prime}$, which implies that $(\alpha {N}_{ijck}+\beta {\sum}_{{c}^{\prime}\ne c}{N}_{ij{c}^{\prime}k}){N}^{\prime}<0.$ Therefore, the value for ${\theta}_{ijck}$ that maximizes summand (b) is the minimal value for ${\theta}_{ijck}$, that is, $\frac{{N}^{\prime}}{{N}_{ij+c}}=\frac{{N}_{ij+ck}}{{N}_{ij+c}}$. Finally, by Gibb’s inequality, we derive that the distribution for ${\theta}_{ijck}$ that maximizes summand (a) in Equation (27) is ${\theta}_{ijck}=\frac{{N}_{ij+ck}}{{N}_{ij+c}}$. Since the maximality of the summands, (a) and (b), is obtained with the same values for ${\theta}_{ijck}$, we have that the values for ${\theta}_{ijck}$ that maximize Equation (25) are given by: ${\theta}_{ijck}=\frac{{N}_{ij+ck}}{{N}_{ij+c}}.$ ☐
The role of the pseudocounts is to guarantee that the values, ${N}_{ij+ck}$ or ${N}_{+c}$, cannot be smaller than ${N}^{\prime}$. By plugging in the parameters given in Theorem 3.8 in Equation (26) into the aCLL criterion in Equation (25), we obtain:
The notation using G as an argument instead of B emphasizes that once the parameters, Θ, are decided upon, the criterion is a function of the network structure, G, only.
$$\widehat{\mathrm{a}}\text{CLL}(GD)=\sum _{c=1}^{s}\left(\alpha {N}_{c}+\beta \sum _{{c}^{\prime}\ne c}{N}_{{c}^{\prime}}\right)log\left(\frac{{N}_{+c}}{{N}_{+}}\right)+\sum _{i=1}^{n}\sum _{j=1}^{{q}_{i}^{*}}\sum _{k=1}^{{r}_{i}}\sum _{c=1}^{s}\left(\alpha {N}_{ijck}+\beta \sum _{{c}^{\prime}\ne c}{N}_{ij{c}^{\prime}k}\right)log\left(\frac{{N}_{ij+ck}}{{N}_{ij+c}}\right)+N\gamma $$
Finally, we show that aCLLobserved frequency estimates (OFE) ($\widehat{\mathrm{a}}$CLL) is not scoreequivalent. Two BNCs are said to be equivalent if they can represent precisely the same set of distributions. Verma and Pearl [25] showed that this is equivalent to checking if the underlying DAG of the two BNCs have the same skeleton and the same vstructures. A scoreequivalent scoring criterion is one that assigns the same score to equivalent BNC structures [12,14,26].
Theorem 3.9
The $\widehat{\mathrm{a}}$CLL scoring criterion is decomposable and nonscore equivalent.
Proof:
Decomposability follows directly from the definition in Equation (28). Concerning nonscore equivalence, it suffices to provide a counterexample, where two equivalent structures do not score the same. To this purpose, consider a BNC with two attributes, ($n=2$) and $D=\{(0,0,1),(0,1,1),(1,1,0),(1,1,1)\}$, as the training set. The structures, $G\equiv {X}_{1}\to {X}_{2}$ and $H\equiv {X}_{2}\to {X}_{1}$ (we omitted the node representing the class variable, C, pointing to ${X}_{1}$ and ${X}_{2}$), for B are equivalent, but it is straightforward to check that $\widehat{\mathrm{a}}\text{CLL}(G\mid D)\ne \widehat{\mathrm{a}}\text{CLL}(H\mid D)$. ☐
Interesting scoring criteria in the literature are decomposable, since it is infeasible to learn undecomposable scores. On the other hand, both scoreequivalent and nonscoreequivalent decomposable scores can be learned efficiently, although the algorithms to learn them are different. In general, the scoreequivalence property does not seem to be important, as nonscoreequivalent scores typically perform better than scoreequivalent ones [12,14].
4. InformationTheoretic Interpretation of the Conditional LogLikelihood
Herein, we provide some insights about informationtheoretic concepts and how they relate with the CLL. These results are well known in the literature [2,27]; nonetheless, they are presented here in detail to provide a clear understanding that approximating the CLL is equivalent to approximating the conditional relative entropy. We refer the interested reader to the textbook of Cover and Thomas [28] for further details.
The relative entropy is a measure of the distance between two distributions. The relative entropy, also known as KullbackLeibler divergence, between two probability mass functions, $p(x)$ and $q(x)$, for a random variable, X, is defined as:
This quantity is always nonnegative and is zero if and only if $p=q$. Although interpreted as a distance between distributions, it is not a true distance, as it is not symmetric and it does not satisfy the triangle inequality. However, it is a measure of the inefficiency of assuming that the distribution is q when the true distribution is p. In terms of encoding, this means that if instead of using the true distribution, p, to encode X, we used the code for a distribution, q; we would need ${H}_{p}(X)+{D}_{KL}(p(x)\mid \mid q(x))$ bits on the average to describe X. In addition, the conditional relative entropy is the average of the relative entropies between the conditional probability mass functions, $p(yx)$ and $q(yx)$, averaged over the probability mass function, $p(x)$, that is:
$${D}_{KL}(p(x)\mid \mid q(x))=\sum _{x}p(x)log\frac{p(x)}{q(x)}$$
$${D}_{KL}(p(yx)\mid \mid q(yx))={E}_{X}[{D}_{KL}(p(yX)\mid \mid q(yX)]=\sum _{x}p(x)\sum _{y}p(yx)log\frac{p(yx)}{q(yx)}$$
The relationship between loglikelihood and entropy is well established. Lewis had already shown that maximizing the loglikelihood is equivalent to minimizing the entropy [27]. In addition, Friedman et al. [2] came into the same conclusion and related it with the relative entropy. They concluded that minimizing the KullbackLeibler divergence between the distribution, ${\widehat{P}}_{D}$, induced by the observed frequency estimates (OFE) and the distribution, ${P}_{B}$, given by the Bayesian network classifier, B, is equivalent to minimizing the entropy, ${H}_{{P}_{B}}$, and, thus, maximizing $\mathrm{LL}(B\mid D)$.
Friedman et al. [2] also hinted that maximizing the conditional loglikelihood is equivalent to minimizing the conditional relative entropy. Assuming that $\mathbf{A}=\mathbf{X}\backslash \{C\}$, and for the case of BNCs, this fact can be taken from:
$$\begin{array}{}& \hfill \text{CLL}(B\mid D)& =& \sum _{t=1}^{N}log{P}_{B}({c}_{t}\mid {\mathbf{y}}_{t}^{})\hfill & & =& N\sum _{{\mathbf{y}}_{t}^{}}\sum _{{c}_{t}}{\widehat{P}}_{D}({\mathbf{y}}_{t}^{},{c}_{t})log{P}_{B}({c}_{t}\mid {\mathbf{y}}_{t}^{})\hfill & & =& N\sum _{{\mathbf{y}}_{t}^{}}\sum _{{c}_{t}}{\widehat{P}}_{D}({\mathbf{y}}_{t}^{},{c}_{t})log{P}_{B}({c}_{t}\mid {\mathbf{y}}_{t}^{})+N{H}_{{\widehat{P}}_{D}}(C\mid \mathbf{A})N{H}_{{\widehat{P}}_{D}}(C\mid \mathbf{A})\hfill & & =& N\sum _{{\mathbf{y}}_{t}^{}}\sum _{{c}_{t}}{\widehat{P}}_{D}({\mathbf{y}}_{t}^{},{c}_{t})log{P}_{B}({c}_{t}\mid {\mathbf{y}}_{t}^{})N\sum _{{\mathbf{y}}_{t}^{}}\sum _{{c}_{t}}{P}_{{\widehat{P}}_{D}}({\mathbf{y}}_{t}^{},{c}_{t})log{P}_{{\widehat{P}}_{D}}({c}_{t}\mid {\mathbf{y}}_{t}^{})\hfill & & & N{H}_{{\widehat{P}}_{D}}(C\mid \mathbf{A})\hfill & & =& N\sum _{{\mathbf{y}}_{t}^{}}\sum _{{c}_{t}}{\widehat{P}}_{D}({\mathbf{y}}_{t}^{},{c}_{t})log\frac{{\widehat{P}}_{D}({c}_{t}\mid {\mathbf{y}}_{t}^{})}{{P}_{B}({c}_{t}\mid {\mathbf{y}}_{t}^{})}N{H}_{{\widehat{P}}_{D}}(C\mid \mathbf{A})\hfill & & =& N\sum _{{\mathbf{y}}_{t}^{}}\sum _{{c}_{t}}{\widehat{P}}_{D}({\mathbf{y}}_{t}^{}){\widehat{P}}_{D}({c}_{t}\mid {\mathbf{y}}_{t}^{})log\frac{{\widehat{P}}_{D}({c}_{t}\mid {\mathbf{y}}_{t}^{})}{{P}_{B}({c}_{t}\mid {\mathbf{y}}_{t}^{})}N{H}_{{\widehat{P}}_{D}}(C\mid \mathbf{A})\hfill & & =& N\sum _{{\mathbf{y}}_{t}^{}}{\widehat{P}}_{D}({\mathbf{y}}_{t}^{})\sum _{{c}_{t}}{\widehat{P}}_{D}({c}_{t}\mid {\mathbf{y}}_{t}^{})log\frac{{\widehat{P}}_{D}({c}_{t}\mid {\mathbf{y}}_{t}^{})}{{P}_{B}({c}_{t}\mid {\mathbf{y}}_{t}^{})}N{H}_{{\widehat{P}}_{D}}(C\mid \mathbf{A})\hfill (29)& & =& N{D}_{KL}({\widehat{P}}_{D}(C\mid \mathbf{A})\mid \mid {P}_{B}(C\mid \mathbf{A}))N{H}_{{\widehat{P}}_{D}}(C\mid \mathbf{A})\hfill \end{array}$$
Indeed, the main goal in classification is to learn the model that best approximates the conditional probability, induced by the OFE, of C, given $\mathbf{A}$. It follows from Equation (29) that by maximizing $\text{CLL}(B\mid D)$, such a goal is achieved, since we are minimizing ${D}_{KL}({\widehat{P}}_{D}(C\mid \mathbf{A})\mid \mid {P}_{B}(C\mid \mathbf{A}))$, which is the conditional relative entropy between the conditional distribution induced by the OFE and the conditional distribution given by the target model, B.
In conclusion, since aCLL is an MVU approximation of CLL, by finding the model, B, that maximizes $\mathrm{aCLL}(B\mid D)$, we expect to minimize the conditional relative entropy between ${\widehat{P}}_{D}(C\mid \mathbf{A})$ and ${P}_{B}(C\mid \mathbf{A})$.
5. Experimental Results
We implemented the TAN learning algorithm (c.f. Section 2.1), endowed with the aCLL scoring criterion, in Mathematica 7.0 on top of the Combinatorica package [29]. For the experimental results, we considered optimal TAN structures learned with LL and aCLL under both (symmetric) uniform and Dirichlet assumptions. The main goal of this empirical assessment is to compare the CLL attained with these optimal structures, in order to unravel which one better approximates CLL. To this end, only the scoring criteria vary, and the searching procedure is fixed to yield an optimal TAN structure; this ensures that only the choice of the scoring criterion affects the learned model. To avoid overfitting, which arises naturally when complex structures are searched, we improved the performance of all BNCs by smoothing parameter estimates according to a Dirichlet prior [13]. The smoothing parameter, ${N}^{\prime}$, was set to 5, as suggested in [2].
We performed our evaluation on 89 benchmark datasets used by Barash et al. [9]. These datasets constitute biologically validated data retrieved from the TRANSFACdatabase [30] for binary classification tasks. The aCLL constants considered for the uniform assumption are those given in Equation (13). The constants considered under the Dirichlet assumption are those given in Example 3.7, since all the datasets have a size around 1,000.
The CLL of the optimal TAN learned by each scoring criterion was computed, and it is depicted in Figure 1. From the figure, it is clear that the scoring criterion that obtained the highest CLL was aCLL under Dir(1, 1, 1000), followed by aCLL under Unif(${[0,p]}^{2}$). We numerically computed the standard relative error of the approximation under the Dirichlet assumption and obtained 13.92%, which is half of the relative error for the uniform case [8]. For this reason, we expect aCLL under Dirichlet to be more stable than under the uniform assumption. To evaluate the statistical significance of these results, we used Wilcoxon signedrank tests. This test is applicable when paired scoring differences, along the datasets, are independent and not necessarily normally distributed [31]. Results are presented in Table 1.
Figure 1.
Conditional loglikelihood (CLL) score achieved by optimal treeaugmented naive (TAN) structures with different scoring criteria.
Table 1.
Wilcoxon signedrank tests for the CLL score obtained by optimal TANs in binary classification tasks.
Searching  TAN  TAN 

Score  LL  aCLL 
Assumption  Unif${\mathbf{[}\mathbf{0}\mathbf{,}\mathit{p}\mathbf{]}}^{\mathbf{2}}$  
TAN  7.08  6.84 
aCLL  7.21 × 10^{−13}  3.96 × 10^{−12} 
Dir(1, 1, 1000)  ⇐  ⇐ 
TAN  3.52  
aCLL  2.15 × 10^{−4}  
Unif${[0,p]}^{2}$  ⇐ 
Each entry of the table gives the ztest and pvalue of the significance test for the corresponding pairs of BNCs. The arrow points to the superior scoring criterion, in terms of higher CLL.
From Table 1, it is clear that aCLL is significantly better than LL for obtaining a model with higher CLL. In addition, the Dirichlet assumption is more suitable than the uniform one for learning BNCs.
To illustrate the usage for multiclassification tasks, we applied our method for ternary classification. For that, we used the same 89 datasets as for the binary classification to create synthetic data for ternary classification tasks. We added to each dataset, i, a third class that corresponded to a class of the dataset, $i+1$. The aCLL constants considered for the uniform and Dirichlet assumptions are those given in Examples 3.4 and 3.7, respectively, since all the datasets have a size around 1,000.
To compute our scoring, we also considered the number of pseudocounts, ${N}^{\prime}=5$, and, similarly to the binary case, we performed a Wilcoxon signedrank test to evaluate the statistical significance of the CLL computed by each scoring criterion. The results are presented in Table 2.
Table 2.
Wilcoxon signedrank tests for the CLL score obtained by optimal TANs in ternary classification tasks.
Searching  TAN  TAN 

Score  LL  aCLL 
Assumption  Unif${\mathbf{[}\mathbf{0}\mathbf{,}\mathit{p}\mathbf{]}}^{\mathbf{3}}$  
TAN  6.91  4.07 
aCLL  2.42 × 10^{−12}  2.35 × 10^{−5} 
Dir(1, 1, 1, 1000)  ⇐  ⇑ 
TAN  7.44  
aCLL  5.03 × 10^{−14}  
Unif${[0,p]}^{3}$  ⇐ 
From Table 2, it is clear that aCLL also outperforms LL in ternary classification for obtaining a model with higher CLL. For the ternary case, the uniform assumption for learning BNCs outperformed the Dirichlet assumption.
6. Conclusions
In this work, we explored three major shortcomings of the initial proposal of the aCLL scoring criterion [8]: (i) it addressed only binaryclassification tasks; (ii) it assumed only a uniform distribution over the parameters; (iii) in the context of discriminative learning of BNCs, it did not provide the optimal parameters that maximize it. The effort of exploring the aforementioned limitations culminated with the proposal of a nontrivial extension of aCLL for multiclassification tasks under diverse stochastic assumptions. Whenever possible, the approximation constants were computed analytically; this included binary and ternary classification tasks under a symmetric uniform assumption. In addition, a MonteCarlo method was proposed to compute the constants required for the approximation under a symmetric Dirichlet assumption. In the context of discriminative learning of BNCs, we showed that the extended score is decomposable over the BNC structure and provided the parameters that maximize it. This decomposition allows scorebased learning procedures to be employed locally, making full (structure and parameters) discriminative learning of BNCs very efficient. Such discriminative learning is equivalent to minimizing the conditional relative entropy between the conditional distribution of the class given the attributes induced by the OFE and the one given by the learned discriminative model.
The merits of the devised scoring criteria under two different assumptions were evaluated in real biological data. These assumptions adopted a symmetrically uniform and a symmetric Dirichlet distribution over the parameters. Optimal discriminative models learned with aCLL both with symmetrically uniform and symmetric Dirichlet assumptions, showed higher CLL than those learned generatively with LL. Moreover, among the proposed criteria, the symmetric Dirichlet assumption also was shown to approximate CLL better than the symmetrically uniform one. This was expected, as the Dirichlet is a conjugate distribution for a multinomial sample, which ties perfectly with the fact that data is assumed to be a multinomial sample when learning BNCs.
Directions for future work include extending aCLL to unsupervised learning and to deal with missing data.
Acknowledgments
This work was partially supported by Fundação para a Ciência e Tecnologia (FCT), under grant, PEstOE/EEI/LA0008/2013, and by the projects, NEUROCLINOMICS (PTDC/EIAEIA/111239/2009), ComFormCrypt (PTDC/EIACCO/113033/2009) and InteleGen (PTDC/DTPFTO/1747/2012), also funded by FCT.
Conflict of Interest
The authors declare no conflict of interest.
References
 Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference; Morgan Kaufmann: San Francisco, CA, USA, 1988. [Google Scholar]
 Friedman, N.; Geiger, D.; Goldszmidt, M. Bayesian network classifiers. Mach. Learn. 1997, 29, 131–163. [Google Scholar] [CrossRef]
 Grossman, D.; Domingos, P. Learning Bayesian Network Classifiers by Maximizing Conditional Likelihood. In Proceedings of the Twentyfirst International Conference on Machine Learning, Banff, Alberta, Canada, 4–8 July 2004; ACM: New York, NY, USA, 2004; pp. 46–53. [Google Scholar]
 Su, J.; Zhang, H. Full Bayesian Network Classifiers. In Proceedings of the Twentythird International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; Cohen, W.W., Moore, A., Eds.; ACM: New York, NY, USA, 2006; pp. 897–904. [Google Scholar]
 Domingos, P.; Pazzani, M.J. On the optimality of the simple Bayesian classifier under zeroone loss. Mach. Learn. 1997, 29, 103–130. [Google Scholar] [CrossRef]
 Greiner, R.; Zhou, W. Structural Extension to Logistic Regression: Discriminative Parameter Learning of Belief Net Classifiers. In Proceedings of the Eighteenth National Conference on Artificial Intelligence and Fourteenth Conference on Innovative Applications of Artificial Intelligence, Edmonton, Alberta, Canada, 28 July–2 August 2002; Dechter, R., Sutton, R.S., Eds.; AAAI/MIT: Cambridge, MA, USA, 2002; pp. 167–173. [Google Scholar]
 Su, J.; Zhang, H.; Ling, C.X.; Matwin, S. Discriminative Parameter Learning for Bayesian Networks. In Proceedings of the Twentyfifth International Conference on Machine Learning, Helsinki, Finland, 5–9 June 2008; Cohen, W.W., McCallum, A., Roweis, S.T., Eds.; ACM: New York, NY, USA, 2008; pp. 1016–1023. [Google Scholar]
 Carvalho, A.M.; Roos, T.; Oliveira, A.L.; Myllymäki, P. Discriminative learning of Bayesian networks via factorized conditional loglikelihood. J. Mach. Learn. Res. 2011, 12, 2181–2210. [Google Scholar]
 Barash, Y.; Elidan, G.; Friedman, N.; Kaplan, T. Modeling Dependencies in ProteinDNA Binding Sites. In Proceedings of the Seventh Annual International Conference on Computational Biology, Berlin, Germany, 10–13 April 2003; ACM: New York, NY, USA, 2003; pp. 28–37. [Google Scholar]
 Carvalho, A.M.; Oliveira, A.L.; Sagot, M.F. Efficient Learning of Bayesian Network Classifiers: An Extension to the TAN Classifier. In Proceedings of the 20th Australian Joint Conference on Artificial Intelligence, Gold Coast, Australia, 2–6 December 2007; Orgun, M.A., Thornton, J., Eds.; Springer: Berlin, Germany, 2007; Volume 4830, pp. 16–25. [Google Scholar]
 Carvalho, A.M. Scoring Functions for Learning Bayesian Networks; Technical Report; INESCID: Lisbon, Portugal, 2009. [Google Scholar]
 Yang, S.; Chang, K.C. Comparison of score metrics for Bayesian network learning. IEEE Trans. Syst. Man. Cybern. A 2002, 32, 419–428. [Google Scholar] [CrossRef]
 Heckerman, D.; Geiger, D.; Chickering, D.M. Learning Bayesian networks: The combination of knowledge and statistical data. Mach. Learn. 1995, 20, 197–243. [Google Scholar] [CrossRef]
 De Campos, L.M. A scoring function for learning Bayesian networks based on mutual information and conditional independence tests. J. Mach. Learn. Res. 2006, 7, 2149–2187. [Google Scholar]
 Silander, T.; Roos, T.; Kontkanen, P.; Myllymäki, P. Bayesian Network Structure Learning using Factorized NML Universal Models. In Proceedings of the Fourth European Workshop on Probabilistic Graphical Models, Hirshals, Denmark, 17–19 September 2008; pp. 257–264.
 Chickering, D.M.; Heckerman, D.; Meek, C. Largesample learning of Bayesian networks is NPhard. J. Mach. Learn. Res. 2004, 5, 1287–1330. [Google Scholar]
 Dagum, P.; Luby, M. Approximating probabilistic inference in Bayesian belief networks is NPhard. Artif. Intell. 1993, 60, 141–153. [Google Scholar] [CrossRef]
 Edmonds, J. Optimum branchings. J. Res. Nat. Bur. Stand. 1967, 71, 233–240. [Google Scholar] [CrossRef]
 Chow, C.K.; Liu, C.N. Approximating discrete probability distributions with dependence trees. IEEE Trans. Inform. Theory 1968, 14, 462–467. [Google Scholar] [CrossRef]
 Chickering, D.M. Learning Bayesian Networks Is NPComplete. In Learning from Data: AI and Statistics V; Springer: Berlin, Germnay, 1996; pp. 121–130. [Google Scholar]
 Bilmes, J. Dynamic Bayesian Multinets. In Proceedings of the 16th Conference in Uncertainty in Artificial Intelligence, Stanford University, Stanford, CA, USA, 30 June–3 July 2000; Boutilier, C., Goldszmidt, M., Eds.; Morgan Kaufmann: Burlington, MA, USA, 2000; pp. 38–45. [Google Scholar]
 Johnson, R.A.; Wichern, D.W. Applied Multivariate Statistical Analysis; Prentice Hall: Englewood Cliffs, NJ, USA, 2007. [Google Scholar]
 Heckerman, D. A Tutorial on Learning Bayesian Networks; Technical Report MSRTR9506, Microsoft Research; Microsoft: Redmond, WA, USA, 1995. [Google Scholar]
 Koller, D.; Friedman, N. Probabilistic Graphical Models: Principles and Techniques; MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
 Verma, T.; Pearl, J. Equivalence and Synthesis of Causal Models. In Proceedings of the Sixth Annual Conference on Uncertainty in Artificial Intelligence, Cambridge, MA, USA, 27–29 July 1990; Bonissone, P.P., Henrion, M., Kanal, L.N., Lemmer, J.F., Eds.; Elsevier: Amsterdam, The Netherlands, 1990; pp. 255–270. [Google Scholar]
 Chickering, D.M. Learning equivalence classes of Bayesiannetwork structures. J. Mach. Learn. Res. 2002, 2, 445–498. [Google Scholar]
 Lewis, P.M. Approximating probability distributions to reduce storage requirements. Inform. Control 1959, 2, 214–225. [Google Scholar] [CrossRef]
 Cover, T.; Thomas, J. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
 Pemmaraju, S.V.; Skiena, S.S. Computational Discrete Mathematics: Combinatorics and Graph Theory with Mathematica; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
 Wingender, E.; Chen, X.; Fricke, E.; Geffers, R.; Hehl, R.; Liebich, I.; Krull, M.; Matys, V.; Michael, H.; Ohnhuser, R.; et al. The TRANSFAC system on gene expression regulation. Nucleic Acids Res. 2001, 29, 281–283. [Google Scholar] [CrossRef] [PubMed]
 Demsar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
© 2013 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).