# Learning Genetic Population Structures Using Minimization of Stochastic Complexity

^{1}

^{2}

^{3}

^{*}

Next Article in Journal

Next Article in Special Issue

Next Article in Special Issue

Previous Article in Journal / Special Issue

Department of Mathematics and statistics, University of Helsinki, P.O.Box 68, FIN-00014 University of Helsinki, Finland

Department of Mathematics, Royal Institute of Technology, S-100 44 Stockholm, Sweden

Department of Mathematics, Åbo Akademi University, FIN-20500 Åbo, Finland

Author to whom correspondence should be addressed.

Received: 21 February 2010 / Accepted: 28 April 2010 / Published: 5 May 2010

(This article belongs to the Special Issue Entropy in Genetics and Computational Biology)

Considerable research efforts have been devoted to probabilistic modeling of genetic population structures within the past decade. In particular, a wide spectrum of Bayesian models have been proposed for unlinked molecular marker data from diploid organisms. Here we derive a theoretical framework for learning genetic population structure of a haploid organism from bi-allelic markers for which potential patterns of dependence are a priori unknown and to be explicitly incorporated in the model. Our framework is based on the principle of minimizing stochastic complexity of an unsupervised classification under tree augmented factorization of the predictive data distribution. We discuss a fast implementation of the learning framework using deterministic algorithms.

The concept of a structured or subdivided population has received intensive attention both in applied and theoretical population genetics for several decades. Intuitively, such populations can be considered to harbor multiple pools of individuals, each associated with distinct allele frequencies over a set of molecular marker loci. For a mathematical conceptualization of such a structure in population genetics, see [1,2]. In a traditional formulation, statistical inference regarding the extent of subdivision that a population harbors is performed through summary statistics calculated from genotype data available for a number of local, geographically determined sample populations. Many different forms of such summary statistics exist in the literature; for details see the previously cited references. Despite the fact that the traditional formulation is still widely utilized in applied population genetics, a complementary approach to determining how subdivided a population is, has grown particularly popular within the most recent decade. This approach is fundamentally based on statistical learning of a mixture model for genotype data from multiple marker loci, where the mixture components are representing gene pools that have drifted apart over time. In a typical setting both the appropriate number and the contents of the mixture components are unknown a priori for a sample of individuals that have been genotyped. Currently, a large body of literature exists about the mixture-based inference of population structure, where numerous ramifications of a simple mixture model can be found [3,4,5,6,7,8,9]. A survey of the existing methods from a spatially explicit perspective can be found in [10].

We have earlier demonstrated how a Bayesian representation of statistical uncertainty related to population structure can be derived from a combination of generalized exchangeability and random urn models [7]. The operational interpretation of this model is that the observed genotypes and the alleles within each genotype are conditionally independent over all the considered loci, given the underlying allele frequency parameters in each latent pool of the total population. The latent pools themselves are represented by a stochastic partition of the sampled dataset, where the number of pools necessary for sufficiently expressive representation of the molecular variation is explicitly determined. In machine learning terminology the stochastic partitions perform unsupervised classification of data from multiple finite alphabets, such that the occurrence of the letters is considered to be conditionally independent over the alphabets given the classification of the items for which the letter combinations are observed.

In the current work we derive a generalization of unsupervised classification approach to inferring genetic population structures from biallelic multilocus data, where the loci are no longer assumed to be conditionally independent given a pool. The statistical learning approach is derived using information theoretic notions where the minimum description length (MDL) criterion is used for minimization of the stochastic complexity (SC) associated with the inferred population structure. To encode possible dependencies among the loci, we augment the classifier with a binary tree to enable a sparse factorization of the joint distribution of the multi-locus genotypes where the amount of dependence is determined by the level of complexity in the data when filtered through the MDL criterion. The mathematical model thus enables simultaneous learning of the unsupervised classification and the graphical structures that are needed for representing the possible dependencies. It will be explicitly shown that SC (per sample) is the sum of the length of the description of the samples within the classification and of the length of the description of the classification with respect to the classification model chosen. Some parts of our findings were earlier reported in a condensed form in a conference proceedings article [11]. We point out that the general notion of augmenting a classifier by a tree is due to Chow and Liu [12] and in a more extensive form due to [13] in a supervised context.

This article is organized as follows. In Section 2.1 we introduce a tree-based factorization of the joint distribution of the multilocus genotypes, in Section 2.2 the SC criteria for learning trees and unsupervised classifications, respectively, are derived. The final sections thereafter discuss deterministic algorithms for learning the optimal population structure under the introduced framework and provide some concluding remarks.

Assume the observed data consists in total of n observed genotypes over d biallelic marker loci from a haploid organism. Thus, each observation resides in the d-dimensional binary hypercube ${B}^{d}$ and we let x denote its elements carrying d binary components, i.e.,
Consider one latent pool in the population into which t of the total n samples are assigned in terms of an unsupervised classification into k disjoint classes. The classification itself will be made notationally explicit in a later section, whereas here we seek a probabilistic representation of the possible dependencies among the d allele frequencies of the individual loci in any particular pool. At one extreme, such representation should also allow for the simplest possible model where all loci are independent, thus encoding the situation where the data do not display evidence for dependence. Using the general Markov theory developed for graphical models of multivariate distributions, we may derive the sought probabilistic representation that enjoys these properties. We let
be a set of ${x}^{\left(l\right)}\in {B}^{d}$, which are considered as t independent realizations of the multilocus genotypes from a single latent pool with its specific (unknown) allele frequencies. Note that the data is here assumed complete, such that no components are missing in any ${x}^{\left(l\right)}$.

$${B}^{d}:=\left\{x\mid x={\left({x}_{i}\right)}_{i=1}^{d},{x}_{i}\in \{0,1\}\right\}$$

$${X}^{t}={\left\{{x}^{\left(l\right)}\right\}}_{l=1}^{t}$$

Let $\mathcal{G}=(V,E)$ be an acyclic graph with the set of nodes (or vertices) $V=\{1,\dots ,d\}$ and the edges E. Each connected component of $\mathcal{G}$ is a tree and $\mathcal{G}$ may also be called a forest. We shall first consider forests that consist of one single tree.

If we choose a direction for the edges of $\mathcal{G}$, the node i in the directed edge $\left(i,j\right)$ in E is said to be the parent of node j and j is called a child of the node i. The notations i and j are used interchangeably to denote any particular locus in the remainder of the text. A tree is characterized by the property that any two vertices are connected by a unique path. Hence the parent of a node is uniquely given. The root of a directed tree is the node lacking a parent. If $\mathcal{G}$ is a directed tree, then we designate by ${\mathcal{G}}^{\sim}$ the undirected version obtained from $\mathcal{G}$ by replacing the directed edges by undirected edges.

The structure or topology of the tree $\mathcal{G}$ (and effectively of ${\mathcal{G}}^{\sim}$) is thus given by
where $\Pi \left[j\right]$ is the parent of the node j, such that $\Pi \left(1\right)=\varnothing $ (the empty set). We suppose here that the nodes are ordered so that 1 is the root of the tree and that $\Pi \left(i\right)<i$ for all i (topological order).

$$\Pi =\left(\Pi \left[1\right],\Pi \left[2\right],\dots ,\Pi \left[d\right]\right),$$

For each node we assign a binary random variable ${X}_{i}$ corresponding to the ith locus, where the possible alleles are labeled by ${x}_{i}\in \{0,1\}$. Each edge $\left(j,i\right)$ in E (directed or not) is a statement of dependence between ${X}_{j}$ and ${X}_{i}$, which implies that the joint allele frequencies of the locus combination $j,i$ do not factorize into the product of the marginal allele frequencies. In contrary, the absence of an edge indicates lack of direct dependence between the corresponding loci.

We assume that the joint distribution of $\left({X}_{1},\dots ,{X}_{d}\right)$ is factorized along $\mathcal{G}$ in the sense that
In addition to the structure Π we thus need to assign for each node the table of conditional probabilities $P\left({X}_{i}={x}_{i}|{X}_{\Pi \left[i\right]}={x}_{\Pi \left[i\right]}\right)$ (allele frequencies) in order to fully specify a tree dependent joint distribution.

$$P\left({X}_{1}={x}_{1},\dots ,{X}_{d}={x}_{d}\right)=\prod _{i=1}^{d}P\left({X}_{i}={x}_{i}|{X}_{\Pi \left[i\right]}={x}_{\Pi \left[i\right]}\right)$$

Simple Markovian process based dependence representations of linkage between loci have earlier been introduced in [8] and [14], where it was assumed that the linkage map of the loci is available a priori and it was explicitly incorporated in the models. The tree and forest based factorizations of the joint allele probabilities defined below also utilize simple Markovian structures along the node ordering as any locus will have only a single or no parents. The essential difference is then that the linkage map of the loci is not assumed known a priori, but is implicitly represented by the node ordering along the tree or forest, which is learned from the observed data. In situations where the chromosome at which a locus is positioned is known for the loci, it is straightforward to impose the information to our framework by learning the trees (forests) separately for each subset of loci positioned within different chromosomes. In order not to burden the notation excessively, we abstain from an explicit representation of this possibility.

For notational convenience we shall in the sequel express the joint distributions $P\left({X}_{1}={x}_{1},\dots ,{X}_{d}={x}_{d}\right)$ and other probability distributions also by omitting the random variables but including the structure of the tree as
A directed tree $\mathcal{G}=(V,E)$ equipped with tree dependent probability distribution (2) factorized along $\mathcal{G}=(V,E)$ is the model introduced in [12] and it is also called a dependence tree. The tree, the binary random variables ${X}_{1},\dots ,{X}_{d}$, and the probability distribution (2) constitute a special Bayesian network [15,16]. However, Meil and Jordan [17] argue that a mixture of trees, which will be introduced below using the probability distributions (2) as components, is not to be regarded as a Bayesian network.

$$P\left({x}_{1},\dots ,{x}_{d}\mid \Pi \right)=\prod _{i=1}^{d}P\left({x}_{i}|{x}_{\Pi \left[i\right]}\right).$$

The undirected graph ${\mathcal{G}}^{\sim}$ is also known as the Markov tree of the probability distribution $P\left({x}_{1},\dots ,{x}_{d}\mid \Pi \right)$ [18]. A tree dependent distribution factorized along ${\mathcal{G}}^{\sim}$ with $P\left({x}_{1},\dots ,{x}_{d}\mid \Pi \right)$ $>0$ enjoys the global Markov property, and hence vertex separation implies conditional independence [16]. Therefore ${\mathcal{G}}^{\sim}$ is a perfect representation of $P\left({x}_{1},\dots ,{x}_{d}\mid \Pi \right)$ [19].

An example of a Chow-Liu dependence tree in the sense of above presentation for seven loci is given by
which leads to the factorization
of the joint distribution of the alleles within the considered latent pool. An illustration of a disconnected dependence tree for seven nodes is given by
which now leads to the factorization

$$\left(\Pi \left[1\right],\Pi \left[2\right],\Pi \left[3\right],\Pi \left[4\right],\Pi \left[5\right],\Pi \left[6\right],\Pi \left[7\right]\right)=(\varnothing ,1,2,1,4,5,3),$$

$$P\left({x}_{1}\right)P\left({x}_{2}|{x}_{1}\right)P\left({x}_{3}|{x}_{2}\right)P\left({x}_{4}|{x}_{1}\right)P\left({x}_{5}|{x}_{4}\right)P\left({x}_{6}|{x}_{5}\right)P\left({x}_{7}|{x}_{3}\right),$$

$$\left(\Pi \left[1\right],\Pi \left[2\right],\Pi \left[3\right],\Pi \left[4\right],\Pi \left[5\right],\Pi \left[6\right],\Pi \left[7\right]\right)=(\varnothing ,1,\varnothing ,\varnothing ,4,4,5),$$

$$P\left({x}_{1}\right)P\left({x}_{2}|{x}_{1}\right)P\left({x}_{3}\right)P\left({x}_{4}\right)P\left({x}_{5}|{x}_{4}\right)P\left({x}_{6}|{x}_{4}\right)P\left({x}_{7}|{x}_{5}\right).$$

Bayesian networks can be equivalent in the sense that they imply the same set of independencies between the variables. Or, they have the same underlying undirected graph but might disagree on the direction of some of the edges. One cannot distinguish between equivalent graphs using observations of the variables of the graph. Two rooted trees ${\mathcal{G}}_{1}$ and ${\mathcal{G}}_{2}$ are equivalent, if they have the same underlying undirected graph ${\mathcal{G}}^{\sim}$. The characterizations of equivalence are recapitulated and proved in [20,21]. In the current work the dependencies between loci are unambiguously represented by the undirected graph ${\mathcal{G}}^{\sim}$.

Since the joint distribution factorized along a forest of dependence trees will, in view of (3), even in general consist of factors having the same structure as for the joint distribution factorized according to one single dependence tree, we restrict the attention to the tree dependent probability distribution
The factors $P\left({x}_{i}|{x}_{\Pi \left(i\right)}\right)$ can be written as
where for $i=2,\dots ,d$
and
and
Recalling from the previous section definition of the data that was assigned to a particular latent pool, we obtain the joint probability of the t d-dimensional samples from (4) and (5) as
where, as all ${x}^{\left(l\right)}$ are realizations under the same dependence tree,
for $i=2,\dots ,d$, and
Obviously, ${n}_{i}(1,1)$ counts the number of times we have simultaneously ${x}_{i}^{\left(l\right)}=1$ and ${x}_{\Pi \left(i\right)}^{\left(l\right)}=1$ in ${X}^{t}={\left\{{x}^{\left(l\right)}\right\}}_{l=1}^{t}$, and the interpretations of the remaining corresponding quantities are also obvious. We call (8) the (first order) Chow expansion of the joint likelihood of a Markov tree [22].

$$P\left(x\right)=P\left({x}_{1}\right)\xb7P\left({x}_{2}|{x}_{\Pi \left(2\right)}\right)\dots P\left({x}_{d-1}|{x}_{\Pi (d-1)}\right)\xb7P\left({x}_{d}|{x}_{\Pi \left(d\right)}\right).$$

$$P\left({x}_{i}|{x}_{\Pi \left(i\right)}\right)={\left({\theta}_{i}^{{x}_{i}}{\left(1-{\theta}_{i}\right)}^{\left(1-{x}_{i}\right)}\right)}^{{x}_{\Pi \left(i\right)}}\xb7{\left({\varphi}_{i}^{{x}_{i}}{\left(1-{\varphi}_{i}\right)}^{\left(1-{x}_{i}\right)}\right)}^{1-{x}_{\Pi \left(i\right)}},$$

$${\theta}_{i}=P\left({x}_{i}=1|{x}_{\Pi \left(i\right)}=1\right),$$

$${\varphi}_{i}=P\left({x}_{i}=1|{x}_{\Pi \left(i\right)}=0\right),$$

$$P\left({x}_{1}\right)={\theta}_{1}^{{x}_{1}}{\left(1-{\theta}_{1}\right)}^{\left(1-{x}_{1}\right)}.$$

$$\prod _{l=1}^{t}P\left({x}^{\left(l\right)}\right)={\theta}_{1}^{{n}_{1}}{\left(1-{\theta}_{1}\right)}^{t-{n}_{1}}\prod _{i=2}^{d}{\theta}_{i}^{{n}_{i}(1,1)}{\left(1-{\theta}_{i}\right)}^{{n}_{i}(0,1)}\xb7{\varphi}_{i}^{{n}_{i}(1,0)}{\left(1-{\varphi}_{i}\right)}^{{n}_{i}(0,0)},$$

$${n}_{i}(1,1)=\sum _{l=1}^{t}{x}_{i}^{\left(l\right)}{x}_{\Pi \left(i\right)}^{\left(l\right)},\phantom{\rule{1.em}{0ex}}{n}_{i}(1,0)=\sum _{l=1}^{t}{x}_{i}^{\left(l\right)}\left(1-{x}_{\Pi \left(i\right)}^{\left(l\right)}\right),$$

$${n}_{i}(0,1)=\sum _{l=1}^{t}\left(1-{x}_{i}^{\left(l\right)}\right){x}_{\Pi \left(i\right)}^{\left(l\right)},\phantom{\rule{1.em}{0ex}}{n}_{i}(0,0)=\sum _{l=1}^{t}\left(1-{x}_{i}^{\left(l\right)}\right)\left(1-{x}_{\Pi \left(i\right)}^{\left(l\right)}\right),$$

$${n}_{1}=\sum _{l=1}^{t}{x}_{1}^{\left(l\right)}.$$

Since the Chow expansion derived above involves a considerable amount of unknown allele frequency parameters and different combinations of trees and unsupervised classifications will be associated with varying numbers of such parameters, it is necessary to handle them in an appropriate manner from the statistical perspective, to ensure coherent learning of the population structure. Here we use the marginal likelihood of the tree topology $P\left({X}^{t}|\Pi \right)$, which can equivalently be considered as the prior predictive distribution of ${X}^{t}$, to ensure consistent learning under the family of the considered probability models.

The joint likelihood of the tree network in (5) is now represented by the parametrization
where from (6) and (7)
Given that Θ and Φ denote two copies of the d-fold and $d-1$-fold product of the unit interval, respectively, then $\underline{\theta}\in \Theta $ and $\underline{\varphi}\in \Phi $.

$${P}_{\underline{\theta},\underline{\varphi}}\left(x\right)=\prod _{i=1}^{d}{P}_{{\theta}_{i},{\varphi}_{i}}\left({x}_{i}|{x}_{\Pi \left(i\right)}\right),$$

$$\underline{\theta}=\left({\theta}_{1},\dots ,{\theta}_{d}\right),\phantom{\rule{1.em}{0ex}}\underline{\varphi}=\left({\varphi}_{2},\dots ,{\varphi}_{d}\right).$$

Given a prior probability density $g\left(\underline{\theta},\underline{\varphi}\right)$ on $\Theta \times \Phi $, we obtain
This is known in statistics as a prior predictive distribution. We choose the prior $g\left(\underline{\theta},\underline{\varphi}\right)$ by local meta independence of parameters, [15,20,21], such that
An explicit expression for the above prior predictive data distribution is derived in Appendix.

$$P\left({X}^{t}|\Pi \right)={\int}_{\Theta}{\int}_{\Phi}\prod _{l=1}^{t}{P}_{\underline{\theta},\underline{\varphi}}\left({x}^{\left(l\right)}\right)g\left(\underline{\theta},\underline{\varphi}\right)d\underline{\theta}d\underline{\varphi}.$$

$$g\left(\underline{\theta},\underline{\varphi}\right)=\prod _{i=1}^{d}h\left({\theta}_{i}\right)\prod _{i=2}^{d}z\left({\varphi}_{i}\right).$$

Given the above results we can now derive an expression of the stochastic complexity for any Chow expansion of joint allele frequencies in a latent pool in a sense made precise in [23,24]. This result will later be invoked to the task of unsupervised classification, such that our learning approach will aim at minimizing the SC over the space of all possible combinations of latent pools and their associated Chow expansions. By using the result in (9), we obtain the criterion for minimizing the SC as the expression
where the hyperparameters are chosen as ${\alpha}_{1}$ $={\alpha}_{2}=1/2$, which defines a product of respective Jeffreys’ priors in (10). In order to minimize the log probability $-logP\left({X}^{t}|\Pi \right)$ we derive an asymptotic expansion of the above expression. This expansion is instructive both for algorithmic purposes and for gaining explicit illustration of the mechanics of SC minimization.

$$\begin{array}{cc}\hfill -logP\left({X}^{t}|\Pi \right)& =-log{I}_{1}-log{I}_{2}-log{I}_{3}\hfill \\ & =log\frac{\Gamma \left({\alpha}_{1}\right)\xb7\Gamma \left({\alpha}_{2}\right)}{\Gamma \left({\alpha}_{1}+{\alpha}_{2}\right)}+log\frac{\Gamma \left(t+{\alpha}_{1}+{\alpha}_{2}\right)}{\Gamma \left({n}_{1}+{\alpha}_{1}\right)\xb7\Gamma \left(t-{n}_{1}+{\alpha}_{2}\right)}\hfill \\ & +\sum _{i=2}^{d}log\frac{\Gamma \left({\alpha}_{1}\right)\xb7\Gamma \left({\alpha}_{2}\right)}{\Gamma \left({\alpha}_{1}+{\alpha}_{2}\right)}\frac{\Gamma \left({n}_{i}(1,1)+{n}_{i}(0,1)+{\alpha}_{1}+{\alpha}_{2}\right)}{\Gamma \left({n}_{i}(1,1)+{\alpha}_{1}\right)\xb7\Gamma \left({n}_{i}(0,1)+{\alpha}_{2}\right)}\hfill \\ & +\sum _{i=2}^{d}log\frac{\Gamma \left({\alpha}_{1}\right)\xb7\Gamma \left({\alpha}_{2}\right)}{\Gamma \left({\alpha}_{1}+{\alpha}_{2}\right)}\frac{\Gamma \left({n}_{i}(1,0)+{n}_{i}(0,0)+{\alpha}_{1}+{\alpha}_{2}\right)}{\Gamma \left({n}_{i}(1,0)+{\alpha}_{1}\right)\xb7\Gamma \left({n}_{i}(0,0)+{\alpha}_{2}\right)},\hfill \end{array}$$

The formulae derived in Appendix establish the following basic result. If the likelihood of the tree topology $P\left({X}^{t}|\Pi \right)$ for a dependence tree is evaluated assuming both local meta independence and Jeffreys’ prior, which equals the Beta distribution Be($1/2,1/2$), for each of the parameters, then
where ${I}_{i,\Pi \left(i\right)}$ is given in (50) and C is bounded in t.

$$\begin{array}{cc}\hfill -logP\left({X}^{t}|\Pi \right)& =log\frac{t!}{\Gamma \left({n}_{1}+1/2\right)\xb7\Gamma \left(t-{n}_{1}+1/2\right)}\hfill \end{array}$$

$$\begin{array}{cc}& +t\xb7\sum _{i=2}^{d}h\left({n}_{i}\left(1\right)/t\right)-t\xb7\sum _{i=2}^{d}{I}_{i,\Pi \left(i\right)}\hfill \end{array}$$

$$\begin{array}{cc}& +\frac{1}{2}\sum _{i=2}^{d}log\left({n}_{\Pi \left(i\right)}\left(1\right)\right)+log\left({n}_{\Pi \left(i\right)}\left(0\right)\right)+C,\hfill \end{array}$$

Since the learning of the tree network structure is achieved by minimizing $-logP\left({X}^{t}|\Pi \right)$ as a function of the structure Π, in view of (12) we see that this corresponds to minimizing
since all other terms are independent of the network structure. This is additionally equivalent to maximization of

$$-t\xb7\sum _{i=2}^{d}{I}_{i,\Pi \left(i\right)}+\frac{1}{2}\sum _{i=2}^{d}\left[log\left({n}_{\Pi \left(i\right)}\left(1\right)\right)+log\left({n}_{\Pi \left(i\right)}\left(0\right)\right)\right],$$

$$\sum _{i=2}^{d}{I}_{i,\Pi \left(i\right)}-\frac{1}{2}\xb7\left[\frac{1}{t}\sum _{i=2}^{d}\left(log\left({n}_{\Pi \left(i\right)}\left(1\right)\right)+log\left({n}_{\Pi \left(i\right)}\left(0\right)\right)\right)\right].$$

We now consider derivation of SC for unsupervised classifications of the n samples into k disjoint classes interpreted to represent latent pools in the population, each associated with distinct allele frequencies. In terms of data, an unsupervised classification refers to a subdivision of the observed multilocus genotypes into the sets $\{{X}^{{t}_{c}},c=1,\dots ,k\}$. There are various ways of representing such subdivisions, here we define them in terms of class membership functions
and incorporate these in the matrix
Hence
is the number of binary vectors assigned to class $c=1,\dots ,k$.

$${u}_{c}^{\left(l\right)}:=\left\{\begin{array}{cc}1\hfill & \text{if}\phantom{\rule{4.pt}{0ex}}{x}^{\left(l\right)}\in c\phantom{\rule{4.pt}{0ex}}\hfill \\ 0\hfill & \text{otherwise,}\hfill \end{array}\right.$$

$${U}^{n}={\left\{{u}_{c}^{\left(l\right)}\right\}}_{l=1,c=1}^{n,k}.$$

$${t}_{c}=\sum _{l=1}^{n}{u}_{c}^{\left(l\right)}$$

Let next $\lambda ={\left\{{\lambda}_{c}\right\}}_{j=1}^{k}$ be a discrete probability distribution ${\sum}_{c=1}^{k}{\lambda}_{c}=1$, ${\lambda}_{c}\ge 0$, which we use as the prevalence of ${u}_{c}^{\left(l\right)}$ so that
This can be interpreted as the probability of the event that ${x}^{\left(l\right)}$ is sampled from the latent pool indexed by c. Then we introduce
as the conditional probability of x within the latent pool c. Thus, each of the classes is equipped with a dependence tree or possibly a forest of dependence trees designated by ${\Pi}_{c}$. We can now write using (15) and (16)
which is formally a mixture of trees in the sense of Meil and Jordan [17]. A difference in the present setting is that each ${\Pi}_{c}$ may also be a forest. Figure 1 provides an illustration of the joint structure of the probability model for a simple example with $d=5$ and $k=2$.

$${\lambda}_{c}=P\left({u}_{c}^{\left(l\right)}=1\right),\phantom{\rule{1.em}{0ex}}l=1,\dots ,n.$$

$${P}_{{\underline{\theta}}_{c},{\underline{\varphi}}_{c}}\left(x\mid {\Pi}_{c}\right)=\prod _{i=1}^{d}{P}_{{\theta}_{ic},{\varphi}_{ic}}\left({x}_{i}|{x}_{{\Pi}_{c}\left(i\right)}\right)$$

$$P\left(x\right)=\sum _{c=1}^{k}{\lambda}_{c}{P}_{{\underline{\theta}}_{c},{\underline{\varphi}}_{c}}\left(x\mid {\Pi}_{c}\right),$$

In view of the preceding we have the complete integrated likelihood
where we have $\underline{\Pi}$ denoting the collection of the trees (forests) $\{{\Pi}_{c},c=1,\dots ,k\}$, and further
and Θ and Φ are the appropriate spaces for ${\underline{\theta}}_{1},\dots ,{\underline{\theta}}_{k}$ and ${\underline{\varphi}}_{1},\dots ,{\underline{\varphi}}_{k}$ to assume their values in. The space Λ is equal to $\{\lambda \mid {\sum}_{c=1}^{k}{\lambda}_{c}=1,{\lambda}_{c}\ge 0\}$ and $\psi \left(\lambda \right)$ is a probability density on Λ. The above integral factorizes as

$$P\left(\left({X}^{n},{U}^{n}\right)|\underline{\Pi}\right)={\int}_{\Lambda}{\int}_{\Theta}{\int}_{\Phi}\prod _{l=1}^{n}\prod _{c=1}^{k}{\left[{P}_{{\underline{\theta}}_{c},{\underline{\varphi}}_{c}}\left({x}^{\left(l\right)}\mid {\Pi}_{c}\right){\lambda}_{c}\right]}^{{u}_{c}^{\left(l\right)}}g\left({\underline{\theta}}_{c},{\underline{\varphi}}_{c}\right)\psi \left(\lambda \right)d{\underline{\theta}}_{c}d{\underline{\varphi}}_{c}d\lambda ,$$

$${\underline{\theta}}_{c}=\left({\theta}_{1c},\dots ,{\theta}_{dc}\right),\phantom{\rule{1.em}{0ex}}{\underline{\varphi}}_{c}=\left({\varphi}_{2c},\dots ,{\varphi}_{dc}\right)$$

$$P\left({X}^{n},{U}^{n}|\underline{\Pi}\right)={Q}^{*}\left({U}^{n}\right)\xb7{P}^{*}\left({X}^{n}|{U}^{n},\underline{\Pi}\right),$$

where the auxiliary notations are defined as
and
With regard to the last integral above the situation is analogous to (8) so that
where ${\theta}_{ic}$ is defined as
and analogously for ${\varphi}_{ic}$ and
As all ${x}^{\left(l\right)}$ with the label c follow the factorization of the multivariate distribution for the same dependence tree ${\Pi}_{c}$,
for $i=2,\dots ,d$ and
and ${t}_{c}$ is given in (14). Here ${n}_{ic}(1,1)$ counts the number of times the pair $\left({x}_{i}^{\left(l\right)}=1,{x}_{{\Pi}_{c}\left(i\right)}^{\left(l\right)}=1\right)$ occurs in the data set simultaneously with ${u}_{c}^{\left(l\right)}=1$.

$${Q}^{*}\left({U}^{n}\right)={\int}_{\Lambda}\prod _{c=1}^{k}{\lambda}_{c}^{{t}_{c}}\psi \left(\lambda \right)d\lambda ,$$

$${P}^{*}\left({X}^{n}|{U}^{n},\underline{\Pi}\right)=\prod _{c=1}^{k}{\int}_{{\Theta}_{c}}{\int}_{{\Phi}_{c}}\prod _{l=1}^{n}{\left[{P}_{{\underline{\theta}}_{c},{\underline{\varphi}}_{c}}\left({x}^{\left(l\right)}\mid {\Pi}_{c}\right)\right]}^{{u}_{c}^{\left(l\right)}}g\left({\underline{\theta}}_{c},{\underline{\varphi}}_{c}\right)d{\underline{\theta}}_{c}d{\underline{\varphi}}_{c}.$$

$$\prod _{l=1}^{n}{\left[{P}_{{\underline{\theta}}_{c},{\underline{\varphi}}_{c}}\left({x}^{\left(l\right)}\mid {\Pi}_{c}\right)\right]}^{{u}_{c}^{\left(l\right)}}={K}_{1}\prod _{i=2}^{d}{\theta}_{ic}^{{n}_{ic}(1,1)}{\left(1-{\theta}_{ic}\right)}^{{n}_{ic}(0,1)}\xb7{\varphi}_{ic}^{{n}_{ic}(1,0)}{\left(1-{\varphi}_{ic}\right)}^{{n}_{ic}(0,0)},$$

$${\theta}_{ic}=P\left({x}_{i}=1|{x}_{{\Pi}_{c}\left(i\right)}=1\right)$$

$${K}_{1}={\theta}_{1c}^{{n}_{1c}}{\left(1-{\theta}_{1c}\right)}^{{t}_{c}-{n}_{1c}}.$$

$${n}_{ic}(1,1)=\sum _{l:{u}_{c}^{l}=1}{x}_{i}^{\left(l\right)}{x}_{{\Pi}_{c}\left(i\right)}^{\left(l\right)},{n}_{ic}(1,0)=\sum _{l:{u}_{c}^{l}=1}{x}_{i}^{\left(l\right)}\left(1-{x}_{{\Pi}_{c}\left(i\right)}^{\left(l\right)}\right),$$

$${n}_{ic}(0,1)=\sum _{l:{u}_{c}^{l}=1}\left(1-{x}_{i}^{\left(l\right)}\right){x}_{{\Pi}_{c}\left(i\right)}^{\left(l\right)},{n}_{ic}(0,0)=\sum _{l:{u}_{c}^{l}=1}\left(1-{x}_{i}^{\left(l\right)}\right)\left(1-{x}_{{\Pi}_{c}\left(i\right)}^{\left(l\right)}\right),$$

$${n}_{1c}=\sum _{l:{u}_{c}^{l}=1}{x}_{1}^{\left(l\right)}.$$

Hence, under the above assumptions the same computations that were used in the preceding section entail the formula
where we have introduced the notation $-logP\left({X}^{n}|{\Pi}_{c}\right)$ for the expression in (11) when the class index c is inserted in all the appropriate places. The forest of Chow-Liu trees is merely implicit in the right hand side of this expression, but is, of course, required for computation and definition of the statistics of the locus pairs.

$$\begin{array}{cc}\hfill -log{P}^{*}\left({X}^{n}|{U}^{n},\underline{\Pi}\right)& =-\sum _{c=1}^{k}logP\left({X}^{n}|{\Pi}_{c}\right)\hfill \\ & =klog\pi +\sum _{c=1}^{k}log\frac{{t}_{c}!}{\Gamma \left({n}_{1c}+1/2\right)\xb7\Gamma \left({t}_{c}-{n}_{1c}+1/2\right)}+\hfill \end{array}$$

$$\begin{array}{cc}\hfill +& 2\xb7k\xb7(d-1)\xb7log\pi \hfill \\ \hfill +& \sum _{c=1}^{k}\sum _{i=2}^{d}log\frac{\left({n}_{ic}(1,1)+{n}_{ic}(0,1)\right)!}{\Gamma \left({n}_{ic}(1,1)+1/2\right)\xb7\Gamma \left({n}_{ic}(0,1)+1/2\right)}\hfill \\ \hfill +& \sum _{c=1}^{k}\sum _{i=2}^{d}log\frac{\left({n}_{ic}(1,0)+{n}_{ic}(0,0)\right)!}{\Gamma \left({n}_{ic}(1,0)+1/2\right)\xb7\Gamma \left({n}_{ic}(0,0)+1/2\right)},\hfill \end{array}$$

The integral ${\int}_{\Lambda}{\prod}_{c=1}^{k}{\lambda}_{c}^{{t}_{c}}\psi \left(\lambda \right)d\lambda $ is explicitly evaluated taking $\psi \left(\lambda \right)$ as the Dirichlet density, see [26] for a rationale, which means that
Then in (19)
If we choose the Jeffreys’ prior, which corresponds to ${\alpha}_{c}=1/2$ (for this result and the computations required, c.f., [24] or [27] p. 218, for all c, we obtain
We have thus established that SC (per item) is the sum of the length of the description of the items within the classification and of the length of the description of the classification with respect to the classification model chosen, i.e.,
respectively, with detailed expressions from (23) and (25). This formula can be used to evaluate the complexity of any classification ${U}^{n}$ of data ${X}^{n}$, irrespective of the way it has been arrived at.

$$\psi \left({\lambda}_{1},\dots ,{\lambda}_{k}\right)=\left\{\begin{array}{cc}\frac{\Gamma \left({\sum}_{c=1}^{k}{\alpha}_{c}\right)}{{\prod}_{c=1}^{k}\Gamma \left({\alpha}_{c}\right)}\prod _{c=1}^{k}{\lambda}_{c}^{{\alpha}_{c}-1},\hfill & \text{if}\phantom{\rule{4.pt}{0ex}}{\lambda}_{1},\dots ,{\lambda}_{k}\in \Lambda \phantom{\rule{4.pt}{0ex}}\hfill \\ 0\hfill & \text{otherwise.}\hfill \end{array}\right.$$

$${Q}^{*}\left({U}^{n}\right)={\int}_{\Lambda}\prod _{c=1}^{k}{\lambda}_{c}^{{t}_{c}}\psi \left(\lambda \right)d\lambda =\frac{\Gamma \left({\sum}_{c=1}^{k}{\alpha}_{c}\right)}{{\prod}_{c=1}^{k}\Gamma \left({\alpha}_{c}\right)}{\int}_{\Lambda}\prod _{c=1}^{k}{\lambda}_{c}^{{t}_{c}+{\alpha}_{c}-1}d\lambda =$$

$$=\frac{\Gamma \left({\sum}_{c=1}^{k}{\alpha}_{c}\right)}{{\prod}_{c=1}^{k}\Gamma \left({\alpha}_{c}\right)}\xb7\frac{{\prod}_{c=1}^{k}\Gamma \left({t}_{c}+{\alpha}_{c}\right)}{\Gamma \left({\sum}_{c=1}^{k}{t}_{c}+{\sum}_{c=1}^{k}{\alpha}_{c}\right)}.$$

$$\begin{array}{cc}\hfill -log{Q}^{*}\left({U}^{n}\right)& =\frac{k}{2}log\pi /2-log\Gamma (k/2)\hfill \\ & +log\Gamma \left(n+k/2\right)-\sum _{c=1}^{k}log\Gamma \left({t}_{c}+1/2\right).\hfill \end{array}$$

$$\text{SC}:=-\frac{1}{n}log{P}^{*}\left({X}^{n}|{U}^{n},\underline{\mathbf{\Pi}}\right)+\left(-\frac{1}{n}log{Q}^{*}\left({U}^{n}\right)\right),$$

Chow and Liu [12] established an algorithm for maximization of ${\sum}_{i=2}^{d}{I}_{i,\Pi \left(i\right)}$ as a function of the tree structure. Suzuki [28] has extended the Chow-Liu algorithm by adding to ${\sum}_{i=2}^{d}{I}_{i,\Pi \left(i\right)}$ terms penalizing tree networks that are too complex. We shall next present a variant of these two algorithms for any single class c containing ${t}_{c}$ samples.

The procedure for constructing disconnected tree networks consists of the following steps:

**A1.**- Compute the numbers$$I{P}_{i,j}=\sum _{u=0}^{1}\sum _{v=0}^{1}{\widehat{P}}_{i,j}\left(u,v\right)log\frac{{\widehat{P}}_{i,j}\left(u,v\right)}{{\widehat{P}}_{i}\left(u\right)\xb7{\widehat{P}}_{j}\left(v\right)}-\frac{1}{2}\xb7\frac{1}{t}\left[log\left({n}_{c}\left(1\right)\right)+log\left({n}_{c}\left(0\right)\right)\right]$$
**A2.**- Construct a complete undirected graph with the binary variables as nodes.
**A3.**- Construct a maximum weighted spanning tree with the extra condition that an edge is in the tree only if $I{P}_{i,j}>0$.
**A4.**- Make the maximum weighted spanning tree directed by choosing a root variable and setting the direction of all edges to be outward from the root.

There are several algorithms for constructing a maximum weighted spanning tree in step **A**$\mathbf{3}$, when the condition for permitting disconnected graphs is not imposed. The most time honoured algorithm for the task is the Borůvka-Choquet-Kruskal algorithm [29].

Having completed the steps **A1**-**A4** we have a tree structure
where each ${\widehat{\Pi}}_{c}^{\left(s\right)}$ corresponds to Chow-Liu tree with a subset of the nodes $\{1,\dots ,d\}$ and its distinct and separate root. For a disconnected tree we then have the stochastic complexity
where each term in the right hand side is of the form
with obvious definitions of the quantities involved.

$${\widehat{\Pi}}_{c}={\left\{{\widehat{\Pi}}_{c}^{\left(s\right)}\right\}}_{s=1}^{r},$$

$$-logP\left({X}^{{t}_{c}}|{\widehat{\Pi}}_{c}\right)=-\sum _{s=1}^{r}logP\left({X}^{{t}_{c}}|{\widehat{\Pi}}_{c}^{\left(s\right)}\right),$$

$$\begin{array}{cc}\hfill -logP\left({X}^{{t}_{c}}|{\widehat{\Pi}}_{c}^{\left(s\right)}\right)& =log\pi +log\frac{{t}_{c}^{\left(s\right)}!}{\Gamma \left({n}_{1}^{\left(s\right)}+1/2\right)\xb7\Gamma \left({t}_{c}^{\left(s\right)}-{n}_{1}^{\left(s\right)}+1/2\right)}\hfill \end{array}$$

$$\begin{array}{cc}\hfill +& ({d}^{\left(s\right)}-1)log\pi \hfill \\ \hfill +& \sum _{i=2}^{{d}^{\left(s\right)}}log\frac{\Gamma \left({n}_{\Pi \left(i\right)}^{\left(s\right)}\left(1\right)+1\right)}{\Gamma \left({n}_{i}^{\left(s\right)}(1,1)+1/2\right)\xb7\Gamma \left({n}_{i}^{\left(s\right)}(0,1)+1/2\right)}\hfill \\ \hfill +& ({d}^{\left(s\right)}-1)log\pi \hfill \\ \hfill +& \sum _{i=2}^{{d}^{\left(s\right)}}log\frac{\Gamma \left({n}_{\Pi \left(i\right)}^{\left(s\right)}\left(0\right)+1\right)}{\Gamma \left({n}_{i}^{\left(s\right)}(1,0)+1/2\right)\xb7\Gamma \left({n}_{i}^{\left(s\right)}(0,0)+1/2\right)},\hfill \end{array}$$

The case ${d}^{\left(s\right)}=1$ corresponds to a node that is not connected to any other node, and thereby only the first term in the right hand side is needed in (30).

Recall the logarithm of the marginal likelihood of the data as a function of an instance of an unsupervised classification and the set of Chow expansions,
Below we show how an algorithm for computing ${max}_{{U}^{n},\Pi}\frac{1}{n}L\left(\left({X}^{n},{U}^{n}\right)|\underline{\mathbf{\Pi}}\right)$ can be formulated in terms of a maximum likelihood estimation procedure. Under the previously stated assumptions, as n grows to infinity, we have for fixed ${X}^{n}$ and k the expression
where R is bounded in n for fixed ${X}^{n}$ and k. By stating that ${X}^{n}$ is fixed as $n\to \infty $, we mean that when ${x}^{(n+1)}$ is added, the preceding ${x}^{\left(l\right)}$ in ${X}^{n}$, $l\le t$ are not changed. We shall merely show that the desired expansion is another way of writing (12) above, when terms corresponding to the similar expansion of $-log{Q}^{*}\left({U}^{n}\right)$ in (25) are added.

$$L\left(\left({X}^{n},{U}^{n}\right)|\underline{\mathbf{\Pi}}\right)=logP\left(\left({X}^{n},{U}^{n}\right)|\underline{\mathbf{\Pi}}\right)$$

$$\begin{array}{cc}\hfill \underset{{U}^{n},\underline{\mathbf{\Pi}}}{max}\frac{1}{n}L\left(\left({X}^{n},{U}^{n}\right)|\underline{\mathbf{\Pi}}\right)& =\underset{{U}^{n},\underline{\mathbf{\Pi}},\Theta ,\lambda}{max}\frac{1}{n}\sum _{l=1}^{n}\sum _{c=1}^{k}\left[{u}_{c}^{\left(l\right)}log{P}_{{\underline{\theta}}_{c},{\underline{\varphi}}_{c}}\left({x}^{\left(l\right)}\mid {\Pi}_{c}\right)+{u}_{c}^{\left(l\right)}log{\lambda}_{c}\right].\hfill \\ & -\frac{1}{2}k\xb7\left(2d\right)\frac{logn}{n}+R,\hfill \end{array}$$

To evaluate the above expression more explicitly, we start by considering in the right hand side of (32) the maximization of the likelihood
which is equivalent to the maximization of
We first maximize with respect λ to obtain
where ${\widehat{\lambda}}_{c}$ is
But then the argument following (47) shows that
In other words we are going to maximize
by maximization of
as a function of $\underline{\mathbf{\Pi}}$, which we can do for each class c separately using the Chow-Liu algorithm. Having found the optimum tree topology, we have also obtained the maximum likelihood estimates ${\underline{\widehat{\theta}}}_{c},{\underline{\widehat{\varphi}}}_{c}$ for each $c,$ $c=1,\dots ,k$.

$$\underset{{U}^{n},\underline{\mathbf{\Pi}},\Theta ,\lambda}{max}\prod _{l=1}^{n}\prod _{c=1}^{k}{\left[{P}_{{\underline{\theta}}_{c},{\underline{\varphi}}_{c}}\left({x}^{\left(l\right)}\mid {\Pi}_{c}\right){\lambda}_{c}\right]}^{{u}_{c}^{\left(l\right)}},$$

$$\underset{{U}^{n},\underline{\Pi},\Theta ,\lambda}{max}\frac{1}{n}\sum _{l=1}^{n}\sum _{c=1}^{k}\left[{u}_{c}^{\left(l\right)}log{P}_{{\underline{\theta}}_{c},{\underline{\varphi}}_{c}}\left({x}^{\left(l\right)}\mid {\Pi}_{c}\right)+{u}_{c}^{\left(l\right)}log{\lambda}_{c}\right].$$

$$\underset{{U}^{n},\underline{\mathbf{\Pi}}}{max}\frac{1}{n}\sum _{l=1}^{n}\sum _{c=1}^{k}\left[{u}_{c}^{\left(l\right)}log{P}_{{\underline{\theta}}_{c},{\underline{\varphi}}_{c}}\left({x}^{\left(l\right)}{\Pi}_{c}\right)\right]+\sum _{c=1}^{k}{\widehat{\lambda}}_{c}log{\widehat{\lambda}}_{c},$$

$${\widehat{\lambda}}_{c}=\frac{{\sum}_{l=1}^{n}{u}_{c}^{\left(l\right)}}{n}=\frac{{t}_{c}}{n}.$$

$$\begin{array}{cc}\hfill \frac{1}{n}\sum _{c=1}^{k}\sum _{l=1}^{n}{u}_{c}^{\left(l\right)}log{P}_{{\underline{\theta}}_{c},{\underline{\varphi}}_{c}}\left({x}^{\left(l\right)}\mid {\Pi}_{c}\right)& =\frac{1}{n}\sum _{c=1}^{k}log{K}_{1c}\hfill \\ & +\frac{1}{n}\sum _{c=1}^{k}log\frac{{t}_{c}!}{\Gamma \left({n}_{1c}+1/2\right)\xb7\Gamma \left({t}_{c}-{n}_{1c}+1/2\right)}\hfill \\ & -\sum _{c=1}^{k}\frac{{t}_{c}}{n}\sum _{i=2}^{d}h\left({n}_{ic}\left(1\right)/n\right)+\sum _{c=1}^{k}\frac{{t}_{c}}{n}\sum _{i=2}^{d}{I}_{i,{\Pi}_{c}\left(i\right)}.\hfill \end{array}$$

$$\frac{1}{n}\sum _{c=1}^{k}\sum _{l=1}^{n}{u}_{c}^{\left(l\right)}log{P}_{{\underline{\theta}}_{c},{\underline{\varphi}}_{c}}\left({x}^{\left(l\right)}\mid {\Pi}_{c}\right)$$

$$\sum _{c=1}^{k}\frac{{t}_{c}}{n}\sum _{i=2}^{d}{I}_{i,{\Pi}_{c}\left(i\right)},$$

However, we have already in (12) the additional terms $\frac{1}{2}{\sum}_{i=2}^{d}log\left({n}_{{\Pi}_{c}\left(i\right)}\left(1\right)\right)+log\left({n}_{{\Pi}_{c}\left(i\right)}\left(0\right)\right)$, which can be subsumed in $\frac{1}{2}k\xb7\left(2d\right)\frac{logn}{n}$.

Clearly we can use Stirling’s formula in a similar way as in (44) to expand $-log{Q}^{*}\left({U}^{n}\right)$ in (25) so as to obtain
The general asymptotic expansion result due Schwartz [30], as in [31], c.f. [32] and [33] chapter 5, provides a similar kind of an expansion, however, without making comparable assumptions about the prior densities. This special application is based on the fact that ${\left[{P}_{{\underline{\theta}}_{c},{\underline{\varphi}}_{c}}\left({x}^{\left(l\right)}\mid {\Pi}_{c}\right){\lambda}_{c}\right]}^{{u}_{c}^{\left(l\right)}}$ belongs to the exponential family of distributions with the convex parameter sets. In general, study of the asymptotics of approximate Bayesian model selection under implicit priors in the presence of hidden states like class variables is very challenging; for some recent significant progress see [33,34].

$$-\frac{1}{n}log{Q}^{*}\left({U}^{n}\right)\approx -\sum _{c=1}^{k}{\widehat{\lambda}}_{c}log{\widehat{\lambda}}_{c}-\frac{k-1}{2}\frac{logn}{n}.$$

We shall next consider an algorithm for unsupervised classification of ${X}^{n}$, i.e., for finding ${U}^{n}$ that maximizes $L\left(\left({X}^{n},{U}^{n}\right)|\underline{\mathbf{\Pi}}\right)$ for a given value of k, using the expansion in (31). A discussion of this kind of expansions in clustering theory is found in [35]. We define first the rules of identification to be used in the algorithm.

A tree augmented supervised Bayesian classifier [13] is based on the following rule of identification first suggested by Chow and Liu [12]. An x in ${B}^{d}$ is assigned to (identified with) class ${c}^{*}$, if
where ${\widehat{\lambda}}_{c}$ is given in (34) and

$${c}_{*}=arg\underset{1\le c\le k}{max}{P}_{{\underline{\widehat{\theta}}}_{c},{\underline{\widehat{\varphi}}}_{c}}\left(x\mid {\Pi}_{c}\right){\widehat{\lambda}}_{c},$$

$${P}_{{\underline{\widehat{\theta}}}_{c},{\underline{\widehat{\varphi}}}_{c}}\left(x\mid {\Pi}_{c}\right)=\prod _{i=1}^{d}{P}_{{\widehat{\theta}}_{ic},{\widehat{\varphi}}_{ic}}\left({x}_{i}|{x}_{\Pi \left(i\right)}\right).$$

In order to simplify the required notation, we drop here the superscript from the data matrix and set ${U}^{n}=U$. The findings in the preceding subsection show that we can maximize
using the following alternating algorithm:

$$\underset{U,\underline{\Pi},\Theta ,\lambda}{max}\prod _{l=1}^{n}\prod _{c=1}^{k}{\left[{P}_{{\underline{\theta}}_{c},{\underline{\varphi}}_{c}}\left({x}^{\left(l\right)}\mid {\Pi}_{c}\right){\lambda}_{c}\right]}^{{u}_{c}^{\left(l\right)}}$$

**B1.**- Fix k, set $w=0$ and store an arbitrary (random) ${U}_{\left(w\right)}$.
**B2.**- Find the structure ${\widehat{\Pi}}_{\left(w\right)}$ maximizing$$\sum _{c=1}^{k}\frac{{t}_{c}}{n}\sum _{i=2}^{d}{I}_{i,{\Pi}_{c}\left(i\right)}-\frac{1}{2}k\xb7\left(2d\right)\frac{logn}{n}$$
**A1-A4**). **B3.**- For ${U}_{\left(w\right)}$ and ${\widehat{\Pi}}_{\left(w\right)}$ compute the maximum likelihood estimates ${\widehat{\Theta}}_{\left(w\right)}$ and ${\widehat{\lambda}}_{\left(w\right)}$.
**B4.**- Given ${\widehat{\Theta}}_{\left(w\right)}$, ${\widehat{\lambda}}_{\left(w\right)}$, and ${\widehat{\Pi}}_{\left(w\right)}$ determine ${U}_{(w+1)}={\left\{{\left({u}_{c}^{\left(l\right)}\right)}_{(w+1)}\right\}}_{c,l=1}^{n,k}$ using$${\left({u}_{c}^{\left(l\right)}\right)}_{(w+1)}=\left\{\begin{array}{cc}1\hfill & \text{if}\phantom{\rule{4.pt}{0ex}}{c}_{*}^{\left(l\right)}=c\phantom{\rule{4.pt}{0ex}}\hfill \\ 0\hfill & \text{otherwise,}\hfill \end{array}\right.$$$${c}_{*}^{\left(l\right)}=arg\underset{1\le c\le k}{max}{P}_{{\underline{\widehat{\theta}}}_{c},{\underline{\widehat{\varphi}}}_{c}}\left({x}^{\left(l\right)}\mid {\Pi}_{c}\right){\widehat{\lambda}}_{c}.$$
**B5.**- If ${U}_{(w+1)}={U}_{\left(w\right)}$, then stop, otherwise set $w=w+1$ and go to
**B**2.

It can be proved in the same way as in [31] that this algorithm will re-enter step **B2** only a finite number of times and, after having stopped, will have found a local minimum of
as a function of $U,\underline{\mathbf{\Pi}},\Theta ,\lambda $. This is easily seen, since each step of the algorithm above increases the value of the likelihood function, since only non-negative terms are added in step **B4**.

$$\prod _{l=1}^{n}\prod _{c=1}^{k}{\left[{P}_{{\underline{\theta}}_{c},{\underline{\varphi}}_{c}}\left({x}^{\left(l\right)}\mid {\Pi}_{c}\right){\lambda}_{c}\right]}^{{u}_{c}^{\left(l\right)}}$$

Due to local meta independence and the other assumptions the class wise components in ${\widehat{\Theta}}_{\left(w\right)}$, ${\widehat{\lambda}}_{\left(w\right)}$, and ${\widehat{\mathbf{\Pi}}}_{\left(w\right)}$ are estimated using those items in ${U}_{\left(w\right)}$ assigned to the class c, respectively, at this step of the algorithm.

The estimation procedure above, i.e., finding $\widehat{U},\underline{\widehat{\mathbf{\Pi}}},\widehat{\Theta},\widehat{\lambda}$ such that
is an example of what is known as the classification maximum likelihood estimate. The procedure has been proved to yield biased estimates of the parameters of the probability distribution [35]. In addition, the family of distributions dealt with here is not identifiable [31]. However, despite of this the classification performance need not be impaired in practice, provided that the underlying classes are represented by a wealth of samples.

$$\left(\widehat{U},\underline{\widehat{\Pi}},\widehat{\Theta},\widehat{\lambda}\right)=\text{arg}\underset{U,\underline{\Pi},\Theta ,\lambda}{max}\prod _{l=1}^{n}\prod _{c=1}^{k}{\left[{P}_{{\underline{\theta}}_{c},{\underline{\varphi}}_{c}}\left({x}^{\left(l\right)}\mid {\Pi}_{c}\right){\lambda}_{c}\right]}^{{u}_{c}^{\left(l\right)}}$$

Finally, in order to establish k, the number of classes, from the data ${X}^{n}$ it is possible to proceed by executing the above algorithm for all feasible values of k and choose $\widehat{k}$ and the corresponding $\widehat{U}$ such that
is maximal.

$$-\frac{1}{n}log{P}^{*}\left({X}^{n}|{\widehat{U}}^{n},\underline{\mathbf{\Pi}}\right)+\left(-\frac{1}{n}log{Q}^{*}\left({\widehat{U}}^{n}\right)\right),$$

In view of the fact that we are actually dealing with an exponential family in ${\left[{P}_{{\underline{\theta}}_{c},{\underline{\varphi}}_{c}}\left({x}^{\left(l\right)}\mid {\Pi}_{c}\right){\lambda}_{c}\right]}^{{u}_{c}^{\left(l\right)}}$, we note the consistency result in [36], which establishes the fact that maximizing
as a function of k will produce a consistent estimate of the model and k.

$$L\left(\left({X}^{n},{U}^{n}\right)|\underline{\mathbf{\Pi}}\right)$$

As the success of statistical mixture models applied to inferring population structures neatly demonstrates, general probabilistic machine learning theory contains many contributions that have potentially fruitful applications in diverse areas of scientific inquiry. Molecular biology in general is both an attractive target to applications of generic probabilistic machine learning tools as well as a source of inspiration for theoretical research on such methods, given the rich variety of biological problems that necessitate the use of advanced computational and statistical methods to arrive at meaningful solutions.

From a theoretical perspective it is fairly intuitive to attempt to represent possible dependencies between marker loci using the relatively sparse model structures the general theory of probabilistic graphical models and networks has to offer. An introduction to the general theory can be found in [16]. It is worth noting that while our machine learning formulation does not include representation of linkage distances in terms of explicit parameters for that purpose, conditional distributions of alleles defined in terms of the tree factorization can still flexibly represent a wide range of dependencies from near independence to complete linkage between loci. Consequently, the evolutionary time scale related to the linkage patterns remains implicit in our formulation, and it is therefore dependent on the characteristics of a particular data set that is investigated.

We have intentionally abstained from considering explicitly any ancestral relationships of the observed samples in terms of a graph (or a tree); either at the level of individual samples or inferred latent pools of them. Such graphs would obviously increase the biological realism our approach when incorporated in the dependence model, however, as the level of computational complexity associated with our machine learning method is already very high, explicit models of ancestral relationships would likely render it practically inapplicable to data sets harboring large numbers of samples and latent pools.

In a typical theoretical formulation, probabilistic classification is based on ranking of the posterior probabilities of classes given an observed feature vector. This is in fact the optimal rule of identification if the “true” description of the data is used to compute the posterior probabilities. Wong and Poon [37] claimed that the tree-aided classifier of Chow and Liu minimizes an upper bound on the Bayes error rate, if the true distribution is approximated by (a mixture of) tree dependent distributions. However, it was later shown in [38] that the result was erroneous and that more caution is needed in the interpretation of classifier error rate in this context.

The information theoretic approach based on minimization of stochastic complexity adopted here [24,39,40,41] is closely related to the fully Bayesian approach, where a comparable model would lead to a posterior distribution over the possible combinations of classifications and expansion structures. Our current approach generalizes previous “naive classifiers” using class-conditional probability distributions expressing independence between features in [31,42], which are also trained by minimization of stochastic complexity. Similarly to the Bayesian modeling paradigm, SC enforces a trade-off between descriptional/predictive accuracy and modeling complexity.

Our result on stochastic complexity for class-conditional probability distributions factorized along a (rooted) tree, whose nodes correspond to the components of a binary vector, was obtained by applying the results in [23,24] to a Chow expansion of the joint probability integrated with respect to Jeffreys’ prior. Generally, minimization of stochastic complexity corresponds in many cases to the minimum description length (MDL) principle of model choice. MDL principle is discussed for learning of the structure of graphs in [43,44], while surveys and tutorials of algorithms and techniques for learning graph structure from data are given in [45,46].

The procedure of learning trees from data was first presented in [12]. In this procedure the mutual information between all pairs of nodes is computed using the relevant sample frequencies and the best tree is selected as the one that gives the maximum overall mutual information. This is in fact a maximum likelihood estimate of the tree, the asymptotic consistency of which was proved in [47] for increasing sets of independent samples from a tree dependent distribution. The procedure was extended by Suzuki [28], who observed the connection of the Chow-Liu estimate to MDL. The techniques in [44] do not lead to this, as pointed out by Suzuki [28]. The procedure of structure learning to be applied here is, as far as the probability distributions involved are concerned, closely related to the Bayesian algorithms in [20,21,48] . Learning of graphs from data using the search algorithms of Cooper and Hershkovits is an NP-complete problem [49].

The deterministic algorithms for learning the population structure minimizing stochastic complexity that were introduced the previous section can be considered as relatively implementation-friendly, although they are still considerably computation intensive as the number of samples and marker loci increase. Given that such algorithms typically only converge to local minima when the model structure and topology of the search space are complex, it would be necessary to execute the algorithms multiple times from different random starting configurations to gain information about the stability of the learned optimal structures. Furthermore, since the stochastic complexities for any two model structures can be analytically compared, the difference in the optimal data encoding efficiency can be easily compared over multiple runs of the algorithms.

An alternative to the deterministic learning algorithms considered in this work would be to consider a family of Monte Carlo algorithms to either approximate the SC optimal population structure or to perform a fully Bayesian analysis where the posterior distribution over the population structures is approximated. We have earlier considered Markov chain Monte Carlo (MCMC)-based learning of unsupervised classification and graphical models [50,51,52], and in particular demonstrated that standard reversible Metropolis-Hastings algorithms may dramatically fail when the level of complexity of the considered models is very high. To resolve this issue, Corander et al. [50] introduced a parallel non-reversible MCMC algorithm for Bayesian model learning where the topology of the model space in combination with the probabilistic search operators is not allowed to influence the acceptance ratio of a Metropolis-Hastings algorithm. This strategy was illustrated to be much more fruitful than a standard reversible MCMC algorithm for learning a large dimensional unsupervised classification model. A particular strength of the non-reversible algorithm is that it enables more freedom in the design of the search operators utilized in the proposal mechanism, since the proposal probabilities need not be calculated explicitly. On the other hand, the currently considered learning problem is so complex in general, that any realistic implementation of a stochastic learning algorithm must be done within a true parallel computing environment to prevent the computation times becoming prohibitive in practice. Our future aim is to implement such algorithms and the deterministic algorithms considered in this work to compare their relative levels of performance for solving the learning task of unsupervised classification augmented with trees. Also, an interesting generalization of the introduced linkage modeling framework would be to consider multi-allelic loci as well as data from diploid and tetraploid organisms.

Work of JC was supported by ERC project no. 239784. The authors would like to thank two anonymous reviewers and Yaqiong Cui for comments that enabled us to improve the original version of the article.

The integral on the right hand side of (9) will by force of (10) be factorized as
where
and
There is an explicit expression for each of the factors ${I}_{1}$, ${I}_{2}$ and ${I}_{3}$, if the prior densities $h(\xb7)$ and $z(\xb7)$ are Beta densities, e.g.,
where ${\alpha}_{i}>0$. Then, θ has a Be$\left({\alpha}_{1},{\alpha}_{2}\right)$ distribution. Using the Beta integral
we obtain, e.g., in each factor of ${I}_{2}$ in (41)
Thus we have
as well as

$$P\left({X}^{t}|\Pi \right)={I}_{1}\xb7{I}_{2}\xb7{I}_{3},$$

$${I}_{1}={\int}_{0}^{1}{\theta}_{1}^{{n}_{1}}{\left(1-{\theta}_{1}\right)}^{t-{n}_{1}}h\left({\theta}_{1}\right)d{\theta}_{1},$$

$${I}_{2}=\prod _{i=2}^{d}{\int}_{0}^{1}{\theta}_{i}^{{n}_{i}(1,1)}{\left(1-{\theta}_{i}\right)}^{{n}_{i}(0,1)}h\left({\theta}_{i}\right)d{\theta}_{i},$$

$${I}_{3}=\prod _{i=2}^{d}{\int}_{0}^{1}{\varphi}_{i}^{{n}_{i}(1,0)}{\left(1-{\varphi}_{i}\right)}^{{n}_{i}(0,0)}z\left({\varphi}_{i}\right)d{\varphi}_{i}.$$

$$h\left(\theta \right)=\left\{\begin{array}{cc}\frac{\Gamma \left({\alpha}_{1}+{\alpha}_{2}\right)}{\Gamma \left({\alpha}_{1}\right)\xb7\Gamma \left({\alpha}_{2}\right)}{\theta}^{{\alpha}_{1}-1}{\left(1-\theta \right)}^{{\alpha}_{2}-1}\hfill & 0\le \theta \le 1\phantom{\rule{4.pt}{0ex}}\hfill \\ 0\hfill & \text{elsewhere,}\hfill \end{array}\right.$$

$${\int}_{0}^{1}{\theta}^{{\alpha}_{1}-1}{(1-\theta )}^{{\alpha}_{2}-1}d\theta =\frac{\Gamma \left({\alpha}_{1}\right)\xb7\Gamma \left({\alpha}_{2}\right)}{\Gamma \left({\alpha}_{1}+{\alpha}_{2}\right)}$$

$$\begin{array}{cc}\hfill {\int}_{0}^{1}{\theta}_{i}^{{n}_{i}(1,1)}{\left(1-{\theta}_{i}\right)}^{{n}_{i}(0,1)}h\left({\theta}_{i}\right)d{\theta}_{i}& =\hfill \end{array}$$

$$\begin{array}{cc}\hfill \frac{\Gamma \left({\alpha}_{1}+{\alpha}_{2}\right)}{\Gamma \left({\alpha}_{1}\right)\xb7\Gamma \left({\alpha}_{2}\right)}\frac{\Gamma \left({n}_{i}(1,1)+{\alpha}_{1}\right)\xb7\Gamma \left({n}_{i}(0,1)+{\alpha}_{2}\right)}{\Gamma \left({n}_{i}(1,1)+{n}_{i}(0,1)+{\alpha}_{1}+{\alpha}_{2}\right)}.& \hfill \end{array}$$

$${I}_{2}=\prod _{i=2}^{d}\frac{\Gamma \left({\alpha}_{1}+{\alpha}_{2}\right)}{\Gamma \left({\alpha}_{1}\right)\xb7\Gamma \left({\alpha}_{2}\right)}\frac{\Gamma \left({n}_{i}(1,1)+{\alpha}_{1}\right)\xb7\Gamma \left({n}_{i}(0,1)+{\alpha}_{2}\right)}{\Gamma \left({n}_{i}(1,1)+{n}_{i}(0,1)+{\alpha}_{1}+{\alpha}_{2}\right)},$$

$${I}_{3}=\prod _{i=2}^{d}\frac{\Gamma \left({\alpha}_{1}+{\alpha}_{2}\right)}{\Gamma \left({\alpha}_{1}\right)\xb7\Gamma \left({\alpha}_{2}\right)}\frac{\Gamma \left({n}_{i}(1,0)+{\alpha}_{1}\right)\xb7\Gamma \left({n}_{i}(0,0)+{\alpha}_{2}\right)}{\Gamma \left({n}_{i}(1,0)+{n}_{i}(0,0)+{\alpha}_{1}+{\alpha}_{2}\right)},$$

$${I}_{1}=\frac{\Gamma \left({\alpha}_{1}+{\alpha}_{2}\right)}{\Gamma \left({\alpha}_{1}\right)\xb7\Gamma \left({\alpha}_{2}\right)}\frac{\Gamma \left({n}_{1}+{\alpha}_{1}\right)\xb7\Gamma \left(t-{n}_{1}+{\alpha}_{2}\right)}{\Gamma \left(t+{\alpha}_{1}+{\alpha}_{2}\right)}.$$

Assuming ${\alpha}_{1}$ $={\alpha}_{2}=1/2$ we obtain the generic term (denoted by ${E}_{1/2}^{\left(2\right)}$) in ${I}_{2}$ (see (11)) as

$$\begin{array}{cc}\hfill {E}_{1/2}^{\left(2\right)}& \equiv \frac{\Gamma \left({\alpha}_{1}+{\alpha}_{2}\right)}{\Gamma \left({\alpha}_{1}\right)\xb7\Gamma \left({\alpha}_{2}\right)}\frac{\Gamma \left({n}_{i}(1,1)+{\alpha}_{1}\right)\xb7\Gamma \left({n}_{i}(0,1)+{\alpha}_{2}\right)}{\Gamma \left({n}_{i}(1,1)+{n}_{i}(0,1)+{\alpha}_{1}+{\alpha}_{2}\right)}\hfill \\ & =\frac{1}{\pi}\frac{\Gamma \left({n}_{i}(1,1)+1/2\right)\xb7\Gamma \left({n}_{i}(0,1)+1/2\right)}{\Gamma \left({n}_{i}(1,1)+{n}_{i}(0,1)+1\right)}.\hfill \end{array}$$

By invoking Stirling’s formula in a straightforward calculation for $-log{E}_{1/2}^{\left(2\right)}$ in $-log{I}_{2}$ in (11), this entails
where C is a bounded in t and $h\left(x\right)=-xlogx-(1-x)log(1-x)$, $0\le x\le 1$, the binary entropy function (in natural logarithms) of the empirical distribution
Here ${\widehat{\theta}}_{i}^{\left(1\right)}$ is the maximum likelihood estimate (based on ${X}^{t}$) of ${\theta}_{i}^{\left(1\right)}=P\left({x}_{i}=1|{x}_{\Pi \left(i\right)}=1\right)$.

$$-log{E}_{1/2}^{\left(2\right)}=\left({n}_{i}(1,1)+{n}_{i}(0,1)\right)h\left({\widehat{\theta}}_{i}^{\left(1\right)}\right)+\frac{1}{2}log\left({n}_{i}(1,1)+{n}_{i}(0,1)\right)+C,$$

$$\left({\widehat{\theta}}_{i}^{\left(1\right)},1-{\widehat{\theta}}_{i}^{\left(1\right)}\right)=\left(\frac{{n}_{i}(1,1)}{{n}_{i}(1,1)+{n}_{i}(0,1)},\frac{{n}_{i}(0,1)}{{n}_{i}(1,1)+{n}_{i}(0,1)}\right).$$

For a generic term (denoted by $-log{E}_{1/2}^{\left(3\right)}$) in $-log{I}_{3}$ in (11) we obtain in the same way
where ${\widehat{\theta}}_{i}^{\left(0\right)}$ is the maximum likelihood estimate of ${\theta}_{i}^{\left(0\right)}=P\left({x}_{i}=1|{x}_{\Pi \left(i\right)}=0\right)$.

$$-log{E}_{1/2}^{\left(3\right)}=\left({n}_{i}(1,0)+{n}_{i}(0,0)\right)h\left({\widehat{\theta}}_{i}^{\left(0\right)}\right)+\frac{1}{2}log\left({n}_{i}(1,0)+{n}_{i}(0,0)\right)+C,$$

Next we consider the result of inserting the terms
and
in the right hand side of (11). This gives the following expression
The generic term in the sum is
This expression is by definition of the binary entropy function $h\left(x\right)$ equal to
Let us introduce the auxiliary quantities
and
Then we have as an identity from the right hand side of (47)
The quantities in the right hand side of the last equality can be regrouped as
The first four terms are equal to
$$-{n}_{i}(1,1)log\left({n}_{i}\left(1\right)/t\right)-{n}_{i}(0,1)log\left({n}_{i}\left(0\right)/t\right)$$
The remaining terms are
$$-{n}_{i}(1,1)log\left(\frac{{n}_{i}(1,1)/t}{{n}_{\Pi \left(i\right)}\left(1\right)/t\xb7{n}_{i}\left(1\right)/t}\right)-{n}_{i}(0,1)log\left(\frac{{n}_{i}(0,1)/t}{{n}_{\Pi \left(i\right)}\left(1\right)/t\xb7{n}_{i}\left(0\right)/t}\right)$$
where we have defined the mutual information [25] between ${X}_{i}$ and ${X}_{\Pi \left(i\right)}$ by
using the maximum likelihood estimates (i.e. observed relative frequencies) of the two dimensional distributions ${P}_{i,\Pi \left(i\right)}\left(u,v\right)$ as well as of the marginal distributions ${P}_{i}\left(u\right)$ and ${P}_{\Pi \left(i\right)}\left(v\right)$.

$$\left({n}_{i}(1,1)+{n}_{i}(0,1)\right)h\left({\widehat{\theta}}_{i}^{\left(1\right)}\right)$$

$$\left({n}_{i}(1,0)+{n}_{i}(0,0)\right)h\left({\widehat{\theta}}_{i}^{\left(0\right)}\right)$$

$$\sum _{i=2}^{d}\left[\left({n}_{i}(1,1)+{n}_{i}(0,1)\right)h\left({\widehat{\theta}}_{i}^{\left(1\right)}\right)+\left({n}_{i}(1,0)+{n}_{i}(0,0)\right)h\left({\widehat{\theta}}_{i}^{\left(0\right)}\right)\right].$$

$$\left({n}_{i}(1,1)+{n}_{i}(0,1)\right)h\left({\widehat{\theta}}_{i}^{\left(1\right)}\right)+\left({n}_{i}(1,0)+{n}_{i}(0,0)\right)h\left({\widehat{\theta}}_{i}^{\left(0\right)}\right).$$

$$\begin{array}{cc}& =-{n}_{i}(1,1)log\left(\frac{{n}_{i}(1,1)}{{n}_{i}(1,1)+{n}_{i}(0,1)}\right)-{n}_{i}(0,1)log\left(\frac{{n}_{i}(0,1)}{{n}_{i}(1,1)+{n}_{i}(0,1)}\right)\hfill \\ & -{n}_{i}(1,0)log\left(\frac{{n}_{i}(1,0)}{{n}_{i}(1,0)+{n}_{i}(0,0)}\right)-{n}_{i}(0,0)log\left(\frac{{n}_{i}(0,0)}{{n}_{i}(1,0)+{n}_{i}(0,0)}\right).\hfill \end{array}$$

$${n}_{\Pi \left(i\right)}\left(1\right)={n}_{i}(1,1)+{n}_{i}(0,1),{n}_{\Pi \left(i\right)}\left(0\right)={n}_{i}(1,0)+{n}_{i}(0,0),$$

$${n}_{i}\left(1\right)={n}_{i}(1,1)+{n}_{i}(1,0),{n}_{i}\left(0\right)={n}_{i}(0,1)+{n}_{i}(0,0).$$

$$-{n}_{i}(1,1)log\left(\frac{{n}_{i}(1,1)}{{n}_{i}(1,1)+{n}_{i}(0,1)}\right)-{n}_{i}(0,1)log\left(\frac{{n}_{i}(0,1)}{{n}_{i}(1,1)+{n}_{i}(0,1)}\right)$$

$$-{n}_{i}(1,0)log\left(\frac{{n}_{i}(1,0)}{{n}_{i}(1,0)+{n}_{i}(0,0)}\right)-{n}_{i}(0,0)log\left(\frac{{n}_{i}(0,0)}{{n}_{i}(1,0)+{n}_{i}(0,0)}\right)=$$

$$=-{n}_{i}(1,1)log\left(\frac{{n}_{i}(1,1)/t\xb7{n}_{i}\left(1\right)/t}{{n}_{\Pi \left(i\right)}\left(1\right)/t\xb7{n}_{i}\left(1\right)/t}\right)-{n}_{i}(0,1)log\left(\frac{{n}_{i}(0,1)/t\xb7{n}_{i}\left(0\right)/t}{{n}_{\Pi \left(i\right)}\left(1\right)/t\xb7{n}_{i}\left(0\right)/t}\right)$$

$$-{n}_{i}(1,0)log\left(\frac{{n}_{i}(1,0)/t\xb7{n}_{i}\left(1\right)/t}{{n}_{\Pi \left(i\right)}\left(0\right)/t\xb7{n}_{i}\left(1\right)/t}\right)-{n}_{i}(0,0)log\left(\frac{{n}_{i}(0,0)/t\xb7{n}_{i}\left(0\right)/t}{{n}_{\Pi \left(i\right)}\left(0\right)/t\xb7{n}_{i}\left(0\right)/t}\right).$$

$$-{n}_{i}(1,1)log\left({n}_{i}\left(1\right)/t\right)-{n}_{i}(0,1)log\left({n}_{i}\left(0\right)/t\right)$$

$$-{n}_{i}(1,0)log\left({n}_{i}\left(1\right)/t\right)-{n}_{i}(0,0)log\left({n}_{i}\left(0\right)/t\right)$$

$$-{n}_{i}(1,1)log\left(\frac{{n}_{i}(1,1)/t}{{n}_{\Pi \left(i\right)}\left(1\right)/t\xb7{n}_{i}\left(1\right)/t}\right)-{n}_{i}(0,1)log\left(\frac{{n}_{i}(0,1)/t}{{n}_{\Pi \left(i\right)}\left(1\right)/t\xb7{n}_{i}\left(0\right)/t}\right)$$

$$-{n}_{i}(1,0)log\left(\frac{{n}_{i}(1,0)/t}{{n}_{\Pi \left(i\right)}\left(0\right)/t\xb7{n}_{i}\left(1\right)/t}\right)-{n}_{i}(0,0)log\left(\frac{{n}_{i}(0,0)/t}{{n}_{\Pi \left(i\right)}\left(0\right)/t\xb7{n}_{i}\left(0\right)/t}\right).$$

$$-{n}_{i}(1,0)log\left({n}_{i}\left(1\right)/t\right)-{n}_{i}(0,0)log\left({n}_{i}\left(0\right)/t\right)=$$

$$-\left({n}_{i}(1,1)+{n}_{i}(1,0)\right)log\left({n}_{i}\left(1\right)/t\right)-\left({n}_{i}(0,1)+{n}_{i}(0,0)\right)log\left({n}_{i}\left(0\right)/t\right)=$$

$$-{n}_{i}\left(1\right)log\left({n}_{i}\left(1\right)/t\right)-{n}_{i}\left(0\right)log\left({n}_{i}\left(0\right)/t\right)=t\xb7h\left({n}_{i}\left(1\right)/t\right).$$

$$-{n}_{i}(1,0)log\left(\frac{{n}_{i}(1,0)/t}{{n}_{\Pi \left(i\right)}\left(0\right)/t\xb7{n}_{i}\left(1\right)/t}\right)-{n}_{i}(0,0)log\left(\frac{{n}_{i}(0,0)/t}{{n}_{\Pi \left(i\right)}\left(0\right)/t\xb7{n}_{i}\left(0\right)/t}\right)=$$

$$-t\xb7{I}_{i,\Pi \left(i\right)},$$

$${I}_{i,\Pi \left(i\right)}=\sum _{u=0}^{1}\sum _{v=0}^{1}{\widehat{P}}_{i,\Pi \left(i\right)}\left(u,v\right)log\frac{{\widehat{P}}_{i,\Pi \left(i\right)}\left(u,v\right)}{{\widehat{P}}_{i}\left(u\right)\xb7{\widehat{P}}_{\Pi \left(i\right)}\left(v\right)}$$

- Ewens, W.J. Mathematical Population Genetics, 2nd ed.; Springer-Verlag: New York, NY, USA, 2004. [Google Scholar]
- Nagylaki, T. Theoretical Population Genetics; Springer-Verlag: Berlin, Germany, 1992. [Google Scholar]
- Pritchard, J.K.; Stephens, M.; Donnelly, P. Inference of population structure using multilocus genotype data. Genetics
**2000**, 155, 945–959. [Google Scholar] - Dawson, K.J.; Belkhir, K. A Bayesian approach to the identification of panmictic populations and the assignment of individuals. Genet. Res.
**2001**, 78, 59–77. [Google Scholar] [CrossRef] [PubMed] - Corander, J.; Waldmann, P.; Sillanpää, M.J. Bayesian analysis of genetic differentiation between populations. Genetics
**2003**, 163, 367–374. [Google Scholar] [PubMed] - Corander, J.; Marttinen, P. Bayesian identification of admixture events using multi-locus molecular markers. Mol. Ecol.
**2006**, 15, 2833–2843. [Google Scholar] [CrossRef] [PubMed] - Corander, J.; Gyllenberg, M.; Koski, T. Random Partition models and Exchangeability for Bayesian Identification of Population Structure. Bull. Math. Biol.
**2007**, 69, 797–815. [Google Scholar] [CrossRef] [PubMed] - Falush, D.; Stephens, M.; Pritchard, J.K. Inference of population structure using multilocus genotype data: Linked loci and correlated allele frequencies. Genetics
**2003**, 164, 1567–1587. [Google Scholar] [PubMed] - Guillot, G.; Estoup, A.; Mortier, F.; Cosson, J.F. A spatial statistical model for landscape genetics. Genetics
**2005**, 170, 1261–1280. [Google Scholar] [CrossRef] [PubMed] - Guillot, G.; Leblois, R.; Coulon, A.; Frantz, A.C. Statistical methods in spatial genetics. Mol. Ecol.
**2010**, 18, 4734–4756. [Google Scholar] [CrossRef] [PubMed] - Gyllenberg, M.; Carlsson, J.; Koski, T. Bayesian Network Classification of Binarized DNA Fingerprinting Patterns. In Mathematical Modelling and Computing in Biology and Medicine; Capasso, V., Ed.; Progetto Leonardo: Bologna, Italy, 2003; pp. 60–66. [Google Scholar]
- Chow, C.K.; Liu, C.N. Approximating Discrete Probability Distributions with Dependence Trees. IEEE Trans. Inf. Theory
**1968**, 14, 462–467. [Google Scholar] [CrossRef] - Friedman, N.; Geiger, D.; Goldszmidt, M. Bayesian Network Classifiers. Mach. Learn.
**1997**, 29, 1–36. [Google Scholar] [CrossRef] - Corander, J.; Tang, J. Bayesian analysis of population structure based on linked molecular information. Math. Biosci.
**2007**, 205, 19–31. [Google Scholar] [CrossRef] [PubMed] - Cowell, R.G.; Dawid, A.P.; Lauritzen, S.L.; Spiegelhalter, D.J. Probabilistic Networks and Expert Systems; Springer-Verlag: New York, NY, USA, 1999. [Google Scholar]
- Koski, T.; Noble, J.N. Bayesian Networks: an Introduction; Wiley: Chichester, UK, 2009. [Google Scholar]
- Meil, M.; Jordan, M.I. Learning with Mixtures of Trees. J. Mach. Learn. Res.
**2000**, 1, 1–48. [Google Scholar] - Pearl, J. Probabilistic Reasoning in Intelligent Systems; Morgan Kaufmann: San Francisco, CA, USA, 1988. [Google Scholar]
- Becker, A.; Geiger, D.; Meek, C. Perfect Tree-like Markovian Distributions. Proc. 16th Conf. Uncertainty in Artificial Intelligence
**2000**, 19–23. [Google Scholar] - Heckerman, D.; Geiger, D.; Chickering, D.M. Learning Bayesian Networks: The combination of knowledge and statistical data. Mach. Learn.
**1995**, 20, 197–243. [Google Scholar] [CrossRef] - Heckerman, D.; Geiger, D.; Chickering, D.M. Likelihoods and Parameter Priors for Bayesian Networks. Microsoft Res. Tech. Rep. MSR-TR-95-54.
- Duda, R.O.; Hart, P.E. Pattern Classification and Scene Analysis; Wiley: New York, NY, USA, 1973. [Google Scholar]
- Clarke, B.S.; Barron, A.R. Jeffreys’ prior is asymptotically least favorable under entropy risk. J. Stat. Planning Inference
**1994**, 41, 37–60. [Google Scholar] [CrossRef] - Rissanen, J. Fisher Information and Stochastic Complexity. IEEE Trans. Inf. Theory
**1996**, 42, 40–47. [Google Scholar] [CrossRef] - Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley: New York, NY, USA, 1991. [Google Scholar]
- Gyllenberg, M.; Koski, T. Bayesian Predictiveness, Exchangeability and Sufficientness in Bacterial Taxonomy. Math. Biosci.
**2002**, 177 & 178, 161–184. [Google Scholar] [CrossRef] - DeGroot, M.H. Optimal Statistical Decisions; McGraw-Hill: New York, NY, USA, 1970. [Google Scholar]
- Suzuki, J. Learning Bayesian Belief Networks Based on the Minimum Description Length Principle: Basic Properties. IEICE Trans. Fundamentals
**1999**, 82, 2237–2245. [Google Scholar] - Kučera, L. Combinatorial Algorithms; Adam Hilger: Bristol, UK, 1990. [Google Scholar]
- Schwartz, G. Estimating the Dimension of a Model. Ann. Statist.
**1978**, 6, 461–464. [Google Scholar] [CrossRef] - Gyllenberg, M.; Koski, T.; Verlaan, M. Classification of Binary Vectors by Stochastic Complexity. J. Multiv. Analysis
**1997**, 63, 47–72. [Google Scholar] [CrossRef] - Kass, R.E.; Wasserman, L. A Reference Bayesian Test for Nested Hypotheses and Its Relationship to the Schwartz criterion. J. Amer. Stat. Assoc.
**1995**, 90, 928–934. [Google Scholar] [CrossRef] - Drton, M.; Sturmfels, B.; Sullivant, S. Lectures on Algebraic Statistics; Birkhäuser: Basel, Switzerland, 2005. [Google Scholar]
- Rusakov, D.; Geiger, D. Asymptotic Model Selection for Naive Bayesian Networks. J. Mach. Learn. Res.
**2005**, 6, 1–35. [Google Scholar] - Biernacki, C.; Celeux, G.; Covaert, G. Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood. IEEE Trans. Patt. Anal. Mach. Intel.
**2000**, 28, 719–725. [Google Scholar] [CrossRef] - Haughton, D.M.A. On the Choice of the Model to Fit Data from an Exponential Family. Ann. Statist.
**1988**, 16, 342–355. [Google Scholar] [CrossRef] - Wong, S.K.M.; Poon, F.C.S. Comments on the Approximating Discrete Probability Distributions with Dependence Trees. IEEE Trans. Patt. Anal. Mach. Intel.
**1989**, 11, 333–335. [Google Scholar] [CrossRef] - Balagani, K.S.; Phoha, V.V. On the Relationship between Dependence Tree Classification Error and Bayes Error Rate. IEEE Trans. Patt. Anal. Mach. Intel.
**2007**, 29, 1866–1868. [Google Scholar] [CrossRef] [PubMed] - Rissanen, J. Stochastic Complexity in Learning. J. Comp. System Sci.
**1997**, 55, 89–95. [Google Scholar] [CrossRef] - Vitányi, P.M.B.; Li, M. Minimum Description Length Induction, Bayesianism, and Kolmogorov Complexity. IEEE Trans. Inf. Theory
**2000**, 46, 446–464. [Google Scholar] [CrossRef] - Yamanishi, J.K. A decision-theoretic extension of stochastic complexity and its applications to learning. IEEE Trans. Inf. Theory
**1998**, 44, 1424–1439. [Google Scholar] [CrossRef] - Gyllenberg, M.; Koski, T.; Lund, T.; Gyllenberg, H.G. Bayesian Predictive Identification and Cumulative Classification of Bacteria. Bull. Math. Biol.
**1999**, 61, 85–111. [Google Scholar] [CrossRef] [PubMed] - Friedman, N.; Goldszmidt, M. Learning Bayesian Networks with Local Structure. In Learning in Graphical Models; Jordan, M., Ed.; MIT Press: Cambridge, MA, USA, 1997; pp. 421–459. [Google Scholar]
- Lam, W.; Bacchus, F. Learning Bayesian Belief Networks: An Approach Based on the MDL Principle. Comput. Intel.
**1994**, 10, 269–293. [Google Scholar] [CrossRef] - Buntine, W. A guide to the literature on learning probabilistic networks from data. IEEE Trans. Knowl. Data Eng.
**1996**, 8, 195–210. [Google Scholar] [CrossRef] - Sangüesa, R.; Cortés, U. Learning causal networks from data: a survey and a new algorithm for recovering possibilistic causal networks. AI Commun.
**1997**, 10, 31–61. [Google Scholar] - Chow, C.K.; Wagner, T.J. Consistency of an estimate of tree-dependent probability distributions. IEEE Trans. Inf. Theory
**1973**, 19, 369–371. [Google Scholar] [CrossRef] - Cooper, G.F.; Herskovits, E. A Bayesian Method for the Induction of Probabilistic Networks from Data. Mach. Learn.
**1992**, 9, 309–347. [Google Scholar] [CrossRef] - Chickering, D.M. Learning Bayesian Networks is NP-Complete. In Learning from Data. Artificial Intelligence and Statistics; V. Fisher, D., Lenz, H-J., Eds.; Springer-Verlag: New York, NY, USA, 1996; pp. 121–130. [Google Scholar]
- Corander, J.; Gyllenberg, M.; Koski, T. Bayesian model learning based on a parallel MCMC strategy. Stat. Comput.
**2006**, 16, 355–362. [Google Scholar] [CrossRef] - Corander, J.; Ekdahl, M.; Koski, T. Parallell interacting MCMC for learning of topologies of graphical models. Data Mining Knowl. Discovery
**2008**, 17, 431–456. [Google Scholar] [CrossRef] - Corander, J.; Gyllenberg, M.; Koski, T. Bayesian unsupervised classification framework based on stochastic partitions of data and a parallel search strategy. Adv. Data Anal. Classification
**2009**, 3, 3–24. [Google Scholar] [CrossRef]

© 2010 by the authors; licensee MDPI Basel, Switzerland. This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license http://creativecommons.org/licenses/by/3.0/.