# Learning Genetic Population Structures Using Minimization of Stochastic Complexity

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Results and Discussion

#### 2.1. Tree-based factorization of the joint distribution of multilocus genotypes

#### 2.2. Chow expansion of the of the joint distribution of multilocus genotypes

#### Prior predictive data distributions under Chow expansion

#### 2.3. Stochastic complexity and learning of classifications and tree structures

#### Asymptotic expansion of the stochastic complexity for a Chow expansion

#### The stochastic complexity for an unsupervised classification under Chow expansion

**Figure 1.**Graphical representation of the dependence structure for an unsupervised classification model augmented by Chow-Liu trees. Here $d=5$ and $k=2$ and the unbroken arrow lines correspond to dependence between the stochastic nodes and the dashed arrows correspond to the dependence of the root nodes on the classification variable λ, which is connected to the trees by a random switch (represented by the curved arrow) according the probabilities in λ.

#### 2.4. Algorithms for learning unsupervised classifications and Chow expansions

#### Deterministic algorithm for learning Chow expansions

**A1.**- Compute the numbers$$I{P}_{i,j}=\sum _{u=0}^{1}\sum _{v=0}^{1}{\widehat{P}}_{i,j}\left(u,v\right)log\frac{{\widehat{P}}_{i,j}\left(u,v\right)}{{\widehat{P}}_{i}\left(u\right)\xb7{\widehat{P}}_{j}\left(v\right)}-\frac{1}{2}\xb7\frac{1}{t}\left[log\left({n}_{c}\left(1\right)\right)+log\left({n}_{c}\left(0\right)\right)\right]$$
**A2.**- Construct a complete undirected graph with the binary variables as nodes.
**A3.**- Construct a maximum weighted spanning tree with the extra condition that an edge is in the tree only if $I{P}_{i,j}>0$.
**A4.**- Make the maximum weighted spanning tree directed by choosing a root variable and setting the direction of all edges to be outward from the root.

**A**$\mathbf{3}$, when the condition for permitting disconnected graphs is not imposed. The most time honoured algorithm for the task is the Borůvka-Choquet-Kruskal algorithm [29].

**A1**-

**A4**we have a tree structure

#### Deterministic algorithm for learning unsupervised classification augmented by Chow expansions

**B1.**- Fix k, set $w=0$ and store an arbitrary (random) ${U}_{\left(w\right)}$.
**B2.**- Find the structure ${\widehat{\Pi}}_{\left(w\right)}$ maximizing$$\sum _{c=1}^{k}\frac{{t}_{c}}{n}\sum _{i=2}^{d}{I}_{i,{\Pi}_{c}\left(i\right)}-\frac{1}{2}k\xb7\left(2d\right)\frac{logn}{n}$$
**A1-A4**). **B3.**- For ${U}_{\left(w\right)}$ and ${\widehat{\Pi}}_{\left(w\right)}$ compute the maximum likelihood estimates ${\widehat{\Theta}}_{\left(w\right)}$ and ${\widehat{\lambda}}_{\left(w\right)}$.
**B4.**- Given ${\widehat{\Theta}}_{\left(w\right)}$, ${\widehat{\lambda}}_{\left(w\right)}$, and ${\widehat{\Pi}}_{\left(w\right)}$ determine ${U}_{(w+1)}={\left\{{\left({u}_{c}^{\left(l\right)}\right)}_{(w+1)}\right\}}_{c,l=1}^{n,k}$ using$${\left({u}_{c}^{\left(l\right)}\right)}_{(w+1)}=\left\{\begin{array}{cc}1\hfill & \text{if}\phantom{\rule{4.pt}{0ex}}{c}_{*}^{\left(l\right)}=c\phantom{\rule{4.pt}{0ex}}\hfill \\ 0\hfill & \text{otherwise,}\hfill \end{array}\right.$$$${c}_{*}^{\left(l\right)}=arg\underset{1\le c\le k}{max}{P}_{{\underline{\widehat{\theta}}}_{c},{\underline{\widehat{\varphi}}}_{c}}\left({x}^{\left(l\right)}\mid {\Pi}_{c}\right){\widehat{\lambda}}_{c}.$$
**B5.**- If ${U}_{(w+1)}={U}_{\left(w\right)}$, then stop, otherwise set $w=w+1$ and go to
**B**2.

**B2**only a finite number of times and, after having stopped, will have found a local minimum of

**B4**.

#### 2.5. Discussion

## Acknowledgements

## Appendices

#### A.1. Prior predictive data distributions under Chow expansion

#### A.2. Asymptotic expansion of the stochastic complexity for a Chow expansion

## References

