Probability Distribution on Full Rooted Trees

The recursive and hierarchical structure of full rooted trees is applicable to statistical models in various fields, such as data compression, image processing, and machine learning. In most of these cases, the full rooted tree is not a random variable; as such, model selection to avoid overfitting is problematic. One method to solve this problem is to assume a prior distribution on the full rooted trees. This enables the optimal model selection based on Bayes decision theory. For example, by assigning a low prior probability to a complex model, the maximum a posteriori estimator prevents the selection of the complex one. Furthermore, we can average all the models weighted by their posteriors. In this paper, we propose a probability distribution on a set of full rooted trees. Its parametric representation is suitable for calculating the properties of our distribution using recursive functions, such as the mode, expectation, and posterior distribution. Although such distributions have been proposed in previous studies, they are only applicable to specific applications. Therefore, we extract their mathematically essential components and derive new generalized methods to calculate the expectation, posterior distribution, etc.


Review of Literature and Motivation
Full rooted trees are utilized in various fields of study. For example, for text compression in information theory, a full rooted tree represents a set of contexts, which are strings of the most recent symbols at each time point, and it is known as a context tree [1]. In image processing, it represents a variable block-size segmentation, and it is known as quadtree block partitioning [2]. In machine learning, it represents a nonlinear function that comprises many conditional branches and is known as a decision tree [3]. In most of these studies, the rooted tree is not a random variable and serves as an index of a statistical model or function; i.e., one full rooted tree τ corresponds to one statistical model p(x; τ) or one function f τ (x).
Full rooted trees' recursive and hierarchical structures are suitable for representing complex statistical models or functional structures. For example, the expansion of the leaf nodes represents an increase in the contexts of a context tree [1], a division of a block on the image in quadtree partitioning [2], or the addition of a conditional branch in a decision tree [3]. The expressive ability and extensibility of full rooted trees render them widely applicable in various fields.
However, this hierarchical expressive capability causes a problem in tree selection, i.e., the selection of one statistical model or function. This is because the optimal tree under the criterion of the likelihood or squared loss for training data is inevitably the deepest one. This phenomenon is called overfitting in the field of machine learning. Therefore, most previous studies have applied a stopping rule for node expansion [2,3], introduced a normalization term into the objective function [4], or averaged the statistical models or the functions with some weights [1,4,5]. However, these algorithmic modifications are heuristic at times.
A theoretical way to solve this problem is to consider the full rooted tree as a random variable and assume a prior distribution on it. An appropriate prior distribution provides a unified method for selecting one full rooted tree or combining them based on Bayes decision theory (see, e.g., [6]). Although Bayes decision theory is typically applied to statistical models with unknown continuous parameters, it is also applicable to statistical models with unknown discrete random variables, such as full rooted trees (see, e.g., [7]). By assigning a high prior probability to a shallow tree and a low prior probability to a deep tree, we can avoid the complex statistical model corresponding to a deep tree.
As mentioned above, most previous studies regard a full rooted tree as a non-stochastic variable. However, a few studies adopted the above-mentioned approach. In terms of text compression, the complete Bayesian interpretation of the context tree weighting method was first investigated by the authors of [8]. Not only the theory, but also the associated algorithm have been improved during the decade they were first investigated (see, e.g., [9]). Moreover, similar results obtained from rich real data analysis have been reported recently [10,11] (note that the prior form reported in [10,11] is extremely restricted and cannot be updated as a posterior, in contrast to that reported in [8,9]). In image processing, the authors of [12] were the first to regard the quadtree as a stochastic model, and its optimal estimation was derived under the Bayesian criteria. In machine learning, the authors of [13] redefined the decision tree as a stochastic generative model and improved most tree weighting methods (e.g., [5]).
However, these studies depend on specific data or generative models. This might have been the reason that more than 25 years passed before the first study [8] pertaining to text compression was applied to image processing [12] and machine learning [13].

The Objective of This Study
Therefore, we separate the mathematically essential component of the discussion from the modifiable component based on specific data or the generative model. Mathematically, a tree is defined as a connected graph without cycles (see, e.g., [14]). A rooted tree is a tree that has one node known as a root node, and a full rooted tree is a rooted tree in which each inner node has the same number of child nodes. Subsequently, we can define a finite set of subtrees of a full rooted tree. This full rooted tree, which contains all the subtrees in the finite set, is denoted as a base tree herein.
A trivial method to define a probability distribution for this set is to assign occurrence probabilities to all subtrees and regard these values as parameters. In other words, we can define the categorical distribution for the finite set of subtrees of the base tree. However, this definition requires the same number of parameters as the subtrees, which increases in a doubly exponential order with the depth of the base tree.
Therefore, we propose an efficient parametric representation of the probability distribution of a set of subtrees. It is suitable for the recursive structure of full rooted trees and allows the number of parameters to be reduced. Moreover, it enables us to calculate its mode, expectation, posterior distribution, etc., using recursive functions. Therefore, it is efficient from a computational viewpoint. Furthermore, we expect these recursive functions to be effective as a subroutine of the variational Bayesian method and the Markov chain Monte Carlo method in hierarchical Bayesian modeling (see, e.g., [15]).
Strictly speaking, our distribution has already been proposed independently in source coding [8], image processing [12], and machine learning [13], as mentioned above. The substantial novelty of our study is the extraction of the essence from the previous discussion, which depends on the applicational objects, and its representation as a clear mathematical theory. This theoretically expands the potential application of probability distributions on full rooted trees. Subsequently, we derive new generalized methods to evaluate the characteristics of the probability distribution of full rooted trees, which could not be derived in previous studies pertaining to real-world applications. More precisely, only Theorems 1 and 3 and Corollary 2 has been used in previous studies. Meanwhile, the other methods expand the possibility of applying the probability distribution on full rooted trees.

Organization of This Paper
The remainder of this paper is organized as follows: In Section 2, we present the notation used herein. In Section 3, we define the prior for full rooted trees. In Section 4, we describe the algorithms for calculating the properties of the proposed distribution-e.g., a marginal distribution for each node, an efficient calculation of the expectation, mode, and the posterior distribution. In Section 5, we discuss the usefulness of our distribution in statistical decision theory and hierarchical Bayesian modeling. In Section 6, we propose some future work. In Section 7, we conclude the paper.

Notations Used for Full Rooted Trees
In this section, we define notation for the rooted trees. It is shown in Figure 1. Let k ∈ N denote the maximum number of child nodes and d max ∈ N denote the maximum depth. Let τ p = (V p , E p ) denote the perfect ("perfect" means that all inner nodes have exactly k children and all leaf nodes have the same depth) k-ary rooted tree whose depth is d max , and root node is v λ . V p and E p denote the sets of the nodes and edges of it, respectively. Then, let I p ⊂ V p and L p ⊂ V p denote the set of inner nodes and the set of leaf nodes of τ p , respectively. For each node v ∈ V p , Ch p (v) ⊂ V p denotes the set of child nodes of v on τ p . Notation about the relation between two nodes v, v ∈ V p is as follows.
Let Subsequently, we consider rooted subtrees of τ p in which their root nodes are the same as v λ and all inner nodes have exactly k children. They are called full rooted subtrees and τ p is called a base tree. Let T denote the set of all full rooted subtrees of τ p . Let V τ and E τ denote the set of the nodes and the edges of τ ∈ T , respectively. Let I τ ⊂ V τ and L τ ⊂ V τ denote the set of the inner nodes and the set of leaf nodes of τ ∈ T , respectively.

Definition of Probability Distribution on Full Rooted Subtrees
In this section, we define a probability distribution of full rooted subtrees T . Let T denote the random variable on T and τ denote its realization. 1] |V p | , we define the probability distribution p(τ) on T as below.
Intuitively, α v represents the probability that v has child nodes under the condition that v is contained in the tree (it will be proved as a theoretical fact in Remark 2). Therefore, the occurrence probability of a full rooted subtree exponentially decays as its depth increases. Example 1. An example of the probability distribution on full rooted subtrees for k = 2 and d max = 2 is shown in Figure 2.  (1) fulfills the condition of the probability distribution, that is, ∑ τ∈T p(τ) = 1.

Example 2.
Before the proof of Theorem 1 for the general case, we describe an example where d max = 2 and k = 2 (see Figure 2). First, we factorize the sum as below. Here, The general proof of Theorem 1 is in the following. That also consists of two parts, namely, factorization and substitution. We will first prove Lemma 1, which is the essential lemma, since it is not used only in the proof of Theorem 1 but also in the proof of other theorems later. Lemma 1. Let F : T → R be a real-valued function on the set T of the full rooted subtrees of the base tree τ p . If F has the form where G : V p → R and H : V p → R are real-valued functions on V p , then the summation ∑ τ∈T F(τ) can be recursively decomposed as follows.
where φ : V p → R is defined as below.
Proof. Let [v λ ] denote the tree that consists of only the root node v λ of the base tree τ p . Then, the cases of the sum is divided as follows.
where (14) is because [v λ ] has no inner node and its leaf node is only v λ ; (15) is because every tree in T \ [v λ ] has v λ and the corresponding factor G(v λ ).
We have already pointed out that each tree τ ∈ T \ {[v λ ]} contains v λ as its inner node. The other structure of τ is determined by the shape of k subtrees whose root nodes are the child nodes of v λ (see Figure 3). We index them in an appropriate order. Then, let v λi denote the i-th child node of v λ for i ∈ {0, 1, . . . , k − 1}; i.e., {v λ0 , . . . , v λ k−1 } = Ch p (v λ ). Let T v λi denote the set of subtrees whose root node is v λi . Then, there is a natural bijection Therefore, the summation of (15) is further factorized. Consequently, we have Then, from (12) and (18), we have The underbraced parts (a) and (b) have the same structure except for the depth of the root node of the subtree. Therefore, (b) can be decomposed in a similar manner from (12) to (18). We can continue this decomposition to the leaf nodes. Then, let T v denote the set of subtrees whose root node is v ∈ V p in general; i.e., we define a notion similar to T v λi for not only v λ0 , v λ1 , . . . , v λ k−1 but also any other nodes v ∈ V p . Finally, we have an alternative definition of φ(v) : V p → R, which is equivalent to (11).
The equivalence is confirmed by substituting it into both sides of (19). Therefore, Lemma 1 is proved.
Then, the proof of Theorem 1 is as follows.
Proof of Theorem 1. Using Lemma 1, we can divide the cases of the sum and factorize the common terms of ∑ τ∈T p(τ) in the following recursive manner. where Then, we prove φ(v) = 1 for any node v ∈ V p by structural induction. For any leaf node v ∈ L p , α v = 0 from Definition 1. Therefore, For any inner node v ∈ I p , assuming φ(v ) = 1 as the induction hypothesis for any

Remark 1.
Although Theorem 1 is also proved in [12,13], we extract the essential part of them as Lemma 1. In [10,11], a restricted case of Theorem 1 is proved, in which α v has a common value for all v ∈ I p .

Properties of Probability Distribution on Full Rooted Subtrees
In this section, we describe properties of the probability distribution on full rooted subtrees and methods to calculate them. All the proofs are in Appendix A. Note that the motivation and usefulness of Conditions 1, 2, 3, and 4 in this section will be described in Section 5.

Probability of Events on Nodes
At the beginning, we explain why v ∈ V T determines a probabilistic event. We consider any v ∈ V p is given as a non-stochastic constant and fixed. After that, a full rooted subtree is randomly chosen according to the probability distribution proposed in Section 3. Then, V T sometimes contains v and sometimes not, depending on the realization τ of random variable T. Therefore, v ∈ V T determines a probabilistic event on p(τ). Although the probability of such events are trivially represented as ∑ τ∈T I{v ∈ V τ }p(τ), where I{·} denotes the indicator function, we derive computationally efficient forms without the summation about τ in the following.

Theorem 2.
For any v ∈ V p , we have the following: Example 3. Let us consider p(τ) shown in Figure 2. Trivially, Pr{v 01 ∈ V T }, Pr{v 1 ∈ I T }, and Pr{v 0 ∈ L T } are calculated as The same probabilities are also given by Remark 2. Probabilities of many other events on nodes are derived from Theorem 2. For example,

Mode
We describe an algorithm to find the mode of p(τ) with O(k d max +1 ) computational cost. (O(·) denotes the Big-O notation, i.e., f (n) = O(g(n)) means that ∃ k > 0, ∃ n 0 > 0, ∀ n > n 0 , | f (n)| ≤ k · g(n)). Note that the size of search space T is of the order of Ω 2 k dmax−2 in general.
In addition, we define a flag variable δ v ∈ {0, 1} as follows.

Definition 2.
For any v ∈ V p , we define We can calculate ψ(v) and δ v simultaneously. Then, the mode of p(τ) is given by the following proposition.

Proposition 2. arg max τ∈T p(τ) is identified as the tree that satisfies
Then, the following theorem holds.

Expectation
Let f : T → R denote a real-valued function on T . Here, we discuss sufficient conditions of f , under which the following expectation can be calculated efficiently with O(k d max +1 ) cost.
Note that the size of T is of the order of Ω 2 k dmax−2 in general.

Condition 1.
There exist g : V p → R and h : V p → R such that Theorem 4. Under Condition 1, we define a recursive function φ : V p → R as (49)

Condition 2.
There exist g : V p → R and h : V p → R such that Theorem 5. Under Condition 2, we define a recursive function ξ : V p → R as

Shannon Entropy
where Remark 5. Kullback-Leibler divergence (see, e.g., [16]) between two tree distributions p(τ) and p (τ) can be calculated in a similar manner to Corollary 1. This fact may be useful for variational Bayesian inference, in which the Kullback-Leibler divergence is minimized. This will be future work.

Conjugate Prior for a Probability Distribution on Full Rooted Subtrees
Here, we consider that α v ∈ [0, 1] is also a realization of a random variable. Let α denote {α v } v∈V p , and we describe p(τ) as p(τ|α) to emphasize the dependency of α in the following theorem. Then, a conjugate prior for p(τ|α) is as follows., which provides an example of conjugate priors for p(τ|α). Theorem 6. The following probability distribution is a conjugate prior for p(τ|α).
where Beta(·|β v , γ v ) denotes the probability density function of the beta distribution whose parameters are β v and γ v . More precisely, where

Probability Distribution on a Full Rooted Tree as a Conjugate Prior
We define another random variable X on a set X and assume X depends on T; i.e., it follows a distribution p(x|τ). Here, we discuss a sufficient condition of p(x|τ), under which p(τ) becomes a conjugate prior for it, and we can efficiently calculate the posterior p(τ|x).

Condition 3.
There exist two functions g : V p × X → R and h : V p × X → R, and p(x|τ) has the following form.
Note that g and h are not necessarily probability density functions.

Example 7.
For given µ 1 , µ 2 ∈ R and σ 1 , σ 2 ∈ R >0 , let N (x|µ 1 , σ 2 1 ) and N (x|µ 2 , σ 2 2 ) denote the probability density functions of the normal distributions governed by them. Let x := (x v ) v∈V p . If we assume we can construct p(x|τ) that satisfies Condition 3. In other words, the elements of the |V p | dimensional vector x follow the mixture of two normal distributions, and either of the two is chosen by τ.

Theorem 7.
Under Condition 3, we define q(x|v) and α v|x as follows.
It should be noted that the calculation of q(x|v) and α v|x requires O(k d max +1 ) cost, and it requires Ω 2 k dmax−2 cost in general.
Moreover, if we assume the following condition to be stronger than Condition 3, we can calculate the posterior p(τ|x) more efficiently with O(d max ) cost.

Condition 4.
In addition to Condition 3, we assume that there exists a path from v λ to a leaf node v end ∈ L p and another function h : V p × X → R, which satisfy Here, I{·} denotes the indicator function. In other words, only h(x, v) on the path from v λ to v end takes a value different of 1.
where v ch is a child node of v on the path from v λ to v end . Note that we need not calculate q(x|v) for v v end to update the posterior, and it costs only O(d max ).
Remark 6. Condition 4 is effective at representing the generation of sequential data x 1 , x 2 , . . . , x N , in which there exists a path from root node v λ to a leaf node v n end ∈ L p for each n ∈ {1, 2, . . . , N} (v n end and v n end may different each other for n = n ). The remarkable previous studies using Corollary 2 are [8][9][10][11][12][13] (In [10,11], only (66) is used but (67) is not). In other words, they treat only the case under Condition 4. The other theorems in this paper have potential applications to broader fields of study.

Discussion
In this section, we describe the usefulness of our results in statistical decision theory (see, e.g., [6]) and hierarchical Bayesian modeling (see, e.g., [15]). First, our results are useful in model selection and model averaging under the Bayes criterion in statistical decision theory (see, e.g., [6]). The proposed probability distribution p(τ) is a conjugate prior for stochastic models p(x|τ) satisfying Condition 3, as shown in Theorem 7, and the MAP estimate arg max τ p(τ|x) can be efficiently calculated by applying Theorem 3 to the posterior distribution p(τ|x) obtained by Theorem 7. This is the Bayes optimal model selection based on the posterior distribution. Furthermore, we can calculate ∑ τ p(x new |τ)p(τ|x), i.e., the weighting of the stochastic models based on the posterior distribution, by using Theorems 4 and 7, since the stochastic models p(x|τ) satisfying Condition 3 also satisfy Condition 1. This is model averaging of all possible trees with Bayes optimal weights. This corresponds to the methodologies in which they do not select a single tree but aggregate several trees, such as [4,5]. It should be noted that the occurrence probability of a deep tree exponentially decays in our proposed probability distribution. Therefore, we can avoid the deep tree, which often corresponds to a complex statistical model, as mentioned in Section 1.
Second, one example of the applications derived from our results is hyperparameter learning. As mentioned in Remark 6, Condition 4 has been applied to various stochastic models p(x|τ) in previous studies [8][9][10][11][12][13]. Conditions 1 and 3 are more generalized conditions than Condition 4, since the stochastic model p(x|τ) satisfying Condition 4 also satisfies Conditions 1 and 3. In addition, the logarithm of a function f (τ) satisfying Conditions 1 and 3 (and a stochastic model p(x|τ) satisfying Condition 4) satisfies Condition 2. Therefore, we can calculate ∑ τ∈T p(τ|x) log p(x|τ) by using Theorems 7 and 5. In particular, the fact that we can calculate the expectations E[p(x|T)] = ∑ τ∈T p(τ|x)p(x|τ) and E[log p(x|T)] = ∑ τ∈T p(τ|x) log p(x|τ) of the stochastic model p(x|τ) satisfying Condition 4 implies that we can learn hyperparameters of the stochastic models in [8][9][10][11][12][13] by hierarchical Bayesian modeling with variational Bayesian methods (see, e.g., [15]). To the best of our knowledge, there are no unified studies treating hyperparameter learning for these models.

Future Work
Since the present study is a theoretical study, the theorems derived will be applied in future studies. Theorems 1 and 3 and Corollary 2 have been used in previous studies [8][9][10][11][12][13]. Therefore, the other theorems can be applied.
In this study, we did not use approximative algorithms such as the variational Bayes or Markov chain Monte Carlo method (see, e.g., [15]). Such algorithms are required for learning hierarchical models that contain the probability distribution on full rooted subtrees. The methods proposed herein may serve as subroutines. The expansion of our methods to approximative algorithms is another future work.
In this study, the class of trees is restricted to that of full trees, in which every inner node has the same number of child nodes. Hence, another the generalization of the class to that of any rooted tree can be considered in future studies.

Conclusions
In this paper, we discussed the probability distributions on full rooted subtrees. Although such a distribution has been used in many fields of study, such as information theory [8][9][10][11], image processing [12], and machine learning [13], it depends significantly on the specific applications and data generative models. By contrast, we discussed it theoretically, collectively, and independently from a specific data generative model. Subsequently, we derived new generalized methods to evaluate the characteristics of the probability distribution on full rooted subtrees, which have not been performed in previous studies. The derived methods are efficient for calculating the events on the nodes, the mode, the expectation, the Shannon entropy, and the posterior distribution of a full rooted subtree. Therefore, this study expands the possibility of the applying the probability distribution on full rooted subtrees.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A
Proof of Theorem 2. First, we prove (27). Let I{·} denote the indicator function. Then, Pr{v ∈ V T } is expressed as Here, v ∈ V τ is equivalent that all the leaf nodes is not a ancestor node of v. Then, Therefore, using Lemma 1, where We further transform this function.
If v v, then I{v v} = 1 and consequently, It has the same form as (22), and every child node v ∈ Ch p (v ) also satisfies v v.
If v v, v cannot be in L p and has only one child node in An(v) ∪ {v}. Let v ch denote it. Then, φ v (v ) = 1 for the other child nodes v ∈ Ch p (v ) \ {v ch }. Therefore, (A5) is represented as follows.
Next, we prove (28). It is proved in a similar manner to the proof of (27) since Lastly, we prove (29). We have Therefore, (29) follows from (27) Then, using Lemma 1, Theorem 4 straightforwardly follows.
Proof of Theorem 5. First, we switch the order of the summation as follows.
where (A16) is because of Theorem 2. Next, we decompose the right-hand side of (A17) until it has the same form as (51).
Comparing (A17) and (A20), we have The underbraced parts (a) and (b) have the same structure. Therefore, (b) can be decomposed in a similar manner from (A17) to (A20). We can continue this decomposition to the leaf nodes.
Finally, we have an alternative definition of ξ(v) : V p → R, which is equivalent to (51).
The equivalence is confirmed by substituting it into both sides of (A21). Therefore, Theorem 5 is proved.
Proof of Theorem 6. By the Bayes theorem, we have where we used the conjugate property between the Bernoulli distribution and the beta distribution for each term and Proof of Theorem 7. We prove (63) from the right-hand side to the left.
In the following, we transform each of the above products in order. First, the first product is transformed by substituting (62) as follows.
Next, the second product is transformed as follows.
where (A31) is because of (62) and (A32) is because q(x|v) = h(x, v) for v ∈ L p . Lastly, the third product is transformed as follows.
For v v end , by substituting (64) and (65) into (61), Since this has the same form as (A6), q(x|v) = 1 for v v end is derived in a similar manner.