Meta-Tree Random Forest: Probabilistic Data-Generative Model and Bayes Optimal Prediction

This paper deals with a prediction problem of a new targeting variable corresponding to a new explanatory variable given a training dataset. To predict the targeting variable, we consider a model tree, which is used to represent a conditional probabilistic structure of a targeting variable given an explanatory variable, and discuss statistical optimality for prediction based on the Bayes decision theory. The optimal prediction based on the Bayes decision theory is given by weighting all the model trees in the model tree candidate set, where the model tree candidate set is a set of model trees in which the true model tree is assumed to be included. Because the number of all the model trees in the model tree candidate set increases exponentially according to the maximum depth of model trees, the computational complexity of weighting them increases exponentially according to the maximum depth of model trees. To solve this issue, we introduce a notion of meta-tree and propose an algorithm called MTRF (Meta-Tree Random Forest) by using multiple meta-trees. Theoretical and experimental analyses of the MTRF show the superiority of the MTRF to previous decision tree-based algorithms.


Introduction
Various studies in pattern recognition deal with a prediction problem of a targeting variable y n+1 corresponding to an explanatory variable x n+1 given pairs of explanatory and targeting variable {(x i , y i )} n i=1 . In many of them, the targeting variable y n+1 is predicted with a tree T. One way to use a tree T is to represent a function y n+1 = f (x n+1 ; T) and predict y n+1 . A tree T used to represent the function y n+1 = f (x n+1 ; T) is called a decision tree in the literature. In this paper, however, this tree is called a function tree. This is because we distinguish a tree used to represent a function from a tree used to represent a data-generative model (A tree used to represent a data-generative model will be explained in the next paragraph.). In the previous studies, algorithms in which a single function tree is used are discussed in, e.g., CART [1]; algorithms in which multiple function trees are used are discussed, e.g., Random Forest [2] is an algorithm that constructs multiple function trees from {(x i , y i )} n i=1 and aggregates them to predict y n+1 from x n+1 . There are various extensions of Random Forest, e.g., Generalized Random Forest [3] generalizes the splitting rule of a function tree. Boosting is an algorithm that constructs a function tree sequentially from {(x i , y i )} n i=1 and combines the constructed function trees to predict y n+1 from x n+1 . There are various Boosting methods, e.g., gradient boosting method in Gradient Boost [4] and XGBoost [5]. Further, previous studies, such as Alternating Decision Forest [6] and Boosted Random Forest [7], combine the ideas of Random Forest and Boosting method. Moreover, combinations of the function trees and neural networks have also been discussed. In Neural Decision Forest [8], inner nodes are replaced by randomized multi-layer perceptrons. In Deep Neural Decision Forest [9], they are replaced by deep convolutional neural networks (CNN). In Adaptive Neural Trees [10], not only their nodes are replaced by CNNs but also their edges are replaced by nonlinear functions. In Deep Forest [11], Random Forests are combined in a cascade structure like deep CNNs. The function tree is also used to understand the deep neural network as an interpretable input-output function in Reference [12]. In these algorithms, a function tree is used to represent the function from x n+1 to y n+1 . Although this function is trained for the data {(x i , y i )} n i=1 under a certain criterion, statistical optimality for prediction of y n+1 is not necessarily discussed theoretically because these algorithms usually do not assume a probabilistic data-generative structure of x and y. As we will describe later in detail, on the other hand, we assume a probabilistic data-generative structure of x and y, and consider a statistical optimal prediction of y n+1 . This is the crucial difference between our study and the related works.
To predict the targeting variable y n+1 , another way to use a tree T is to represent a data-generative model p(y|x, T) that represents a conditional probabilistic structure of y given x. A tree T used to represent the data-generative model p(y|x, T) is called a model tree throughout this paper. Although not so many studies have assumed the model tree, Reference [13] has assumed it. Because Reference [13] assumed that a targeting variable y n+1 is generated according to p(y|x n+1 , T), statistical optimality for prediction of y n+1 can be discussed theoretically. Specifically, based on the Bayes decision theory [14], Reference [13] proposed the optimal prediction of y n+1 .
In order to explain the optimal prediction based on the Bayes decision theory in more detail, we introduce the terminology model tree candidate set. The model tree candidate set is a set of model trees in which the true model tree (When the model tree T is used to represent the true data-generative model p(y|x, T), we call it true model tree.) is assumed to be included. One example of a model tree candidate set is a set of model trees whose depth is up to d ∈ N. The optimal prediction based on the Bayes decision theory is given by weighting all the model trees in the model tree candidate set and the optimal weight of each model tree is given by the posterior probability p(T|{(x i , y i )} n i=1 ) of the model tree T. Because the number of all the model trees in the model tree candidate set increases exponentially according to the maximum depth of model trees, the computational complexity of weighting them increases exponentially according to the maximum depth of model trees. One way to reduce the computational complexity is to restrict the model tree candidate set.
To represent a restricted model tree candidate set, Reference [13] proposed a notion of metatree. The concept of a meta-tree was originally used for data compression in information theory (see, e.g., Reference [15]) (Recently, the concept of a meta-tree was also used for image compression [16].). As shown in Figure 1, a model tree candidate set is composed of the model trees represented by the meta-tree.
A meta-tree is also used for the prediction of y n+1 in the algorithm proposed by Reference [13]. In summary, a meta-tree has the following two roles: (i) it represents a model tree candidate set; and (ii) it is used for the prediction algorithm. The characteristics of a meta-tree are as follows: • If the true model tree is in a model tree candidate set represented by a meta-tree, the statistically optimal prediction-optimal prediction based on the Bayes decision theory-can be calculated. • The number of model trees in a model tree candidate set represented by a meta-tree increases exponentially according to the depth of the meta-tree. • The computational cost in learning processes of a single meta-tree has the same order as that of a single function tree.
Under the assumption that the true model tree is in the restricted model tree candidate set represented by a meta-tree, Reference [13] proposed the optimal prediction based on the Bayes decision theory. As we have described above, if the true model tree is actually included in the model tree candidate set, the optimal prediction based on the Bayes decision theory is calculated. Hence, it is desirable that we construct the model tree candidate set that includes as many model trees as possible within the allowed computational cost. Motivated by this fact, this paper extends the model tree candidate set compared with Reference [13]. Instead of considering a single meta-tree as in Reference [13], we consider multiple meta-trees. By using the model tree candidate set represented by multiple meta-trees, we predict y n+1 based on the Bayes decision theory. We call this proposed algorithm MTRF (Meta-Tree Random Forest). The characteristics of MTRF are as follows: • If the true model tree is in a model tree candidate set represented by any of the metatrees of MTRF, the statistically optimal prediction-optimal prediction based on the Bayes decision theory-can be calculated. • The number of model trees in the model tree candidate set represented by multiple meta-trees is constant times larger than those contained in a single meta-tree. • The computational cost in learning processes of meta-trees of MTRF is the same order as that of multiple function trees of Random Forest.
Regarding a meta-tree in Reference [13] or multiple meta-trees of MTRF, how to construct the meta-tree/multiple meta-trees is a problem. One way to solve this issue is to use the function tree-based algorithms; for example, a meta-tree in Reference [13] can be constructed by regarding a function tree in CART as a meta-tree; multiple meta-trees of MTRF can be constructed by regarding function trees in Random Forest as meta-trees. In this way, by regarding a function tree as a meta-tree, our proposed method can be applied to any practical applications in which a function-tree based algorithm is usedinsurance claim task (e.g., Reference [5]), letter classification task (e.g., Reference [6]), semantic segmentation (e.g., Reference [8]), face recognition (e.g., Reference [11]), music classification (e.g., Reference [11]), etc.-and we can improve every previous function tree-based algorithm. This is because we can perform a prediction of y n+1 by using more model trees compared with the previous algorithms. More detailed discussion will be given in Sections 3 and 4.
The rest of this paper is organized as follows. In Section 2, we state the preliminaries of our study. Section 3 explains our proposed method-MTRF. In Section 4, we revisit the previous studies, such as CART [1] and Random Forest [2], and compare them with MTRF. Section 5 shows the results of experiments. Section 6 concludes this paper.

Model Tree
Let K denote the dimension of explanatory variables. The space of feature values is denoted by X := {0, 1, . . . , |X | − 1}, and the explanatory variable is denoted by In addition, the space of label values is denoted by Y := {0, 1 . . . , |Y | − 1}, and the targeting variable is denoted by y ∈ Y.
In the field of pattern recognition, the probabilistic structure of x and y is not necessarily assumed. In particular, it is not assumed in most of function tree-based algorithms. In contrast, we consider a triplet of (T, k, θ) defined in Definition 1 below and assume the probabilistic structure p(y|x, θ, T, k) as in Definition 2. We call (T, k) a model tree (In Section 1, we called T a model tree. More precisely speaking, however, a model tree is (T, k).) and θ a tree parameter. Definition 1. The notation T denotes the |X |-ary regular (Here, "regular" means that all inner nodes have |X | child nodes.) tree whose depth is up to D max , and T denotes the set of T. Let s be a node of a tree and S be a set of s. The set of the inner nodes of T is denoted by I(T), and the set of the leaf nodes of T is denoted by L(T). Let k s ∈ {1, 2, . . . , K} be a component of the feature assign vector of node s. The explanatory variable x k s ∈ X is assigned to the inner node s ∈ I(T). Let k := (k s ) s∈S denote a feature assign vector, and K denote a set of k. Let θ s := (θ 0|s , θ 1|s , . . . , θ |Y |−1|s ) ∈ (0, 1) |Y | , where ∑ |Y |−1 y=0 θ y|s = 1, be a parameter assigned to s ∈ S. We define θ := (θ s ) s∈S , and Θ is the set of θ.
Definition 2. Because the value of explanatory variable x corresponds to a path from the root node to the leaf node of T whose feature assign vector is k, let s(x) ∈ L(T) denote the corresponding leaf node. Then, for given x, θ, T, and k, the label variable y is generated according to p(y|x, θ, T, k) := θ y|s(x) .

Problem Setup
In this subsection, we introduce our problem setup. Let x i := (x i1 , . . . , x iK ) ∈ X K (i = 1, . . . , n) denote the value of explanatory variable of the i-th data and x n := (x 1 , . . . , x n ). We assume that x 1 , . . . , x n are i.i.d. random variables drawn from a probability distribution. Let y i ∈ Y (i = 1, . . . , n) denote the targeting variable of the i-th data and y n := (y 1 , . . . , y n ). We assume that y 1 , . . . , y n are generated according to p(y|x, θ, T, k) defined in Section 2.1. In regard to this, we assume that the true model tree (T , k ) and true tree parameter θ are unknown, but a model tree candidate set M ⊂ T × K is known. Note that, as we have explained in Section 1, the model tree candidate set M is a set of model trees (T, k) in which the true model tree is assumed to be included. Further, we assume that a class of parametrized distribution {p(y|x, θ, T, k) : θ ∈ Θ, (T, k) ∈ M} is known.
The purpose of the prediction problem is to predict unknown targeting variable y n+1 corresponding to an explanatory variable x n+1 by using the given data x n , y n , x n+1 .

Optimal Prediction Based on the Bayes Decision Theory
We introduce a statistically optimal prediction under the problem setup in Section 2.2. From the viewpoint of the Bayes decision theory [14], we shall derive the optimal prediction based on the Bayes decision theory. To this end, we assume priors p(θ), p(T), and p(k).
First, we define a decision function as Second, we define a 0-1 loss as follows: Third, using the 0-1 loss, we define the loss function, which is the expectation of the 0-1 loss taken by new targeting variable y n+1 : Next, given the loss function, we define the risk function, which is the expectation of the loss function taken by training data x n , y n , and x n+1 : Finally, the Bayes risk function, which is the expectation of the risk function taken by θ, T, and k, is defined as Then, we have the following theorem: Theorem 1. Under the setup in Section 2.2, the decision function δ * (x n , y n , x n+1 ) that minimizes the Bayes risk function is given as The proof of Theorem 1 is in Appendix A. We call δ * (x n , y n , x n+1 ) the Bayes decision.

Previous Study
In this subsection, we introduce how to calculate the Bayes decision in Reference [13]. Because the proof of Reference [13] lacks some explanation, we give more rigorous discussion in comparison. This is one of the contributions in this paper.
As we have explained in Section 1, Reference [13] introduced the concept of a "metatree" to represent a restricted model tree candidate set. For T and k, a meta-tree-denoted by M T,k -represents a model tree candidate set that is composed of model trees (T , k ) where T is a sub-tree of T and k = k. An example of M T,k is shown in Figure 3, where T is a complete tree with D max = 2 and k = (3, 4, 2). In this example, a meta-tree M T,k represents a model tree candidate set which is composed of four model trees. A meta-tree is also used for the prediction of y n+1 in the algorithm proposed by Reference [13]. In short, a meta-tree has the two roles: first, it represents a model tree candidate set, and, second, it is used for the prediction algorithm.

Expectation over the Parameters
In this subsection, we explain how to calculate (9). To calculate (9) analytically, a conjugate prior of θ s given T and k, is assumed.
Assumption 1. We assume p(θ s |T, k) = Dirichlet(θ s |α s ) for s ∈ L(T), where α s denotes the hyper-parameter of the Dirichlet distribution.
Under this assumption, q(y n+1 |x n+1 , x n , y n , T, k) has a closed form expression and (9) can be calculated exactly. In addition, q(y n+1 |x n+1 , x n , y n , T, k) has an important property. We describe this in the next lemma as a preparation of Section 2.4.2.
Lemma 1. For any T, T ∈ T , x n ∈ X Kn , y n ∈ Y n , and x n+1 ∈ X K , if s(x n+1 ) ∈ L(T) ∩ L(T ), then q(y n+1 |x n+1 , x n , y n , T, k) = q(y n+1 |x n+1 , x n , y n , T , k). (11) The proof of Lemma 1 is in Appendix B. From this lemma, the right-and left-hand sides of (11) are denoted by q s(x n+1 ) (y n+1 |x n+1 , x n , y n , k) because they depend on not T and T but s(x n+1 ).

Summation over All Model Trees Represented by a Meta-Tree
In this subsection, we explain how to calculate (10). The exact calculation of (10) needs to take summation over all model trees represented by a single meta-tree. However, the computational complexity of this calculation is huge because the number of model trees increases exponentially according to the depth of meta-tree. Reference [13] solved this problem. The advantages of using the method proposed in Reference [13] are as follows: • This method calculates (10) exactly. • The computational complexity of calculating (10) in learning parameters is O(nD max ), which is the same as that of building a single function tree.
To execute this method, an assumption about the prior probability distribution of T-Assumption 2-is required. It should be noted that p(T) is different from the construction method of function trees, given x n and y n .
Assumption 2. Let g s ∈ [0, 1] denote a hyper-parameter assigned to any node s ∈ S. Then, we assume the prior of T ∈ T M T,k as where g s = 0 for a leaf node s of a meta-tree M T,k .
Remark 1. It should be noted that (12) is a probability distribution, i.e., ∑ T ∈T M T,k p(T ) = 1. See Lemma A1 in Appendix C.
As we have described in Section 1, a meta-tree is used for the prediction of y n+1 in the algorithm proposed by Reference [13]. By using a meta-tree M T,k , the recursive function to calculate the Bayes decision (10) is defined as follows.
Definition 3. We define the following recursive functionq s (y n+1 |x n+1 , x n , y n , M T,k ) for any node s on the path in M T,k corresponding to x n+1 .
(if s is the leaf node of M T,k ), (1 − g s|x n ,y n )q s (y n+1 |x n+1 , x n , y n , k) +g s|x n ,y nq s child (y n+1 |x n+1 , x n , y n , M T,k ), (otherwise), where s child is the child node of s on the path corresponding to x n+1 in M T,k , and g s|x n ,y n is also recursively updated as follows: , (otherwise).
Now, (10) can be calculated as shown in the following theorem.
Theorem 2.q(y n+1 |x n+1 , x n , y n , k) can be calculated bỹ q(y n+1 |x n+1 , x n , y n , k) =q s λ (y n+1 |x n+1 , x n , y n , M T,k ), (15) where s λ is the root node of M T,k . In addition, p(T|x n+1 , y n+1 , k) can be calculated as The proof of Theorem 2 is in Appendix C. Surprisingly, the computational complexity of calculating (10) is O(nD max ) by using this theorem.

Proposed Method
In this section, we introduce our proposed method. Here, we reconsider the general setup where M = T × K. Recall that the Bayes decision (7) is whereq(y n+1 |x n+1 , x n , y n , k) is defined as in (10). In (17),q(y n+1 |x n+1 , x n , y n , k) can be calculated in the same way as in Section 2.4. However, regarding the calculation of (17), we need to calculateq(y n+1 |x n+1 , x n , y n , k) for all k that satisfies p(k) > 0. Because this computational cost increases exponentially depending to D max , it is hard to calculate all of them in general. To reduce the computational complexity of (17), we restrict the model tree candidate set to the set represented by multiple meta-trees. The advantages of building multiple meta-trees are as follows: • As we have explained in Section 2.4.2, the number of model trees represented by each of the meta-trees increases exponentially according to the depth of meta-trees. In addition, the number of model trees in multiple meta-trees is constant times larger than those contained in a single meta-tree. • If the true model tree is a sub-tree of any of the meta-trees, the statistically optimal prediction-optimal prediction based on the Bayes decision theory-can be calculated. • If we build B meta-trees, the computational cost of it in learning parameters is O(nBD max ), which is the same as that of building B function trees. Now, we introduce the next assumption.

Assumption 3.
For B meta-trees M T 1 ,k 1 , . . . , M T B ,k B , we define K := {k 1 , . . . , k B }. Then, we assume a uniform distribution on K as a prior probability distribution of k ∈ K.
An example of K is shown in Figure 4. Next, we prove the following lemma about computing a posterior probability distribution of k. In fact, the recursive function we have defined in Definition 3 is also useful to calculate a posterior probability distribution on K sequentially. The proof of Lemma 2 is in Appendix D.
Lemma 2. By decomposing x n and y n into the set of i-th data (x i , y i )(i = 1, . . . , n) and performing the recursive calculation ofq s (y i |x i , x i−1 , y i−1 , M T,k ) as in (13), a posterior probability distribution of k ∈ K is expressed as follows: From Theorem 2 and Lemma 2, we immediately obtain the next theorem.
Then, if the true model tree is included in (19), the Bayes decision δ * (x n , y n , x n+1 ) can be calculated as δ * (x n , y n , x n+1 ) = arg max y n+1 ∑ k∈K p(k|x n , y n )q(y n+1 |x n+1 , x n , y n , k) (20) If the true model tree is not included in (19), the right-hand side of (21) is a sub-optimal prediction of the Bayes decision. By using Theorem 3, we can calculate the optimal (possibly sub-optimal) prediction of the Bayes decision effectively. We name this algorithm Meta-Tree Random Forest (MTRF). We summarize MTRF in Algorithm 1.
Remark 2. One way to construct multiple meta-trees is to use the ideas of Random Forest. Random Forest builds function trees F T with the impurity (e.g., the entropy and the Gini coefficient) from the training data x n and y n . By regarding (T 1 , k 1 ) . . . (T B , k B ) that are built by Random Forest as meta-trees M T 1 ,k 1 . . . M T B ,k B , we can construct B meta-trees. In our experiments in Section 5, we adopt this method. The computational complexity of MTRF is O(nBD max ) in learning processes, which is the same as that of Random Forest. This is computable unless n, B, or D max are not extremely huge. In addition, we can parallelize the procedure on each meta-tree.

CART and Random Forest Revisited
Although CART [1] and Random Forest [2] do not assume the model tree and they use trees to express a function y = f (x; T) (i.e., function tree), they can be regarded as methods of constructing a model tree candidate set.
First, we consider algorithms that use a single function tree, such as CART [1]. Such algorithms can be regarded as selecting a single model tree and predicting y n+1 by using the single function tree that corresponds to the single model tree. For example, the prediction of CART-denoted by δ CART (x n , y n , x n+1 )-is given by δ CART (x n , y n , x n+1 ) = arg max y n+1 p(y n+1 |x n+1 ,θ(x n , y n ),T,k), where (T,k) andθ(x n , y n ) denote an estimated model tree and tree parameter, respectively. Second, as another example, let us consider Random Forest [2]. Random Forest can be regarded as selecting multiple model trees and predicting y n+1 by weighting the multiple function trees that correspond to the multiple model trees. It should be noted that the number of function trees is equal to the number of model trees in Random Forest. The prediction of Random Forest-denoted by δ RF (x n , y n , x n+1 )-is given by where M = {(T 1 ,k 1 ), . . . , (T B ,k B )} ⊂ M denotes the set of model trees that are constructed by Random Forest.

Comparison of Random Forest with MTRF
Let us consider the situation where the trees that represent the function trees constructed by Random Forest are meta-trees of MTRF. Then, the size of model tree candidate set is far larger in MTRF than in Random Forest.
Random Forest can be regarded as approximating (9), (10) and (17) as in (23). Further, the approximation of the weight p(T, k|x n , y n ) ≈ 1/| M| is not accurate and the optimality of each function tree's weight for predicting y n+1 has not usually been discussed. In contrast, MTRF calculates the Bayes decision in (9) and (10) exactly and approximates (17) as in (21). Moreover, the optimal weights of T and k are given by (16) and (18), respectively.

Experiments
Most machine learning algorithms require rich numerical experiments to confirm the improvement independent of a specific dataset. However, in our paper, such an improvement is theoretically explained in Sections 3 and 4. If we redefine all the function trees constructed by any existing method as meta-trees, we can necessarily improve it in a sense of the Bayes risk function since our model tree candidate set contains all the model trees that correspond to the original function trees. This improvement is independent of both datasets and the original methods. Therefore, we perform our experiment only on three types of datasets and compare with the Random Forest. Similar results should be given in any other situation.

Experiment 1
Although the theoretical superiority of MTRF to Random Forest has been explained in Sections 3 and 4, we performed Experiment 1 to make sure their performance on synthetic data. The procedure of Experiment 1 is the following:

2.
To make true model trees diverse, we determinate an index of a true feature assign vector k and the true model trees as below: • In the true model tree (A), we assign x j to the node whose depth is j.

3.
We determine parameters θ s according to the Dirichlet distribution. 4.
We generate K-dimensional n training data from the true model tree and generate K-dimensional 100 test data from the true model tree. From (7), Bayes decision takes the same value for any distribution p(x n ). Therefore, in our experiments, we set a probability distribution of x as follows: Let U({0, 1}) denote the uniform distribution on {0, 1}. Then, a probability distribution of x is expressed as

5.
We perform Random Forest that utilizes B function trees (T 1 , k 1 ), . . . , (T B , k B ). We set its impurity as the entropy and max-depth as D max . Afterward, we make B meta-trees M T 1 ,k 1 , . . . , M T B ,k B . (Of course, their max-depth are D max , too.) Then, we calculate the average prediction error rate of Random Forest and MTRF, where we set the hyper-parameter g s as g s = 1/2 in MTRF.
From Figure 6, when we compare the performance of Random Forest with that of MTRF, we can confirm that MTRF outperforms Random Forest in all conditions. These results are theoretically reasonable because the trees that represent the function trees constructed by Random Forest are meta-trees of MTRF.

Experiment 2
In Experiment 2, we used real data (nursery school data and mushroom data) in the UCI repository (University of California, Irvine Machine Learning Repository). The procedure of Experiment 2 is mostly the same as that of Experiment 1, but the way of sampling data is different; instead of Steps 2, 3, and 4 in Experiment 1, we randomly sampled n + t data from the whole dataset and divided them to n training data and t test data. We repeated the random sampling of the data for 2500 times.
The detail on nursery school data is as follows. The explanatory variables are discrete 8 attributes of a family whose child wants to enter a nursery school (e.g., finance, health, and family structure). The dimension of each attributes is up to 5. The targeting variable is whether the nursery school should welcome him/her or not. We want to consider that a new child should be welcomed with his attributes and learning data. We set |X | = 5, |Y | = 2, n = 100, 200, . . . , 1000, t = 1000, K = 8, B = 5, and D max = 2, 3, 4. The result is shown in Figure 7a. The detail on mushroom data is as follows. The explanatory variables are discrete 22 attributes of a mushroom (e.g., smell, shape, surface, and color). The dimension of each attributes is up to 10. The targeting variable is whether the mushroom can be eaten, not be eaten, or unknown. We want to predict that a new mushroom can be eaten, with its attributes and learning data. We set |X | = 22, |Y | = 3,n = 100, 200, . . . , 1000, t = 1000, K = 22, B = 5, and D max = 2, 3, 4. The result is shown in Figure 7b.
From Figure 7, we can confirm MTRF outperforms Random Forest in all conditions.

Conclusions and Future Work
In this paper, we have considered a model tree and derived the Bayes decision. We have addressed the computational complexity problem of the Bayes decision by introducing a notion of meta-tree, and proposed MTRF (Meta-Tree Random Forest), which uses multiple meta-trees. As we have shown in Theorem 3, if the true model tree is included in the model tree candidate set represented by multiple meta-trees, MTRF calculates the Bayes decision with efficient computational cost. Even if the true model tree is not included in the model tree candidate set represented by multiple meta-trees, MTRF gives the approximation of the Bayes decision. The advantages of MTRF have been theoretically analyzed. In Section 4, we have explained that, if we redefine all the function trees constructed by Random Forest as meta-trees, we can necessarily improve it in a sense of the Bayes risk function since our model tree candidate set contains all the model trees that correspond to the original function trees. We have performed experiments in Section 5 to check the theoretical analysis.
As we have described in Section 1, it is desirable that we construct the model tree candidate set that includes as many model trees as possible within the allowed computational cost. This is because if the true model tree is included in the model tree candidate set, the Bayes decision is calculated. Hence, the main problem of MTRF is how to construct multiple meta-trees. In Section 5, we have constructed multiple meta-trees by using the idea of Random Forest. However, we can consider other ways to construct multiple meta-trees. One of the possible future directions is to use other ideas, such as Boosting. This is one of the future works in our research.

Abbreviations
The following abbreviation is used in this manuscript:
(A3) Therefore, the decision function a that minimizes F(a) is a = arg max y n+1 p(y n+1 |x n , y n , x n+1 ).

Appendix B. Proof of Lemma 1
Proof of Lemma 1. Let Θ s(x n+1 ) denote the space of parameters on s(x n+1 ). In addition, we define Then, we can rewrite the left-hand side of (11) as below: q(y n+1 |x n+1 , x n , y n , T, k) = Θ p(y n+1 |x n+1 , θ, T, k)p(θ|x n , y n , T, k)dθ where the last equality follows from (1). Looking into the term Θ s =s(x n+1 ) p(θ|x n , y n , T, k)dθ s =s(x n+1 ) in (A6), we have where (a) follows from Bayes' theorem. Because we can transform the right-hand side of (11) in the same way, we obtain (11).

Appendix C. Proof of Theorem 2
Proof of Theorem 2. We prove Theorem 2 with mathematical induction according to the following three steps: Step 1: We prove (15) for n = 0 with Lemmas A1 and A2.
Step 2: We prove (16) for n = 0 by using the result of Step 1.
In the following, we give a proof of each step.
Step 1: First, we prove (15) for n = 0. Let S(x i ) (i = 1, . . . , n + 1) denote the set of all nodes on the path of x i in the meta-tree M T,k . Let T s := {T ∈ T M T,k |s ∈ L(T)}. Then, we transformq(y 1 |x 1 , k) as below: where (a) follows from (10), and (b) follows from Lemma 1. The right-hand side of (A12) contains the summation with respect to T as below: Regarding (A13), we prove the following lemmas.
Lemma A1. The prior probability distribution p(T) assumed in (12) satisfies the next property: Proof of Lemma A1. In this proof, for simplicity, we use T instead of T M T,k . Before showing the formal proof, we consider an example. Let us consider the set of tree T shown in Figure A1. In this example, T is composed of five model trees. Among these model trees, T 1 is included in {[s λ ]} and T 2 -T 5 are included in T ∈ T \{[s λ ]}. Therefore, the next equation holds: +(1 − g s 0 )g s 1 (1 − g s 10 )(1 − g s 11 ) + g s 0 (1 − g s 00 )(1 − g s 01 )g s 1 (1 − g s 10 )(1 − g s 11 )] (A17) where (a) follows from g s = 0 for a leaf node s of the meta-tree (see Assumption 2). Now, we give the formal proof. Because we prove by induction. Let [s λ ] denote the tree that consists of only the root node s λ of T. Then, we have Because each tree T ∈ T \ {[s λ ]} is identified by |X | sub-trees whose root nodes are the child nodes of s λ , let s λchild,j denote the j-th child node of s λ for 0 ≤ j ≤ |X | − 1 and T s λchild,j denote the set of sub-trees whose root node is s λchild,j . Figure A1 shows an example of T s λchild,0 and T s λchild,1 . Then, the summation in the right-hand side of (A23) is factorized as Consequently, because the goal is to show (A22), Lemma A1 is proven by induction.  Figure A1. An illustration of T s λ child,0 (left lower enclosure) and T s λ child,1 (right lower enclosure).
In the trees T 2 -T 5 , s λ has child nodes s λchild,0 and s λchild,1 . Therefore, we can decompose the child nodes of s λ in the trees T 2 -T 5 into the nodes included in T s λ child,0 and the nodes included in T s λ child,1 .
where (a) follows from Assumption 2, and (b) follows from the fact that all trees T ∈ T s have the node s and its ancestor nodes. Further, as we will prove later, it holds that (1 − g s ) = 1.
To prove (A28), we can apply the factorization of Lemma A1 to (A28) if we regard the node whose parent node is s or an element of A s as a root node, as shown in Figure A2. Let J denote the number of that factorized trees, and let r j denote the root node of j-th factorized tree. Then, we can rewrite (A28) as follows:  where (a) follows from Lemma A2. Figure A2. An illustration of r 0 ,r 1 ,r 2 , and r 3 . In this example, the dotted line represents the path from s λ to s and A s is the set of two parent nodes of s. Furthermore, the rectangles of the dotted line represent four factorized trees T 0 , T 1 , T 2 , and T 3 .
Here, let s λ child denote the node that is a child node of s λ and an element of S(x 1 ). Then, we can decompose ∑ s∈S (x 1 )\{s λ } q s (y 1 |x 1 , k)(1 − g s ) ∏ s ∈A s \{s λ } g s = 1 − g s λchild q s λchild (y 1 |x 1 , k) + g s λchild ∑ s∈S (x 1 )\{s λ ,s λchild } q s (y 1 |x 1 , k)(1 − g s ) ∏ s ∈A s \{s λ ,s λchild } g s . (A32) The equation (A32) has a similar structure to (A31). Like this, we can decompose (A31) recursively to the leaf node that is included in S(x 1 ). In addition, we can utilize the assumption g s = 0 for any leaf nodes s of a meta-tree M T,k . Consequently, we can prove that (A31) coincides withq s λ (y 1 |x 1 , M T,k ).
Step 2: Next, we prove (16) for n = 0. Because the parameters g s are updated only when their nodes are the elements of S(x 1 ), we decompose the right-hand side of (16) for n = 0 as follows: (1 − g s ). (A34) We transform the part of (A34), whose parameters g s are renewed by x 1 and y 1 , as follows: (1 − g s ), where (a) follows from s ∈ L(T) ∩ S(x 1 ) is uniquely determined as s(x 1 ), (b) follows from Definition 3, and (c) follows from reduction ofq s (y 1 |x 1 , M T,k ). Thus, the right-hand side of (A34) is calculated as (1 − g s ). (A39)