Meta-Tree Random Forest: Probabilistic Data-Generative Model and Bayes Optimal Prediction

Nao Dobashi; Shota Saito; Yuta Nakahara; Toshiyasu Matsushima

doi:10.3390/e23060768

,

and

¹

Department of Pure and Applied Mathematics, Waseda University, 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan

²

Faculty of Informatics, Gunma University, 4-2, Maebashi, Gunma 371-8510, Japan

³

Center for Data Science, Waseda University, 1-6-1 Nisniwaseda, Shinjuku-ku, Tokyo 169-8050, Japan

^*

Author to whom correspondence should be addressed.

Entropy2021, 23(6), 768;https://doi.org/10.3390/e23060768

Version Notes

Order Reprints

Abstract

This paper deals with a prediction problem of a new targeting variable corresponding to a new explanatory variable given a training dataset. To predict the targeting variable, we consider a model tree, which is used to represent a conditional probabilistic structure of a targeting variable given an explanatory variable, and discuss statistical optimality for prediction based on the Bayes decision theory. The optimal prediction based on the Bayes decision theory is given by weighting all the model trees in the model tree candidate set, where the model tree candidate set is a set of model trees in which the true model tree is assumed to be included. Because the number of all the model trees in the model tree candidate set increases exponentially according to the maximum depth of model trees, the computational complexity of weighting them increases exponentially according to the maximum depth of model trees. To solve this issue, we introduce a notion of meta-tree and propose an algorithm called MTRF (Meta-Tree Random Forest) by using multiple meta-trees. Theoretical and experimental analyses of the MTRF show the superiority of the MTRF to previous decision tree-based algorithms.

Keywords:

bayes decision theory; data-generative model; meta-tree; prediction; random forest

1. Introduction

Various studies in pattern recognition deal with a prediction problem of a targeting variable

y_{n + 1}

corresponding to an explanatory variable

x_{n + 1}

given pairs of explanatory and targeting variable

{(x_{i}, y_{i})}_{i = 1}^{n}

. In many of them, the targeting variable

y_{n + 1}

is predicted with a tree T. One way to use a tree T is to represent a function

y_{n + 1} = f (x_{n + 1}; T)

and predict

y_{n + 1}

. A tree T used to represent the function

y_{n + 1} = f (x_{n + 1}; T)

is called a decision tree in the literature. In this paper, however, this tree is called a function tree. This is because we distinguish a tree used to represent a function from a tree used to represent a data-generative model (A tree used to represent a data-generative model will be explained in the next paragraph.). In the previous studies, algorithms in which a single function tree is used are discussed in, e.g., CART [1]; algorithms in which multiple function trees are used are discussed, e.g., Random Forest [2] is an algorithm that constructs multiple function trees from

{(x_{i}, y_{i})}_{i = 1}^{n}

and aggregates them to predict

y_{n + 1}

from

x_{n + 1}

. There are various extensions of Random Forest, e.g., Generalized Random Forest [3] generalizes the splitting rule of a function tree. Boosting is an algorithm that constructs a function tree sequentially from

{(x_{i}, y_{i})}_{i = 1}^{n}

and combines the constructed function trees to predict

y_{n + 1}

from

x_{n + 1}

. There are various Boosting methods, e.g., gradient boosting method in Gradient Boost [4] and XGBoost [5]. Further, previous studies, such as Alternating Decision Forest [6] and Boosted Random Forest [7], combine the ideas of Random Forest and Boosting method. Moreover, combinations of the function trees and neural networks have also been discussed. In Neural Decision Forest [8], inner nodes are replaced by randomized multi-layer perceptrons. In Deep Neural Decision Forest [9], they are replaced by deep convolutional neural networks (CNN). In Adaptive Neural Trees [10], not only their nodes are replaced by CNNs but also their edges are replaced by nonlinear functions. In Deep Forest [11], Random Forests are combined in a cascade structure like deep CNNs. The function tree is also used to understand the deep neural network as an interpretable input-output function in Reference [12]. In these algorithms, a function tree is used to represent the function from

x_{n + 1}

to

y_{n + 1}

. Although this function is trained for the data

{(x_{i}, y_{i})}_{i = 1}^{n}

under a certain criterion, statistical optimality for prediction of

y_{n + 1}

is not necessarily discussed theoretically because these algorithms usually do not assume a probabilistic data-generative structure of x and y. As we will describe later in detail, on the other hand, we assume a probabilistic data-generative structure of x and y, and consider a statistical optimal prediction of

y_{n + 1}

. This is the crucial difference between our study and the related works.

To predict the targeting variable

y_{n + 1}

, another way to use a tree T is to represent a data-generative model

p (y | x, T)

that represents a conditional probabilistic structure of y given

x

. A tree T used to represent the data-generative model

p (y | x, T)

is called a model tree throughout this paper. Although not so many studies have assumed the model tree, Reference [13] has assumed it. Because Reference [13] assumed that a targeting variable

y_{n + 1}

is generated according to

p (y | x_{n + 1}, T)

, statistical optimality for prediction of

y_{n + 1}

can be discussed theoretically. Specifically, based on the Bayes decision theory [14], Reference [13] proposed the optimal prediction of

y_{n + 1}

.

In order to explain the optimal prediction based on the Bayes decision theory in more detail, we introduce the terminology model tree candidate set. The model tree candidate set is a set of model trees in which the true model tree (When the model tree T is used to represent the true data-generative model

p (y | x, T)

, we call it true model tree.) is assumed to be included. One example of a model tree candidate set is a set of model trees whose depth is up to

d \in N

. The optimal prediction based on the Bayes decision theory is given by weighting all the model trees in the model tree candidate set and the optimal weight of each model tree is given by the posterior probability

p (T | {(x_{i}, y_{i})}_{i = 1}^{n})

of the model tree T.

Because the number of all the model trees in the model tree candidate set increases exponentially according to the maximum depth of model trees, the computational complexity of weighting them increases exponentially according to the maximum depth of model trees. One way to reduce the computational complexity is to restrict the model tree candidate set. To represent a restricted model tree candidate set, Reference [13] proposed a notion of meta-tree. The concept of a meta-tree was originally used for data compression in information theory (see, e.g., Reference [15]) (Recently, the concept of a meta-tree was also used for image compression [16].). As shown in Figure 1, a model tree candidate set is composed of the model trees represented by the meta-tree.

Figure 1. An example of a meta-tree and four model trees represented by it.

A meta-tree is also used for the prediction of

y_{n + 1}

in the algorithm proposed by Reference [13]. In summary, a meta-tree has the following two roles: (i) it represents a model tree candidate set; and (ii) it is used for the prediction algorithm. The characteristics of a meta-tree are as follows:

If the true model tree is in a model tree candidate set represented by a meta-tree, the statistically optimal prediction—optimal prediction based on the Bayes decision theory—can be calculated.
The number of model trees in a model tree candidate set represented by a meta-tree increases exponentially according to the depth of the meta-tree.
The computational cost in learning processes of a single meta-tree has the same order as that of a single function tree.

Under the assumption that the true model tree is in the restricted model tree candidate set represented by a meta-tree, Reference [13] proposed the optimal prediction based on the Bayes decision theory.

As we have described above, if the true model tree is actually included in the model tree candidate set, the optimal prediction based on the Bayes decision theory is calculated. Hence, it is desirable that we construct the model tree candidate set that includes as many model trees as possible within the allowed computational cost. Motivated by this fact, this paper extends the model tree candidate set compared with Reference [13]. Instead of considering a single meta-tree as in Reference [13], we consider multiple meta-trees. By using the model tree candidate set represented by multiple meta-trees, we predict

y_{n + 1}

based on the Bayes decision theory. We call this proposed algorithm

MTRF (Meta

-

Tree Random Forest)

. The characteristics of MTRF are as follows:

If the true model tree is in a model tree candidate set represented by any of the meta-trees of MTRF, the statistically optimal prediction—optimal prediction based on the Bayes decision theory— can be calculated.
The number of model trees in the model tree candidate set represented by multiple meta-trees is constant times larger than those contained in a single meta-tree.
The computational cost in learning processes of meta-trees of MTRF is the same order as that of multiple function trees of Random Forest.

Regarding a meta-tree in Reference [13] or multiple meta-trees of MTRF, how to construct the meta-tree/multiple meta-trees is a problem. One way to solve this issue is to use the function tree-based algorithms; for example, a meta-tree in Reference [13] can be constructed by regarding a function tree in CART as a meta-tree; multiple meta-trees of MTRF can be constructed by regarding function trees in Random Forest as meta-trees. In this way, by regarding a function tree as a meta-tree, our proposed method can be applied to any practical applications in which a function-tree based algorithm is used—insurance claim task (e.g., Reference [5]), letter classification task (e.g., Reference [6]), semantic segmentation (e.g., Reference [8]), face recognition (e.g., Reference [11]), music classification (e.g., Reference [11]), etc.—and we can improve every previous function tree-based algorithm. This is because we can perform a prediction of

y_{n + 1}

by using more model trees compared with the previous algorithms. More detailed discussion will be given in Section 3 and Section 4.

The rest of this paper is organized as follows. In Section 2, we state the preliminaries of our study. Section 3 explains our proposed method—MTRF. In Section 4, we revisit the previous studies, such as CART [1] and Random Forest [2], and compare them with MTRF. Section 5 shows the results of experiments. Section 6 concludes this paper.

2. Preliminaries

2.1. Model Tree

Let K denote the dimension of explanatory variables. The space of feature values is denoted by

X : = {0, 1, \dots, | X | - 1}

, and the explanatory variable is denoted by

x : = {(x_{1}, \dots, x_{K})}^{⊤} \in X^{K}

. In addition, the space of label values is denoted by

Y : = {0, 1 \dots, | Y | - 1}

, and the targeting variable is denoted by

y \in Y

.

In the field of pattern recognition, the probabilistic structure of x and y is not necessarily assumed. In particular, it is not assumed in most of function tree-based algorithms. In contrast, we consider a triplet of

(T, k, θ)

defined in Definition 1 below and assume the probabilistic structure

p (y | x, θ, T, k)

as in Definition 2. We call

(T, k)

a model tree (In Section 1, we called T a model tree. More precisely speaking, however, a model tree is

(T, k)

.) and

θ

a tree parameter.

Definition 1.

The notation T denotes the

| X |

-ary regular (Here, “regular” means that all inner nodes have

| X |

child nodes.) tree whose depth is up to

D_{\max}

, and

T

denotes the set of T. Let s be a node of a tree and

S

be a set of s. The set of the inner nodes of T is denoted by

I (T)

, and the set of the leaf nodes of T is denoted by

L (T)

. Let

k_{s} \in {1, 2, \dots, K}

be a component of the feature assign vector of node s. The explanatory variable

x_{k_{s}} \in X

is assigned to the inner node

s \in I (T)

. Let

k : = {(k_{s})}_{s \in S}

denote a feature assign vector, and

K

denote a set of

k

. Let

θ_{s} : = (θ_{0 | s}, θ_{1 | s}, \dots, θ_{| Y | - 1 | s}) \in {(0, 1)}^{| Y |}

, where

\sum_{y = 0}^{| Y | - 1} θ_{y | s} = 1

, be a parameter assigned to

s \in S

. We define

θ : = {(θ_{s})}_{s \in S}

, and Θ is the set of

θ

.

Definition 2.

Because the value of explanatory variable

x

corresponds to a path from the root node to the leaf node of T whose feature assign vector is

k

, let

s (x) \in L (T)

denote the corresponding leaf node. Then, for given

x

,

θ

, T, and k, the label variable y is generated according to

\begin{matrix} p (y | x, θ, T, k) : = θ_{y | s (x)} . \end{matrix}

(1)

Figure 2a shows two examples of

(T, k, θ)

, and Figure 2b illustrates

p (y | x, θ, T, k)

.

Figure 2. (a) Two examples of

(T_{1}, k_{1}, θ_{1})

and

(T_{2}, k_{2}, θ_{2})

, where

T_{1}

and

T_{2}

are 2-ary regular trees,

k_{1} = (k_{s_{λ}}, k_{s_{0}}) = (1, 2)

, and

k_{2} = (k_{s_{λ}}, k_{s_{0}}, k_{s_{1}}) = (3, 4, 2)

. (b)

p (y | x = {(0, 1)}^{⊤}, θ_{1}, T_{1}, k_{1}) = θ_{y | s_{01}}

.

2.2. Problem Setup

In this subsection, we introduce our problem setup. Let

x_{i} : = {(x_{i 1}, \dots, x_{i K})}^{⊤} \in X^{K} (i = 1, \dots, n)

denote the value of explanatory variable of the i-th data and

x^{n} : = (x_{1}, \dots, x_{n})

. We assume that

x_{1}, \dots, x_{n}

are i.i.d. random variables drawn from a probability distribution. Let

y_{i} \in Y (i = 1, \dots, n)

denote the targeting variable of the i-th data and

y^{n} : = (y_{1}, \dots, y_{n})

. We assume that

y_{1}, \dots, y_{n}

are generated according to

p (y | x, θ, T, k)

defined in Section 2.1. In regard to this, we assume that the true model tree

(T^{★}, k^{★})

and true tree parameter

θ^{★}

are unknown, but a model tree candidate set

M \subset T \times K

is known. Note that, as we have explained in Section 1, the model tree candidate set

M

is a set of model trees

(T, k)

in which the true model tree is assumed to be included. Further, we assume that a class of parametrized distribution

{p (y | x, θ, T, k) : θ \in Θ, (T, k) \in M}

is known.

The purpose of the prediction problem is to predict unknown targeting variable

y_{n + 1}

corresponding to an explanatory variable

x_{n + 1}

by using the given data

x^{n}

,

y^{n}

,

x_{n + 1}

.

2.3. Optimal Prediction Based on the Bayes Decision Theory

We introduce a statistically optimal prediction under the problem setup in Section 2.2. From the viewpoint of the Bayes decision theory [14], we shall derive the optimal prediction based on the Bayes decision theory. To this end, we assume priors

p (θ)

,

p (T)

, and

p (k)

.

First, we define a decision function as

\begin{matrix} δ : X^{K n} \times Y^{n} \times X^{K} \to Y; (x^{n}, y^{n}, x_{n + 1}) \mapsto δ (x^{n}, y^{n}, x_{n + 1}) . \end{matrix}

(2)

Second, we define a 0–1 loss as follows:

\begin{matrix} l (y_{n + 1}, δ (x^{n}, y^{n}, x_{n + 1})) : = \{\begin{matrix} 0 (y_{n + 1} = δ (x^{n}, y^{n}, x_{n + 1})), \\ 1 (y_{n + 1} \neq δ (x^{n}, y^{n}, x_{n + 1})) . \end{matrix} \end{matrix}

(3)

Third, using the 0–1 loss, we define the loss function, which is the expectation of the 0–1 loss taken by new targeting variable

y_{n + 1}

:

\begin{matrix} L (δ (x^{n}, y^{n}, x_{n + 1}), θ, T, k) : = \sum_{y_{n + 1} \in Y} p (y_{n + 1} | x_{n + 1}, θ, T, k) l (y_{n + 1}, δ (x^{n}, y^{n}, x_{n + 1})) . \end{matrix}

(4)

Next, given the loss function, we define the risk function, which is the expectation of the loss function taken by training data

x^{n}

,

y^{n}

, and

x_{n + 1}

:

\begin{matrix} R & (δ (\cdot, \cdot, \cdot), θ, T, k) \end{matrix}

\begin{matrix} : = \sum_{x^{n} \in X^{K n}} \sum_{y^{n} \in Y^{n}} \sum_{x_{n + 1} \in X^{K}} p (x_{n + 1}) p (x^{n}) p (y^{n} | x^{n}, θ, T, k) L (δ (x^{n}, y^{n} x_{n + 1}), θ, T, k) . \end{matrix}

(5)

Finally, the Bayes risk function, which is the expectation of the risk function taken by

θ

, T, and

k

, is defined as

\begin{matrix} BR (δ (\cdot, \cdot, \cdot)) : = & \sum_{(T, k) \in M} p (k) p (T) \int_{Θ} p (θ) R (δ (\cdot, \cdot, \cdot), θ, T, k) d θ . \end{matrix}

(6)

Then, we have the following theorem:

Theorem 1.

Under the setup in Section 2.2, the decision function

δ^{*} (x^{n}, y^{n}, x_{n + 1})

that minimizes the Bayes risk function is given as

\begin{matrix} δ^{*} & (x^{n}, y^{n}, x_{n + 1}) = \\ \underset{y_{n + 1}}{\arg \max} \sum_{(T, k) \in M} p (k | x^{n}, y^{n}) p (T | x^{n}, y^{n}, k) \int_{Θ} p (θ | x^{n}, y^{n}, T, k) p (y_{n + 1} | x_{n + 1}, θ, T, k) d θ . \end{matrix}

(7)

The proof of Theorem 1 is in Appendix A. We call

δ^{*} (x^{n}, y^{n}, x_{n + 1})

the Bayes decision.

2.4. Previous Study

In this subsection, we introduce how to calculate the Bayes decision in Reference [13]. Because the proof of Reference [13] lacks some explanation, we give more rigorous discussion in comparison. This is one of the contributions in this paper.

As we have explained in Section 1, Reference [13] introduced the concept of a “meta-tree” to represent a restricted model tree candidate set. For T and

k

, a meta-tree—denoted by

M_{T, k}

—represents a model tree candidate set that is composed of model trees

(T^{'}, k^{'})

where

T^{'}

is a sub-tree of T and

k^{'} = k

. An example of

M_{T, k}

is shown in Figure 3, where T is a complete tree with

D_{\max} = 2

and

k = (3, 4, 2)

. In this example, a meta-tree

M_{T, k}

represents a model tree candidate set which is composed of four model trees. A meta-tree is also used for the prediction of

y_{n + 1}

in the algorithm proposed by Reference [13]. In short, a meta-tree has the two roles: first, it represents a model tree candidate set, and, second, it is used for the prediction algorithm.

Figure 3. An example of a meta-tree

M_{T, k}

, where T is a complete tree with

D_{\max} = 2

and

k = (3, 4, 2)

.

Let

T_{M_{T, k}}

denote a set of T represented by a meta-tree

M_{T, k}

and let

M = T_{M_{T, k}} \times {k}

. Under the assumption that the true model tree is included in

M

, the Bayes decision in Theorem 1 is expressed as

\begin{matrix} δ^{*} & (x^{n}, y^{n}, x_{n + 1}) \\ = \underset{y_{n + 1}}{\arg \max} \sum_{T^{'} \in T_{M_{T, k}}} p (T^{'} | x^{n}, y^{n}, k) \int_{Θ} p (θ | x^{n}, y^{n}, T^{'}, k) p (y_{n + 1} | x_{n + 1}, θ, T^{'}, k) d θ . \end{matrix}

(8)

Now, (8) can be decomposed into two parts:

\begin{matrix} 1 . & q (y_{n + 1} | x_{n + 1}, x^{n}, y^{n}, T, k) : = \int_{Θ} p (θ | x^{n}, y^{n}, T, k) p (y_{n + 1} | x_{n + 1}, θ, T, k) d θ . \end{matrix}

(9)

\begin{matrix} 2 . & \tilde{q} (y_{n + 1} | x_{n + 1}, x^{n}, y^{n}, k) : = \sum_{T^{'} \in T_{M_{T, k}}} p (T^{'} | x^{n}, y^{n}, k) q (y_{n + 1} | x_{n + 1}, x^{n}, y^{n}, T^{'}, k) . \end{matrix}

(10)

Reference [13] exactly calculates (9) and (10) under some assumptions. We describe them in Section 2.4.1 and Section 2.4.2, respectively.

2.4.1. Expectation over the Parameters

In this subsection, we explain how to calculate (9). To calculate (9) analytically, a conjugate prior of

θ_{s}

given T and

k

, is assumed.

Assumption 1.

We assume

p (θ_{s} | T, k) = Dirichlet (θ_{s} | α_{s})

for

s \in L (T)

, where

α_{s}

denotes the hyper-parameter of the Dirichlet distribution.

Under this assumption,

q (y_{n + 1} | x_{n + 1}, x^{n}, y^{n}, T, k)

has a closed form expression and (9) can be calculated exactly. In addition,

q (y_{n + 1} | x_{n + 1}, x^{n}, y^{n}, T, k)

has an important property. We describe this in the next lemma as a preparation of Section 2.4.2.

Lemma 1.

For any

T, T^{'} \in T

,

x^{n} \in X^{K n}

,

y^{n} \in Y^{n}

, and

x_{n + 1} \in X^{K}

, if

s (x_{n + 1}) \in L (T) \cap L (T^{'})

, then

\begin{matrix} q (y_{n + 1} | x_{n + 1}, x^{n}, y^{n}, T, k) = q (y_{n + 1} | x_{n + 1}, x^{n}, y^{n}, T^{'}, k) . \end{matrix}

(11)

The proof of Lemma 1 is in Appendix B. From this lemma, the right- and left-hand sides of (11) are denoted by

q_{s (x_{n + 1})} (y_{n + 1} | x_{n + 1}, x^{n}, y^{n}, k)

because they depend on not T and

T^{'}

but

s (x_{n + 1})

.

2.4.2. Summation over All Model Trees Represented by a Meta-Tree

In this subsection, we explain how to calculate (10). The exact calculation of (10) needs to take summation over all model trees represented by a single meta-tree. However, the computational complexity of this calculation is huge because the number of model trees increases exponentially according to the depth of meta-tree. Reference [13] solved this problem. The advantages of using the method proposed in Reference [13] are as follows:

This method calculates (10) exactly.
The computational complexity of calculating (10) in learning parameters is $O (n D_{\max})$ , which is the same as that of building a single function tree.

To execute this method, an assumption about the prior probability distribution of T—Assumption A2—is required. It should be noted that

p (T)

is different from the construction method of function trees, given

x^{n}

and

y^{n}

.

Assumption A2.

Let

g_{s} \in [0, 1]

denote a hyper-parameter assigned to any node

s \in S

. Then, we assume the prior of

T^{'} \in T_{M_{T, k}}

as

\begin{matrix} p (T^{'}) = \prod_{s \in I (T^{'})} g_{s} \prod_{s^{'} \in L (T^{'})} (1 - g_{s^{'}}), \end{matrix}

(12)

where

g_{s} = 0

for a leaf node s of a meta-tree

M_{T, k}

.

Remark 1.

It should be noted that (12) is a probability distribution, i.e.,

\sum_{T^{'} \in T_{M_{T, k}}} p (T^{'}) = 1

. See Lemma A1 in Appendix C.

As we have described in Section 1, a meta-tree is used for the prediction of

y_{n + 1}

in the algorithm proposed by Reference [13]. By using a meta-tree

M_{T, k}

, the recursive function to calculate the Bayes decision (10) is defined as follows.

Definition 3.

We define the following recursive function

{\tilde{q}}_{s} (y_{n + 1} | x_{n + 1}, x^{n}, y^{n}, M_{T, k})

for any node s on the path in

M_{T, k}

corresponding to

x_{n + 1}

.

\begin{matrix} {\tilde{q}}_{s} (y_{n + 1} | x_{n + 1}, x^{n}, y^{n}, M_{T, k}) : = \\ \{\begin{matrix} q_{s} (y_{n + 1} | x_{n + 1}, x^{n}, y^{n}, k), & (if s is the leaf node of M_{T, k}), \\ (1 - g_{s | x^{n}, y^{n}}) q_{s} (y_{n + 1} | x_{n + 1}, x^{n}, y^{n}, k) \\ + g_{s | x^{n}, y^{n}} {\tilde{q}}_{s_{child}} (y_{n + 1} | x_{n + 1}, x^{n}, y^{n}, M_{T, k}), & (otherwise), \end{matrix} \end{matrix}

(13)

where

s_{child}

is the child node of s on the path corresponding to

x_{n + 1}

in

M_{T, k}

, and

g_{s | x^{n}, y^{n}}

is also recursively updated as follows:

\begin{matrix} g_{s | x^{i}, y^{i}} : = \{\begin{matrix} g_{s}, & (if i = 0), \\ \frac{g_{s | x^{i - 1}, y^{i - 1}} {\tilde{q}}_{s_{child}} (y_{i} | x_{i}, x^{i - 1}, y^{i - 1}, M_{T, k})}{{\tilde{q}}_{s} (y_{i} | x_{i}, x^{i - 1}, y^{i - 1}, M_{T, k})}, & (otherwise) . \end{matrix} \end{matrix}

(14)

Now, (10) can be calculated as shown in the following theorem.

Theorem 2.

\tilde{q} (y_{n + 1} | x_{n + 1}, x^{n}, y^{n}, k)

can be calculated by

\begin{matrix} \tilde{q} (y_{n + 1} | x_{n + 1}, x^{n}, y^{n}, k) = {\tilde{q}}_{s_{λ}} (y_{n + 1} | x_{n + 1}, x^{n}, y^{n}, M_{T, k}), \end{matrix}

(15)

where

s_{λ}

is the root node of

M_{T, k}

. In addition,

p (T | x^{n + 1}, y^{n + 1}, k)

can be calculated as

\begin{matrix} p (T | x^{n + 1}, y^{n + 1}, k) = \prod_{s \in I (T)} g_{s | x^{n + 1}, y^{n + 1}} \prod_{s^{'} \in L (T)} (1 - g_{s^{'} | x^{n + 1}, y^{n + 1}}) . \end{matrix}

(16)

The proof of Theorem 2 is in Appendix C. Surprisingly, the computational complexity of calculating (10) is

O (n D_{\max})

by using this theorem.

3. Proposed Method

In this section, we introduce our proposed method. Here, we reconsider the general setup where

M = T \times K

. Recall that the Bayes decision (7) is

\begin{matrix} δ^{*} (x^{n}, y^{n}, x_{n + 1}) = \underset{y_{n + 1}}{\arg \max} \sum_{k \in K} p (k | x^{n}, y^{n}) \tilde{q} (y_{n + 1} | x_{n + 1}, x^{n}, y^{n}, k), \end{matrix}

(17)

where

\tilde{q} (y_{n + 1} | x_{n + 1}, x^{n}, y^{n}, k)

is defined as in (10). In (17),

\tilde{q} (y_{n + 1} | x_{n + 1}, x^{n}, y^{n}, k)

can be calculated in the same way as in Section 2.4. However, regarding the calculation of (17), we need to calculate

\tilde{q} (y_{n + 1} | x_{n + 1}, x^{n}, y^{n}, k)

for all

k

that satisfies

p (k) > 0

. Because this computational cost increases exponentially depending to

D_{\max}

, it is hard to calculate all of them in general. To reduce the computational complexity of (17), we restrict the model tree candidate set to the set represented by multiple meta-trees. The advantages of building multiple meta-trees are as follows:

As we have explained in Section 2.4.2, the number of model trees represented by each of the meta-trees increases exponentially according to the depth of meta-trees. In addition, the number of model trees in multiple meta-trees is constant times larger than those contained in a single meta-tree.
If the true model tree is a sub-tree of any of the meta-trees, the statistically optimal prediction—optimal prediction based on the Bayes decision theory—can be calculated.
If we build B meta-trees, the computational cost of it in learning parameters is $O (n B D_{\max})$ , which is the same as that of building B function trees.

Now, we introduce the next assumption.

Assumption A3.

For B meta-trees

M_{T_{1}, k_{1}}, \dots, M_{T_{B}, k_{B}}

, we define

K^{'} {k_{1}, \dots, k_{B}}

. Then, we assume a uniform distribution on

K^{'}

as a prior probability distribution of

k \in K

.

An example of

K^{'}

is shown in Figure 4.

Figure 4. An example of

K = {k_{1}, k_{2}, k_{3}}

and

K^{'} = {k_{1}, k_{2}}

. In this example, the prior probability

p (k_{1}) = p (k_{2}) = 1 / 2

and

p (k_{3}) = 0

.

Next, we prove the following lemma about computing a posterior probability distribution of

k

. In fact, the recursive function we have defined in Definition 3 is also useful to calculate a posterior probability distribution on

K^{'}

sequentially. The proof of Lemma 2 is in Appendix D.

Lemma 2.

By decomposing

x^{n}

and

y^{n}

into the set of i-th data

(x_{i}, y_{i}) (i = 1, \dots, n)

and performing the recursive calculation of

{\tilde{q}}_{s} (y_{i} | x_{i}, x^{i - 1}, y^{i - 1}, M_{T, k})

as in (13), a posterior probability distribution of

k \in K^{'}

is expressed as follows:

\begin{matrix} p (k | x^{n}, y^{n}) \propto \prod_{i = 1}^{n} {\tilde{q}}_{s_{λ}} (y_{i} | x_{i}, x^{i - 1}, y^{i - 1}, M_{T, k}) . \end{matrix}

(18)

From Theorem 2 and Lemma 2, we immediately obtain the next theorem.

Theorem 3.

Let us consider the model tree candidate set represented by B meta-trees, i.e.,

\begin{matrix} ⋃_{b = 1}^{B} (T_{M_{T_{b}, k_{b}}} \times {k_{b}}) . \end{matrix}

(19)

Then, if the true model tree is included in (19), the Bayes decision

δ^{*} (x^{n}, y^{n}, x_{n + 1})

can be calculated as

\begin{matrix} δ^{*} & (x^{n}, y^{n}, x_{n + 1}) = \underset{y_{n + 1}}{\arg \max} \sum_{k \in K^{'}} p (k | x^{n}, y^{n}) \tilde{q} (y_{n + 1} | x_{n + 1}, x^{n}, y^{n}, k) \end{matrix}

(20)

\begin{matrix} = \underset{y_{n + 1}}{\arg \max} \sum_{k \in K^{'}} {\tilde{q}}_{s_{λ}} (y_{n + 1} | x_{n + 1}, x^{n}, y^{n}, M_{T, k}) \prod_{i = 1}^{n} {\tilde{q}}_{s_{λ}} (y_{i} | x_{i}, x^{i - 1}, y^{i - 1}, M_{T, k}) . \end{matrix}

(21)

If the true model tree is not included in (19), the right-hand side of (20) is a sub-optimal prediction of the Bayes decision.

By using Theorem 3, we can calculate the optimal (possibly sub-optimal) prediction of the Bayes decision effectively. We name this algorithm Meta-Tree Random Forest (MTRF). We summarize MTRF in Algorithm 1.

Algorithm 1 MTRF

Input: $x_{n + 1}, x^{n}, y^{n}, B, g_{s}, α_{s}$
Output: $δ^{*} (x^{n}, y^{n}, x_{n + 1})$
Initialize parameters $g_{s}$ .
Construct B meta-trees $M_{T_{1}, k_{1}}, \dots, M_{T_{B}, k_{B}}$ and $K^{'} : = {k_{1}, \dots, k_{B}}$ .
for all $k \in K^{'}$ do
for $i = 1, \dots, n, n + 1$ do
Calculate ${\tilde{q}}_{s_{λ}} (y_{i} | x_{i}, x^{i - 1}, y^{i - 1}, M_{T, k})$ with (13).
if $i \neq n + 1$ then
Renew parameters $g_{s | x^{i}, y^{i}}$ with (14).
end if
end for
end for
Calculate $δ^{*} (x^{n}, y^{n}, x_{n + 1})$ with (20).

Remark 2.

One way to construct multiple meta-trees is to use the ideas of Random Forest. Random Forest builds function trees

FT

with the impurity (e.g., the entropy and the Gini coefficient) from the training data

x^{n}

and

y^{n}

. By regarding

(T_{1}, k_{1}) \dots (T_{B}, k_{B})

that are built by Random Forest as meta-trees

M_{T_{1}, k_{1}} \dots M_{T_{B}, k_{B}}

, we can construct B meta-trees. In our experiments in Section 5, we adopt this method. The computational complexity of MTRF is

O (n B D_{\max})

in learning processes, which is the same as that of Random Forest. This is computable unless n, B, or

D_{\max}

are not extremely huge. In addition, we can parallelize the procedure on each meta-tree.

4. CART and Random Forest Revisited and Comparison with MTRF

4.1. CART and Random Forest Revisited

Although CART [1] and Random Forest [2] do not assume the model tree and they use trees to express a function

y = f (x; T)

(i.e., function tree), they can be regarded as methods of constructing a model tree candidate set.

First, we consider algorithms that use a single function tree, such as CART [1]. Such algorithms can be regarded as selecting a single model tree and predicting

y_{n + 1}

by using the single function tree that corresponds to the single model tree. For example, the prediction of CART—denoted by

δ_{CART} (x^{n}, y^{n}, x_{n + 1})

—is given by

\begin{matrix} δ_{CART} (x^{n}, y^{n}, x_{n + 1}) = \underset{y_{n + 1}}{\arg \max} p (y_{n + 1} | x_{n + 1}, \hat{θ} (x^{n}, y^{n}), \hat{T}, \hat{k}), \end{matrix}

(22)

where

(\hat{T}, \hat{k})

and

\hat{θ} (x^{n}, y^{n})

denote an estimated model tree and tree parameter, respectively.

Second, as another example, let us consider Random Forest [2]. Random Forest can be regarded as selecting multiple model trees and predicting

y_{n + 1}

by weighting the multiple function trees that correspond to the multiple model trees. It should be noted that the number of function trees is equal to the number of model trees in Random Forest. The prediction of Random Forest—denoted by

δ_{RF} (x^{n}, y^{n}, x_{n + 1})

—is given by

\begin{matrix} δ_{RF} (x^{n}, y^{n}, x_{n + 1}) = \underset{y_{n + 1}}{\arg \max} \sum_{(T, k) \in \hat{M}} \frac{1}{| \hat{M} |} p (y_{n + 1} | x_{n + 1}, \hat{θ} (x^{n}, y^{n}), T, k), \end{matrix}

(23)

where

\hat{M} = {({\hat{T}}_{1}, {\hat{k}}_{1}), \dots, ({\hat{T}}_{B}, {\hat{k}}_{B})} \subset M

denotes the set of model trees that are constructed by Random Forest.

4.2. Comparison of Random Forest with MTRF

Let us consider the situation where the trees that represent the function trees constructed by Random Forest are meta-trees of MTRF. Then, the size of model tree candidate set is far larger in MTRF than in Random Forest.

Random Forest can be regarded as approximating (9), (10) and (17) as in (23). Further, the approximation of the weight

p (T, k | x^{n}, y^{n}) \approx 1 / | \hat{M} |

is not accurate and the optimality of each function tree’s weight for predicting

y_{n + 1}

has not usually been discussed. In contrast, MTRF calculates the Bayes decision in (9) and (10) exactly and approximates (17) as in (20). Moreover, the optimal weights of T and

k

are given by (16) and (18), respectively.

5. Experiments

Most machine learning algorithms require rich numerical experiments to confirm the improvement independent of a specific dataset. However, in our paper, such an improvement is theoretically explained in Section 3 and Section 4. If we redefine all the function trees constructed by any existing method as meta-trees, we can necessarily improve it in a sense of the Bayes risk function since our model tree candidate set contains all the model trees that correspond to the original function trees. This improvement is independent of both datasets and the original methods. Therefore, we perform our experiment only on three types of datasets and compare with the Random Forest. Similar results should be given in any other situation.

5.1. Experiment 1

Although the theoretical superiority of MTRF to Random Forest has been explained in Section 3 and Section 4, we performed Experiment 1 to make sure their performance on synthetic data. The procedure of Experiment 1 is the following:

We initialize the hyper-parameter $α_{s} = (1 / | X |, \dots, 1 / | X |)$ , which is assigned on each node s.
To make true model trees diverse, we determinate an index of a true feature assign vector $k$ and the true model trees as below:
- In the true model tree (A), we assign $x_{j}$ to the node whose depth is j.
  –
  true model tree (A-1): the complete 2-ary model tree with depth 4.
  –
  true model tree (A-2): the regular 2-ary model tree with depth 4. This model tree holds the structure that only left nodes have child nodes and right nodes does not.
- In the true model tree (B), we assign all different variables to the node.
  –
  true model tree (B-1): the complete 2-ary model tree with depth 4.
We determine parameters $θ_{s}$ according to the Dirichlet distribution.
We generate K-dimensional n training data from the true model tree and generate K-dimensional 100 test data from the true model tree. From (7), Bayes decision takes the same value for any distribution $p (x^{n})$ . Therefore, in our experiments, we set a probability distribution of $x$ as follows: Let $U ({0, 1})$ denote the uniform distribution on ${0, 1}$ . Then, a probability distribution of $x$ is expressed as

$\begin{matrix} p (x_{i j}) \overset{i . i . d .}{\sim} U ({0, 1}) (i = 1, \dots, n, j = 1, \dots, K) . \end{matrix}$

(24)
We perform Random Forest that utilizes B function trees $(T_{1}, k_{1}), \dots, (T_{B}, k_{B})$ . We set its impurity as the entropy and max-depth as $D_{\max}$ . Afterward, we make B meta-trees $M_{T_{1}, k_{1}}, \dots, M_{T_{B}, k_{B}}$ . (Of course, their max-depth are $D_{\max}$ , too.) Then, we calculate the average prediction error rate of Random Forest and MTRF, where we set the hyper-parameter $g_{s}$ as $g_{s} = 1 / 2$ in MTRF.
We repeat Steps 3∼5Q times.

Figure 5 shows an example of these model trees (note that, for simplicity, we illustrate the model trees with depth 2).

Figure 5. An example of true model tree (A-1) (a), model tree (A-2) (b), and model tree (B-1) (c).

We set

| X | = | Y | = 2

,

n = 100, 200, \dots, 1000, K = 500, B = 10, D_{\max} = 2, 4, 6

, and

Q = 5000

. The result is shown in Figure 6.

Figure 6. Relationships between the number of learning data n and the average prediction error rate where the true model tree is model (A-1) (a), model (A-2) (b), and model (B-1) (c).

From Figure 6, when we compare the performance of Random Forest with that of MTRF, we can confirm that MTRF outperforms Random Forest in all conditions. These results are theoretically reasonable because the trees that represent the function trees constructed by Random Forest are meta-trees of MTRF.

5.2. Experiment 2

In Experiment 2, we used real data (nursery school data and mushroom data) in the UCI repository (University of California, Irvine Machine Learning Repository). The procedure of Experiment 2 is mostly the same as that of Experiment 1, but the way of sampling data is different; instead of Steps 2, 3, and 4 in Experiment 1, we randomly sampled

n + t

data from the whole dataset and divided them to n training data and t test data. We repeated the random sampling of the data for 2500 times.

The detail on nursery school data is as follows. The explanatory variables are discrete 8 attributes of a family whose child wants to enter a nursery school (e.g., finance, health, and family structure). The dimension of each attributes is up to 5. The targeting variable is whether the nursery school should welcome him/her or not. We want to consider that a new child should be welcomed with his attributes and learning data. We set

| X | = 5

,

| Y | = 2

,

n = 100, 200, \dots, 1000

,

t = 1000

,

K = 8

,

B = 5

, and

D_{\max} = 2, 3, 4

. The result is shown in Figure 7a.

Figure 7. Relationships between the number of the learning data n and the average prediction error rate on nursery school data (a)/mushroom data (b) in UCI repository.

The detail on mushroom data is as follows. The explanatory variables are discrete 22 attributes of a mushroom (e.g., smell, shape, surface, and color). The dimension of each attributes is up to 10. The targeting variable is whether the mushroom can be eaten, not be eaten, or unknown. We want to predict that a new mushroom can be eaten, with its attributes and learning data. We set

| X | = 22

,

| Y | = 3

,

n = 100, 200, \dots, 1000

,

t = 1000

,

K = 22

,

B = 5

, and

D_{\max} = 2, 3, 4

. The result is shown in Figure 7b.

From Figure 7, we can confirm MTRF outperforms Random Forest in all conditions.

6. Conclusions and Future Work

In this paper, we have considered a model tree and derived the Bayes decision. We have addressed the computational complexity problem of the Bayes decision by introducing a notion of meta-tree, and proposed MTRF (Meta-Tree Random Forest), which uses multiple meta-trees. As we have shown in Theorem 3, if the true model tree is included in the model tree candidate set represented by multiple meta-trees, MTRF calculates the Bayes decision with efficient computational cost. Even if the true model tree is not included in the model tree candidate set represented by multiple meta-trees, MTRF gives the approximation of the Bayes decision. The advantages of MTRF have been theoretically analyzed. In Section 4, we have explained that, if we redefine all the function trees constructed by Random Forest as meta-trees, we can necessarily improve it in a sense of the Bayes risk function since our model tree candidate set contains all the model trees that correspond to the original function trees. We have performed experiments in Section 5 to check the theoretical analysis.

As we have described in Section 1, it is desirable that we construct the model tree candidate set that includes as many model trees as possible within the allowed computational cost. This is because if the true model tree is included in the model tree candidate set, the Bayes decision is calculated. Hence, the main problem of MTRF is how to construct multiple meta-trees. In Section 5, we have constructed multiple meta-trees by using the idea of Random Forest. However, we can consider other ways to construct multiple meta-trees. One of the possible future directions is to use other ideas, such as Boosting. This is one of the future works in our research.

Author Contributions

Conceptualization, N.D., S.S., Y.N. and T.M.; Methodology, N.D., S.S., Y.N. and T.M.; Software, N.D. and Y.N.; Validation, N.D., S.S., Y.N. and T.M.; Formal Analysis, N.D., S.S., Y.N. and T.M.; Investigation, N.D., S.S., Y.N. and T.M.; Resources, N.D.; Data Curation, N.D.; Writing—Original Draft Preparation, N.D., S.S. and Y.N.; Writing—Review & Editing, N.D., S.S., Y.N. and T.M.; Visualization, N.D., S.S. and Y.N.; Supervision, T.M.; Project Administration, T.M.; Funding Acquisition, S.S. and T.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by JSPS KAKENHI Grant Numbers JP17K06446, JP19K04914 and JP19K14989.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://archive.ics.uci.edu/ml/datasets/Nursery and https://archive.ics.uci.edu/ml/datasets/Mushroom (accessed on 8 June 2021).

Acknowledgments

The authors would like to thank all the members of Matsushima Lab. for their comments and discussions.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviation is used in this manuscript:

MTRF	Meta-Tree Random Forest

Appendix A. Proof of Theorem 1

Proof of Theorem 1.

From (4)–(6), and Bayes’ theorem, we have

\begin{matrix} BR & (δ (\cdot, \cdot, \cdot)) \\ = \sum_{x^{n} \in X^{K n}} \sum_{y^{n} \in Y^{n}} \sum_{x_{n + 1} \in X^{K}} p (x_{n + 1}) (\sum_{y_{n + 1} \in Y} p (y_{n + 1} | x^{n}, y^{n}, x_{n + 1}) l (y_{n + 1}, δ (x^{n}, y^{n}, x_{n + 1}))) \\ \cdot p (x^{n}, y^{n}), \end{matrix}

(A1)

where

\begin{matrix} p (y_{n + 1} | x^{n}, y^{n}, x_{n + 1}) = \sum_{(T, k) \in M} \int_{Θ} p (y_{n + 1} | x_{n + 1}, θ, T, k) p (θ, T, k | x^{n}, y^{n}) d θ . \end{matrix}

(A2)

We need to calculate

δ

that minimizes the brackets in (A1) when we minimize the Bayes risk. Let

a = δ (x^{n}, y^{n}, x_{n + 1})

and

F (a)

denote the formula in the bracket of (A1). Then,

\begin{matrix} F (a) & = \sum_{y_{n + 1} \in Y} l (y_{n + 1}, a) p (y_{n + 1} | x^{n}, y^{n}, x_{n + 1}) \\ = \sum_{y_{n + 1} \in Y \ {a}} 1 \cdot p (y_{n + 1} | x^{n}, y^{n}, x_{n + 1}) . \end{matrix}

(A3)

Therefore, the decision function a that minimizes

F (a)

is

\begin{matrix} a = \underset{y_{n + 1}}{\arg \max} p (y_{n + 1} | x^{n}, y^{n}, x_{n + 1}) . \end{matrix}

(A4)

We can prove (7) by substituting (A2) for (A4). □

Appendix B. Proof of Lemma 1

Proof of Lemma 1.

Let

Θ_{s (x_{n + 1})}

denote the space of parameters on

s (x_{n + 1})

. In addition, we define

\begin{matrix} Θ_{s \neq s (x_{n + 1})} Θ \ Θ_{s (x_{n + 1})} . \end{matrix}

(A5)

Then, we can rewrite the left-hand side of (11) as below:

\begin{matrix} q & (y_{n + 1} | x_{n + 1}, x^{n}, y^{n}, T, k) = \int_{Θ} p (y_{n + 1} | x_{n + 1}, θ, T, k) p (θ | x^{n}, y^{n}, T, k) d θ \\ = \int_{Θ_{s (x_{n + 1})}} θ_{y_{n + 1} | s (x_{n + 1})} \int_{Θ_{s \neq s (x_{n + 1})}} p (θ | x^{n}, y^{n}, T, k) d θ_{s \neq s (x_{n + 1})} d θ_{s (x_{n + 1})}, \end{matrix}

(A6)

where the last equality follows from (1).

Looking into the term

\int_{Θ_{s \neq s (x_{n + 1})}} p (θ | x^{n}, y^{n}, T, k) d θ_{s \neq s (x_{n + 1})}

in (A6), we have

\begin{matrix} \int_{Θ_{s \neq s (x_{n + 1})}} & p (θ | x^{n}, y^{n}, T, k) d θ_{s \neq s (x_{n + 1})} \end{matrix}

\begin{matrix} \overset{(a)}{\propto} \int_{Θ_{s \neq s (x_{n + 1})}} p (θ | T, k) p (y^{n} | x^{n}, θ, T, k) d θ_{s \neq s (x_{n + 1})} \end{matrix}

(A7)

\begin{matrix} = \int_{Θ_{s \neq s (x_{n + 1})}} \prod_{s \in S} p (θ_{s} | α_{s}) \prod_{i = 1}^{n} θ_{y_{i} | s (x_{i})} d θ_{s \neq s (x_{n + 1})} \end{matrix}

(A8)

\begin{matrix} \propto p (θ_{s (x_{n + 1})} | α_{s (x_{n + 1})}) \prod_{i \in {j \in N : s (x_{j}) = s (x_{n + 1})}} θ_{y_{i} | s (x_{i})}, \end{matrix}

(A9)

where (a) follows from Bayes’ theorem.

Because we can transform the right-hand side of (11) in the same way, we obtain (11). □

Appendix C. Proof of Theorem 2

Proof of Theorem 2.

We prove Theorem 2 with mathematical induction according to the following three steps:

$Step 1 :$ We prove (15) for $n = 0$ with Lemmas A1 and A2.
$Step 2 :$ We prove (16) for $n = 0$ by using the result of Step 1.
$Step 3 :$ If we assume (15) and (16) for $n = N$ , we can prove (15) and (16) for $n = N + 1$ by repeating Steps 1 and 2.

In the following, we give a proof of each step.

Step 1: First, we prove (15) for

n = 0

. Let

S (x_{i}) (i = 1, \dots, n + 1)

denote the set of all nodes on the path of

x_{i}

in the meta-tree

M_{T, k}

. Let

T_{s} : = {T \in T_{M_{T, k}} | s \in L (T)}

. Then, we transform

\tilde{q} (y_{1} | x_{1}, k)

as below:

\begin{matrix} \tilde{q} (y_{1} | x_{1}, k) & \overset{(a)}{=} \sum_{T \in T_{M_{T, k}}} p (T) q (y_{1} | x_{1}, T, k) \end{matrix}

(A10)

\begin{matrix} = \sum_{s \in S (x_{1})} \sum_{T \in T_{s}} p (T) q (y_{1} | x_{1}, T, k) \end{matrix}

(A11)

\begin{matrix} \overset{(b)}{=} \sum_{s \in S (x_{1})} q_{s} (y_{1} | x_{1}, k) \sum_{T \in T_{s}} p (T), \end{matrix}

(A12)

where (a) follows from (10), and (b) follows from Lemma 1. The right-hand side of (A12) contains the summation with respect to T as below:

\begin{matrix} \sum_{T \in T_{s}} p (T) . \end{matrix}

(A13)

Regarding (A13), we prove the following lemmas.

Lemma A1.

The prior probability distribution

p (T)

assumed in (12) satisfies the next property:

\begin{matrix} \sum_{T \in T_{M_{T, k}}} p (T) = 1 . \end{matrix}

(A14)

Proof of Lemma A1.

In this proof, for simplicity, we use

T

instead of

T_{M_{T, k}}

. Before showing the formal proof, we consider an example. Let us consider the set of tree

T

shown in Figure A1. In this example,

T

is composed of five model trees. Among these model trees,

T_{1}

is included in

{[s_{λ}]}

and

T_{2}

–

T_{5}

are included in

T \in T \ {[s_{λ}]}

. Therefore, the next equation holds:

\begin{matrix} \sum_{T \in T} p (T) = \sum_{T \in T} (\prod_{s \in I (T)} g_{s} \prod_{s^{'} \in L (T)} (1 - g_{s^{'}})) \end{matrix}

(A15)

\begin{matrix} = (1 - g_{s_{λ}}) + g_{s_{λ}} \sum_{T \in {T_{2}, T_{3}, T_{4}, T_{5}}}} (\prod_{s \in L (T)} (1 - g_{s}) \prod_{s^{'} \in I (T) \ {s_{λ}}} g_{s^{'}}) \\ = (1 - g_{s_{λ}}) + g_{s_{λ}} [(1 - g_{s_{0}}) (1 - g_{s_{1}}) + g_{s_{0}} (1 - g_{s_{00}}) (1 - g_{s_{01}}) (1 - g_{s_{1}}) \end{matrix}

(A16)

\begin{matrix} + (1 - g_{s_{0}}) g_{s_{1}} (1 - g_{s_{10}}) (1 - g_{s_{11}}) + g_{s_{0}} (1 - g_{s_{00}}) (1 - g_{s_{01}}) g_{s_{1}} (1 - g_{s_{10}}) (1 - g_{s_{11}})] \end{matrix}

(A17)

\begin{matrix} = (1 - g_{s_{λ}}) + g_{s_{λ}} [{(1 - g_{s_{0}}) + g_{s_{0}} (1 - g_{s_{00}}) (1 - g_{s_{01}})} {(1 - g_{s_{1}}) + g_{s_{1}} (1 - g_{s_{10}}) (1 - g_{s_{11}})}] \end{matrix}

(A18)

\begin{matrix} \overset{(a)}{=} (1 - g_{s_{λ}}) + g_{s_{λ}} [{(1 - g_{s_{0}}) + g_{s_{0}}} {(1 - g_{s_{1}}) + g_{s_{1}}}] \end{matrix}

(A19)

\begin{matrix} = 1, \end{matrix}

(A20)

where (a) follows from

g_{s} = 0

for a leaf node s of the meta-tree (see Assumption A2).

Now, we give the formal proof. Because

\begin{matrix} \sum_{T \in T} p (T) = \sum_{T \in T} (\prod_{s \in I (T)} g_{s} \prod_{s^{'} \in L (T)} (1 - g_{s^{'}})), \end{matrix}

(A21)

we prove

\begin{matrix} \sum_{T \in T} (\prod_{s \in I (T)} g_{s} \prod_{s^{'} \in L (T)} (1 - g_{s^{'}})) = 1 \end{matrix}

(A22)

by induction.

Let

[s_{λ}]

denote the tree that consists of only the root node

s_{λ}

of T. Then, we have

\begin{matrix} \sum_{T \in T} & (\prod_{s \in L (T)} (1 - g_{s}) \prod_{s^{'} \in I (T)} g_{s^{'}}) \\ = (1 - g_{s_{λ}}) + g_{s_{λ}} \sum_{T \in T \ {[s_{λ}]}} (\prod_{s \in L (T)} (1 - g_{s}) \prod_{s^{'} \in I (T) \ {s_{λ}}} g_{s^{'}}) . \end{matrix}

(A23)

Because each tree

T \in T \ {[s_{λ}]}

is identified by

| X |

sub-trees whose root nodes are the child nodes of

s_{λ}

, let

{s_{λ}}_{child, j}

denote the j-th child node of

s_{λ}

for

0 \leq j \leq | X | - 1

and

T^{{s_{λ}}_{child, j}}

denote the set of sub-trees whose root node is

{s_{λ}}_{child, j}

. Figure A1 shows an example of

T^{{s_{λ}}_{child, 0}}

and

T^{{s_{λ}}_{child, 1}}

. Then, the summation in the right-hand side of (A23) is factorized as

\begin{matrix} \prod_{j = 0}^{| X | - 1} \sum_{T_{j} \in T^{{s_{λ}}_{child, j}}} (\prod_{s \in L (T_{j})} (1 - g_{s}) \prod_{s^{'} \in I (T_{j})} g_{s^{'}}) . \end{matrix}

(A24)

Consequently, because the goal is to show (A22), Lemma A1 is proven by induction. □

Lemma A2.

Let

A_{s}

denote the set of all ancestor nodes of s. Then, we have

\begin{matrix} \sum_{T \in T_{s}} p (T) = (1 - g_{s}) \prod_{s^{'} \in A_{s}} g_{s^{'}} . \end{matrix}

(A25)

Figure A1. An illustration of

T^{{s_{λ}}_{child, 0}}

(left lower enclosure) and

T^{{s_{λ}}_{child, 1}}

(right lower enclosure). In the trees

T_{2}

–

T_{5}

,

s_{λ}

has child nodes

{s_{λ}}_{child, 0}

and

{s_{λ}}_{child, 1}

. Therefore, we can decompose the child nodes of

s_{λ}

in the trees

T_{2}

–

T_{5}

into the nodes included in

T^{{s_{λ}}_{child, 0}}

and the nodes included in

T^{{s_{λ}}_{child, 1}}

.

Figure A1. An illustration of

T^{{s_{λ}}_{child, 0}}

(left lower enclosure) and

T^{{s_{λ}}_{child, 1}}

(right lower enclosure). In the trees

T_{2}

–

T_{5}

,

s_{λ}

has child nodes

{s_{λ}}_{child, 0}

and

{s_{λ}}_{child, 1}

. Therefore, we can decompose the child nodes of

s_{λ}

in the trees

T_{2}

–

T_{5}

into the nodes included in

T^{{s_{λ}}_{child, 0}}

and the nodes included in

T^{{s_{λ}}_{child, 1}}

.

Proof of Lemma A2.

\begin{matrix} \sum_{T \in T_{s}} p (T) & \overset{(a)}{=} \sum_{T \in T_{s}} \prod_{s^{'} \in I (T)} g_{s^{'}} \prod_{s^{″} \in L (T)} (1 - g_{s^{″}}) \end{matrix}

(A26)

\begin{matrix} \overset{(b)}{=} (1 - g_{s}) \prod_{s^{'} \in A_{s}} g_{s^{'}} \sum_{T \in T_{s}} \prod_{s^{″} \in I (T) \ A_{s}} g_{s^{″}} \prod_{s^{‴} \in L (T) \ {s}} (1 - g_{s^{‴}}), \end{matrix}

(A27)

where (a) follows from Assumption 2, and (b) follows from the fact that all trees

T \in T_{s}

have the node s and its ancestor nodes.

Further, as we will prove later, it holds that

\begin{matrix} \sum_{T \in T_{s}} \prod_{s^{″} \in I (T) \ A_{s}} g_{s^{″}} \prod_{s^{‴} \in L (T) \ {s}} (1 - g_{s^{‴}}) = 1 . \end{matrix}

(A28)

Thus, we obtain (A25) by combining (A26) and (A28).

To prove (A28), we can apply the factorization of Lemma A1 to (A28) if we regard the node whose parent node is s or an element of

A_{s}

as a root node, as shown in Figure A2. Let J denote the number of that factorized trees, and let

r_{j}

denote the root node of j-th factorized tree. Then, we can rewrite (A28) as follows:

\begin{matrix} \sum_{T \in T_{s}} \prod_{s^{″} \in I (T) \ A_{s}} g_{s^{″}} \prod_{s^{‴} \in L (T) \ {s}} (1 - g_{s^{‴}}) = \prod_{j = 0}^{J - 1} \sum_{T_{j} \in T^{r_{j}}} (\prod_{s \in L (T_{j})} (1 - g_{s}) \prod_{s^{'} \in I (T_{j})} g_{s^{'}}) = 1 . \end{matrix}

(A29)

Hence, the proof is complete. □

Now, we transform (A12) as follows.

\begin{matrix} \sum_{s \in S (x_{1})} q_{s} (y_{1} | x_{1}, k) \sum_{T \in T_{s}} p (T) & \overset{(a)}{=} \sum_{s \in S (x_{1})} q_{s} (y_{1} | x_{1}, k) (1 - g_{s}) \prod_{s^{'} \in A_{s}} g_{s^{'}} \\ = (1 - g_{s_{λ}}) q_{s_{λ}} (y_{1} | x_{1}, k) \end{matrix}

(A30)

\begin{matrix} + g_{s_{λ}} \sum_{s \in S (x_{1}) \ {s_{λ}}} q_{s} (y_{1} | x_{1}, k) (1 - g_{s}) \prod_{s^{'} \in A_{s} \ {s_{λ}}} g_{s^{'}}, \end{matrix}

(A31)

where (a) follows from Lemma A2.

Figure A2. An illustration of

r_{0}

,

r_{1}

,

r_{2}

, and

r_{3}

. In this example, the dotted line represents the path from

s_{λ}

to s and

A_{s}

is the set of two parent nodes of s. Furthermore, the rectangles of the dotted line represent four factorized trees

T_{0}

,

T_{1}

,

T_{2}

, and

T_{3}

.

Figure A2. An illustration of

r_{0}

,

r_{1}

,

r_{2}

, and

r_{3}

. In this example, the dotted line represents the path from

s_{λ}

to s and

A_{s}

is the set of two parent nodes of s. Furthermore, the rectangles of the dotted line represent four factorized trees

T_{0}

,

T_{1}

,

T_{2}

, and

T_{3}

.

Here, let

s_{λ_{child}}

denote the node that is a child node of

s_{λ}

and an element of

S (x_{1})

. Then, we can decompose

\begin{matrix} \sum_{s \in S (x_{1}) \ {s_{λ}}} q_{s} (y_{1} | x_{1}, k) (1 - g_{s}) \prod_{s^{'} \in A_{s} \ {s_{λ}}} g_{s^{'}} \\ = (1 - g_{{s_{λ}}_{child}}) q_{{s_{λ}}_{child}} (y_{1} | x_{1}, k) \\ + g_{{s_{λ}}_{child}} \sum_{s \in S (x_{1}) \ {s_{λ}, {s_{λ}}_{child}}} q_{s} (y_{1} | x_{1}, k) (1 - g_{s}) \prod_{s^{'} \in A_{s} \ {s_{λ}, {s_{λ}}_{child}}} g_{s^{'}} . \end{matrix}

(A32)

The equation (A32) has a similar structure to (A31). Like this, we can decompose (A31) recursively to the leaf node that is included in

S (x_{1})

. In addition, we can utilize the assumption

g_{s} = 0

for any leaf nodes s of a meta-tree

M_{T, k}

. Consequently, we can prove that (A31) coincides with

{\tilde{q}}_{s_{λ}} (y_{1} | x_{1}, M_{T, k})

.

Step 2: Next, we prove (16) for

n = 0

. Because the parameters

g_{s}

are updated only when their nodes are the elements of

S (x_{1})

, we decompose the right-hand side of (16) for

n = 0

as follows:

\begin{matrix} \prod_{s \in I (T)} g_{s | x^{1}, y^{1}} \prod_{s^{'} \in L (T)} (1 - g_{s^{'} | x^{1}, y^{1}}) \\ = \prod_{s \in I (T) \cap S (x_{1})} g_{s | x^{1}, y^{1}} \prod_{s^{'} \in L (T) \cap S (x_{1})} (1 - g_{s^{'} | x^{1}, y^{1}}) \end{matrix}

\begin{matrix} \cdot \prod_{s^{″} \in I (T) \ S (x_{1})} g_{s^{″} | x^{1}, y^{1}} \prod_{s^{‴} \in L (T) \ S (x_{1})} (1 - g_{s^{‴} | x^{1}, y^{1}}) \end{matrix}

(A33)

\begin{matrix} = \prod_{s \in I (T) \cap S (x_{1})} g_{s | x^{1}, y^{1}} \prod_{s^{'} \in L (T) \cap S (x_{1})} (1 - g_{s^{'} | x^{1}, y^{1}}) \prod_{s^{″} \in I (T) \ S (x_{1})} g_{s^{″}} \prod_{s^{‴} \in L (T) \ S (x_{1})} (1 - g_{s^{‴}}) . \end{matrix}

(A34)

We transform the part of (A34), whose parameters

g_{s}

are renewed by

x^{1}

and

y^{1}

, as follows:

\begin{matrix} \prod_{s \in I (T) \cap S (x_{1})} g_{s | x^{1}, y^{1}} & \prod_{s^{'} \in L (T) \cap S (x_{1})} (1 - g_{s^{'} | x^{1}, y^{1}}) \end{matrix}

\begin{matrix} \overset{(a)}{=} (1 - g_{s (x_{1}) | x^{1}, y^{1}}) \prod_{s \in I (T) \cap S (x_{1})} g_{s | x^{1}, y^{1}} \end{matrix}

(A35)

\begin{matrix} \overset{(b)}{=} (1 - g_{s (x_{1})}) \prod_{s \in I (T) \cap S (x_{1})} \frac{g_{s} {\tilde{q}}_{s_{child}} (y_{1} | x_{1}, M_{T, k})}{{\tilde{q}}_{s} (y_{1} | x_{1}, M_{T, k})} \end{matrix}

(A36)

\begin{matrix} \overset{(c)}{=} \frac{(1 - g_{s (x_{1})}) {\tilde{q}}_{s (x_{1})} (y_{1} | x_{1}, M_{T, k})}{{\tilde{q}}_{s_{λ}} (y_{1} | x_{1}, M_{T, k})} \prod_{s \in I (T) \cap S (x_{1})} g_{s} \end{matrix}

(A37)

\begin{matrix} = \frac{{\tilde{q}}_{s (x_{1})} (y_{1} | x_{1}, M_{T, k})}{{\tilde{q}}_{s_{λ}} (y_{1} | x_{1}, M_{T, k})} \prod_{s \in I (T) \cap S (x_{1})} g_{s} \prod_{s^{'} \in L (T) \cap S (x_{1})} (1 - g_{s^{'}}), \end{matrix}

(A38)

where (a) follows from

s^{'} \in L (T) \cap S (x_{1})

is uniquely determined as

s (x_{1})

, (b) follows from Definition 3, and (c) follows from reduction of

{\tilde{q}}_{s} (y_{1} | x_{1}, M_{T, k})

.

Thus, the right-hand side of (A34) is calculated as

\begin{matrix} \frac{{\tilde{q}}_{s (x_{1})} (y_{1} | x_{1}, M_{T, k})}{{\tilde{q}}_{s_{λ}} (y_{1} | x_{1}, M_{T, k})} \prod_{s \in I (T)} g_{s} \prod_{s^{'} \in L (T)} (1 - g_{s^{'}}) . \end{matrix}

(A39)

\begin{matrix} \overset{(a)}{=} \frac{{\tilde{q}}_{s (x_{1})} (y_{1} | x_{1}, M_{T, k}) p (T)}{{\tilde{q}}_{s_{λ}} (y_{1} | x_{1}, M_{T, k})} \end{matrix}

(A40)

\begin{matrix} \overset{(b)}{=} \frac{q (y_{1} | x_{1}, T, k) p (T)}{\tilde{q} (y_{1} | x_{1}, k)} \end{matrix}

(A41)

\begin{matrix} \overset{(c)}{=} p (T | x^{1}, y^{1}, k), \end{matrix}

(A42)

where (a) follows from Assumption 2, the denominator of (b) follows from Theorem 2 for

n = 0

, the numerator of (b) follows from Definition 3 and the definition of

q_{s (x_{n + 1})} (y_{n + 1} | x_{n + 1}, x^{n}, y^{n}, k)

, and (c) follows from the Bayes’ theorem.

Step 3: Finally, if we assume (15) and (16) are true when

n = N

, we can also prove (15) and (16) when

n = N + 1

in the same way. Therefore, Theorem 2 holds.

Appendix D. Proof of Lemma 2

Proof of Lemma 2.

\begin{matrix} p (k | x^{n}, y^{n}) & \propto p (k) p (x^{n}, y^{n} | k) \end{matrix}

(A43)

\begin{matrix} = p (k) \prod_{i = 1}^{n} \tilde{q} (y_{i} | x_{i}, x^{i - 1}, y^{i - 1}, k) p (x_{i} | x^{i - 1}, y^{i - 1}, k) \end{matrix}

(A44)

\begin{matrix} \overset{(a)}{\propto} \prod_{i = 1}^{n} \tilde{q} (y_{i} | x_{i}, x^{i - 1}, y^{i - 1}, k) \end{matrix}

(A45)

\begin{matrix} \overset{(b)}{=} \prod_{i = 1}^{n} {\tilde{q}}_{s_{λ}} (y_{i} | x_{i}, x^{i - 1}, y^{i - 1}, M_{T, k}), \end{matrix}

(A46)

where (a) follows from Assumption A3 and

x_{1}, \dots, x_{n}

are i.i.d.; (b) follows from Theorem 2 at each

i = 1, \dots, n

. □

References

Breiman, L.; Friedman, J.; Stone, C.J.; Olshen, R.A. Classification and Regression Trees; CRC Press: Boca Raton, FL, USA, 1984. [Google Scholar]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Athey, S.; Tibshirani, J.; Wager, S. Generalized Random Forests. Ann. Stat. 2019, 47, 1148–1178. [Google Scholar] [CrossRef]
Friedman, J.H. Stochastic Gradient Boosting. Comput. Stat. Data Anal. 2002, 38, 367–378. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’16), San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef]
Schulter, S.; Wohlhart, P.; Leistner, C.; Saffari, A.; Roth, P.M.; Bischof, H. Alternating Decision Forests. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 508–515. [Google Scholar] [CrossRef]
Mishina, Y.; Murata, R.; Yamauchi, Y.; Yamashita, T.; Fujiyoshi, H. Boosted Random Forest. IEICE Trans. Inf. Syst. 2015, E98.D, 1630–1636. [Google Scholar] [CrossRef]
Bulo, S.; Kontschieder, P. Neural Decision Forests for Semantic Image Labelling. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 81–88. [Google Scholar] [CrossRef]
Kontschieder, P.; Fiterau, M.; Criminisi, A.; Bulo, S.R. Deep Neural Decision Forests. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1467–1475. [Google Scholar] [CrossRef]
Tanno, R.; Arulkumaran, K.; Alexander, D.; Criminisi, A.; Nori, A. Adaptive Neural Trees. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 6166–6175. [Google Scholar]
Zhou, Z.H.; Feng, J. Deep Forest: Towards An Alternative to Deep Neural Networks. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17), Melbourne, Australia, 19–25 August 2017; pp. 3553–3559. [Google Scholar] [CrossRef]
Frosst, N.; Hinton, G. Distilling a Neural Network Into a Soft Decision Tree. arXiv 2017, arXiv:1711.09784. [Google Scholar]
Suko, T.; Nomura, R.; Matsushima, T.; Hirasawa, S. Prediction Algorithm for Decision Tree Model. IEICE Tech. Rep. Theor. Found. Comput. 2003, 103, 93–98. (In Japanese) [Google Scholar]
Berger, J.O. Statistical Decision Theory and Bayesian Analysis; Springer: New York, NY, USA, 1985. [Google Scholar] [CrossRef]
Matsushima, T.; Hirasawa, S. A Bayes Coding Algorithm Using Context Tree. In Proceedings of the 1994 IEEE International Symposium on Information Theory, Trondheim, Norway, 27 June–1 July 1994; p. 386. [Google Scholar] [CrossRef]
Nakahara, Y.; Matsushima, T. A Stochastic Model of Block Segmentation Based on the Quadtree and the Bayes Code for It. In Proceedings of the 2020 Data Compression Conference (DCC), Snowbird, UT, USA, 24–27 March 2020; pp. 293–302. [Google Scholar] [CrossRef]

Figure 1. An example of a meta-tree and four model trees represented by it.

Figure 2. (a) Two examples of

(T_{1}, k_{1}, θ_{1})

and

(T_{2}, k_{2}, θ_{2})

, where

T_{1}

and

T_{2}

are 2-ary regular trees,

k_{1} = (k_{s_{λ}}, k_{s_{0}}) = (1, 2)

, and

k_{2} = (k_{s_{λ}}, k_{s_{0}}, k_{s_{1}}) = (3, 4, 2)

. (b)

p (y | x = {(0, 1)}^{⊤}, θ_{1}, T_{1}, k_{1}) = θ_{y | s_{01}}

.

Figure 2. (a) Two examples of

(T_{1}, k_{1}, θ_{1})

and

(T_{2}, k_{2}, θ_{2})

, where

T_{1}

and

T_{2}

are 2-ary regular trees,

k_{1} = (k_{s_{λ}}, k_{s_{0}}) = (1, 2)

, and

k_{2} = (k_{s_{λ}}, k_{s_{0}}, k_{s_{1}}) = (3, 4, 2)

. (b)

p (y | x = {(0, 1)}^{⊤}, θ_{1}, T_{1}, k_{1}) = θ_{y | s_{01}}

.

Figure 3. An example of a meta-tree

M_{T, k}

, where T is a complete tree with

D_{\max} = 2

and

k = (3, 4, 2)

.

Figure 3. An example of a meta-tree

M_{T, k}

, where T is a complete tree with

D_{\max} = 2

and

k = (3, 4, 2)

.

Figure 4. An example of

K = {k_{1}, k_{2}, k_{3}}

and

K^{'} = {k_{1}, k_{2}}

. In this example, the prior probability

p (k_{1}) = p (k_{2}) = 1 / 2

and

p (k_{3}) = 0

.

Figure 4. An example of

K = {k_{1}, k_{2}, k_{3}}

and

K^{'} = {k_{1}, k_{2}}

. In this example, the prior probability

p (k_{1}) = p (k_{2}) = 1 / 2

and

p (k_{3}) = 0

.

Figure 5. An example of true model tree (A-1) (a), model tree (A-2) (b), and model tree (B-1) (c).

Figure 6. Relationships between the number of learning data n and the average prediction error rate where the true model tree is model (A-1) (a), model (A-2) (b), and model (B-1) (c).

Figure 7. Relationships between the number of the learning data n and the average prediction error rate on nursery school data (a)/mushroom data (b) in UCI repository.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Meta-Tree Random Forest: Probabilistic Data-Generative Model and Bayes Optimal Prediction

Abstract

1. Introduction

2. Preliminaries

2.1. Model Tree

2.2. Problem Setup

2.3. Optimal Prediction Based on the Bayes Decision Theory

2.4. Previous Study

2.4.1. Expectation over the Parameters

2.4.2. Summation over All Model Trees Represented by a Meta-Tree

3. Proposed Method

4. CART and Random Forest Revisited and Comparison with MTRF

4.1. CART and Random Forest Revisited

4.2. Comparison of Random Forest with MTRF

5. Experiments

5.1. Experiment 1

5.2. Experiment 2

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Proof of Theorem 1

Appendix B. Proof of Lemma 1

Appendix C. Proof of Theorem 2

Appendix D. Proof of Lemma 2

References

Article Metrics

Citations

Article Access Statistics