An Efficient and Accurate Random Forest Node-Splitting Algorithm Based on Dynamic Bayesian Methods

He, Jun; Li, Zhanqi; Yin, Linzi

doi:10.3390/make7030070

Open AccessArticle

An Efficient and Accurate Random Forest Node-Splitting Algorithm Based on Dynamic Bayesian Methods

by

Jun He

¹,

Zhanqi Li

² and

Linzi Yin

^2,*

¹

School of Physics, Central South University, Changsha 410083, China

²

School of Electronic Information, Central South University, Changsha 410075, China

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2025, 7(3), 70; https://doi.org/10.3390/make7030070

Submission received: 6 June 2025 / Revised: 12 July 2025 / Accepted: 17 July 2025 / Published: 21 July 2025

Download

Browse Figures

Versions Notes

Abstract

Random Forests are powerful machine learning models widely applied in classification and regression tasks due to their robust predictive performance. Nevertheless, traditional Random Forests face computational challenges during tree construction, particularly in high-dimensional data or on resource-constrained devices. In this paper, a novel node-splitting algorithm, BayesSplit, is proposed to accelerate decision tree construction via a Bayesian-based impurity estimation framework. BayesSplit treats impurity reduction as a Bernoulli event with Beta-conjugate priors for each split point and incorporates two main strategies. First, Dynamic Posterior Parameter Refinement updates the Beta parameters based on observed impurity reductions in batch iterations. Second, Posterior-Derived Confidence Bounding establishes statistical confidence intervals, efficiently filtering out suboptimal splits. Theoretical analysis demonstrates that BayesSplit converges to optimal splits with high probability, while experimental results show up to a 95% reduction in training time compared to baselines and maintains or exceeds generalization performance. Compared to the state-of-the-art MABSplit, BayesSplit achieves similar accuracy on classification tasks and reduces regression training time by 20–70% with lower MSEs. Furthermore, BayesSplit enhances feature importance stability by up to 40%, making it particularly suitable for deployment in computationally constrained environments.

Keywords:

machine learning; decision trees; random forests; Bayesian statistics

Graphical Abstract

1. Introduction

It is widely known that the performance of multiple combined classifiers far exceeds that of using any single classifier independently [1]. Tree ensembles such as Random Forest (RF) [2] have achieved impressive success across various classification and regression tasks. RF builds a number of decision trees in the training process, which contributes to the final outcome through an aggregation method known as bagging [3]. Each decision tree is built using a bootstrap sample of the data and random subsets of features at each split. The predictions from decision trees are combined to produce the final output of RF, delivering high accuracy and robustness across different datasets and outperforming single classifiers, especially on complex predictive tasks [4].

RF has established itself as an indispensable tool in data mining, with applications covering bioinformatics [5], image processing [6], financial analytics [7], and natural language processing [8]. However, traditional RF algorithms face substantial challenges when handling high-dimensional data or running on hardware-constrained devices such as smartphones and Internet-of-Things (IoT) systems [9]. In these scenarios, the traditional algorithms often exhibit lower computational efficiency and reduced predictive accuracy. These limitations become even more pronounced due to the rapidly evolving computational landscape, where edge computing and IoT devices demand faster and more reliable algorithms [10].

To address these computational demands, researchers have explored various optimization techniques, particularly those focusing on decision-tree construction [11]. Decision trees represent the relationship between features and target variables through a hierarchical structure of conjunctive conditions, where each internal node corresponds to a feature and a threshold. In RF, finding the optimal split by identifying the best feature–threshold pair is crucial for both computational efficiency and predictive accuracy [12]. Accordingly, optimizing the node splitting process can significantly improve the overall performance of RF [13].

The traditional RF algorithms determine optimal splits by exhaustively searching through all possible splitting points, resulting in a computational complexity of

O (N)

for each split, where

N

is the number of instances. They are computationally heavy and inefficient. To address this problem, algorithms like XGBoost [14] and LightGBM [15] have employed histogram-based techniques to reduce the complexity to

O (B)

, where

B ≪ N

, by grouping feature values into discrete bins. While histogram-based methods can substantially accelerate training, they remain wasteful because they allocate the same computational effort to all features, including those that are not particularly informative.

1.1. Related Works

In 2001, Breiman [2] introduced the RF algorithm by combining the bagging technique, the random subspace method, and the Classification and Regression Trees (CART) algorithm. With the rapid growth of information technology, datasets have expanded dramatically in both size and complexity. Traditional RF algorithms face significant challenges meeting time and computational demands [16]. Thus, enhancing RFs to handle large datasets has become a major focus in both research and industry [17].

Recent advancements have emerged in algorithmic optimizations and parallel computing techniques. Yates and Islam [9] developed FastForest, which incorporates Subbagging to reduce the size of bootstrap samples, Logarithmic Split-Point Sampling (LSPS) to decrease the computational cost of node splits, and Dynamic Restricted Subspacing (DRS) to adjust the feature subset size during feature selection. These optimizations collectively reduce the data processed by each tree, thereby accelerating node splitting. In the context of incremental learning, Domingos and Hulten [18] introduced Hoeffding Tree (HT) in Very Fast Decision Tree (VFDT), which utilizes the Hoeffding Bound to guide the choice of decision nodes, ensuring a rapid and effective approximation of the optimal decision with limited samples. The Extremely Fast Decision Tree (EFDT) algorithm [19] builds upon HT by introducing Hoeffding Anytime Tree (HATT), which continuously monitors and updates splits as more data becomes available. Both HT and HATT are incremental decision tree learning techniques, particularly suited for big data environments and data stream applications.

Parallel computing frameworks have also been leveraged to enhance decision tree algorithms. Mu et al. [20] proposed a parallel decision tree algorithm based on MapReduce, using Pearson’s correlation coefficient for optimal split selection. Xu [21] introduced an improved RF algorithm based on Spark, leveraging the Fayyad boundary point principle for efficient feature discretization. Yin et al. [22] presented a fast parallel RF algorithm on Spark, integrating a modified Gini coefficient to reduce feature redundancy and applying an approximate equal-frequency binning method for split optimization.

Although these algorithms have sped up the training process for decision tree ensembles, most of them still rely on exhaustive searches to find the optimal split at each node, which can be costly for large datasets. MABSplit presents a cutting-edge solution to overcome this bottleneck [23]. At its core, MABSplit treats each candidate feature–threshold pair as an arm in a multi-armed bandit (MAB) problem, aiming to efficiently identify the optimal split by estimating each candidate’s impurity reduction through sampling. Specifically, the algorithm iteratively samples batches of data points, updates confidence intervals of impurity reduction estimates for each candidate split, and eliminates splits whose lower confidence bound exceeds the upper confidence bound of the currently best-performing candidate. This adaptive procedure significantly reduces computational complexity from linear to logarithmic with respect to the number of samples, achieving substantial speed-ups without compromising predictive accuracy.

Despite its considerable strengths, MABSplit has certain aspects that could be further enhanced:

Lack of Memory Mechanism: MABSplit lacks some related mechanisms to leverage information gathered from previously explored splits. The valuable information from earlier computations is not utilized to improve subsequent split evaluations. Consequently, the exploration in the early phase would be less efficient.

Lower Efficiency for Similar Candidates: The computational efficiency of MABSplit relies on sufficient heterogeneity among the true impurity reductions of different feature–threshold pairs. When most splits have similar impurity reductions (as in highly symmetric datasets), MABSplit fails to achieve its promised logarithmic sample complexity and reduces to a batched version of the naïve approach, resulting in no significant speed advantage.

Reduced Accuracy with Limited Samples: MABSplit employs confidence intervals to estimate split quality, which are derived from some statistical properties of impurity measures such as Gini impurity and entropy. However, these estimates may become less reliable when sample sizes are small or probabilities approach extreme values.

1.2. Our Contributions

In this paper, we introduce BayesSplit, a novel node-splitting algorithm that extends MABSplit to improve the computational efficiency and predictive accuracy of RFs. BayesSplit treats the probability of impurity reduction as a Beta posterior distribution, which is iteratively refined based on batched observations. Based on this, posterior confidence intervals are used to adaptively select splits most likely to maximize impurity reduction, ensuring statistically robust and data-driven tree construction. Our key contributions include:

(1) A Novel Bayesian-Based Impurity Estimation Framework

A Bayesian framework is developed that treats each split’s impurity reduction as a random event with an unknown success probability. The framework initializes uninformative Beta priors that evolve into informative posteriors as observations accumulate. After each batch of samples is evaluated, the Beta parameters of candidate splits are updated according to the observed impurity reductions. These posterior distributions are used to derive confidence intervals that balance exploration of uncertain splits with exploitation of promising ones. By comparing the confidence bounds of each split across all candidates, BayesSplit eliminates suboptimal splits while allocating computational resources to the most promising candidates. This Bayesian-based Impurity Estimation Framework naturally accommodates prior knowledge and new data, enabling robust decision-making and faster convergence to optimal splits.

(2) Two Bayesian Optimization Strategies to Achieve High Computational Efficiency

Dynamic Posterior Parameter Refinement: Drawing inspiration from Thompson Sampling (TS), this strategy treats each split’s impurity reduction as a reward signal to update the parameters of Beta distributions. After a batch is sampled, whether each split reduced impurity is evaluated,

α

is incremented when impurity decreases and

β

when it does not. This process creates a memory mechanism to accumulate evidence across iterations.

Posterior-Derived Confidence Bounding: This strategy uses each split’s posterior Beta distribution to establish confidence intervals. Rather than relying on frequentist approximations, the confidence bounds are set directly from the posterior parameters, enabling more accurate uncertainty quantification across diverse data conditions.

Our experiment results demonstrate that BayesSplit further minimizes quantization errors and more accurately captures the underlying data distribution than existing approaches.

2. Algorithmic Background

To facilitate understanding of the subsequent technical content, we summarize the key notation used in this paper. Table 1 provides a list of symbols and their meanings, serving as a reference for the algorithmic development and theoretical analysis of BayesSplit.

2.1. Node-Splitting Description in RFs and Decision Trees

An RF is composed of multiple classification or regression trees, where each tree

T

maps the feature space to the response. Consider a dataset

D

with

N

data points

{\{(x_{i}, y_{i})\}}_{i = 1}^{N}

, where

x_{i}

is the

i

-th feature vector and

y_{i}

is the corresponding target. Each tree in an RF is built independently on a bootstrapped dataset

D^{T}

from the original data

D

, while a random subset of features

M

is considered at each node.

In the decision tree node-splitting process, let

R_{ν}

represent the region in the feature space corresponding to node

ν

(typically a hyper-rectangle) [24]. Using the pair

(f, t)

to split node

ν

,

R_{ν}

is partitioned into two subregions

R_{ν, l e f t}

and

R_{ν, r i g h t}

, corresponding to the left and right child nodes of node

ν

. For a node

ν

in decision tree

T

,

N_{n} (ν) = |\{i \in D^{T} : x_{i} \in R_{ν}\}|

denotes the number of samples falling into

R_{ν}

. Finding the optimal split, i.e., determining the best pair

(f_{b}, t_{b})

to maximize node-splitting effectiveness, is accomplished by maximizing the reduction in label impurity:

\underset{f \in M, t \in T}{argmin} \{\frac{N_{n} (ν_{l e f t})}{N_{n} (ν)} I (ν_{l e f t}) + \frac{N_{n} (ν_{r i g h t})}{N_{n} (ν)} I (ν_{r i g h t}) - I (ν)\}

(1)

where

N_{n} (ν_{l e f t})

and

N_{n} (ν_{r i g h t})

represent the number of samples in the left and right child nodes, respectively,

I (\cdot)

represents the impurity measure, and

T

denotes the permissible thresholds for feature

f

. Popular impurity measures include the Gini index and entropy for classification, as well as the mean-squared error (MSE) for regression [25]:

G i n i = 1 - \sum_{k = 1}^{K} p_{k}^{2}

(2)

E = - \sum_{k = 1}^{K} p_{k} \log_{2} p_{k}

(3)

M S E = \frac{1}{N_{n} (v)} \sum_{i : x_{i} \in R_{v}} {(y_{i} - \bar{y})}^{2}

(4)

where

K

represents the no. of classes in the target variable,

p_{k}

is the proportion of samples at node belonging to class

k

, and

\bar{y}

is calculated as

\bar{y} = \frac{1}{N_{n} (v)} \sum_{i : x_{i} \in R_{v}} y_{i}

. In Equation (1),

I (ν)

denotes the impurity of node

ν

before splitting and does not depend on the split feature

f

or threshold

t

. Since

I (ν)

does not affect the minimization process, we simplify Equations (1)–(5) for a direct evaluation of the split effect, written as follows:

\underset{f \in M, t \in T}{argmin} \{\frac{N_{n} (ν_{l e f t})}{N_{n} (ν)} I (ν_{l e f t}) + \frac{N_{n} (ν_{r i g h t})}{N_{n} (ν)} I (ν_{r i g h t})\}

(5)

We define

μ_{f t} = \frac{N_{n} (ν_{l e f t})}{N_{n} (ν)} I (ν_{l e f t}) + \frac{N_{n} (ν_{r i g h t})}{N_{n} (ν)} I (ν_{r i g h t})

as the optimization objective. Note that lower values of

μ_{f t}

correspond to higher impurity reductions, while higher values of

μ_{f t}

correspond to lower impurity reductions.

2.2. Confidence Interval Estimation in MABSplit

Since computing

μ_{f t}

exactly requires a full pass over the data, MABSplit draws

n^{'}

samples,

{\{(X_{i}, Y_{i})\}}_{i = 1}^{n^{'}}

, to construct point estimates

{\hat{μ}}_{f t}

and confidence intervals for impurity reduction. At the core of this approach is the delta method, which transforms the empirical estimates of class distributions into reliable estimates of impurity measures.

Let

p_{L, k}

and

p_{R, k}

represent the proportion of the full data points in class

k

and each of the two subsets created by the split

(f, t)

. For a given split

(f, t),

MABSplit constructs empirical estimates

{\hat{p}}_{L, k}

and

{\hat{p}}_{R, k}

represents the proportion of class

k

samples in the left and right child nodes based on the

n^{'}

subsamples, respectively:

{\hat{p}}_{L, k} = \frac{1}{n^{'}} \sum_{i = 1}^{n^{'}} I (X_{i f} < t, Y_{i} = k) {\hat{p}}_{R, k} = \frac{1}{n^{'}} \sum_{i = 1}^{n^{'}} I (X_{i f} \geq t, Y_{i} = k)

(6)

These estimates jointly follow a multinomial distribution with parameters

(n^{'}, p)

, where

p = {(p_{L, 1}, \dots, p_{L, K}, p_{R, 1}, \dots, p_{R, K})}^{T}

. By the Central Limit Theorem (CLT), we obtain the following:

\sqrt{n^{'}} (\hat{γ} - γ) \overset{d}{\to} N (0, Q_{γ})

(7)

where

γ = {(p_{L, 1}, \dots, p_{L, K}, p_{R, 1}, \dots, p_{R, K - 1})}^{T}

and

\hat{γ} = {({\hat{p}}_{L, 1}, \dots, {\hat{p}}_{L, K - 1}, {\hat{p}}_{R, 1}, \dots, {\hat{p}}_{R, K - 1})}^{T}

.

Q_{γ}

is the corresponding covariance matrix. Specifically,

μ_{f t}

could be written in terms of

γ

for the impurity metrics (e.g., Gini or entropy). Let

\nabla μ_{f t} (γ)

be the derivative of

μ_{f t}

with respect to

γ

. Applying the delta method, we obtain the following:

\sqrt{n^{'}} ({\hat{μ}}_{f t} (γ) - μ_{f t} (γ)) \overset{d}{\to} N (0, {\nabla μ_{f t} (γ)}^{T} Q_{γ} \nabla μ_{f t} (γ))

(8)

This allows MABSplit to construct

(1 - δ)

confidence intervals scale as

C_{f t} (n', δ) = O (\sqrt{\log (1 / δ) / n^{'}})

, these confidence intervals are asymptotically valid as

n^{'} \to \infty

. In MABSplit, each batch of data points updates

{\hat{p}}_{L, k}

and

{\hat{p}}_{R, k}

, which in turn refine the point estimates

{\hat{μ}}_{f t}

and their corresponding confidence intervals.

While the delta method provides a theoretically sound framework for constructing confidence intervals, its practical reliability depends on the validity of asymptotic assumptions guaranteed by the CLT. Specifically, Equation (8) shows that the width of the confidence intervals depends on the sample size through a

\sqrt{1 / n^{'}}

scaling. In the early stages of MABSplit, where only small batches of data are available, this approximation may break down. For example, when the class proportions approach 0, the derivative of impurity metrics such as entropy becomes unstable, leading to large variance and unreliable confidence intervals.

In summary, MABSplit exhibits reduced accuracy with limited samples, as statistical estimation under small-sample conditions can introduce substantial variance in impurity estimates.

3. BayesSplit: A Bayesian Node-Splitting Algorithm

3.1. Overview of the Framework

BayesSplit treats impurity reduction as a Bernoulli event with Beta-conjugate priors, as shown in Figure 1. The framework begins with uninformative priors, which are gradually updated into informative posteriors as observations accumulate. After each batch of samples is evaluated, the Beta parameters of candidate splits are updated based on the observed impurity reductions.

The framework consists of three key components: (1) InitializePosterior establishes posterior distributions for each candidate split based on initial impurity evaluations; (2) UpdatePosteriorAndBounds implements the Dynamic Posterior Parameter Refinement strategy, updating beliefs about split effectiveness based on empirical observations; and (3) Filter uses Posterior-Derived Confidence Bounds to eliminate splits that are demonstrably suboptimal, focusing resources on promising candidates.

The precise splitting approach BayesSplit is outlined in Algorithm 1. We preprocess the input data to identify candidate splits

(f, t)

, or arms, for each feature

f

. All potential solutions to Equation (5) are tracked by maintaining a set

S_{solution}

, which initially includes every candidate arm

\{(f, t)\}

.

Algorithm 1 BayesSplit

(X, M, T, I (\cdot), B)

1:: $S_{solution} \leftarrow {(f, t), \forall f \in M, \forall t \in T}$ // Set of potential solutions to Equation (5)
2:: $n_{u s e d} \leftarrow 0$ // Number of data points sampled
3:: for all $(f, t) \in S_{s o l u t i o n}$ do
4:: ${\hat{μ}}_{f t} \leftarrow \infty, C_{f t} \leftarrow \infty$ // Initialize mean and CI for each arm
5:: end for
6:: for all $f \in M$ do
7:: Create empty histogram $h_{f}$ with $|T| = T$ equally spaced bins
8:: end for
9:: InitializePosterior // Set up posterior distributions
10:: while $n_{u s e d} < n and |S_{s o l u t i o n}| > 1$ do
11:: Draw a batch sample $X_{b a t c h}$ of size B with replacement from $X$
12:: for all unique f in S_solution do
13:: for all x in $X_{batch}$ do
14:: Insert x_f into histogram h_f // Update histograms with sampled data
15:: end for
16:: end for
17:: for all $(f, t) \in S_{solution}$ do
18:: Update ${\hat{μ}}_{f t} and C_{f t}$ based on histogram h_f
19:: end for
20:: UpdatePosteriorAndBounds // Adjust posterior and refine bounds
21:: $S_{solution} \leftarrow \{(f, t) : {\hat{μ}}_{p o s t, f t} - C_{p o s t, f t} \leq \min_{f, t} ({\hat{μ}}_{p o s t, f t} + C_{p o s t, f t})\}$ // Retain promising splits
22:: $n_{u s e d} \leftarrow n_{u s e d} + B$
23:: end while
24:: if $|S_{solution}| = 1$ then
25:: return $(f_{b}, t_{b}) \in S_{solution}$
26:: else
27:: Compute $μ_{f t}$ exactly for all $(f, t) \in S_{solution}$
28:: return $(f_{b}, t_{b}) = {\arg \min}_{(f, t) \in S_{solution}} μ_{f t}$

Algorithm 1, as depicted in Figure 1, begins with candidate splits at the top and produces the optimal split

(f_{b}, t_{b})

at the bottom. In each iteration, it samples a batch of data points and updates histograms to refine impurity estimates. When evaluating a candidate split, BayesSplit calculates the impurity gain

g_{f t}

and updates the posterior Beta parameters, which are then used to compute confidence intervals for each split. The algorithm renews

S_{solution}

until convergence by eliminating suboptimal splits.

This Bayesian framework fundamentally distinguishes BayesSplit from previous approaches by providing adaptive, accurate, and computationally efficient node-splitting decisions, thus enhancing both computational efficiency and predictive accuracy of RFs.

3.2. Dynamic Posterior Parameter Refinement

BayesSplit is inspired by TS, a seminal Bayesian heuristic for solving MAB problems [26]. TS has shown robust performance compared to alternatives like Upper Confidence Bound (UCB) algorithms [27]. It is widely applied in Bernoulli bandit problems, where rewards are binary, with a value of 1 for success and 0 for failure. In BayesSplit, “success” specifically corresponds to a split that successfully reduces node impurity. To clarify the application of TS within BayesSplit, we begin by briefly reviewing the Beta-Bernoulli Bandit.

Beta distributions serve as natural representations of uncertainty for Bernoulli trials due to their conjugacy properties, meaning that if the prior is Beta

(α, β)

, the posterior after observing outcomes from Bernoulli trials remains a Beta distribution with updated parameters. The probability density function (PDF) of the Beta distribution on the interval

[0, 1]

is given by the following:

f (x) = \frac{1}{B (α, β)} x^{α - 1} {(1 - x)}^{β - 1}

(9)

where

B (α, β)

is the Beta function, defined as follows:

B (α, β) = \int_{0}^{1} x^{α - 1} {(1 - x)}^{β - 1} d x = \frac{Γ (α) Γ (β)}{Γ (α + β)}

(10)

Here,

Γ

denotes the gamma function, generalizing factorials.

Suppose there are

A

actions, and when an action

a \in \{1, \dots, A\}

is played, it yields a reward of 1 with probability

θ_{a}

and a reward of 0 with probability

1 - θ_{a}

[28]. Each

θ_{a}

represents the probability of success, or the mean reward. Once an action

a

is chosen, the resulting reward

r_{1} \in \{0,1\}

is generated with success probability

P (r_{1} = 1 | a) = θ_{a}

. We assume an independent prior belief for each

θ_{a}

, where these priors follow a Beta distribution with parameters

α = (α_{1}, \dots, α_{A})

and

β = (β_{1}, \dots, β_{A})

. For each action

a

, the prior PDF of

θ_{a}

is calculated as follows:

p (θ_{a}) = \frac{Γ (α_{a} + β_{a})}{Γ (α_{a}) Γ (β_{a})} θ_{a}^{α_{a} - 1} {(1 - θ_{a})}^{β_{a} - 1}

(11)

If action

a

is chosen at step

t

, the parameters are updated based on the observed reward

r_{t}

. The posterior update for each action’s distribution follows the following simple rule:

(α_{a}, β_{a}) \leftarrow (α_{a} + r_{t}, β_{a} + (1 - r_{t}))

Initially, BayesSplit assigns a non-informative

B e t a (1,1)

prior to each candidate split, representing uncertainty about split effectiveness. Before the main iterations begin, the algorithm performs a posterior initialization phase where each candidate split undergoes a preliminary impurity reduction evaluation. If this evaluation shows that a split reduces impurity (a success), its parameters update as

α \leftarrow α + 1

; otherwise (a failure), as

β \leftarrow β + 1

. This initialization step provides an informed starting point for the posterior distributions, leveraging global data characteristics to guide early exploration (see Algorithm 2). It ensures that all candidate splits begin without bias and are immediately updated based on observed impurity reductions. As more batches are processed, the influence of the prior becomes negligible, and split selection is determined by empirical evidence.

In each subsequent iteration, BayesSplit samples a new batch of data

X_{b a t c h}

to evaluate all candidate splits

(f, t)

in

S_{solution}

. This ongoing update process refines the value of

{\hat{μ}}_{f t}

, which represents the mean objective estimate computed from histograms. Using the current estimate, BayesSplit calculates the impurity gain

g_{f t} = {\hat{μ}}_{f t} - {\hat{μ}}_{p r e v, f t}

between consecutive iterations. The posterior parameters are then updated based on this gain: specifically, when

g_{f t} < 0

, confidence increases by incrementing

α

; when

g_{f t} > 0

, confidence decreases by incrementing

β

. This dynamic mechanism actively accumulates evidence across iterations, enhancing convergence efficiency (see Algorithm 3).

Algorithm 2 InitializePosterior

(X, H, I (\cdot), α_{1}, β_{1})

Require:

X

: input data;

H

: list of histograms;

I (\cdot) :

impurity measure;

α_{1}, β_{1} :

non-informative prior parameters of Beta distribution
Ensure:

α, β :

initialized posterior parameters

1:: $Initialize success counter S_{f t} = 0 and failure counter F_{f t} = 0 for all (f, t) \in S_{solution}$
2:: for each feature f do // Iterate over all features
3:: for each bin b in h_f do // Iterate over bins in histogram
4:: Compute impurity reduction Δ for splitting on bin b
5:: if $Δ > 0$ then // Check if split reduces impurity
6:: $S_{f t} = S_{f t} + 1$
7:: else
8:: $F_{f t} = F_{f t} + 1$
9:: end if
10:: end for
11:: end for
12:: $α = α_{1} + S_{f t}, β = β_{1} + F_{f t}$
13:: return $α, β$

3.3. Posterior-Derived Confidence Bounding

Unlike the classical TS, which selects arms based on posterior sampling alone, BayesSplit uses posterior distributions to construct explicit confidence intervals for impurity reductions. Arms whose confidence intervals indicate potential optimality are retained for further exploration; others are eliminated early.

Confidence intervals in BayesSplit are derived directly from the statistical properties of the Beta distribution posterior, providing a theoretically sound basis for uncertainty quantification. The moment-generating function (MGF) of the Beta distribution can be expressed through a confluent hypergeometric function, written as follows:

E [e x p (λ θ)] = {}_{1}F_{1} (α; α + β; λ) = \sum_{j = 0}^{\infty} \frac{Γ (α + j) Γ (α + β)}{(j!) Γ (α) Γ (α + β + j)} λ^{j}

(12)

From this, the

j^{t h}

raw moment of a Beta

(α, β)

random variable

θ

is given by:

E [θ^{j}] = \frac{{(α)}_{j}}{{(α + β)}_{j}}

(13)

where

{(x)}_{j} = x (x + 1) \dots (x + j - 1) = \frac{Γ (x + j)}{Γ (x)}

, known as the Pochhammer symbol or rising factorial. The mean and variance are defined as follows:

E [θ] = \frac{α}{α + β} V a r [θ] = \frac{α β}{{(α + β)}^{2} (α + β + 1)}

(14)

For each candidate split

(f, t)

with posterior parameters

α

and

β

, BayesSplit computes the posterior mean as

{\hat{μ}}_{p o s t, f t} = \frac{α}{α + β}

, which represents our current belief about the probability that the split reduces impurity based on all observed evidence. The half-width of the confidence interval

C_{p o s t, f t}

is based on the standard deviation

\sqrt{V a r [θ]}

from the posterior Beta distribution,

C_{p o s t, f t} = o (\sqrt{\frac{α β}{{(α + β)}^{2} (α + β + 1)}})

, which naturally decreases as more samples are accumulated.

Algorithm 3 UpdatePosteriorAndBounds

(α, β, {\hat{μ}}_{f t}, S_{solution}, B)

\begin{matrix} Require : & α, β : initialized posterior parameters for all (f, t) \in S_{solution} \\ {\hat{μ}}_{f t} : estimates of impurity reductions based on histogram h_{f} \\ B : batch size \\ Ensure : & Updated α, β, {\hat{μ}}_{p o s t, f t}, C_{p o s t, f t} \end{matrix}

1:: for all $(f, t) \in S_{solution}$ do
2:: Compute impurity gain $g_{f t} = {\hat{μ}}_{f t} - {\hat{μ}}_{p r e v, f t}$ // Compute impurity gains from new batches
3:: if $g_{f t} < 0$ then
4:: Set reduced flag $r_{f t} \leftarrow 1$
5:: else
6:: Set reduced flag $r_{f t} \leftarrow 0$
7:: end if
8:: Update posterior parameters:
9:: $α = α + r_{f t}$ // Increase confidence if split reduced impurity
10:: $β = β + (B - r_{f t})$ // Decrease confidence if split increased impurity
11:: end for
12:: Update ${\hat{μ}}_{p o s t, f t} and C_{p o s t, f t} using updated α and β$
13:: return Updated $α, β, {\hat{μ}}_{p o s t, f t}, C_{p o s t, f t}$

At each iteration, BayesSplit implements a filtering mechanism by retaining only those splits whose posterior lower confidence bound

{\hat{μ}}_{p o s t, f t} - C_{p o s t, f t}

falls below the upper confidence bound of the highest potential arm

\min_{f, t} ({\hat{μ}}_{p o s t, f t} + C_{p o s t, f t})

. This criterion ensures that only splits that are demonstrably suboptimal with high probability are removed from consideration. As sampling progresses, the confidence intervals narrow, enabling the algorithm to identify the optimal split while minimizing exploration of suboptimal splits. The iterative process continues until either a single optimal split remains or the algorithm reaches a predefined computational budget.

While both Dynamic Posterior Parameter Refinement and Posterior-Derived Confidence Bounding contribute to BayesSplit’s efficiency, their roles are complementary rather than independent. Dynamic Posterior Parameter Refinement is the primary driver of performance gains. It accumulates evidence across iterations to update split-quality estimates adaptively, addressing MABSplit’s lack of memory mechanism by leveraging historical data to inform future sampling. The accumulated evidence enables faster convergence to optimal splits, particularly when many feature–threshold pairs have similar impurity reductions, a scenario where MABSplit requires substantially more samples to distinguish among candidates. Posterior-Derived Confidence Bounding adds statistical robustness and accuracy under limited-sample conditions, since its confidence intervals are derived directly from the exact properties of the Beta distribution.

The synergy between the two strategies is crucial because the Dynamic Posterior Parameter Refinement provides increasingly informative posteriors that tighten confidence bounds, and the Posterior Derived Confidence Bounding ensures statistically sound elimination decisions, creating a feedback loop that focuses computational resources on the most competitive splits.

Implementation Details: In practice, sampling without replacement is utilized for efficiency, similar to MABSplit, achieving substantial computational savings without significantly impacting performance.

3.4. Convergence Analysis

In this section, we show that BayesSplit’s posterior estimation of impurity reduction for each feature–threshold pair converges to the true parameter as the number of observations increases. Based on these estimates, the algorithm eliminates suboptimal splits with high confidence, and the retained candidate converges to the optimal split with high probability. For any feature–threshold pair

(f, t),

let

θ^{*}

denote the true probability that split

(f, t)

successfully reduces node impurity. We treat the event of observing an impurity reduction as a Bernoulli trial with success probability

θ^{*}

.

We make the following standard assumptions for Bayesian consistency:

1. The Bernoulli likelihood is different for different parameter values.

2. The prior Beta distribution assigns non-zero density to the true parameter

θ^{*}

.

Lemma 1.

For any feature–threshold pair

(f, t),

given a sufficient number of samples

n^{'}

, the posterior estimate in BayesSplit converges to the true impurity reduction probability

θ^{*}

. As

n^{'} \to \infty

, the posterior mean

{\hat{μ}}_{p o s t, f t}

tends to

θ^{*}

, and the posterior distribution concentrates its mass at

θ^{*}

.

Proof.

We use a conjugate Beta–Bernoulli model for inference. Suppose the prior for

θ^{*}

is

θ^{*} ~ B e t a (α, β)

. After observing

n^{'}

independent impurity-reduction outcomes

X_{1}

,…,

X_{n^{'}}

for split

(f, t)

, where

X \in \{0,1\}

and it indicates whether impurity is reduced

(X = 1)

or not

(X = 0)

, the posterior distribution is as follows:

θ^{*} | X_{1 : n^{'}} ~ B e t a (α^{(n^{'})}, β^{(n^{'})})

with updated parameters, it is written as follows:

α^{(n^{'})} = α + \sum_{i = 1}^{n^{'}} X_{i}, β^{(n^{'})} = β + \sum_{i = 1}^{n^{'}} (1 - X_{i})

The posterior mean at this stage is

{\hat{μ}}_{p o s t, f t} = \frac{α^{(n^{'})}}{α^{(n^{'})} + β^{(n^{'})}}

. We can rewrite the posterior mean as follows:

{\hat{μ}}_{p o s t, f t} = \frac{α + β}{α + β + n^{'}} \cdot \frac{α}{α + β} + \frac{n^{'}}{α + β + n^{'}} \cdot \frac{1}{n^{'}} \sum_{i = 1}^{n^{'}} X_{i}

(15)

As

n^{'}

grows large, the weight on the prior mean vanishes, while the weight on the sample mean approaches 1. By the Law of Large Numbers, the sample mean converges to

E [X_{i}] = θ^{*}

. Therefore,

{\hat{μ}}_{p o s t, f t} \to θ^{*}

as

n^{'} \to \infty

.

In the Beta–Bernoulli model, the posterior variance of the random variable

θ

given

X_{1 : n^{'}}

is written as follows:

V a r [θ | X_{1 : n^{'}}] = \frac{α^{(n^{'})} β^{(n^{'})}}{{(α^{(n^{'})} + β^{(n^{'})})}^{2} (α^{(n^{'})} + β^{(n^{'})} + 1)} = O (\frac{1}{n^{'}})

(16)

Thus, as

n^{'}

increases, the posterior variance shrinks, and the distribution becomes increasingly concentrated around

θ^{*}

. Another perspective uses the Kullback–Leibler (KL) divergence [29]. For any candidate value

θ^{'} \neq θ^{*}

, the KL divergence between the Beta distributions with parameters

θ^{*}

and

θ^{'}

is written as follows:

D_{K L} (θ^{*} | | θ^{'}) = θ^{*} \ln \frac{θ^{*}}{θ^{'}} + (1 - θ^{*}) \ln \frac{{1 - θ}^{*}}{{1 - θ}^{'}}

(17)

By Gibbs’ inequality,

D_{K L} (θ^{*} | | θ^{'}) > 0 .

With i.i.d. observations, the average log-likelihood ratio converges to

D_{K L} (θ^{*} | | θ^{'})

, implying the following:

\frac{1}{n^{'}} \ln \frac{P (X_{1 : n^{'}} | θ^{*})}{P (X_{1 : n^{'}} | θ^{'})} \overset{n^{'} \to \infty}{\to} D_{K L} (θ^{*} | | θ^{'}) > 0

(18)

The likelihood of the data under any

θ^{'} \neq θ^{*}

diminishes exponentially relative to the likelihood under

θ^{*}

. Consequently, the posterior probability of such a

θ^{'}

becomes negligible, and the posterior mass concentrates at

θ^{*}

as

n^{'} \to \infty

. □

The above analysis shows that BayesSplit’s posterior estimate

{\hat{μ}}_{p o s t, f t}

converges to the true impurity reduction probability

θ^{*}

, and its uncertainty decreases as

O (\sqrt{1 / n^{'}})

. Given enough sampling, the algorithm will, with high probability, correctly identify the optimal split.

3.5. Optimal Solution Regret Bound

BayesSplit’s multi-armed bandit formulation allows us to derive a finite-time regret guarantee for the node-splitting process. Consider a node with

n

data points

X

,

m

features

M

, and

T

possible thresholds for each feature (

|T| = T

). Assume that

(f_{b}, t_{b})

is the optimal feature–threshold pair that maximizes node-splitting effectiveness, i.e.,

(f_{b}, t_{b}) = \underset{f \in M, t \in T}{arg min} μ_{f t}

. The following theorem provides an upper bound on the expected regret at total computation.

Theorem 1.

Fix

ϵ \in (0,1) .

For the multi-armed bandit formulation of BayesSplit, the finite-time expected regret

E [R (M)]

at total computation

M

satisfies the following:

E [R (M)] \leq (1 + ϵ) \sum_{(f, t) \neq (f_{b}, t_{b})}^{m T} \frac{\ln M}{d (p_{f t}, p_{f_{b} t_{b}})} ∆_{f t} + O (\frac{m T}{ϵ^{2}})

(19)

where

d (p_{f t}, p_{f_{b} t_{b}})

is the KL divergence between the probability of achieving impurity reduction of the optimal split

(f_{b}, t_{b})

and that of any suboptimal split

(f, t)

, defined as

d (p_{f t}, p_{f_{b} t_{b}}) = p_{f t} \log \frac{p_{f t}}{p_{f_{b} t_{b}}} + (1 - p_{f t}) \log \frac{(1 - p_{f t})}{(1 - p_{f_{b} t_{b}})}

with

∆_{f t} = p_{f_{b} t_{b}} - p_{f t}

.

A complete derivation and step-by-step proof of Theorem 1 can be found in Appendix A. The proof follows standard multi-armed bandit analyses by bounding the number of times a suboptimal split

(f, t)

can be selected. The regret bound implies that BayesSplit rapidly converges on the optimal split when one is clearly superior. In scenarios where several splits exhibit comparable performance, the cumulative regret increases slightly, yet it remains sublinear overall.

4. Performance Analysis

To evaluate the effectiveness of BayesSplit, we conducted a series of experiments. Initially, the wall-clock training time and corresponding generalization performance of decision tree ensembles utilizing BayesSplit were measured. Subsequently, we assessed the computational efficiency and resulting generalization performance under a fixed computational budget. All experiments were performed on a ThinkPad T14 Gen 3 laptop, equipped with a 12th Generation Intel Core i7-1260P processor and 32 GB of RAM.

In these comparative experiments, we evaluated the histogrammed versions of three decision tree ensembles with and without BayesSplit: Random Forest (RF), ExtraTrees [30], and Random Patches (RP) [31]. Random Forest creates an ensemble of decision trees built on bootstrapped samples, while a random subset of features is considered at every node split to balance variance reduction and computational complexity. ExtraTrees further randomizes this process by randomly selecting split thresholds, reducing variance and training time but potentially increasing bias. Random Patches samples subsets of instances and features simultaneously for each tree, enhancing ensemble diversity and improving generalization, particularly for high-dimensional datasets.

We performed experiments using all datasets originally employed by MABSplit, supplemented with additional publicly available classification and regression datasets from the UCI Machine Learning Repository. This dataset selection ensured coverage of varying complexities and data characteristics, while consistent preprocessing and benchmarking allowed for a fair comparison of BayesSplit, MABSplit, and the naïve approach. In addition to these core experiments, we conducted additional experiments to evaluate BayesSplit’s sensitivity to hyperparameter settings and its performance on imbalanced datasets, with detailed results provided in Appendix B.

4.1. Wall-Clock Time Comparisons

Classification: We assessed the performance of BayesSplit by comparing it with MABSplit and the baseline brute-force solver on several classification tasks, evaluating wall-clock training time, number of histogram insertions, and test accuracy. As shown in Figure 2, incorporating BayesSplit and MABSplit subroutines significantly reduces training time relative to the naïve approach, with BayesSplit providing up to a 95% reduction compared to baseline methods.

Table 2 further confirms their efficiency by showing fewer histogram insertions, with BayesSplit’s refined histogram-based splits providing additional computational savings to the ExtraTrees model. The identical insertion counts between BayesSplit and MABSplit for RF and RP models occur because both algorithms eliminate candidates at similar rates when significant differences in split quality are evident. Given the strong correlation between histogram insertions and training time, improving node-splitting by reducing sample complexity is clearly justified.

Figure 3 illustrates that integrating BayesSplit and MABSplit produces test accuracy comparable to that of the baseline models. Notably, BayesSplit provides significant advantages to the ExtraTrees model, which typically relies on random split thresholds that can lead to suboptimal decisions. Through Bayesian optimization, BayesSplit refines those random splits, resulting in more precise decision boundaries and improved accuracy.

Regression: Across four regression datasets with diverse characteristics, BayesSplit consistently reduces training time by 20–70% compared to MABSplit, as shown in Figure 4. To ensure fairness given the varying histogram bin counts of the baseline regression models, we excluded the number of histogram insertions from our analysis.

Table 3 shows that using BayesSplit yields lower test MSEs than the naïve solver or MABSplit, and the improvement in generalization performance along with reduced training time highlights the effectiveness of Bayesian optimization in efficiently focusing on the most promising splits.

4.2. Fixed Budget Comparisons

Classification: Under a fixed computational budget defined by a set number of histogram insertions, forests trained with BayesSplit split more nodes and require fewer data-point queries than those using the naïve solver. Consequently, these forests can accommodate more trees and achieve improved generalization.

Figure 5 compares the number of classification trees built under a fixed budget with test accuracies in Table 4. For RF and RP, BayesSplit and MABSplit perform similarly across all five datasets. However, in ExtraTrees, BayesSplit surpasses MABSplit in both tree count and test accuracy, indicating that BayesSplit effectively captures and leverages the additional randomness in ExtraTrees’ split-selection process.

Regression: Under a fixed computational budget, integrating BayesSplit consistently outperforms both the naïve approach and MABSplit, allowing more trees to be trained and yielding lower test MSEs in all baseline models. Figure 6 shows that BayesSplit supports additional regression trees within the same budget, and Table 5 confirms notable improvements in predictive performance via reduced test MSEs. Notably, compared to MABSplit, BayesSplit can reduce the test MSEs by up to 25%.

BayesSplit proves particularly well-suited to regression tasks due to the continuous nature of the target variable, which enables more precise estimation of impurity reductions. This precision results in tighter confidence intervals within the Bayesian updating framework, allowing the algorithm to allocate more samples to regions of high uncertainty while limiting evaluations in low-variance segments. Consequently, BayesSplit converges more rapidly toward optimal splits, reducing unnecessary computations and enhancing overall efficiency.

4.3. Feature Stability Comparisons

Beyond predictive performance, Random Forests offer valuable insights into feature importance, thereby enhancing model explainability [32]. We evaluate feature importance using two well-established metrics: Out-of-Bag Permutation Importance (OOB-PI) and Mean Decrease in Impurity (MDI). Specifically, OOB-PI quantifies the change in out-of-bag error when the values of a feature are shuffled, reflecting its contribution to predictive accuracy. On the other hand, MDI calculates the average reduction in impurity (e.g., Gini index or entropy for classification, MSE for regression) across all decision nodes where a given feature is used for splitting, indicating its effectiveness in reducing overall impurity within the model. To ensure robustness, the top

k

features identified by these metrics are further assessed for stability using standardized formulas [33].

Table 6 shows that under a fixed computational budget, forests trained with BayesSplit achieve a 10–40% improvement in feature stability compared to MABSplit, corresponding to an average increase of approximately 30.28%. This improvement is particularly significant in resource-constrained environments, such as IoT and edge computing applications, where model interpretability and robustness are critical. By consistently identifying the most relevant features across multiple iterations, BayesSplit not only enhances predictive performance but also mitigates the risk of overfitting to irrelevant or noisy data.

5. Conclusions and Future Work

In this work, the BayesSplit algorithm was proposed as a Bayesian enhancement of MABSplit for decision-tree node splitting. While MABSplit introduces multi-armed bandit techniques with frequentist confidence intervals, BayesSplit advances this approach by developing a Bayesian-based impurity estimation framework where impurity reduction events are treated as Bernoulli trials. On the benchmark datasets, BayesSplit reduced wall-clock training time by 20–70% relative to MABSplit and by up to 95% relative to the naïve approach. Compared to MABSplit, it also lowered regression MSEs by as much as 25%.

Beyond computational efficiency, BayesSplit exhibits enhanced feature stability. Our experiments primarily focused on standard computing environments, making it an important future research direction to benchmark and optimize BayesSplit for resource-constrained devices such as smartphones and IoT platforms. Given the growing relevance of edge computing, which involves limited memory, power constraints, and specialized hardware architectures, dedicated optimizations and platform-specific considerations are critical. Future work could also extend BayesSplit to parallel frameworks such as Apache Spark, further broadening its applicability in large-scale real-time processing scenarios.

Finally, we note that gradient boosting decision tree (GBDT) frameworks such as XGBoost, LightGBM, and CatBoost remain strong baselines in predictive performance. These boosting algorithms often achieve higher accuracy than bagging-based methods on many tasks due to their sequential error-correcting training process. In contrast, our work focuses on the Random Forest framework, where trees are constructed in parallel, aiming to improve training efficiency without sacrificing accuracy. Nevertheless, integrating BayesSplit’s adaptive splitting strategy into boosting frameworks offers potential advantages, such as selectively updating the necessary data points’ residual targets rather than recomputing residuals for the entire dataset at each iteration. Although such integration presents practical and algorithmic challenges, it represents a promising direction for future research, as indicated by recent advances such as FastForest [9]. Thus, our proposed approach complements rather than directly competes with optimized gradient boosting and hybrid methods. We intend to explore this topic in greater detail in future work.

Author Contributions

Conceptualization, J.H. and L.Y.; methodology, L.Y.; software, Z.L.; validation, Z.L.; formal analysis, Z.L.; investigation, J.H., L.Y. and Z.L.; resources, L.Y.; data curation, Z.L.; writing—original draft preparation, L.Y. and Z.L.; writing—review and editing, L.Y. and Z.L.; visualization, Z.L.; supervision, J.H. and L.Y.; project administration, J.H.; funding acquisition, L.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, under grant number [52307232], and the Hunan Provincial Natural Science Foundation of China, under grant number [2024JJ4055].

Data Availability Statement

The datasets analyzed during the current study are publicly available from the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/index.php (accessed on 31 May 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

This section provides the complete derivation underlying Theorem 1.

Proof.

To facilitate a deeper analysis of the result, we first introduce several key definitions. □

Definition A1.

(Quantities

k_{f t} (τ)

,

i (τ)

,

S_{f t} (τ), \tilde{μ_{f t}} (τ)

).

k_{f t} (τ)

denotes the number of times the feature–threshold pair

(f, t)

has been selected for evaluation up to computation step

τ

,

i (τ)

represents the feature–threshold pair selected at computation step

τ

, and

S_{f t} (τ)

is the cumulative count of times that the pair

(f, t)

resulted in impurity reduction up to step

τ

. For Bernoulli bandits in BayesSplit,

\tilde{μ_{f t}} (τ)

is calculated as

\tilde{μ_{f t}} (τ) = \frac{S_{f t} (τ)}{k_{f t} (τ) + 1}

.

Definition A2.

(Quantities

θ_{f t} (τ)

).

θ_{f t} (τ)

denotes a sample generated independently for each feature–threshold pair

(f, t)

from the posterior distribution at step

τ

. In BayesSplit, this sample is drawn from the Beta distribution

B e t a (S_{f t} (τ) + 1, k_{f t} (τ) - S_{f t} (τ) + 1)

.

Definition A3.

(Quantities

x_{f t}, y_{f t}

). For each feature–threshold pair

(f, t) \neq (f_{b}, t_{b})

, we define two thresholds

x_{f t}

and

y_{f t}

such that

p_{f t} < x_{f t} < y_{f t} < p_{f_{b} t_{b}}

, their exact values depend on the context of the proof.

Definition A4.

(Events

E_{f t}^{μ} (τ), E_{f t}^{θ} (τ)

). For any

(f, t) \neq (f_{b}, t_{b})

,

E_{f t}^{μ} (τ)

is the event that the empirical mean

\tilde{μ_{f t}} (τ) \leq x_{f t}

, and

E_{f t}^{θ} (τ)

is the event that the sampled value

θ_{f t} (τ) \leq y_{f t}

.

Definition A5.

(History

F_{τ}

). For computation steps

τ = 1,2, \dots,

define

F_{τ}

as the history of all evaluations up to step

τ

:

F_{τ} = {i (τ^{'}), r_{i (τ^{'})} (τ^{'}), τ^{'} = 1, \dots τ}

where

i (τ^{'})

denotes the feature–threshold pair selected at step

τ^{'}

, and

r_{i (τ^{'})} (τ^{'})

is the observed impurity reduction for that pair at step

τ^{'} .

whether or not the event

E_{f t}^{μ} (τ)

holds is determined by

F_{τ - 1}

.

Definition A6.

(Probability

Ƥ_{f t, τ}

). Define

Ƥ_{f t, τ}

as the probability that

θ_{f_{b} t_{b}} (τ)

exceeds

y_{f t}

at computation step

τ

:

Ƥ_{f t, τ} = P r (θ_{f_{b} t_{b}} (τ) > y_{f t} | F_{τ - 1})

where

Ƥ_{f t, τ}

is a random variable determined by the history of previous evaluations

F_{τ - 1} .

Now we proceed to establish the upper bound on regret in Theorem 1. We begin by decomposing the expected number of times a suboptimal pair

(f, t) \neq (f_{b}, t_{b})

is chosen, written as follows:

\begin{matrix} E [k_{f t} (M)] \\ = \sum_{τ = 1}^{M} P r (i (τ) = (f, t), E_{f t}^{μ} (τ), E_{f t}^{θ} (τ)) + \sum_{τ = 1}^{M} P r (i (τ) = (f, t), E_{f t}^{μ} (τ), \bar{E_{f t}^{θ} (τ)}) \\ + \sum_{τ = 1}^{M} P r (i (τ) = (f, t), \bar{E_{f t}^{μ} (τ)}) \end{matrix}

(A1)

Each of these terms will be bounded separately. We start with the first term and introduce Lemma A1, which is addressed as follows:

Lemma A1.

For all computation steps

τ

, and for every feature–threshold pair

(f, t) \neq (f_{b}, t_{b})

, given any history

F_{τ - 1}

, the following holds:

P r (i (τ) = (f, t), E_{f t}^{μ} (τ), E_{f t}^{θ} (τ) | F_{τ - 1}) \leq \frac{(1 - Ƥ_{f t, τ})}{Ƥ_{f t, τ}} P r (i (τ) = (f_{b}, t_{b}), E_{f t}^{μ} (τ), E_{f t}^{θ} (τ) | F_{τ - 1})

(A2)

Proof.

We aim to demonstrate that for any suboptimal pair

(f, t) \neq (f_{b}, t_{b})

and any history

F_{τ - 1}

:

P r (i (τ) = (f, t) {| E}_{f t}^{θ} (τ), F_{τ - 1}) \leq \frac{(1 - Ƥ_{f t, τ})}{Ƥ_{f t, τ}} P r (i (τ) = (f_{b}, t_{b}) {| E}_{f t}^{θ} (τ), F_{τ - 1})

We begin by observing that to make the event

E_{f t}^{θ} (τ)

stand,

i (τ) = (f, t)

only if

θ_{f t} (τ) \leq y_{f t}

for any

(f, t) \neq (f_{b}, t_{b}),

implying the following:

P r (i (τ) = (f, t) {| E}_{f t}^{θ} (τ), F_{τ - 1}) \leq (1 - Ƥ_{f t, τ}) \cdot P r (θ_{f^{*} t^{*}} (τ) \leq y_{f t}, \forall (f^{*} {, t}^{*}) \neq (f_{b}, t_{b}) {{| E}_{f t}^{θ} (τ), F}_{τ - 1})

Similarly, for the optimal pair

(f_{b}, t_{b}),

we have the following:

P r (i (τ) = (f_{b}, t_{b}) {| E}_{f t}^{θ} (τ), F_{τ - 1}) \geq Ƥ_{f t, τ} \cdot P r (θ_{f^{*} t^{*}} (τ) \leq y_{f t}, \forall (f^{*} {, t}^{*}) \neq (f_{b}, t_{b}) {{| E}_{f t}^{θ} (τ), F}_{τ - 1})

By combining the above two inequalities, we obtain the desired result. □

Applying Lemma A1 to the first term of Equation (A1), we obtain the following:

\begin{matrix} \sum_{τ = 1}^{M} P r (i (τ) = (f, t), E_{f t}^{μ} (τ), E_{f t}^{θ} (τ)) & = \sum_{τ = 1}^{M} E [P r (i (τ) = (f, t), E_{f t}^{μ} (τ), E_{f t}^{θ} (τ) | F_{τ - 1})] \\ = \sum_{τ = 1}^{M} E [\frac{(1 - Ƥ_{f t, τ})}{Ƥ_{f t, τ}} I (i (τ) = (f_{b}, t_{b}), E_{f t}^{μ} (τ), E_{f t}^{θ} (τ))] \end{matrix}

Lemma A2.

For any feature–threshold pair

(f, t) \neq (f_{b}, t_{b})

we calculate the following:

\begin{matrix} \sum_{τ = 1}^{M} P r (i (τ) = (f, t), E_{f t}^{μ} (τ), E_{f t}^{θ} (τ)) \leq \\ \frac{24}{∆_{f t}^{' 2}} + \sum_{j \geq \frac{8}{∆_{f t}^{'}}} Θ (e x p (- ∆_{f t}^{' 2} j / 2) + \frac{1}{(j + 1) ∆_{f t}^{' 2}} e x p (- D_{f t} j) + \frac{1}{e x p (∆_{f t}^{' 2} j / 4) - 1}) \end{matrix}

(A3)

where

∆_{f t}^{'} = p_{f_{b} t_{b}} - y_{f t}

and

D_{f t} = y_{f t} \ln \frac{y_{f t}}{p_{f_{b} t_{b}}} + (1 - y_{f t}) \ln \frac{{1 - y}_{f t}}{1 - p_{f_{b} t_{b}}}

.

Proof.

To prove the bound in Equation (A3), we first note that the expression

E [\frac{(1 - Ƥ_{f t, τ})}{Ƥ_{f t, τ}}]

can be decomposed based on when the optimal feature–threshold pair

(f_{b}, t_{b})

is chosen. Let

τ_{j}

denote the computation step at which the optimal pair

(f_{b}, t_{b})

is evaluated for the

j

-th time. Then we obtain the following:

\sum_{τ = 1}^{M} E [\frac{(1 - Ƥ_{f t, τ})}{Ƥ_{f t, τ}} I (i (τ) = (f_{b}, t_{b}), E_{f t}^{μ} (τ), E_{f t}^{θ} (τ))] \leq \sum_{j = 0}^{M - 1} E [\frac{1}{Ƥ_{f t, τ_{j + 1}}} - 1]

For small values of

j < 8 / ∆_{f t}^{'}

, we can bound

E [\frac{1}{Ƥ_{f t, τ_{j + 1}}} - 1] \leq \frac{3}{∆_{f t}^{'}}

. This contributes a total of

\frac{24}{∆_{f t}^{' 2}}

to the bound since there are at most

\frac{8}{∆_{f t}^{'}}

such terms.

For larger values of

j \geq 8 / ∆_{f t}^{'}

, the posterior distribution for the optimal pair

(f_{b}, t_{b})

becomes more concentrated as more evaluations are performed. Using properties of the Beta distribution and KL divergence,

E [\frac{1}{Ƥ_{f t, τ_{j + 1}}} - 1]

decreases at a rate determined by three components. The first component of the bound

e x p (- ∆_{f t}^{' 2} j / 2)

arises from applying Chernoff–Hoeffding bound [34] to the concentration of the posterior mean around the true impurity reduction probability. This term captures how quickly the algorithm’s beliefs about the quality of the optimal split converge as more samples are collected. The second component

\frac{1}{(j + 1) ∆_{f t}^{' 2}} e x p (- D_{f t} j)

emerges from a more refined analysis using the KL divergence

D_{f t}

between the impurity reduction probabilities. The last component

\frac{1}{e x p (∆_{f t}^{' 2} j / 4) - 1}

handles the tail probabilities when considering the TS procedure’s random draws from posterior distributions, showing that the probability of selecting a suboptimal split diminishes rapidly with additional evidence.

The detailed proof involves various technical assumptions and numerical estimates, so we opt for a simplified approach. □

To finish handling the last two terms in Equation (A1), we introduce additional lemmas.

Lemma A3.

For any feature–threshold pair

(f, t) \neq (f_{b}, t_{b})

the following is calculated:

\sum_{τ = 1}^{M} P r (i (τ) = (f, t), \bar{E_{f t}^{μ} (τ)}) \leq \frac{1}{d (x_{f t}, p_{f t})} + 1

(A4)

Proof.

Let

τ_{k}

be the computation step for the

k

-th trial of

(f, t)

. Recall that the event

\bar{E_{f t}^{μ} (τ)}

corresponds to

\tilde{μ_{f t}} (τ) > x_{f t}

, we obtain the following:

\sum_{τ = 1}^{M} P r (i (τ) = (f, t), \bar{E_{f t}^{μ} (τ)}) \leq \sum_{k = 0}^{M - 1} P r (\bar{E_{f t}^{μ} (τ_{k + 1})})

In BayesSplit, the empirical mean

\tilde{μ_{f t}} (τ_{k + 1})

at computation step

τ_{k + 1}

depends on the cumulative outcomes of

k

i.i.d. evaluations of the arm

(f, t)

. That is,

\tilde{μ_{f t}} (τ_{k + 1}) = \frac{S_{f t} (τ_{k + 1})}{k + 1}

. By applying the Chernoff–Hoeffding bound, we derive

P r (\frac{S_{f t} (τ_{k + 1})}{k + 1} > x_{f t}) \leq P r (\frac{S_{f t} (τ_{k + 1})}{k} > x_{f t}) \leq e x p (- k d (x_{f t}, p_{f t}))

, leading to the following:

\sum_{k = 0}^{M - 1} P r (\bar{E_{f t}^{μ} (τ_{k + 1})}) = \sum_{k = 0}^{M - 1} P r (\tilde{μ_{f t}} (τ_{k + 1}) > x_{f t}) \leq 1 + \sum_{k = 1}^{M - 1} e x p (- k d (x_{f t}, p_{f t})) \leq 1 + \frac{1}{d (x_{f t}, p_{f t})}

where

d (x_{f t}, p_{f t}) > 0

. This completes the proof of Lemma A3. □

Lemma A4.

For any feature–threshold pair

(f, t) \neq (f_{b}, t_{b})

the following is calculated.

\sum_{τ = 1}^{M} P r (i (τ) = (f, t), E_{f t}^{μ} (τ), \bar{E_{f t}^{θ} (τ)}) \leq L_{f t} (M) + 1

(A5)

where

L_{f t} (M) = \frac{\ln M}{d (x_{f t}, y_{f t})}

.

Proof.

We divide the summation of probabilities into two cases, depending on whether the number of selections for

(f, t)

, denoted

k_{f t} (τ)

, is less than or exceeds

L_{f t} (M) .

\begin{matrix} \sum_{τ = 1}^{M} P r (i (τ) = (f, t), E_{f t}^{μ} (τ), \bar{E_{f t}^{θ} (τ)}) = \sum_{τ = 1}^{M} P r (i (τ) = (f, t), {k_{f t} (τ) \leq L_{f t} (M), E}_{f t}^{μ} (τ), \bar{E_{f t}^{θ} (τ)}) \\ + \sum_{τ = 1}^{M} P r (i (τ) = (f, t), {k_{f t} (τ) > L_{f t} (M), E}_{f t}^{μ} (τ), \bar{E_{f t}^{θ} (τ)}) \end{matrix}

For the case where

k_{f t} (τ) \leq L_{f t} (M)

, the sum is trivially bounded by

L_{f t} (M)

. Now we handle the case where

k_{f t} (τ) > L_{f t} (M)

. Once

k_{f t} (τ)

is large, and the event

E_{f t}^{μ} (τ)

holds, the probability that

E_{f t}^{θ} (τ)

fails to hold becomes small.

θ_{f t} (τ)

follows a Beta distribution with parameters

\tilde{μ_{f t}} (τ)

and

k_{f t} (τ)

:

θ_{f t} (τ) ~ B e t a (\tilde{μ_{f t}} (τ) (k_{f t} (τ) + 1) + 1, (1 - \tilde{μ_{f t}} (τ)) (k_{f t} (τ) + 1))

and the sum of these parameters is given by

s = α + β = k_{f t} (τ) + 2

. By leveraging the monotonicity property of the KL divergence and the fact that

p_{f t} < x_{f t} < y_{f t}

, we relate the divergence

d (p_{f t}, y_{f t})

to

d (x_{f t}, y_{f t})

. This relationship allows us to assert that the probability of

θ_{f t} (τ) > y_{f t}

decreases exponentially with respect to

s

and the divergence

d (x_{f t}, y_{f t})

. We obtain the following upper bound on the tail probability:

P r (θ_{f t} (τ) > y_{f t} | F_{τ - 1}) \leq e x p (- (k_{f t} (τ) + 2) d (x_{f t}, y_{f t}))

Under the condition that

k_{f t} (τ) > L_{f t} (M)

, a simplified inequality is given by the following equation:

e x p (- (k_{f t} (τ) + 2) d (x_{f t}, y_{f t})) \leq e x p (- L_{f t} (M) d (x_{f t}, y_{f t})) = \frac{1}{M}

Hence, for any history

F_{τ - 1}

, we conclude the following equation:

P r (θ_{f t} (τ) > y_{f t} | F_{τ - 1}) \leq \frac{1}{M}

Summing over all computation steps, the second term is bounded by 1. □

By combining the results of prior derivations, we obtain an upper bound on the expected number of times a suboptimal feature–threshold pair

(f, t) \neq (f_{b}, t_{b})

is chosen during

M

computational steps:

\begin{matrix} E [k_{f t} (M)] & \leq \frac{24}{∆_{f t}^{' 2}} + \sum_{j \geq 8 / ∆_{f t}^{'}} Θ (e x p (- ∆_{f t}^{' 2} j / 2) + \frac{1}{(j + 1) ∆_{f t}^{' 2}} e x p (- D_{f t} j) + \frac{1}{e x p (∆_{f t}^{' 2} j / 4) - 1}) \\ + L_{f t} (M) + 1 + 1 + \frac{1}{d (x_{f t}, p_{f t})} \end{matrix}

To derive the BayesSplit-specific bound, we set

0 < ϵ \leq 1

and

x_{f t} \in (p_{f t}, p_{f_{b} t_{b}})

such that

d (x_{f t}, p_{f_{b} t_{b}}) = d (p_{f t}, p_{f_{b} t_{b}}) / (1 + ϵ) .

Similarly,

y_{f t} \in (x_{f t}, p_{f_{b} t_{b}})

and this gives

d (x_{f t,} y_{f t}) = d (x_{f t}, p_{f_{b} t_{b}}) / (1 + ϵ)

. We then obtain the following:

L_{f t} (M) = \frac{\ln M}{d (x_{f t}, y_{f t})} = \frac{{(1 + ϵ)}^{2} \ln M}{d (p_{f t}, p_{f_{b} t_{b}})}

and by straightforward algebraic manipulation of the divergence

d (x_{f t}, p_{f t})

, we arrive at the following:

\frac{1}{d (x_{f t}, p_{f t})} \leq \frac{1}{2 {(x_{f t} - p_{f t})}^{2}} = O (\frac{1}{ϵ^{2}})

Putting all the results together yields the following:

E [k_{f t} (M)] \leq \frac{24}{∆_{f t}^{' 2}} + Θ (\frac{1}{∆_{f t}^{' 2}} + \frac{1}{∆_{f t}^{' 2} D_{f t}} + \frac{1}{∆_{f t}^{' 4}}) + \frac{{(1 + ϵ)}^{2} \ln M}{d (p_{f t}, p_{f_{b} t_{b}})} + O (\frac{1}{ϵ^{2}}) = O (1) + \frac{{(1 + ϵ)}^{2} \ln M}{d (p_{f t}, p_{f_{b} t_{b}})} + O (\frac{1}{ϵ^{2}})

which simplifies to a big-Oh expression. Finally, we combine all suboptimal pairs to arrive at the expected regret bound for BayesSplit:

E [R (M)] = \sum_{(f, t)} ∆_{f t} E [k_{f t} (M)] \leq \sum_{(f, t)} {(1 + ϵ)}^{2} \frac{\ln M}{d (p_{f t}, p_{f_{b} t_{b}})} ∆_{f t} + O (\frac{m T}{ϵ^{2}})

In practice, relying on posterior means and confidence intervals for feature–threshold selection reduces exploration relative to purely random sampling. Nevertheless, our experiments show that this trade-off preserves low regret while delivering significant computational speedups. □

Appendix B

In this section, we conducted additional experiments evaluating BayesSplit’s sensitivity to hyperparameter settings and its performance on imbalanced datasets.

Appendix B.1. Sensitivity to Hyperparameter Settings

We examined BayesSplit’s sensitivity to prior initialization and key hyperparameters, including batch size and histogram bins. All experiments in this section were conducted using Random Forest integrated with BayesSplit.

Prior Settings: Table A1 presents the impact of different Beta prior initializations on BayesSplit performance. We tested symmetric priors (

B e t a (1,1)

and

B e t a (2,2)

) and asymmetric priors (

B e t a (1,2)

and

B e t a (2,1)

) on both classification and regression tasks. The results demonstrate remarkable stability across all prior settings, with virtually identical test performance metrics. These findings empirically validate our theoretical framework described in Section 3.2, where the initialization provides an informed starting point, but its influence diminishes as more batches are processed, with split selection ultimately determined by empirical evidence rather than prior assumptions.

Table A1. Impact of different prior settings on BayesSplit performance.

Dataset	Prior Setting	Training Time (s)	Performance Metric	Test Performance
Online retail (Classification)	Beta (1,1)	6.552 $\pm$ 0.786	Accuracy	0.918 $\pm$ 0.001
	Beta (2,2)	6.214 $\pm$ 1.277	Accuracy	0.918 $\pm$ 0.001
	Beta (1,2)	5.221 $\pm$ 0.096	Accuracy	0.918 $\pm$ 0.001
	Beta (2,1)	5.421 $\pm$ 0.331	Accuracy	0.918 $\pm$ 0.001
SGEMM GPU kernel performance (Regression)	Beta (1,1)	19.436 $\pm$ 1.829	MSE	27,957.258 $\pm$ 395.620
	Beta (2,2)	21.017 $\pm$ 2.126	MSE	27,957.258 $\pm$ 395.620
	Beta (1,2)	$20.287 \pm$ 0.186	MSE	27,957.258 $\pm$ 395.620
	Beta (2,1)	20.649 $\pm$ 0.231	MSE	27,957.258 $\pm$ 395.620

Batch Size and Histogram Bins: Table A2 investigates the sensitivity to batch size and histogram bin configuration. Our experiments maintain consistency with MABSplit’s default settings (batch size = 1000, histogram bins = 11) as the baseline. Varying batch sizes from 500 to 1500 show minimal impact on accuracy, with all configurations achieving over 91% accuracy. Notably, increasing histogram bins to 20 yields the best performance (93.1–93.3% accuracy). This performance gain can be attributed to the increased resolution in feature space partitioning, enabling more precise identification of optimal split thresholds. The robustness across different batch sizes indicates that BayesSplit efficiently utilizes samples regardless of batching strategy.

Table A2. Impact of batch size and histogram bins on BayesSplit performance (online retail).

Online Retail Dataset (N = 541,909)
	Histogram Bins = 5		Histogram Bins = 11		Histogram Bins = 20
Batch Size	Time (s)	Accuracy (%)	Time (s)	Accuracy (%)	Time (s)	Accuracy (%)
500	5.729 $\pm$ 0.794	91.316 $\pm$ 0.001	7.321 $\pm$ 0.893	91.544 $\pm$ 0.002	6.069 $\pm$ 0.230	93.109 $\pm$ 0.001
1000	7.328 $\pm$ 0.959	91.317 $\pm$ 0.001	7.201 $\pm$ 1.119	91.785 $\pm$ 0.001	6.267 $\pm$ 0.101	93.258 $\pm$ 0.001
1500	4.997 $\pm$ 0.241	91.356 $\pm$ 0.001	7.548 $\pm$ 0.374	91.664 $\pm$ 0.002	7.646 $\pm$ 0.234	92.700 $\pm$ 0.002

Appendix B.2. Performance on Imbalanced Dataset

To evaluate BayesSplit’s robustness on imbalanced data, we utilized the Online Shoppers Purchasing Intention dataset from the UCI repository, which exhibits significant class imbalance with only 15.5% positive class samples (customers who made purchases) versus 84.5% negative samples (non-purchasing visitors).

The results in Table A3 demonstrate that BayesSplit maintains competitive accuracy while significantly reducing training time across all baseline models. In these models, ExtraTrees+BayesSplit achieves comparable accuracy (88.483%) to the baseline (88.451%) while reducing training time by 47.6%. These results indicate that BayesSplit’s adaptive splitting strategy effectively handles class imbalance by focusing computational resources on informative splits rather than exhaustively evaluating all candidates, making it suitable for real-world applications where class imbalance is common.

Table A3. Training time and test accuracies for models with and without BayesSplit on an imbalanced dataset.

Online Shoppers Purchasing Intention Dataset (N = 12,330, Maximum Depth = 5)
Model	Training Time (s)	Test Accuracy (%)
RF	11.028 $\pm$ 0.951	88.694 $\pm$ 0.001
RF+BayesSplit	6.112 $\pm$ 0.441	88.111 $\pm$ 0.002
RP	6.320 $\pm$ 0.216	88.369 $\pm$ 0.003
RP+BayesSplit	3.547 $\pm$ 0.070	87.672 $\pm$ 0.003
ExtraTrees	11.206 $\pm$ 1.283	88.451 $\pm$ 0.002
ExtraTrees+BayesSplit	5.876 $\pm$ 0.834	88.483 $\pm$ 0.003

References

Adnan, M.N.; Islam, M.Z. Optimizing the Number of Trees in a Decision Forest to Discover a Subforest with High Ensemble Accuracy Using a Genetic Algorithm. Knowl.-Based Syst. 2016, 110, 86–97. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Ngo, G.; Beard, R.; Chandra, R. Evolutionary Bagging for Ensemble Learning. Neurocomputing 2022, 510, 1–14. [Google Scholar] [CrossRef]
Abellán, J.; Mantas, C.J.; Castellano, J.G. A Random Forest Approach Using Imprecise Probabilities. Knowl.-Based Syst. 2017, 134, 72–84. [Google Scholar] [CrossRef]
Zhang, X.; Lin, X.; Zhao, J.; Huang, Q.; Xu, X. Efficiently Predicting Hot Spots in PPIs by Combining Random Forest and Synthetic Minority Over-Sampling Technique. IEEE/ACM Trans. Comput. Biol. Bioinf. 2018, 16, 774–781. [Google Scholar] [CrossRef] [PubMed]
Jiang, N.; Sheng, B.; Li, P.; Lee, T.Y. PhotoHelper: Portrait Photographing Guidance via Deep Feature Retrieval and Fusion. IEEE Trans. Multimed. 2022, 25, 2226–2238. [Google Scholar] [CrossRef]
Yin, L.; Li, B.; Li, P.; Zhang, R. Research on Stock Trend Prediction Method Based on Optimized Random Forest. CAAI Trans. Intell. Technol. 2023, 8, 274–284. [Google Scholar] [CrossRef]
Wang, X.; Du, S.; Feng, C.C.; Zhang, X.; Zhang, X. Interpreting the Fuzzy Semantics of Natural-Language Spatial Relation Terms with the Fuzzy Random Forest Algorithm. ISPRS Int. J. Geo-Inf. 2018, 7, 58. [Google Scholar] [CrossRef]
Yates, D.; Islam, M.Z. FastForest: Increasing Random Forest Processing Speed While Maintaining Accuracy. Inf. Sci. 2021, 557, 130–152. [Google Scholar] [CrossRef]
Dinh, T.P.; Pham-Quoc, C.; Thinh, T.N.; Do Nguyen, B.K.; Kha, P.C. A Flexible and Efficient FPGA-Based Random Forest Architecture for IoT Applications. Internet Things 2023, 22, 100813. [Google Scholar] [CrossRef]
Sun, Z.; Wang, G.; Li, P.; Wang, H.; Zhang, M.; Liang, X. An Improved Random Forest Based on the Classification Accuracy and Correlation Measurement of Decision Trees. Expert Syst. Appl. 2024, 237, 121549. [Google Scholar] [CrossRef]
Ma, J.; Pan, Q.; Guo, Y. Depth-First Random Forests with Improved Grassberger Entropy for Small Object Detection. Eng. Appl. Artif. Intell. 2022, 114, 105138. [Google Scholar] [CrossRef]
Chen, J.; Wang, X.; Lei, F. Data-Driven Multinomial Random Forest: A New Random Forest Variant with Strong Consistency. J. Big Data 2024, 11, 34. [Google Scholar] [CrossRef]
Qiu, Y.; Zhou, J.; Khandelwal, M.; Yang, H.; Yang, P.; Li, C. Performance Evaluation of Hybrid WOA-XGBoost, GWO-XGBoost and BO-XGBoost Models to Predict Blast-Induced Ground Vibration. Eng. Comput. 2022, 38, 4145–4162. [Google Scholar] [CrossRef]
Punmiya, R.; Choe, S. Energy Theft Detection Using Gradient Boosting Theft Detector with Feature Engineering-Based Preprocessing. IEEE Trans. Smart Grid 2019, 10, 2326–2329. [Google Scholar] [CrossRef]
Herrera, V.M.; Khoshgoftaar, T.M.; Villanustre, F.; Furht, B. Random Forest Implementation and Optimization for Big Data Analytics on LexisNexis’s High Performance Computing Cluster Platform. J. Big Data 2019, 6, 1–36. [Google Scholar] [CrossRef]
Chen, J.; Li, K.; Tang, Z.; Bilal, K.; Yu, S.; Weng, C.; Li, K. A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment. IEEE Trans. Parallel Distrib. Syst. 2017, 28, 919–933. [Google Scholar] [CrossRef]
Domingos, P.; Hulten, G. Mining High-Speed Data Streams. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, USA, 20–23 August 2000; pp. 71–80. [Google Scholar] [CrossRef]
Manapragada, C.; Webb, G.I.; Salehi, M. Extremely Fast Decision Tree. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, UK, 19–23 August 2018; pp. 1953–1962. [Google Scholar] [CrossRef]
Mu, Y.; Liu, X.; Wang, L. A Pearson’s Correlation Coefficient Based Decision Tree and Its Parallel Implementation. Inf. Sci. 2018, 435, 40–58. [Google Scholar] [CrossRef]
Xu, Y. Research and Implementation of Improved Random Forest Algorithm Based on Spark. In Proceedings of the 2nd IEEE International Conference on Big Data Analytics, Chengdu, China, 18–19 October 2017; pp. 499–503. [Google Scholar] [CrossRef]
Yin, L.; Chen, K.; Jiang, Z.; Xu, X. A Fast Parallel Random Forest Algorithm Based on Spark. Appl. Sci. 2023, 13, 6121. [Google Scholar] [CrossRef]
Tiwari, M.; Kang, R.; Lee, J.; Piech, C.; Shomorony, I.; Thrun, S.; Zhang, M.J. MABSplit: Faster Forest Training Using Multi-Armed Bandits. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 6–12 December 2022; Volume 35, pp. 1223–1237. [Google Scholar]
Li, X.; Wang, Y.; Basu, S.; Kumbier, K.; Yu, B. A Debiased MDI Feature Importance Measure for Random Forests. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32, pp. 8049–8059. [Google Scholar] [CrossRef]
Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Routledge: New York, NY, USA, 2017. [Google Scholar]
Russo, D.; Van Roy, B. Learning to Optimize via Posterior Sampling. Math. Oper. Res. 2014, 39, 1221–1243. [Google Scholar] [CrossRef]
Zhu, Z.; Huang, L.; Xu, H. Self-Accelerated Thompson Sampling with Near-Optimal Regret Upper Bound. Neurocomputing 2020, 399, 37–47. [Google Scholar] [CrossRef]
Russo, D.; Van Roy, B.; Kazerouni, A.; Osband, I.; Wen, Z. A Tutorial on Thompson Sampling. Found. Trends Mach. Learn. 2018, 11, 1–96. [Google Scholar] [CrossRef]
Agahi, H. A Modified Kullback–Leibler Divergence for Non-Additive Measures Based on Choquet Integral. Fuzzy Sets Syst. 2019, 367, 107–117. [Google Scholar] [CrossRef]
Saeed, U.; Jan, S.U.; Lee, Y.D.; Koo, I. Fault Diagnosis Based on Extremely Randomized Trees in Wireless Sensor Networks. Reliab. Eng. Syst. Saf. 2021, 205, 107284. [Google Scholar] [CrossRef]
Gomes, H.M.; Read, J.; Bifet, A.; Durrant, R.J. Learning from Evolving Data Streams Through Ensembles of Random Patches. Knowl. Inf. Syst. 2021, 63, 1597–1625. [Google Scholar] [CrossRef]
Alduailij, M.; Khan, Q.W.; Tahir, M.; Sardaraz, M.; Alduailij, M.; Malik, F. Machine-Learning-Based DDoS Attack Detection Using Mutual Information and Random Forest Feature Importance Method. Symmetry 2022, 14, 1095. [Google Scholar] [CrossRef]
Nogueira, S.; Sechidis, K.; Brown, G. On the Stability of Feature Selection Algorithms. J. Mach. Learn. Res. 2018, 18, 1–54. [Google Scholar]
Hellman, M.; Raviv, J. Probability of Error, Equivocation, and the Chernoff Bound. IEEE Trans. Inf. Theory 1970, 16, 368–372. [Google Scholar] [CrossRef]

Figure 1. BayesSplit framework workflow.

Figure 2. Classification wall-clock training time.

Figure 3. Classification test accuracy on different datasets. (a) MNIST dataset (N = 60,000); (b) APS Failure dataset (N = 60,000); (c) Forest Covertype dataset (N = 581,012); (d) Online Retail dataset (N = 541,909); (e) Dry Bean dataset (N = 13,611).

Figure 4. Regression wall-clock training time.

Figure 5. Number of classification trees under a fixed computational budget on different datasets. (a) MNIST dataset (N = 60,000); (b) APS Failure dataset (N = 60,000); (c) Forest Covertype dataset (N = 581,012); (d) Online Retail dataset (N = 541,909); (e) Dry Bean dataset (N = 13,611).

Figure 6. Number of regression trees under a fixed computational budget on different datasets. (a) Beijing Air Quality dataset (N = 420,768); (b) SGEMM GPU Perf. dataset (N = 241,600); (c) Multivariate Gait Data dataset (N = 181,800); (d) Clickstream Data dataset (N = 165,474).

Table 1. Notations in BayesSplit algorithm and analysis.

Symbol	Meaning
$D$	$dataset with N$ $data points {\{(x_{i}, y_{i})\}}_{i = 1}^{N}$
$N$	total number of data points
$x_{i}$	$i$ -th feature vector
$y_{i}$	$target value corresponding to x_{i}$
$R_{ν}$	region in feature space corresponding to node $ν$
$N_{n} (ν)$	number of samples falling into region $R_{ν}$
$M$	set of features considered for splitting
$T$	set of possible thresholds for each feature
$(f, t)$	feature–threshold pair (candidate split)
$I (\cdot)$	impurity measure (Gini index, entropy, or MSE)
$μ_{f t}$	$optimization objective for split (f, t)$
${\hat{μ}}_{f t}$	$point estimate of μ_{f t}$ from histogram
$n^{'}$	number of subsamples for confidence interval estimation
$C_{f t} (n', δ)$	confidence interval width with confidence level $(1 - δ)$
$S_{s o l u t i o n}$	set of candidate splits maintained by algorithm
$α, β$	parameters of Beta distribution
$g_{f t}$	impurity gain
${\hat{μ}}_{p o s t, f t}$	$posterior mean : {\hat{μ}}_{p o s t, f t} = α / (α + β)$
$C_{p o s t, f t}$	posterior confidence interval half-width
$B$	batch size for data sampling
$θ^{*}$	$true probability that split (f, t)$ reduces impurity

Table 2. Number of histogram insertions for classification models.

	Dataset
	MNIST	APS Failure	Forest Covertype	Online Retail	Dry Bean
RF	1.54 × 10⁸ $\pm$ 3.16 × 10⁵	1.88 × 10⁷ $\pm$ 2.19 × 10⁴	1.04 × 10⁸ $\pm$ 1.16 × 10⁵	3.60 × 10⁷ $\pm$ 1.17 × 10⁵	1.21 × 10⁶ $\pm$ 1.09 × 10³
RF+MABSplit	3.93 × 10⁶ $\pm$ 5.64 × 10³	7.85 × 10⁵ $\pm$ 1.43 × 10⁴	1.04 × 10⁶ $\pm$ 1.26 × 10⁴	3.68 × 10⁵ $\pm$ 7.62 × 10³	4.00 × 10⁵ $\pm$ 1.01 × 10⁴
RF+BayesSplit	3.93 × 10⁶ $\pm$ 5.64 × 10³	7.85 × 10⁵ $\pm$ 1.43 × 10⁴	1.04 × 10⁶ $\pm$ 1.26 × 10⁴	3.68 × 10⁵ $\pm$ 7.62 × 10³	4.00 × 10⁵ $\pm$ 1.01 × 10⁴
RP	1.32 × 10⁸ $\pm$ 6.95 × 10⁵	1.61 × 10⁷ $\pm$ 3.52 × 10⁴	7.79 × 10⁷ $\pm$ 4.95 × 10⁵	2.68 × 10⁷ $\pm$ 1.50 × 10⁵	1.22 × 10⁶ $\pm$ 1.65 × 10³
RP+MABSplit	3.17 × 10⁶ $\pm$ 1.40 × 10⁴	6.44 × 10⁵ $\pm$ 2.43 × 10⁴	3.59 × 10⁵ $\pm$ 1.71 × 10⁴	2.74 × 10⁵ $\pm$ 1.48 × 10⁴	4.07 × 10⁵ $\pm$ 2.31 × 10³
RP+BayesSplit	3.17 × 10⁶ $\pm$ 1.40 × 10⁴	6.44 × 10⁵ $\pm$ 2.43 × 10⁴	3.59 × 10⁵ $\pm$ 1.71 × 10⁴	2.74 × 10⁵ $\pm$ 1.48 × 10⁴	4.07 × 10⁵ $\pm$ 2.31 × 10³
ExtraTrees	1.68 × 10⁸ $\pm$ 0.00 × 10⁰	1.89 × 10⁷ $\pm$ 3.85 × 10²	1.04 × 10⁸ $\pm$ 1.29 × 10⁵	3.55 × 10⁷ $\pm$ 2.12 × 10⁵	1.22 × 10⁶ $\pm$ 1.91 × 10³
ExtraTrees+MABSplit	4.32 × 10⁶ $\pm$ 7.69 × 10³	8.03 × 10⁵ $\pm$ 2.45 × 10⁴	2.24 × 10⁶ $\pm$ 8.11 × 10⁵	6.12 × 10⁶ $\pm$ 9.15 × 10⁵	4.15 × 10⁵ $\pm$ 1.30 × 10⁴
ExtraTrees+BayesSplit	4.32 × 10⁶ $\pm$ 1.11 × 10⁴	7.43 × 10⁵ $\pm$ 1.30 × 10⁴	1.00 × 10⁶ $\pm$ 3.22 × 10⁴	3.63 × 10⁵ $\pm$ 6.03 × 10³	3.87 × 10⁵ $\pm$ 3.85 × 10³

Table 3. Regression test MSEs.

	Dataset
	Beijing Multi-Site Air Quality	SGEMM GPU Kernel Performance	Multivariate Gait Data	Clickstream Data
RF	$1138.284 \pm$ 4.066	$28, 822.531 \pm$ 13.386	$66.477 \pm$ 0.066	$10.399 \pm$ 0.014
RF+MABSplit	1132.521 $\pm$ 5.659	$27, 646.146 \pm$ 391.684	$51.270 \pm$ 3.056	$10.083 \pm$ 0.136
RF+BayesSplit	1113.040 $\pm$ 4.898	27,957.258 $\pm$ 395.620	39.888 $\pm$ 1.539	9.193 $\pm$ 0.265
RP	$889.993 \pm$ 7.186	$41, 998.543 \pm$ 5321.693	$61.942 \pm$ 6.699	$7.656 \pm$ 0.395
RP+MABSplit	$861.830 \pm$ 4.373	$41, 998.543 \pm$ 5321.693	$60.756 \pm$ 7.296	$7.490 \pm$ 0.412
RP+BayesSplit	856.503 $\pm$ 5.041	41,964.334 $\pm$ 5267.328	60.349 $\pm$ 6.065	7.391 $\pm$ 0.422
ExtraTrees	$829.345 \pm$ 8.158	$28, 827.176 \pm$ 20.081	$37.568 \pm$ 0.809	$6.101 \pm$ 0.155
ExtraTrees+MABSplit	$824.230 \pm$ 6.631	$28, 838.510 \pm$ 0.143	$35.958 \pm$ 0.805	$6.068 \pm$ 0.103
ExtraTrees+BayesSplit	818.699 $\pm$ 4.059	27,953.338 $\pm$ 405.546	34.857 $\pm$ 1.053	6.312 $\pm$ 0.096

Table 4. Classification test accuracy (fixed computational budget).

	Dataset
	MNIST	APS Failure	Forest Covertype	Online Retail	Dry Bean
RF	0.575 $\pm$ 0.006	0.983 $\pm$ 0.0	0.374 $\pm$ 0.019	0.893 $\pm$ 0.003	0.858 $\pm$ 0.003
RF+MABSplit	0.825 $\pm$ 0.001	0.987 $\pm$ 0.0	0.670 $\pm$ 0.001	0.917 $\pm$ 0.001	0.867 $\pm$ 0.002
RF+BayesSplit	0.825 $\pm$ 0.001	0.987 $\pm$ 0.0	0.670 $\pm$ 0.001	0.917 $\pm$ 0.001	0.867 $\pm$ 0.002
RP	0.589 $\pm$ 0.014	0.984 $\pm$ 0.0	0.553 $\pm$ 0.05	0.893 $\pm$ 0.003	0.876 $\pm$ 0.001
RP+MABSplit	0.834 $\pm$ 0.001	0.988 $\pm$ 0.0	0.676 $\pm$ 0.001	0.929 $\pm$ 0.004	0.882 $\pm$ 0.002
RP+BayesSplit	0.834 $\pm$ 0.001	0.988 $\pm$ 0.0	0.676 $\pm$ 0.001	0.929 $\pm$ 0.004	0.882 $\pm$ 0.002
ExtraTrees	0.563 $\pm$ 0.008	0.983 $\pm$ 0.0	0.389 $\pm$ 0.022	0.896 $\pm$ 0.003	0.877 $\pm$ 0.004
ExtraTrees+MABSplit	0.814 $\pm$ 0.002	0.988 $\pm$ 0.0	0.653 $\pm$ 0.014	0.919 $\pm$ 0.008	0.885 $\pm$ 0.004
ExtraTrees+BayesSplit	0.820 $\pm$ 0.002	0.989 $\pm$ 0.0	0.672 $\pm$ 0.001	0.933 $\pm$ 0.005	0.887 $\pm$ 0.002

Table 5. Regression test MSEs (fixed computational budget).

	Dataset
	Beijing Multi-Site Air Quality	SGEMM GPU Kernel Performance	Multivariate Gait Data	Clickstream Data
RF	$1150.986 \pm$ 1.360	$33, 133.193 \pm$ 229.830	$109.821 \pm$ 0.178	$17.198 \pm$ 0.013
RF+MABSplit	1122.964 $\pm$ 0.636	$27, 539.144 \pm$ 188.497	$53.366 \pm$ 2.076	$10.176 \pm$ 0.096
RF+BayesSplit	1106.436 $\pm$ 0.365	27,421.693 $\pm$ 98.705	39.960 $\pm$ 0.457	9.017 $\pm$ 0.070
RP	$1169.343 \pm$ 1.107	$73, 717.816 \pm$ 4059.742	$85.964 \pm$ 3.780	$13.117 \pm$ 0.820
RP+MABSplit	$1169.573 \pm$ 3.351	$70, 140.627 \pm$ 3425.126	$76.579 \pm$ 2.672	$6.755 \pm$ 0.192
RP+BayesSplit	1135.562 $\pm$ 1.281	67,814.118 $\pm$ 755.378	70.194 $\pm$ 4.203	6.678 $\pm$ 0.062
ExtraTrees	$951.450 \pm$ 1.656	$33, 104.573 \pm$ 213.766	$90.886 \pm$ 1.426	$13.004 \pm$ 2.393
ExtraTrees+MABSplit	$797.448 \pm$ 1.581	$28, 969.467 \pm$ 41.929	$43.639 \pm$ 2.203	$6.087 \pm$ 0.078
ExtraTrees+BayesSplit	772.231 $\pm$ 0.922	27,762.501 $\pm$ 136.819	33.776 $\pm$ 0.490	5.855 $\pm$ 0.048

Table 6. Stability scores (fixed computational budget).

Model	Stability Metric	Dataset	Stability
RF RF+MABSplit RF+BayesSplit	MDI MDI MDI	Random Classification Random Classification Random Classification	$0.536 \pm$ 0.039 $0.863 \pm$ 0.016 0.863 $\pm$ 0.016
RF RF+MABSplit RF+BayesSplit	MDI MDI MDI	Random Regression Random Regression Random Regression	$0.497 \pm$ 0.046 $0.581 \pm$ 0.050 0.805 $\pm$ 0.012
RF RF+MABSplit RF+BayesSplit	Permutation Permutation Permutation	Random Classification Random Classification Random Classification	$0.599 \pm$ 0.022 $0.695 \pm$ 0.025 0.787 $\pm$ 0.024
RF RF+MABSplit RF+BayesSplit	Permutation Permutation Permutation	Random Regression Random Regression Random Regression	$0.403 \pm$ 0.047 $0.456 \pm$ 0.041 0.634 $\pm$ 0.014

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, J.; Li, Z.; Yin, L. An Efficient and Accurate Random Forest Node-Splitting Algorithm Based on Dynamic Bayesian Methods. Mach. Learn. Knowl. Extr. 2025, 7, 70. https://doi.org/10.3390/make7030070

AMA Style

He J, Li Z, Yin L. An Efficient and Accurate Random Forest Node-Splitting Algorithm Based on Dynamic Bayesian Methods. Machine Learning and Knowledge Extraction. 2025; 7(3):70. https://doi.org/10.3390/make7030070

Chicago/Turabian Style

He, Jun, Zhanqi Li, and Linzi Yin. 2025. "An Efficient and Accurate Random Forest Node-Splitting Algorithm Based on Dynamic Bayesian Methods" Machine Learning and Knowledge Extraction 7, no. 3: 70. https://doi.org/10.3390/make7030070

APA Style

He, J., Li, Z., & Yin, L. (2025). An Efficient and Accurate Random Forest Node-Splitting Algorithm Based on Dynamic Bayesian Methods. Machine Learning and Knowledge Extraction, 7(3), 70. https://doi.org/10.3390/make7030070

Article Menu

An Efficient and Accurate Random Forest Node-Splitting Algorithm Based on Dynamic Bayesian Methods

Abstract

1. Introduction

1.1. Related Works

1.2. Our Contributions

2. Algorithmic Background

2.1. Node-Splitting Description in RFs and Decision Trees

2.2. Confidence Interval Estimation in MABSplit

3. BayesSplit: A Bayesian Node-Splitting Algorithm

3.1. Overview of the Framework

3.2. Dynamic Posterior Parameter Refinement

3.3. Posterior-Derived Confidence Bounding

3.4. Convergence Analysis

3.5. Optimal Solution Regret Bound

4. Performance Analysis

4.1. Wall-Clock Time Comparisons

4.2. Fixed Budget Comparisons

4.3. Feature Stability Comparisons

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B

Appendix B.1. Sensitivity to Hyperparameter Settings

Appendix B.2. Performance on Imbalanced Dataset

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI