Next Article in Journal
Hopf Bifurcation and Control for the Bioeconomic Predator–Prey Model with Square Root Functional Response and Nonlinear Prey Harvesting
Previous Article in Journal
Trust in Artificial Intelligence: Modeling the Decision Making of Human Operators in Highly Dangerous Situations
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

On the Oracle Properties of Bayesian Random Forest for Sparse High-Dimensional Gaussian Regression

by
Oyebayo Ridwan Olaniran
1,*,† and
Ali Rashash R. Alzahrani
2,†
1
Department of Statistics, Faculty of Physical Sciences, University of Ilorin, llorin 240101, Nigeria
2
Mathematics Department, Faculty of Sciences, Umm Al-Qura University, Makkah 24382, Saudi Arabia
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Mathematics 2023, 11(24), 4957; https://doi.org/10.3390/math11244957
Submission received: 18 October 2023 / Revised: 30 November 2023 / Accepted: 12 December 2023 / Published: 14 December 2023

Abstract

:
Random forest (RF) is a widely used data prediction and variable selection technique. However, the variable selection aspect of RF can become unreliable when there are more irrelevant variables than relevant ones. In response, we introduced the Bayesian random forest (BRF) method, specifically designed for high-dimensional datasets with a sparse covariate structure. Our research demonstrates that BRF possesses the oracle property, which means it achieves strong selection consistency without compromising the efficiency or bias.

1. Introduction

Several techniques for handling high-dimensional data have been proposed from different areas of research, such as in oncology (modeling and identification of relevant genetic biomarkers for tumorous cancer cells) [1,2,3,4,5]. The methodologies of the techniques differ, but the collective standpoint is to find an efficient way to analyze high-dimensional data [6]. In a broader sense, high-dimensionality (HD) refers to a modeling situation where the number of unknown parameters p are far greater than the sample size n, that is p n [7]. This scenario includes the supervised regression and classification with several explanatory variables or features largely greater than the sample size, unsupervised learning with more attributes than samples and hypothesis testing parlance with more considered hypotheses than observations [8]. Ref. [9] identified the need for developing robust methods for high-dimensional data. Classical methods like ordinary least squares, logistic regression, and k N N often break down due to an ill-conditioned design matrix when p n . Ref. [10] described two major approaches to analyzing high-dimensional data, namely: the modification of n > p approaches to accommodate high-dimensional data or developing a new approach. Modifying approaches involves moving from complex to simple models by selecting relevant subsets of the p variables. This approach is widely referred to as variable selection.
Variable selection is an approach used to adapt existing low-dimensional data modeling techniques for high-dimensional data. Simultaneously, penalized regression involves imposing constraints on dimensionality to achieve a similar objective. The primary advantage of variable selection methods is their ability to preserve the desirable qualities of low-dimensional approaches like the maximum likelihood estimator (MLE), even though they may struggle to address the complexity of high-dimensional datasets. Penalized methods such as LASSO [11] and SCAD [12], among others, offer a partial solution to the problem but introduce bias in estimation. Both approaches share the drawback of not fully capturing the complexities of high-dimensional datasets, including interactions, nonlinearity, and non-normality [13]. One robust procedure that has been shown to overcome these challenges in both low- and high-dimensional scenarios is classification and regression trees (CART) [14,15]. CART is a non-parametric statistical method that relaxes dimensionality assumptions and naturally accommodates the modeling of interactions and nonlinearity.
The strength of CART in terms of simplicity and interpretability is offset by a significant drawback, which often leads to a loss of accuracy. In the late 20th century, a new methodological framework emerged for combining multiple models to create a more comprehensive model, known as ensemble modeling. One of the earliest ensemble techniques within the CART framework is bagging (bootstrap aggregating) [16]. The bagging process involves taking multiple versions of the bootstrap sample [17] from the training dataset and fitting an unpruned CART to each of these bootstrap samples. The final predictor is derived by averaging these different model versions. Remarkably, this procedure works well and typically outperforms its competitors in most situations. Some intuitive explanations for why and how it works were provided in [18]. This concept has spurred subsequent work, including the development of random forests (RF) [19], which present a broader framework for tree ensembles. RF enhances bagging by replacing all covariates in the CART’s splitting step with a random sub-sampling of covariates. This adjustment helps reduce the correlation between adjacent trees, thereby enhancing the predictive accuracy.
The complexity of dealing with high-dimensional data has led to the development of multiple versions of random forest (RF) algorithms in the context of regression modeling. One prominent characteristic of high-dimensional datasets is their sparsity, which means that there are relatively few relevant predictors within the predictor space. This sparsity is often observed in microarray data, where only a small number of genes are associated with a specific disease outcome [13,20]. The traditional approach of RF, which involves randomly subsampling either the square root of the predictors p or p 3 , fails to effectively capture this sparsity [21]. This failure arises from RF’s unrealistic assumption that the predictor space should be densely populated with relevant variables to achieve reasonable accuracy. In contrast, boosting techniques, as introduced by [20], specifically address this issue by boosting weak trees rather than averaging all the trees, as performed in RF. However, boosting comes at the cost of reduced predictive accuracy compared to RF.
In the context of high-dimensional data, there have been several proposals for tackling the random subsampling of the predictors used by RF that fails to distinguish relevant and irrelevant variables. [22] proposed a novel approach to address the issue of variable importance measures in RF, particularly focusing on high-dimensional data with numerous candidate predictors. The strength of the paper lies in its introduction of a computationally fast heuristic variable importance test based on a modified version of permutation variable importance, inspired by cross-validation procedures. The proposed approach is designed to efficiently handle situations where many variables lack information, which is common in high-dimensional settings. The paper substantiates its claims through simulation studies based on real data, comparing the new approach with an existing method by [23]. The results indicate that the proposed approach not only controls the type I error but also demonstrates comparable power while significantly reducing the computation time. However, the paper is weak in that the method fails to consider variable ranking and it was also developed for the classification scenario alone.
In a study conducted by [24], a novel two-stage quality-based sampling method named ts-RF is proposed for the subspace selection of single-nucleotide polymorphisms (SNPs) in genome-wide association studies (GWAS). This method tackles the challenge of handling high-dimensional GWAS data, where a considerable portion of SNPs is irrelevant to the studied disease. The procedure involves utilizing a p-value assessment to differentiate between informative and irrelevant SNPs, followed by a further classification of informative SNPs into highly and weakly informative sub-groups. When sampling SNP subspaces to construct random forests (RF), only SNPs from these two subgroups are taken into account. This ensures that the feature subspaces used to split nodes in the trees consist solely of highly informative SNPs, potentially enhancing the accuracy of the prediction models. While the paper recognizes the limitations of traditional RF in GWAS and seeks to address them with this innovative sampling method, it does not consider the inherent interaction effects specific to high-dimensional data.
In their work, [25] introduced a novel random forest (RF) algorithm tailored to address challenges associated with high-dimensional data in classification tasks. The algorithm’s primary strength lies in its proposed subspace feature sampling method, enhancing forest diversity and randomness. This results in the creation of trees with lower prediction errors. Another notable strength is the incorporation of a greedy technique to manage cardinal categorical features during decision tree construction. This addresses the efficiency of node splitting, particularly in scenarios with very high cardinality, leading to reduced computational time for constructing the RF model. Experimental validation on various high-dimensional real datasets, including both standard machine learning and image datasets, underscores the superior performance of the proposed approach. It exhibits significant reductions in prediction errors compared to existing RF methods. However, a limitation of the approach is its variable subsampling before tree building, which might overlook the complex structures inherent in high-dimensional datasets, especially when the functional relationship between the response and predictors is nonlinear. Additionally, it is crucial to note that the method was designed exclusively for classification tasks rather than continuous response variables, as in the context of this paper.
Also, in another related study by [26], a novel approach called “enriched random forest” (ERF) was introduced to address limitations in traditional random forest methods, especially in dealing with high-dimensional datasets where the number of features significantly exceeds the number of samples, and only a small percentage of features are genuinely informative. The authors identified a decline in the performance of a traditional random forest under such conditions. To address this issue, they proposed a weighted random sampling mechanism at each node, aiming to diminish the influence of less informative features. This modification is intended to improve the performance of the random forest algorithm, particularly in scenarios where relevant features are scarce. The evaluation of the proposed method includes the use of high-dimensional micro-array datasets in both regression and classification settings. While the approach used in ERF bears similarity to the one proposed in this paper, there are notable distinctions. The differences lie in the type of weights employed, the stage at which the weights are introduced, and the Bayesian framework utilized. While ERF utilized the raw correlation between the response and predictor variables, our approach used the probability of the ordered t-statistic obtained from the slope coefficient of the simple regression analysis of the association between each variable and the response variable. Additionally, in ERF, the weight was introduced at the predictor random sampling stage, whereas we introduced the weight at the splitting stage, thereby ensuring the maintenance of the interaction modeling strength of the random forest. The concept behind ERF somewhat resembles performing variable selection before employing random forest, which, although beneficial, is more prone to neglecting the interaction effects of variables, thus potentially diminishing the predictive strength of random forest.
In addition to extensions of Random Forest (RF) concerning variable selection and predictive performance, there have been various modifications aimed at enhancing model interpretability. Despite its ease of application, RF lacks a formal model and is often characterized as a black box model [27]. Bayesian procedures are renowned for their ability to offer model interpretation by formulating the model’s posterior distribution. One of the earliest Bayesian adaptations of RF is Bayesian additive regression trees (BART, [28]). BART represents a notable Bayesian modified boosting technique that has gained significant popularity for its effectiveness in modeling intricate relationships within data. Researchers have actively explored and refined BART, positioning it as a robust approach for constructing the ensembles of classification and regression tree (CART) models. Recent methodological enhancements have broadened BART’s capabilities, addressing challenges related to computational efficiency [29], high-dimensional data handling [30], and improved variable selection through sparsity-inducing priors [31]. These advancements enhance BART’s adaptability in addressing contemporary statistical modeling challenges.
Extensions of BART, such as dynamic BART [32] for time-varying effects and spatial BART [33] tailored for spatial data, showcase its versatility. BART has demonstrated applications across diverse fields, including finance [34], healthcare [35], and environmental science [36]. Despite its successes, BART encounters challenges, particularly in high-dimensional settings, such as scalability issues and the need for further methodological developments to enhance interpretability. Previous modifications proposed by [30,31] aimed to extend BART in the context of high-dimensionality and sparsity. However, these modifications do not directly extend the RF framework; variable selection remains random, and sparsity is addressed through appropriate prior specifications, potentially limiting their effectiveness.
The methods described above primarily delved into modifications of RF that centered around adopting a greedy approach to tackle challenges related to high-dimensional data. Notably, numerous frequentist methods, including ERF, have integrated some form of variable selection before splitting. However, this approach tends to compromise the inherent strengths of RF, particularly when dealing with intricate functional relationships. In response to this limitation, our paper proposes an innovative solution that harnesses the benefits of both random and greedy variable selection approaches. On a different note, high-dimensional Bayesian approaches such as sparse-BART maintain the random variable subsampling employed by RF but only address the high-dimensional issue by specifying a particular shrinking prior. This prior compels the range of variables to be selected within a specified range. The underlying assumption is that the relevant variables will fall within this specified range; otherwise, the inherent problem with RF persists. While these high-dimensional Bayesian approaches address some challenges, they also introduce assumptions that may not always align with the underlying data distribution, potentially limiting their generalizability.
In essence, our proposed solution aims to strike a balance between the benefits of random and greedy variable selection, offering a nuanced approach that preserves the robustness of RF in capturing complex relationships within high-dimensional datasets. This stands in contrast to existing methods that often lean towards one extreme, either sacrificing interpretability or assuming specific data characteristics. Through empirical validation and comparative analysis, our approach seeks to demonstrate its effectiveness in overcoming the limitations encountered by traditional RF and its variants in high-dimensional scenarios. Specifically, the proposed procedure initiates with a thorough greedy search, systematically evaluating the contribution of each predictor. Following this, the random search stage is dynamically updated based on the ordered probabilities derived during the preceding greedy stage. This innovative two-stage approach aims to preserve the robustness of RF while simultaneously mitigating prediction errors in high-dimensional settings. By combining the strengths of both variable selection strategies, this methodology seeks to strike a balance, enhancing the adaptability of RF to intricate functional relationships in high-dimensional datasets. Through this hybrid approach, the paper endeavors to contribute to the evolution of RF methodologies for improved performance in complex and challenging data scenarios.

2. Random Forest and Sum of Trees Models

Suppose that we let D = [ y i , x i 1 , x i 2 , , x i p ] , i = 1 , 2 , , n be an n × p dataset with y i assuming continuous values and x = [ x i 1 , x i 2 , , x i p ] be the vector of p covariates. Thus, we can define a single regression tree using the formulation of [28]
y i = f ( x i 1 , , x i 2 , , x i p ) + ϵ i .
where the random noise seen during estimation is denoted by the variable e p s i l o n i , which is assumed to have an independent, identical Gaussian distribution with a mean of 0 and a constant variance of s i g m a 2 . Consequently, in the same formulation by [28], a sum of trees model can be defined as:
y i = h ( x i 1 , , x i 2 , , x i p ) + ϵ i .
where h ( x i 1 , , x i 2 , , x i p ) = j = 1 J f ( x i 1 , , x i 2 , , x i p ) and the total number of trees in the forest is J. In the notation of a tree, we have
y = j = 1 J I j ( β m j : x R m j )
where β m is an estimate of y in the region R m , and I j ( β m j : x R m j ) is the single regression tree.
The model’s parameters β m and σ 2 (3) are often estimated using the frequentist approach. These approaches include bagging [16], stacking [37], boosting [38] and random forest [19]. Stacking improves the performance of the single tree in the model (1) by forming a linear combination of different covariates in x. Similarly, boosting improves (1) by fitting a single tree model on data sample points not used by an earlier fitted tree. Bagging and random forest iteratively randomize the sample data D and fitting (1) on each uniquely generated sample. The random forest improves over bagging using a random subset of p covariates, often denoted by m t r y , to fit (1) instead of all p covariates. This procedure has been shown to dramatically lower the correlations between adjacent trees and thus reduce the overall prediction risk of (3). Freidman [39] defined the risk of using the random forest for estimating (3) as:
V a r ( y ^ R F ) = ρ ( x ) σ 2 ( x )
where ρ ( x ) is the pairwise correlation between adjacent trees and σ 2 ( x ) is the variance of any randomly selected tree from the forest. Equation (4) implies that ρ ( x ) plays a vital role in shrinking the risk towards 0 as J . Achieving infinite forests is rarely possible due to computational difficulty. This drawback has necessitated the development of Bayesian [40] alternatives that are adaptive in nature in terms of averaging many posterior distributions. Chipman [28] proposed the Bayesian additive regression trees (BART) that average many posterior distributions of single Bayesian classification and regression trees (BCART, [41]). The approach is specifically similar to a form of boosting proposed in [20]. BART boosts weak trees by placing some form of deterministic priors on them. Although the empirical bake-off results of 42 datasets used to test the BART procedure showed an improved performance over RF, there is no theoretical backup on why BART is better than RF. Several authors have queried the improvement, especially in a high-dimensional settings where RF still enjoys moderate acceptance. Hernandez [13] claimed that the BART algorithm implemented in R as package “bartMachine” is memory hungry even at moderate p with T being the number of MCMC iterations for posterior sampling fixed at 1000.
Taddy [42] proposed the Bayesian and empirical Bayesian forests (BF and EBF) to maintain the structure of RF and modify the data randomization technique using a Bayesian approach. BF replaces the uniform randomization with the Dirichlet posterior distribution of sampled observation. Similarly, EBF considered the hierarchical parameter structure of estimating the next stage prior to the hyperparameters using current data. The results from the empirical analysis showed that BF and EBF are not different from RF except in the aspect of model interpretation.

Variable Selection: Inconsistency and Inefficiency of Random Forest in Sparse High-Dimensional Setting

Definition  1.
Let D h d = Y | X be a partitioned matrix composed of n × 1 response variable Y vector and X be an n × p matrix with y i , x i k , for i = 1 , , n and k = 1 , , p then a rectangular matrix D h d :
D h d = y 1 x 11 x 12 x 13 x 14 x 1 p y 2 x 21 x 22 x 23 x 24 x 2 p y 3 x 31 x 32 x 33 x 34 x 2 p y n 1 x n 1 x n 2 x n 3 x n 4 x n p
is referred to as high-dimensional data matrix if p n [8].
If we redefine some of the columns in D h d such that the entries are zeros, thus truncating the matrix structure, we have a sparsed HD. Sparsity is inherent in HD where only a few of the p covariates xS are usually related to the response y.
Definition 2.
A typical sparsed HD matrix is given by
D s h d = y 1 x 11 0 x 13 0 0 x 1 p y 2 x 21 0 x 23 0 0 x 2 p y 3 x 31 0 x 33 0 0 x 2 p y n 1 x n 1 0 x n 3 0 0 x n p
in Olaniran and Abdullah [43].
The risk of random forests (RF), as indicated in Equation (4), increases in high-dimensional situations with a sparse covariate matrix. This increase is a consequence of the random selection of the variable set m t r y used to build Equation (1). The method of hypergeometric sampling for selecting m t r y p is notably sensitive to the mixing parameter π , which represents the proportion of relevant variables. Specifically, when the proportion of relevant variables ( π ) is higher, the risk associated with RF is lower, and vice versa.
It is worth noting that many software implementations of RF regression commonly set m t r y as p / 3 . This choice is expected to yield a satisfactory set of covariates for predicting the target variable y under two conditions: when the data matrix D is of low dimensionality or when the number of observations (n) is greater than the number of covariates (p).
In the specific scenario where D = D s h d , Theorem 1 is employed to establish an upper bound for the probability of correctly selecting at least one relevant covariate from the set of p. This Theorem is the foundation for defining the selection consistency of random forests in high-dimensional, sparse settings.
Theorem 1.
Given p covariates, r p relevant variables, and if we set the RF subsample size m t r y = p / 3 as r, then the probability that at least one of the covariates in r is relevant converges to 1 e 1 as p .
Proof. 
Proposition 1.
Let R 1 , R 2 , R r denote the event that the R k covariate is relevant. Then, the event that at least one covariate in r is relevant is R 1 R 2 R r , and the required probability is P ( R 1 R 2 R r ) . It is worth noting that the subsample selection performed by RF is without replacement, implying that the sample space of p covariates can be partitioned into two (R relevant covariates and p R irrelevant covariates). This partitioning performed without replacement is often referred to as a hypergeometric or non-mutually exclusive process [44]. Thus, by the generalization of the principle of non-mutually exclusive events, we have
P ( R 1 R 2 R r ) = k = 1 r P ( R k ) j , k = 1 ; k > j r P ( R j R k ) + i , j , k = 1 ; k > j > i r P ( R i R i R k ) + ( 1 ) r 1 P ( R 1 R 2 R r )
P ( R 1 R 2 R r ) = r 1 1 r r 2 1 r 1 r 1 + r 3 1 r 1 r 1 1 r 3 + ( 1 ) r 1 r r 1 r ! = 1 1 2 ! + 1 3 ! + ( 1 ) r 1 1 r !
Recall that the exponential function e ξ = 1 + ξ + ξ 2 2 ! + ξ 3 3 ! + + ξ r r ! , if ξ = 1 ,
e 1 = 1 1 1 2 ! + 1 3 ! + + ( 1 ) r 1 1 r !
Thus,
P ( R 1 R 2 R r ) = 1 e 1 = 0.6321
Theorem 1 implies that, when p grows infinitely, the maximum proportion of relevant covariates that would be selected is 63.2 % , assuming the number of subsample m t r y chosen equals the number of relevant covariates in p. Figure 1 shows the convergence over varying p covariates.
Lemma 1.
RF variable selection is consistent if lim p P ( M ^ = M ) = 1 .
Remark 1.
Lemma 1 indicates that for an RF model M ^ fitted using r subsampled covariates, the RF variable selection is consistent if the fitted model converges almost surely to the true model M that contains all relevant variables R.
Corollary 1.
lim p P ( k = 1 r p R k ) = 0.6321 < 1 , then, the RF variable selection is inconsistent in HD with a large p.
Now that we have established the inconsistency of RF in the HD setting, the following Lemma presents RF variance for D s h d matrix.
Lemma 2.
Let π be the proportion of relevant covariates and 0 π 1 , the variance of a single tree σ 2 ( x ) can be decomposed into random and noise variance such that the risk of RF in a high-dimensional setting with p n can be defined as:
V a r ( y ^ R F ) = ρ ( x ) π σ 1 2 ( x ) + ( 1 π ) σ 2 2 ( x )
where σ 1 2 ( x ) and σ 2 2 ( x ) are the random and noise variances, respectively.
Remark 2.
It is clear from Lemma 2 that the risk of RF in (11) is larger than (4) when π < 1 ; thus, RF violates the oracle property conditions defined by [45,46,47] among others as:
 i.
Identification of the right subset model M such that P ( M ^ = M ) 1 .
 ii.
Achievement of the optimal estimation rate, n ( M ^ M ) d N ( 0 , V a r ( M ) )
Numerous scholars have contended that an effective estimator, denoted by M ^ , should meet certain oracle properties. However, the examination of Theorem 1 and Lemma 2 reveal that RF theoretically fails to fulfil these conditions within the context of sparse high-dimensional (HD) settings. This observation underscores the necessity of introducing an alternative methodology that can exhibit these desirable properties. The primary objective of this section was to present the theoretical limitations of RF with respect to variable selection and predictive performance, specifically in sparse HD settings. Previous studies, including those by [26,48], have primarily focused on empirically illustrating the limitations of random forests (RF) in high-dimensional (HD) settings. However, none of these studies has delved into the theoretical oracle properties in a high-dimensional context. Hence, this investigation presented here is particularly significant as it establishes the foundation for a comparative analysis of these properties in relation to those of the proposed Bayesian random forest (BRF), a comparison that will be detailed in the subsequent sections.
In delving into the theoretical constraints of RF, it becomes evident that, in sparse HD scenarios, RF struggles to meet the criteria deemed essential for an ideal estimator. This deficiency motivates the quest for alternative approaches that can overcome these limitations. The subsequent sections will elucidate how the proposed BRF method addresses and potentially surpasses these challenges, providing a more robust and versatile framework for variable selection and predictive performance in sparse HD settings. Thus, this critical evaluation not only highlights the shortcomings of RF but also paves the way for a more nuanced discussion on the merits of the proposed BRF methodology.

3. Priors and Posterior Specification of Bayesian Random Forest for Sparse HD

The Bayesian random forest (BRF) proposed here has three major prior parameters. The first is model uncertainty prior defined over tree I . Here, we propose a uniform prior I U ( 0 , 1 ) by [49] such that P r ( I ) = 1 for any candidate tree. We used this prior specification to retain the average weighing procedure of RF so that each tree I j has an equal voting right. The core advantage of this prior is to retain RF’s strength in correcting the over-fitting problem by averaging over all trees. The second form of the prior is the terminal node parameter β M and σ 2 prior, and here we propose the normal inverse gamma prior N I G ( μ M , σ 2 , a 0 , b 0 ) by [49], where μ M and Σ are the prior mean and covariance for parameter β M , a 0 and b 0 are the prior sample size and sum of squares for response y i on parameter σ 2 . Furthermore, we assumed a conditional prior of trees parameters β M on σ 2 , that is, for a single tree with M terminal nodes:
P ( β M , σ 2 ) = P ( β M | σ 2 ) P ( σ 2 ) .
This can be easily extended to J trees with the assumption of constant model variance σ 2 over all trees. Thus, we have:
P ( I 1 , I 2 , , I J ) = j = 1 J P ( I j , β M j ) P ( σ 2 )
P ( I 1 , I 2 , , I J ) = j = 1 J P ( β M j | σ 2 ) P ( I j ) P ( σ 2 )
with P r ( I j ) = 1
P ( I 1 , I 2 , , I J ) = j = 1 J P ( β M j | σ 2 ) P ( σ 2 )
P ( β M j | σ 2 ) = exp 1 2 σ 2 ( β M j μ M j ) j 1 ( β M j μ M j ) ( 2 π ) M | σ 2 j | 1 2
If we assume that the trees are independent and identically distributed, then
j = 1 J P ( β M j | σ 2 ) N ( J μ M , J σ 2 Σ )
j = 1 J P ( β M j | σ 2 ) = exp 1 2 J σ 2 ( β M j J μ M j ) Σ j 1 ( β M j J μ M j ) ( 2 π ) J M | σ 2 J j | J 2
P r ( σ 2 ) = b 0 a 0 ( σ 2 ) a 0 1 exp ( b 0 / σ 2 ) Γ ( a 0 )
P r ( I 1 , I 2 , , I J ) = exp 1 2 J σ 2 ( β M j J μ M j ) j 1 ( β M j J μ M j ) ( 2 π ) J M | σ 2 J j | J 2 × b 0 a 0 ( σ 2 ) a 0 1 exp ( b 0 / σ 2 ) Γ ( a 0 )
P r ( I 1 , I 2 , , I J ) = b 0 a 0 ( σ 2 ) ( a 0 + ( J / 2 ) + 1 ) Γ ( a 0 ) ( 2 π ) J M | J j | J 2 × exp 1 2 σ 2 ( β M j J μ M j ) J 1 j 1 ( β M j J μ M j ) + 2 b 0 Γ ( a 0 ) ( 2 π ) J M | J j | J 2
The Bayes theorem leads to the posterior density of trees:
P r ( I 1 , , I J | y , x ) = j = 1 J P r ( β M j | σ 2 ) P r ( σ 2 ) L ( y , x | I 1 , , I J ) β M j σ 2 j = 1 J P r ( β M j | σ 2 ) P r ( σ 2 ) L ( y , x | I 1 , , I J ) d β M j d σ 2
The integral at the denominator of Equation (21) cannot be solved analytically; thus, it is often dropped in most Bayesian analyses suggested by [49], and hence, we proceed as:
P r ( I 1 , I 2 , , I J | y , x ) j = 1 J P r ( β M j | σ 2 ) P r ( σ 2 ) L ( y , x | I 1 , I 2 , , I J ) .
The likelihood of J trees can be defined as:
L ( y , x | I 1 , I 2 , , I J ) = j = 1 J L ( y , x | I j )
L ( y , x | I j ) = exp 1 2 σ 2 ( y β M j ) ( y β M j ) ( 2 π σ 2 ) n
L ( y , x | I 1 , I 2 , , I J ) = exp 1 2 σ 2 j = 1 J ( y β M j ) ( y β M j ) ( 2 π σ 2 ) J n .
Therefore, the posterior of Bayesian random forest regression is:
P r ( I 1 , I 2 , , I J | y , x ) ( σ 2 ) ( a 0 + ( J / 2 ) + 1 ) Γ ( a 0 ) ( 2 π ) J M | J Σ j | J 2 × exp 1 2 σ 2 ( β M j J μ M j ) J 1 Σ j 1 ( β M j J μ M j ) + 2 b 0 Γ ( a 0 ) ( 2 π ) J M | J Σ j | J 2 × exp 1 2 σ 2 j = 1 J ( y β M j ) ( y β M j ) ( 2 π σ 2 ) J n .
P r ( I 1 , I 2 , , I J | y , x ) ( σ 2 ) ( a 1 + ( J / 2 ) + 1 ) × exp 1 2 σ 2 ( β M j J μ M j 1 ) ( J Σ j 1 ) 1 ( β M j J μ M j 1 ) + 2 b 1
where:
J μ M j 1 = [ ( J Σ j ) 1 + ( J V j ) ] 1 ( J Σ j ) 1 μ M j + ( J V j ) 1 n M j y ¯ M j
where V j is an m × m matrix of data information such that the diagonal of V j is σ m j 2 which is defined as:
σ m j 2 = n m j 1 i = 1 n m j ( y y ¯ m j ) 2
J Σ j 1 = [ ( J Σ j ) 1 + ( J V j ) ] 1
a 1 = a 0 + n / 2
b 1 = b 0 + μ M j ( Σ j ) 1 μ M j + y y μ M j 1 ( Σ j 1 ) 1 μ M j 1 2
The marginal densities of P r ( I 1 , I 2 , , I J | y , x ) is important when performing inference about β M and σ 2 . The marginal density of β M is given by:
P r ( β M | I 1 , I 2 , , I J , y , x ) = σ 2 P r ( I 1 , I 2 , , I J | y , x ) d σ 2
P r ( β M | I 1 , I 2 , , I J , y , x ) = σ 2 ( σ 2 ) ( a 1 + ( J / 2 ) + 1 ) × exp [ 1 2 σ 2 ( β M j J μ M j 1 ) ( J Σ j 1 ) 1 × ( β M j J μ M j 1 ) + 2 b 1 ] d σ 2
According to [50], the marginal distribution is identical to the S t u d e n t t distribution defined as:
P r ( β M | I 1 , I 2 , , I J , y , x ) t ( μ M j 1 , s 2 Σ j 1 , 2 a 1 )
where,
s 2 = b 1 a 1 1 = b 0 + μ M j ( Σ j ) 1 μ M j + y y μ M j 1 ( Σ j 1 ) 1 μ M j 1 2 a 0 + n / 2 1
Therefore, the posterior mean for β M is
β ^ M = J 1 [ ( J Σ j ) 1 + ( J V j ) ] 1 ( J Σ j ) 1 μ M j + ( J V j ) 1 n M j y ¯ M j
and the posterior variance for β M is
v a r ( β M ) = J 1 2 a 1 s 2 2 a 1 2 [ ( J Σ j ) 1 + ( J V j ) ] 1
The posterior mean of β M can be interpreted as the weighted average of prior mean μ M and data mean y ¯ M . The scaling factor is the joint contribution of prior and data information. Similarly, the posterior variance of β M is the scaled form of the joint contribution of data and prior information matrix. The parameters β M and σ 2 can be extracted from their posterior density using a hybrid of the Metropolis-Hastings and Gibbs sampler algorithms described as follows.

3.1. Hybrid Gibbs and MH Procedure for Extracting Posterior Information from Bayesian Random Regression Forest with Gaussian Response

  • Step 0: Define initial values for β M j 0 and ( σ 2 ) 0 such that P r ( β M j 0 | y , x ) > 0 and P r [ ( σ 2 ) 0 ] > 0 .
  • Step 1: For v = 1 , 2 , , V
  • Step 2: Sample σ ˜ 2 from lognormal distribution; q 1 ( σ ˜ 2 , ν 1 ) = L N [ ( σ ˜ 2 ) v 1 , ν 1 ] .
  • Step 3: For j = 1 , 2 , , J trees
  • Step 4: Sample β ˜ M j from independent multivariate normal distribution q 2 ( β ˜ M j , ν 2 ) = I N M ( β M j v 1 , ν 2 ) .
  • Step 5: Calculate the moving probability for β M j by
    π 1 ( β M j v , β ˜ M j ) = min P r ( β v ˜ M j | y , x ) P r ( β M j v | y , x ) , 1
  • Step 6: Sample U 1 U ( 0 , 1 ) ; then
    β M j v = β ˜ M j i f   U 1 π 1 ( β M j v , β ˜ M j ) ; β M j v 1 i f   U 1 > π 1 ( β M j v 1 , β ˜ M j ) .
  • Step 7: Compute the residuals ϵ i = y i J 1 j = 1 J I j ( β M j v : x R M j )
  • Step 8: Calculate the moving probability for σ 2 by
    π 2 [ ( σ 2 ) v , σ ˜ 2 ] = min P r ( σ ˜ 2 | ϵ i ) q 1 [ ( σ 2 ) v 1 | σ ˜ 2 , ν 1 ] P r [ ( σ 2 ) v 1 | ϵ i ] q 1 [ σ ˜ 2 | ( σ 2 ) v 1 , ν 1 ] , 1
  • Step 9: Sample U 2 U ( 0 , 1 ) ; then
    ( σ 2 ) v = σ ˜ 2 i f U 2 π 2 [ ( σ 2 ) v , σ ˜ 2 ] ; ( σ 2 ) v 1 i f U 2 > π 2 [ ( σ 2 ) v , σ ˜ 2 ] .
The proposed algorithm is a combination of the Metropolis–Hasting algorithm and the Gibbs sampler. It is a Metropolis–Hasting algorithm with further updates on σ 2 using the Gibbs sampler. The algorithm’s validity was demonstrated with a simulated response variable y scenario with no predictor variable. A regression tree with three terminal nodes was assumed. Each terminal node of the regression tree consists of five observations with a mean of β m = 10 ; variance σ m 2 = 4 ; m = 1 , 2 , 3 . The regression tree was replicated 5 times using bootstrapping to make a forest. The MCMC iteration results are shown below. The first column shows the trace plot of the parameter for each node over the trees. The trace plot for the four parameters shows that the iterations converge very sharply at 10,000. The autocorrelation plots show an exponential decay, suggesting independent MCMC chains. The acceptance rate lies within the tolerable range of 20–40%, as suggested by [49]. These features established the validity of the algorithm. In addition, the histogram supports the analytical densities proposed for the posterior distribution of parameters. The posterior densities for β m are significantly closer to the Student t distribution with values at the tail end while σ 2 density is very much closer to Gamma. The overall model standard error estimate using the Bayesian random forest (BRF) algorithm is σ b r f = 1.42 while that of the frequentist estimate is σ r f = 2.04 . This established that, empirically, BRF is more efficient than the frequentist RF method (Figure 2 and Figure 3).
The performance of the BRF procedure was also examined for predicting response y given three covariates that correspond to genes. The three covariates are denoted by G e n e 1 , G e n e 2 , and G e n e 3 . The functional relation between the response and genes was defined as:
y i = 5 + 10 × G e n e 1 + 20 × G e n e 2 + 30 × G e n e 3 + ϵ i
where i = 1 , 2 , , 30 . Figure 4 shows a single tree diagram from a Bayesian random forest consisting of five trees. The plot shows simulated 20 terminal nodes.
The first white box in Figure 4 housed value 2.4 , corresponding to the split decision value. Variable Gene3 was split into two daughter nodes with the left node terminating on predicted y ^ = 13.39 based on the condition that G e n e 3 < 2.4 . Subsequently, the right node further splits into another two daughter nodes using the condition G e n e 3 < 0.69 . Again, the left daughter node terminates on the predicted y ^ = 17.17 . The process continues until the maximum number of nodes set to be 20 is reached. The condition of the number of nodes is also referred to as the maximal tree depth by [6]. The reason for more splits on G e n e 3 can be easily observed from how the response was simulated such that G e n e 3 is the most relevant in terms of weights. If the variable importance score is desired, G e n e 3 will be the most important variable since it occurs more frequently than others in the predictor set.
The posterior estimates in Equations (36) and (37) obtained earlier rely on the accuracy of the prior parameter values assumed. In most occasions, searching for appropriate prior parameters may be difficult, especially in the case of the sum of trees model. A data-driven approach is often used, such as those used in [28,41]. Another alternative is the empirical Bayes. Empirical Bayes [51,52,53] allows the experimenter to estimate the prior hyperparameter values from the data. It is a hierarchical modeling approach where the parameter of a second stage or later model depends on the initial stage data. The empirical Bayes approach is often used when hierarchical data are available. However, it can also be applied for non-hierarchical situations, as extended by [54] using bootstrapped data to construct confidence intervals. The sum of trees modeling strategy is thus further simplified using the bootstrap prior technique. The approach was used to obtain the prior hyperparameters μ M , Σ for each tree. The major advantage of the bootstrap prior is that it guarantees an unbiased estimate of β M for each tree. To achieve a fast grasp of the approach, we consider the Bayesian inference of a single tree j with one of the terminal node parameters defined by β m and λ m . The likelihood of a single tree I j ( β m : x R m ) , L [ y , x | I j ( β m : x R m ) ] can be written as
L [ y , x | I j ( β m : x R m ) ] = λ m j 1 2 2 π n m j exp λ m j 2 i : x R m j n m j ( y i β m j ) 2
where λ m j = σ m j 2 was interpreted as precision for node m. Correspondingly, we can write the prior density for the parameters of a single tree as:
P r ( β m j , λ m j | n 0 , a 0 , b 0 ) = ( λ m j n 0 ) 1 / 2 2 π exp λ m j 2 ( β m j μ 0 ) 2 × b 0 a 0 λ m j a 0 1 exp λ m j b 0 Γ ( a 0 )
where n 0 is the prior sample size for terminal node m, μ 0 is the prior mean obtained from n 0 , a 0 is the prior sample size for the precision λ m j , and b 0 is the prior sum of squares deviation from the prior mean μ 0 . The posterior distribution of a single tree thus follows from the Bayes Theorem:
P r ( β m j , λ m j | y , x ) = P r ( β m j , λ m j ) × L [ y , x | I j ( β m : x R m ) ] β m j λ m j P r ( β m j , λ m j ) × L [ y , x | I j ( β m : x R m ) ] d β m j d λ m j .
After a little arrangement, the posterior distribution can be defined as:
P r ( β m j , λ m j | y , x ) = ( λ m j n 1 ) 1 / 2 2 π exp λ m j 2 ( β m j μ 1 ) 2 × b 1 a 1 λ m j a 1 1 exp λ m j b 1 Γ ( a 1 )
where μ 1 m = ( n 0 μ 0 + n m y ¯ m ) / ( n 0 + n m ) is the posterior estimate of β m , n 1 = n 0 + n m is the posterior sample size for which μ 1 m can be estimated, a 1 = a 0 + n m / 2 is the posterior sample size for which λ m can be estimated and b 1 = b 0 + 1 / 2 i = 1 n m ( y i y ¯ m ) 2 + n 0 n m ( y ¯ m μ 0 ) 2 2 ( n 0 + n m ) . The terminal node estimate y ¯ m is defined as
y ¯ m = ( n m ) 1 i = 1 n m ( y i | x i R m )
is the maximum likelihood estimate of E ( y i | x i R m ) . The estimate is unbiased and that is used in RF algorithms. The variance of the estimate followed as
v a r ( y ¯ m ) = n m 1 σ m 2
The corresponding empirical Bayes estimate for Equations (42) and (43) are
μ ^ [ E B ] m = ( n ^ 0 μ ^ 0 + n m y ¯ m ) / ( n ^ 0 + n m )
σ ^ [ E B ] m 2 = b ^ 0 + 1 / 2 i = 1 n m ( y i y ¯ m ) 2 + n ^ 0 n m ( y ¯ m μ ^ 0 ) 2 2 ( n ^ 0 + n m ) a ^ 0 + n m / 2
As a further update on the empirical Bayes procedure, the prior hyperparameters are estimated from a bootstrapped sample by following the procedure below:
  • Creating B bootstrap samples y b from the initial sample y m in the terminal node m;
  • Estimating the hyperparameters (prior parameters) each time the samples are generated using the maximum likelihood (ML) method;
  • Updating the posterior estimates using the hyperparameters in Step (2) above using Equations (44) and (45);
  • Then, obtaining the bootstrap empirical Bayesian estimates μ ^ B T and σ ^ B T 2 using
μ ^ B T = B 1 b = 1 B μ ^ [ E B ] m b
σ ^ B T 2 = B 1 b = 1 B σ ^ [ E B ] m b 2
Theorem 2.
The bootstrap prior estimate μ ^ B T is a uniformly minimum variance unbiased estimator of β m = E ( y i | x i R m ) under conditions of mild regularity.
Proof. 
μ ^ B T = B 1 b = 1 B n ^ 0 μ ^ 0 + n m y ¯ m n ^ 0 + n m
μ ^ B T = B 1 b = 1 B n ^ 0 μ ^ 0 n ^ 0 + n m + n m y ¯ m n ^ 0 + n m
Suppose we fix the prior parameters as: n ^ 0 b = B and μ ^ 0 b = y ¯ m b , where ( ¯ y ) m b is the ML estimate based on a bootstrap sample selected from y b . That is,
y ¯ m b = ( n m ) 1 i = 1 n m ( y b i | x i R m )
Then,
μ ^ B T = B 1 b = 1 B B y ¯ m b B + n m + n m y ¯ m B + n m
μ ^ B T = b = 1 B y ¯ m b B + n m + n m y ¯ m B + n m
E [ μ ^ B T ] = E b = 1 B y ¯ m b B + n m + n m y ¯ m B + n m
E [ μ ^ B T ] = b = 1 B E [ y ¯ m b ] B + n m + n m E [ y ¯ m ] B + n m
Since y ¯ m and y ¯ m b are known unbiased estimates of β m ,
E [ μ ^ B T ] = b = 1 B β m B + n m + n m β m B + n m
= 1 B + n m [ B β m + n m β m ]
E [ μ ^ B T ] = β m
Therefore, μ ^ B T is unbiased for estimating β m . Also, the MSE is the combination of the square of bias and variance of the estimate; then, following from the above derivation, the MSE is just the variance of the estimate. Thus,
v a r [ μ ^ B T ] = v a r b = 1 B y ¯ m b B + n m + n m y ¯ m B + n m
= b = 1 B v a r [ y ¯ m b ] ( B + n m ) 2 + n m 2 v a r [ y ¯ m ] ( B + n m ) 2
= b = 1 B [ n m 1 σ m 2 ] ( B + n m ) 2 + n m 2 [ n m 1 σ m 2 ] ( B + n m ) 2
v a r [ μ ^ B T ] = n m 2 + B ( B + n m ) 2 n m 1 σ m 2
Hence, it can be shown that the limiting form of n m 2 + B ( B + n m ) 2 is 0, by applying L’hospital rule:
lim B n m 2 + B ( B + n m ) 2 = lim B d ( n m 2 + B ) d B d ( B + n m ) 2 d B
lim B n m 2 + B ( B + n m ) 2 = lim B 1 2 ( B + n m )
lim B n m 2 + B ( B + n m ) 2 = 1
lim B n m 2 + B ( B + n m ) 2 = 0
The above derivation implies that, at a sample size of n m , the lim B v a r [ μ ^ B T ] = 0 . This affirms that the experimenter can control the stability of the estimator by increasing the number of bootstrap samples B. In addition, v a r [ μ ^ B T ] = n m 2 + B ( B + n m ) 2 n m 1 σ m 2 < v a r [ μ ^ M L ] = n m 1 σ m 2 , by a factor n m 2 + B ( B + n m ) 2 that converges faster to zero with increasing B.
Therefore, the frequentist estimator (ML) is less efficient than the estimator μ ^ B T , which is more efficient. Because they are both unbiased, this comparison is valid. Since this proposed estimator reduces the MSE in terms of bias and variance reduction, it is additionally more efficient within the Bayesian framework. By only lowering the variance, the conventional Bayesian estimator reduces the MSE.
Therefore, μ ^ B T is a minimum variance unbiased estimator for estimating the population mean β m . The proof established here serves as a baseline for using a bootstrapped prior with the sum of trees model.
On the other hand, we have assumed the independence of trees in the above derivations, a premise derived from the original formulation of RF, where each tree is constructed using a random subset of high-dimensional (HD) predictors. However, in certain situations, this assumption may be deemed invalid, especially in a typical HD setting where the correlation between adjacent predictors is not necessarily zero ( ρ x i , x j 0 , i j ) . In such cases, the a priori distribution may not be absolutely continuous with respect to the Lebesgue measure, leading to a departure from the absolute continuity condition. A probability measure P is deemed absolutely continuous with respect to another measure Q if, for every set A with Q-measure zero, the P-measure of A is also zero. Mathematically, this translates to Q ( A ) = 0 P ( A ) = 0 A [49]. In Bayesian statistics, the specified condition carries implications for the application of particular types of prior distributions. We chose the uniform U ( 0 , 1 ) prior for each candidate tree in the forest, guided by the independence assumption. This choice aligns with decisions made in several related studies [5,28,30,31,41,43,52,53,55]. However, in scenarios lacking absolute continuity, especially in dependent situations, the uniform prior U ( 0 , 1 ) may prove inadequate. In such cases, the most appropriate prior distribution becomes the Cantor ternary distribution [56]. Nevertheless, providing detailed derivations of the posterior distribution for this case extends beyond the scope of this paper.
Moreover, in situations of dependent high-dimensional data (HD) where the independent assumption is not valid, integrating the fractal and Hausdorff dimensions into dimensionality reduction techniques or regularization methods in machine learning can guide the learning process [57]. This integration proves valuable in preserving essential features while reducing noise or irrelevant details. Fractal or Hausdorff dimensions can function as metrics for assessing the complexity of models in high-dimensional analysis, facilitating model selection by balancing complexity and performance. It is important to highlight that this approach represents a form of dimension reduction, akin to the greedy approach of identifying relevant variables before fitting random forests (RF). While effective for this specific scenario, the fact that it weakens the robustness of RF in terms of accommodating interaction effects raises doubts about its applicability in genomic analysis.

3.2. A New Weighted Splitting for the Bayesian Random Forest in Sparse High-Dimensional Setting

Apart from the probabilistic interpretation update on random forest regression achieved using Bayesian modeling, we also dealt with the variable selection principle used during splitting. Tree-based methods use a greedy approach to build trees. In a high-dimensional setting with a large number of covariates, modeling with all the covariates increases the computational time and thus randomly subsampling variables or using a deterministic approach is suitable for tree-based methods. There are two popular approaches for handling high-dimensional data:
  • Greedy search: Identifying the relevant subset of variables and fitting the desired model on them.
  • Random search: Randomly selecting the subset (whether relevant or irrelevant) and fitting the desired model on them.
The two approaches are not 100 % perfect in variable selection, and greedy search fails to capture the interaction effect between variables and sometimes overfits while random search does not overfit if replicated a large number of times, which tends to suffer a loss of efficiency when the variable space is populated with irrelevant variables. The RF regression algorithm randomly selects variables from the predictor space by selecting a fixed number p / 3 irrespective of their predictive interaction with the response variable. This subsample size does not take into account the number of relevant predictors in the entire predictor space, and thus the chance of selecting irrelevant features increases with an increased p. Therefore, using the same data configuration, the predictive performance of RF reduces with an increasing p.
The weakness of RF can be attributed to its random subset selection mechanism. Updating the subset selection with a data-driven approach such that predictors are ranked in the order of relative correlation with response y will be fruitful. The motivation behind this idea follows from a greedy background, by trying to build a sum of tree models with only a relevant subset of predictors. However, this will affect the interaction modeling strength of RF which might further lead to a reduction in predictive power. In addition, we intend to update and not modify RF so as to maintain all its strength. Based on this fact, we developed a new framework that combines the strength of greedy search as well as random search by ranking the variables based on their initially computed importance.
Let T 1 , T 2 , T 3 , , T p represent p independent t statistics with a cumulative distribution function denoted by F ( t ) . This assumption of independence aligns with similar assumptions made in previous studies, including [24,25,26]. This assumption’s validity is justified in this study’s context, where the goal is to assess the marginal contribution of each variable to the response variable. In particular, T k corresponds to the t statistic for each covariate x k after fitting a Bayesian simple linear regression model of the response variable y on x k . The definition of T k is as follows:
T k = θ ^ k S D ( θ ^ k )
where θ ^ k is the Bayesian estimated weight of x k in the simple linear regression model:
y = θ 0 + θ k x k + ϵ
θ 0 is the bias of estimating y using x k and ϵ is the random noise that arises during the estimation of y with the linear model; it is considered to be independent, identical, Gaussian-distributed noise with a mean of zero and a constant variance δ 2 [26]. S D ( θ ^ k ) is the posterior standard deviation of θ k . These assumptions follow because, at this stage, only one variable is considered at a time. The t- statistics T k are then ranked in the increasing order of magnitude as: T ( 1 ) T ( 2 ) T ( 3 ) T ( p ) . The T ( k ) is the k t h order statistic ( k = 1 , 2 , , p ) . Then, the cumulative distribution function ( C D F ) of the largest order statistic T ( p ) is given by:
F p ( t ) = P r T ( p ) t
F p ( t ) = P r a l l T ( k ) t = F p ( t )
Also, we can see that P r ( T ( k ) a l l T ( p k ) ) P r ( a l l T ( p k ) T ( k ) ) ; thus,
F p k ( t ) = P r ( a l l T ( p k ) T ( k ) ) .
Equation (69) can be interpreted as the probability that at least p k of the T ( k ) are less than or equal to t. This also implies that all other ( p k ) variables are less relevant to response y than X k .
P r ( a l l T ( p k ) T ( k ) ) = k = ( p k ) p p k F ( k ) ( t ) [ 1 F ( t ) ] p k
F ( p k ) ( t ) = k = ( p k ) p p k F ( k ) ( t ) [ 1 F ( t ) ] p k .
We now refer to F ( p k ) ( t ) as weight w k which is the probability that each x p k variable is less important to y than x k . A binary regression tree’s splitting mechanism is then updated using this weight in the following ways:
Q m w ( T ) = ( 1 w k ) i : x k R 1 ( j , s ) n 1 m ( y i β ^ 1 m ) 2 + i : x k R 2 ( j , s ) n 2 m ( y i β ^ 2 m ) 2
If a variable x k is important and subsequent splitting on it will have significance, the weighted deviation Q m w ( T ) reduces to zero (since w k 1 ). Due to the fact that variables with lower weights w will not be further divided in the tree-building algorithm, this strategy helps accelerate the algorithm and improve the variable selection component of random forest regression. The procedure below summarizes BRF for a Gaussian response.
  • Step 0: Start with input data D = [ x ; y ]
  • Step 1: Analyze each variable x k x individually by running a univariate analysis and save the bootstrap Bayesian t statistic t B P .
  • Step 2: Calculate the probability of maximal weight w k for each variable x k x .
  • Step 3: For each of the J trees, where j = 1 , 2 , , J :
  • Step 4: Compute the bootstrap prior predictive density weights ω i from a Normal-Inverse ( N I G ) distribution with parameters μ M B P , σ 2 Σ M B P , a B P , b B P .
  • Step 5: Generate a Bayesian weighted simple random sample D * of size N with replacement from the training data D using the weights ω i .
  • Step 6: Generate a Bayesian weighted simple random sample.
  • Step 7: Grow a weighted predictors CART tree I j by iteratively repeating the following steps for each terminal node m until the minimum node size n m i n is reached:
    (a)
    Randomly select m t r y = p / 3 the variables without replacement from the pavailable variables.
    (b)
    Choose the best variable and split-point from the selected variables.
    (c)
    Divide the node into two daughter nodes.
    (d)
    Compute the weighted splitting criterion Q m w ( T ) and identify the node with the minimum deviance Q m w ( T ) .
  • Step 8: Print the ensemble of trees I j over J iterations.
  • Step 10: To predict the test data x t e , apply:
    y ^ b r f J = 1 J j = 1 J I j ( x t e )

3.3. Oracle Properties of Bayesian Random Forest

In this section, we show that, if the Bayesian bootstrap prior estimator μ ^ B T is used for estimating β M and the weighted splitting approach is utilized, the Bayesian random forest (BRF) enjoys the oracle properties.
Theorem 3.
Suppose β ^ M = μ ^ B T and F ( p k ) ( t ) 1 , then the Bayesian random forest (BRF) satisfy the following:
 i.
Identification of the right subset model M such that P ( M ^ = M ) 1 .
 ii.
Achievement of the optimal estimation rate, n ( M ^ M ) d N ( 0 , V a r ( M ) )
Proof. 
From Theorem 1, we know that the probability of selecting at least one relevant subset R from the set p using RF is
P ( R 1 R 2 R r ) = k = 1 r P ( R k ) j , k = 1 ; k > j r P ( R j R k ) + i , j , k = 1 ; k > j > i r P ( R i R i R k ) + ( 1 ) r 1 P ( R 1 R 2 R r ) .
Now, using the weighted splitting, there is assurance that the selected variable x k is relevant provided F ( p k ) ( t ) 1 . This implies that the random selection of variables for splitting in BRF is a mutually exclusive process that is P ( R i R j ) = 0 i j . Thus, the probability of selecting at least one relevant subset R from the set p using BRF is
P ( R 1 R 2 R r ) = k = 1 r P ( R k ) = r 1 1 r = 1
Lemma 3.
BRF variable selection is consistent if lim p P ( M ^ = M ) = 1 .
Corollary 2.
From Equation (74) lim p P ( k = 1 r p R k ) = 1 , then the BRF variable selection is consistent in HD with a large p.
Theorem 2 revealed that the Bayesian estimator μ ^ B T is a uniformly minimum variance unbiased estimator for the parameter β M under conditions of mild regularity. This implies that
v a r ( y ^ b r f ) = v a r ( μ ^ B T ) v a r ( y ^ b r f ) = n m 2 + B ( B + n m ) 2 n m 1 σ m 2 < v a r [ y ^ r f ] = n m 1 σ m 2
Remark 3.
Equation (75) implies that BRF is more efficient than RF when the bootstrap size B .

4. Simulation and Results

In this section, we conducted an empirical evaluation of BRF in comparison to its major competitors using both simulation and real-life data. The analyses were performed through ten repetitions of 10-fold cross-validations on the datasets. All the analyses were executed in the R statistical package v. 3.6.1. We utilized the newly built-in function b r f for BRF, g l m n e t function [39] for LASSO, g b m for gradient boosting [20], r f s r c for random forest, w b a r t for BART1 as described in [28], and b a r t M a c h i n e for BART2 [58].
To implement the Bayesian forest method (BF), we modified the c a s e . w t parameter of r f s r c from [59], introducing random weights distributed exponentially with a rate parameter of 1. It is worth noting that we employed two different R packages for BART due to the observed discrepancies in the results produced by these packages. Detailed information regarding the setup of tuning parameters can be found in Table 1.
Two simulation scenarios were created based on the problem we intend to tackle in this paper. The simulation scenarios were adapted from the works of [13,28]. In each of the scenarios, six levels of low- and high-dimensional settings were defined as p = 50 , 100 , 500 , 1000 , 5000 , 10 , 000 and used so as to mimic the realistic gene expression datasets. The sample size corresponding to the number of patients n which is usually far smaller than p was fixed at 200 in all the scenarios. Here, the root mean square error ( R M S E ) and average root mean square ( A R M S E ) were used as performance measures over the 10-fold cross-validation. Note that p = 50 & 100 were used to examine the behavior of the methods in low-dimensional data situations.
R M S E = i = 1 n t e s t ( y i y i ^ ) 2 n t e s t
A R M S E = e = 1 10 R M S E e 10
Scenario 1: Linear Case: Set x 1 , , x p as the multivariate standard normal N ( 0 , 1 ) random variables with the associated covariance structure define as: Σ = b l o c k d i a g ( Σ 1 1 , , Σ G 1 ) = I G Σ 1 , where ⊗ is the Kronecker product. Here, we assume that the first five predictors [ x 1 , , x 5 ] are relevant and the associated covariance structure is defined as
Σ 1 = ρ i f i j ; 1 i f i = j .
such that the first five variables have a pairwise correlation value pf ρ = 0.9 and likewise the other blocks of size five variables have the same correlation structure. The response is then simulated as y = x 1 + 2 x 2 + 3 x 3 + 4 x 4 + 5 x 5 + ϵ , where [ x 6 , , x p ] are the irrelevant predictor set. Note, with the covariance structure Σ defined, that the p 5 variables are independent and identically distributed and ϵ N ( 0 , 1 ) .
  • Scenario 2: Nonlinear Case: This follows the same structure as in Scenario 1, except for the simulation of the response which is defined as y = 10 s i n ( x 1 x 2 ) + 20 ( x 3 0.5 ) 2 + 10 | x 4 0.5 | + 5 ( x 5 0.5 ) 3 + ϵ and ρ = 0.2 .

4.1. Simulation Results

Table 2 summarizes the 10-fold cross-validation simulation of a Gaussian response for the seven methods. As expected for Scenario 1 with the linearity assumption, LASSO takes the lead followed by the new method BRF. Also, the ARMSE increases with an increase in p for most of the methods except GBM, which is unaffected by the increase in p. RF also performs much better than other ensemble methods like BF, BART1, and BART2 and most especially GBM. Although an increase in p affects the performance of RF significantly, the situation is different for BRF as the increase in p does not correspond to an increase in the ARMSE. BART2 performance tends to be better than BART1 for the low-dimensional case than the high-dimensional situation. The BF performance is better than BART1 and BART2; however, the BRF still takes the lead within the Bayesian class of models. The boxplot in Figure 5 corroborates the findings in Table 2 with the median RMSE of BRF and LASSO being the lowest over the different data dimensions.
The box and whisker plot for GBM (blue) was observed to be the highest in all data dimension situations. Table 3 summarizes the 10-fold cross-validation simulation for the Gaussian response for the seven methods when the nonlinear model is assumed. The performance of all methods degrades drastically when compared to the linear case in Table 2. LASSO performs the worst, as expected in this situation. The performances of BART1 and BART2 are again better for the low-dimensional situation when p < n , precisely for p = 50 , 100 . However, their performances depreciate faster as p approaches 500 and are in fact worse than LASSO as p approaches 10,000. The GBM performance is again unaffected with the increase in p but the performance is not different from LASSO. RF and likewise BF perform moderately better than BART1 and BART2 for p > 1000 . BRF simultaneously achieves robustness to increase in p as well as maintaining the lowest RMSE for low- and high-dimensional settings when compared with the six other competing methods. The boxplot in Figure 6 corroborates the findings in Table 3, with the median RMSE of BRF being the lowest for p > 500 .

4.2. Variable Selection

The two scenario models (linear and nonlinear) were investigated to determine the best method in terms of the selection of the five relevant variables imposed. Table 4 presents the results of the variable selection performance of BRF alongside competing methods. For the linear model, the average proportion of relevant variables identified using LASSO is 1 and constant over all six datasets used. This result corroborates the findings in Table 2 where LASSO was found to be the best in terms of lowest ARMSE. The entire five relevant variables were correctly identified with LASSO under the linearity assumption. BRF favorably competes with LASSO with the identification of about 4 / 5 relevant variables up to p = 5000 . BRF also consistently identified all the relevant variables in low-dimensional conditions with p = 50 & 100 . The performances of RF, BF, and GBM are very similar with GBM, and slightly above RF and BF. BART2 also consistently identified the 4 / 5 relevant variables until p = 1000 . However, the performance at p = 5000 & 10,000 is not presented due to computational difficulty while computing the probability of inclusion for p > 1000 . The lowest performance was observed with BART1 over all the dimensions of the datasets used.
For the nonlinear condition, none of the methods could achieve 100 % identification as the functional path is now rough but BRF is still the best for p = 50 , and it converges to 2 / 5 from p = 1000 . The LASSO performance is not consistent here and it also corroborates the high ARMSE observed in Table 3. BART2 competes with BRF at various levels of p and in fact the highest for p 1000 . A similar worse performance was observed for BART1 under the nonlinear condition.

4.3. Predicting Tumor Size and Biomarker Score

Three real-life cancer datasets on the prediction of tumor size and biomarker score. The two breast cancer datasets were used to predict the size of the tumor before the patients underwent chemotherapy. The other dataset was used to predict the biomarker score of lung cancer for patients with a history of smoking. The dataset’s detailed description can be found below:
  • Breast1 Cancer: In [60], 22,283 gene expression profiles were obtained using Affymetrix Human Genome U133A Array i 61 patients prior to chemotherapy. The pre-chemotherapy sizes of tumors were recorded for both negative estrogen receptor (ER−) and positive estrogen receptor (ER+). A preliminary analysis carried out on the dataset using a Bayesian t-test revealed that only 7903 genes are relevant at some specific threshold.
  • Breast2 Cancer: In [61], 22,575 gene expression profiles were obtained using 60mer oligonucleotide array in 60 patients with ER-positive primary breast cancer and treated with tamoxifen monotherapy for 5 years. Data were generated from whole-tissue sections of breast cancers. The pre-chemotherapy sizes of the tumors were recorded for both negative estrogen receptor (ER−) and positive estrogen receptor (ER+). A preliminary analysis carried out on the dataset using a Bayesian t-test revealed that only 4808 genes are relevant at some specific threshold.
  • Lung Cancer: In [62], 22,215 gene expression profiles were obtained using Affymetrix suggested protocol on 163 patients. The biomarker score to detect the presence or absence of lung cancer was recorded alongside the gene expression profile. A preliminary analysis carried out on the dataset using a Bayesian t-test revealed that only 7187 genes are relevant at some specific threshold.
The RMSEs of the methods were obtained for the test dataset that arose from the 10-fold cross-validation. Table 5 shows the summary of ARMSE for the test dataset over the 10-fold cross-validation. For Breast1 and Breast2, BRF was found to be the best with the lowest ARMSE. In terms of ranking, RF was found to be in the second position in terms of performance when compared with other methods. For the prediction of the biomarker score, the best is LASSO with the lowest ARMSE. On average, BRF has the lowest ARMSE over the three datasets. The standard error of mean (SEM) estimates measure the relative spread of RMSE for each dataset. The SEM results show that the most stable method is BRF with least SEM over most datasets except lung.

5. Discussion of Results

BRF achieves impressive results because it employs Bayesian estimation at the tree node parameter stage and combines a greedy and random search to select splitting variables. In contrast, RF fails since it randomly selects variables without considering their importance. Random search is adequate for low-dimensional cases, as seen in various simulation conditions. However, as the number of irrelevant variables increases, the performance of random search significantly deteriorates. For example, in a five-dimensional simulation with five relevant variables, the probabilities of selecting at least one relevant variable when m t r y = p are as follows 0.546 ,   0.416 ,   0.202 ,   0.150 ,   0.06 ,   0.05 ,   0.04 for different values of p = 50, 100, 500, 1000, 5000, 10,000. This demonstrates that, as the data dimension grows with a fixed sample size n, more irrelevant variables are selected, resulting in a poor model fit.
The new approach, BRF, directly addresses this issue by ensuring the use of only relevant variables, regardless of the dataset’s dimension. This approach is akin to what GBM does, as it assesses the influence of each variable on the response. However, BRF surpasses GBM due to its application of Bayesian estimation methods and robust data-driven prior techniques. Moreover, it is clear that BRF’s performance relies on correctly identifying variables during the greedy search. If irrelevant variables are ranked higher than relevant ones, it will affect the performance, emphasizing the need for a robust procedure for preliminary variable ranking. While the bootstrap prior technique performed reasonably well in both linear and nonlinear scenarios, the accuracy of BRF can also be improved by introducing a more effective subset selection procedure.

6. Conclusions

This paper investigated the strengths and flaws of random forest (RF) for modeling high-dimensional data. The major weakness of RF methods is that they are not governed by any statistical model, and thus, they cannot provide probabilistic results as in the Bayesian setting. Another critical issue with the RF methods occurs in high-dimensional data with a large number of predictors but a small number of relevant ones. The performance of RF tends to depreciate as the dimension of the data grows infinitely under this condition. These two issues motivated the development of Bayesian random forests (BRFs) presented in this paper. The theoretical results revealed that BRF satisfies the oracle properties under conditions of mild regularity. Furthermore, the various empirical results from the simulation and real-life data analysis established that BRF is more consistent and efficient than other competing methods for modeling nonlinear functional relationships in low- and high-dimensional situations. Also, BRF was found to be better than the competing Bayesian methods, especially in high-dimensional settings. A potential limitation of the proposed BRF method stems from the assumptions of independence, zero mean, and constant variance that are relied upon during the greedy variable ranking process.

Author Contributions

Conceptualization, O.R.O. and A.R.R.A.; methodology, O.R.O.; software, O.R.O.; validation, O.R.O. and A.R.R.A.; formal analysis, O.R.O.; investigation, O.R.O.; resources, A.R.R.A.; data curation, O.R.O.; writing—original draft preparation, O.R.O.; writing—review and editing, O.R.O.; visualization, A.R.R.A.; supervision, O.R.O.; project administration, O.R.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The authors confirm that the data supporting the findings of this study are available within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Gohil, S.H.; Iorgulescu, J.B.; Braun, D.A.; Keskin, D.B.; Livak, K.J. Applying high-dimensional single-cell technologies to the analysis of cancer immunotherapy. Nat. Rev. Clin. Oncol. 2021, 18, 244–256. [Google Scholar] [CrossRef] [PubMed]
  2. Quist, J.; Taylor, L.; Staaf, J.; Grigoriadis, A. Random forest modelling of high-dimensional mixed-type data for breast cancer classification. Cancers 2021, 13, 991. [Google Scholar] [CrossRef] [PubMed]
  3. Nederlof, I.; Horlings, H.M.; Curtis, C.; Kok, M. A high-dimensional window into the micro-environment of triple negative breast cancer. Cancers 2021, 13, 316. [Google Scholar] [CrossRef] [PubMed]
  4. Olaniran, O.R.; Olaniran, S.F.; Popoola, J.; Omekam, I.V. Bayesian Additive Regression Trees for Predicting Colon Cancer: Methodological Study (Validity Study). Turk. Klin. J. Biostat. 2022, 14, 103–109. [Google Scholar] [CrossRef]
  5. Olaniran, O.R.; Abdullah, M.A.A. Bayesian weighted random forest for classification of high-dimensional genomics data. Kuwait J. Sci. 2023, 50, 477–484. [Google Scholar] [CrossRef]
  6. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer: New York, NY, USA, 2009. [Google Scholar]
  7. Olaniran, O.R. Shrinkage based variable selection techniques for the sparse Gaussian regression model: A Monte-Carlo simulation comparative study. Proc. Aip Conf. Proc. 2021, 2423, 070014. [Google Scholar]
  8. Bühlmann, P.; van De Geer, S. Statistics for High-Dimensional Data: Methods, Theory and Applications; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
  9. Gündüz, N.; Fokoue, E. Predictive performances of implicitly and explicitly robust classifiers on high dimensional data. Commun. Fac. Sci. Univ.-Ank.-Ser. Math. Stat. 2017, 66, 14–36. [Google Scholar]
  10. Vapnik, V. The Nature of Statistical Learning Theory; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
  11. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. 1996, 58, 267–288. [Google Scholar] [CrossRef]
  12. Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
  13. Hernández, B.; Raftery, A.E.; Pennington, S.R.; Parnell, A.C. Bayesian additive regression trees using Bayesian model averaging. Stat. Comput. 2018, 28, 869–890. [Google Scholar] [CrossRef]
  14. Breiman, L.; Friedman, J.; Olshen, R.; Stone, C. Classification and Regression Trees; Taylor & Francis: Abingdon, UK, 1984. [Google Scholar]
  15. Hwang, K.; Lee, K.; Park, S. Variable selection methods for multi-class classification using signomial function. J. Oper. Res. Soc. 2017, 68, 1117–1130. [Google Scholar] [CrossRef]
  16. Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
  17. Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; CRC Press: Boca Raton, FL, USA, 1994. [Google Scholar]
  18. Breiman, L. Arcing classifier (with discussion and a rejoinder by the author). Ann. Stat. 1998, 26, 801–849. [Google Scholar] [CrossRef]
  19. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  20. Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
  21. Hastie, T.; Friedman, J.; Tibshirani, R. Overview of Supervised Learning; Springer: New York, NY, USA, 2010. [Google Scholar]
  22. Janitza, S.; Celik, E.; Boulesteix, A.L. A computationally fast variable importance test for random forests for high-dimensional data. Adv. Data Anal. Classif. 2018, 12, 885–915. [Google Scholar] [CrossRef]
  23. Altmann, A.; Toloşi, L.; Sander, O.; Lengauer, T. Permutation importance: A corrected feature importance measure. Bioinformatics 2010, 26, 1340–1347. [Google Scholar] [CrossRef]
  24. Nguyen, T.T.; Huang, J.Z.; Wu, Q.; Nguyen, T.T.; Li, M.J. Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests. BMC Genom. 2015, 16, S5. [Google Scholar] [CrossRef]
  25. Wang, Q.; Nguyen, T.T.; Huang, J.Z.; Nguyen, T.T. An efficient random forests algorithm for high dimensional data classification. Adv. Data Anal. Classif. 2018, 12, 953–972. [Google Scholar] [CrossRef]
  26. Ghosh, D.; Cabrera, J. Enriched random forest for high dimensional genomic data. IEEE/ACM Trans. Comput. Biol. Bioinform. 2021, 19, 2817–2828. [Google Scholar] [CrossRef]
  27. Sage, A.J.; Liu, Y.; Sato, J. From Black Box to Shining Spotlight: Using Random Forest Prediction Intervals to Illuminate the Impact of Assumptions in Linear Regression. Am. Stat. 2022, 76, 414–429. [Google Scholar] [CrossRef]
  28. Chipman, H.A.; George, E.I.; McCulloch, R.E. BART: Bayesian additive regression trees. Ann. Appl. Stat. 2010, 4, 266–298. [Google Scholar] [CrossRef]
  29. Linero, A.R.; Basak, P.; Li, Y.; Sinha, D. Bayesian survival tree ensembles with submodel shrinkage. Bayesian Anal. 2022, 17, 997–1020. [Google Scholar] [CrossRef]
  30. Linero, A.R. Bayesian regression trees for high-dimensional prediction and variable selection. J. Am. Stat. Assoc. 2018, 113, 626–636. [Google Scholar] [CrossRef]
  31. Linero, A.R.; Yang, Y. Bayesian regression tree ensembles that adapt to smoothness and sparsity. J. R. Stat. Soc. Ser. Stat. Methodol. 2018, 80, 1087–1110. [Google Scholar] [CrossRef]
  32. Linero, A.R.; Sinha, D.; Lipsitz, S.R. Semiparametric mixed-scale models using shared Bayesian forests. Biometrics 2020, 76, 131–144. [Google Scholar] [CrossRef]
  33. Krueger, R.; Bansal, P.; Buddhavarapu, P. A new spatial count data model with Bayesian additive regression trees for accident hot spot identification. Accid. Anal. Prev. 2020, 144, 105623. [Google Scholar] [CrossRef]
  34. Clark, T.E.; Huber, F.; Koop, G.; Marcellino, M.; Pfarrhofer, M. Tail forecasting with multivariate bayesian additive regression trees. Int. Econ. Rev. 2023, 64, 979–1022. [Google Scholar] [CrossRef]
  35. Waldmann, P. Genome-wide prediction using Bayesian additive regression trees. Genet. Sel. Evol. 2016, 48, 1–12. [Google Scholar] [CrossRef]
  36. Kim, C. Bayesian additive regression trees in spatial data analysis with sparse observations. J. Stat. Comput. Simul. 2022, 92, 3275–3300. [Google Scholar] [CrossRef]
  37. Breiman, L. Stacked regressions. Mach. Learn. 1996, 24, 49–64. [Google Scholar] [CrossRef]
  38. Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
  39. Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010, 33, 1. [Google Scholar] [CrossRef] [PubMed]
  40. Olaniran, O.R.; Olaniran, S.F.; Popoola, J. Bayesian regularized neural network for forecasting naira-USD exchange rate. In Proceedings of the International Conference on Soft Computing and Data Mining, Virtual, 30–31 May 2022; pp. 213–222. [Google Scholar]
  41. Chipman, H.A.; George, E.I.; McCulloch, R.E. Bayesian CART model search. J. Am. Stat. Assoc. 1998, 93, 935–948. [Google Scholar] [CrossRef]
  42. Taddy, M.A.; Gramacy, R.B.; Polson, N.G. Dynamic trees for learning and design. J. Am. Stat. Assoc. 2011, 106, 109–123. [Google Scholar] [CrossRef]
  43. Olaniran, O.R.; Abdullah, M.A.A.B.; Affendi, M.A. BayesRandomForest: An R implementation of Bayesian Random Forest for Regression Analysis of High-dimensional Data. Rom. Stat. Rev. 2018, 66, 95–102. [Google Scholar]
  44. Johnson, N.L.; Kemp, A.W.; Kotz, S. Univariate Discrete Distributions; John Wiley & Sons: New York, NY, USA, 2005. [Google Scholar]
  45. Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef]
  46. Shi, G.; Lim, C.Y.; Maiti, T. High-dimensional Bayesian Variable Selection Methods: A Comparison Study. Calcutta Stat. Assoc. Bull. 2016, 68, 16–32. [Google Scholar] [CrossRef]
  47. Heinze, G.; Wallisch, C.; Dunkler, D. Variable selection—A review and recommendations for the practicing statistician. Biom. J. 2018, 60, 431–449. [Google Scholar] [CrossRef]
  48. Hastie, T.; Tibshirani, R.; Wainwright, M. Statistical Learning with Sparsity: The Lasso and Generalizations; CRC Press: Boca Raton, FL, USA, 2015. [Google Scholar]
  49. Gelman, A.; Carlin, J.B.; Stern, H.S.; Dunson, D.B.; Vehtari, A.; Rubin, D.B. Bayesian Data Analysis; CRC Press: Boca Raton, FL, USA, 2013. [Google Scholar]
  50. Denison, D.G.; Holmes, C.C.; Mallick, B.K.; Smith, A.F. Bayesian Methods for Nonlinear Classification and Regression; John Wiley & Sons: Hoboken, NJ, USA, 2002. [Google Scholar]
  51. Olaniran, O.R.; Yahya, W.B. Bayesian Hypothesis Testing of Two Normal Samples using Bootstrap Prior Technique. J. Mod. Appl. Stat. Methods 2017, 16, 34. [Google Scholar] [CrossRef]
  52. Olaniran, O.R.; Abdullah, M.A.A. Bayesian variable selection for multiclass classification using Bootstrap Prior Technique. Austrian J. Stat. 2019, 48, 63–72. [Google Scholar] [CrossRef]
  53. Olaniran, O.R.; Abdullah, M.A.A. Bayesian analysis of extended cox model with time-varying covariates using bootstrap prior. J. Mod. Appl. Stat. Methods 2020, 18, 7. [Google Scholar] [CrossRef]
  54. Laird, N.M.; Louis, T.A. Empirical Bayes confidence intervals based on bootstrap samples. J. Am. Stat. Assoc. 1987, 82, 739–750. [Google Scholar] [CrossRef]
  55. Pratola, M.T. Efficient Metropolis—Hastings proposal mechanisms for Bayesian regression tree models. Bayesian Anal. 2016, 11, 885–911. [Google Scholar] [CrossRef]
  56. Presnell, B. A Geometric Derivation of the Cantor Distribution. Am. Stat. 2022, 76, 73–77. [Google Scholar] [CrossRef]
  57. Karbauskaitė, R.; Dzemyda, G. Fractal-based methods as a technique for estimating the intrinsic dimensionality of high-dimensional data: A survey. Informatica 2016, 27, 257–281. [Google Scholar] [CrossRef]
  58. Bleich, J.; Kapelner, A.; George, E.I.; Jensen, S.T. Variable selection for BART: An application to gene regulation. Ann. Appl. Stat. 2014, 8, 1750–1781. [Google Scholar] [CrossRef]
  59. Ishwaran, H.; Kogalur, U.B.; Blackstone, E.H.; Lauer, M.S. Random survival forests. Ann. Appl. Stat. 2008, 2, 841–860. [Google Scholar] [CrossRef]
  60. Iwamoto, T.; Bianchini, G.; Booser, D.; Qi, Y.; Coutant, C.; Ya-Hui Shiang, C.; Santarpia, L.; Matsuoka, J.; Hortobagyi, G.N.; Symmans, W.F.; et al. Gene pathways associated with prognosis and chemotherapy sensitivity in molecular subtypes of breast cancer. J. Natl. Cancer Inst. 2010, 103, 264–272. [Google Scholar] [CrossRef]
  61. Ma, X.J.; Wang, Z.; Ryan, P.D.; Isakoff, S.J.; Barmettler, A.; Fuller, A.; Muir, B.; Mohapatra, G.; Salunga, R.; Tuggle, J.T.; et al. A two-gene expression ratio predicts clinical outcome in breast cancer patients treated with tamoxifen. Cancer Cell 2004, 5, 607–616. [Google Scholar] [CrossRef]
  62. Gustafson, A.M.; Soldi, R.; Anderlind, C.; Scholand, M.B.; Qian, J.; Zhang, X.; Cooper, K.; Walker, D.; McWilliams, A.; Liu, G.; et al. Airway PI3K pathway activation is an early and reversible event in lung cancer development. Sci. Transl. Med. 2010, 2, 1–25. [Google Scholar] [CrossRef]
Figure 1. Probability of selecting relevant variables for RF at varying dimensionality p.
Figure 1. Probability of selecting relevant variables for RF at varying dimensionality p.
Mathematics 11 04957 g001
Figure 2. Simulation plot 1 of the hybrid algorithm for Bayesian random forest regression.
Figure 2. Simulation plot 1 of the hybrid algorithm for Bayesian random forest regression.
Mathematics 11 04957 g002
Figure 3. Simulation plot 2 of the hybrid algorithm for Bayesian random forest regression.
Figure 3. Simulation plot 2 of the hybrid algorithm for Bayesian random forest regression.
Mathematics 11 04957 g003
Figure 4. Single regression tree plot from the forest of five trees using Bayesian random forest (BRF) hybrid Gibbs and MH procedure. The colored box corresponds to the terminal node or final prediction, and the white box corresponds to the decision node or split point.
Figure 4. Single regression tree plot from the forest of five trees using Bayesian random forest (BRF) hybrid Gibbs and MH procedure. The colored box corresponds to the terminal node or final prediction, and the white box corresponds to the decision node or split point.
Mathematics 11 04957 g004
Figure 5. Boxplot of test 10-fold cross-validation RMSE of Scenario 1. The black middle line in each box represents the median. The dots represent outliers in the RMSE results. The outliers in GBM are the highest.
Figure 5. Boxplot of test 10-fold cross-validation RMSE of Scenario 1. The black middle line in each box represents the median. The dots represent outliers in the RMSE results. The outliers in GBM are the highest.
Mathematics 11 04957 g005
Figure 6. Boxplot of test 10-fold cross-validation RMSE of Scenario 2 for Gaussian response. The black middle line in each box represents the median.
Figure 6. Boxplot of test 10-fold cross-validation RMSE of Scenario 2 for Gaussian response. The black middle line in each box represents the median.
Mathematics 11 04957 g006
Table 1. Tuning parameters set-up for the various methods used in the data analysis.
Table 1. Tuning parameters set-up for the various methods used in the data analysis.
MethodTuning Parameter Set-Up
LASSO λ [ 0 , 1 ] is selected via 10-fold cross-validation. Other settings are default as in g l m n e t .
GBMNumber of trees is fixed at 1000 and all other settings are default.
RF m t r y settings are default p / 3 , number of trees is fixed at 1000. Other settings are default.
BART1All settings are default.
BART2All settings are default
BF m t r y settings are default, number of trees is fixed at 1000. c a s e . w t e x p ( 1 ) . Other settings are default.
BRF m t r y settings are default p / 3 , number of trees is fixed at 1000, search type is random, split weight is obtained using F ( p k ) ( t ) .
Table 2. Average test root mean square error (ARMSE) over 10-fold cross-validation for scenario 1.
Table 2. Average test root mean square error (ARMSE) over 10-fold cross-validation for scenario 1.
Scenario 1: Linear
Method p
501005001000500010,000
BRF2.0052.0122.0842.1132.1142.156
GBM8.8538.8618.8438.8548.8548.847
LASSO0.9500.9490.9520.9580.9670.977
RF2.2472.2962.5132.6032.8242.947
BF2.5682.6272.8553.0223.1853.590
BART13.8434.5212.7633.1265.3647.210
BART22.5963.1133.0073.6215.6588.395
Table 3. Average test root mean square error (ARMSE) over 10-fold cross-validation for Scenario 2.
Table 3. Average test root mean square error (ARMSE) over 10-fold cross-validation for Scenario 2.
Scenario 2: Nonlinear
Method p
501005001000500010,000
BRF19.49819.89619.88919.90621.63722.543
GBM30.35530.35230.44430.55130.70830.964
LASSO32.40432.45332.58533.52634.62635.230
RF20.66421.26423.26623.74225.04325.954
BF23.28822.99326.06226.09529.12128.210
BART116.84416.49324.15128.18031.91738.156
BART214.03718.07122.19327.45033.52237.932
Table 4. Average proportion of relevant variables selected in 10-fold cross-validation.
Table 4. Average proportion of relevant variables selected in 10-fold cross-validation.
Methodp
501005001000500010,000
Linear
BRF1.000.980.900.840.760.68
RF0.960.840.800.780.680.64
BF0.980.840.760.680.640.64
GBM0.960.840.820.800.760.76
LASSO1.001.001.001.001.001.00
BART10.760.860.820.800.480.24
BART20.880.880.840.80--
Nonlinear
BRF0.660.580.440.400.400.40
RF0.600.600.440.420.400.38
BF0.580.580.400.360.360.26
GBM0.580.560.400.400.380.34
LASSO0.620.720.420.400.400.40
BART10.540.560.400.300.140.10
BART20.620.640.520.44--
Table 5. Average test RMSE (and standard error) over 10-fold cross-validation for regression cancer datasets.
Table 5. Average test RMSE (and standard error) over 10-fold cross-validation for regression cancer datasets.
DatasetMethod
BRFGBMLASSORFBFBART1BART2
Breast11.0141.1311.2831.1171.1201.1281.123
(0.071)(1.086)(0.258)(0.673)(0.749)(0.661)(1.380)
Breast20.3470.4500.4580.4480.4490.4560.452
(0.048)(0.352)(0.246)(0.298)(0.576)(0.139)(0.239)
Lung1.0995.2430.8252.2872.4201.5891.934
(0.243)(1.125)(0.309)(0.162)(0.298)(0.717)(0.255)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Olaniran, O.R.; Alzahrani, A.R.R. On the Oracle Properties of Bayesian Random Forest for Sparse High-Dimensional Gaussian Regression. Mathematics 2023, 11, 4957. https://doi.org/10.3390/math11244957

AMA Style

Olaniran OR, Alzahrani ARR. On the Oracle Properties of Bayesian Random Forest for Sparse High-Dimensional Gaussian Regression. Mathematics. 2023; 11(24):4957. https://doi.org/10.3390/math11244957

Chicago/Turabian Style

Olaniran, Oyebayo Ridwan, and Ali Rashash R. Alzahrani. 2023. "On the Oracle Properties of Bayesian Random Forest for Sparse High-Dimensional Gaussian Regression" Mathematics 11, no. 24: 4957. https://doi.org/10.3390/math11244957

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop