Optimal Feature Set Size in Random Forest Regression

: One of the most important hyper-parameters in the Random Forest (RF) algorithm is the feature set size used to search for the best partitioning rule at each node of trees. Most existing research on feature set size has been done primarily with a focus on classiﬁcation problems. We studied the effect of feature set size in the context of regression. Through experimental studies using many datasets, we ﬁrst investigated whether the RF regression predictions are affected by the feature set size. Then, we found a rule associated with the optimal size based on the characteristics of each data. Lastly, we developed a search algorithm for estimating the best feature set size in RF regression. We showed that the proposed search algorithm can provide improvements over other choices, such as using the default size speciﬁed in the randomForest R package and using the common grid search method.


Introduction
Random forest [1] is one of the most successful ensemble machine learning methods because it has many advantages. It is suitable for large data with fast model fitting and evaluating, is applicable to both regression and classification, is robust to outliers, can deal with simple linear and complicated nonlinear associations, and produces competitive prediction accuracy for high-dimensional data [2,3]. These unique advantages lead RF to gain increasing interests in a variety of research fields. In particular, it has been actively used and shown great accomplishment in many biomedical applications. For example, RF has been successfully utilized in cancer prognosis and diagnosis [4,5], to predict infectious diseases with high accuracy [6,7], and to recognize disease associated genes in microarray data [8,9].
The RF prediction performance can be further improved by tuning its hyper-parameters, a factor to determine the learning process, which are usually set by users. In the RF algorithm, there are three main hyper-parameters that highly affects the performance [10]: (1) ntrees, which controls how many trees are constructed in the ensemble; (2) nodesize, which decides the size of each tree by controlling minimum number of observations in terminal nodes; and (3) mtry, which controls the feature set size to search for the best split rules at each node of trees. Many studies have been conducted to explore the influence of the three hyper-parameters on the RF prediction performance. Oshiro et al. [11] and Probst and Boulesteix [12] investigated how to obtain the optimal number of trees. Han and Kim [13] examined the effect of tree size and developed a new ensemble method for improving the RF model by growing deeper trees. Bernard et al. [14], Goldstein et al. [15], and Han and Kim [16] studied the impact of the feature set size on the prediction performance in the context of classification.
Feature set size, i.e., mtry, can be considered as the salient hyper-parameter in the sense that it controls the accuracy of individual trees and the diversity between pairs of trees in the ensemble [1]. Setting a large value makes accurate individual trees, but the trees would be similar to each other; setting a small value enhances the diversity between trees, but each tree would have poor prediction accuracy. The two indicators have a trade-off relationship, which must be balanced to achieve the best RF prediction accuracy. Hence, using the appropriate feature set size is the most important task in the RF model fitting to obtain good prediction accuracy. Unfortunately, no theoretical proof has been developed to select the optimal feature set size that provides the highest accuracy, and, in most cases, the default value specified in a software package is used. For example, the default mtry value is p/3 for the randomForest R package and p for the RandomForestRegressor in Python's sklearn.ensemble package, where p denotes the number of features.
Various hyper-parameter optimization algorithms have been applied to obtain the appropriate feature set size. (1) Manual tuning is a traditional optimization method which is performed by users manually. It depends on the trial and error process, so it would be effective only for experienced users. (2) Random search is a method of randomly selecting feature set sizes, evaluating the RF model for each size, and then estimating the optimal size based on RF predictions. It is very fast since only a small number of candidates is considered, but the true optimal size can be missed. (3) Grid search, which has a similar mechanism with the random search, except that every feature set size is explored. It can estimate the true optimal size with high accuracy, but it is computationally intensive. (4) Recently, a Bayesian model-based hyper-parameter optimization algorithm called BOA [17] has been increasingly implemented in machine learning and deep learning communities [18,19]. This method selects the hyper-parameters of the next step using the probability model constructed by the previous step. We propose a novel algorithm that can more efficiently replace the grid search method without using a probability model.
We investigated the effect of feature set size on the RF prediction performance in the context of regression because most existing studies have been developed focusing on classification problems. There are not enough references in regression problems, even though classification and regression are quite different analysis task. Using many datasets, a total of 56 real and artificial datasets, we first investigated whether the RF prediction performance is affected by the feature set size, and then we found a rule associated with the optimal size for each dataset. Finally, we developed a search algorithm that combines a typical grid search and two unique concepts for estimating the best size in RF regression. We compared the proposed method with the typical grid search algorithm. In addition, we compared the size estimated by the proposed method with the optimal size and the default size specified in the randomForest R package.
The rest of the paper is organized as follows. Section 2 begins with a brief introduction to the RF algorithm in regression. Through experimental studies, we first explored the influence of feature set size on the RF prediction performance, and then we examined relationships between an optimal feature set size and characteristics of given datasets. In Section 3, we propose a search algorithm for estimating the best size in regression. In Section 4, we study the impact of our proposed search algorithm on the RF prediction by comparing it with a default size and the optimal size. In addition, we compare the proposed algorithm with a typical grid search method. The paper ends with conclusion in Section 5.

Random Forest Algorithm in Regression
RF, a popular ensemble machine learning method, is a modified version of bagging [20] since they share conceptually same algorithms to each other, but the main difference is that RF further reduces the variance of a prediction model by combining more de-correlated trees than bagging. The RF algorithm starts with generating bootstrapped samples from a training dataset. Multiple regression trees are fitted on the bootstrapped samples. When the trees are constructed, RF uses a random feature subset instead of all features to find the best split rule at each split of trees. The results of each tree are combined to produce a final prediction. The detailed RF algorithm in regression is described in Algorithm 1.

Algorithm 1 The RF algorithm in regression.
Training Phase : Given : -D : training set with n observations, p features, and the response variable.
-B : number of regressors in the ensemble.

Procedure :
For b = 1 to B 1. Generate bootstrapped sample D * b from training set D. 2. Grow a regression tree using the bootstrapped sample D * b . For a given node t, (i) Randomly sample mtry features from the full features.
(ii) Find the best split rule using the random feature subset. (iii) Split the node t into two child nodes using the best split rule. Repeat (i)-(iii) until stopping rules are met. 3. Obtain a trained regressor R b .

Test Phase :
For a test instance x, the prediction estimated by the B regressors is given as :

Influence on Prediction Performance
In the literature, three mtry values, 1, log 2 (p) + 1, and p/3, have been primarily used in the RF model fitting. The first two were introduced by Breiman [1], and the last was recommended for regression problems by Hastie et al. [2] and Liaw and Wiener [21]. We consider p/3 as the default mtry value throughout this paper.
We conducted experimental studies to investigate the influence of the feature set size on the RF prediction accuracy in regression. The experiments were based on 56 real or artificial datasets that were used in other studies or came from the UCI data repository [22]. Tables 1 and 2 contain information about the datasets.  The design of the experiments is as follows: we randomly split a dataset into 60% training set for fitting and 40% test set for evaluating. For the RF model fitting, we used the full range of values from 1 to p as the value of mtry. We set the number of trees as 100, which is a common choice in many RF applications [15,16,43]. For evaluation, root mean squared error (RMSE) was calculated to measure the RF prediction performance. All experiments were repeated 100 times to obtain stable results.
The results are organized with three figures. First, Figure 1 compares the RMSE depending on different mtry values using four representative datasets.
The large fluctuation of RMSE means that the performance of the RF is severely affected by mtry values. Moreover, the results clearly show that the optimal mtry that achieves the smallest RMSE differs from the common choices in all four datasets. Second, in Figure 2, we explore a relative distance, which is defined as (optimal mtry-default mtry)/p, to measure the difference between the optimal size and the default size. There are only four datasets where the default size is matched to the optimal size, and, in most cases, the default size fails to offer the best RF prediction accuracy. In addition, there are more datasets where the optimal size is larger than the default size than there are datasets in the opposite case.   Last, Figure 3 compares the optimal size and the default size by using a relative RMSE, defined as log( RMSE with default mtry RMSE with optimal mtry ). The relative RMSE greater than 0 means that the optimal size achieves more accurate prediction than the default size. The RF model with the optimal size is obviously more accurate than that with the default size because almost all boxes are located above 0. The performance difference is huge when the optimal size is larger than the default size, but the difference is quite small when the optimal size is smaller than the default size. In summary, through the experimental results, we observed that: (1) the mtry value has an important role in the RF prediction performance; (2) the default mtry value cannot guarantee the best RF performance in regression problems; (3) the optimal mtry value differed dataset to dataset; and (4) there is no clear pattern for the optimal size, thus it is difficult to guess the best mtry value in advance.  Figure 3. Comparison of the RF prediction accuracies using by the optimal mtry and the default mtry. The x-axis indicates 56 datasets with the same order as in Figure 2. The y-axis indicates the relative RMSE. Each box-plot is based on 100 relative RMSEs.

Rule for Optimal Feature Set Size
In this section, we further try to find a rule associated with the optimal mtry using characteristics of datasets. In the context of classification, Bernard et al. [14] and Goldstein et al. [15] observed that the optimal size tends to be related to the number of relevant features, where the relevance means that a feature is significantly associated with the target. If there are many relevant features in a given dataset, a smaller mtry value is preferred to utilize the relevant features equally for splitting nodes in the tree construction. As a result, the trees will be more diverse. In this situation, larger mtry value would cause similar trees in the ensemble because the most dominant feature is repeatedly selected for the split rule at each node. On the other hand, if there are only a few relevant features in a given dataset, a larger mtry value may be advantageous to increase the accuracy of individual trees. However, these observations are limited on classification problems and may not be valid in regression problems. Hence, we sought a relationship between the optimal mtry and the characteristics of given datasets, focusing on the relevance of features, in the context of regression.
How do we measure the relevance between the response variable Y and a feature X i , where i = 1, · · · , p, in regression? The typical Pearson correlation coefficient between Y and X i can be ineffective because it deals with linear associations only. Hence, we consider a modified Pearson correlation coefficient between Y andŶ i , which is a predicted response for X i , obtained by a regression tree between Y and X i . In detail, a decision tree between Y and X i is fitted with pruning, and then the predicted responseŶ i is obtained by applying X i on the fitted tree. We can capture both linear and nonlinear associations by considering the decision tree [2]. We denote the modified Pearson correlation coefficients as R i , i = 1, · · · , p, and the larger R i means stronger evidence that X i is a relevant feature to the response Y.
To investigate whether the optimal mtry is related to the relevance of features, we consider a classification tree, where the target variable Y dt is a binary outcome that is 1 if the optimal size is larger than the default size, and is 0 if the optimal size is smaller than or equal to the default size. We also create a variety of factors X dt to be used in the decision tree: (1) nrel, number of relevant features; (2) nirr, number of irrelevant features; (3) scor, standard deviation of R i ; (4) mcor, mean of R i ; (5) ccor, coefficient of variation of R i ; and (6) pn, ratio of features and observations. The definitions of Y dt and X dt are summarized in Table 3. Table 3. Description of Y dt and X dt for classification tree analysis. The definitions of X dt are as follows: (1) nrel, the number of relevant features (Q 3 (R): the third quartile of R); (2) nirr, the number of irrelevant features (Q 1 (R): the first quartile of R); (3) scor, the standard deviation of R; (4) mcor, the mean of R; (5) ccor, the coefficient of variation of R; and (6) pn, the ratio of features and observations. Note that R is the collection of R i , i = 1, ..., p.

Attribute
Name Definition Mean(R) ccor Sd(R)/Mean(R) pn p/n Figure 4 depicts the classification tree with pruning between Y dt and X dt based on a total of 56 datasets, where scor, standard deviation of R i , is the most significant factor to discriminate the 56 datasets. The result shows that, for scor < 0.135, there are many datasets with the optimal size less than or equal to the default size; for scor ≥ 0.135, most datasets have an optimal size larger than the default size. These results can be interpreted as follows. If features in a given dataset have similar relevance to each other, the optimal size tends to be small, thus setting smaller mtry value can be helpful to achieve good prediction performance. Conversely, if features are widely distributed from irrelevant to relevant, the optimal size tends to be large, thus setting larger mtry value may be preferred. We found that the optimal feature set size may be related to the relevance of features in a given dataset, but it is not enough to select a specific mtry in RF model fitting. Hence, in the next section, we develop a search algorithm for estimating the best mtry in RF regression.

Search Algorithm for Optimal Feature Set Size in Random Forest Regression
We develop a search algorithm for estimating the best mtry in RF regression using the findings in Section 2. This algorithm is also motivated by Han and Kim [16].
Our proposed algorithm combines a typical grid search method and two unique concepts: (1) SearchDirection, which controls the direction of search, "forward" and "backward"; and (2) SearchSize, which decides how many features are searched at a time, set ceiling(p/10) by default. These two concepts make our proposed search algorithm more efficient than the typical grid search method by reducing the number of searches. The proposed algorithm searches using the out-of-bag mean squared error (OOB-MSE), which is computed on the out-of-bag samples. The OOB-MSE estimate is known as a good estimate for the true MSE [1]. In the SearchDirection = "forward" setting, the proposed algorithm starts the searches from mtry = f loor(p/3) to p in increasing order. Note that, if SearchDirection is "backward", the searches are performed from mtry = f loor(p/3) to 1 in decreasing order.
We now describe our proposed search algorithm using a simple example. Suppose that SearchDirection is "forward" and a dataset has 15 features (p = 15); then, SearchSize is set as ceiling(15/10) = 2 by default. The proposed method first computes the OOB-MSE of the RF model with mtry = (15/3) = 5, which is denoted asê 5 . Then, it comparesê 5 with the next two OOB-MSEs,ê 6 andê 7 , which are obtained by the RF model with mtry = 6 and mtry = 7, respectively, given that SearchSize is 2. Ifê 5 is smaller than the minimum ofê 6 andê 7 , the algorithm estimates the best mtry as 5 and terminates the search. Ifê 5 is larger than one ofê 6 andê 7 , the minimum ofê 6 andê 7 is continuously compared with the next two OOB-MSE,ê 8 andê 9 . The algorithm is reiterated until stopping criteria are met. The detailed search algorithm is in Algorithm 2.

Experimental Study
This section studies the impact of our proposed search algorithm by comparing the RF prediction accuracy using an estimated mtry by our proposed algorithm, with that using the optimal mtry, the default mtry, and an estimated mtry by a typical grid search method. We determine the SearchDirection by the results of the classification tree in Figure 4. Specifically, if a dataset falls into the node number 2, the SeachDirection is assigned as "backward"; if a dataset falls into the node number 3, the SearchDirection is assigned as "forward".

Estimated Size vs. Optimal Size
The results of comparison between the estimated mtry by our proposed algorithm and the optimal mtry are presented with two figures. First, Figure 5 compares the relative distance between the estimated mtry and the optimal mtry, where the relative distance between the optimal mtry and the default mtry is added above as a reference. The two subfigures clearly show that the estimated mtry is closer to the optimal mtry than the default, and, in 23 out of 56 datasets, the estimated mtry is exactly matched to the optimal mtry. It demonstrates that the proposed search algorithm performs well to estimate the optimal mtry. Second, Figure 6 compares the relative RMSE between the estimated mtry and the optimal mtry. Since almost all boxes are located near 0, it seems that the estimated mtry and the optimal mtry produce similar RF prediction performances.    bre  alc  blo  app  sol  med  com  hit  ozo  rwm  fir  wag  cps  ame  spp  ist  mdp  wqr  wqw  cdo  aba  bos  hat  bsh  bsd  dia  tri  can  car  eec  aut  beh  sms  dee  sce  spm  ais  lah  cou  con  eeh  pks  rat  tre  att  bud  fac  ele  fis  lab  yac  fam fat air Figure 6. Comparison of the RF prediction performances using the estimated mtry by the proposed search algorithm and the optimal mtry. The x-axis and y-axis indicate 56 datasets and the relative RMSE, respectively. Each box-plot is based on 100 relative RMSEs. Figure 7 compares the relative RMSE between the estimated mtry and the default. The results show that using the estimated mtry offers improvement over the default size in terms of the RF prediction because many boxes are located above 0. Interestingly, the distribution of the boxes is quite similar with that in Figure 3. It confirms that the optimal mtry and the estimated mtry have similar performance. We further conducted the paired t-tests based on the 100 RMSE pairs of the estimated mtry and the default mtry for each dataset. We observed that the RF prediction using the estimated mtry is statistically better than that using the default mtry on 38 out of 56 datasets, and, for the other 18 datasets, their differences are not statistically significant. There is no dataset where the default mtry produces better prediction than the estimated mtry.  bre  alc  blo  app  sol  med  com  hit  ozo  rwm  fir  wag  cps  ame  spp  ist  mdp  wqr  wqw  cdo  aba  bos  hat  bsh  bsd  dia  tri  can  car  eec  aut  beh  sms  dee  sce  spm  ais  lah  cou  con  eeh  pks  rat  tre  att  bud  fac  ele  fis  lab  yac  fam fat air Figure 7. Comparison of RF prediction using the estimated size by the proposed search algorithm and the default size of p/3. The x-axis and y-axis indicate 56 datasets and the relative RMSEs, respectively. Each box-plot is based on 100 relative RMSEs. Figure 8 compares the estimated mtry by the proposed search algorithm and the one by the typical grid search method in terms of prediction accuracy and computational cost. The typical grid search method can be considered as a special version of the proposed search algorithm, where the searches are always performed from mtry = 1 to p with the Searchsize of p. The results demonstrate that, while the RF prediction accuracies using the two estimated sizes are not different because almost all boxes are located near 0, the proposed search algorithm is much faster than the typical grid search method.

Conclusions
We focused on the influence of mtry on the RF prediction in regression problems because most existing studies about mtry have been primarily conducted focusing on classification problems. Using many datasets, a total of 56 real or artificial datasets, we carried out several experiments and observed that mtry has a significant impact on the RF regression prediction. In addition, the default mtry of p/3 cannot guarantee the highest accuracy in regression.
We also found that it is difficult to find a specific rule for estimating the best mtry since the optimal size seems domain dependent. Fortunately, we found a relationship between the optimal size and characteristics of datasets. If the features of the dataset have similar relevance to each other, the optimal size tends to be relatively small, thus smaller mtry values are preferred. In the opposite case, larger mtry values are preferred. Relative RMSE (logarithmic scale) 5 Figure 8. Comparison between the proposed search algorithm and the typical grid search method in terms of the relative RMSE and the relative computing time. The x-axis represents 56 datasets. The y-axis on the left is log(relative RMSE), which the y-axis on the right is the relative computing time, which is defined as log(Computing time of the proposed method/Computing time of the grid search algorithm).
We developed a search algorithm for estimating the optimal mtry in RF regression, which is motivated from a method for RF classification [16]. The proposed search algorithm combines a typical grid search method and two unique concepts of "SearchDirection" and "SearchSize". The two additional concepts allow our proposed search algorithm to estimate the best mtry more efficiently than the typical grid search method. In the experimental studies, we demonstrated that the estimated size by the proposed algorithm is close to the optimal size and produces better prediction accuracy than the default size. We also confirmed that the proposed algorithm provides similar prediction accuracy to the typical grid search method, but it is more efficient in terms of computational cost.