Next Article in Journal
Energy Statistic-Based Goodness-of-Fit Test for the Lindley Distribution with Application to Lifetime Data
Previous Article in Journal
An Overview of Economics and Econometrics Related R Packages
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhancing Diversity and Improving Prediction Performance of Subsampling-Based Ensemble Methods

Department of Mathematics and Statistics, Wellesley College, Wellesley, MA 02481, USA
*
Author to whom correspondence should be addressed.
Stats 2025, 8(4), 86; https://doi.org/10.3390/stats8040086
Submission received: 27 August 2025 / Revised: 22 September 2025 / Accepted: 24 September 2025 / Published: 26 September 2025
(This article belongs to the Section Applied Statistics and Machine Learning Methods)

Abstract

This paper investigates how diversity among training samples impacts the predictive performance of a subsampling-based ensemble. It is well known that diverse training samples improve ensemble predictions, and smaller subsampling rates naturally lead to enhanced diversity. However, this approach of achieving a higher degree of diversity often comes with the cost of a reduced training sample size, which is undesirable. This paper introduces two novel subsampling strategies—partition and shift subsampling—as alternative schemes designed to improve diversity without sacrificing the training sample size in subsampling-based ensemble methods. From a probabilistic perspective, we investigate their impact on subsample diversity when utilized with tree-based sub-ensemble learners in comparison to the benchmark random subsampling. Through extensive simulations and eight real-world examples in both regression and classification contexts, we found a significant improvement in the predictive performance of the developed methods. Notably, this gain is particularly pronounced on challenging datasets or when higher subsampling rates are employed.

1. Introduction

Regression and classification are two of the most fundamental statistical problems, both falling within the realm of supervised learning where the given dataset contains a response variable of interest. Their distinction lies in the type of response variable: when the response is quantitative, it is a regression problem; when the response is categorical, it is referred to as classification. Over the past few decades, a vast array of statistical and machine learning tools have been developed to tackle regression and classification problems. For regression, common approaches include ordinary linear regression [1] and penalized regression [2], among others. For classification, techniques such as generalized linear regression [3], neural networks [4], support vector machines [5], Bayesian methods [6], decision trees [7], and others may be applied. In this paper, we focus our attention on tree-based methods, which are versatile tools that are applicable to both regression and classification tasks.
Classification and Regression Trees (CART), first introduced by Breiman et al. [7], offer a general method for regression and classification through recursive binary splitting of the given feature space. While it is easy to visualize and straightforward to implement, CART suffers from large sampling variation, which often leads to less competitive predictive performance as compared to other existing methods. To address its drawback, Breiman [8] developed bagging (bootstrap aggregation), which averages over predictions from numerous base learners. Building on this, Breiman [9] later proposed the random forest algorithm, further decorrelating individual trees to enhance the predictive power.
In this paper, we focus on subsampling-based ensemble methods, a computationally efficient alternative to conventional ensembles. Instead of drawing bootstrap samples of the same size as the original sample size n, they employ subsamples of a lower size k ( k < n ) taken without replacement as individual training datasets. The reduced training sample size directly translates into improved computational efficiency, alleviating the computational burden of conventional ensemble estimators. Moreover, Bühlmann and Yu [10] showed that subbagging, for example, can achieve predictive power comparable to that of the conventional bagging estimator. Other recent work studying subsampling-based ensemble methods includes Mentch and Hooker [11], Peng et al. [12], Wang and Wei [13].
For subsampling-based ensemble methods like subbagging and sub-random forests, it is well recognized that diversity among training samples is crucial for the predictive performance [14]. Reducing the subsampling proportion intuitively increases the diversity. However, this comes at the cost of a smaller training sample size, negatively impacting the prediction accuracy. To address this inherent trade-off and dilemma, we developed two novel subsampling schemes designed to enhance the training sample diversity without sacrificing the size.
The remainder of the paper is organized as follows: In Section 2, Materials and Methods, we begin by illustrating the relationship between the subsampling rates and predictive performance in subsampling-based ensembles in Section 2.1. Then, Section 2.2 details our proposed subsampling schemes: partition subsampling and shift subsampling. Moreover, this section also quantifies their diversity levels against the benchmark random subsampling using probabilistic landscapes. Section 3 is devoted to numerical investigations: We present extensive simulation studies in Section 3.1 for both regression and classification scenarios to evaluate our methods’ performance against the benchmark. Following this, Section 3.2 showcases eight real-world data examples in regression and classification problems. Finally, Section 4 concludes the paper with a brief summary and discussion of future work.

2. Materials and Methods

2.1. The Role of Diversity

It is widely recognized that diversity is a cornerstone for the predictive performance of ensemble estimators. Indeed, the success of ensembles in machine learning is often attributed directly to the level of diversity they embody. The previous literature has established strong connections between ensemble diversity and performance [14,15,16]. Over the past two decades, numerous methods for measuring diversity have been proposed. Diversity can arise from variations in the samples used to train base learners or from employing distinct base learning algorithms within an ensemble. In addition, ensemble diversity can also be achieved by modifying the machinery of the model-building process. Rotation forest [17] and AdaBoost [18] are two examples of this latter approach, as they create diverse models by transforming the feature space or iteratively adjusting data weights. Quantifying diversity can therefore involve metrics related to either of these sources [19]. Furthermore, diversity can also be assessed by focusing on the predictions generated by individual base learners across an ensemble [20]. While efforts have been made to develop a unified framework for diversity quantification [21], a widely accepted approach has yet to exist. In this paper, we focus on measuring diversity through the similarity between subsamples used to train individual base learners, specifically within the context of subsampling-based homogeneous ensemble methods using CART as the base learners.
For subsampling-based ensemble methods, such as subbagging, the diversity among training samples is largely influenced by the size of these training sets. Intuitively, larger training sample sizes increase the likelihood of substantial overlaps among randomly generated subsamples, thereby impairing the diversity. The success of ensemble methods is rooted in the principle “wisdom of the crowds”: individual models often make different types of errors. By aggregating the predictions of different base learners, these errors tend to cancel each other out, resulting in a more accurate final outcome. However, while a larger sample size generally reduces the variation in a single model, its impact on ensemble performance is more nuanced. If all base learners are trained on the same or similar large dataset, they are likely to be highly correlated and therefore make similar errors. In short, enhancing the diversity of an ensemble is more critical to the success of the ensemble method than increasing the sample size. To better illustrate this phenomenon, we present an analysis using the Wine dataset in this section [22]. This dataset addresses a multi-class classification task, aiming to categorize wines into one of three regions based on 13 features quantifying their chemical composition. The dataset comprises 178 observations, with 58, 65, and 47 instances for each class, corresponding to approximately 34%, 38%, and 28% of the total observations, respectively. We randomly sampled 170 observations from the raw dataset for ease of implementation of 10-fold cross validation, as described below. More information on the Wine dataset can be found in Section 3.2, as well as in the UCI Machine Learning Repository (https://archive.ics.uci.edu/).
We define the subsampling proportion, p, as the percentage of the dataset used to generate each training sample, such that k = n p , where n is the learning sample size, and k is the subsample size. To assess the effect of the subsampling proportion on the predictive performance, we utilized a subbagging estimator, aggregating 500 individual trees to produce the ensemble outcome. For each value of p ranging from 0.40 to 0.95 incremented by 0.05, we computed the 10-fold cross-validated accuracy. (Under 10-fold cross validation, n denotes the size of the delete-one-fold learning sample size.) To mitigate the impact of randomness inherent in ensemble learning, we refit the subbagging estimator 100 times, reporting the average cross-validated accuracy scores.
Figure 1 illustrates the relationship between accuracy and varying subsampling proportions. It is evident that increasing the subsampling proportion leads to a significant decrease in accuracy, indicating a deterioration in predictive performance. This consistent pattern was also observed in our exploration of some regression datasets, which will be presented in Section 3.2.
As revealed by Figure 1, reducing the subsampling rate offers a direct means to enhance the diversity and improve the predictive performance. Nevertheless, this approach inherently deteriorates the training sample size, which negatively impacts the prediction outcomes. To circumvent this trade-off, we propose two novel subsampling schemes designed to generate a more diverse set of subsamples without compromising their size. Our objective is to foster enhanced diversity among training samples while preserving their scale, thereby further augmenting the performance of subsampling-based ensembles.

2.2. Proposed Methods

In this section, we describe two novel subsampling schemes devised to maximize the diversity among training samples for subsampling-based ensemble methods, thereby enhancing predictive performance without sacrificing training sample size. Before detailing these proposed methods, we will outline some general notation, review the conventional random subsampling approach, and provide additional motivation of our developed methods.
Let B be the total number of subsamples, each of size k ( k < n ). In subbagging, B also corresponds to the number of individual trees of the ensemble. The traditional approach generates these training samples through random subsampling, drawing k instances from the n observations in a learning dataset without replacement and repeating this process B times independently. In contrast to this conventional approach, we will next discuss our two developed schemes: partition subsampling and shift subsampling.
Inspired by the success and broad applications of the previous work [23,24,25,26], where partition or shift subsampling schemes have proven to be efficient and effective for U-statistic variance estimation [27,28] in applications ranging from cross validation and model comparison and selection to the assessment of AUC (area under ROC curve). The potential of these schemes to improve the diversity within subsampling-based ensembles motivated our research.

2.2.1. Partition Subsampling

Without loss of generality, assume n is divisible by k. The partition subsampling is applicable whenever the subsampling proportion p 1 / 2 . It can be realized as follows: We begin by randomly shuffling the given learning sample of size n. This shuffled dataset is then systematically partitioned into n / k disjoint subsamples, each of size k. This process is repeated B k / n times to obtain a total of B subsamples. Given a random partition, the generated n / k data subsets are inherently mutually exclusive, thereby maximizing the diversity among them. Furthermore, the partition subsampling scheme guarantees a minimal number of non-overlapping subsamples (i.e., training samples) within an ensemble. Algorithm 1 outlines the detailed procedure, and Figure 2 displays the diagram of the partition subsampling scheme.
Algorithm 1 Partition Subsampling.
  • Input: Training dataset of size n, a number of subsamples to be generated B, a size k such that k n / 2 for each subsample
  • Output: B subsamples, each of size k
  • while less than B subsamples have been generated do
  •     shuffle the training sample;
  •    systematically partition the shuffled training dataset into n / k non-overlapped subsamples
  • end while
  • Return B generated subsamples
Remark 1.
The assumption that n is divisible by k is set primarily for notational simplicity, and the algorithm can be easily modified to account for cases where this condition is not met. Let · and · represent the floor and ceiling of a real number, respectively. When this assumption does not hold, we partition each given learning sample of size n into n / k subsamples of size k. This process is repeated B / n / k times. In the final random partition, fewer than n / k subsamples may be selected to obtain total of B subsamples.

2.2.2. Shift Subsampling

When the subsampling proportion p is greater than 1 / 2 , the partition subsampling scheme is no longer applicable. To address this, we propose an alternative shift subsampling scheme that works effectively for larger subsampling rates. Shift subsampling generates pairs of subsamples, each of size k, with the minimal number of overlaps. Specifically, from a randomly shuffled learning dataset, we extract the first k instances and the last k instances to form a pair of subsamples. This pairing results in 2 k n overlaps, which is the smallest possible number of between-subsample overlaps in this context. This process is repeated B / 2 times to yield a total of B subsamples. Note that when p = 1 / 2 , the partition subsampling and shift subsampling are identical. Algorithm 2 describes this procedure in detail, and Figure 3 displays the diagram of the shift subsampling scheme.
Algorithm 2 Shift Subsampling.
  • Input: Training dataset of size n, a number of subsamples to be generated B, a size k such that n / 2 < k < n for each subsample
  • Output: B subsamples, each of size k
  • while less than B subsamples have been generated do
  •     shuffle the training sample;
  •     extract the first k and last k observations of the shuffled training dataset to form two subsamples
  • end while
  • Return B generated subsamples
Shift subsampling is particularly effective when the training sample size k is relatively large ( k > n / 2 ). In this context, random subsampling is more likely to generate data subsets with a significant amount of overlap. Shift subsampling is designed to maximize diversity between training datasets, making it a valuable scheme for larger sample sizes. Using a larger training sample size is generally desirable, as it often leads to more accurate individual predictions. Therefore, from this aspect, shift subsampling is expected to yield more accurate results than partition subsampling, which is limited to k n / 2 scenarios. These inherent design features—maximizing the diversity and utilizing larger sample sizes—explain the advantage and success of shift subsampling over other methods, as demonstrated in the numerical studies in Section 3.
Remark 2.
For random subsampling, the cost of the most efficient algorithm that generates a subsample of size k ( k < n ) is O ( k ) . Thus, creating B subsamples requires a total cost of O ( B k ) . In comparison, for partition subsampling, an initial random shuffle of the dataset of size n demands an O ( n ) effort, which yields n / k subsamples. Thus, under partition subsampling, the total cost for generating B subsamples is O ( n ( B k / n ) ) , which simplifies to O ( B k ) . Similarly, shift subsampling also starts with an O ( n ) shuffle but produces only two subsamples. Therefore, generating B subsamples demands a computational effort of order O ( n B / 2 ) , where, by design, k > n / 2 in the context of shift subsampling. Overall, the computational costs of partition subsampling, shift subsampling, and the benchmark random subsampling are all comparable.

2.2.3. Probabilistic Investigation

As the two proposed subsampling schemes entail, each of them aim to maximize diversity among training datasets. To further justify their superiority over the conventional random subsampling, we take a probabilistic approach to measure their diversity by focusing on between-subample overlaps.
Given a subsampling strategy, let X be the number of overlaps between two randomly generated subsamples. When p 1 / 2 ( i . e . , , k n / 2 ) , X takes values of 0 , 1 , , k , while X = 2 k n , , k for p > 1 / 2 ( i . e . , , k > n / 2 ) . In both scenarios, it is easy to see that, under random subsampling, the probability mass function of X can be written as
P ( X = c random subsampling ) = n k k c n k k c n k 2 = k c n k k c n k ,
where c { 0 , , k } for p 1 / 2 and c { 2 k n , , k } for p > 1 / 2 . This agrees with the probability mass function of a Hypergeometric distribution with parameters ( n , k , k ) .
In contrast, under the partition-subsampling scheme (i.e., k n / 2 ), the probability that a pair of generated subsamples have exactly c overlaps can be expressed as follows
P ( X = c partition subsampling ) = B k n n / k 2 B 2 + n k k B k / n 2 ( n / k ) 2 B 2 n k , when c = 0 ; k c n k k c B k / n 2 ( n / k ) 2 B 2 n k , when c = 1 , , k .
Equation (2) can be verified as follows.
Proof. 
In the following derivations, we utilize the Law of Total Probability, a fundamental rule in probability theory, to calculate the probability of zero overlaps by considering two mutually exclusive scenarios. In the case of partition subsampling, the pair of subsamples may be drawn from the same partition or may come from two different partitions. Under each of these two conditions, we partition the probability of zero overlaps to complete the derivation. Specifically,
P ( X = 0 ) = P ( { X = 0 } { a within - partition pair } ) + P ( { X = 0 } { a within - partition pair } ) .
Then, by the General Multiplication Rule in probability theory,
P ( { X = 0 } { a within - partition pair } ) = P ( X = 0 a within - partition pair ) P ( a within - partition pair ) , P ( { X = 0 } { a within - partition pair } ) = P ( X = 0 a between - partition pair ) P ( a between - partition pair ) .
Hence,
P ( X = 0 ) = P ( X = 0 a within - partition pair ) P ( a within - partition pair ) + P ( X = 0 a between - partition pair ) P ( a between - partition pair ) = B k n n / k 2 B 2 + n k n k k n k 2 × B k / n 2 n / k 1 n / k 1 B 2 = B k n n / k 2 B 2 + n k k B k / n 2 ( n / k ) 2 B 2 n k .
Similarly, when c = 1 , , k , we have
P ( X = c ) = P ( { X = c } { a within - partition pair } ) + P ( { X = c } { a within - partition pair } ) = P ( X = c a within - partition pair ) P ( a within - partition pair ) + P ( X = c a between - partition pair ) P ( a between - partition pair ) = 0 + n k k c n k k c n k 2 × B k / n 2 n / k 1 n / k 1 B 2 = k c n k k c B k / n 2 ( n / k ) 2 B 2 n k .
   □
Furthermore, under shift subsampling (i.e., k > n / 2 ), the probability mass function of X is given by
P ( X = c shift subsampling ) = B / 2 B 2 + 4 B / 2 2 k 2 k n B 2 n k , when c = 2 k n ; 4 B / 2 2 k c n k k c B 2 n k , when c = 2 k n + 1 , , k
for c = 2 k n , , k .
The proof of Equation (3) is presented below.
Proof. 
In the case of shift subsampling, a pair of subsamples may be drawn from the same shuffled dataset or may come from two different shuffled datasets. Under each of these two conditions, we partition the probability of minimal overlaps to complete the derivation. By applying the Law of Total Probability and the General Multiplication Rule in probability theory, when c = 2 k n , i.e., the minimal number of overlaps, we have
P ( X = 2 k n ) = P ( S i S j = 2 k n S i , S j same shuffle ) P ( S i , S j same shuffle ) + P ( S i S j = 2 k n S i , S j same shuffle ) P ( S i , S j same shuffle ) = 1 × B / 2 B 2 + B / 2 2 2 1 2 1 B 2 × n k k 2 k n n k k ( 2 k n ) n k 2 = B / 2 B 2 + 4 B / 2 2 k 2 k n B 2 n k .
Similarly, when c = 2 k n + 1 , , k ,
P ( X = c ) = P ( S i S j = c S i , S j same shuffle ) P ( S i , S j same shuffle ) + P ( S i S j = c S i , S j same shuffle ) P ( S i , S j same shuffle ) = 0 + B / 2 2 2 1 2 1 B 2 × n k k c n k k c n k 2 = 4 B / 2 2 k c n k k c n k B 2 .
   □
Figure 4 and Figure 5 illustrate the probability mass functions (PMF) under various subsampling schemes, with parameters set to n = 100 , B = 20 , and p = 0.2 for partition subsampling and p = 0.6 for shift subsampling. The two PMFs are shown as overlapping bar charts, with the shorter bars positioned in front. In Figure 4, the first black bar for partition subsampling that corresponds to the probability of generating a non-overlapped pair of subsamples ( X = 0 ) is so much taller than that under random subsampling (see the first gray bar). In Figure 5, the first black bar for shift subsampling that displays the probability of generating a pair of subsamplings with 2 k n overlaps (i.e., the minimal overlap) is, once again, much taller than that under random subsampling (in this case, the gray bar has a height that is almost equal to zero). These plots clearly demonstrate how the proposed subsampling designs significantly boost the likelihood of achieving the minimal number of overlaps between subsamples compared to simple random subsampling. As indicated by Equations (1)–(3), the exact reduction in between-subsample overlaps is dependent on n, B, and p. Nevertheless, the clear benefit of incorporating partition and shift subsampling to enhance training sample diversity is evident.
From the probabilistic point of view, there are two possible ways that we may quantify and compare the diversity resulting from various subsampling schemes through a probabilistic perspective: 1. comparing the probability that each method achieves the minimal number of overlaps between a pair of randomly generated subsamples and 2. comparing the expected number of overlaps between a pair of subsamples given a subsampling scheme. Specifically, Table 1 summarizes the comparison for the probability of attaining maximum diversity (i.e., minimum overlaps) among the three subsampling methods.
Furthermore, the expected number of overlaps between a pair of subsamples generated from a specific subsampling scheme can be expressed as
E ( X ) = c c P ( X = c ) ,
where the formula for P ( X = c ) depends on the subsampling method. Table 2 compares the expected number of overlaps between a pair of random subsamples under different subsampling schemes when n = 50 , 100 , or 500 , and B = 10 , 20 , 50 , or 100. Partition and shift subsampling consistently produce fewer expected overlaps. However, this advantage diminishes as either the sample size n or the ensemble size B increases.
In the following section, we conduct comprehensive simulation studies to numerically assess the performance of the proposed subsampling schemes against the benchmark in both classification and regression contexts.
Remark 3.
Because both of our proposed subsampling methods begin with a random shuffle of the entire learning sample, they are best suited for data where the observations are independently and identically distributed. We acknowledge that this approach may inadvertently distort inherent structures within datasets that have complex stratification or strong temporal dependencies. In such cases, specialized algorithms designed to preserve these data features, such as those that account for stratified or time-series data, may be more appropriate.

3. Results

3.1. Simulation Studies

In this section, we evaluate the performance of the proposed subsampling schemes through simulation studies in regression and classification scenarios. For both designs, we consider subsampling rates from 0.10 n to 0.95 n , incremented by 0.05 n . We use the proposed partition subsampling scheme to generate individual training samples for the ensemble when k { 0.10 n , , 0.50 n } , and the developed shift subsampling scheme for k { 0.55 n , , 0.95 n } . For comparison purposes, we also realize the conventional random subsampling approach as a benchmark for subsampling-based ensemble methods. We fit individual trees within an ensemble using the randomForest package in R [29]. Each tree was grown to its maximum possible depth until a stopping criterion, such as a minimal mean squared error for regression, was met. The same tree-fitting algorithm was used in both Section 3.1 and Section 3.2.

3.1.1. Classification Simulation Study

For the classification simulation study, we generated R = 500 independent samples of size n, with n { 100 , 500 } . Within each simulated dataset, six x-variables, each of length n, were independently drawn from a standard Uniform distribution. Subsequently, n random errors were simulated from a Logistic distribution with a location parameter of 0 and a scale parameter of 5. The “continuous” response, Y c , was then determined based on the following true relationship:
Y i c = 1 + 5 X 1 , i + 4 X 2 , i + 3 X 3 , i + 2 X 4 , i + 0.1 X 5 , i + 0 X 6 , i + ϵ i ( 1 i n ) .
The binary outcome Y was obtained by dichotomizing the continuous response Y c based on a varying threshold. This threshold was chosen to yield a target ratio of 1’s to 0’s of 50-50, 40-60, 30-70, or 20-80 in the final dataset. Specifically, the threshold was set to be the qth percentile ( q = 50 , 60 , 70 , 80 ) of the Logistic distribution with a location parameter of 8.05 and a scale parameter of 2, which represents the expected distribution of Y c conditional on the x-variables. A value of Y c greater than the threshold resulted in Y = 1 ; otherwise, Y = 0 . We deliberately included X 6 as an irrelevant predictor to introduce noise and elevate the classification task’s difficulty.
For each simulated dataset, we applied the subbagging estimator, varying the number of trees within each ensemble to 50, 100, and 500. As previously mentioned, we also explored a wide range of subsampling rates, from p = 0.10 to p = 0.95 , with an increment of 0.05. This resulted in subsampling sizes ranging from k = 0.10 n to k = 0.95 n . We fit the benchmark subbagging estimator, which uses the default random subsampling scheme, alongside those constructed using our developed partition or shift subsampling strategies. Their performance was compared using a 10-fold cross-validated accuracy measure. The average cross-validated accuracy scores across the 500 iterations are summarized in Table 3, Table 4, Table 5, Table 6, Table 7, Table 8, Table 9 and Table 10. It is worth mentioning that, in Table 6, the values for the smallest subsampling proportion ( p = 0.10 ) are marked as NA. This occurred because, at such a low subsampling rate and given the imblanced nature of the simulated dataset, it is probable to have only 0 s in some training sets through 10-fold cross validation, making it impossible to fit the model. (For example, when n = 100 and p = 0.1 , a 10-fold cross validation results in training sets with only nine observations each for the ensemble. Given a 20-80 positive–negative ratio, it is highly likely that some of these small training samples will contain no positive cases (1’s) due to random chance).
The simulation results further confirmed the role of diversity on the ensemble predictive performance: across all three subsampling methods, as the subsampling rate increases—reducing diversity among individual training samples—the accuracy consistently deteriorates. This pattern holds true for both sample sizes. On the other hand, as the sample size n increases from 100 to 500, or the number of trees per ensemble becomes larger, there is a slight improvement in accuracy scores across various subsampling rates, given the same positive–negative ratio.
The proposed subsampling schemes demonstrate superior predictive performance as the subsampling rate increases. Notably, the performance gains from our methods become apparent at relatively small subsampling rates, particularly with larger sample sizes. For example, when n = 100 and a 50-50 positive–negative ratio, the partition subsampling scheme starts to outperform the benchmark at approximately p = 0.30 . This threshold drops to around p = 0.15 for n = 500 . Overall, the proposed methods ultimately achieve a higher accuracy than the benchmark beyond a certain threshold. An additional notable observation is that the shift subsampling yields a more significant improvement in accuracy. For example, in Table 3, shift subsampling with k = 0.95 n boosts the benchmark’s accuracy by up to 6 % (see Table 3 when k = 0.95 n and #trees = 100 : the improvement in accuracy is ( 0.625 0.591 ) / 0.591 6 % ), whereas the partition subsampling (e.g., at k = 0.1 n to 0.50 n ) only improves the accuracy by 1–2%.

3.1.2. Regression Simulation Study

For the regression simulation study, we used a Multivariate Adaptive Regression Spline (MARS) model, also known as Friedman #1 [30]. This model allows the generation of datasets that exhibit an underlying non-linear relationship between the response and five predictor variables. It is a common benchmark for evaluating ensemble methods and has been considered in several previous studies, including Bühlmann and Yu [10], Mentch and Hooker [11], Wang and Wei [13]. For each sample of size n (either 100 or 500), we independently simulated five x-variables from a standard Uniform distribution. Random errors were then generated from a normal distribution with a mean of 0 and a standard deviation of 1 , 5 , or 10 . The response was then determined based on the following relationship:
Y i = 10 sin ( π X 1 , i X 2 , i ) + 20 ( X 3 , i 0.05 ) 2 + 10 X 4 , i + 5 X 5 , i + ϵ i ( 1 i n ) .
In total, we considered R = 500 independent samples.
Similar to the classification study, we applied the subbagging estimator with varying numbers of trees—specifically, 50, 100, and 500 per ensemble. We also explored a wide range of subsampling rates, from p = 0.10 to p = 0.95 , increasing by 0.05. We fit both the benchmark subbagging estimator (using the default random subsampling scheme) and those developed with our partition or shift subsampling strategies. To compare their performance, we computed the 10-fold cross-validated mean squared error (MSE). The average cross-validated MSEs from the 500 iterations are summarized in Table 11, Table 12, Table 13, Table 14, Table 15 and Table 16.
The simulation results for the regression show a slightly different pattern than those observed in the classification scenario. For random subsampling, the MSE initially decreases with the subsampling rate but then increases. In contrast, under partition and shift subsampling, the MSE decreases monotonously with a larger subsampling rate. This ultimately leads to a much reduced MSE under shift subsampling at k = 0.95 n , with an up to 37% reduction in MSE compared to the benchmark (see Table 14 for the specific case where k = 0.95 n and the number of trees is 500. This value was obtained from the calculation: ( 4.830 7.682 ) / 7.682 37 % , namely, a 37% reduction in MSE). In addition, as expected, we found that increasing the number of trees or decreasing the error standard deviation slightly reduces the MSE. This pattern holds true for both sample sizes. Furthermore, a larger sample size consistently leads to better overall performance, resulting in a lower MSE. From a different aspect, the threshold for observing superior performance from our proposed methods is lower when the error standard deviation increases or when the number of trees increases.

3.2. Real Data Examples

In this section, we present eight real data examples to demonstrate the practical applications of the proposed subsampling schemes compared to the benchmark random subsampling in both the regression and classification scenarios.

3.2.1. Classification Datasets

We first evaluated our proposed methods and the benchmark using five diverse classification datasets, each presenting unique characteristics in terms of the sample size, number of features, class distribution, and application domains. In addition to the Wine dataset [22] discussed in Section 2.1, we also considered several other classification datasets. More specifically, the Iris dataset [31], one of the earliest known datasets used to evaluate classification methods, includes 150 observations. It uses four continuous variables (petal and sepal length and width) to predict one of three balanced iris plant species. In addition, we utilized the Cleveland database for heart disease diagnosis [32]. This dataset has 300 observations and contains information on thirteen demographic and health-related variables to predict one of five heart disease severity levels (0–4, where 0 indicates no disease). We also analyzed the Pima Indians Diabetes dataset [33], comprising 760 observations with eight health-related variables to predict diabetes presence (i.e., binary outcome). Lastly, the Statlog (German Credit Data) dataset [34] contains 1000 observations and classifies individuals as good or bad credit risks based on 20 financial and demographic features. All the aforementioned datasets are available in the UCI Machine Learning Repository (https://archive.ics.uci.edu/).
Table 17 summarizes the characteristics of these datasets. For ease of implementation of 10-fold cross validation, we round the number of observations down to a multiple of 10 by randomly removing fewer than 10 observations from each dataset. All reported sample sizes reflect these adjusted counts.
We applied subbagging estimators by varying the number of trees per ensemble from 20 to 1000. For each of the five datasets, we generated subsamples for individual classification trees using two methods: random subsampling (the benchmark) and shift and partition subsampling (our proposed methods). Moreoever, we set the subsampling proportion p from 0.1 to 0.95, incrementally increasing by 0.05. This resulted in subsample sizes k ranging from 0.1 n to 0.95 n . Consistent with our approach in Section 2.1, we assessed the performance of each method using 10-fold cross-validated accuracy scores for each dataset and setting. To mitigate the impact of randomness inherent in ensemble learning, we refit the subbagging estimator 100 times, reporting the average cross-validated accuracy scores. The results are summarized in Table 18, Table 19, Table 20, Table 21 and Table 22.
Across all five classification datasets, the accuracy generally declines with the subsampling rate past a certain inflection point. This trend is most noticeable with random subsampling, which leads to a significant drop in accuracy at k = 0.95 n . In contrast, partition and shift subsampling enable higher subsampling rates to improve the accuracy. For example, shift subsampling on the Wine dataset achieves an accuracy of nearly or above 96%, representing an improvement of up to 5% over the benchmark. Further, the improvement of shift subsampling reaches 10% on the German dataset. Consistent with our findings in Section 3.1, the accuracy also improves with a larger ensemble size and a more balanced class ratio.

3.2.2. Regression Datasets

To further demonstrate the performance of the developed methods in a different setting, we considered three real-world regression datasets. First, we analyzed the Housing dataset [35], which predicts house price per unit area for properties in Xindian District, New Taipei City, Taiwan, using six features. Next, we utilized the Energy Efficiency dataset [36], which assesses building energy efficiency based on eight building parameters. Finally, we incorporated the Forest Fires dataset [37]. This dataset attempts to predict the burned area of forest fires in the northeast region of Portugal using twelve meteorological features. Following the documentation and the prior literature [38], a natural logarithm transformation was applied to the highly skewed response variable. One thing worthy of attention is that the Forest Fires dataset is particularly challenging to model. Previous attempts to analyze it using machine learning techniques have not been very successful [38].
Table 23 presents the summary characteristics of these datasets. Similar to our previous approach, we rounded the number of observations down to a multiple of 10 by randomly removing fewer than 10 observations to simplify the 10-fold cross validation. All the sample sizes reported in Table 23 are after these adjustments.
Using the three aforementioned datasets, we examined the performance of different subbagging estimator configurations, evaluating each based on the cross-validated mean squared error (MSE). Consistent with the settings in Section 3.2.1, we varied the subsampling proportion p from 0.10 to 0.95 and the number of trees per ensemble from 20 to 1000. All MSE scores were computed using 10-fold cross validation. The complete results are summarized in Table 24, Table 25 and Table 26.
Under random subsampling, a different inflection point of MSE is observed for each dataset. Specifically, in the Housing dataset, the MSE initially decreases as the subsampling proportion (p) increases from 0.10 to 0.40 and then begins to rise for p values between 0.40 and 0.95. Conversely, the Forest Fires dataset consistently shows an increase in the MSE with larger subsampling proportions. For the Energy dataset, however, increased subsampling proportions lead to lower errors. Similarly, distinct trends in the prediction error (MSE) are also observed when utilizing partition or shift subsampling. Specifically, under the proposed subsampling schemes: In the Housing dataset, the MSE decreases as the subsampling proportion (p) increases from approximately 0.10 to 0.60 and then rises from roughly 0.60 to 0.95. For the Energy dataset, increased subsampling proportions generally yield lower errors. On the contrary, in the Forest Fires dataset, higher subsampling proportions are associated with increased errors. Across all three datasets, as the number of trees per ensemble increases, we observe a general trend of decreasing MSE. This reduction is most pronounced when the number of trees grows from 20 to 100; beyond that point, only marginal further decreases in MSE are observed. This holds true across all different subsampling schemes.
Regarding the comparison between the benchmark and our proposed methods, the benefits of our subsampling scheme become more pronounced when increasing the subsampling proportion. For both the Housing and Energy datasets, our proposed method begins to show improvement at a subsampling proportion of approximately p = 0.40 . In contrast, for the challenging Forest Fires dataset, a gain in MSE is consistently observed across all tested values of p, which proves the advantage of our proposed methods for more challenging datasets. Across all datasets, a marginal positive correlation between the number of trees per ensemble and the magnitude of the gain in MSE is also noted.

3.3. Further Justification

As discussed in the previous sections, to mitigate the effects of randomness, we used 500 iterations for each simulation study. For the real data examples, each subbagging estimator was refitted 100 times. The reported performance scores are therefore averages over a large number of iterations, effectively accounting for random variations.
To further demonstrate the superior performance of our proposed subsampling schemes, in what follows, we use the Wine dataset as a case study to better justify the effectiveness and significance of our developed methods compared to the benchmark random subsampling.
In Figure 6, Figure 7, Figure 8, Figure 9 and Figure 10 below, we display the cross-validated accuracy scores of different methods. The solid dots represent the average score over 100 refits of the subbagging estimator, for a given subsample size and and number of trees. The error bars around each dot indicate the standard error. As anticipated, the standard error of accuracy score decreases as the number of trees increases. These plots show that, on average, our proposed partition and shift subsampling schemes consistently outperform the benchmark random subsampling, with only a few exceptions where their performance scores are quite comparable. This performance gain becomes particularly significant as the subsampling rate and the number of trees increase, even after accounting for sampling variations. As discussed in Section 2.2.2, shift subsampling tends to be particularly advantageous over the other competing methods at relatively large subsampling rates. For example, in Figure 10 when p 0.5 and with 1000 trees per ensemble, shift subsampling is significantly better than the benchmark, yielding a much higher cross-validated accuracy. Overall, the threshold for such significant performance gain appears at a lower subsampling rate when the number of trees grows.

4. Discussion

This paper explores how diversity among training samples impacts the predictive performance of subsampling-based ensemble methods. To improve the diversity without compromising the training sample size, we introduce two novel subsampling schemes: partition subsampling and shift subsampling. Our probabilistic analyses further justify the improved diversity the proposed methods offer compared to the benchmark random subsampling. Through extensive simulation studies and real-world data illustrations, we show their superior performance in both regression and classification scenarios. In particular, the benefits of utilizing the developed subsampling strategies become more noticeable on challenging datasets or at larger subsampling rates, and the percent of improvement is more significant in regression problems.
For future work, it would be interesting to extend these schemes to other subsampling-based ensemble methods, such as sub-random forest or non-tree-based sub-ensemble estimators. Given their adaptability, we anticipate similar positive trends and conclusions in these broader applications.

Author Contributions

Conceptualization, Q.W.; methodology, M.O. and Q.W.; software, M.O. and Q.W.; validation, M.O. and Q.W.; formal analysis, M.O. and Q.W.; investigation, M.O. and Q.W.; resources, M.O.; data curation, M.O.; writing—original draft preparation, M.O. and Q.W.; writing—review and editing, M.O. and Q.W.; visualization, M.O. and Q.W.; supervision, Q.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All the real datasets presented in this paper are available in the UCI Machine Learning Repository (https://archive.ics.uci.edu/).

Acknowledgments

The publication fees for this article are supported by the Wellesley College Library and Technology Services Open Access Fund.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Galton, F. Regression Towards Mediocrity in Hereditary Stature. J. Anthropol. Inst. Great Br. Irel. 1886, 15, 246–263. [Google Scholar] [CrossRef]
  2. Heckman, N.; Ramsay, J. Penalized Regression with Model-Based Penalties. Can. J. Stat. 2000, 28, 241–258. [Google Scholar] [CrossRef]
  3. Dobson, A.; Barnett, A. An Introduction to Generalized Linear Models, 4th ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 2018. [Google Scholar]
  4. Gurney, K. An Introduction to Neural Networks; CRC Press: London, UK, 2018. [Google Scholar]
  5. Zhou, Y.; Gallins, P. A Review and Tutorial of Machine Learning Methods for Microbiome Host Trait Prediction. Front. Genet. 2019, 10, 579. [Google Scholar] [CrossRef] [PubMed]
  6. Schoot, R.v.; Depaoli, S.; King, R.; Kramer, B.; Märtens, K.; Tadesse, M.G.; Vannucci, M.; Gelman, A.; Veen, D.; Willemsen, J.; et al. Bayesian statistics and modelling. Nat. Rev. Methods Prim. 2021, 1, 1. [Google Scholar] [CrossRef]
  7. Breiman, L.; Friedman, J.; Stone, C.J.; Olshen, R.A. Classification and Regression Trees; Taylor and Francis: New York, NY, USA, 1984. [Google Scholar]
  8. Breiman, L. Bagging Predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
  9. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  10. Bühlmann, P.; Yu, B. Analyzing bagging. Ann. Stat. 2002, 30, 927–961. [Google Scholar] [CrossRef]
  11. Mentch, L.; Hooker, G. Quantifying uncertainty in random forest via confidence intervals and hypothesis tests. J. Mach. Learn. Res. 2016, 17, 1–41. [Google Scholar]
  12. Peng, W.; Coleman, T.; Mentch, L. Rates of convergence for random forests via generalized U-statistics. Electron. J. Stat. 2022, 16, 232–292. [Google Scholar] [CrossRef]
  13. Wang, Q.; Wei, Y. Quantifying uncertainty of subsampling-based ensemble methods under a U-statistic framework. J. Statisitcal Comput. Simul. 2022, 92, 3706–3726. [Google Scholar] [CrossRef]
  14. Kuncheva, L.; Whitaker, C. Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy. Machine Learning 2003, 51, 181–207. [Google Scholar] [CrossRef]
  15. Brown, G.; Wyatt, J.; Harris, R.; Yao, X. Diversity creation methods: A survey and categorisation. Inf. Fusion 2005, 6, 5–20. [Google Scholar] [CrossRef]
  16. Kuncheva, L. Combining Pattern Classifiers: Methods and Algorithms; Wiley: Hoboken, NJ, USA, 2004. [Google Scholar]
  17. Rodríguez, J.J.; Kuncheva, L.I.; Alonso, C.J. Rotation Forest: A new classifier ensemble method. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 1619–1630. [Google Scholar] [CrossRef]
  18. Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
  19. Cunningham, P. Ensembles in Machine Learning. 2022. Available online: https://medium.com/data-science/ensembles-in-machine-learning-9128215629d1 (accessed on 15 August 2025).
  20. Tang, E.K.; Suganthan, P.N.; Yao, X. An analysis of diversity measures. Mach. Learn. 2006, 65, 247–271. [Google Scholar] [CrossRef]
  21. Wood, D.; Mu, T.; Webb, A.; Reeve, H.W.J.; Lujan, M.; Brown, G. A Unified Theory of Diversity in Ensemble Learning. J. Mach. Learn. Res. 2023, 24, 1–49. [Google Scholar]
  22. Aeberhard, S.; Forina, M. Wine; UCI Machine Learning Repository: 1991. Available online: https://archive.ics.uci.edu/dataset/109/wine (accessed on 15 August 2025).
  23. Wang, Q.; Linday, B.G. Variance estimation of a general U-statistic with application to cross-validation. Stat. Sin. 2014, 24, 1117–1141. [Google Scholar]
  24. Wang, Q.; Guo, A. An efficient variance estimator of AUC with applications to binary classification. Stat. Med. 2020, 39, 4281–4300. [Google Scholar] [CrossRef]
  25. Wang, Q.; Cai, X. An efficient variance estimator for cross-validation under partition-sampling. Statistics 2021, 55, 660–681. [Google Scholar] [CrossRef]
  26. Wang, Q.; Cai, X. A new perspective on U-statistic variance estimation. Stat 2025, 14, e70070. [Google Scholar] [CrossRef]
  27. Hoeffding, W. A class of statistics with asymptotically normal distribution. Ann. Math. Stat. 1948, 19, 293–325. [Google Scholar] [CrossRef]
  28. Lee, A.J. U-Statistics: Theory and Practice; Marcel Dekker: New York, NY, USA, 1990. [Google Scholar]
  29. Liaw, A.; Wiener, M. Classification and Regression by randomForest. R News 2002, 2, 18–22. [Google Scholar]
  30. Friedman, J. Multivariate Adaptive Regression Splines. Ann. Stat. 1991, 19, 1–67. [Google Scholar] [CrossRef]
  31. Fisher, R.A. Iris; UCI Machine Learning Repository: 1936. Available online: https://archive.ics.uci.edu/dataset/53/iris (accessed on 15 August 2025).
  32. Janosi, A.; Steinbrunn, W.; Pfisterer, M.; Detrano, R. Heart Disease: Cleveland Database; UCI Machine Learning Repository: 1988. Available online: https://archive.ics.uci.edu/dataset/45/heart+disease (accessed on 15 August 2025).
  33. Turney, P. Pima Indians Diabetes Data Set; UCI Machine Learning Repository: 1990. Available online: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database (accessed on 15 August 2025).
  34. Hofmann, H. Statlog (German Credit Data); UCI Machine Learning Repository: 1994. Available online: https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data (accessed on 15 August 2025).
  35. Yeh, I. Real Estate Valuation; UCI Machine Learning Repository: 2018. Available online: https://archive.ics.uci.edu/dataset/477/real+estate+valuation+data+set (accessed on 15 August 2025).
  36. Tsanas, A.; Xifara, A. Energy Efficiency; UCI Machine Learning Repository: 2012. Available online: https://archive.ics.uci.edu/dataset/242/energy+efficiency (accessed on 15 August 2025).
  37. Cortez, P.; Morais, A. Forest Fires; UCI Machine Learning Repository: 2008. Available online: https://archive.ics.uci.edu/dataset/162/forest+fires (accessed on 15 August 2025).
  38. Cortez, P.; Morais, A. A Data Mining Approach to Predict Forest Fires using Meteorological Data. In Proceedings of the 13th Portuguese Conference on Artificial Intelligence (EPIA 2007), Guimarães, Portugal, 3–7 December 2007; pp. 512–523. [Google Scholar]
Figure 1. Cross-validated accuracy against subsampling proportion based on the Wine dataset (averaged over 100 iterations).
Figure 1. Cross-validated accuracy against subsampling proportion based on the Wine dataset (averaged over 100 iterations).
Stats 08 00086 g001
Figure 2. Diagram that displays the partition subsampling scheme that generates B subsamples of size k.
Figure 2. Diagram that displays the partition subsampling scheme that generates B subsamples of size k.
Stats 08 00086 g002
Figure 3. Diagram that displays the shift subsampling scheme that generates B subsamples of size k.
Figure 3. Diagram that displays the shift subsampling scheme that generates B subsamples of size k.
Stats 08 00086 g003
Figure 4. Probability mass functions of expected overlaps between a pair of subsamples generated from random subsampling and partition subsampling when n = 100 , B = 20 , and p = 0.2 .
Figure 4. Probability mass functions of expected overlaps between a pair of subsamples generated from random subsampling and partition subsampling when n = 100 , B = 20 , and p = 0.2 .
Stats 08 00086 g004
Figure 5. Probability mass functions of expected overlaps between a pair of subsamples generated from random subsampling and partition subsampling when n = 100 , B = 20 , and p = 0.6 .
Figure 5. Probability mass functions of expected overlaps between a pair of subsamples generated from random subsampling and partition subsampling when n = 100 , B = 20 , and p = 0.6 .
Stats 08 00086 g005
Figure 6. Cross-validated accuracy with error bars based on the Wine dataset (#trees = 20 ).
Figure 6. Cross-validated accuracy with error bars based on the Wine dataset (#trees = 20 ).
Stats 08 00086 g006
Figure 7. Cross-validated accuracy with error bars based on the Wine dataset (#trees = 50 ).
Figure 7. Cross-validated accuracy with error bars based on the Wine dataset (#trees = 50 ).
Stats 08 00086 g007
Figure 8. Cross-validated accuracy with error bars based on the Wine dataset (#trees = 100 ).
Figure 8. Cross-validated accuracy with error bars based on the Wine dataset (#trees = 100 ).
Stats 08 00086 g008
Figure 9. Cross-validated accuracy with error bars based on the Wine dataset (#trees = 500 ).
Figure 9. Cross-validated accuracy with error bars based on the Wine dataset (#trees = 500 ).
Stats 08 00086 g009
Figure 10. Cross-validated accuracy with error bars based on the Wine dataset (#trees = 1000 ).
Figure 10. Cross-validated accuracy with error bars based on the Wine dataset (#trees = 1000 ).
Stats 08 00086 g010
Table 1. Comparison of the probability of maximum diversity under a subsampling scheme.
Table 1. Comparison of the probability of maximum diversity under a subsampling scheme.
p 1 / 2 RandomPartition
P ( X = 0 ) n k k n k B k n n / k 2 B 2 + n k k B k / n 2 ( n / k ) 2 B 2 n k
p 1 / 2 RandomShift
P ( X = 2 k n ) k 2 k n n k B / 2 B 2 + 4 B / 2 2 k 2 k n B 2 n k
Table 2. Expected number of overlaps between a pair of subsamples.
Table 2. Expected number of overlaps between a pair of subsamples.
Random Subsampling
n\k 0.1 n 0.2 n 0.3 n 0.4 n 0.5 n 0.6 n 0.7 n 0.8 n 0.9 n
500.52.04.58.012.518.024.532.040.5
1001.04.09.016.025.036.049.084.081.0
5005.020.045.080.0125.0180.0245.0320.0405.0
Partition and Shift Subsampling ( B = 10 )
n\k 0.1 n 0.2 n 0.3 n 0.4 n 0.5 n 0.6 n 0.7 n 0.8 n 0.9 n
500.01.13.36.711.117.124.031.840.4
1000.02.26.713.322.234.248.063.680.9
5000.011.133.366.7111.1171.1240.0317.8404.4
Partition and Shift Subsampling ( B = 20 )
n\k 0.1 n 0.2 n 0.3 n 0.4 n 0.5 n 0.6 n 0.7 n 0.8 n 0.9 n
500.31.63.97.411.817.624.331.940.5
1000.53.27.914.723.735.248.563.880.9
5002.615.839.573.7118.4175.8242.6318.9404.7
Partition and Shift Subsampling ( B = 50 )
n\k 0.1 n 0.2 n 0.3 n 0.4 n 0.5 n 0.6 n 0.7 n 0.8 n 0.9 n
500.41.84.37.812.217.824.432.040.5
1000.83.78.615.524.535.748.863.981.0
5004.118.442.977.6122.4178.4244.1319.6404.9
Partition and Shift Subsampling ( B = 100 )
n\k 0.1 n 0.2 n 0.3 n 0.4 n 0.5 n 0.6 n 0.7 n 0.8 n 0.9 n
500.51.94.47.912.417.924.532.040.5
1000.93.88.815.824.735.848.964.081.0
5004.519.243.978.8123.7179.2244.5319.8404.9
Table 3. Cross-validated accuracies in binary classification when n = 100 and a 50-50 positive–negative ratio.
Table 3. Cross-validated accuracies in binary classification when n = 100 and a 50-50 positive–negative ratio.
Random Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
500.6290.6330.6340.6320.6310.6300.6310.6240.628
1000.6380.6420.6380.6360.6390.6370.6310.6310.633
5000.6450.6360.6400.6380.6360.6350.6360.6330.630
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
500.6240.6250.6230.6170.6110.6090.6090.5970.588
1000.6290.6250.6160.6160.6170.6120.6080.5960.591
5000.6260.6250.6240.6180.6130.6090.6060.6010.595
Partition and Shift Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
500.6240.6300.6340.6310.6330.6310.6340.6290.633
1000.6320.6380.6340.6350.6390.6400.6370.6360.636
5000.6410.6380.6410.6400.6380.6400.6440.6400.638
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
500.6300.6330.6310.6290.6260.6240.6260.6210.620
1000.6390.6370.6300.6280.6320.6290.6300.6240.625
5000.6370.6340.6360.6320.6300.6280.6280.6260.627
Table 4. Cross-validated accuracies in binary classification when n = 100 and a 40-60 positive–negative ratio.
Table 4. Cross-validated accuracies in binary classification when n = 100 and a 40-60 positive–negative ratio.
Random Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
500.6380.6450.6480.6390.6450.6450.6450.6440.641
1000.6470.6470.6520.6440.6450.6480.6430.6470.636
5000.6450.6570.6500.6510.6510.6510.6440.6440.648
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
500.6300.6350.6340.6300.6220.6180.6140.6100.606
1000.6430.6330.6320.6270.6230.6210.6200.6070.597
5000.6410.6370.6370.6310.6260.6240.6100.6150.607
Partition and Shift Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
500.6360.6440.6470.6410.6480.6450.6500.6470.646
1000.6440.6450.6510.6470.6510.6490.6480.6540.644
5000.6440.6540.6500.6530.6540.6540.6510.6500.655
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
500.6410.6430.6460.6430.6380.6340.6370.6350.637
1000.6510.6430.6460.6400.6380.6410.6410.6330.633
5000.6500.6480.6510.6420.6430.6450.6330.6430.638
Table 5. Cross-validated accuracies in binary classification when n = 100 and a 30-70 positive–negative ratio.
Table 5. Cross-validated accuracies in binary classification when n = 100 and a 30-70 positive–negative ratio.
Random Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
500.6790.6860.6850.6820.6840.6800.6760.6730.673
1000.6790.6880.6890.6890.6850.6840.6800.6760.676
5000.6510.6820.6920.6920.6810.6870.6790.6790.674
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
500.6650.6720.6700.6640.6570.6560.6480.6430.632
1000.6720.6690.6700.6640.6650.6570.6530.6470.632
5000.6760.6700.6680.6630.6590.6560.6470.6440.632
Partition and Shift Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
500.6760.6850.6850.6850.6870.6830.6830.6810.681
1000.6770.6850.6920.6910.6880.6890.6880.6850.684
5000.6500.6810.6910.6950.6880.6920.6860.6890.683
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
500.6770.6810.6810.6790.6770.6750.6680.6740.667
1000.6820.6810.6830.6820.6840.6780.6770.6750.671
5000.6890.6830.6830.6790.6800.6790.6710.6750.669
Table 6. Cross-validated accuracies in binary classification when n = 100 and a 20-80 positive–negative ratio.
Table 6. Cross-validated accuracies in binary classification when n = 100 and a 20-80 positive–negative ratio.
Random Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
50NA0.7570.7500.7520.7490.7480.7470.7420.740
100NA0.7480.7540.7520.7510.7470.7470.7460.741
500NA0.7150.7340.7480.7510.7450.7490.7420.738
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
500.7390.7310.7260.7170.7160.7120.7110.7000.693
1000.7350.7330.7330.7210.7170.7180.7050.7000.700
5000.7370.7370.7330.7340.7210.7150.7070.7040.692
Partition and Shift Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
50NA0.7570.7570.7570.7540.7550.7530.7510.750
100NA0.7510.7550.7570.7570.7540.7530.7570.749
500NA0.7150.7360.7520.7570.7530.7570.7530.752
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
500.7500.7460.7430.7370.7400.7370.7430.7330.734
1000.7490.7480.7480.7430.7370.7420.7350.7330.738
5000.7500.7480.7500.7520.7440.7410.7360.7370.736
Table 7. Cross-validated accuracies in binary classification when n = 500 and a 50-50 positive–negative ratio.
Table 7. Cross-validated accuracies in binary classification when n = 500 and a 50-50 positive–negative ratio.
Random Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
500.6600.6630.6580.6580.6550.6550.6520.6500.650
1000.6690.6660.6650.6630.6600.6570.6560.6540.650
5000.6720.6680.6690.6640.6630.6600.6590.6560.652
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
500.6440.6460.6430.6390.6360.6330.6270.6220.613
1000.6500.6440.6420.6410.6370.6350.6290.6220.614
5000.6510.6490.6440.6440.6370.6340.6300.6240.615
Partition and Shift Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
500.6600.6630.6610.6590.6590.6600.6580.6550.658
1000.6690.6690.6690.6660.6640.6610.6630.6610.659
5000.6730.6710.6730.6680.6680.6660.6660.6640.661
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
500.6530.6550.6530.6520.6500.6500.6470.6460.644
1000.6600.6550.6550.6540.6530.6540.6510.6490.648
5000.6610.6600.6570.6590.6540.6530.6530.6510.652
Table 8. Cross-validated accuracies in binary classification when n = 500 and a 40-60 positive–negative ratio.
Table 8. Cross-validated accuracies in binary classification when n = 500 and a 40-60 positive–negative ratio.
Random Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
500.6700.6680.6700.6640.6630.6620.6640.6570.660
1000.6760.6750.6750.6710.6660.6670.6620.6630.663
5000.6780.6780.6770.6730.6720.6710.6670.6620.661
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
500.6540.6540.6500.6460.6430.6410.6350.6330.617
1000.6580.6520.6510.6470.6430.6460.6390.6340.620
5000.6590.6600.6530.6500.6450.6380.6370.6340.621
Partition and Shift Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
500.6690.6670.6710.6660.6660.6670.6700.6640.667
1000.6760.6760.6750.6720.6700.6720.6710.6690.669
5000.6780.6800.6800.6780.6770.6770.6740.6700.670
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
500.6630.6650.6630.6600.6590.6580.6540.6560.652
1000.6680.6630.6630.6620.6590.6640.6620.6590.653
5000.6710.6700.6670.6660.6630.6600.6600.6620.656
Table 9. Cross-validated accuracies in binary classification when n = 500 and a 30-70 positive–negative ratio.
Table 9. Cross-validated accuracies in binary classification when n = 500 and a 30-70 positive–negative ratio.
Random Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
500.7040.7050.7040.7020.7010.6980.6970.6940.692
1000.7090.7090.7060.7040.7020.7020.6990.6980.695
5000.7130.7090.7100.7090.7060.7030.7010.7000.695
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
500.6900.6890.6850.6840.6800.6750.6710.6660.654
1000.6940.6900.6850.6850.6830.6770.6730.6650.656
5000.6940.6930.6890.6860.6820.6740.6760.6670.658
Partition and Shift Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
500.7030.7050.7050.7040.7060.7030.7020.7010.700
1000.7070.7100.7080.7080.7060.7060.7040.7050.703
5000.7120.7100.7130.7130.7100.7090.7070.7070.704
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
500.6990.6980.6970.6970.6960.6920.6930.6920.689
1000.7040.7010.6990.7000.6990.6960.6950.6930.691
5000.7040.7040.7010.7000.6990.6930.6980.6950.694
Table 10. Cross-validated accuracies in binary classification when n = 500 and a 20-80 positive–negative ratio.
Table 10. Cross-validated accuracies in binary classification when n = 500 and a 20-80 positive–negative ratio.
Random Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
500.7640.7650.7620.7620.7620.7610.7590.7600.752
1000.7660.7680.7650.7670.7660.7620.7600.7590.755
5000.7680.7690.7680.7670.7660.7640.7640.7630.756
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
500.7510.7500.7470.7440.7390.7380.7320.7270.715
1000.7520.7510.7480.7440.7430.7380.7310.7270.714
5000.7540.7530.7490.7450.7420.7390.7350.7270.715
Partition and Shift Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
500.7640.7650.7630.7640.7650.7640.7620.7630.761
1000.7660.7690.7660.7680.7670.7650.7630.7630.762
5000.7670.7690.7690.7690.7680.7660.7660.7660.763
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
500.7590.7600.7590.7580.7560.7560.7540.7530.752
1000.7610.7610.7600.7580.7590.7570.7540.7550.752
5000.7620.7630.7610.7590.7590.7580.7570.7540.754
Table 11. Cross-validated mean squared error (MSE) in regression when n = 100 and error standard deviation = 1 .
Table 11. Cross-validated mean squared error (MSE) in regression when n = 100 and error standard deviation = 1 .
Random Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
5021.84816.88115.19713.96112.94312.46312.07211.71011.584
10021.45416.77414.88013.81112.97612.31411.88211.42711.337
50021.16616.44514.90713.51712.72612.21311.72411.53111.171
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
5011.37111.37411.30611.28111.49711.77912.31912.97315.020
10011.17611.05311.13211.33611.43911.66012.06113.06015.029
50010.99511.14211.10011.11811.30511.78112.11712.86414.706
Partition and Shift Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
5026.83520.02517.47215.80114.50113.59312.97712.38212.217
10026.53019.77917.06015.61314.20513.46212.77612.27311.940
50026.33719.57817.02715.37114.14713.34612.54912.01511.744
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
5011.75711.55211.37111.14911.03910.89610.82910.70810.716
10011.53811.35911.15310.92910.80710.68510.60310.61110.616
50011.41211.29211.06510.77810.62110.59310.44910.39510.353
Table 12. Cross-validated mean squared error (MSE) in regression when n = 100 and error standard deviation = 5 .
Table 12. Cross-validated mean squared error (MSE) in regression when n = 100 and error standard deviation = 5 .
Random Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
5025.98320.81219.07717.85317.09916.64615.94615.83715.711
10025.42120.60518.82617.52816.84616.27615.88015.50815.518
50025.09420.32518.51517.24616.30016.13015.71015.31515.375
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
5015.45315.60315.51915.80116.21716.51817.25118.23220.501
10015.22715.38715.46815.61616.09916.43916.83617.94720.065
50015.37415.24415.35315.50815.95316.28516.89118.01520.188
Partition and Shift Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
5030.82024.19521.29119.77318.29717.69616.99516.31716.218
10030.64523.59621.18019.52018.08017.34016.77316.26415.942
50030.59423.52320.84719.10617.95417.32116.42915.99115.705
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
5015.66115.47115.37315.28715.00015.02115.01115.02014.921
10015.59915.44115.15815.09715.02514.82414.65114.70614.684
50015.32415.19314.87314.92214.78314.85814.65014.69714.678
Table 13. Cross-validated mean squared error (MSE) in regression when n = 100 and error standard deviation = 10 .
Table 13. Cross-validated mean squared error (MSE) in regression when n = 100 and error standard deviation = 10 .
Random Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
5030.86925.84023.90422.95821.94821.46321.22320.95720.809
10030.24525.39523.92822.48221.73621.25920.71320.85820.857
50030.00225.29223.27522.26821.57921.03420.88220.65320.559
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
5020.82821.18721.35721.51722.04122.36922.89824.68927.948
10020.82020.66521.06221.21821.61522.29922.93424.81527.276
50020.54120.82620.99621.02221.58922.13023.24824.45727.548
Partition and Shift Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
5036.31329.43026.44425.06623.29822.62422.02221.70821.345
10035.63628.73826.09524.53523.15122.36221.73121.12020.787
50035.44928.77925.83024.17423.15922.08021.62120.88620.904
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
5020.75520.80520.63620.56120.28720.28120.27720.48920.632
10020.69420.37020.41320.18020.19620.20420.26120.12420.147
50020.55920.39320.23520.15520.01720.05820.21719.97520.114
Table 14. Cross-validated mean squared error (MSE) in regression when n = 500 and error standard deviation = 1 .
Table 14. Cross-validated mean squared error (MSE) in regression when n = 500 and error standard deviation = 1 .
Random Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
509.5897.6677.5226.5885.9986.0285.9715.8995.436
1009.5828.1717.0446.2056.0545.9145.8985.9625.805
5008.7527.7666.5326.3715.7655.6684.9935.6455.515
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
505.5955.9395.4395.6015.6365.8056.0216.2787.157
1005.2605.2945.2756.0155.6745.3285.4026.4987.287
5005.4505.4565.5245.3174.9185.8745.7956.0237.682
Partition and Shift Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
5010.5989.4108.3817.6656.8636.4976.2965.9356.019
10011.0769.1428.1737.4966.6596.4506.2786.0295.962
50010.6978.8607.8667.2086.7186.4346.0955.9225.757
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
505.8325.5995.2045.4785.5505.2905.0295.2035.238
1005.5455.5135.1525.3495.2625.2855.1845.0714.881
5005.4555.4975.3435.0255.3804.8795.0815.1514.830
Table 15. Cross-validated mean squared error (MSE) in regression when n = 500 and error standard deviation = 5 .
Table 15. Cross-validated mean squared error (MSE) in regression when n = 500 and error standard deviation = 5 .
Random Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
5012.94211.87710.84010.41810.3209.93010.0079.5789.578
10013.38511.33810.7049.90710.1259.7169.6959.27810.473
50012.41311.40010.32410.7589.93910.0549.5729.6309.206
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
509.6139.54410.24610.0239.59310.02110.26811.29312.632
1009.5919.6939.3119.6659.6359.98010.17411.33911.533
5009.6009.4009.3459.5949.9159.77910.23911.12811.867
Partition and Shift Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
5014.37313.06611.88111.21011.03110.54010.2519.8719.852
10014.60212.35411.76210.77810.87210.17710.1359.67810.484
50013.74612.53411.31311.61610.53810.55810.1009.9599.427
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
509.8079.5739.9789.6629.2139.4469.1489.5589.447
1009.6649.5859.1649.3948.9539.2559.1129.6209.053
5009.6099.3929.2069.2729.2948.9849.2289.1689.192
Table 16. Cross-validated mean squared error (MSE) in regression when n = 100 and error standard deviation = 10 .
Table 16. Cross-validated mean squared error (MSE) in regression when n = 100 and error standard deviation = 10 .
Random Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
5017.82116.88615.93015.67315.39015.32914.70014.98415.629
10017.67015.63915.85315.57715.08614.81915.55514.98514.555
50017.68816.15415.31314.76714.71314.25314.48314.80414.855
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
5014.76015.84214.57314.87216.51916.41716.94017.59118.661
10015.46414.97015.64115.35715.67214.75215.89916.99419.143
50015.12515.49915.15715.02416.37215.64816.37016.44818.713
Partition and Shift Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
5019.40917.59716.59816.44616.01615.64614.89815.21715.642
10018.96216.90916.75516.25615.43115.23215.64015.10914.557
50019.12617.26616.09615.52115.28714.65114.79814.96514.754
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
5014.87415.74914.45914.29915.14714.94914.75515.12714.755
10015.16114.62315.01814.61514.52114.05414.28714.57014.786
50015.01014.92914.70714.37415.16414.14914.84913.77514.812
Table 17. Classification dataset quantitative summary.
Table 17. Classification dataset quantitative summary.
DatasetSample Size#FeaturesLabel CountsLabel Distribution
Wine1701358, 65, 470.34, 0.38, 0.28
Iris150450, 50, 500.33, 0.33, 0.33
Cleveland30013161, 55, 36, 35, 130.54, 0.18, 0.12, 0.12, 0.04
Diabetes7608494, 2660.65, 0.35
German100020700, 3000.70, 0.30
Table 18. Cross-validated accuracy scores based on the Wine dataset.
Table 18. Cross-validated accuracy scores based on the Wine dataset.
Random Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
200.9340.9350.9370.9390.9400.9440.9450.9460.944
500.9470.9460.9470.9480.9510.9530.9530.9520.952
1000.9530.9480.9510.9550.9550.9560.9580.9550.953
5000.9540.9500.9530.9580.9610.9620.9610.9580.954
10000.9540.9490.9530.9600.9620.9620.9610.9590.954
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
200.9430.9440.9410.9390.9360.9330.9310.920.903
500.9500.9470.9460.9440.9410.9390.9340.9210.904
1000.9510.9480.9460.9440.9410.9390.9340.9200.904
5000.9510.9490.9460.9440.9410.9400.9310.9170.905
10000.9510.9500.9460.9440.9420.9400.9310.9140.904
Partition and Shift Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
200.9310.9370.9420.9370.9400.9390.9420.9440.948
500.9530.9540.9480.9500.9500.9520.9520.9530.956
1000.9610.9560.9550.9540.9510.9530.9550.9570.960
5000.9680.9620.9580.9540.9540.9580.9620.9620.963
10000.9680.9630.9560.9550.9520.9590.9620.9640.965
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
200.9470.9470.9490.9490.9500.9500.9500.9520.949
500.9560.9570.9580.9570.9590.9590.9580.9570.957
1000.9610.9600.9620.9620.9610.9600.9600.9600.957
5000.9640.9650.9670.9660.9660.9660.9640.9630.960
10000.9650.9660.9660.9680.9670.9660.9650.9640.961
Table 19. Cross-validated accuracy scores based on the Iris dataset.
Table 19. Cross-validated accuracy scores based on the Iris dataset.
Random Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
200.9500.9500.9480.9470.9500.9480.9490.9500.950
500.9510.9510.9500.9500.9510.9500.9520.9510.951
1000.9520.9510.9520.9510.9520.9520.9530.9520.951
5000.9520.9530.9530.9530.9530.9530.9530.9530.952
10000.9530.9530.9530.9530.9530.9530.9530.9530.953
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
200.9500.9510.9510.9500.9500.9490.9470.9460.946
500.9520.9520.9520.9520.9520.9500.9470.9460.945
1000.9520.9520.9520.9530.9530.9510.9470.9460.946
5000.9510.9510.9530.9530.9530.9530.9470.9470.946
10000.9520.9500.9530.9530.9530.9530.9470.9460.946
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
200.9450.9480.9500.9480.9480.9470.9480.9500.948
500.9510.9510.9510.9490.9490.9500.9500.9510.951
1000.9500.9520.9510.9510.9520.9520.9520.9520.952
5000.9520.9520.9530.9530.9530.9530.9530.9530.953
10000.9530.9530.9530.9530.9530.9530.9530.9530.953
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
200.9480.9490.9490.9490.9510.9490.9510.9510.951
500.9520.9510.9520.9510.9520.9520.9510.9520.952
1000.9530.9530.9530.9520.9530.9530.9520.9520.952
5000.9530.9530.9530.9530.9530.9530.9530.9530.953
10000.9530.9530.9530.9530.9530.9530.9530.9530.951
Table 20. Cross-validated accuracy scores based on the Cleveland dataset.
Table 20. Cross-validated accuracy scores based on the Cleveland dataset.
Random Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
200.5740.5740.5720.5720.5710.5680.5680.5650.564
500.5740.5770.5760.5750.5750.5730.5710.5680.566
1000.5760.5790.5780.5780.5760.5730.5710.5660.564
5000.5780.5770.5800.5800.5780.5730.5700.5650.562
10000.5790.5770.5800.5800.5770.5730.5690.5650.562
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
200.5600.5570.5560.5510.5480.5420.5370.5270.515
500.5650.5610.5590.5580.5550.5470.5420.5330.518
1000.5630.5620.5600.5580.5540.5500.5430.5330.519
5000.5610.5580.5570.5570.5570.5530.5440.5320.522
10000.5600.5570.5570.5560.5560.5550.5450.5310.522
Partition and Shift Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
200.5720.5760.5760.5750.5760.5730.5760.5750.572
500.5800.5830.5820.5800.5800.5810.5780.5750.575
1000.5830.5840.5830.5830.5830.5820.5810.5790.577
5000.5840.5850.5830.5830.5830.5840.5840.5820.580
10000.5830.5850.5840.5840.5840.5840.5840.5830.579
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
200.5690.5690.5680.5660.5660.5640.5610.5610.561
500.5730.5700.5720.5710.5670.5660.5640.5630.564
1000.5750.5740.5700.5700.5670.5650.5650.5630.561
5000.5760.5740.5710.5680.5660.5650.5630.5620.560
10000.5780.5730.5690.5670.5650.5650.5620.5620.560
Table 21. Cross-validated accuracy scores based on the Diabetes dataset.
Table 21. Cross-validated accuracy scores based on the Diabetes dataset.
Random Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
200.7510.7510.7540.7530.7530.7550.7540.7540.753
500.7590.7580.7580.7590.7600.7600.7610.7610.761
1000.7610.7610.7600.7610.7620.7630.7640.7640.762
5000.7610.7600.7600.7630.7650.7660.7670.7660.766
10000.7610.7610.7600.7630.7640.7660.7680.7670.765
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
200.7510.7510.7530.7490.7500.7480.7480.7460.739
500.7590.7590.7590.7580.7570.7550.7550.7500.741
1000.7620.7620.7630.7620.7600.7590.7570.7520.741
5000.7660.7670.7680.7650.7630.7610.7590.7530.742
10000.7670.7680.7690.7660.7640.7620.7590.7530.742
Partition and Shift Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
200.7500.7510.7520.7520.7500.7530.7520.7530.752
500.7600.7590.7590.7590.7580.7580.7600.7600.759
1000.7630.7610.7600.7610.7620.7600.7610.7630.762
5000.7650.7620.7610.7610.7620.7630.7630.7650.764
10000.7650.7610.7600.7620.7620.7620.7630.7640.765
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
200.7530.7540.7520.7540.7510.7520.7530.7510.752
500.7590.7610.7600.7590.7600.7600.7590.7590.759
1000.7630.7640.7630.7640.7640.7630.7640.7630.763
5000.7650.7650.7660.7660.7680.7680.7680.7670.767
10000.7650.7650.7670.7670.7680.7690.7690.7690.768
Table 22. Cross-validated accuracy scores based on the German dataset.
Table 22. Cross-validated accuracy scores based on the German dataset.
Random Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
200.7170.7160.7150.7160.7150.7150.7110.7110.707
500.7300.7300.7280.7270.7250.7220.7200.7170.715
1000.7370.7360.7350.7300.7280.7250.7220.7200.719
5000.7430.7370.7350.7340.7300.7290.7260.7230.721
10000.7430.7370.7350.7340.7320.7290.7260.7240.722
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
200.7040.7030.7010.6980.6950.6890.6860.6770.658
500.7140.7110.7070.7040.7000.6950.6910.6810.657
1000.7150.7130.7100.7070.7020.6990.6920.6840.655
5000.7200.7170.7140.7110.7060.7030.6970.6880.655
10000.7200.7170.7150.7120.7080.7040.6980.6880.656
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
200.7240.7260.7270.7270.7270.7260.7240.7220.721
500.7330.7330.7350.7360.7360.7340.7300.7300.727
1000.7380.7360.7380.7370.7380.7360.7330.7320.729
5000.7440.7360.7350.7380.7380.7370.7350.7350.732
10000.7440.7350.7330.7370.7380.7370.7350.7340.733
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
200.7200.7180.7170.7160.7150.7130.7110.7110.709
500.7270.7240.7240.7230.7210.7180.7160.7180.715
1000.7290.7270.7240.7230.7210.7200.7190.7180.716
5000.7310.7290.7240.7220.7210.7200.7210.7210.720
10000.7320.7290.7240.7210.7210.7220.7200.7230.721
Table 23. Regression dataset quantitative summary.
Table 23. Regression dataset quantitative summary.
DatasetSample SizeNumber of FeaturesMissing Values?
Housing4106no
Energy7608no
Forest fires51012no
Table 24. Cross-validated MSEs based on the Housing dataset.
Table 24. Cross-validated MSEs based on the Housing dataset.
Random Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
2064.45161.74760.10658.86858.25157.80057.79557.96558.182
5062.26159.85058.37957.46156.72456.42256.49556.58356.664
10061.59459.32357.78056.88256.21455.84055.87256.12556.505
50061.01758.72957.28056.33855.78555.50855.49155.69356.048
100060.94858.67057.16156.32155.73555.45055.42855.63056.030
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
2058.62959.49260.96962.43865.02266.86369.86073.52678.874
5057.31658.38959.68861.63663.37865.92168.98772.98678.365
10057.07558.19059.48161.07363.42165.71468.67072.72178.387
50056.77857.76259.21260.79362.96865.30268.56972.46578.059
100056.69657.73559.18460.83462.90765.24968.54972.50178.092
Partition and Shift Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
2067.15164.37962.16961.10759.71859.25258.62258.38558.076
5064.79361.91060.43158.96458.21557.62856.98456.48656.482
10064.05461.28659.70558.50457.84357.00156.49956.17255.841
50063.35760.80359.11858.04857.18356.58056.05255.69255.560
100063.40260.74259.07257.98357.10556.51956.03355.66855.497
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
2057.46857.99357.52757.55158.00658.13258.41458.66659.275
5056.23655.99956.19756.51856.43856.74657.12857.36757.831
10055.85155.73755.76955.91456.12356.31256.48157.01957.375
50055.36155.35855.41155.49555.71955.90956.21656.62157.078
100055.35755.30755.35755.47455.60555.86956.23256.54257.054
Table 25. Cross-validated MSEs based on the Energy dataset.
Table 25. Cross-validated MSEs based on the Energy dataset.
Random Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
201.7830.8860.5510.4440.4100.3940.3800.3690.356
501.6780.8280.5200.4250.3980.3860.3740.3630.351
1001.6310.8150.5090.4180.3920.3810.3700.3600.349
5001.6040.7930.4990.4130.3890.3790.3690.3580.348
10001.6000.7900.4980.4120.3890.3790.3690.3580.347
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
200.3440.3360.3250.3150.3020.2910.2790.2700.267
500.3390.3300.3200.3110.3000.2880.2770.2680.265
1000.3390.3300.3190.3090.2980.2870.2750.2670.265
5000.3380.3270.3180.3080.2980.2860.2750.2670.264
10000.3370.3270.3170.3080.2970.2860.2750.2670.264
Partition and Shift Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
203.0241.6030.9100.6960.5130.4330.4410.4080.350
502.7061.4370.8720.6030.4700.4280.3820.3600.360
1002.4961.3580.8620.5860.4600.4060.3890.3550.339
5002.5011.3580.8240.5790.4530.4000.3680.3530.336
10002.4521.3580.8350.5770.4540.4000.3710.3500.336
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
200.3340.3290.3150.3170.3000.2820.2610.2560.246
500.3300.3320.3160.3090.2830.2760.2660.2650.249
1000.3230.3140.2970.2990.2820.2730.2710.2540.254
5000.3240.3130.3000.2930.2790.2670.2610.2540.246
10000.3250.3110.3010.2870.2790.2680.2600.2520.246
Table 26. Cross-validated MSEs based on the Forest Fires dataset.
Table 26. Cross-validated MSEs based on the Forest Fires dataset.
Random Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
202.1062.1152.1372.1822.1712.1712.2082.1502.206
502.0632.0702.0872.1272.1192.1292.1392.1312.144
1002.0612.0852.0762.0772.0982.1172.1012.1062.140
5002.0432.0572.0692.0752.0862.0962.1022.1112.123
10002.0362.0552.0652.0782.0842.0922.1022.1102.115
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
202.2062.2302.2682.2762.2492.2922.3252.3462.404
502.1632.1572.1862.1982.2112.2312.2502.2982.369
1002.1432.1492.1612.1902.2012.2202.2322.2702.345
5002.1312.1492.1632.1692.1902.2112.2372.2692.345
10002.1342.1432.1552.1692.1872.2062.2322.2662.341
Partition and Shift Subsampling
#trees\k 0.10 n 0.15 n 0.20 n 0.25 n 0.30 n 0.35 n 0.40 n 0.45 n 0.50 n
202.1122.1062.1452.1752.1302.1782.1582.1732.194
502.0562.0492.0872.0952.1002.1012.1112.1142.117
1002.0482.0572.0602.0892.0722.0912.0892.1022.105
5002.0312.0432.0502.0552.0652.0722.0852.0782.091
10002.0272.0422.0472.0592.0582.0762.0832.0832.087
#trees\k 0.55 n 0.60 n 0.65 n 0.70 n 0.75 n 0.80 n 0.85 n 0.90 n 0.95 n
202.1692.2022.1812.1652.1952.2042.2002.1952.253
502.1342.1392.1412.1282.1582.1542.1582.1612.169
1002.1162.1272.1322.1172.1202.1342.1572.1572.162
5002.0902.0942.1072.1112.1082.1162.1282.1292.138
10002.0912.0952.1062.1082.1142.1182.1282.1372.143
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ordal, M.; Wang, Q. Enhancing Diversity and Improving Prediction Performance of Subsampling-Based Ensemble Methods. Stats 2025, 8, 86. https://doi.org/10.3390/stats8040086

AMA Style

Ordal M, Wang Q. Enhancing Diversity and Improving Prediction Performance of Subsampling-Based Ensemble Methods. Stats. 2025; 8(4):86. https://doi.org/10.3390/stats8040086

Chicago/Turabian Style

Ordal, Maria, and Qing Wang. 2025. "Enhancing Diversity and Improving Prediction Performance of Subsampling-Based Ensemble Methods" Stats 8, no. 4: 86. https://doi.org/10.3390/stats8040086

APA Style

Ordal, M., & Wang, Q. (2025). Enhancing Diversity and Improving Prediction Performance of Subsampling-Based Ensemble Methods. Stats, 8(4), 86. https://doi.org/10.3390/stats8040086

Article Metrics

Back to TopTop