Enhancing Diversity and Improving Prediction Performance of Subsampling-Based Ensemble Methods

Ordal, Maria; Wang, Qing

doi:10.3390/stats8040086

Open AccessArticle

Enhancing Diversity and Improving Prediction Performance of Subsampling-Based Ensemble Methods

by

Maria Ordal

and

Qing Wang

^*

Department of Mathematics and Statistics, Wellesley College, Wellesley, MA 02481, USA

^*

Author to whom correspondence should be addressed.

Stats 2025, 8(4), 86; https://doi.org/10.3390/stats8040086

Submission received: 27 August 2025 / Revised: 22 September 2025 / Accepted: 24 September 2025 / Published: 26 September 2025

(This article belongs to the Section Applied Statistics and Machine Learning Methods)

Download

Browse Figures

Versions Notes

Abstract

This paper investigates how diversity among training samples impacts the predictive performance of a subsampling-based ensemble. It is well known that diverse training samples improve ensemble predictions, and smaller subsampling rates naturally lead to enhanced diversity. However, this approach of achieving a higher degree of diversity often comes with the cost of a reduced training sample size, which is undesirable. This paper introduces two novel subsampling strategies—partition and shift subsampling—as alternative schemes designed to improve diversity without sacrificing the training sample size in subsampling-based ensemble methods. From a probabilistic perspective, we investigate their impact on subsample diversity when utilized with tree-based sub-ensemble learners in comparison to the benchmark random subsampling. Through extensive simulations and eight real-world examples in both regression and classification contexts, we found a significant improvement in the predictive performance of the developed methods. Notably, this gain is particularly pronounced on challenging datasets or when higher subsampling rates are employed.

Keywords:

diversity; ensemble methods; partition subsampling; shift subsampling; subbagging

1. Introduction

Regression and classification are two of the most fundamental statistical problems, both falling within the realm of supervised learning where the given dataset contains a response variable of interest. Their distinction lies in the type of response variable: when the response is quantitative, it is a regression problem; when the response is categorical, it is referred to as classification. Over the past few decades, a vast array of statistical and machine learning tools have been developed to tackle regression and classification problems. For regression, common approaches include ordinary linear regression [1] and penalized regression [2], among others. For classification, techniques such as generalized linear regression [3], neural networks [4], support vector machines [5], Bayesian methods [6], decision trees [7], and others may be applied. In this paper, we focus our attention on tree-based methods, which are versatile tools that are applicable to both regression and classification tasks.

Classification and Regression Trees (CART), first introduced by Breiman et al. [7], offer a general method for regression and classification through recursive binary splitting of the given feature space. While it is easy to visualize and straightforward to implement, CART suffers from large sampling variation, which often leads to less competitive predictive performance as compared to other existing methods. To address its drawback, Breiman [8] developed bagging (bootstrap aggregation), which averages over predictions from numerous base learners. Building on this, Breiman [9] later proposed the random forest algorithm, further decorrelating individual trees to enhance the predictive power.

In this paper, we focus on subsampling-based ensemble methods, a computationally efficient alternative to conventional ensembles. Instead of drawing bootstrap samples of the same size as the original sample size n, they employ subsamples of a lower size k (

k < n

) taken without replacement as individual training datasets. The reduced training sample size directly translates into improved computational efficiency, alleviating the computational burden of conventional ensemble estimators. Moreover, Bühlmann and Yu [10] showed that subbagging, for example, can achieve predictive power comparable to that of the conventional bagging estimator. Other recent work studying subsampling-based ensemble methods includes Mentch and Hooker [11], Peng et al. [12], Wang and Wei [13].

For subsampling-based ensemble methods like subbagging and sub-random forests, it is well recognized that diversity among training samples is crucial for the predictive performance [14]. Reducing the subsampling proportion intuitively increases the diversity. However, this comes at the cost of a smaller training sample size, negatively impacting the prediction accuracy. To address this inherent trade-off and dilemma, we developed two novel subsampling schemes designed to enhance the training sample diversity without sacrificing the size.

The remainder of the paper is organized as follows: In Section 2, Materials and Methods, we begin by illustrating the relationship between the subsampling rates and predictive performance in subsampling-based ensembles in Section 2.1. Then, Section 2.2 details our proposed subsampling schemes: partition subsampling and shift subsampling. Moreover, this section also quantifies their diversity levels against the benchmark random subsampling using probabilistic landscapes. Section 3 is devoted to numerical investigations: We present extensive simulation studies in Section 3.1 for both regression and classification scenarios to evaluate our methods’ performance against the benchmark. Following this, Section 3.2 showcases eight real-world data examples in regression and classification problems. Finally, Section 4 concludes the paper with a brief summary and discussion of future work.

2. Materials and Methods

2.1. The Role of Diversity

It is widely recognized that diversity is a cornerstone for the predictive performance of ensemble estimators. Indeed, the success of ensembles in machine learning is often attributed directly to the level of diversity they embody. The previous literature has established strong connections between ensemble diversity and performance [14,15,16]. Over the past two decades, numerous methods for measuring diversity have been proposed. Diversity can arise from variations in the samples used to train base learners or from employing distinct base learning algorithms within an ensemble. In addition, ensemble diversity can also be achieved by modifying the machinery of the model-building process. Rotation forest [17] and AdaBoost [18] are two examples of this latter approach, as they create diverse models by transforming the feature space or iteratively adjusting data weights. Quantifying diversity can therefore involve metrics related to either of these sources [19]. Furthermore, diversity can also be assessed by focusing on the predictions generated by individual base learners across an ensemble [20]. While efforts have been made to develop a unified framework for diversity quantification [21], a widely accepted approach has yet to exist. In this paper, we focus on measuring diversity through the similarity between subsamples used to train individual base learners, specifically within the context of subsampling-based homogeneous ensemble methods using CART as the base learners.

For subsampling-based ensemble methods, such as subbagging, the diversity among training samples is largely influenced by the size of these training sets. Intuitively, larger training sample sizes increase the likelihood of substantial overlaps among randomly generated subsamples, thereby impairing the diversity. The success of ensemble methods is rooted in the principle “wisdom of the crowds”: individual models often make different types of errors. By aggregating the predictions of different base learners, these errors tend to cancel each other out, resulting in a more accurate final outcome. However, while a larger sample size generally reduces the variation in a single model, its impact on ensemble performance is more nuanced. If all base learners are trained on the same or similar large dataset, they are likely to be highly correlated and therefore make similar errors. In short, enhancing the diversity of an ensemble is more critical to the success of the ensemble method than increasing the sample size. To better illustrate this phenomenon, we present an analysis using the Wine dataset in this section [22]. This dataset addresses a multi-class classification task, aiming to categorize wines into one of three regions based on 13 features quantifying their chemical composition. The dataset comprises 178 observations, with 58, 65, and 47 instances for each class, corresponding to approximately 34%, 38%, and 28% of the total observations, respectively. We randomly sampled 170 observations from the raw dataset for ease of implementation of 10-fold cross validation, as described below. More information on the Wine dataset can be found in Section 3.2, as well as in the UCI Machine Learning Repository (https://archive.ics.uci.edu/).

We define the subsampling proportion, p, as the percentage of the dataset used to generate each training sample, such that

k = n p

, where n is the learning sample size, and k is the subsample size. To assess the effect of the subsampling proportion on the predictive performance, we utilized a subbagging estimator, aggregating 500 individual trees to produce the ensemble outcome. For each value of p ranging from 0.40 to 0.95 incremented by 0.05, we computed the 10-fold cross-validated accuracy. (Under 10-fold cross validation, n denotes the size of the delete-one-fold learning sample size.) To mitigate the impact of randomness inherent in ensemble learning, we refit the subbagging estimator 100 times, reporting the average cross-validated accuracy scores.

Figure 1 illustrates the relationship between accuracy and varying subsampling proportions. It is evident that increasing the subsampling proportion leads to a significant decrease in accuracy, indicating a deterioration in predictive performance. This consistent pattern was also observed in our exploration of some regression datasets, which will be presented in Section 3.2.

As revealed by Figure 1, reducing the subsampling rate offers a direct means to enhance the diversity and improve the predictive performance. Nevertheless, this approach inherently deteriorates the training sample size, which negatively impacts the prediction outcomes. To circumvent this trade-off, we propose two novel subsampling schemes designed to generate a more diverse set of subsamples without compromising their size. Our objective is to foster enhanced diversity among training samples while preserving their scale, thereby further augmenting the performance of subsampling-based ensembles.

2.2. Proposed Methods

In this section, we describe two novel subsampling schemes devised to maximize the diversity among training samples for subsampling-based ensemble methods, thereby enhancing predictive performance without sacrificing training sample size. Before detailing these proposed methods, we will outline some general notation, review the conventional random subsampling approach, and provide additional motivation of our developed methods.

Let B be the total number of subsamples, each of size k (

k < n

). In subbagging, B also corresponds to the number of individual trees of the ensemble. The traditional approach generates these training samples through random subsampling, drawing k instances from the n observations in a learning dataset without replacement and repeating this process B times independently. In contrast to this conventional approach, we will next discuss our two developed schemes: partition subsampling and shift subsampling.

Inspired by the success and broad applications of the previous work [23,24,25,26], where partition or shift subsampling schemes have proven to be efficient and effective for U-statistic variance estimation [27,28] in applications ranging from cross validation and model comparison and selection to the assessment of AUC (area under ROC curve). The potential of these schemes to improve the diversity within subsampling-based ensembles motivated our research.

2.2.1. Partition Subsampling

Without loss of generality, assume n is divisible by k. The partition subsampling is applicable whenever the subsampling proportion

p \leq 1 / 2

. It can be realized as follows: We begin by randomly shuffling the given learning sample of size n. This shuffled dataset is then systematically partitioned into

n / k

disjoint subsamples, each of size k. This process is repeated

B k / n

times to obtain a total of B subsamples. Given a random partition, the generated

n / k

data subsets are inherently mutually exclusive, thereby maximizing the diversity among them. Furthermore, the partition subsampling scheme guarantees a minimal number of non-overlapping subsamples (i.e., training samples) within an ensemble. Algorithm 1 outlines the detailed procedure, and Figure 2 displays the diagram of the partition subsampling scheme.

Algorithm 1 Partition Subsampling.

Input: Training dataset of size n, a number of subsamples to be generated B, a size k such that $k \leq n / 2$ for each subsample
Output: B subsamples, each of size k
while less than B subsamples have been generated do
shuffle the training sample;
systematically partition the shuffled training dataset into $n / k$ non-overlapped subsamples
end while
Return B generated subsamples

Remark 1.

The assumption that n is divisible by k is set primarily for notational simplicity, and the algorithm can be easily modified to account for cases where this condition is not met. Let

⌊ \cdot ⌋

and

⌈ \cdot ⌉

represent the floor and ceiling of a real number, respectively. When this assumption does not hold, we partition each given learning sample of size n into

⌊ n / k ⌋

subsamples of size k. This process is repeated

⌈ B / ⌊ n / k ⌋ ⌉

times. In the final random partition, fewer than

⌊ n / k ⌋

subsamples may be selected to obtain total of B subsamples.

2.2.2. Shift Subsampling

When the subsampling proportion p is greater than

1 / 2

, the partition subsampling scheme is no longer applicable. To address this, we propose an alternative shift subsampling scheme that works effectively for larger subsampling rates. Shift subsampling generates pairs of subsamples, each of size k, with the minimal number of overlaps. Specifically, from a randomly shuffled learning dataset, we extract the first k instances and the last k instances to form a pair of subsamples. This pairing results in

2 k - n

overlaps, which is the smallest possible number of between-subsample overlaps in this context. This process is repeated

B / 2

times to yield a total of B subsamples. Note that when

p = 1 / 2

, the partition subsampling and shift subsampling are identical. Algorithm 2 describes this procedure in detail, and Figure 3 displays the diagram of the shift subsampling scheme.

Algorithm 2 Shift Subsampling.

Input: Training dataset of size n, a number of subsamples to be generated B, a size k such that $n / 2 < k < n$ for each subsample
Output: B subsamples, each of size k
while less than B subsamples have been generated do
shuffle the training sample;
extract the first k and last k observations of the shuffled training dataset to form two subsamples
end while
Return B generated subsamples

Shift subsampling is particularly effective when the training sample size k is relatively large (

k > n / 2

). In this context, random subsampling is more likely to generate data subsets with a significant amount of overlap. Shift subsampling is designed to maximize diversity between training datasets, making it a valuable scheme for larger sample sizes. Using a larger training sample size is generally desirable, as it often leads to more accurate individual predictions. Therefore, from this aspect, shift subsampling is expected to yield more accurate results than partition subsampling, which is limited to

k \leq n / 2

scenarios. These inherent design features—maximizing the diversity and utilizing larger sample sizes—explain the advantage and success of shift subsampling over other methods, as demonstrated in the numerical studies in Section 3.

Remark 2.

For random subsampling, the cost of the most efficient algorithm that generates a subsample of size

k (k < n)

is

O (k)

. Thus, creating B subsamples requires a total cost of

O (B k)

. In comparison, for partition subsampling, an initial random shuffle of the dataset of size n demands an

O (n)

effort, which yields

n / k

subsamples. Thus, under partition subsampling, the total cost for generating B subsamples is

O (n (B k / n))

, which simplifies to

O (B k)

. Similarly, shift subsampling also starts with an

O (n)

shuffle but produces only two subsamples. Therefore, generating B subsamples demands a computational effort of order

O (n B / 2)

, where, by design,

k > n / 2

in the context of shift subsampling. Overall, the computational costs of partition subsampling, shift subsampling, and the benchmark random subsampling are all comparable.

2.2.3. Probabilistic Investigation

As the two proposed subsampling schemes entail, each of them aim to maximize diversity among training datasets. To further justify their superiority over the conventional random subsampling, we take a probabilistic approach to measure their diversity by focusing on between-subample overlaps.

Given a subsampling strategy, let X be the number of overlaps between two randomly generated subsamples. When

p \leq 1 / 2

(i . e .,, k \leq n / 2)

, X takes values of

0, 1, \dots, k

, while

X = 2 k - n, \dots, k

for

p > 1 / 2

(i . e .,, k > n / 2)

. In both scenarios, it is easy to see that, under random subsampling, the probability mass function of X can be written as

P (X = c ∣ random subsampling) = \frac{(\binom{n}{k}) (\binom{k}{c}) (\binom{n - k}{k - c})}{{(\binom{n}{k})}^{2}} = \frac{(\binom{k}{c}) (\binom{n - k}{k - c})}{(\binom{n}{k})},

(1)

where

c \in {0, \dots, k}

for

p \leq 1 / 2

and

c \in {2 k - n, \dots, k}

for

p > 1 / 2

. This agrees with the probability mass function of a Hypergeometric distribution with parameters

(n, k, k)

.

In contrast, under the partition-subsampling scheme (i.e.,

k \leq n / 2

), the probability that a pair of generated subsamples have exactly c overlaps can be expressed as follows

\begin{matrix} P (X = c ∣ partition subsampling) \\ = & \{\begin{matrix} \frac{\frac{B k}{n} (\binom{n / k}{2})}{(\binom{B}{2})} + \frac{(\binom{n - k}{k}) (\binom{B k / n}{2}) {(n / k)}^{2}}{(\binom{B}{2}) (\binom{n}{k})}, when c = 0; \\ \frac{(\binom{k}{c}) (\binom{n - k}{k - c}) (\binom{B k / n}{2}) {(n / k)}^{2}}{(\binom{B}{2}) (\binom{n}{k})}, when c = 1, \dots, k . \end{matrix} \end{matrix}

(2)

Equation (2) can be verified as follows.

Proof.

In the following derivations, we utilize the Law of Total Probability, a fundamental rule in probability theory, to calculate the probability of zero overlaps by considering two mutually exclusive scenarios. In the case of partition subsampling, the pair of subsamples may be drawn from the same partition or may come from two different partitions. Under each of these two conditions, we partition the probability of zero overlaps to complete the derivation. Specifically,

\begin{matrix} P (X = 0) & = & P ({X = 0} \cap {a within - partition pair}) \\ + & P ({X = 0} \cap {a within - partition pair}) . \end{matrix}

Then, by the General Multiplication Rule in probability theory,

\begin{matrix} P ({X = 0} \cap {a within - partition pair}) \\ = & P (X = 0 ∣ a within - partition pair) P (a within - partition pair), \\ P ({X = 0} \cap {a within - partition pair}) \\ = & P (X = 0 ∣ a between - partition pair) P (a between - partition pair) . \end{matrix}

Hence,

\begin{matrix} P (X = 0) & = & P (X = 0 ∣ a within - partition pair) P (a within - partition pair) \\ + & P (X = 0 ∣ a between - partition pair) P (a between - partition pair) \\ = & \frac{\frac{B k}{n} (\binom{n / k}{2})}{(\binom{B}{2})} + \frac{(\binom{n}{k}) (\binom{n - k}{k})}{{(\binom{n}{k})}^{2}} \times \frac{(\binom{B k / n}{2}) (\binom{n / k}{1}) (\binom{n / k}{1})}{(\binom{B}{2})} \\ = & \frac{\frac{B k}{n} (\binom{n / k}{2})}{(\binom{B}{2})} + \frac{(\binom{n - k}{k}) (\binom{B k / n}{2}) {(n / k)}^{2}}{(\binom{B}{2}) (\binom{n}{k})} . \end{matrix}

Similarly, when

c = 1, \dots, k

, we have

\begin{matrix} P (X = c) \\ = & P ({X = c} \cap {a within - partition pair}) \\ + & P ({X = c} \cap {a within - partition pair}) \\ = & P (X = c ∣ a within - partition pair) P (a within - partition pair) \\ + & P (X = c ∣ a between - partition pair) P (a between - partition pair) \\ = & 0 + \frac{(\binom{n}{k}) (\binom{k}{c}) (\binom{n - k}{k - c})}{{(\binom{n}{k})}^{2}} \times \frac{(\binom{B k / n}{2}) (\binom{n / k}{1}) (\binom{n / k}{1})}{(\binom{B}{2})} \\ = & \frac{(\binom{k}{c}) (\binom{n - k}{k - c}) (\binom{B k / n}{2}) {(n / k)}^{2}}{(\binom{B}{2}) (\binom{n}{k})} . \end{matrix}

□

Furthermore, under shift subsampling (i.e.,

k > n / 2

), the probability mass function of X is given by

\begin{matrix} P (X = c ∣ shift subsampling) \\ = & \{\begin{matrix} \frac{B / 2}{(\binom{B}{2})} + \frac{4 (\binom{B / 2}{2}) (\binom{k}{2 k - n})}{(\binom{B}{2}) (\binom{n}{k})}, when c = 2 k - n; \\ \frac{4 (\binom{B / 2}{2}) (\binom{k}{c}) (\binom{n - k}{k - c})}{(\binom{B}{2}) (\binom{n}{k})}, when c = 2 k - n + 1, \dots, k \end{matrix} \end{matrix}

(3)

for

c = 2 k - n, \dots, k

.

The proof of Equation (3) is presented below.

Proof.

In the case of shift subsampling, a pair of subsamples may be drawn from the same shuffled dataset or may come from two different shuffled datasets. Under each of these two conditions, we partition the probability of minimal overlaps to complete the derivation. By applying the Law of Total Probability and the General Multiplication Rule in probability theory, when

c = 2 k - n

, i.e., the minimal number of overlaps, we have

\begin{matrix} P (X = 2 k - n) \\ = & P (S_{i} \cap S_{j} = 2 k - n S_{i}, S_{j} \in same shuffle) P (S_{i}, S_{j} \in same shuffle) \\ + & P (S_{i} \cap S_{j} = 2 k - n S_{i}, S_{j} \notin same shuffle) P (S_{i}, S_{j} \notin same shuffle) \\ = & 1 \times \frac{B / 2}{(\binom{B}{2})} + \frac{(\binom{B / 2}{2}) (\binom{2}{1}) (\binom{2}{1})}{(\binom{B}{2})} \times \frac{(\binom{n}{k}) (\binom{k}{2 k - n}) (\binom{n - k}{k - (2 k - n)})}{{(\binom{n}{k})}^{2}} \\ = & \frac{B / 2}{(\binom{B}{2})} + \frac{4 (\binom{B / 2}{2}) (\binom{k}{2 k - n})}{(\binom{B}{2}) (\binom{n}{k})} . \end{matrix}

Similarly, when

c = 2 k - n + 1, \dots, k

,

\begin{matrix} P (X = c) \\ = & P (∣ S_{i} \cap S_{j} ∣ = c ∣ S_{i}, S_{j} \in same shuffle) P (S_{i}, S_{j} \in same shuffle) \\ + & P (∣ S_{i} \cap S_{j} ∣ = c ∣ S_{i}, S_{j} \notin same shuffle) P (S_{i}, S_{j} \notin same shuffle) \\ = & 0 + \frac{(\binom{B / 2}{2}) (\binom{2}{1}) (\binom{2}{1})}{(\binom{B}{2})} \times \frac{(\binom{n}{k}) (\binom{k}{c}) (\binom{n - k}{k - c})}{{(\binom{n}{k})}^{2}} \\ = & \frac{4 (\binom{B / 2}{2}) (\binom{k}{c}) (\binom{n - k}{k - c})}{(\binom{n}{k}) (\binom{B}{2})} . \end{matrix}

□

Figure 4 and Figure 5 illustrate the probability mass functions (PMF) under various subsampling schemes, with parameters set to

n = 100

,

B = 20

, and

p = 0.2

for partition subsampling and

p = 0.6

for shift subsampling. The two PMFs are shown as overlapping bar charts, with the shorter bars positioned in front. In Figure 4, the first black bar for partition subsampling that corresponds to the probability of generating a non-overlapped pair of subsamples (

X = 0

) is so much taller than that under random subsampling (see the first gray bar). In Figure 5, the first black bar for shift subsampling that displays the probability of generating a pair of subsamplings with

2 k - n

overlaps (i.e., the minimal overlap) is, once again, much taller than that under random subsampling (in this case, the gray bar has a height that is almost equal to zero). These plots clearly demonstrate how the proposed subsampling designs significantly boost the likelihood of achieving the minimal number of overlaps between subsamples compared to simple random subsampling. As indicated by Equations (1)–(3), the exact reduction in between-subsample overlaps is dependent on n, B, and p. Nevertheless, the clear benefit of incorporating partition and shift subsampling to enhance training sample diversity is evident.

From the probabilistic point of view, there are two possible ways that we may quantify and compare the diversity resulting from various subsampling schemes through a probabilistic perspective: 1. comparing the probability that each method achieves the minimal number of overlaps between a pair of randomly generated subsamples and 2. comparing the expected number of overlaps between a pair of subsamples given a subsampling scheme. Specifically, Table 1 summarizes the comparison for the probability of attaining maximum diversity (i.e., minimum overlaps) among the three subsampling methods.

Furthermore, the expected number of overlaps between a pair of subsamples generated from a specific subsampling scheme can be expressed as

E (X) = \sum_{c} c P (X = c),

where the formula for

P (X = c)

depends on the subsampling method. Table 2 compares the expected number of overlaps between a pair of random subsamples under different subsampling schemes when

n = 50, 100, or 500

, and

B = 10, 20, 50

, or 100. Partition and shift subsampling consistently produce fewer expected overlaps. However, this advantage diminishes as either the sample size n or the ensemble size B increases.

In the following section, we conduct comprehensive simulation studies to numerically assess the performance of the proposed subsampling schemes against the benchmark in both classification and regression contexts.

Remark 3.

Because both of our proposed subsampling methods begin with a random shuffle of the entire learning sample, they are best suited for data where the observations are independently and identically distributed. We acknowledge that this approach may inadvertently distort inherent structures within datasets that have complex stratification or strong temporal dependencies. In such cases, specialized algorithms designed to preserve these data features, such as those that account for stratified or time-series data, may be more appropriate.

3. Results

3.1. Simulation Studies

In this section, we evaluate the performance of the proposed subsampling schemes through simulation studies in regression and classification scenarios. For both designs, we consider subsampling rates from

0.10 n

to

0.95 n

, incremented by

0.05 n

. We use the proposed partition subsampling scheme to generate individual training samples for the ensemble when

k \in {0.10 n, \dots, 0.50 n}

, and the developed shift subsampling scheme for

k \in {0.55 n, \dots, 0.95 n}

. For comparison purposes, we also realize the conventional random subsampling approach as a benchmark for subsampling-based ensemble methods. We fit individual trees within an ensemble using the randomForest package in R [29]. Each tree was grown to its maximum possible depth until a stopping criterion, such as a minimal mean squared error for regression, was met. The same tree-fitting algorithm was used in both Section 3.1 and Section 3.2.

3.1.1. Classification Simulation Study

For the classification simulation study, we generated

R = 500

independent samples of size n, with

n \in {100, 500}

. Within each simulated dataset, six x-variables, each of length n, were independently drawn from a standard Uniform distribution. Subsequently, n random errors were simulated from a Logistic distribution with a location parameter of 0 and a scale parameter of 5. The “continuous” response,

Y^{c}

, was then determined based on the following true relationship:

Y_{i}^{c} = 1 + 5 X_{1, i} + 4 X_{2, i} + 3 X_{3, i} + 2 X_{4, i} + 0.1 X_{5, i} + 0 X_{6, i} + ϵ_{i} (1 \leq i \leq n) .

The binary outcome Y was obtained by dichotomizing the continuous response

Y^{c}

based on a varying threshold. This threshold was chosen to yield a target ratio of 1’s to 0’s of 50-50, 40-60, 30-70, or 20-80 in the final dataset. Specifically, the threshold was set to be the qth percentile (

q = 50, 60, 70, 80

) of the Logistic distribution with a location parameter of 8.05 and a scale parameter of 2, which represents the expected distribution of

Y^{c}

conditional on the x-variables. A value of

Y^{c}

greater than the threshold resulted in

Y = 1

; otherwise,

Y = 0

. We deliberately included

X_{6}

as an irrelevant predictor to introduce noise and elevate the classification task’s difficulty.

For each simulated dataset, we applied the subbagging estimator, varying the number of trees within each ensemble to 50, 100, and 500. As previously mentioned, we also explored a wide range of subsampling rates, from

p = 0.10

to

p = 0.95

, with an increment of 0.05. This resulted in subsampling sizes ranging from

k = 0.10 n

to

k = 0.95 n

. We fit the benchmark subbagging estimator, which uses the default random subsampling scheme, alongside those constructed using our developed partition or shift subsampling strategies. Their performance was compared using a 10-fold cross-validated accuracy measure. The average cross-validated accuracy scores across the 500 iterations are summarized in Table 3, Table 4, Table 5, Table 6, Table 7, Table 8, Table 9 and Table 10. It is worth mentioning that, in Table 6, the values for the smallest subsampling proportion (

p = 0.10

) are marked as NA. This occurred because, at such a low subsampling rate and given the imblanced nature of the simulated dataset, it is probable to have only 0 s in some training sets through 10-fold cross validation, making it impossible to fit the model. (For example, when n = 100 and

p = 0.1

, a 10-fold cross validation results in training sets with only nine observations each for the ensemble. Given a 20-80 positive–negative ratio, it is highly likely that some of these small training samples will contain no positive cases (1’s) due to random chance).

The simulation results further confirmed the role of diversity on the ensemble predictive performance: across all three subsampling methods, as the subsampling rate increases—reducing diversity among individual training samples—the accuracy consistently deteriorates. This pattern holds true for both sample sizes. On the other hand, as the sample size n increases from 100 to 500, or the number of trees per ensemble becomes larger, there is a slight improvement in accuracy scores across various subsampling rates, given the same positive–negative ratio.

The proposed subsampling schemes demonstrate superior predictive performance as the subsampling rate increases. Notably, the performance gains from our methods become apparent at relatively small subsampling rates, particularly with larger sample sizes. For example, when

n = 100

and a 50-50 positive–negative ratio, the partition subsampling scheme starts to outperform the benchmark at approximately

p = 0.30

. This threshold drops to around

p = 0.15

for

n = 500

. Overall, the proposed methods ultimately achieve a higher accuracy than the benchmark beyond a certain threshold. An additional notable observation is that the shift subsampling yields a more significant improvement in accuracy. For example, in Table 3, shift subsampling with

k = 0.95 n

boosts the benchmark’s accuracy by up to

6 %

(see Table 3 when

k = 0.95 n

and #trees

= 100

: the improvement in accuracy is

(0.625 - 0.591) / 0.591 \approx 6 %

), whereas the partition subsampling (e.g., at

k = 0.1 n

to

0.50 n

) only improves the accuracy by 1–2%.

3.1.2. Regression Simulation Study

For the regression simulation study, we used a Multivariate Adaptive Regression Spline (MARS) model, also known as Friedman #1 [30]. This model allows the generation of datasets that exhibit an underlying non-linear relationship between the response and five predictor variables. It is a common benchmark for evaluating ensemble methods and has been considered in several previous studies, including Bühlmann and Yu [10], Mentch and Hooker [11], Wang and Wei [13]. For each sample of size n (either 100 or 500), we independently simulated five x-variables from a standard Uniform distribution. Random errors were then generated from a normal distribution with a mean of 0 and a standard deviation of

1, \sqrt{5},

or

\sqrt{10}

. The response was then determined based on the following relationship:

Y_{i} = 10 sin (π X_{1, i} X_{2, i}) + 20 {(X_{3, i} - 0.05)}^{2} + 10 X_{4, i} + 5 X_{5, i} + ϵ_{i} (1 \leq i \leq n) .

In total, we considered

R = 500

independent samples.

Similar to the classification study, we applied the subbagging estimator with varying numbers of trees—specifically, 50, 100, and 500 per ensemble. We also explored a wide range of subsampling rates, from

p = 0.10

to

p = 0.95

, increasing by 0.05. We fit both the benchmark subbagging estimator (using the default random subsampling scheme) and those developed with our partition or shift subsampling strategies. To compare their performance, we computed the 10-fold cross-validated mean squared error (MSE). The average cross-validated MSEs from the 500 iterations are summarized in Table 11, Table 12, Table 13, Table 14, Table 15 and Table 16.

The simulation results for the regression show a slightly different pattern than those observed in the classification scenario. For random subsampling, the MSE initially decreases with the subsampling rate but then increases. In contrast, under partition and shift subsampling, the MSE decreases monotonously with a larger subsampling rate. This ultimately leads to a much reduced MSE under shift subsampling at

k = 0.95 n

, with an up to 37% reduction in MSE compared to the benchmark (see Table 14 for the specific case where

k = 0.95 n

and the number of trees is 500. This value was obtained from the calculation:

(4.830 - 7.682) / 7.682 \approx - 37 %

, namely, a 37% reduction in MSE). In addition, as expected, we found that increasing the number of trees or decreasing the error standard deviation slightly reduces the MSE. This pattern holds true for both sample sizes. Furthermore, a larger sample size consistently leads to better overall performance, resulting in a lower MSE. From a different aspect, the threshold for observing superior performance from our proposed methods is lower when the error standard deviation increases or when the number of trees increases.

3.2. Real Data Examples

In this section, we present eight real data examples to demonstrate the practical applications of the proposed subsampling schemes compared to the benchmark random subsampling in both the regression and classification scenarios.

3.2.1. Classification Datasets

We first evaluated our proposed methods and the benchmark using five diverse classification datasets, each presenting unique characteristics in terms of the sample size, number of features, class distribution, and application domains. In addition to the Wine dataset [22] discussed in Section 2.1, we also considered several other classification datasets. More specifically, the Iris dataset [31], one of the earliest known datasets used to evaluate classification methods, includes 150 observations. It uses four continuous variables (petal and sepal length and width) to predict one of three balanced iris plant species. In addition, we utilized the Cleveland database for heart disease diagnosis [32]. This dataset has 300 observations and contains information on thirteen demographic and health-related variables to predict one of five heart disease severity levels (0–4, where 0 indicates no disease). We also analyzed the Pima Indians Diabetes dataset [33], comprising 760 observations with eight health-related variables to predict diabetes presence (i.e., binary outcome). Lastly, the Statlog (German Credit Data) dataset [34] contains 1000 observations and classifies individuals as good or bad credit risks based on 20 financial and demographic features. All the aforementioned datasets are available in the UCI Machine Learning Repository (https://archive.ics.uci.edu/).

Table 17 summarizes the characteristics of these datasets. For ease of implementation of 10-fold cross validation, we round the number of observations down to a multiple of 10 by randomly removing fewer than 10 observations from each dataset. All reported sample sizes reflect these adjusted counts.

We applied subbagging estimators by varying the number of trees per ensemble from 20 to 1000. For each of the five datasets, we generated subsamples for individual classification trees using two methods: random subsampling (the benchmark) and shift and partition subsampling (our proposed methods). Moreoever, we set the subsampling proportion p from 0.1 to 0.95, incrementally increasing by 0.05. This resulted in subsample sizes k ranging from

0.1 n

to

0.95 n

. Consistent with our approach in Section 2.1, we assessed the performance of each method using 10-fold cross-validated accuracy scores for each dataset and setting. To mitigate the impact of randomness inherent in ensemble learning, we refit the subbagging estimator 100 times, reporting the average cross-validated accuracy scores. The results are summarized in Table 18, Table 19, Table 20, Table 21 and Table 22.

Across all five classification datasets, the accuracy generally declines with the subsampling rate past a certain inflection point. This trend is most noticeable with random subsampling, which leads to a significant drop in accuracy at

k = 0.95 n

. In contrast, partition and shift subsampling enable higher subsampling rates to improve the accuracy. For example, shift subsampling on the Wine dataset achieves an accuracy of nearly or above 96%, representing an improvement of up to 5% over the benchmark. Further, the improvement of shift subsampling reaches 10% on the German dataset. Consistent with our findings in Section 3.1, the accuracy also improves with a larger ensemble size and a more balanced class ratio.

3.2.2. Regression Datasets

To further demonstrate the performance of the developed methods in a different setting, we considered three real-world regression datasets. First, we analyzed the Housing dataset [35], which predicts house price per unit area for properties in Xindian District, New Taipei City, Taiwan, using six features. Next, we utilized the Energy Efficiency dataset [36], which assesses building energy efficiency based on eight building parameters. Finally, we incorporated the Forest Fires dataset [37]. This dataset attempts to predict the burned area of forest fires in the northeast region of Portugal using twelve meteorological features. Following the documentation and the prior literature [38], a natural logarithm transformation was applied to the highly skewed response variable. One thing worthy of attention is that the Forest Fires dataset is particularly challenging to model. Previous attempts to analyze it using machine learning techniques have not been very successful [38].

Table 23 presents the summary characteristics of these datasets. Similar to our previous approach, we rounded the number of observations down to a multiple of 10 by randomly removing fewer than 10 observations to simplify the 10-fold cross validation. All the sample sizes reported in Table 23 are after these adjustments.

Using the three aforementioned datasets, we examined the performance of different subbagging estimator configurations, evaluating each based on the cross-validated mean squared error (MSE). Consistent with the settings in Section 3.2.1, we varied the subsampling proportion p from 0.10 to 0.95 and the number of trees per ensemble from 20 to 1000. All MSE scores were computed using 10-fold cross validation. The complete results are summarized in Table 24, Table 25 and Table 26.

Under random subsampling, a different inflection point of MSE is observed for each dataset. Specifically, in the Housing dataset, the MSE initially decreases as the subsampling proportion (p) increases from 0.10 to 0.40 and then begins to rise for p values between 0.40 and 0.95. Conversely, the Forest Fires dataset consistently shows an increase in the MSE with larger subsampling proportions. For the Energy dataset, however, increased subsampling proportions lead to lower errors. Similarly, distinct trends in the prediction error (MSE) are also observed when utilizing partition or shift subsampling. Specifically, under the proposed subsampling schemes: In the Housing dataset, the MSE decreases as the subsampling proportion (p) increases from approximately 0.10 to 0.60 and then rises from roughly 0.60 to 0.95. For the Energy dataset, increased subsampling proportions generally yield lower errors. On the contrary, in the Forest Fires dataset, higher subsampling proportions are associated with increased errors. Across all three datasets, as the number of trees per ensemble increases, we observe a general trend of decreasing MSE. This reduction is most pronounced when the number of trees grows from 20 to 100; beyond that point, only marginal further decreases in MSE are observed. This holds true across all different subsampling schemes.

Regarding the comparison between the benchmark and our proposed methods, the benefits of our subsampling scheme become more pronounced when increasing the subsampling proportion. For both the Housing and Energy datasets, our proposed method begins to show improvement at a subsampling proportion of approximately

p = 0.40

. In contrast, for the challenging Forest Fires dataset, a gain in MSE is consistently observed across all tested values of p, which proves the advantage of our proposed methods for more challenging datasets. Across all datasets, a marginal positive correlation between the number of trees per ensemble and the magnitude of the gain in MSE is also noted.

3.3. Further Justification

As discussed in the previous sections, to mitigate the effects of randomness, we used 500 iterations for each simulation study. For the real data examples, each subbagging estimator was refitted 100 times. The reported performance scores are therefore averages over a large number of iterations, effectively accounting for random variations.

To further demonstrate the superior performance of our proposed subsampling schemes, in what follows, we use the Wine dataset as a case study to better justify the effectiveness and significance of our developed methods compared to the benchmark random subsampling.

In Figure 6, Figure 7, Figure 8, Figure 9 and Figure 10 below, we display the cross-validated accuracy scores of different methods. The solid dots represent the average score over 100 refits of the subbagging estimator, for a given subsample size and and number of trees. The error bars around each dot indicate the standard error. As anticipated, the standard error of accuracy score decreases as the number of trees increases. These plots show that, on average, our proposed partition and shift subsampling schemes consistently outperform the benchmark random subsampling, with only a few exceptions where their performance scores are quite comparable. This performance gain becomes particularly significant as the subsampling rate and the number of trees increase, even after accounting for sampling variations. As discussed in Section 2.2.2, shift subsampling tends to be particularly advantageous over the other competing methods at relatively large subsampling rates. For example, in Figure 10 when

p \geq 0.5

and with 1000 trees per ensemble, shift subsampling is significantly better than the benchmark, yielding a much higher cross-validated accuracy. Overall, the threshold for such significant performance gain appears at a lower subsampling rate when the number of trees grows.

4. Discussion

This paper explores how diversity among training samples impacts the predictive performance of subsampling-based ensemble methods. To improve the diversity without compromising the training sample size, we introduce two novel subsampling schemes: partition subsampling and shift subsampling. Our probabilistic analyses further justify the improved diversity the proposed methods offer compared to the benchmark random subsampling. Through extensive simulation studies and real-world data illustrations, we show their superior performance in both regression and classification scenarios. In particular, the benefits of utilizing the developed subsampling strategies become more noticeable on challenging datasets or at larger subsampling rates, and the percent of improvement is more significant in regression problems.

For future work, it would be interesting to extend these schemes to other subsampling-based ensemble methods, such as sub-random forest or non-tree-based sub-ensemble estimators. Given their adaptability, we anticipate similar positive trends and conclusions in these broader applications.

Author Contributions

Conceptualization, Q.W.; methodology, M.O. and Q.W.; software, M.O. and Q.W.; validation, M.O. and Q.W.; formal analysis, M.O. and Q.W.; investigation, M.O. and Q.W.; resources, M.O.; data curation, M.O.; writing—original draft preparation, M.O. and Q.W.; writing—review and editing, M.O. and Q.W.; visualization, M.O. and Q.W.; supervision, Q.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All the real datasets presented in this paper are available in the UCI Machine Learning Repository (https://archive.ics.uci.edu/).

Acknowledgments

The publication fees for this article are supported by the Wellesley College Library and Technology Services Open Access Fund.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Galton, F. Regression Towards Mediocrity in Hereditary Stature. J. Anthropol. Inst. Great Br. Irel. 1886, 15, 246–263. [Google Scholar] [CrossRef]
Heckman, N.; Ramsay, J. Penalized Regression with Model-Based Penalties. Can. J. Stat. 2000, 28, 241–258. [Google Scholar] [CrossRef]
Dobson, A.; Barnett, A. An Introduction to Generalized Linear Models, 4th ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 2018. [Google Scholar]
Gurney, K. An Introduction to Neural Networks; CRC Press: London, UK, 2018. [Google Scholar]
Zhou, Y.; Gallins, P. A Review and Tutorial of Machine Learning Methods for Microbiome Host Trait Prediction. Front. Genet. 2019, 10, 579. [Google Scholar] [CrossRef] [PubMed]
Schoot, R.v.; Depaoli, S.; King, R.; Kramer, B.; Märtens, K.; Tadesse, M.G.; Vannucci, M.; Gelman, A.; Veen, D.; Willemsen, J.; et al. Bayesian statistics and modelling. Nat. Rev. Methods Prim. 2021, 1, 1. [Google Scholar] [CrossRef]
Breiman, L.; Friedman, J.; Stone, C.J.; Olshen, R.A. Classification and Regression Trees; Taylor and Francis: New York, NY, USA, 1984. [Google Scholar]
Breiman, L. Bagging Predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Bühlmann, P.; Yu, B. Analyzing bagging. Ann. Stat. 2002, 30, 927–961. [Google Scholar] [CrossRef]
Mentch, L.; Hooker, G. Quantifying uncertainty in random forest via confidence intervals and hypothesis tests. J. Mach. Learn. Res. 2016, 17, 1–41. [Google Scholar]
Peng, W.; Coleman, T.; Mentch, L. Rates of convergence for random forests via generalized U-statistics. Electron. J. Stat. 2022, 16, 232–292. [Google Scholar] [CrossRef]
Wang, Q.; Wei, Y. Quantifying uncertainty of subsampling-based ensemble methods under a U-statistic framework. J. Statisitcal Comput. Simul. 2022, 92, 3706–3726. [Google Scholar] [CrossRef]
Kuncheva, L.; Whitaker, C. Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy. Mach. Learn. 2003, 51, 181–207. [Google Scholar] [CrossRef]
Brown, G.; Wyatt, J.; Harris, R.; Yao, X. Diversity creation methods: A survey and categorisation. Inf. Fusion 2005, 6, 5–20. [Google Scholar] [CrossRef]
Kuncheva, L. Combining Pattern Classifiers: Methods and Algorithms; Wiley: Hoboken, NJ, USA, 2004. [Google Scholar]
Rodríguez, J.J.; Kuncheva, L.I.; Alonso, C.J. Rotation Forest: A new classifier ensemble method. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 1619–1630. [Google Scholar] [CrossRef]
Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
Cunningham, P. Ensembles in Machine Learning. 2022. Available online: https://medium.com/data-science/ensembles-in-machine-learning-9128215629d1 (accessed on 15 August 2025).
Tang, E.K.; Suganthan, P.N.; Yao, X. An analysis of diversity measures. Mach. Learn. 2006, 65, 247–271. [Google Scholar] [CrossRef]
Wood, D.; Mu, T.; Webb, A.; Reeve, H.W.J.; Lujan, M.; Brown, G. A Unified Theory of Diversity in Ensemble Learning. J. Mach. Learn. Res. 2023, 24, 1–49. [Google Scholar]
Aeberhard, S.; Forina, M. Wine; UCI Machine Learning Repository: 1991. Available online: https://archive.ics.uci.edu/dataset/109/wine (accessed on 15 August 2025).
Wang, Q.; Linday, B.G. Variance estimation of a general U-statistic with application to cross-validation. Stat. Sin. 2014, 24, 1117–1141. [Google Scholar]
Wang, Q.; Guo, A. An efficient variance estimator of AUC with applications to binary classification. Stat. Med. 2020, 39, 4281–4300. [Google Scholar] [CrossRef]
Wang, Q.; Cai, X. An efficient variance estimator for cross-validation under partition-sampling. Statistics 2021, 55, 660–681. [Google Scholar] [CrossRef]
Wang, Q.; Cai, X. A new perspective on U-statistic variance estimation. Stat 2025, 14, e70070. [Google Scholar] [CrossRef]
Hoeffding, W. A class of statistics with asymptotically normal distribution. Ann. Math. Stat. 1948, 19, 293–325. [Google Scholar] [CrossRef]
Lee, A.J. U-Statistics: Theory and Practice; Marcel Dekker: New York, NY, USA, 1990. [Google Scholar]
Liaw, A.; Wiener, M. Classification and Regression by randomForest. R News 2002, 2, 18–22. [Google Scholar]
Friedman, J. Multivariate Adaptive Regression Splines. Ann. Stat. 1991, 19, 1–67. [Google Scholar] [CrossRef]
Fisher, R.A. Iris; UCI Machine Learning Repository: 1936. Available online: https://archive.ics.uci.edu/dataset/53/iris (accessed on 15 August 2025).
Janosi, A.; Steinbrunn, W.; Pfisterer, M.; Detrano, R. Heart Disease: Cleveland Database; UCI Machine Learning Repository: 1988. Available online: https://archive.ics.uci.edu/dataset/45/heart+disease (accessed on 15 August 2025).
Turney, P. Pima Indians Diabetes Data Set; UCI Machine Learning Repository: 1990. Available online: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database (accessed on 15 August 2025).
Hofmann, H. Statlog (German Credit Data); UCI Machine Learning Repository: 1994. Available online: https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data (accessed on 15 August 2025).
Yeh, I. Real Estate Valuation; UCI Machine Learning Repository: 2018. Available online: https://archive.ics.uci.edu/dataset/477/real+estate+valuation+data+set (accessed on 15 August 2025).
Tsanas, A.; Xifara, A. Energy Efficiency; UCI Machine Learning Repository: 2012. Available online: https://archive.ics.uci.edu/dataset/242/energy+efficiency (accessed on 15 August 2025).
Cortez, P.; Morais, A. Forest Fires; UCI Machine Learning Repository: 2008. Available online: https://archive.ics.uci.edu/dataset/162/forest+fires (accessed on 15 August 2025).
Cortez, P.; Morais, A. A Data Mining Approach to Predict Forest Fires using Meteorological Data. In Proceedings of the 13th Portuguese Conference on Artificial Intelligence (EPIA 2007), Guimarães, Portugal, 3–7 December 2007; pp. 512–523. [Google Scholar]

Figure 1. Cross-validated accuracy against subsampling proportion based on the Wine dataset (averaged over 100 iterations).

Figure 2. Diagram that displays the partition subsampling scheme that generates B subsamples of size k.

Figure 3. Diagram that displays the shift subsampling scheme that generates B subsamples of size k.

Figure 4. Probability mass functions of expected overlaps between a pair of subsamples generated from random subsampling and partition subsampling when

n = 100

,

B = 20

, and

p = 0.2

.

Figure 4. Probability mass functions of expected overlaps between a pair of subsamples generated from random subsampling and partition subsampling when

n = 100

,

B = 20

, and

p = 0.2

.

Figure 5. Probability mass functions of expected overlaps between a pair of subsamples generated from random subsampling and partition subsampling when

n = 100

,

B = 20

, and

p = 0.6

.

Figure 5. Probability mass functions of expected overlaps between a pair of subsamples generated from random subsampling and partition subsampling when

n = 100

,

B = 20

, and

p = 0.6

.

Figure 6. Cross-validated accuracy with error bars based on the Wine dataset (#trees

= 20

).

Figure 6. Cross-validated accuracy with error bars based on the Wine dataset (#trees

= 20

).

Figure 7. Cross-validated accuracy with error bars based on the Wine dataset (#trees

= 50

).

Figure 7. Cross-validated accuracy with error bars based on the Wine dataset (#trees

= 50

).

Figure 8. Cross-validated accuracy with error bars based on the Wine dataset (#trees

= 100

).

Figure 8. Cross-validated accuracy with error bars based on the Wine dataset (#trees

= 100

).

Figure 9. Cross-validated accuracy with error bars based on the Wine dataset (#trees

= 500

).

Figure 9. Cross-validated accuracy with error bars based on the Wine dataset (#trees

= 500

).

Figure 10. Cross-validated accuracy with error bars based on the Wine dataset (#trees

= 1000

).

Figure 10. Cross-validated accuracy with error bars based on the Wine dataset (#trees

= 1000

).

Table 1. Comparison of the probability of maximum diversity under a subsampling scheme.

$p \leq 1 / 2$	Random	Partition
$P (X = 0)$	$\frac{(\binom{n - k}{k})}{(\binom{n}{k})}$	$\frac{\frac{B k}{n} (\binom{n / k}{2})}{(\binom{B}{2})} + \frac{(\binom{n - k}{k}) (\binom{B k / n}{2}) {(n / k)}^{2}}{(\binom{B}{2}) (\binom{n}{k})}$
$p \geq 1 / 2$	Random	Shift
$P (X = 2 k - n)$	$\frac{(\binom{k}{2 k - n})}{(\binom{n}{k})}$	$\frac{B / 2}{(\binom{B}{2})} + \frac{4 (\binom{B / 2}{2}) (\binom{k}{2 k - n})}{(\binom{B}{2}) (\binom{n}{k})}$

Table 2. Expected number of overlaps between a pair of subsamples.

Random Subsampling
n\k	$0.1 n$	$0.2 n$	$0.3 n$	$0.4 n$	$0.5 n$	$0.6 n$	$0.7 n$	$0.8 n$	$0.9 n$
50	0.5	2.0	4.5	8.0	12.5	18.0	24.5	32.0	40.5
100	1.0	4.0	9.0	16.0	25.0	36.0	49.0	84.0	81.0
500	5.0	20.0	45.0	80.0	125.0	180.0	245.0	320.0	405.0
Partition and Shift Subsampling ( $B = 10$ )
n\k	$0.1 n$	$0.2 n$	$0.3 n$	$0.4 n$	$0.5 n$	$0.6 n$	$0.7 n$	$0.8 n$	$0.9 n$
50	0.0	1.1	3.3	6.7	11.1	17.1	24.0	31.8	40.4
100	0.0	2.2	6.7	13.3	22.2	34.2	48.0	63.6	80.9
500	0.0	11.1	33.3	66.7	111.1	171.1	240.0	317.8	404.4
Partition and Shift Subsampling ( $B = 20$ )
n\k	$0.1 n$	$0.2 n$	$0.3 n$	$0.4 n$	$0.5 n$	$0.6 n$	$0.7 n$	$0.8 n$	$0.9 n$
50	0.3	1.6	3.9	7.4	11.8	17.6	24.3	31.9	40.5
100	0.5	3.2	7.9	14.7	23.7	35.2	48.5	63.8	80.9
500	2.6	15.8	39.5	73.7	118.4	175.8	242.6	318.9	404.7
Partition and Shift Subsampling ( $B = 50$ )
n\k	$0.1 n$	$0.2 n$	$0.3 n$	$0.4 n$	$0.5 n$	$0.6 n$	$0.7 n$	$0.8 n$	$0.9 n$
50	0.4	1.8	4.3	7.8	12.2	17.8	24.4	32.0	40.5
100	0.8	3.7	8.6	15.5	24.5	35.7	48.8	63.9	81.0
500	4.1	18.4	42.9	77.6	122.4	178.4	244.1	319.6	404.9
Partition and Shift Subsampling ( $B = 100$ )
n\k	$0.1 n$	$0.2 n$	$0.3 n$	$0.4 n$	$0.5 n$	$0.6 n$	$0.7 n$	$0.8 n$	$0.9 n$
50	0.5	1.9	4.4	7.9	12.4	17.9	24.5	32.0	40.5
100	0.9	3.8	8.8	15.8	24.7	35.8	48.9	64.0	81.0
500	4.5	19.2	43.9	78.8	123.7	179.2	244.5	319.8	404.9

Table 3. Cross-validated accuracies in binary classification when

n = 100

and a 50-50 positive–negative ratio.

Table 3. Cross-validated accuracies in binary classification when

n = 100

and a 50-50 positive–negative ratio.

Random Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	0.629	0.633	0.634	0.632	0.631	0.630	0.631	0.624	0.628
100	0.638	0.642	0.638	0.636	0.639	0.637	0.631	0.631	0.633
500	0.645	0.636	0.640	0.638	0.636	0.635	0.636	0.633	0.630
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	0.624	0.625	0.623	0.617	0.611	0.609	0.609	0.597	0.588
100	0.629	0.625	0.616	0.616	0.617	0.612	0.608	0.596	0.591
500	0.626	0.625	0.624	0.618	0.613	0.609	0.606	0.601	0.595
Partition and Shift Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	0.624	0.630	0.634	0.631	0.633	0.631	0.634	0.629	0.633
100	0.632	0.638	0.634	0.635	0.639	0.640	0.637	0.636	0.636
500	0.641	0.638	0.641	0.640	0.638	0.640	0.644	0.640	0.638
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	0.630	0.633	0.631	0.629	0.626	0.624	0.626	0.621	0.620
100	0.639	0.637	0.630	0.628	0.632	0.629	0.630	0.624	0.625
500	0.637	0.634	0.636	0.632	0.630	0.628	0.628	0.626	0.627

Table 4. Cross-validated accuracies in binary classification when

n = 100

and a 40-60 positive–negative ratio.

Table 4. Cross-validated accuracies in binary classification when

n = 100

and a 40-60 positive–negative ratio.

Random Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	0.638	0.645	0.648	0.639	0.645	0.645	0.645	0.644	0.641
100	0.647	0.647	0.652	0.644	0.645	0.648	0.643	0.647	0.636
500	0.645	0.657	0.650	0.651	0.651	0.651	0.644	0.644	0.648
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	0.630	0.635	0.634	0.630	0.622	0.618	0.614	0.610	0.606
100	0.643	0.633	0.632	0.627	0.623	0.621	0.620	0.607	0.597
500	0.641	0.637	0.637	0.631	0.626	0.624	0.610	0.615	0.607
Partition and Shift Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	0.636	0.644	0.647	0.641	0.648	0.645	0.650	0.647	0.646
100	0.644	0.645	0.651	0.647	0.651	0.649	0.648	0.654	0.644
500	0.644	0.654	0.650	0.653	0.654	0.654	0.651	0.650	0.655
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	0.641	0.643	0.646	0.643	0.638	0.634	0.637	0.635	0.637
100	0.651	0.643	0.646	0.640	0.638	0.641	0.641	0.633	0.633
500	0.650	0.648	0.651	0.642	0.643	0.645	0.633	0.643	0.638

Table 5. Cross-validated accuracies in binary classification when

n = 100

and a 30-70 positive–negative ratio.

Table 5. Cross-validated accuracies in binary classification when

n = 100

and a 30-70 positive–negative ratio.

Random Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	0.679	0.686	0.685	0.682	0.684	0.680	0.676	0.673	0.673
100	0.679	0.688	0.689	0.689	0.685	0.684	0.680	0.676	0.676
500	0.651	0.682	0.692	0.692	0.681	0.687	0.679	0.679	0.674
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	0.665	0.672	0.670	0.664	0.657	0.656	0.648	0.643	0.632
100	0.672	0.669	0.670	0.664	0.665	0.657	0.653	0.647	0.632
500	0.676	0.670	0.668	0.663	0.659	0.656	0.647	0.644	0.632
Partition and Shift Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	0.676	0.685	0.685	0.685	0.687	0.683	0.683	0.681	0.681
100	0.677	0.685	0.692	0.691	0.688	0.689	0.688	0.685	0.684
500	0.650	0.681	0.691	0.695	0.688	0.692	0.686	0.689	0.683
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	0.677	0.681	0.681	0.679	0.677	0.675	0.668	0.674	0.667
100	0.682	0.681	0.683	0.682	0.684	0.678	0.677	0.675	0.671
500	0.689	0.683	0.683	0.679	0.680	0.679	0.671	0.675	0.669

Table 6. Cross-validated accuracies in binary classification when

n = 100

and a 20-80 positive–negative ratio.

Table 6. Cross-validated accuracies in binary classification when

n = 100

and a 20-80 positive–negative ratio.

Random Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	NA	0.757	0.750	0.752	0.749	0.748	0.747	0.742	0.740
100	NA	0.748	0.754	0.752	0.751	0.747	0.747	0.746	0.741
500	NA	0.715	0.734	0.748	0.751	0.745	0.749	0.742	0.738
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	0.739	0.731	0.726	0.717	0.716	0.712	0.711	0.700	0.693
100	0.735	0.733	0.733	0.721	0.717	0.718	0.705	0.700	0.700
500	0.737	0.737	0.733	0.734	0.721	0.715	0.707	0.704	0.692
Partition and Shift Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	NA	0.757	0.757	0.757	0.754	0.755	0.753	0.751	0.750
100	NA	0.751	0.755	0.757	0.757	0.754	0.753	0.757	0.749
500	NA	0.715	0.736	0.752	0.757	0.753	0.757	0.753	0.752
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	0.750	0.746	0.743	0.737	0.740	0.737	0.743	0.733	0.734
100	0.749	0.748	0.748	0.743	0.737	0.742	0.735	0.733	0.738
500	0.750	0.748	0.750	0.752	0.744	0.741	0.736	0.737	0.736

Table 7. Cross-validated accuracies in binary classification when

n = 500

and a 50-50 positive–negative ratio.

Table 7. Cross-validated accuracies in binary classification when

n = 500

and a 50-50 positive–negative ratio.

Random Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	0.660	0.663	0.658	0.658	0.655	0.655	0.652	0.650	0.650
100	0.669	0.666	0.665	0.663	0.660	0.657	0.656	0.654	0.650
500	0.672	0.668	0.669	0.664	0.663	0.660	0.659	0.656	0.652
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	0.644	0.646	0.643	0.639	0.636	0.633	0.627	0.622	0.613
100	0.650	0.644	0.642	0.641	0.637	0.635	0.629	0.622	0.614
500	0.651	0.649	0.644	0.644	0.637	0.634	0.630	0.624	0.615
Partition and Shift Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	0.660	0.663	0.661	0.659	0.659	0.660	0.658	0.655	0.658
100	0.669	0.669	0.669	0.666	0.664	0.661	0.663	0.661	0.659
500	0.673	0.671	0.673	0.668	0.668	0.666	0.666	0.664	0.661
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	0.653	0.655	0.653	0.652	0.650	0.650	0.647	0.646	0.644
100	0.660	0.655	0.655	0.654	0.653	0.654	0.651	0.649	0.648
500	0.661	0.660	0.657	0.659	0.654	0.653	0.653	0.651	0.652

Table 8. Cross-validated accuracies in binary classification when

n = 500

and a 40-60 positive–negative ratio.

Table 8. Cross-validated accuracies in binary classification when

n = 500

and a 40-60 positive–negative ratio.

Random Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	0.670	0.668	0.670	0.664	0.663	0.662	0.664	0.657	0.660
100	0.676	0.675	0.675	0.671	0.666	0.667	0.662	0.663	0.663
500	0.678	0.678	0.677	0.673	0.672	0.671	0.667	0.662	0.661
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	0.654	0.654	0.650	0.646	0.643	0.641	0.635	0.633	0.617
100	0.658	0.652	0.651	0.647	0.643	0.646	0.639	0.634	0.620
500	0.659	0.660	0.653	0.650	0.645	0.638	0.637	0.634	0.621
Partition and Shift Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	0.669	0.667	0.671	0.666	0.666	0.667	0.670	0.664	0.667
100	0.676	0.676	0.675	0.672	0.670	0.672	0.671	0.669	0.669
500	0.678	0.680	0.680	0.678	0.677	0.677	0.674	0.670	0.670
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	0.663	0.665	0.663	0.660	0.659	0.658	0.654	0.656	0.652
100	0.668	0.663	0.663	0.662	0.659	0.664	0.662	0.659	0.653
500	0.671	0.670	0.667	0.666	0.663	0.660	0.660	0.662	0.656

Table 9. Cross-validated accuracies in binary classification when

n = 500

and a 30-70 positive–negative ratio.

Table 9. Cross-validated accuracies in binary classification when

n = 500

and a 30-70 positive–negative ratio.

Random Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	0.704	0.705	0.704	0.702	0.701	0.698	0.697	0.694	0.692
100	0.709	0.709	0.706	0.704	0.702	0.702	0.699	0.698	0.695
500	0.713	0.709	0.710	0.709	0.706	0.703	0.701	0.700	0.695
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	0.690	0.689	0.685	0.684	0.680	0.675	0.671	0.666	0.654
100	0.694	0.690	0.685	0.685	0.683	0.677	0.673	0.665	0.656
500	0.694	0.693	0.689	0.686	0.682	0.674	0.676	0.667	0.658
Partition and Shift Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	0.703	0.705	0.705	0.704	0.706	0.703	0.702	0.701	0.700
100	0.707	0.710	0.708	0.708	0.706	0.706	0.704	0.705	0.703
500	0.712	0.710	0.713	0.713	0.710	0.709	0.707	0.707	0.704
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	0.699	0.698	0.697	0.697	0.696	0.692	0.693	0.692	0.689
100	0.704	0.701	0.699	0.700	0.699	0.696	0.695	0.693	0.691
500	0.704	0.704	0.701	0.700	0.699	0.693	0.698	0.695	0.694

Table 10. Cross-validated accuracies in binary classification when

n = 500

and a 20-80 positive–negative ratio.

Table 10. Cross-validated accuracies in binary classification when

n = 500

and a 20-80 positive–negative ratio.

Random Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	0.764	0.765	0.762	0.762	0.762	0.761	0.759	0.760	0.752
100	0.766	0.768	0.765	0.767	0.766	0.762	0.760	0.759	0.755
500	0.768	0.769	0.768	0.767	0.766	0.764	0.764	0.763	0.756
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	0.751	0.750	0.747	0.744	0.739	0.738	0.732	0.727	0.715
100	0.752	0.751	0.748	0.744	0.743	0.738	0.731	0.727	0.714
500	0.754	0.753	0.749	0.745	0.742	0.739	0.735	0.727	0.715
Partition and Shift Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	0.764	0.765	0.763	0.764	0.765	0.764	0.762	0.763	0.761
100	0.766	0.769	0.766	0.768	0.767	0.765	0.763	0.763	0.762
500	0.767	0.769	0.769	0.769	0.768	0.766	0.766	0.766	0.763
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	0.759	0.760	0.759	0.758	0.756	0.756	0.754	0.753	0.752
100	0.761	0.761	0.760	0.758	0.759	0.757	0.754	0.755	0.752
500	0.762	0.763	0.761	0.759	0.759	0.758	0.757	0.754	0.754

Table 11. Cross-validated mean squared error (MSE) in regression when

n = 100

and error standard deviation

= 1

.

Table 11. Cross-validated mean squared error (MSE) in regression when

n = 100

and error standard deviation

= 1

.

Random Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	21.848	16.881	15.197	13.961	12.943	12.463	12.072	11.710	11.584
100	21.454	16.774	14.880	13.811	12.976	12.314	11.882	11.427	11.337
500	21.166	16.445	14.907	13.517	12.726	12.213	11.724	11.531	11.171
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	11.371	11.374	11.306	11.281	11.497	11.779	12.319	12.973	15.020
100	11.176	11.053	11.132	11.336	11.439	11.660	12.061	13.060	15.029
500	10.995	11.142	11.100	11.118	11.305	11.781	12.117	12.864	14.706
Partition and Shift Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	26.835	20.025	17.472	15.801	14.501	13.593	12.977	12.382	12.217
100	26.530	19.779	17.060	15.613	14.205	13.462	12.776	12.273	11.940
500	26.337	19.578	17.027	15.371	14.147	13.346	12.549	12.015	11.744
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	11.757	11.552	11.371	11.149	11.039	10.896	10.829	10.708	10.716
100	11.538	11.359	11.153	10.929	10.807	10.685	10.603	10.611	10.616
500	11.412	11.292	11.065	10.778	10.621	10.593	10.449	10.395	10.353

Table 12. Cross-validated mean squared error (MSE) in regression when

n = 100

and error standard deviation

= \sqrt{5}

.

Table 12. Cross-validated mean squared error (MSE) in regression when

n = 100

and error standard deviation

= \sqrt{5}

.

Random Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	25.983	20.812	19.077	17.853	17.099	16.646	15.946	15.837	15.711
100	25.421	20.605	18.826	17.528	16.846	16.276	15.880	15.508	15.518
500	25.094	20.325	18.515	17.246	16.300	16.130	15.710	15.315	15.375
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	15.453	15.603	15.519	15.801	16.217	16.518	17.251	18.232	20.501
100	15.227	15.387	15.468	15.616	16.099	16.439	16.836	17.947	20.065
500	15.374	15.244	15.353	15.508	15.953	16.285	16.891	18.015	20.188
Partition and Shift Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	30.820	24.195	21.291	19.773	18.297	17.696	16.995	16.317	16.218
100	30.645	23.596	21.180	19.520	18.080	17.340	16.773	16.264	15.942
500	30.594	23.523	20.847	19.106	17.954	17.321	16.429	15.991	15.705
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	15.661	15.471	15.373	15.287	15.000	15.021	15.011	15.020	14.921
100	15.599	15.441	15.158	15.097	15.025	14.824	14.651	14.706	14.684
500	15.324	15.193	14.873	14.922	14.783	14.858	14.650	14.697	14.678

Table 13. Cross-validated mean squared error (MSE) in regression when

n = 100

and error standard deviation

= \sqrt{10}

.

Table 13. Cross-validated mean squared error (MSE) in regression when

n = 100

and error standard deviation

= \sqrt{10}

.

Random Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	30.869	25.840	23.904	22.958	21.948	21.463	21.223	20.957	20.809
100	30.245	25.395	23.928	22.482	21.736	21.259	20.713	20.858	20.857
500	30.002	25.292	23.275	22.268	21.579	21.034	20.882	20.653	20.559
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	20.828	21.187	21.357	21.517	22.041	22.369	22.898	24.689	27.948
100	20.820	20.665	21.062	21.218	21.615	22.299	22.934	24.815	27.276
500	20.541	20.826	20.996	21.022	21.589	22.130	23.248	24.457	27.548
Partition and Shift Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	36.313	29.430	26.444	25.066	23.298	22.624	22.022	21.708	21.345
100	35.636	28.738	26.095	24.535	23.151	22.362	21.731	21.120	20.787
500	35.449	28.779	25.830	24.174	23.159	22.080	21.621	20.886	20.904
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	20.755	20.805	20.636	20.561	20.287	20.281	20.277	20.489	20.632
100	20.694	20.370	20.413	20.180	20.196	20.204	20.261	20.124	20.147
500	20.559	20.393	20.235	20.155	20.017	20.058	20.217	19.975	20.114

Table 14. Cross-validated mean squared error (MSE) in regression when

n = 500

and error standard deviation

= 1

.

Table 14. Cross-validated mean squared error (MSE) in regression when

n = 500

and error standard deviation

= 1

.

Random Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	9.589	7.667	7.522	6.588	5.998	6.028	5.971	5.899	5.436
100	9.582	8.171	7.044	6.205	6.054	5.914	5.898	5.962	5.805
500	8.752	7.766	6.532	6.371	5.765	5.668	4.993	5.645	5.515
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	5.595	5.939	5.439	5.601	5.636	5.805	6.021	6.278	7.157
100	5.260	5.294	5.275	6.015	5.674	5.328	5.402	6.498	7.287
500	5.450	5.456	5.524	5.317	4.918	5.874	5.795	6.023	7.682
Partition and Shift Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	10.598	9.410	8.381	7.665	6.863	6.497	6.296	5.935	6.019
100	11.076	9.142	8.173	7.496	6.659	6.450	6.278	6.029	5.962
500	10.697	8.860	7.866	7.208	6.718	6.434	6.095	5.922	5.757
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	5.832	5.599	5.204	5.478	5.550	5.290	5.029	5.203	5.238
100	5.545	5.513	5.152	5.349	5.262	5.285	5.184	5.071	4.881
500	5.455	5.497	5.343	5.025	5.380	4.879	5.081	5.151	4.830

Table 15. Cross-validated mean squared error (MSE) in regression when

n = 500

and error standard deviation

= \sqrt{5}

.

Table 15. Cross-validated mean squared error (MSE) in regression when

n = 500

and error standard deviation

= \sqrt{5}

.

Random Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	12.942	11.877	10.840	10.418	10.320	9.930	10.007	9.578	9.578
100	13.385	11.338	10.704	9.907	10.125	9.716	9.695	9.278	10.473
500	12.413	11.400	10.324	10.758	9.939	10.054	9.572	9.630	9.206
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	9.613	9.544	10.246	10.023	9.593	10.021	10.268	11.293	12.632
100	9.591	9.693	9.311	9.665	9.635	9.980	10.174	11.339	11.533
500	9.600	9.400	9.345	9.594	9.915	9.779	10.239	11.128	11.867
Partition and Shift Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	14.373	13.066	11.881	11.210	11.031	10.540	10.251	9.871	9.852
100	14.602	12.354	11.762	10.778	10.872	10.177	10.135	9.678	10.484
500	13.746	12.534	11.313	11.616	10.538	10.558	10.100	9.959	9.427
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	9.807	9.573	9.978	9.662	9.213	9.446	9.148	9.558	9.447
100	9.664	9.585	9.164	9.394	8.953	9.255	9.112	9.620	9.053
500	9.609	9.392	9.206	9.272	9.294	8.984	9.228	9.168	9.192

Table 16. Cross-validated mean squared error (MSE) in regression when

n = 100

and error standard deviation

= \sqrt{10}

.

Table 16. Cross-validated mean squared error (MSE) in regression when

n = 100

and error standard deviation

= \sqrt{10}

.

Random Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	17.821	16.886	15.930	15.673	15.390	15.329	14.700	14.984	15.629
100	17.670	15.639	15.853	15.577	15.086	14.819	15.555	14.985	14.555
500	17.688	16.154	15.313	14.767	14.713	14.253	14.483	14.804	14.855
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	14.760	15.842	14.573	14.872	16.519	16.417	16.940	17.591	18.661
100	15.464	14.970	15.641	15.357	15.672	14.752	15.899	16.994	19.143
500	15.125	15.499	15.157	15.024	16.372	15.648	16.370	16.448	18.713
Partition and Shift Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
50	19.409	17.597	16.598	16.446	16.016	15.646	14.898	15.217	15.642
100	18.962	16.909	16.755	16.256	15.431	15.232	15.640	15.109	14.557
500	19.126	17.266	16.096	15.521	15.287	14.651	14.798	14.965	14.754
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
50	14.874	15.749	14.459	14.299	15.147	14.949	14.755	15.127	14.755
100	15.161	14.623	15.018	14.615	14.521	14.054	14.287	14.570	14.786
500	15.010	14.929	14.707	14.374	15.164	14.149	14.849	13.775	14.812

Table 17. Classification dataset quantitative summary.

Dataset	Sample Size	#Features	Label Counts	Label Distribution
Wine	170	13	58, 65, 47	0.34, 0.38, 0.28
Iris	150	4	50, 50, 50	0.33, 0.33, 0.33
Cleveland	300	13	161, 55, 36, 35, 13	0.54, 0.18, 0.12, 0.12, 0.04
Diabetes	760	8	494, 266	0.65, 0.35
German	1000	20	700, 300	0.70, 0.30

Table 18. Cross-validated accuracy scores based on the Wine dataset.

Random Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
20	0.934	0.935	0.937	0.939	0.940	0.944	0.945	0.946	0.944
50	0.947	0.946	0.947	0.948	0.951	0.953	0.953	0.952	0.952
100	0.953	0.948	0.951	0.955	0.955	0.956	0.958	0.955	0.953
500	0.954	0.950	0.953	0.958	0.961	0.962	0.961	0.958	0.954
1000	0.954	0.949	0.953	0.960	0.962	0.962	0.961	0.959	0.954
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
20	0.943	0.944	0.941	0.939	0.936	0.933	0.931	0.92	0.903
50	0.950	0.947	0.946	0.944	0.941	0.939	0.934	0.921	0.904
100	0.951	0.948	0.946	0.944	0.941	0.939	0.934	0.920	0.904
500	0.951	0.949	0.946	0.944	0.941	0.940	0.931	0.917	0.905
1000	0.951	0.950	0.946	0.944	0.942	0.940	0.931	0.914	0.904
Partition and Shift Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
20	0.931	0.937	0.942	0.937	0.940	0.939	0.942	0.944	0.948
50	0.953	0.954	0.948	0.950	0.950	0.952	0.952	0.953	0.956
100	0.961	0.956	0.955	0.954	0.951	0.953	0.955	0.957	0.960
500	0.968	0.962	0.958	0.954	0.954	0.958	0.962	0.962	0.963
1000	0.968	0.963	0.956	0.955	0.952	0.959	0.962	0.964	0.965
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
20	0.947	0.947	0.949	0.949	0.950	0.950	0.950	0.952	0.949
50	0.956	0.957	0.958	0.957	0.959	0.959	0.958	0.957	0.957
100	0.961	0.960	0.962	0.962	0.961	0.960	0.960	0.960	0.957
500	0.964	0.965	0.967	0.966	0.966	0.966	0.964	0.963	0.960
1000	0.965	0.966	0.966	0.968	0.967	0.966	0.965	0.964	0.961

Table 19. Cross-validated accuracy scores based on the Iris dataset.

Random Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
20	0.950	0.950	0.948	0.947	0.950	0.948	0.949	0.950	0.950
50	0.951	0.951	0.950	0.950	0.951	0.950	0.952	0.951	0.951
100	0.952	0.951	0.952	0.951	0.952	0.952	0.953	0.952	0.951
500	0.952	0.953	0.953	0.953	0.953	0.953	0.953	0.953	0.952
1000	0.953	0.953	0.953	0.953	0.953	0.953	0.953	0.953	0.953
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
20	0.950	0.951	0.951	0.950	0.950	0.949	0.947	0.946	0.946
50	0.952	0.952	0.952	0.952	0.952	0.950	0.947	0.946	0.945
100	0.952	0.952	0.952	0.953	0.953	0.951	0.947	0.946	0.946
500	0.951	0.951	0.953	0.953	0.953	0.953	0.947	0.947	0.946
1000	0.952	0.950	0.953	0.953	0.953	0.953	0.947	0.946	0.946
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
20	0.945	0.948	0.950	0.948	0.948	0.947	0.948	0.950	0.948
50	0.951	0.951	0.951	0.949	0.949	0.950	0.950	0.951	0.951
100	0.950	0.952	0.951	0.951	0.952	0.952	0.952	0.952	0.952
500	0.952	0.952	0.953	0.953	0.953	0.953	0.953	0.953	0.953
1000	0.953	0.953	0.953	0.953	0.953	0.953	0.953	0.953	0.953
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
20	0.948	0.949	0.949	0.949	0.951	0.949	0.951	0.951	0.951
50	0.952	0.951	0.952	0.951	0.952	0.952	0.951	0.952	0.952
100	0.953	0.953	0.953	0.952	0.953	0.953	0.952	0.952	0.952
500	0.953	0.953	0.953	0.953	0.953	0.953	0.953	0.953	0.953
1000	0.953	0.953	0.953	0.953	0.953	0.953	0.953	0.953	0.951

Table 20. Cross-validated accuracy scores based on the Cleveland dataset.

Random Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
20	0.574	0.574	0.572	0.572	0.571	0.568	0.568	0.565	0.564
50	0.574	0.577	0.576	0.575	0.575	0.573	0.571	0.568	0.566
100	0.576	0.579	0.578	0.578	0.576	0.573	0.571	0.566	0.564
500	0.578	0.577	0.580	0.580	0.578	0.573	0.570	0.565	0.562
1000	0.579	0.577	0.580	0.580	0.577	0.573	0.569	0.565	0.562
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
20	0.560	0.557	0.556	0.551	0.548	0.542	0.537	0.527	0.515
50	0.565	0.561	0.559	0.558	0.555	0.547	0.542	0.533	0.518
100	0.563	0.562	0.560	0.558	0.554	0.550	0.543	0.533	0.519
500	0.561	0.558	0.557	0.557	0.557	0.553	0.544	0.532	0.522
1000	0.560	0.557	0.557	0.556	0.556	0.555	0.545	0.531	0.522
Partition and Shift Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
20	0.572	0.576	0.576	0.575	0.576	0.573	0.576	0.575	0.572
50	0.580	0.583	0.582	0.580	0.580	0.581	0.578	0.575	0.575
100	0.583	0.584	0.583	0.583	0.583	0.582	0.581	0.579	0.577
500	0.584	0.585	0.583	0.583	0.583	0.584	0.584	0.582	0.580
1000	0.583	0.585	0.584	0.584	0.584	0.584	0.584	0.583	0.579
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
20	0.569	0.569	0.568	0.566	0.566	0.564	0.561	0.561	0.561
50	0.573	0.570	0.572	0.571	0.567	0.566	0.564	0.563	0.564
100	0.575	0.574	0.570	0.570	0.567	0.565	0.565	0.563	0.561
500	0.576	0.574	0.571	0.568	0.566	0.565	0.563	0.562	0.560
1000	0.578	0.573	0.569	0.567	0.565	0.565	0.562	0.562	0.560

Table 21. Cross-validated accuracy scores based on the Diabetes dataset.

Random Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
20	0.751	0.751	0.754	0.753	0.753	0.755	0.754	0.754	0.753
50	0.759	0.758	0.758	0.759	0.760	0.760	0.761	0.761	0.761
100	0.761	0.761	0.760	0.761	0.762	0.763	0.764	0.764	0.762
500	0.761	0.760	0.760	0.763	0.765	0.766	0.767	0.766	0.766
1000	0.761	0.761	0.760	0.763	0.764	0.766	0.768	0.767	0.765
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
20	0.751	0.751	0.753	0.749	0.750	0.748	0.748	0.746	0.739
50	0.759	0.759	0.759	0.758	0.757	0.755	0.755	0.750	0.741
100	0.762	0.762	0.763	0.762	0.760	0.759	0.757	0.752	0.741
500	0.766	0.767	0.768	0.765	0.763	0.761	0.759	0.753	0.742
1000	0.767	0.768	0.769	0.766	0.764	0.762	0.759	0.753	0.742
Partition and Shift Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
20	0.750	0.751	0.752	0.752	0.750	0.753	0.752	0.753	0.752
50	0.760	0.759	0.759	0.759	0.758	0.758	0.760	0.760	0.759
100	0.763	0.761	0.760	0.761	0.762	0.760	0.761	0.763	0.762
500	0.765	0.762	0.761	0.761	0.762	0.763	0.763	0.765	0.764
1000	0.765	0.761	0.760	0.762	0.762	0.762	0.763	0.764	0.765
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
20	0.753	0.754	0.752	0.754	0.751	0.752	0.753	0.751	0.752
50	0.759	0.761	0.760	0.759	0.760	0.760	0.759	0.759	0.759
100	0.763	0.764	0.763	0.764	0.764	0.763	0.764	0.763	0.763
500	0.765	0.765	0.766	0.766	0.768	0.768	0.768	0.767	0.767
1000	0.765	0.765	0.767	0.767	0.768	0.769	0.769	0.769	0.768

Table 22. Cross-validated accuracy scores based on the German dataset.

Random Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
20	0.717	0.716	0.715	0.716	0.715	0.715	0.711	0.711	0.707
50	0.730	0.730	0.728	0.727	0.725	0.722	0.720	0.717	0.715
100	0.737	0.736	0.735	0.730	0.728	0.725	0.722	0.720	0.719
500	0.743	0.737	0.735	0.734	0.730	0.729	0.726	0.723	0.721
1000	0.743	0.737	0.735	0.734	0.732	0.729	0.726	0.724	0.722
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
20	0.704	0.703	0.701	0.698	0.695	0.689	0.686	0.677	0.658
50	0.714	0.711	0.707	0.704	0.700	0.695	0.691	0.681	0.657
100	0.715	0.713	0.710	0.707	0.702	0.699	0.692	0.684	0.655
500	0.720	0.717	0.714	0.711	0.706	0.703	0.697	0.688	0.655
1000	0.720	0.717	0.715	0.712	0.708	0.704	0.698	0.688	0.656
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
20	0.724	0.726	0.727	0.727	0.727	0.726	0.724	0.722	0.721
50	0.733	0.733	0.735	0.736	0.736	0.734	0.730	0.730	0.727
100	0.738	0.736	0.738	0.737	0.738	0.736	0.733	0.732	0.729
500	0.744	0.736	0.735	0.738	0.738	0.737	0.735	0.735	0.732
1000	0.744	0.735	0.733	0.737	0.738	0.737	0.735	0.734	0.733
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
20	0.720	0.718	0.717	0.716	0.715	0.713	0.711	0.711	0.709
50	0.727	0.724	0.724	0.723	0.721	0.718	0.716	0.718	0.715
100	0.729	0.727	0.724	0.723	0.721	0.720	0.719	0.718	0.716
500	0.731	0.729	0.724	0.722	0.721	0.720	0.721	0.721	0.720
1000	0.732	0.729	0.724	0.721	0.721	0.722	0.720	0.723	0.721

Table 23. Regression dataset quantitative summary.

Dataset	Sample Size	Number of Features	Missing Values?
Housing	410	6	no
Energy	760	8	no
Forest fires	510	12	no

Table 24. Cross-validated MSEs based on the Housing dataset.

Random Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
20	64.451	61.747	60.106	58.868	58.251	57.800	57.795	57.965	58.182
50	62.261	59.850	58.379	57.461	56.724	56.422	56.495	56.583	56.664
100	61.594	59.323	57.780	56.882	56.214	55.840	55.872	56.125	56.505
500	61.017	58.729	57.280	56.338	55.785	55.508	55.491	55.693	56.048
1000	60.948	58.670	57.161	56.321	55.735	55.450	55.428	55.630	56.030
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
20	58.629	59.492	60.969	62.438	65.022	66.863	69.860	73.526	78.874
50	57.316	58.389	59.688	61.636	63.378	65.921	68.987	72.986	78.365
100	57.075	58.190	59.481	61.073	63.421	65.714	68.670	72.721	78.387
500	56.778	57.762	59.212	60.793	62.968	65.302	68.569	72.465	78.059
1000	56.696	57.735	59.184	60.834	62.907	65.249	68.549	72.501	78.092
Partition and Shift Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
20	67.151	64.379	62.169	61.107	59.718	59.252	58.622	58.385	58.076
50	64.793	61.910	60.431	58.964	58.215	57.628	56.984	56.486	56.482
100	64.054	61.286	59.705	58.504	57.843	57.001	56.499	56.172	55.841
500	63.357	60.803	59.118	58.048	57.183	56.580	56.052	55.692	55.560
1000	63.402	60.742	59.072	57.983	57.105	56.519	56.033	55.668	55.497
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
20	57.468	57.993	57.527	57.551	58.006	58.132	58.414	58.666	59.275
50	56.236	55.999	56.197	56.518	56.438	56.746	57.128	57.367	57.831
100	55.851	55.737	55.769	55.914	56.123	56.312	56.481	57.019	57.375
500	55.361	55.358	55.411	55.495	55.719	55.909	56.216	56.621	57.078
1000	55.357	55.307	55.357	55.474	55.605	55.869	56.232	56.542	57.054

Table 25. Cross-validated MSEs based on the Energy dataset.

Random Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
20	1.783	0.886	0.551	0.444	0.410	0.394	0.380	0.369	0.356
50	1.678	0.828	0.520	0.425	0.398	0.386	0.374	0.363	0.351
100	1.631	0.815	0.509	0.418	0.392	0.381	0.370	0.360	0.349
500	1.604	0.793	0.499	0.413	0.389	0.379	0.369	0.358	0.348
1000	1.600	0.790	0.498	0.412	0.389	0.379	0.369	0.358	0.347
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
20	0.344	0.336	0.325	0.315	0.302	0.291	0.279	0.270	0.267
50	0.339	0.330	0.320	0.311	0.300	0.288	0.277	0.268	0.265
100	0.339	0.330	0.319	0.309	0.298	0.287	0.275	0.267	0.265
500	0.338	0.327	0.318	0.308	0.298	0.286	0.275	0.267	0.264
1000	0.337	0.327	0.317	0.308	0.297	0.286	0.275	0.267	0.264
Partition and Shift Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
20	3.024	1.603	0.910	0.696	0.513	0.433	0.441	0.408	0.350
50	2.706	1.437	0.872	0.603	0.470	0.428	0.382	0.360	0.360
100	2.496	1.358	0.862	0.586	0.460	0.406	0.389	0.355	0.339
500	2.501	1.358	0.824	0.579	0.453	0.400	0.368	0.353	0.336
1000	2.452	1.358	0.835	0.577	0.454	0.400	0.371	0.350	0.336
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
20	0.334	0.329	0.315	0.317	0.300	0.282	0.261	0.256	0.246
50	0.330	0.332	0.316	0.309	0.283	0.276	0.266	0.265	0.249
100	0.323	0.314	0.297	0.299	0.282	0.273	0.271	0.254	0.254
500	0.324	0.313	0.300	0.293	0.279	0.267	0.261	0.254	0.246
1000	0.325	0.311	0.301	0.287	0.279	0.268	0.260	0.252	0.246

Table 26. Cross-validated MSEs based on the Forest Fires dataset.

Random Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
20	2.106	2.115	2.137	2.182	2.171	2.171	2.208	2.150	2.206
50	2.063	2.070	2.087	2.127	2.119	2.129	2.139	2.131	2.144
100	2.061	2.085	2.076	2.077	2.098	2.117	2.101	2.106	2.140
500	2.043	2.057	2.069	2.075	2.086	2.096	2.102	2.111	2.123
1000	2.036	2.055	2.065	2.078	2.084	2.092	2.102	2.110	2.115
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
20	2.206	2.230	2.268	2.276	2.249	2.292	2.325	2.346	2.404
50	2.163	2.157	2.186	2.198	2.211	2.231	2.250	2.298	2.369
100	2.143	2.149	2.161	2.190	2.201	2.220	2.232	2.270	2.345
500	2.131	2.149	2.163	2.169	2.190	2.211	2.237	2.269	2.345
1000	2.134	2.143	2.155	2.169	2.187	2.206	2.232	2.266	2.341
Partition and Shift Subsampling
#trees\k	$0.10 n$	$0.15 n$	$0.20 n$	$0.25 n$	$0.30 n$	$0.35 n$	$0.40 n$	$0.45 n$	$0.50 n$
20	2.112	2.106	2.145	2.175	2.130	2.178	2.158	2.173	2.194
50	2.056	2.049	2.087	2.095	2.100	2.101	2.111	2.114	2.117
100	2.048	2.057	2.060	2.089	2.072	2.091	2.089	2.102	2.105
500	2.031	2.043	2.050	2.055	2.065	2.072	2.085	2.078	2.091
1000	2.027	2.042	2.047	2.059	2.058	2.076	2.083	2.083	2.087
#trees\k	$0.55 n$	$0.60 n$	$0.65 n$	$0.70 n$	$0.75 n$	$0.80 n$	$0.85 n$	$0.90 n$	$0.95 n$
20	2.169	2.202	2.181	2.165	2.195	2.204	2.200	2.195	2.253
50	2.134	2.139	2.141	2.128	2.158	2.154	2.158	2.161	2.169
100	2.116	2.127	2.132	2.117	2.120	2.134	2.157	2.157	2.162
500	2.090	2.094	2.107	2.111	2.108	2.116	2.128	2.129	2.138
1000	2.091	2.095	2.106	2.108	2.114	2.118	2.128	2.137	2.143

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ordal, M.; Wang, Q. Enhancing Diversity and Improving Prediction Performance of Subsampling-Based Ensemble Methods. Stats 2025, 8, 86. https://doi.org/10.3390/stats8040086

AMA Style

Ordal M, Wang Q. Enhancing Diversity and Improving Prediction Performance of Subsampling-Based Ensemble Methods. Stats. 2025; 8(4):86. https://doi.org/10.3390/stats8040086

Chicago/Turabian Style

Ordal, Maria, and Qing Wang. 2025. "Enhancing Diversity and Improving Prediction Performance of Subsampling-Based Ensemble Methods" Stats 8, no. 4: 86. https://doi.org/10.3390/stats8040086

APA Style

Ordal, M., & Wang, Q. (2025). Enhancing Diversity and Improving Prediction Performance of Subsampling-Based Ensemble Methods. Stats, 8(4), 86. https://doi.org/10.3390/stats8040086

Article Menu

Enhancing Diversity and Improving Prediction Performance of Subsampling-Based Ensemble Methods

Abstract

1. Introduction

2. Materials and Methods

2.1. The Role of Diversity

2.2. Proposed Methods

2.2.1. Partition Subsampling

2.2.2. Shift Subsampling

2.2.3. Probabilistic Investigation

3. Results

3.1. Simulation Studies

3.1.1. Classification Simulation Study

3.1.2. Regression Simulation Study

3.2. Real Data Examples

3.2.1. Classification Datasets

3.2.2. Regression Datasets

3.3. Further Justification

4. Discussion

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI