Super RaSE: Super Random Subspace Ensemble Classiﬁcation

: We propose a new ensemble classiﬁcation algorithm, named Super Random Subspace 1 Ensemble (Super RaSE), to tackle the sparse classiﬁcation problem. The proposed algorithm is 2 motivated by the Random Subspace Ensemble algorithm (RaSE) (Tian and Feng 2021b). The 3 RaSE method was shown to be a ﬂexible framework that can be coupled with any existing base 4 classiﬁcation. However, the success of RaSE largely depends on the proper choice of the base 5 classiﬁer, which is unfortunately unknown to us. In this work, we show that Super RaSE avoids 6 the need to choose a base classiﬁer by randomly sampling a collection of classiﬁers together with 7 the subspace. As a result, Super RaSE is more ﬂexible and robust than RaSE. In addition to the 8 vanilla Super RaSE, we also develop the iterative Super RaSE, which adaptively changes the base 9 classiﬁer distribution as well as the subspace distribution. We show the Super RaSE algorithm and 10 its iterative version perform competitively for a wide range of simulated datasets and two real 11 data examples. The new Super RaSE algorithm and its iterative version are implemented in a new 12 version of the R package RaSEn . 13


15
Ensemble learning is a popular machine learning framework, which combines 16 multiple learning algorithms to obtain better prediction performance and increase the 17 stability of any single algorithm (Dietterich 2000;Rokach 2010). Some popular examples 18 include bagging (Breiman 1996) and random forests (Breiman 2001), which aggregates 19 a collection of weak learners formed by decision trees. More recent ensemble learning 20 methods include the random subspace method (Ho 1998), super learner (Van der Laan 21 et al. 2007), model averaging (Feng et al. 2021;Raftery et al. 1997), random rotation ( 22 Blaser and Fryzlewicz 2016), random projection (Cannings and Samworth 2017;Durrant 23 and Kabán 2015), and random subspace ensemble classification (Tian and Feng 2021ab). 24 This paper is largely motivated by the Random Subspace Ensemble (RaSE) classi-25 fication framework (Tian and Feng 2021b), which we will briefly review. Suppose we 26 want to predict the class label y from the feature vector x. For a given base classifier, the 27 RaSE algorithm aims to construct B 1 weak learner, where each classifier is formed by 28 applying the specified base classifier on a properly chosen subspace. To choose each 29 subspace, B 2 random subspaces are generated and the optimal one is selected according 30 to certain criteria (e.g., cross-validation error). In the end, the predicted labels from the 31 B 1 weak learners are averaged and compared to a data-driven threshold, forming the 32 final classifier. Tian and Feng (2021b) also proposed an iterative version of RaSE which 33 updates the random subspace distribution according to the selected proportions of each 34 feature. 35 Powerful the RaSE algorithm and its iterative versions are, one major limitation is 36 that one needs to specify a single base classifier prior to using the RaSE framework. The The rest of the paper is organized as follows. In Section 2, we introduce the super 62 random subspace ensemble (SRaSE) classification algorithm as well as its iteration ver- ∼ (x, y) ∈ R p × {0, 1}, where p is the number of predictors and y ∈ {0, 1} is the class label. We use S Full = {1, · · · , p} to represent the whole feature set. We assume the marginal densities of x for class 0 (y = 0) and 1 (y = 1) exist and are denoted as f (0) and f (1) , respectively. Thus, the joint distribution of (x, y) can be described in the following mixture model where y is a Bernoulli variable with success probability π 1 = 1 − π 0 ∈ (0, 1). For any 70 subspace S, we use |S| to denote its cardinality. When restricting to the feature subspace 71 S, the corresponding marginal densities of class 0 and 1 are denoted as f  Here, we are concerned with a high-dimension classification problem where the 74 dimensional p is comparable or even larger than the sample size n. In high-dimensional 75 problems, we usually believe there are only a handful of features that contribute to 76 the response, which is usually referred to as the sparse classification problem. For sparse 77 classification problems, it is of significance to accurately separate signals from noises.

78
Following Tian and Feng (2021b), we introduce the definition of a discriminant set.

79
A feature subset S is called a discriminative set if y is conditionally independent with 80 x S c given x S , where S c = S Full \ S. We call S a minimal discriminative set if it has minimal 81 cardinality among all discriminative sets, and we denote it as S * . Here, to train each weak learner (e.g., the j-th one), B 2 independent random sub-84 spaces are generated as {S j1 , · · · , S jB 2 } from subspace distribution D and B 2 base classi-85 fiers {T j1 , · · · , T jB 2 } are sampled with replacement from base classifier distribution D 86 on the candidate base classifier set T = {T 1 , · · · , T M }. By default, we will be using a 87 uniform distribution for D. However, users can use other distributions in the algorithm if 88 they have some prior belief on which classifiers may work better. Then, for k = 1, · · · , B 2 , 89 we train the classifier T jk using only the features in S jk . We then choose the optimal 90 subspace S j * and base classifier T j * pair using 5-fold cross-validation. We denote the 91 base classifier T j * applied on subspace S j * as C T j * −S j * n , where the subscript n is used 92 to emphasize the classifier depends on the sample with n observations. Finally, we 93 aggregate outputs of {C T j * −S j * n } B 1 j=1 to form the final decision function by taking a simple 94 average. The whole procedure can be summarized in Algorithm 1.

95
Algorithm 1: Super Random Subspace Ensemble classification (SRaSE) , new data x, subspace distribution D, integers B 1 and B 2 , the candidate base classifier set T , base classifier distribution D Output: predicted label C RaSE n (x), the selected proportion of each base classifier ζ, and for the base classifier T i ∈ T where i ∈ {1, · · · , M}, the selected proportion of each feature η i 1 Independently generate base classifiers Select the optimal subspace and base classifier pair (T j * , S j * ) from {(T jk , S jk )} B 2 k=1 using 5-fold cross-validation. 5 end 6 Construct the ensemble decision function ν n (x) Set the thresholdα according to (1) 8 Compute the selected proportion of each method ζ = (ζ 1 , · · · , ζ M ) T , where j=1 1(i ∈ T j * ) 9 For each method T i , i = 1, · · · , M, compute the selected proportion of each feature η i = (η i1 , · · · , η ip ) T , where j=1 1(i ∈ T j * )1(l ∈ S j * ), l = 1, · · · , p 10 Output the predicted label C RaSE n (x) = 1(ν n (x) >α), the selected proportion of each method ζ = (ζ 1 , · · · , ζ M ) T , and the selected proportion of each feature for each method η i = (η i1 , · · · , η ip ) T Following Tian and Feng (2021b), by default, the subspace distribution D is chosen as a hierarchical uniform distribution over the subspaces. In particular, with D as the upper bound of the subspace size, we first generate the subspace size d from the discrete uniform distribution over {1, · · · , D}. Then, the subspaces {S jk , j = 1, · · · , M, k = 1, · · · , p} are independent and follow the uniform distribution over the set {S ⊆ S Full : |S| = d}. In addition, in Step 7 of Algorithm 1, we choose the decision threshold to minimize the empirical classification error on the training set, where In Algorithm 1, there are two important by-products. The first one is the selected 96 proportion of each method ζ = (ζ 1 , · · · , ζ M ) T out of the B 1 weak learners. The higher the 97 proportion for a method (e.g. KNN) is, the more appropriate it may be for the particular 98 data. In numerical studies and real data analyses, we will provide more interpretations 99 of the results.

102
The feature selection proportion depends on the particular base method. The underlying 103 reason is that when we use different base methods on the same data, different signals   The updating scheme for the subspace distribution is as follows. For method i, we 130 again first generate the subspace size d from the uniform distribution over {1, · · · , D} 131 as before. It is easy to observe that each subspace S can be equivalently represented as 132 a binary p-dimensional vector representing whether each feature is included in S. To 133 be more specific, the equivalent vector representation subspace S is J = (J 1 , · · · , J p ) T ,

134
where J l = 1(l ∈ S), l = 1, · · · , p. Then, we will generate J from a restrictive multino- , new data x, integers B 1 and B 2 , the candidate base classifier set T , initial base classifier distribution D (0) , initial subspace distribution for each base classifier {D , the selected proportion of each base classifier ζ (T) , and for the base classifier T i ∈ T where i ∈ {1, · · · , M}, the selected proportion of each feature η Independently generate random subspaces Select the optimal subspace and base classifier pair (S Set D (t+1) to be a discrete distribution over the candidate base classifier set Set D (t+1) to be a restrictive multinomial distribution with parameter il ≤ C 0 / log p) and d is sampled from the uniform distribution over {1, · · · , D} 10 end 11 Set the thresholdα according to (1) 12 Construct the ensemble decision function ν n (x) = 1 M ) T , and the selected proportion of each feature for each base classifier η where C 0 is a constant. Note that here, the parameterη i characterize the subspace 138 distribution.

139
We named the iterative algorithm Iterative Super RaSE, the details of which are 140 summarized in Algorithm 2.

141
In the iterative Super RaSE algorithm, the base classifier distribution is initially 142 set to be D (0) , which is a uniform distribution over all base classifiers by default. As 143 the iteration proceeds, the base classifiers that are more frequently selected will have 144 a higher chance of being selected in the next step, resulting in a different D (t) . The 145 adaptive nature of the iterative Super RaSE algorithm will enable us to discover the best 146 performing base methods for each dataset and in turn, reduces its classification error.

147
Besides the base classifier distribution, the subspace distribution is also continuously 148 updated during the iteration process. In our implementation of the algorithm, the 149 initial subspace distribution for each base classifier D i (0)    independently generate a test data of size 1000.

200
It is easy to verify that the feature subset {1, 2, 5} is the minimal discriminative set 201 S * . In Table 1  In addition to the test classification error, it is useful to investigate the two by- showing that having a larger sample size helps us to select the model from which the 220 data is generated.

221
Now, let's look at the column of n = 1000, we can see that as the iteration process 222 moves on, the percentage of LDA is increasing as well, leading to almost 100% for 223 SRaSE 2 .

272
Having considered two parametric models, let's move to an example where the 273 label is generated in a nonparametric way (Tian and Feng 2021b). We will be following 274 the idea of nearest neighbors when assigning the labels.

275
The detailed data generation process is as follows. First, 10 initial points z 1 , · · · , z 10 276 are generated i.i.d. from N(0 p×1 , I p ). Out of the 10 points, five are labeled as 0 and the 277 other five are labeled as 1. Then, each observation (x, y) pair is generated as follows. First,

278
we randomly select one element from {z 1 , · · · , z 10 }. Let's suppose it is z k i , corresponding 279 to the i-th observation. Then, the corresponding y i will take the label as z k 1 , and the 280 feature vector x i is generated as x i ∼ N((z T k i ,S * , 0 1×(p−5) ) T , 0.5 2 I p ). The general idea 281 is like we are generating a mixture of ten Gaussian clusters that surround each of the 282 {z 1 , · · · , z 10 } when they are embedded in a p-dimensional space.

284
We consider p = 200, and n ∈ {200, 400, 1000}. The Summary of test classification error 285 rates over 200 repetitions is presented in Table 3.      Next, we show the average selected proportion for each base method for different 328 sample sizes and iteration numbers in Figure 7.  focus the observations corresponding to numbers 7 (class 0) and 9 (class 1). After with their standard deviations are reported in Table 5.