Simultaneous Feature Selection and Classiﬁcation for Data-Adaptive Kernel-Penalized SVM

: Simultaneous feature selection and classiﬁcation have been explored in the literature to extend the support vector machine (SVM) techniques by adding penalty terms to the loss function directly. However, it is the kernel function that controls the performance of the SVM, and an imbalance in the data will deteriorate the performance of an SVM. In this paper, we examine a new method of simultaneous feature selection and binary classiﬁcation. Instead of incorporating the standard loss function of the SVM, a penalty is added to the data-adaptive kernel function directly to control the performance of the SVM, by ﬁrstly conformally transforming the kernel functions of the SVM, and then re-conducting an SVM classiﬁer based on the sparse features selected. Both convex and non-convex penalties, such as least absolute shrinkage and selection (LASSO), moothly clipped absolute deviation (SCAD) and minimax concave penalty (MCP) are explored, and the oracle property of the estimator is established accordingly. An iterative optimization procedure is applied as there is no analytic form of the estimated coefﬁcients available. Numerical comparisons show that the proposed method outperforms the competitors considered when data are imbalanced, and it performs similarly to the competitors when data are balanced. The method can be easily applied in medical images from different platforms.


Introduction
As one of the most critical tasks in data mining with high dimensions, the performance of a classification model relies on selecting the most appropriately relevant features while removing irrelevant ones. This offers advantages, including a lower risk of overfitting, less model complexity (and hence the improvement of the generalization capability) and less computational cost [1]. In scientific research, classification models have served as useful tools of artificial intelligence in various areas such as financial credit risk assessment [2], signal processing and pattern recognition [3]. In these tasks, the fundamental goal is the accuracy of prediction in various situations.
The support vector machine (SVM) offers a method of classification that has adequate generalizing ability, fewer local minima with limited dependence on only a few parameters [4], and has achieved success in applications as a powerful classifier of high accuracy with flexibility; see, e.g., [5]. However, the method described in the standard formulation settings cannot decide the importance from different features [6], while its performance may be severely deteriorated when redundant variables are used in determining the decision rule, even those as poor as random guessing due to the accumulation of random noise, especially in a high dimensional space [7,8]. Consequently, the development of well as others when the data are balanced. Our contribution in this paper is mainly two-fold. Firstly, the proposed method can select relevant predictors in the feature space, with properties of selection consistency established theoretically. None of the papers in the literature have touched or studied the theoretical property of the selection procedure under the framework of an SVM. We also employ non-convex penalty functions including SCAD and MCP, which are much more difficult to deal with when programming due to a non-convex objective function. On the other hand, we take the data imbalance issue into consideration during feature selection with a data-adaptive procedure, and this appears not easy during feature selection procedure, especially when the input space is divergingly large. The imbalance issue that may severely deteriorate the performance of a classifier has not been accommodated in the previous literature during feature selection procedures.
The methodology is partially motivated by an ongoing prostate caner study from London, ON, Canada. The goal is to construct a classifier to predict the cancerous areas with imaging intensity measures that come from different platforms such as MRI and CT. Several issues need to be considered. One is that redundant measures might deteriorate the performance of the classifier, so that a feature selection technique is necessary. Another problem is the imbalance in the data. The cancerous proportion in the prostate only takes 8% on average, indicating an extreme imbalance. Hence, how to perform accurate classification accommodating these issues needs to be addressed.
The rest of the paper organizes as follows. In Section 2, the framework of SVMs and the penalized SVM is introduced. In Section 3, our model is constructed. Not only is the oracle property proposed, but an algorithm to achieve the goal is also introduced for implementation purposes as well. Section 4 shows the experiment results of comparing numerical performances with different models under different scenarios, and the model is applied on a real data set as well. Remarks conclude the paper and technical proofs are given in Appendix A.

Notation and Framework
Consider a binary classification problem. Given a random sample {(x i , y i )} n i=1 , where x i is a vector of features in the input space I = R p , y i represents the class index which takes values +1 or −1, and p, the dimension of the input space, indicates the number of features available. The goal is to determine a rule so that future observations with only the features available can be labeled into the corresponding class. The Support Vector Machine (SVM) is a technique to obtain the rule. The SVM finds a linear boundary to separate the two classes by maximizing the smallest distance from the observations of each class to the boundary if the samples are linearly separable. When the samples are not linearly separable, the method finds a nonlinear boundary by mapping the input data x into a high-dimensional feature space F = R l using a nonlinear mapping function s : R p → R l , and searching a linear discriminant function or a hyperplane in the feature space F, where β = (β 1 , β 2 , . . . , β l ) is an l−dimensional vector of parameters, s(x) = (s 1 (x), . . . , s l (x)) T is the l−dimensional column vector, and b is a scalar bias term. Hence, an individual point with observation x can be classified by the sign of D(x) = β T s(x) + b as long as the parameters β and b are determined, and the boundary D(x) = 0 is nonlinear in the input space. Theoretically, the solution to the SVM can be obtained by maximizing the aggregated margin between the separating boundaries [25]. In the mean time, the features that are used to construct the rule should be limited or even sparse so that the rule is easy to implement in practice. Mathematically, the SVM boundary is the solution of minimizing with respect to β, and b, subject to the constraints where C is the so-called soft margin parameter that determines the trade-off between the optimal combinatorial choice of the margin and the classification error, and ξ = (ξ 1 , . . . , ξ n ) T are non-negative slack variables. Equivalently, this optimization problem can be represented in the Lagrangian dual function with the form subject to the constraints where α i 's are the dual variables, and < · , · > is the inner product operator. Generally, a scalar function K(·, ·), which is called a kernel function, is adopted to replace the inner product of the two vectors x i and x j in the dual function If we denote SV as the set {j | α j > 0 for j = 1, 2, . . . , n} with all the observations, and x i , i ∈ SV as the support vectors, correspondingly, the kernel form of the SVM boundary can be written as and, consequently, the estimated bias term b j obtained by using the jth support vector x j is defined as The bias term b j is proved to be identical for all j in the set SV [7]. Thus, in practice, with the estimated coefficients of α i , we can take the average of all the estimated b j s with all support vectors as the estimate of b.
Although the kernel form of the SVM is developed through the projection of input space to higher dimensional space, in practice, we may specify the kernel function instead of finding the projection mapping. A number of commonly used kernel functions are available, for example, the radial kernel function where h(·) is a probability density function. When h(·) comes from a Gaussian distribution with variance σ 2 , the kernel function is then

Geometric Interpretation of SVM Kernels
Geometrically speaking, when the input space I is the Euclidean space, the Riemannian metric is induced in the feature space F. Let f be the mapped result of x ∈ R p in F, i.e., f = s(x) ∈ R l , then a small change in x in the input space, dx, will be mapped into the vector df in the feature space so that Thus, the squared length of df can be written in the quadratic form as Consequently, the l × l matrix S(x) = [s ij (x)] is defined on the Riemannian metric, which can be derived from the kernel K, and S(x) is positive definite [26]. More straightforwardly, the following lemma demonstrates the connection between a kernel function K and a mapping s: Lemma 1 ([26]). Suppose K(x, z) is a reproducing kernel function, and s(x) is the corresponding mapping in the support vector machine. Then, (9) holds that To increase the separability between two categories, the spatial resolution around the boundary surface in F needs to be enlarged. This motivates us to increase the factor v(x) around the boundary of D(x) = 0. Therefore, the mapping s or, equivalently, the related kernel K, is to be examined so that s ij (x) can be enlarged around the boundary. This knowledge is especially useful when dealing with imbalanced data, since it has been known that an imbalance in the data can severely affect the performance of an SVM, where a data-adaptive kernel function is constructed to solve the problem based on the assumed form of the original kernel function.

Penalized SVM
When there are a large number of features available, not all the features will contribute to the construction of the classifier. Redundant features and extra noise in the available input features may deteriorate the accuracy of the classifier while leading to the complexity of the classifier if they are all included in the model. The number of features may be controlled in the SVM framework. Under the standard prediction risk framework of loss plus penalty form, the potential misclassification cost can be specified by a universal weight c for each of the sample points from the two classes, namely, Q i = c if y i = 1 and Q i = 1 − c if y i = −1 for some 0 < c < 1, and the classification boundary can be estimated by a linear weighted SVM [7,27] by solving where (1 − t) + = max{1 − t, 0} denotes the hinge loss, β are the coefficients of the features, b is the intercept and λ is a positive regularization parameter. When the weight c = 0.5, the linear weighted SVM goes back to the standard SVM [27]. When the hinge loss is considered as E[Q(1 − (yX T β + b)) + ], an analytic form of the estimators of β is given bŷ Furthermore, in terms of selecting variables from the input space, suppose the true model has sparse features or, equivalently, , where the k × 1 vector z is the feature vector corresponding to the non-zero coefficients and the (p − k) × 1 covariate vector u corresponds to the redundant information. To select the vector z, [8] proposed a general form of penalty terms to be added directly to the loss function as where p λ (·) is a symmetric, non-convex penalty function with a tuning parameter λ. Oracle properties were developed under some regulatory conditions, and some common penalty functions such as the smoothly clipped absolute deviance (SCAD) [14] penalty and the minimax concave penalty (MCP) [28] were explored. However, such a feature selection process screens features in the input space. As shown in Lemma 1 in the Appendix A, when the classes are not linearly separable in the input space, the SVM will map the input space into projection space and the linear boundary is obtained in the projected space. As the kernel function controls the classifier's performance, a straightforward idea is then to select features and enhance the performance of the SVM by directly introducing penalty terms to the kernel function. Based on this exploration, we propose a method of simultaneous feature selection and classification by penalizing the kernel function in SVM through a data-adaptive kernel procedure.

Methodology of Data-Adaptive Kernel-Penalized SVM
In this section, a data-adaptive kernel-penalized SVM is proposed. The method can simultaneously select features and conduct classifications with a data-adaptive kernel function. Instead of adding a penalty to the standard hinge loss function, we propose to introduce the penalty term directly into the SVM under the kernel formulation, so that the number of predictors are controlled. To accommodate the common imbalance issue in real applications, a data-adaptive kernel will be employed. Thus, the oracle properties of the estimator of the true parameters under the proposed setting are developed.

Kernel-Based Parameters
We focus on the Gaussian radial basis function kernel (RBF kernel) as in (6) to develop the proposed method, where the parameter σ is originally assumed to be universal for all components of the input vectors x. Actually, the parameter σ can be extended to be component-specific as where p is the dimension of the input space I [6]. Consequently, the contributions of the corresponding predictors can be determined by the parameters σ = (σ 1 , σ 2 , . . . , σ p ). For instance, if σ j is very large, the j-th predictor tends to contribute very little to the kernel function as the corresponding component in the exponent will be close to zero. Contrarily, if σ j is small, the contribution of the j-th predictor will be large and its importance increases consequently. Thus, by controlling the j-th component in the parameter vector σ, the importance of the j-th predictor can be determined. This provides a method of feature selection by directly estimating the parameters in the kernel function. Accordingly, we propose the following modification of the kernel function as where w = (w 1 , w 2 , . . . , w p ) = (1/σ 1 , 1/σ 2 , . . . , 1/σ p ) and ⊗ represents the component-wise product. That is, w assigns weights for the contribution of each component to the kernel. When w j is large, the contribution of the j-th feature will be large and hence its importance increases. Contrarily, when w j is small, the j-th predictor tends to contribute little to the kernel function, and might not be included during the construction of an SVM. However, even if the absolute value of w j is small (not zero), its influence in the kernel function still exists. Including too many active features in the classifier may dramatically complicate the model and result in extra noisy information. Forcing the effect of some features to be exactly zero may therefore solve such an issue. This can be achieved by introducing a penalty to penalize the weights w under the assumption that the number of active features are sparse.

Data-Adaptive Kernel Functions
To deal with the imbalance of the data and enhance the performance of the SVM, we employ the data-adaptive kernel function when constructing the SVM. The data-adaptive kernel SVM is a two stage procedure, where the SVM is applied in the first stage to identify a temporary boundary, and the kernel function is modified adaptively in the second stage based on the boundary and support vectors identified in the first stage. It is proven to have the capability of increasing the separability between two classes by enlarging the spatial resolution around the boundary surface. This is especially important when the data are imbalanced. It has been demonstrated that the imbalance of classes can severely affect the performance of an SVM [29]. To illustrate, let f be the mapped result of x ∈ R p in F, i.e., f = s(x) ∈ R l . A small change in x in the input space, dx, will be mapped into the vector df in the feature space such that Thus, the squared length of df can be written in the quadratic form as where s ij (x) can be regarded as a local magnification factor [26]. To enlarge the spatial separation around the boundary, the kernel function will be adapted based on the data. Let C(x, x ) be a positive scalar function such that where x and x are vectors of features in the input space, and c(x) is a positive univariate scalar function. Then, the kernel function K is updated as where K(x, x ) is the kernel function in the first stage andK(x, x ) in the updated kernel in the second stage. It can be viewed as a modification of the original mapping s(x) to a new mapping functioñ s(x), satisfyings (14) and This process is referred to as adaptive scaling, andK can be easily shown to satisfy the Mercer positivity condition, which is the sufficient condition for a real function to be a kernel function [22]. Whens ij (x) has larger values at the support vectors than other data points, the updated mappings can increase the separation when a positive function c(x) is properly chosen. In particular, when the kernel function is Gaussian, we have the following result derived from Amari and Wu [26]. (6) is used, the modified magnification factor is

Theorem 1. When a Gaussian radial basis kernel in
where c i (x) = ∂c(x)/∂x i , and I(·) is the indicator function.
Thus, to makes bigger, we need to make the positive scalar c(x) and its first-order derivative relatively large. The authors propose to adaptively scale the primary kernel function K by constructing c(x) with the L 1 norm radial basis function and where D(x) is the (1) obtained from the first stage, N M (x) = {j : s(x j ) − s(x) 2 < M, y j = y}, |A| is the cardinality of the set A, y is the class label associated with x, and M can be regarded as the distance between the nearest and the farthest support vectors from s(x). This process is important when the data are imbalanced, since by incorporating k M (x) into c(x), the adaptive scaling process updates the spatial information and balances the data locally by considering only the support vectors , which determine the location of the decision hyperplane, from the opposite class near the boundary. This method has been proved to have greater separability even when the data are imbalanced. The magnification effect is roughly the largest near the initial separating boundary, and decreases robustly with a slow and steady rate from the separating boundary to faraway locations. We will incorporate this data-adaptive kernel procedure to accommodate imbalance classes.

Data-Adaptive Kernel-Penalized SVM
To control the number of features in the classifier, a penalty term for the weights p λ ( w ) will be introduced to select the features through the kernel function. We propose to add the penalty term directly to the dual maximization problem for the SVM which contains the kernel function. Specifically, the data-adaptive kernel-penalized SVM is initially proposed as the solution to such that whereK(x, z) is the data-adaptive kernel function from (15), c(x) inK(x, z) is from (16) and the primary kernel function is from (13). When the estimate of w is obtained, the predictors with non-zero coefficients are considered to be the truly active predictors that will affect the decision boundary. The boundary will be estimated by and the intercept b can be estimated with the decision rule in (19), a test observation x can be assigned to the class by the sign of D(x).
There are several options for the specific forms of the penalty. In general, non-convex penalties satisfying the following assumptions A1 and A2 can be used.
3. L 0 norm smooth approximation: w 0 = |{i : w i = 0}| by [11]. Unlike L p norm with p > 0, L 0 norm is not precisely a norm because the triangle inequality does not hold and, consequently, it is not smooth. Thus, the approximation by a concave function is applied on the L 0 norm so that a penalty function is where λ is an approximation parameter.

Remark 1.
A penalty is usually added to the loss function in the literature; however, the standard loss function does not contain the kernel function. When the data are imbalanced, the performance of a standard SVM will be affected. Consequently, the features selected without considering the imbalance of classes may be unreliable in the imbalance application. Contrarily, data-adaptive kernel-penalized SVM can fulfil the feature selection process while taking the imbalance of classes into account.

Remark 2.
Although other types of kernels such as the polynomial kernel K(x, z) = (1 + ∑ p j=1 x j z j ) d are also available to describe the mapping by kernels, not all the kernels are feasible for simultaneous feature selection process classification because of technical difficulties. For example, polynomial kernels are determined only by the order parameter d, while it is not obvious how feature selection can be conducted during the classification process. However, the proposed method is still very attractive in applications, since the Gaussian RBF kernel adopted here is the most popular kernel.

Remark 3.
The constraints in the dual function contain the non-negativity of the parameters w-they correspond to the positive scale parameter in the Gaussian kernels. This constraint can be removed by using a quadratic form of the parameters in the penalized kernels.

An Algorithm to Solve Data-Adaptive Kernel-Penalized SVM
To solve the data-adaptive kernel-penalized SVM in (18), a two-stage algorithm is proposed. In the first stage, a standard SVM is obtained so that the location information of the support vectors and the temporary decision boundary are available. The primary kernel function is then updated adaptively by (15) in the second stage, and the optimization with both the updated kernel and the penalty is then solved to obtain the final boundary as well as the selected features.
Since the objective function in (15) is non-convex, an iterative procedure is adopted [11]. To be specific, in the t-th round iteration, t = 1, 2, . . . , T, a standard dual optimization problem for an SVM with the (t − 1)-th estimated kernel parameter vector w (t−1) is to be solved as such that and the result is denoted as α (t) . During this stage, the support vectors are obtained by those non-zero α i s, and c(x) can be constructed through (16) so that the data-adaptive kernel functionK(x, z) can be constructed by (15). Finally a non-linear formulation with a fixed α (t) is solved such that w j ≥ 0, j = 1, 2, . . . , p.
The process will stop when w (t) − w (t−1) is sufficiently small.

The Oracle Property
In this subsection, we develop the oracle property of the estimator. We show that, under some regularity conditions, the distance between the estimates and the true values of the parameters goes to zero with probability 1 when the sample size is sufficient large. Here, we only need to consider the optimization process in the second stage in (22), since all the unknown information regarding the parameters w is included in this stage (note that α is considered as a fixed constant vector in the second stage). Define the estimator p λ (|w j |)} and the following regularity conditions: C1. The densities of Z given Y = 1 and −1 are continuous with common support in R q , where Z are truly relevant predictors. C2. E(Z 2 j ) < ∞ for 1 ≤ j ≤ q, i.e., the second order moments of all active predictors are finite. C3. The true parameter β 0 is a non-zero and unique vector. C4. q = O(n c ) for some 0 ≤ c < 1/2, namely, lim n→∞ q/n c < ∞. C5. Eigenvalues of n −1 [X 2 ] T X 2 are finite, where X is the input matrix, and (·) 2 is the component-wise square.
Conditions C1-3 are the assumptions to ensure that the oracle estimator constructed in our proposed method is consistent and that the optimal classification decision rule is not constant. Condition C4 is a common requirement in high-dimensional inference, indicating that the the number of the truly active predictors cannot diverge with a rate faster than √ n. Condition C5 gives the upper boundary of the largest eigenvalues of the squared design matrix, which is necessary in our proposed method due to the quadratic form in the radial kernel functions. With these conditions, the following oracle property holds: Theorem 2. Assume that Conditions C1-5 and Assumptions 1-2 for the penalty are satisfied. If max{|p λ (w j )| : w j = 0} → 0, then there exists a local minimizerŵ of L 2 (w) = {∑ l i=1 ∑ l j=1 α i α j y i y j K(x i , x j ; w) + ∑ p j=1 p λ (|w j |)} such that ŵ − w true = O p { q/n}, where w true is the true value of w.
Detailed proof is provided in Appendix A. Theorem 2 guarantees that the estimate of the parameter in the proposed method acts as if the true values of the parameters were known. When the sample size is sufficiently large, the distance between the estimates and the true values of the parameters will be small enough. Consequently, the estimated decision rule in (19) can be obtained as if the true decision boundary were known, and it can then be employed to classify new observations. Though various approaches for SVM-based feature selection procedures are available in literature, the proposed method is different in that it directly obtains a minimal subset of features and simultaneously classifies objects by penalizing the kernel function, eliminating noisy features without ranking the features. The process of the proposed method is more time-efficient compared to the methods in the literature, and the proposed method improves the classification performance, especially when the data are imbalanced.

Numerical Studies
In this section, simulation studies are carried out to assess the performance of the data-adaptive kernel-penalized SVM, and to compare the proposed method with some other penalized SVMs in the literature. In the data-adaptive (DA) kernel-penalized SVM, the SCAD (DA-SCAD-SVM), MCP (DA-MCP-SVM) penalties and L 0 norm approximation (DA-L0-SVM) are used. For other penalized SVMs, we use the penalties of SCAD (SCAD-SVM, [8]), MCP (MCP-SVM, [8]), L 1 norm (L 1 -SVM, [30]), adaptively weighted L 1 norm with a weight parameter c = 0.5 (adapt L 1 -SVM, [10]) and L 0 norm approximation (L 0 -SVM, [11]). The comparisons are made under various levels of imbalance in the data. The abilities of identifying the relevant features and controlling the test error are compared when the data are both balanced and imbalanced.
In terms of tuning the regularization parameters for all of the approaches considered, we adopt a procedure similar to [31]. The prediction error is estimated using a five-fold cross-validation method. The initial value of w is set as 1 T . During the second stage of solving the data-adaptive kernel-penalized SVM, the gradient descent procedure is adopted for the non-linear optimization problem. The iterative algorithm will stop if the change in the estimates of w in two consecutive rounds, namely w (t+1) − w (t) , is smaller than a given threshold , which is set as 10 −4 for fast convergence.
For the tuning parameter λ in the penalty term, we use the SVM-extended Bayesian information criterion (SVMIC) proposed in [8] as where ξ i , i = 1, 2, . . . , n, are the optimal slack predictors and, correspondingly, S is a subset of {1, 2, . . . , p}, |S| is the cardinality of S, and ( · · ) represents the combination operator. This idea is motivated by the standard Bayesian information criterion and is extended by [32]. The range of λ is set as {2 −6 , 2 −5 , . . . , 2 3 }, and γ is set as 0.5 in the tuning procedure without a loss of generality [32]. The value of λ will be set as the one that maximizes (23). Note that the values of the slack variables ξ i in (23) are not available directly, but they can be calculated by ξ i = [1 − y i D(x i )] + for i = 1, . . . , n, where [t] + = max{0, t}, and D(x i ) can be obtained by (19) [33].
As suggested in [8], for SCAD and MCP penalties, the constant a values will be set as 3.7 and 3, respectively.
Tables 1 and 2 summarize the performances with different combinations of imbalance levels and numbers of predictors, based on a replication of 100 times. The sample sizes n are fixed as 100 and 400, respectively. The 'Relevant' and 'Irrelevant' columns show the information of the mean values of the truly active and inactive predictors selected by the model, respectively. Column 'True' gives the percentage when the true model, containing exactly those five active predictors, is correctly selected during the 100 replications. Values in parentheses are the corresponding empirical standard errors. In general, the SVMs with the non-convex penalized data-adaptive kernels show a much greater probability of correctly selecting the true model as n increases, which is consistent with the asymptotic oracle property. According to the numbers in the Relevant column, the SVMs with penalties of SCAD and MCP find the most relevant predictors compared with other methods. The SVM with L 0 norm approximation can find some relevant predictors, while the SVMs with an L 1 norm penalty tend to fail in selecting the correct predictors, with or without adaptive weights. According to the Irrelevant column, the two data-adaptive kernel-penalized methods exclude most irrelevant predictors and hence eliminate the noisy predictors. The missing relevant predictor, if there is any, is mostly from X 1 due to the fact that setting X 1 has the weakest effect.
On the other hand, when the imbalance level of the data is increasing, the prediction error tends to increase. However, given a specific level of imbalance in data, test prediction errors from data-adaptive kernel-penalized SVMs are universally smaller than those obtained from other approaches, because these two methods give the fewest noisy predictors so that the prediction error is minimized. More importantly, when the imbalance level increases, our data-adaptive kernel-penalized SVMs outperform among all methods, which agrees with the fact that the data-adaptive kernel can improve the classification performance. This adaptive scaling process on the kernel is only applicable to our setting and not to any other method due to the lack of kernel functions in the model structures (penalized SVMs have penalty terms directly on the loss function, which is not described in the kernel form). In the mean time, the feature selection performance changes little, especially in the non-convex penalized data-adaptive kernel SVMs.
It is worth noting that the combination (n, p) shows that, even when the number of potential predictors is proportional to the sample size or larger, our method still performs well. This gives us some clue that the method may still work in big data or ultra-high dimensional settings. Indeed, the oracle property in our proposed method indicates that the true predictors can still be selected even when the dimension of the input space grows proportional to the sample size, which is the high-dimensional setting.

A Real Data Example
A publicly available Wisconsin Breast Cancer (WBC) data set from the UCI Machine Learning Repository [34] provides an illustration of the proposed method. The data set can be found and downloaded via https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic). The WBC data set contains 569 observations (212 malignant and 357 benign tumors). Thirty continuous features, such as the radius (mean of distances from center to points on the perimeter) of the cancer, as well as the texture (standard deviation of gray-scale values), smoothness (local variation in radius lengths) and area of the cancer, are expected to be used to classify the two classes of malignant and benign tumors. These features are measured by a digitized image of a Fine Needle Aspirate (FNA) of a breast mass, which can describe the cell nuclei shown in the images. We refer readers to a full description of the data set in [35]. As a pre-process step, the features were first standardized.
Different methods are applied to the data set, both with and without penalties. For classifiers without penalties, the Gaussian kernel will be adopted with all the input features being used to estimate the decision boundary. For those with penalties, we will use data-adaptive kernel-penalized SVMs with SCAD and MCP penalties, as well as the penalized SVMs with SCAD and MCP penalties to the hinge loss.
The numbers of selected features and the test errors from all the considered methods will be reported. For those approaches that require a two-stage optimization process, the solutions for the first stage optimization are used as the initial values for the second stage optimization if needed. For SCAD and MCP penalties, the constant a values are still fixed as 3.7 and 3, respectively, the same as the values used in the simulation process. A five-fold cross validation will be conducted to obtain the tuning parameters, which will be chosen from the following sets B ∈ {0.1, 0.5, 1, 5, 10, 20, 50, 80, 100, 200, 500} σ ∈ {0.1, 0.5, 1, 2, 3, 4, 5, 10, 50, 100}.
The tuning parameter λ is selected by the grid search from {2 −14 , 2 −9 , . . . , 2 5 } in 100 repetitions. Table 3 summarizes the classification outcome of the mean and the standard deviation (in parentheses) of the prediction error and the number of predictors selected with different approaches. It is clear that the data-adaptive kernel-penalized SVMs perform the best among all approaches, with a significantly lower prediction error and number of predictors selected than any other method. Compared with penalized SVM with SCAD and MCP penalties, data-adaptive kernel-penalized SVMs with the corresponding penalties still outperform, even though the penalties are the same. MCP seems to be a better choice for the penalty term, since the number of the predictor is the smallest, and the standard deviation is smaller. Adaptively weighted L 1 norm SVM and L 1 norm SVM are fair. Clearly, the numerical results have confirmed that data-adaptive kernel-penalized SVMs with SCAD or MCP penalties are both promising classifiers with low prediction errors and excellent feature selection abilities.

Concluding Remarks
In this paper, we propose a data-adaptive kernel-penalized SVM, a new method that simultaneously achieves feature selection and classification, especially when the data is imbalanced. Instead of penalizing the loss function of SVMs, as has been done in the literature, a non-convex penalty is proposed to be added directly to the kernel form of the SVM. The benefit is that the features are selected more correctly in the feature space instead of the original input space. This is because it is the kernel function that mainly determines the classification process. Moreover, the data-adaptive kernel is applicable to SVM so that, even when the data is imbalanced, the performance of the SVM is still excellent, while-in this setting-other penalized SVM cannot work well due to the lack of flexibility in SVM. Along with the oracle properties, if the true sparsity in the feature space is already known, our proposed method works well in both the simulation study and the real data example, possibly even when the ultra-dimensional setting exists.
The method proposed in this paper is actually an embedded approach, as mentioned in the introduction part, and the forms of penalty terms are not limited to those applied in the methodology above. In terms of the multi-category classification problem, the methodology can be extended to fit in the direct method, though the data-adaptive kernels need to be modified. Another issue is the choice of the primary kernel function. The methodology proposed is base on the Gaussian RBF kernel due to its natural link with the contribution of the predictors. Extensions will be considered in future works.
Furthermore, in terms of the cancer image dataset, the patients included in the study probably have more diseases in the prostate other than cancer, and this requires techniques for multi-category classifiers. Moreover, measurement errors will probably exist due to the co-registration of the measures from different platforms, and this may affect the accuracy of the classifier. Future works will continue this study, taking all of these issues into consideration. The lemma shows how mapping s is associated with the corresponding kernel function K. Thus, given a specific form of a kernel function and an adaptive scaling function c(x), we have Theorems 1 and 2.
Since y i ∈ {1, −1}, then y i y j ∈ {1, −1} for all (i, j), with a probability of π 2 + + π 2 − for 1 and 2π + π 1 for −1, assuming independence between y i and y j , where π + = Pr(y i = 1) and π − = Pr(y i = 1); furthermore, it is easy to check 0 ≤ E(y i y j ) = π 2 + + π 2 − − 2π + π 1 = (π + − π − ) 2 ≤ 1 and thus Now, let Λ n (u) = nq −1 · [L 1 (β + q/n · u) − L 1 (β)] where () 2 is the component-wise square. Since exp(x) > x + 1 for all x and α i ≥ 0 for all i, then the first item in (A7) is Taking the standard augmentation of the Taylor expansion with respect to u, Then it is easy to find that the second item in (A7) is Now, by combining (A8) and (A10), we have Note that the first part in (A11) is equivalent to ∂ ∂β L 1 (β) = 0 due to the necessary condition that β = arg min L 1 (β), and the second term, which is obviously non-negative, will dominate (A11). In terms of the penalty term, it is obvious that P n (β) = nq −1 p ∑ j=1 [p λ n (|β j + q/n · u j | − p λ n (|β j |)]; using p λ n (0) = 0 and p λ n (·) ≥ 0 ≥ q ∑ j=1 nq −1 · [p λ n (|β j + q/n · u j | − p λ n (|β j |)]; using Taylor Expansion which is bounded by q −1/2 u + max{|p λ n (β j )| : β j = 0} u|. Thus, by choosing a sufficiently large ∆, P n (β) is dominated by the second item in (A11) as well. Thus, L(β) = Λ n (u) + P n (β) is dominated by a non-negative item with probability 1 within a ball. This indicates that with a probability of at least 1 − , there exists a local minimum in the ball {β + q/n · u : u ≤ ∆}, and hence there exists a local minimizer, such that β − β = O p { q/n}. Note that when the kernel function K is updated bỹ K, nothing is changed except that the kernel is multiplied by two finite constants constructed from the first stage of SVM, and hence the theorem still holds. This completes the proof.