Abstract
Distance weighted discrimination (DWD) is an appealing classification method that is capable of overcoming data piling problems in high-dimensional settings. Especially when various sparsity structures are assumed in these settings, variable selection in multicategory classification poses great challenges. In this paper, we propose a multicategory generalized DWD (MgDWD) method that maintains intrinsic variable group structures during selection using a sparse group lasso penalty. Theoretically, we derive minimizer uniqueness for the penalized MgDWD loss function and consistency properties for the proposed classifier. We further develop an efficient algorithm based on the proximal operator to solve the optimization problem. The performance of MgDWD is evaluated using finite sample simulations and miRNA data from an HIV study.
1. Introduction
Classification problems appear in diverse practical applications, such as spam e-mail classification, disease diagnosis and drug discovery, among many others (e.g., [,,]). In these classification problems, the goal is to predict class labels based on a given set of variables. Recent research has focused extensively on linear classification: see [,] for comprehensive introductions. Among many linear classification methods, support vector machines (SVMs) (see [,]) and distance-weighted discrimination (DWD) (see [,,]) are two commonly used large-margin based classification methods.
Owing to the recent advent of new technologies for data acquisition and storage, classification with high dimensional features, i.e., a large number of variables, has become a ubiquitous problem in both theoretical and applied scientific studies. Typically, only a small number of instances are available, a setting we refer to as high-dimensional, low-sample size (HDLSS), as in []. In the HDLSS setting, a so-called “data-piling” phenomenon is observed in [] for SVMs, occurring when projections of many training instances onto a vector normal to the separating hyperplane are nearly identical, suggesting severe overfitting. DWD was originally proposed to overcome data-piling in the HDLSS setting. In binary classification problems, linear SVMs seek a hyperplane maximizing the smallest margin for all data points, while DWD seeks a hyperplane minimizing the sum of inverse margins over all data points. Reference [] suggests replacing the inverse margins by the q-th power of the inverse margins in a generalized DWD method; see [] for a detailed description. Formally, for a training data set of N observations, where and , binary generalized linear DWD seeks a proper separating hyperplane through the optimization problem
where a and are the intercept and slope parameters, respectively. The slack variable is introduced to ensure that the corresponding margin is non-negative and the constant is a tuning parameter to control the overlap between classes. Problem (1) can also be written in a loss-plus-penalty form (e.g., []) as
where
with , and . When , (1) becomes the standard DWD problem in [] while problem (2) appears in [,].
The binary classification problem (1) is well studied. However, in many applications such as image classification [], cancer diagnosis [] and speech recognition [], to name a few, problems with more than two categories are commonplace. To solve these multicategory problems with the DWD classifier, approaches based on either formulation (1) or (2) are common. One common strategy is to extend problem (1) to multiple classes by solving a series of binary problems in a one-versus-one (OVO) or one-versus-rest (OVR) method (e.g., []). Instead of reducing the multicategory problem to a binary one, another strategy based on problem (1) considers all classes at once. As shown in [], this approach generally works better than the OVO and OVR methods. Based on an extension of problem (2), [] proposes multicategory DWD, written in a loss-plus-penalty form as
with and where and are the intercept and slope parameters for each category k, respectively. Although these methods can be applied to multicategory classification in the HDLSS setting, both problems (2) and (4) use the penalty and do not perform feature selection. As discussed in [], for high dimensional classification, taking all features into consideration does not work well for two reasons. First, based on prior knowledge, only a small number of variables are relevant to the classification problem: a good classifier in high dimensions should have the ability to sparsely select important variables and discard redundant ones. Second, classifiers using all available variables in high-dimensional settings may have poor classification performance.
Much of the SVM literature has considered variable selection in high-dimensional classification problems to improve performance (e.g., [,,]). Among the DWD literature, to the best of our knowledge, only [] considered variables selection and classification simultaneously. Wang and Zou [] considered an rather than an penalty in problem (2) to improve interpretability through sparsity in the binary classification. Moreover, [] made selections based on the strengths of input variables within individual classes but ignored the strengths of input variable groupings, thereby selecting more factors than necessary for each class. To overcome this weakness in this paper, we developed a multicategory generalized DWD method that is capable of performing variable selection and classification simultaneously. Our approach incorporates sparsity and group structure information via the sparse group lasso penalty (see [,,,,]).
Although DWD is well studied, it is less popular than the SVM for binary classification, arguably for computational and theoretical reasons. For an up-to-date list of works on DWD mostly focused on the case, see [,]. Theoretical asymptotic properties of large-margin classifiers in high dimensional settings were studied in [], and [] derived an expression for asymptotic generalization error. In terms of computation, [] solved the standard DWD problem in (1) as a second-order cone programming (SOCP) problem using a primal-dual interior-point method that is computationally expensive when N or p is large. To overcome computational bottlenecks, [] proposed an approach based on a novel formulation of the primal DWD model in (1): this method, proposed in [], does not scale to large data sets and requires further work. Lam et al. [] designed a new algorithm for large DWD problems with and based on convergent multi-block ADMM-type methods (see []). Wang and Zou [] solved the lasso-penalized binary DWD problem by combining majorization–minimization and coordinate descent methods: the lasso penalty does not directly permit a SOCP solution. In fact, solution identifiability in the generalized DWD problem with requires more constraints and remains an open research problem (see []). To the best of our knowledge, no work focusing on computational aspects of lasso penalized multicategory generalized DWD (MgDWD) exists. The same holds for sparse group lasso-penalized MgDWD.
The theoretical and computational contributions of this paper are as follows. First, we establish the uniqueness of the minimizer in the population form of the MgDWD problem. Second, we prove a non-asymptotic estimation error bound for the sparse group lasso-regularized MgDWD loss function in the ultra-high dimensional setting under mild regularity conditions. Third, we develop a fast, efficient algorithm able to solve the sparse group lasso-penalized MgDWD problem using proximal methods.
The rest of this paper is organized as follows. In Section 2.1, we introduce the MgDWD problem with sparse group lasso penalty. In Section 2.2 and Section 2.3, we establish theoretical properties of the population classifier and regularized empirical loss. We propose a computational algorithm in Section 2.4. Section 3 illustrates the finite sample performance of our method through simulation studies and a real data analysis. Proofs for major theorems are given in the Appendix A.
2. Methodology
2.1. Model Setup
We begin with some basic set-up and notation. Consider the multicategory classification problem for a random sample of N independent and identically distributed (i.i.d.) observations from some distribution . Here, y is the categorical response taking values in , and is the covariate vector. We wish to obtain a proper separating hyperplane for each category , where and are intercept and slope parameters, respectively.
In this paper, we consider MgDWD with sparse group lasso regularization. That is, we estimate a classification boundary by solving the constrained optimization problem
where is as defined in (3).
To approach this problem, we apply the concept of a “margin vector” to extend the definition of a (binary) margin to the multicategory case. Denote the margin vector of an observation as , with satisfying . Let be the class indicator vector with . The multicategory margin of the data point is then given as . Therefore, the MgDWD loss can be rewritten as
Based on (6), Lemma 1 describes the Fisher consistency of the MgDWD loss.
Lemma 1.
Consequently, can be treated as an effective proxy of and, for any new observation , a reasonable prediction of its label is
Speaking to the sparse group lasso (SGL) regularization in (5), the penalty encourages an element-wise sparse estimator that selects important variables for each category, indicated by . Assuming that parameters in different categories share the same information, we use an penalty to encourage a group-wise sparsity structure that removes covariates that are irrelevant across all categories, that is, where . Specifically, let and , where the k-th column is the slope vector for the category label k and the j-th row is the group coefficient for the variable . If is noise in the classification problem or is not relevant to category label k, then the entry of should be shrunk to exactly zero. The SGL penalty of (5) can be written as a convex combination of the lasso and group lasso penalties in terms of as
where is the scale of the penalty and tunes the propensity between the element-wise and group-wise sparsity structure.
2.2. Population MgDWD
In this subsection, some basic results pertaining to unpenalized population MgDWD are given. These results are necessary for further theoretical analysis.
Denote the marginal probability mass of y as with and , and the conditional probability density functions of given by . Let be the collection of coefficient vectors for all labels and . The population version of the MgDWD problem in (6) is
where is the vectorization of the matrix and is a random vector. Denote the true parameter value as a minimizer of the population MgDWD problem, namely,
where is the set of sum-constrained with , where ⊗ denotes the Kronecker product.
To facilitate our theoretical analysis, we first define the gradient vector and Hessian matrix of the population MgDWD loss function. We then introduce some regularity conditions necessary to derive theoretical properties of this problem. Let be a diagonal matrix constructed from the vector , and let ∘ and ⊕ be the Hadamard product and the direct matrix sum, respectively. Denote the gradient vector of the population MgDWD loss function (8) as
with
and its Hessian matrix as
where denotes the second derivative of the function ; is a random vector with and ; and
The block structure of implies a parallel relationship between each category. The relationship between the is reflected by the sum-to-zero constraint in the definition of .
We assume the following regularity conditions.
(C1) The densities of given , i.e., the , are continuous and have finite second moments.
(C2) for all , where .
(C3) for all .
Remark 1.
Condition (C1) ensures that , and are well defined and continuous in . For the theoretically optimal hyperplane , the case with leaves useless for classification. On the other hand, when and , the hyperplane is the empty set and is similarly meaningless. Condition (C2) is proposed to avoid the case where so that always contains information relevant to the classification problem. For bounded random variables, condition (C2) should be assumed with caution. Condition (C3) implies the positive definiteness of .
By convexity and the second-order Lagrange condition, the following theorem shows that the local minimizer of the population MgDWD problem exists and is unique.
Theorem 1.
Under the regularity conditions (C1)–(C3), the true parameter is the unique minimizer of with , and
with , where
The bounds in Theorem 1 show how q affects the loss function . The upper bound is a decreasing function of q with
In the lower bound , the first term is an increasing function of q and the last term is a decreasing function of q, with
Consequently, for the given population , a larger q encourages the population MgDWD estimator to focus more on the regions that correspond to misclassifications. As a result, the estimator’s performance will be similar to the hinge loss as . Setting q too small will lead to an ineffective classifier due to the unreasonable penalty placed on the well classified region . This variation in the lower bound with respect to q provides a necessary condition for the existence of an optimal q.
Remark 2.
The explicit relationship between q and is complicated. While it may be more desirable to prove that a greater value of q results in a smaller value of the loss function , there is no explicit formula for the optimal value in terms of q.
2.3. Estimator Consistency
Under the unpenalized framework presented in the previous subsection, all covariates will contribute to the classification task for each category: this scenario may lead to a classifier that overfits to the training data set. In this subsection, we study the consistency of the estimator for (5) in ultra-high dimensional settings.
To achieve structural sparsity in the estimator, the regularization parameter in (7) must be large enough to dominate the gradient of the empirical MgDWD loss evaluated at the theoretical minimizer with high probability. On the other hand, should also be as small as possible to reduce the bias incurred by the SGL regularization term
Lemma 2 provides a suitable choice of under the following assumption.
(A1) The predictors are independent sub-Gaussian random vectors satisfying , and where , there exists a constant such that for any , . From here on, we define as the largest eigenvalue of .
Lemma 2.
Denote , where , with , and is the identity matrix of size K. Under condition (A1),
with probability at least , where
for constants .
It is difficult to obtain a closed form for the conjugate of the SGL penalty, say, . Instead, we use a regularized upper bound . Based on Lemma 2, we propose a theoretical tuning parameter value
where is some given constant satisfying .
Before we can derive an error bound for the estimator in (5), we impose two additional assumptions.
(A2) For the true parameter value , there is a -sparse structure in the coefficients with element-wise and group-wise support sets
with cardinality and , respectively.
(A3) There exist some positive constants and such that
with and
where , is the complement of , , and .
Theorem 2.
Suppose that conditions (A1)–(A3) hold. Then with in (5), we have that
with probability at least , where and .
Remark 3.
The sub-Gaussian distribution assumption (A1) is common in high-dimensional scenarios. This assumption characterizes the tail behavior of a collection of random variables including Gaussian, Bernoulli, and bounded variables as special cases. Assumption (A2) describes structural sparsity at two levels. The element-wise size is the size of the underlying generative model, and the group-wise size is the size of the signal covariate set. Both and are allowed to depend on the sample size N. As a result, the dimension p is allowed to increase with the sample size N. Assumption (A3) guarantees that eigenvalues are positive in this sparse scenario.
Remark 4.
In practice, the tuning parameters λ and τ in (7) are commonly chosen by M-fold cross validation. That is, we choose the pair with the highest prediction accuracy among the sub-data sets , specifically,
where .
2.4. Computational Algorithm
In this section, we propose an efficient algorithm to solve problem (5). Our approach uses the proximal algorithm (see []) for solving high dimensional regularization problems. In two main steps, this approach obtains a solution to the constrained optimization problem by applying the proximal operator to the solution to the unconstrained problem.
Since regularization is not needed for the intercept terms , it can be separated from the coefficients in . The empirical MgDWD loss of (8) is given by
where . Various properties of the loss function follow below.
Lemma 3.
The loss function has Lipschitz continuous partial derivatives. In particular, for and any , we have that
where is the largest group sample size. For and any , we have that
where is the k-th column of and indicates the observations belonging to the k-th group.
Hence, following the majorization–minimization scheme, we can majorize the empirical MgDWD loss by a quadratic function, that is,
for some , where and denote the Lipschitz constants in Lemma 3. Instead of minimizing directly, we apply gradient descent to minimize its surrogate upper bound function. The gradient descent updates are given by
Next, we address the problem’s constraints and regularization simultaneously by applying the proximal operator. For , it is clear that
where . For , the minimization problem can be expressed as
which implies that we can implement minimization for p groups in parallel. The following theorem provides the solution to (13).
Theorem 3.
Let and . Then the constrained regularization problem
has a solution of the form
for some .
In the special case with , the constrained regularization problem in Theorem 3 reduces to the constrained lasso problem with solution . Combined with (14), the proximal operator , given by
can be introduced to realize the group sparsity of .
For the standard lasso problem, the subgradient has a closed form given by , with . However, under the constraint on , the naive solution is misleading in that it satisfies the constraint but does not achieve shrinkage, let alone loss function minimization. The term is suggestive of the intersection between the subdifferential set and the constraint set ; in this sense, might not have a closed form. Here we consider using coordinate descent to solve the constrained lasso problem. For some fixed coordinate m, since , we have that . Rewriting the objective function of the lasso-constrained problem in a coordinate-wise form, we obtain
Next, Theorem 4 provides the solution to the optimization problem (16).
Theorem 4.
Suppose that and . Then the regularization problem
has solution
where .
By Theorem 4, given some , the coordinate-wise minimizer for any can be expressed as the proximal operator
with and . If we fix m during iteration, then the shrinkage of will be indirectly reflected in the other . We propose that m change with k in the coordinate-wise minimization process to ensure that every coordinate can be equally shrunk. We summarize our proposed algorithm in Algorithm 1.
| Algorithm 1: Proximal gradient descent algorithm for SGL-MgDWD. |
| Input: |
| Initialization:, , . |
| Output: and . |
3. Numerical Analysis
In the following section, we use both simulated and real data sets to evaluate the finite sample properties of our proposed method. We compare the finite sample performance of SGL-MgDWD with -regularized multinomial logistic regression (-logistic).
3.1. Simulation Studies
The data is generated from the following model. Consider the K-category classification problem where and is the density function of a normal distribution with mean vector and covariance matrix , where with , for . In this model, only the first two variables contribute to the classification and their corresponding parameter vectors and form two groups of coefficients. The true model has the sparsity structure for a total of coefficients. We set the sample size for each category to and 400, and the number of classes to and 11. We consider dimensionality and 1000.
In what follows, we compare the proposed SGL-MgDWD method with the OVR method based on SGL-MgDWD with (OVR-SGL-gDWD). For SGL-MgDWD, logistic regression and OVR, the tuning parameter is optimized over a discrete set by minimizing the prediction error using 5-fold cross validation. In each simulation, we conduct 100 runs and use a testing set of equal size to evaluate each method’s performance using the following criteria:
- Testing set accuracy, measuring the rate of correct classification;
- Signal, as the average number of correctly-selected element-wise and group-wise signals, that is, with and , respectively, denoted by the pair ;
- Noise, as the average number of incorrectly-selected element-wise and group-wise components, that is, with and , respectively, denoted by the pair .
Table 1.
Simulation results for the SGL-MgDWD, -logistic, and OVR methods with . Time is measured relative to a baseline logistic regression model with , , and . Numbers in parentheses denote standard deviations.
Table 2.
Simulation results for the SGL-MgDWD, -logistic, and OVR methods with . Time is measured relative to a baseline logistic regression model with , , and . Numbers in parentheses denote standard deviations.
As shown in Table 1 and Table 2, the proposed SGL-MgDWD method performs better than the -logistic and OVR methods. Specifically, in each scenario, predictions from the SGL-MgDWD method had higher accuracy relative to the other two methods. Similarly, the SGL-MgDWD method correctly selected the signal components of the model with fewer incorrectly-selected noise components, again relative to the -logistic and OVR methods. These simulation results also demonstrate that test accuracy increases with increasing sample size and that test accuracy decreases with higher dimension p at fixed . This is consistent with the derived theoretical properties. All computations were performed on a Tensorflow 2.3 CPU on Threadripper 2950X at 4.1 Ghz.
3.2. HIV Data Analysis
Symptomatic distal sensory polyneuropathy (sDSP) is a common debilitating condition among people with HIV. This condition leads to neuropathic pain and is associated with substantial comorbidities and increased health care costs. Plasma miRNA profiles show differences between HIV patients with and without sDSP, and several miRNA biomarkers are reported to be associated with the presence of sDSP in HIV patients (see []). The corresponding binary classification problem was analyzed in [] using random forest classifiers. However, the HIV data set can be further classified into four classes. The HIV data set has 1715 miRNA measures for 40 patients and is partitioned into four groups () with patients each category: non-HIV, HIV with no brain damage (HIVNBD), HIV with brain damage but stable (HIVBDS) and HIV with brain damage and unstable (HIVBDU). In the following analysis, we apply our proposed method to this classification problem. The primary aim was to identify critical miRNA biomarkers for each of the four groups. Beyond achieving a finer classification, this analysis is helpful in assessing related pathogenic effects for each patient group.
Given the small sample size of , we chose the tuning parameter by maximizing leave-one-out cross validation accuracy. We fixed . Table 3 shows the signal for coefficient estimates obtained from the SGL-MgDWD method using the selected . We conclude that there are 22 critical miRNA biomarkers important to the classification problem. In particular, the biomarkers miR-25-star, miR-3171, miR-3924 and miR-4307 are not relevant to the non-HIV group; miR-4641, miR-4655-3p and miR-660 are not relevant to the HIVNBD group; miR-217 and miR-4683 are not relevant to the HIVBDS group; and miR-217 and miR-4307 are not relevant to the HIVBDU group.
Table 3.
Signal for the coefficient estimates obtained from the SGL-MgDWD method with for the HIV data set. The symbols “+” and “-” denote positive and negative coefficient estimates, respectively, while “0” denotes a zero coefficient (i.e., an irrelevant variable).
Author Contributions
Conceptualization, L.K. and N.T.; Methodology, T.S., L.K. and N.T.; Formal Analysis, Y.L.; Data Curation, W.G.B., E.A. and C.P.; Writing—Review & Editing, Y.W., B.J. and L.K.; Supervision, B.J., L.K. and N.T. All authors have read and agreed to the published version of the manuscript.
Funding
A Canadian Institutes of Health Research Team Grant and Canadian HIV-Ageing Multidisciplinary Programmatic Strategy (CHAMPS) in NeuroHIV (Christopher Power) supported these studies. Bei Jiang and Linglong Kong were supported by the Natural Sciences and Engineering Research Council of Canada (NSERC). Christopher Power and Linglong Kong were supported by Canada Research Chairs in Neurological Infection and Immunity and Statistical Learning, respectively. Niansheng Tang was supported by grants from the National Natural Science Foundation of China (grant number: 11671349) and the Key Projects of the National Natural Science Foundation of China (grant number: 11731011).
Acknowledgments
The authors are thankful for the invitation of the two guest editors, Farouk Nathoo and Ejaz Ahmed. This work has also benefited from two anonymous reviewers’ constructive comments and valuable feedback. The authors also thank the great help of Matthew Pietrosanu with editing.
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A. Proofs
Appendix A.1. Proof of Lemma 1
Proof.
For simplicity, we write and . Using the Lagrange multiplier method, we define
Then for each k,
with
Without loss of generality, assume that . Note that , and so and if and only if .
If , then when , which implies that for all . Hence, substituting into (A1) yields
However, , contradicting the sum-to-zero constraint. Therefore, for and the result follows. □
Appendix A.2. Proof of Theorem 1
Lemma A1.
Under (C1), exists, and it is convex on .
Proof.
The existence of will be satisfied if
We divide into two disjoint subsets. Defining , it is clear that
Note that when . On the other hand, for ,
if for all . This completes the proof of the existence of .
Recall that
where is a convex function of u, so its composition with the affine mapping is still convex in . Clearly, , so the non-negatively-weighted integral and sum both preserve convexity. □
Lemma A2.
Existence of minimizers of on , where .
Proof.
By Jensen’s inequality, for any , we have that
Let , where . For some , we have that
Note that when . By the Cauchy–Schwarz inequality, .
Hence, if , then . The contrapositive of this result implies the existence of a minimizer in the unconstrained problem. That is, the closed set is bounded for some large enough C. This guarantees the existence of a solution, as desired. □
Lemma A3.
Under (C1), exists and
Proof.
The existence of will follow if
for . Note that when .
For every , is a Lebesgue integrable function of . For any , exists and . Hence, by the Leibniz integral rule, we have that
and for any ,
which is sufficient to show that
□
Lemma A4.
Suppose (C1) is satisfied. Then (C2) implies that .
Proof.
We can rewrite as
Then for any and its corresponding , we have that
Let be a local minimizer. It follows that and since and . Therefore,
For any and its corresponding , we always have that
If , then so that and . If and , then , giving the same conclusions as the previous case. If and , then so that and . Consequently, when , then neither nor equal ⌀, so follows.
Note that implies that or , and so special attention should be paid to bounded random variables. □
Lemma A5.
Under (C1), exists and
Furthermore, when (C2) and (C3) hold.
Proof.
The existence of follows if its all entries are absolutely integrable, that is, for any ,
Equivalently, the result follows if for all . Note that when .
Let be a test function belonging to the Schwartz space . Then with some support denoted by .
Clearly, is not differentiable at Q but is Lipschitz continuous. Therefore, the measurable function is a locally integrable function of . Then the (regular) generalized functions belong to the dual space of .
For the distributional derivative of with respect to , we have that
implying that the function is integrable on . Therefore, by Fubini’s Theorem,
which implies that
Recall that can be written as
which contains a Schwartz product between the differentiable function and the generalized function . Note that
where and
where is the Dirac delta function and the distributional derivative of . Recall that and for some constant c and function f.
Thus, by the product rule for the distributional derivative of the Schwartz product,
Substituting the above expression, we obtain
Similarly, for , we have the distributional derivative
Recall that the distributional derivative does not depend on the order of differentiation and agrees with the classical derivative whenever the latter exists. To summarize, we have that
The are symmetric matrices, so is also symmetric.
In the sense of generalized functions, differentiation is a continuous operation with respect to convergence in . Therefore, and ; and , which coincides with results from the hinge loss.
Next, if and only if both and its Schur complement are both symmetric and positive definite. We can deduce that if and only if for all k.
Note that there exists such that on . Then for any ,
which implies that if and only if when is assumed to be non-singular. Assuming that implies that . □
Proof of Theorem 1
By Lemma A2, a minimizer exists with (by Lemma A4) and (by Lemma A5). By the second-order Lagrange condition and the convexity of (by Lemma A1), a minimizer of the population MgDWD loss is unique.
Recall from (A2) that
It follows that
and
Consequently, .
Note that when and when . The difference between these two results is attributed to pointwise convergence.
Let with and . By Fubini’s theorem and the dominated convergence theorem,
Similarly,
hence
As a result, coincides with the population hinge/SVM loss and is independent of . □
Appendix A.3. Proof of Lemma 2
Proof.
By the definition of ,
where
with , and
Denoting
we have that . Note that the are N i.i.d. random variables with
By Hoeffding’s inequality, we have that
where .
Regarding the , we have that
which implies that the are N independent sub-Gaussian random variables with variance proxy . Taking , we have that
Taking a union bound over the entries of yields that
On one hand,
so for any ,
and . Applying Hoeffding’s lemma,
Applying a square root to Theorem 2.1 of [] with , we have that
On the other hand, since the are N independent sub-Gaussian random variables with variance proxy ,
and . Similarly, we have that
for a constant .
Applying the union bound to (A5), it follows that
and the desired result follows. □
Appendix A.4. Proof of Theorem 2
Lemma A6.
Suppose that . Then , where
, denotes the complement of , , and .
Proof.
Since is the minimizer, we have that
where is the vector without the components, replacing for . Then
By the convexity of L,
Note that
Combining the above results, we have that
□
Lemma A7.
Assume that conditions (A1)–(A3) are satisfied. Then
with probability at most , where
and for any and for some constant .
Proof.
Given any and with ,
where , .
The bounded gradient implies the Lipschitz continuity of so that . Since , we have that
Note that
By Hoeffding’s inequality, we have that
Thus with
Next, we consider covering with -balls such that for any and in the same ball, where is a small positive number. The number of -balls required to cover a m-dimensional unit ball is bounded by . Then for those , we require a covering number of at most . Let denote such an -net. We have that
Furthermore, for any ,
Therefore . Taking , we have that
Setting and , we obtain the desired result that
□
Proof of Theorem 2.
Consider a disjoint partition on the coordinate set , that is, with . Note that, each subvector has at most non-zero coordinates. Denote and so that and . We have the decomposition
By Lemma A7,
with high probability. By Lemma A5, is twice differentiable so that
Consequently, is bounded below by with high probability.
Note that
From (A8),
Clearly, and . We conclude that
after which the desired result follows from straightforward algebraic manipulation. □
Appendix A.5. Proof of Lemma 3
Proof.
Since
we have that
The derivative with respect to is
Thus,
where is the Lipschitz constant of . We have that .
The derivative with respect to is
Therefore, the derivative with respect to is . Note that
and
thus
We conclude that . □
Appendix A.6. Proof of Theorem 3
Lemma A8.
The indicator function
where , has subdifferential
Proof.
Suppose that . Then if and only if both
Let . Then since . Thus, . If , then . If there exists satisfying for some , then , so we must have that . This is a contradiction.
Now, for any , we have that if and only if both
For and , since and , it must be that . □
Proof of Theorem 3.
It is sufficient to minimize the objective function
where . Then the subdifferential of is
For an optimal solution , we have that if and only if there exist , and such that . Since , we have that , so
If , then for , and
If , then , and
Note that . Taking the norm of both sides, we see that
Substituting this result back into the case, we have that
Combining the above two cases gives the desired result. □
Appendix A.7. Proof of Theorem 4
Proof.
Denote the objective function by
When , we obtain a lasso problem with
When , the subdifferential of is
We see that if and only if there exist and with
If , then and , hence
If , then and . If , then , and . Note that if , then or .
When , then and , hence
If , then and . If , then and . Note that is equivalent to .
Let . We can summarize the two cases above as
If , then and , thus
If , then or . Thus if . If , then or . Thus if . Rewriting the two cases above, we have that
If , then
Note that and . If , then
Note that and . Rewriting the two cases above, we have that
On the other hand, when , it follows that since . □
References
- Haralick, R.M.; Shanmugam, K.; Dinstein, I.H. Textural features for image classification. IEEE Trans. Syst. Man Cybern. 1973, SMC-3, 610–621. [Google Scholar] [CrossRef]
- Wang, X.; Zhang, H.H.; Wu, Y. Multiclass probability estimation with support vector machines. J. Comput. Graph. Stat. 2019, 28, 586–595. [Google Scholar] [CrossRef]
- Hansen, J.H.; Hasan, T. Speaker recognition by machines and humans: A tutorial review. IEEE Signal Process. Mag. 2015, 32, 74–99. [Google Scholar] [CrossRef]
- Duda, R.O.; Hart, P.E.; Stork, D.G. Pattern Classification; John Wiley & Sons: New York, NY, USA, 2012. [Google Scholar]
- Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer Science & Business Media: New York, NY, USA, 2009. [Google Scholar]
- Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
- Cristianini, N.; Shawe-Taylor, J. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
- Marron, J.S.; Todd, M.J.; Ahn, J. Distance-weighted discrimination. J. Am. Stat. Assoc. 2007, 102, 1267–1271. [Google Scholar] [CrossRef]
- Qiao, X.; Zhang, H.H.; Liu, Y.; Todd, M.J.; Marron, J.S. Weighted distance weighted discrimination and its asymptotic properties. J. Am. Stat. Assoc. 2010, 105, 401–414. [Google Scholar] [CrossRef]
- Marron, J. Distance-weighted discrimination. Wiley Interdiscip. Rev. Comput. Stat. 2015, 7, 109–114. [Google Scholar] [CrossRef]
- Zhang, L.; Lin, X. Some considerations of classification for high dimension low-sample size data. Stat. Methods Med. Res. 2013, 22, 537–550. [Google Scholar] [CrossRef]
- Wang, B.; Zou, H. Another look at distance-weighted discrimination. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2018, 80, 177–198. [Google Scholar] [CrossRef]
- Liu, Y.; Zhang, H.H.; Wu, Y. Hard or soft classification? Large-margin unified machines. J. Am. Stat. Assoc. 2011, 106, 166–177. [Google Scholar] [CrossRef]
- Huang, H.; Liu, Y.; Du, Y.; Perou, C.M.; Hayes, D.N.; Todd, M.J.; Marron, J.S. Multiclass distance-weighted discrimination. J. Comput. Graph. Stat. 2013, 22, 953–969. [Google Scholar] [CrossRef]
- Wang, B.; Zou, H. A multicategory kernel distance weighted discrimination method for multiclass classification. Technometrics 2019, 61, 396–408. [Google Scholar] [CrossRef]
- Wang, B.; Zou, H. Sparse distance weighted discrimination. J. Comput. Graph. Stat. 2016, 25, 826–838. [Google Scholar] [CrossRef]
- Wang, L.; Shen, X. On L1-norm multiclass support vector machines: Methodology and theory. J. Am. Stat. Assoc. 2007, 102, 583–594. [Google Scholar] [CrossRef]
- Zhang, X.; Wu, Y.; Wang, L.; Li, R. Variable selection for support vector machines in moderately high dimensions. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2016, 78, 53–76. [Google Scholar] [CrossRef]
- Peng, B.; Wang, L.; Wu, Y. An error bound for L1-norm support vector machine coefficients in ultra-high dimension. J. Mach. Learn. Res. 2016, 17, 8279–8304. [Google Scholar]
- Simon, N.; Friedman, J.; Hastie, T.; Tibshirani, R. A sparse-group lasso. J. Comput. Graph. Stat. 2013, 22, 231–245. [Google Scholar] [CrossRef]
- Friedman, J.; Hastie, T.; Tibshirani, R. A note on the group lasso and a sparse group lasso. arXiv 2010, arXiv:1001.0736. [Google Scholar]
- Cai, T.T.; Zhang, A.; Zhou, Y. Sparse group lasso: Optimal sample complexity, convergence rate, and statistical inference. arXiv 2019, arXiv:1909.09851. [Google Scholar]
- Yu, D.; Zhang, L.; Mizera, I.; Jiang, B.; Kong, L. Sparse wavelet estimation in quantile regression with multiple functional predictors. Comput. Stat. Data Anal. 2019, 136, 12–29. [Google Scholar] [CrossRef]
- He, Q.; Kong, L.; Wang, Y.; Wang, S.; Chan, T.A.; Holland, E. Regularized quantile regression under heterogeneous sparsity with application to quantitative genetic traits. Comput. Stat. Data Anal. 2016, 95, 222–239. [Google Scholar] [CrossRef]
- Huang, H. Large dimensional analysis of general margin based classification methods. arXiv 2019, arXiv:1901.08057. [Google Scholar]
- Huang, H.; Yang, Q. Large scale analysis of generalization error in learning using margin based classification methods. arXiv 2020, arXiv:2007.10112. [Google Scholar] [CrossRef]
- Lam, X.Y.; Marron, J.; Sun, D.; Toh, K.C. Fast algorithms for large-scale generalized distance weighted discrimination. J. Comput. Graph. Stat. 2018, 27, 368–379. [Google Scholar] [CrossRef]
- Sun, D.; Toh, K.C.; Yang, L. A convergent 3-block semiproximal alternating direction method of multipliers for conic programming with 4-type constraints. SIAM J. Optim. 2015, 25, 882–915. [Google Scholar] [CrossRef]
- Parikh, N.; Boyd, S. Proximal algorithms. Found. Trends Optim. 2014, 1, 127–239. [Google Scholar] [CrossRef]
- Asahchop, E.L.; Branton, W.G.; Krishnan, A.; Chen, P.A.; Yang, D.; Kong, L.; Zochodne, D.W.; Brew, B.J.; Gill, M.J.; Power, C. HIV-associated sensory polyneuropathy and neuronal injury are associated with miRNA–455-3p induction. JCI Insight 2018, 3, e122450. [Google Scholar] [CrossRef]
- Hsu, D.; Kakade, S.; Zhang, T. A tail inequality for quadratic forms of subgaussian random vectors. Electron. Commun. Probab. 2012, 17, 52. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).