Sparse Multicategory Generalized Distance Weighted Discrimination in Ultra-High Dimensions

Distance weighted discrimination (DWD) is an appealing classification method that is capable of overcoming data piling problems in high-dimensional settings. Especially when various sparsity structures are assumed in these settings, variable selection in multicategory classification poses great challenges. In this paper, we propose a multicategory generalized DWD (MgDWD) method that maintains intrinsic variable group structures during selection using a sparse group lasso penalty. Theoretically, we derive minimizer uniqueness for the penalized MgDWD loss function and consistency properties for the proposed classifier. We further develop an efficient algorithm based on the proximal operator to solve the optimization problem. The performance of MgDWD is evaluated using finite sample simulations and miRNA data from an HIV study.


Introduction
Classification problems appear in diverse practical applications, such as spam e-mail classification, disease diagnosis and drug discovery, among many others (e.g., [1][2][3]). In these classification problems, the goal is to predict class labels based on a given set of variables. Recent research has focused extensively on linear classification: see [4,5] for comprehensive introductions. Among many linear classification methods, support vector machines (SVMs) (see [6,7]) and distance-weighted discrimination (DWD) (see [8][9][10]) are two commonly used large-margin based classification methods.
Owing to the recent advent of new technologies for data acquisition and storage, classification with high dimensional features, i.e., a large number of variables, has become a ubiquitous problem in both theoretical and applied scientific studies. Typically, only a small number of instances are available, a setting we refer to as high-dimensional, low-sample size (HDLSS), as in [11]. In the HDLSS setting, a so-called "data-piling" phenomenon is observed in [8] for SVMs, occurring when projections of many training instances onto a vector normal to the separating hyperplane are nearly identical, suggesting severe overfitting. DWD was originally proposed to overcome data-piling in the HDLSS setting. In binary classification problems, linear SVMs seek a hyperplane maximizing the smallest margin for all data points, while DWD seeks a hyperplane minimizing the sum of inverse margins over all data points. Reference [8] suggests replacing the inverse margins by the q-th power of the inverse margins in a generalized DWD method; see [12] for a detailed description. Formally, for a training data set {(y i , X i )} N i=1 of N observations, where X i ∈ R p and y i ∈ {−1, 1}, binary generalized linear DWD seeks a proper separating hyperplane {X : a + X b = 0} through the optimization problem arg max where a and b are the intercept and slope parameters, respectively. The slack variable η i is introduced to ensure that the corresponding margin d i is non-negative and the constant c > 0 is a tuning parameter to control the overlap between classes. Problem (1) can also be written in a loss-plus-penalty form (e.g., [12]) as where with Q = q q+1 , q > 0 and ϕ q (u) = (1 − Q)(Qu −1 ) q . When q = 1, (1) becomes the standard DWD problem in [8] while problem (2) appears in [9,13].
The binary classification problem (1) is well studied. However, in many applications such as image classification [1], cancer diagnosis [2] and speech recognition [3], to name a few, problems with more than two categories are commonplace. To solve these multicategory problems with the DWD classifier, approaches based on either formulation (1) or (2) are common. One common strategy is to extend problem (1) to multiple classes by solving a series of binary problems in a one-versus-one (OVO) or one-versus-rest (OVR) method (e.g., [14]). Instead of reducing the multicategory problem to a binary one, another strategy based on problem (1) considers all classes at once. As shown in [14], this approach generally works better than the OVO and OVR methods. Based on an extension of problem (2), [15] proposes multicategory DWD, written in a loss-plus-penalty form as with y i , k ∈ {1, . . . , K} and where a k and b k = (b 1k , · · · , b pk ) are the intercept and slope parameters for each category k, respectively. Although these methods can be applied to multicategory classification in the HDLSS setting, both problems (2) and (4) use the L 2 penalty and do not perform feature selection. As discussed in [16], for high dimensional classification, taking all features into consideration does not work well for two reasons. First, based on prior knowledge, only a small number of variables are relevant to the classification problem: a good classifier in high dimensions should have the ability to sparsely select important variables and discard redundant ones. Second, classifiers using all available variables in high-dimensional settings may have poor classification performance.
Much of the SVM literature has considered variable selection in high-dimensional classification problems to improve performance (e.g., [17][18][19]). Among the DWD literature, to the best of our knowledge, only [16] considered variables selection and classification simultaneously. Wang and Zou [16] considered an L 1 rather than an L 2 penalty in problem (2) to improve interpretability through sparsity in the binary classification. Moreover, [16] made selections based on the strengths of input variables within individual classes but ignored the strengths of input variable groupings, thereby selecting more factors than necessary for each class. To overcome this weakness in this paper, we developed a multicategory generalized DWD method that is capable of performing variable selection and classification simultaneously. Our approach incorporates sparsity and group structure information via the sparse group lasso penalty (see [20][21][22][23][24]).
Although DWD is well studied, it is less popular than the SVM for binary classification, arguably for computational and theoretical reasons. For an up-to-date list of works on DWD mostly focused on the q = 1 case, see [14,15]. Theoretical asymptotic properties of large-margin classifiers in high dimensional settings were studied in [25], and [26] derived an expression for asymptotic generalization error. In terms of computation, [8] solved the standard DWD problem in (1) as a second-order cone programming (SOCP) problem using a primal-dual interior-point method that is computationally expensive when N or p is large. To overcome computational bottlenecks, [12] proposed an approach based on a novel formulation of the primal DWD model in (1): this method, proposed in [12], does not scale to large data sets and requires further work. Lam et al. [27] designed a new algorithm for large DWD problems with q ≥ 2 and K = 2 based on convergent multi-block ADMM-type methods (see [28]). Wang and Zou [16] solved the lasso-penalized binary DWD problem by combining majorization-minimization and coordinate descent methods: the lasso penalty does not directly permit a SOCP solution. In fact, solution identifiability in the generalized DWD problem with q > 1 requires more constraints and remains an open research problem (see [8]). To the best of our knowledge, no work focusing on computational aspects of lasso penalized multicategory generalized DWD (MgDWD) exists. The same holds for sparse group lasso-penalized MgDWD.
The theoretical and computational contributions of this paper are as follows. First, we establish the uniqueness of the minimizer in the population form of the MgDWD problem. Second, we prove a non-asymptotic L 2 estimation error bound for the sparse group lasso-regularized MgDWD loss function in the ultra-high dimensional setting under mild regularity conditions. Third, we develop a fast, efficient algorithm able to solve the sparse group lasso-penalized MgDWD problem using proximal methods.
The rest of this paper is organized as follows. In Section 2.1, we introduce the MgDWD problem with sparse group lasso penalty. In Sections 2.2 and 2.3, we establish theoretical properties of the population classifier and regularized empirical loss. We propose a computational algorithm in Section 2.4. Section 3 illustrates the finite sample performance of our method through simulation studies and a real data analysis. Proofs for major theorems are given in the Appendix A.

Model Setup
We begin with some basic set-up and notation. Consider the multicategory classification problem for a random sample {(y i , X i )} N i=1 of N independent and identically distributed (i.i.d.) observations from some distribution P(y, X). Here, y is the categorical response taking values in Y = {1, . . . , K}, and X = (x 1 , . . . , x p ) ∈ X ⊂ R p is the covariate vector. We wish to obtain a proper separating hyperplane {X ∈ X |a k + X b k = 0} for each category k ∈ Y , where a k and b k = (b 1k , . . . , b pk ) are intercept and slope parameters, respectively.
In this paper, we consider MgDWD with sparse group lasso regularization. That is, we estimate a classification boundary by solving the constrained optimization problem where φ q is as defined in (3).
To approach this problem, we apply the concept of a "margin vector" to extend the definition of a (binary) margin to the multicategory case. Denote the margin vector of an observation X i as . . , e iK ) be the class indicator vector with e ik = 1{y i = k}. The multicategory margin of the data point (y i , X i ) is then given as f iy i = a y i + X i b y i = E i F i . Therefore, the MgDWD loss can be rewritten as Based on (6), Lemma 1 describes the Fisher consistency of the MgDWD loss.
Lemma 1. Given X = u, the minimizer of the conditional expectation of (6) is and k * = argmin k∈Y Pr{y = k|X = u}.
Consequently,f k (u) can be treated as an effective proxy of Pr{y = k|X = u} and, for any new observation X * , a reasonable prediction of its label y * iŝ Speaking to the sparse group lasso (SGL) regularization in (5), the L 1 penalty encourages an element-wise sparse estimator that selects important variables for each category, indicated byb jk = 0. Assuming that parameters in different categories share the same information, we use an L 2 penalty to encourage a group-wise sparsity structure that removes covariates that are irrelevant across all categories, that is, whereβ j = (b 1j , . . . , b Kj ) = 0. Specifically, let x j = (x 1j , · · · , x Nj ) T and B = (b jk ) ∈ R p×K jk , where the k-th column b k is the slope vector for the category label k and the j-th row β j is the group coefficient for the variable x j . If x j is noise in the classification problem or is not relevant to category label k, then the entry b jk of B should be shrunk to exactly zero. The SGL penalty of (5) can be written as a convex combination of the lasso and group lasso penalties in terms of β j as where λ > 0 is the scale of the penalty and τ ∈ [0, 1] tunes the propensity between the element-wise and group-wise sparsity structure.

Population MgDWD
In this subsection, some basic results pertaining to unpenalized population MgDWD are given. These results are necessary for further theoretical analysis.
Denote the marginal probability mass of y as Pr(y = k) = π k with π k > 0 and ∑ K k=1 π k = 1, and the conditional probability density functions of X given y = k by g(X | y = k) = g k (X). Let Θ = (θ 1 , . . . , θ K ) be the collection of coefficient vectors θ k = (a k , b k ) for all labels and Z = (1, X ) . The population version of the MgDWD problem in (6) is where ϑ = vec{Θ} is the vectorization of the matrix Θ and I(Y ) = (1{y = 1}, . . . , 1{y = K}) is a random vector. Denote the true parameter value ϑ * as a minimizer of the population MgDWD problem, namely, where C = ϑ ∈ R K(p+1) Cϑ = 0 K is the set of sum-constrained ϑ with C = 1 K ⊗ I p+1 , where ⊗ denotes the Kronecker product.
To facilitate our theoretical analysis, we first define the gradient vector and Hessian matrix of the population MgDWD loss function. We then introduce some regularity conditions necessary to derive theoretical properties of this problem. Let diag{v} be a diagonal matrix constructed from the vector v, and let • and ⊕ be the Hadamard product and the direct matrix sum, respectively. Denote the gradient vector of the population MgDWD loss function (8) as and its Hessian matrix as where ϕ q denotes the second derivative of the function ϕ q ; I(Y , X ) = I(Y ) • I(X ) is a random vector with I(X ) = (1{X ∈ X 1 }, . . . , 1{X ∈ X k }) and X k = X ∈ X Z θ k > Q ; and The block structure of H(ϑ) implies a parallel relationship between each category. The relationship between the θ k is reflected by the sum-to-zero constraint in the definition of C .
We assume the following regularity conditions. (C1) The densities of X given y = k ∈ Y , i.e., the g k (X), are continuous and have finite second moments.
Remark 1. Condition (C1) ensures that L, S and H are well defined and continuous in ϑ. For the theoretically optimal hyperplane {X ∈ X |Z θ * k = 0}, the case with θ * k = 0 p+1 leaves X useless for classification. On the other hand, when a * k = 0 and b * k = 0 p , the hyperplane is the empty set and is similarly meaningless. Condition (C2) is proposed to avoid the case where b * k = 0 p so that ϑ * always contains information relevant to the classification problem. For bounded random variables, condition (C2) should be assumed with caution. Condition (C3) implies the positive definiteness of H(ϑ * ).
By convexity and the second-order Lagrange condition, the following theorem shows that the local minimizer of the population MgDWD problem exists and is unique.

Theorem 1.
Under the regularity conditions (C1)-(C3), the true parameter ϑ * ∈ C is the unique minimizer of L(ϑ) with b * k = 0 p , and The bounds in Theorem 1 show how q affects the loss function L(ϑ * ). The upper bound v(k, q) is a decreasing function of q with lim q→0 v(k, q) = 1 and lim In the lower bound u(k, q), the first term Pr X / ∈ X * k y = k is an increasing function of q and the last term Q 2q Pr Q < Z θ * k ≤ Q −1 y = k is a decreasing function of q, with lim q→0 u(k, q) = 1 and lim Consequently, for the given population P(y, X), a larger q encourages the population MgDWD estimator to focus more on the regions {X / ∈ X k , y = k that correspond to misclassifications. As a result, the estimator's performance will be similar to the hinge loss as q → ∞. Setting q too small will lead to an ineffective classifier due to the unreasonable penalty placed on the well classified region {X ∈ X k , y = k . This variation in the lower bound with respect to q provides a necessary condition for the existence of an optimal q. Remark 2. The explicit relationship between q and ϑ * is complicated. While it may be more desirable to prove that a greater value of q results in a smaller value of the loss function L(ϑ), there is no explicit formula for the optimal value ϑ * in terms of q.

Estimator Consistency
Under the unpenalized framework presented in the previous subsection, all covariates will contribute to the classification task for each category: this scenario may lead to a classifier that overfits to the training data set. In this subsection, we study the consistency of the estimator for (5) in ultra-high dimensional settings.
To achieve structural sparsity in the estimator, the regularization parameter λ in (7) must be large enough to dominate the gradient of the empirical MgDWD loss evaluated at the theoretical minimizer ϑ * = vec{Θ * } with high probability. On the other hand, λ should also be as small as possible to reduce the bias incurred by the SGL regularization term Lemma 2 provides a suitable choice of λ under the following assumption. (A1) The predictors X = (x 1 , . . . , x p ) ∈ R p are independent sub-Gaussian random vectors satisfying EX = 0 p , and where Var(X) = Σ, there exists a constant κ > 0 such that for any γ ∈ R p , E exp(γ Σ −1/2 X) ≤ exp( γ 2 2 κ 2 /2). From here on, we define ς 2 1 as the largest eigenvalue of Σ.
It is difficult to obtain a closed form for the conjugate of the SGL penalty, say, we propose a theoretical tuning parameter value where c 0 is some given constant satisfying Before we can derive an error bound for the estimator in (5), we impose two additional assumptions.
(A2) For the true parameter value ϑ * , there is a (s e , s g )-sparse structure in the coefficients B * with element-wise and group-wise support sets with cardinality |E | = s e and |G | = s g , respectively.
(A3) There exist some positive constants ς 2 and ς 3 such that Under the choice of λ given in (9), we show the L 2 -consistency of the estimator in (5).

Remark 3. The sub-Gaussian distribution assumption (A1) is common in high-dimensional scenarios.
This assumption characterizes the tail behavior of a collection of random variables including Gaussian, Bernoulli, and bounded variables as special cases. Assumption (A2) describes structural sparsity at two levels. The element-wise size s e < p is the size of the underlying generative model, and the group-wise size s g < pK is the size of the signal covariate set. Both s e and s g are allowed to depend on the sample size N. As a result, the dimension p is allowed to increase with the sample size N. Assumption (A3) guarantees that eigenvalues are positive in this sparse scenario.

Remark 4.
In practice, the tuning parameters λ and τ in (7) are commonly chosen by M-fold cross validation. That is, we choose the pair (τ, λ) with the highest prediction accuracy among the sub-data sets D m , specifically,

Computational Algorithm
In this section, we propose an efficient algorithm to solve problem (5). Our approach uses the proximal algorithm (see [29]) for solving high dimensional regularization problems. In two main steps, this approach obtains a solution to the constrained optimization problem by applying the proximal operator to the solution to the unconstrained problem.
Since regularization is not needed for the intercept terms α = (a 1 , . . . , a K ) , it can be separated from the coefficients in B. The empirical MgDWD loss of (8) is given by Various properties of the loss function L(ϑ) follow below.
Lemma 3. The loss function L(ϑ) has Lipschitz continuous partial derivatives. In particular, where n max is the largest group sample size. For S(B) = ∂L(θ) ∂B where e k is the k-th column of E and indicates the observations belonging to the k-th group.
Hence, following the majorization-minimization scheme, we can majorize the empirical MgDWD loss L(ϑ) by a quadratic function, that is, for some ϑ * = vec{(α * , B * ) }, where L α and L B denote the Lipschitz constants in Lemma 3. Instead of minimizing L(ϑ) directly, we apply gradient descent to minimize its surrogate upper bound function. The gradient descent updates are given by Next, we address the problem's constraints and regularization simultaneously by applying the proximal operator. For α * , it is clear that where , the minimization problem can be expressed as which implies that we can implement minimization for p groups in parallel. The following theorem provides the solution to (13).
Theorem 3. Let ρ 1 , ρ 2 ≥ 0 and β * ∈ R K . Then the constrained regularization problem has a solution of the form for some u ∈ ∂ β 1 .
In the special case with ρ 2 = 0, the constrained regularization problem in Theorem 3 reduces to the constrained lasso problem with solutionβ * = P K (β * − ρ 1 u). Combined with (14), the proximal operator U , given by can be introduced to realize the group sparsity ofβ * . For the standard lasso problem, the subgradient u has a closed form given bỹ However, under the constraint onβ * , the naive solution P K S(β * , ρ 1 ) is misleading in that it satisfies the constraint but does not achieve shrinkage, let alone loss function minimization. The term P K u is suggestive of the intersection between the subdifferential set ∂ β 1 and the constraint set {β ∈ R K |β 1 K = 0}; in this sense,β * might not have a closed form. Here we consider using coordinate descent to solve the constrained lasso problem.
For some fixed coordinate m, since β 1 K = 0, we have that b m = − ∑ l =m b l . Rewriting the objective function of the lasso-constrained problem in a coordinate-wise form, we obtain Next, Theorem 4 provides the solution to the optimization problem (16).
By Theorem 4, given some m ∈ {1, . . . , K}, the coordinate-wise minimizer for any k = m can be expressed as the proximal operator with s = ∑ l =k,m b l and t = (b k * − b m * − s)/2. If we fix m during iteration, then the shrinkage of b m will be indirectly reflected in the other b k . We propose that m change with k in the coordinate-wise minimization process to ensure that every coordinate can be equally shrunk. We summarize our proposed algorithm in Algorithm 1.

Numerical Analysis
In the following section, we use both simulated and real data sets to evaluate the finite sample properties of our proposed method. We compare the finite sample performance of SGL-MgDWD with L 1 -regularized multinomial logistic regression (L 1 -logistic).

Simulation Studies
The data is generated from the following model. Consider the K-category classification problem where π k = K −1 and g k (X) is the density function of a normal distribution with mean vector µ k = (µ 1k , µ 2k , 0 p−2 ) and covariance matrix I p , where (µ 1k , µ 2k ) = (2 cos(πr k ), 2 sin(πr k )) with r k = 2(k−1) K , for k = 1, . . . , K. In this model, only the first two variables contribute to the classification and their corresponding parameter vectors β 1 and β 2 form two groups of coefficients. The true model has the sparsity structure (s e , s g ) = (2K, 2) for a total of K(p + 1) coefficients. We set the sample size for each category to n k = 50, 100, 200 and 400, and the number of classes to K = 5 and 11. We consider dimensionality p = 100 and 1000.
In what follows, we compare the proposed SGL-MgDWD method with the OVR method based on SGL-MgDWD with K = 2 (OVR-SGL-gDWD). For SGL-MgDWD, logistic regression and OVR, the tuning parameter λ is optimized over a discrete set by minimizing the prediction error using 5-fold cross validation. In each simulation, we conduct 100 runs and use a testing set of equal size to evaluate each method's performance using the following criteria: • Testing set accuracy, measuring the rate of correct classification; • Signal, as the average number of correctly-selected element-wise and group-wise signals, that is, withb jk = 0 andβ j = 0, respectively, denoted by the pair (s + e , s + g ); • Noise, as the average number of incorrectly-selected element-wise and group-wise components, that is, withb jk = 0 andβ j = 0, respectively, denoted by the pair (n + e , n + g ). Simulation results are summarized in Tables 1 and 2. As shown in Tables 1 and 2, the proposed SGL-MgDWD method performs better than the L 1 -logistic and OVR methods. Specifically, in each scenario, predictions from the SGL-MgDWD method had higher accuracy relative to the other two methods. Similarly, the SGL-MgDWD method correctly selected the signal components of the model with fewer incorrectly-selected noise components, again relative to the L 1 -logistic and OVR methods. These simulation results also demonstrate that test accuracy increases with increasing sample size n k and that test accuracy decreases with higher dimension p at fixed n k . This is consistent with the derived theoretical properties. All computations were performed on a Tensorflow 2.3 CPU on Threadripper 2950X at 4.1 Ghz.

HIV Data Analysis
Symptomatic distal sensory polyneuropathy (sDSP) is a common debilitating condition among people with HIV. This condition leads to neuropathic pain and is associated with substantial comorbidities and increased health care costs. Plasma miRNA profiles show differences between HIV patients with and without sDSP, and several miRNA biomarkers are reported to be associated with the presence of sDSP in HIV patients (see [30]). The corresponding binary classification problem was analyzed in [30] using random forest classifiers. However, the HIV data set can be further classified into four classes. The HIV data set has 1715 miRNA measures for 40 patients and is partitioned into four groups (K = 4) with n k = 10 patients each category: non-HIV, HIV with no brain damage (HIVNBD), HIV with brain damage but stable (HIVBDS) and HIV with brain damage and unstable (HIVBDU). In the following analysis, we apply our proposed method to this classification problem. The primary aim was to identify critical miRNA biomarkers for each of the four groups. Beyond achieving a finer classification, this analysis is helpful in assessing related pathogenic effects for each patient group.
Given the small sample size of N = 40, we chose the tuning parameter λ by maximizing leave-one-out cross validation accuracy. We fixed (q, τ) = (1, 0.1). Table 3 shows the signal for coefficient estimates obtained from the SGL-MgDWD method using the selected λ. We conclude that there are 22 critical miRNA biomarkers important to the classification problem. In particular, the biomarkers miR-25-star, miR-3171, miR-3924 and miR-4307 are not relevant to the non-HIV group; miR-4641, miR-4655-3p and miR-660 are not relevant to the HIVNBD group; miR-217 and miR-4683 are not relevant to the HIVBDS group; and miR-217 and miR-4307 are not relevant to the HIVBDU group. Table 1. Simulation results for the SGL-MgDWD, L 1 -logistic, and OVR methods with K = 5. Time is measured relative to a baseline logistic regression model with K = 5, p = 100, and N = 50. Numbers in parentheses denote standard deviations.  OVR-SGL-gDWD 0.556 --- Table 3. Signal for the coefficient estimates obtained from the SGL-MgDWD method with (q, τ) = (1, 0.1) for the HIV data set. The symbols "+" and "-" denote positive and negative coefficient estimates, respectively, while "0" denotes a zero coefficient (i.e., an irrelevant variable). Table 3. Cont.

Acknowledgments:
The authors are thankful for the invitation of the two guest editors, Farouk Nathoo and Ejaz Ahmed. This work has also benefited from two anonymous reviewers' constructive comments and valuable feedback. The authors also thank the great help of Matthew Pietrosanu with editing.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Proofs
Appendix A.1. Proof of Lemma 1 Proof. For simplicity, we write p j = P(y = j|X) and f k = f k (X). Using the Lagrange multiplier method, we define Then for each k, Without loss of generality, assume that p 1 > p 2 ≥ p 3 ≥ · · · ≥ p K−1 > p K . Note that −1 ≤ φ q < 0, and so p j ≥ −φ q ( f k )p k = µ > 0 and µ = p k if and only if f k ≤ Q.
However, ∑ K k=1 f k > 0, contradicting the sum-to-zero constraint. Therefore, µ = p K < p k for k < K and the result follows. Proof. The existence of L(ϑ) will be satisfied if We divide X into two disjoint subsets. Defining This completes the proof of the existence of L(ϑ).
Recall that where φ q (u) is a convex function of u, so its composition with the affine mapping u = Z θ k is still convex in θ k . Clearly, g k (X), π k > 0, so the non-negatively-weighted integral and sum both preserve convexity.
Hence, if ϑ 2 > C + 1 The contrapositive of this result implies the existence of a minimizer in the unconstrained problem. That is, the closed set ϑ ∈ C L(ϑ) ≤ C is bounded for some large enough C. This guarantees the existence of a solution, as desired.

∂L(ϑ) ∂ϑ
For every θ kj ∈ R, φ q (Z θ k ) is a Lebesgue integrable function of X. For any u ∈ R, φ q (u) exists and |φ q (u)| ≤ 1. Hence, by the Leibniz integral rule, we have that and for any l = k, which is sufficient to show that Proof. We can rewrite φ q (u) as Then for any γ ∈ R p+1 and its corresponding X k = {X ∈ X |Z γ > Q}, we have that Let ϑ * ∈ C be a local minimizer. It follows that PS (ϑ * ) = 0 and For any γ ∈ R p+1 and its corresponding X k = {X ∈ X |Z γ > Q}, we always have that If γ = 0 p+1 , then X k = ∅ so that Pr{y = k, X / ∈ X k } = π k and Pr{y = k, X ∈ X k } = 0. If γ 1 ≤ Q and γ /1 = 0 p , then X k = ∅, giving the same conclusions as the previous case. If γ 1 > Q and γ /1 = 0 p , then X k = X so that Pr{y = k, X / ∈ X k } = 0 and Pr{y = k, X ∈ X k } = π k . Consequently, when 0 < Pr{X ∈ X k |y = k} < 1, then neither X k nor X equal ∅, so b k = 0 follows.
Let η be a test function belonging to the Schwartz space D. Then η ∈ D with some support denoted by supp(η ).
Clearly, φ q (u) is not differentiable at Q but is Lipschitz continuous. Therefore, the measurable function S k (θ k ) is a locally integrable function of θ k . Then the (regular) generalized functions S k (θ k ) belong to the dual space of D.
For the distributional derivative of S k (θ k ) with respect to θ jk , we have that implying that the function f (θ jk , X) = φ q (Z θ k )Zπ k g k (X)η (θ jk ) is integrable on R × X . Therefore, by Fubini's Theorem, Recall that φ q can be written as which contains a Schwartz product between the differentiable function ϕ q (u) and the generalized function 1{u > Q}. Note that where c jk = (Q − ∑ l =j z l θ lk )/z j and where δ(x) is the Dirac delta function and the distributional derivative of 1{x > 0}. Recall that for some constant c and function f . Thus, by the product rule for the distributional derivative of the Schwartz product, Substituting the above expression, we obtain Similarly, for l = k, we have the distributional derivative Recall that the distributional derivative does not depend on the order of differentiation and agrees with the classical derivative whenever the latter exists. To summarize, we have that The H k (θ k ) are symmetric matrices, so H(ϑ) is also symmetric. In the sense of generalized functions, differentiation is a continuous operation with respect to convergence in D . Therefore, , which coincides with results from the hinge loss.
Next, H(ϑ) O K(p+1) if and only if both H 1 (θ 1 ) and its Schur complement K k=2 H k (θ k ) are both symmetric and positive definite. We can deduce that H(ϑ) O K(p+1) if and only if H k (θ k ) O p+1 for all k.
Note that there exists c > 0 such that ϕ q (Z θ k ) ≥ c on X k . Then for any γ ∈ R p+1 , which implies that γ H k (θ k )γ = 0 if and only if γ = 0 p+1 when Var{X|X ∈ X k , y = k} is assumed to be non-singular. Assuming that Var{X|y = k} O implies that Var{X|X ∈ X k , y = k} O.
Proof of Theorem 1. By Lemma A2, a minimizer ϑ * ∈ C exists with b * k = 0 p (by Lemma A4) and H(ϑ * ) O K(p+1) (by Lemma A5). By the second-order Lagrange condition and the convexity of L(ϑ) (by Lemma A1), a minimizer of the population MgDWD loss is unique.
Recall from (A2) that It follows that The difference between these two results is attributed to pointwise convergence. Let f m = 1 − A(k, m) ∈ D with m = 1, 2, . . . and η ∈ D. By Fubini's theorem and the dominated convergence theorem, Similarly, As a result, A(k, ∞) coincides with the population hinge/SVM loss and A(k, 0) is independent of θ * k .

Appendix A.3. Proof of Lemma 2
Proof. By the definition ofP, we have that |d ik | ≤ 1 − K −1 . Note that the d ik are N i.i.d. random variables with By Hoeffding's inequality, we have that where c 1 > 1.
Appendix A.4. Proof of Theorem 2 Proof. Sinceθ = ϑ * + δ is the minimizer, we have that where β * is the vector ϑ * without the a k components, replacingδ for δ. Then By the convexity of L, Note that Combining the above results, we have that Lemma A7. Assume that conditions (A1)-(A3) are satisfied. Then with probability at most 2(Kp) 2(s e +K)(1−c 2 3 ) , where and ∆L(u, v) = L(u + v) − L(u) for any u, v ∈ R K(p+1) and for some constant c 3 > 1.
Next, we consider covering V with -balls such that for any v 1 and v 2 in the same ball, Therefore sup v∈V R(v) ≤ sup v∈N R(v) + 2ς 2 . Taking = (s e + K) log(pK) 2N , we have that Setting c 3 = 1 + √ 2c 4 and c 4 > 1, we obtain the desired result that Proof of Theorem 2. Consider a disjoint partition on the coordinate set δ =θ − ϑ * , that is, Note that, each subvector v m has at most s e + K non-zero coordinates. Denote v 0 = 0 and u m = ϑ * + ∑ m−1 l=0 v l so that u 1 = ϑ * and u M + v M = ϑ * + δ. We have the decomposition with high probability. By Lemma A5, L is twice differentiable so that Consequently, ∆L(ϑ * , δ) is bounded below by ς 2 3 2 δ 2 2 − Λ 3 δ 2 with high probability. Note that From (A8), Clearly, δ E + 1 ≤ √ s e + K δ E + 2 ≤ √ s e + K δ 2 and ∑ j∈G + δ j 2 ≤ s g + 1 δ 2 . We conclude that after which the desired result follows from straightforward algebraic manipulation.
Appendix A.5. Proof of Lemma 3

Proof.
Since The derivative with respect to α is Thus, where L q = (q+1) 2 q is the Lipschitz constant of φ q . We have that L α = n max N L q . The derivative with respect to vec(B ) is Therefore, the derivative with respect to B is S(B) = N −1 X E • φ q (F) . Note that We conclude that L B = L q N −1 max k diag(e k )X 2 2 .
Appendix A.6. Proof of Theorem 3 Lemma A8. The indicator function Proof. Suppose that x ∈ R. Then g ∈ ∂δ R (x) if and only if both δ R (y) ≥ δ R (x) + g, y − x for all y ∈ R and ω (y − x) ≤ 0.
Let z = y − x. Then z ∈ R since 1 p (y − x) = 0. Thus, g z ≤ 0. If g z = 0, then g ∈ {g ∈ R p | g = s1 p , s ∈ R}. If there exists g ∈ ∂δ R (x) satisfying g z < 0 for some z ∈ R, then −z ∈ R, so we must have that g z > 0. This is a contradiction. Now, for any x / ∈ R, we have that g ∈ ∂δ R (x) if and only if both δ R (y) ≥ δ R (x) + g, y − x for all y ∈ R and ω (x − y) ≥ ∞.
For x / ∈ R and y ∈ R, since z = x − y ∈ R p and g z ≥ ∞, it must be that g ∈ ∅.
Combining the above two cases gives the desired result.
When s = 0, the subdifferential of G(b) is We see that 0 ∈ ∂G(b * ) if and only if there exist u ∈ ∂|b| and v ∈ ∂|b + s| with b * = b − (u + v).