Next Article in Journal
Simultaneous Inference for High-Dimensional Approximate Factor Model
Next Article in Special Issue
Evaluation of Survival Outcomes of Endovascular Versus Open Aortic Repair for Abdominal Aortic Aneurysms with a Big Data Approach
Previous Article in Journal
Analysis of Blind Reconstruction of BCH Codes
Previous Article in Special Issue
Segmentation of High Dimensional Time-Series Data Using Mixture of Sparse Principal Component Regression Model with Information Complexity
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Sparse Multicategory Generalized Distance Weighted Discrimination in Ultra-High Dimensions

1
Key Lab of Statistical Modeling and Data Analysis of Yunnan Province, Yunnan University, Kunming 650091, China
2
Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB T6G 2G1, Canada
3
Department of Medicine (Neurology), University of Alberta, Edmonton, AB T6G 2G1, Canada
*
Authors to whom correspondence should be addressed.
Entropy 2020, 22(11), 1257; https://doi.org/10.3390/e22111257
Submission received: 30 September 2020 / Revised: 26 October 2020 / Accepted: 2 November 2020 / Published: 5 November 2020

Abstract

:
Distance weighted discrimination (DWD) is an appealing classification method that is capable of overcoming data piling problems in high-dimensional settings. Especially when various sparsity structures are assumed in these settings, variable selection in multicategory classification poses great challenges. In this paper, we propose a multicategory generalized DWD (MgDWD) method that maintains intrinsic variable group structures during selection using a sparse group lasso penalty. Theoretically, we derive minimizer uniqueness for the penalized MgDWD loss function and consistency properties for the proposed classifier. We further develop an efficient algorithm based on the proximal operator to solve the optimization problem. The performance of MgDWD is evaluated using finite sample simulations and miRNA data from an HIV study.

1. Introduction

Classification problems appear in diverse practical applications, such as spam e-mail classification, disease diagnosis and drug discovery, among many others (e.g., [1,2,3]). In these classification problems, the goal is to predict class labels based on a given set of variables. Recent research has focused extensively on linear classification: see [4,5] for comprehensive introductions. Among many linear classification methods, support vector machines (SVMs) (see [6,7]) and distance-weighted discrimination (DWD) (see [8,9,10]) are two commonly used large-margin based classification methods.
Owing to the recent advent of new technologies for data acquisition and storage, classification with high dimensional features, i.e., a large number of variables, has become a ubiquitous problem in both theoretical and applied scientific studies. Typically, only a small number of instances are available, a setting we refer to as high-dimensional, low-sample size (HDLSS), as in [11]. In the HDLSS setting, a so-called “data-piling” phenomenon is observed in [8] for SVMs, occurring when projections of many training instances onto a vector normal to the separating hyperplane are nearly identical, suggesting severe overfitting. DWD was originally proposed to overcome data-piling in the HDLSS setting. In binary classification problems, linear SVMs seek a hyperplane maximizing the smallest margin for all data points, while DWD seeks a hyperplane minimizing the sum of inverse margins over all data points. Reference [8] suggests replacing the inverse margins by the q-th power of the inverse margins in a generalized DWD method; see [12] for a detailed description. Formally, for a training data set { ( y i , X i ) } i = 1 N of N observations, where X i R p and y i { 1 , 1 } , binary generalized linear DWD seeks a proper separating hyperplane { X : a + X b = 0 } through the optimization problem
arg max a , b i = 1 N 1 d i q s . t . d i = y i a + X i T b + η i 0 , i , η i 0 , i , i η i c , b 2 2 = 1 ,
where a and b are the intercept and slope parameters, respectively. The slack variable η i is introduced to ensure that the corresponding margin d i is non-negative and the constant c > 0 is a tuning parameter to control the overlap between classes. Problem (1) can also be written in a loss-plus-penalty form (e.g., [12]) as
( a ^ , b ^ ) = argmin a , b 1 N i = 1 N ϕ q y i a + X i b + λ b 2 2 ,
where
ϕ q ( u ) = 1 u , if u Q φ q ( u ) , if u > Q ,
with Q = q q + 1 , q > 0 and φ q ( u ) = ( 1 Q ) ( Q u 1 ) q . When q = 1 , (1) becomes the standard DWD problem in [8] while problem (2) appears in [9,13].
The binary classification problem (1) is well studied. However, in many applications such as image classification [1], cancer diagnosis [2] and speech recognition [3], to name a few, problems with more than two categories are commonplace. To solve these multicategory problems with the DWD classifier, approaches based on either formulation (1) or (2) are common. One common strategy is to extend problem (1) to multiple classes by solving a series of binary problems in a one-versus-one (OVO) or one-versus-rest (OVR) method (e.g., [14]). Instead of reducing the multicategory problem to a binary one, another strategy based on problem (1) considers all classes at once. As shown in [14], this approach generally works better than the OVO and OVR methods. Based on an extension of problem (2), [15] proposes multicategory DWD, written in a loss-plus-penalty form as
min a k , b k 1 N i = 1 N ϕ q a y i + X i b y i + λ k = 1 K b k 2 2 s . t . k = 1 K a k = 0 ; k = 1 K b j k = 0 , j = 1 , , p ,
with y i , k { 1 , , K } and where a k and b k = ( b 1 k , , b p k ) are the intercept and slope parameters for each category k, respectively. Although these methods can be applied to multicategory classification in the HDLSS setting, both problems (2) and (4) use the L 2 penalty and do not perform feature selection. As discussed in [16], for high dimensional classification, taking all features into consideration does not work well for two reasons. First, based on prior knowledge, only a small number of variables are relevant to the classification problem: a good classifier in high dimensions should have the ability to sparsely select important variables and discard redundant ones. Second, classifiers using all available variables in high-dimensional settings may have poor classification performance.
Much of the SVM literature has considered variable selection in high-dimensional classification problems to improve performance (e.g., [17,18,19]). Among the DWD literature, to the best of our knowledge, only [16] considered variables selection and classification simultaneously. Wang and Zou [16] considered an L 1 rather than an L 2 penalty in problem (2) to improve interpretability through sparsity in the binary classification. Moreover, [16] made selections based on the strengths of input variables within individual classes but ignored the strengths of input variable groupings, thereby selecting more factors than necessary for each class. To overcome this weakness in this paper, we developed a multicategory generalized DWD method that is capable of performing variable selection and classification simultaneously. Our approach incorporates sparsity and group structure information via the sparse group lasso penalty (see [20,21,22,23,24]).
Although DWD is well studied, it is less popular than the SVM for binary classification, arguably for computational and theoretical reasons. For an up-to-date list of works on DWD mostly focused on the q = 1 case, see [14,15]. Theoretical asymptotic properties of large-margin classifiers in high dimensional settings were studied in [25], and [26] derived an expression for asymptotic generalization error. In terms of computation, [8] solved the standard DWD problem in (1) as a second-order cone programming (SOCP) problem using a primal-dual interior-point method that is computationally expensive when N or p is large. To overcome computational bottlenecks, [12] proposed an approach based on a novel formulation of the primal DWD model in (1): this method, proposed in [12], does not scale to large data sets and requires further work. Lam et al. [27] designed a new algorithm for large DWD problems with q 2 and K = 2 based on convergent multi-block ADMM-type methods (see [28]). Wang and Zou [16] solved the lasso-penalized binary DWD problem by combining majorization–minimization and coordinate descent methods: the lasso penalty does not directly permit a SOCP solution. In fact, solution identifiability in the generalized DWD problem with q > 1 requires more constraints and remains an open research problem (see [8]). To the best of our knowledge, no work focusing on computational aspects of lasso penalized multicategory generalized DWD (MgDWD) exists. The same holds for sparse group lasso-penalized MgDWD.
The theoretical and computational contributions of this paper are as follows. First, we establish the uniqueness of the minimizer in the population form of the MgDWD problem. Second, we prove a non-asymptotic L 2 estimation error bound for the sparse group lasso-regularized MgDWD loss function in the ultra-high dimensional setting under mild regularity conditions. Third, we develop a fast, efficient algorithm able to solve the sparse group lasso-penalized MgDWD problem using proximal methods.
The rest of this paper is organized as follows. In Section 2.1, we introduce the MgDWD problem with sparse group lasso penalty. In Section 2.2 and Section 2.3, we establish theoretical properties of the population classifier and regularized empirical loss. We propose a computational algorithm in Section 2.4. Section 3 illustrates the finite sample performance of our method through simulation studies and a real data analysis. Proofs for major theorems are given in the Appendix A.

2. Methodology

2.1. Model Setup

We begin with some basic set-up and notation. Consider the multicategory classification problem for a random sample { ( y i , X i ) } i = 1 N of N independent and identically distributed (i.i.d.) observations from some distribution P ( y , X ) . Here, y is the categorical response taking values in Y = { 1 , , K } , and X = ( x 1 , , x p ) X R p is the covariate vector. We wish to obtain a proper separating hyperplane { X X | a k + X b k = 0 } for each category k Y , where a k and b k = ( b 1 k , , b p k ) are intercept and slope parameters, respectively.
In this paper, we consider MgDWD with sparse group lasso regularization. That is, we estimate a classification boundary by solving the constrained optimization problem
min a k , b k 1 N i = 1 N ϕ q a y i + X i b y i + λ 1 k = 1 K j = 1 p | b j k | + λ 2 j = 1 p k = 1 K b j k 2 s . t . k = 1 K a k = 0 ; k = 1 K b j k = 0 , j = 1 , , p ,
where ϕ q is as defined in (3).
To approach this problem, we apply the concept of a “margin vector” to extend the definition of a (binary) margin to the multicategory case. Denote the margin vector of an observation X i as F i = ( f i 1 , , f i K ) , with f i k = a k + X i b k satisfying k = 1 K f i k = 0 . Let E i = ( e i 1 , , e i K ) be the class indicator vector with e i k = 𝟙 { y i = k } . The multicategory margin of the data point ( y i , X i ) is then given as f i y i = a y i + X i b y i = E i F i . Therefore, the MgDWD loss can be rewritten as
ϕ q ( a y i + X i b y i ) = ϕ q ( E i F i ) = E i ϕ q ( F i ) = k = 1 K 𝟙 { y i = k } ϕ q ( a k + X i b k ) .
Based on (6), Lemma 1 describes the Fisher consistency of the MgDWD loss.
Lemma 1.
Given X = u , the minimizer of the conditional expectation of (6) is F ˜ ( u ) = f ˜ 1 ( u ) , , f ˜ K ( u ) , satisfying
argmax k Y f ˜ k ( u ) = argmax k Y Pr { y = k | X = u } ,
where
f ˜ k ( u ) = Q Pr { y = k | X = u } Pr { y = k | X = u } q , k k Q l k Pr { y = l | X = u } Pr { y = k | X = u } q , k = k .
and k = argmin k Y Pr { y = k | X = u } .
Consequently, f ˜ k ( u ) can be treated as an effective proxy of Pr { y = k | X = u } and, for any new observation X , a reasonable prediction of its label y is
y ^ = argmax k Y { a k + X b k } .
Speaking to the sparse group lasso (SGL) regularization in (5), the L 1 penalty encourages an element-wise sparse estimator that selects important variables for each category, indicated by b ^ j k 0 . Assuming that parameters in different categories share the same information, we use an L 2 penalty to encourage a group-wise sparsity structure that removes covariates that are irrelevant across all categories, that is, where β ^ j = ( b 1 j , , b K j ) = 0 . Specifically, let x j = ( x 1 j , , x N j ) T and B = ( b j k ) R j k p × K , where the k-th column b k is the slope vector for the category label k and the j-th row β j is the group coefficient for the variable x j . If x j is noise in the classification problem or is not relevant to category label k, then the entry b j k of B should be shrunk to exactly zero. The SGL penalty of (5) can be written as a convex combination of the lasso and group lasso penalties in terms of β j as
λ 1 k = 1 K j = 1 p | b j k | + λ 2 j = 1 p k = 1 K b j k 2 = λ j = 1 p τ β j 1 + ( 1 τ ) β j 2 ,
where λ > 0 is the scale of the penalty and τ [ 0 , 1 ] tunes the propensity between the element-wise and group-wise sparsity structure.

2.2. Population MgDWD

In this subsection, some basic results pertaining to unpenalized population MgDWD are given. These results are necessary for further theoretical analysis.
Denote the marginal probability mass of y as Pr ( y = k ) = π k with π k > 0 and k = 1 K π k = 1 , and the conditional probability density functions of X given y = k by g ( X y = k ) = g k ( X ) . Let Θ = ( θ 1 , , θ K ) be the collection of coefficient vectors θ k = ( a k , b k ) for all labels and Z = ( 1 , X ) . The population version of the MgDWD problem in (6) is
L ( ϑ ) = E I ( Y ) ϕ q ( Θ Z ) = k = 1 K π k X ϕ q ( Z θ k ) g k ( x ) d x ,
where ϑ = vec { Θ } is the vectorization of the matrix Θ and I ( Y ) = ( 𝟙 { y = 1 } , , 𝟙 { y = K } ) is a random vector. Denote the true parameter value ϑ as a minimizer of the population MgDWD problem, namely,
ϑ argmin ϑ C L ( ϑ ) ,
where C = ϑ R K ( p + 1 ) | C ϑ = 0 K is the set of sum-constrained ϑ with C = 1 K I p + 1 , where ⊗ denotes the Kronecker product.
To facilitate our theoretical analysis, we first define the gradient vector and Hessian matrix of the population MgDWD loss function. We then introduce some regularity conditions necessary to derive theoretical properties of this problem. Let diag { v } be a diagonal matrix constructed from the vector v , and let ∘ and ⊕ be the Hadamard product and the direct matrix sum, respectively. Denote the gradient vector of the population MgDWD loss function (8) as
S ( ϑ ) = E { I ( Y ) ϕ q ( Θ Z ) } Z = vec S 1 , , S K ,
with
S k = E 𝟙 { y = k } ϕ q ( Z θ k ) Z = π k X ϕ q ( Z θ k ) Z g k ( X ) d X ,
and its Hessian matrix as
H ( ϑ ) = E diag I ( Y , X ) φ q ( Θ Z ) ( ZZ ) = k = 1 K H k ,
where φ q denotes the second derivative of the function φ q ; I ( Y , X ) = I ( Y ) I ( X ) is a random vector with I ( X ) = ( 𝟙 { X X 1 } , , 𝟙 { X X k } ) and X k = X X | Z θ k > Q ; and
H k = E 𝟙 { y = k , X X k } φ q ( Z θ k ) ZZ = π k X k φ q ( Z θ k ) ZZ g k ( X ) d X .
The block structure of H ( ϑ ) implies a parallel relationship between each category. The relationship between the θ k is reflected by the sum-to-zero constraint in the definition of C .
We assume the following regularity conditions.
(C1) The densities of X given y = k Y , i.e., the g k ( X ) , are continuous and have finite second moments.
(C2) 0 < Pr { X X k | y = k } < 1 for all k Y , where X k = X X | Z θ k > Q .
(C3) Var X | X X k , y = k O p for all k Y .
Remark 1.
Condition (C1) ensures that L , S and H are well defined and continuous in ϑ . For the theoretically optimal hyperplane { X X | Z θ k = 0 } , the case with θ k = 0 p + 1 leaves X useless for classification. On the other hand, when a k 0 and b k = 0 p , the hyperplane is the empty set and is similarly meaningless. Condition (C2) is proposed to avoid the case where b k = 0 p so that ϑ always contains information relevant to the classification problem. For bounded random variables, condition (C2) should be assumed with caution. Condition (C3) implies the positive definiteness of H ( ϑ ) .
By convexity and the second-order Lagrange condition, the following theorem shows that the local minimizer of the population MgDWD problem exists and is unique.
Theorem 1.
Under the regularity conditions (C1)–(C3), the true parameter ϑ C is the unique minimizer of L ( ϑ ) with b k 0 p , and
L ( ϑ ) = k = 1 K A ( k , q ) π k ,
with 0 u ( k , q ) A ( k , q ) v ( k , q ) 1 , where
A ( k , q ) = 1 E 𝟙 { X X k } 1 Q Z θ k q | y = k , u ( k , q ) = Pr X X k | y = k + Q 2 q Pr Q < Z θ k Q 1 | y = k , v ( k , q ) = Pr Z θ k 1 | y = k + inf ϵ > 0 Q 1 + ϵ q Pr Z θ k > 1 + ϵ | y = k .
The bounds in Theorem 1 show how q affects the loss function L ( ϑ ) . The upper bound v ( k , q ) is a decreasing function of q with
lim q 0 v ( k , q ) = 1 and lim q v ( k , q ) = Pr Z θ k 1 | y = k .
In the lower bound u ( k , q ) , the first term Pr X X k | y = k is an increasing function of q and the last term Q 2 q Pr Q < Z θ k Q 1 | y = k is a decreasing function of q, with
lim q 0 u ( k , q ) = 1 and lim q u ( k , q ) = Pr Z θ k 1 | y = k .
Consequently, for the given population P ( y , X ) , a larger q encourages the population MgDWD estimator to focus more on the regions { X X k , y = k } that correspond to misclassifications. As a result, the estimator’s performance will be similar to the hinge loss as q . Setting q too small will lead to an ineffective classifier due to the unreasonable penalty placed on the well classified region { X X k , y = k } . This variation in the lower bound with respect to q provides a necessary condition for the existence of an optimal q.
Remark 2.
The explicit relationship between q and ϑ is complicated. While it may be more desirable to prove that a greater value of q results in a smaller value of the loss function L ( ϑ ) , there is no explicit formula for the optimal value ϑ in terms of q.

2.3. Estimator Consistency

Under the unpenalized framework presented in the previous subsection, all covariates will contribute to the classification task for each category: this scenario may lead to a classifier that overfits to the training data set. In this subsection, we study the consistency of the estimator for (5) in ultra-high dimensional settings.
To achieve structural sparsity in the estimator, the regularization parameter λ in (7) must be large enough to dominate the gradient of the empirical MgDWD loss evaluated at the theoretical minimizer ϑ = vec { Θ } with high probability. On the other hand, λ should also be as small as possible to reduce the bias incurred by the SGL regularization term
P ( β ) = j = 1 p τ β j 1 + ( 1 τ ) β j 2 .
Lemma 2 provides a suitable choice of λ under the following assumption.
(A1) The predictors X = ( x 1 , , x p ) R p are independent sub-Gaussian random vectors satisfying E X = 0 p , and where Var ( X ) = Σ , there exists a constant κ > 0 such that for any γ R p , E exp ( γ Σ 1 / 2 X ) exp ( γ 2 2 κ 2 / 2 ) . From here on, we define ς 1 2 as the largest eigenvalue of Σ .
Lemma 2.
Denote S ( ϑ ) = ( I K Z ) diag ( vec { E } ) vec { ϕ q ( Z Θ ) } , where E = E 1 , , E N , Z = Z 1 , , Z N with Z i = ( 1 , X i ) , and I K is the identity matrix of size K. Under condition (A1),
P ˜ P S ( ϑ ) τ Λ 1 + ( 1 τ ) Λ 2
with probability at least 1 2 ( K p ) 1 c 1 2 p 1 c 2 2 , where
P = ( I K K 1 1 K 1 K ) I p + 1 , Λ 1 = max { ς 1 κ , 1 } 1 1 K 2 log ( p K ) N , Λ 2 = max { 2 2 ς 1 κ , 1 } c 2 1 1 K 2 log ( p ) N + K 1 N ,
for constants c 1 , c 2 > 1 .
It is difficult to obtain a closed form for the conjugate of the SGL penalty, say, P ¯ ( v ) = sup u C \ { 0 } u , v P ( u ) . Instead, we use a regularized upper bound P ˜ ( v ) P ¯ ( v ) . Based on Lemma 2, we propose a theoretical tuning parameter value
λ = c 0 log ( p K ) N ,
where c 0 is some given constant satisfying λ > τ Λ 1 + ( 1 τ ) Λ 2 .
Before we can derive an error bound for the estimator in (5), we impose two additional assumptions.
(A2) For the true parameter value ϑ , there is a ( s e , s g ) -sparse structure in the coefficients B with element-wise and group-wise support sets
E = ( j , k ) { 1 , , p } × { 1 , , K } | b j k 0 and G = j { 1 , , p } | β j 0 K
with cardinality | E | = s e and | G | = s g , respectively.
(A3) There exist some positive constants ς 2 and ς 3 such that
ς 2 2 = max γ V diag { vec ( E ) } ( Z I K ) γ 2 2 N γ 2 2 and ς 3 2 = min γ U γ H ( ϑ ) γ γ γ
with V = v R K ( p + 1 ) | 0 < v 0 s e + K and
U = δ R K ( p + 1 ) | τ 1 τ δ E + 1 + j G + δ j 2 C 0 τ 1 τ δ E c 1 + j G δ j 2 ,
where C 0 = ( c 0 1 ) ( c 0 + 1 ) , E c is the complement of E , E + = E { l = 1 + ( k 1 ) ( p + 1 ) | k = 1 , , K } , and  G + = G { 0 } .
Under the choice of λ given in (9), we show the L 2 -consistency of the estimator in (5).
Theorem 2.
Suppose that conditions (A1)–(A3) hold. Then with λ = c 0 log ( p K ) N in (5), we have that
ϑ ^ ϑ 2 C 1 s e + K + C 2 s g + 1 log ( p K ) N
with probability at least 1 2 ( K p ) 2 ( s e + K ) ( 1 c 3 2 ) , where C 1 = 2 ς 3 2 { c 0 τ + ( 2 + 2 c 3 ) ς 2 } and C 2 = 2 ς 3 2 c 0 ( 1 τ ) .
Remark 3.
The sub-Gaussian distribution assumption (A1) is common in high-dimensional scenarios. This assumption characterizes the tail behavior of a collection of random variables including Gaussian, Bernoulli, and bounded variables as special cases. Assumption (A2) describes structural sparsity at two levels. The element-wise size s e < p is the size of the underlying generative model, and the group-wise size s g < p K is the size of the signal covariate set. Both s e and s g are allowed to depend on the sample size N. As a result, the dimension p is allowed to increase with the sample size N. Assumption (A3) guarantees that eigenvalues are positive in this sparse scenario.
Remark 4.
In practice, the tuning parameters λ and τ in (7) are commonly chosen by M-fold cross validation. That is, we choose the pair ( τ , λ ) with the highest prediction accuracy among the sub-data sets D m , specifically,
C V ( τ , λ ) = m = 1 M i D m , 𝟙 { y i = y ^ i ( τ , λ ) }
where y ^ i ( τ , λ ) = argmax k Y Z i θ ^ k ( τ , λ ) .

2.4. Computational Algorithm

In this section, we propose an efficient algorithm to solve problem (5). Our approach uses the proximal algorithm (see [29]) for solving high dimensional regularization problems. In two main steps, this approach obtains a solution to the constrained optimization problem by applying the proximal operator to the solution to the unconstrained problem.
Since regularization is not needed for the intercept terms α = ( a 1 , , a K ) , it can be separated from the coefficients in B . The empirical MgDWD loss of (8) is given by
L ( ϑ ) = 1 N i = 1 N E i ϕ q ( F i ) = 1 N tr E ϕ q ( F ) = 1 N vec { E } vec ϕ q ( F )
where F = ( f i k ) N × K = Z Θ = 1 N α + XB . Various properties of the loss function L ( ϑ ) follow below.
Lemma 3.
The loss function L ( ϑ ) has Lipschitz continuous partial derivatives. In particular, for  S ( α ) = L ( θ ) α = 1 N E ϕ q ( F ) 1 N and any u , v R K , we have that
S ( u ) S ( v ) 2 n max N ( q + 1 ) 2 q u v 2 ,
where n max is the largest group sample size. For S ( B ) = L ( θ ) B = 1 N E ϕ q ( F ) X and any U , V R p × K , we have that
vec { S ( U ) S ( V ) } 2 max k diag ( e k ) X 2 2 N ( q + 1 ) 2 q vec { U V } 2 ,
where e k is the k-th column of E and indicates the observations belonging to the k-th group.
Hence, following the majorization–minimization scheme, we can majorize the empirical MgDWD loss L ( ϑ ) by a quadratic function, that is,
L ( ϑ ) L ( ϑ ) + S ( α ) ( α α ) + L α 2 α α 2 2 + vec { S ( B ) } vec { B B } + L B 2 vec { B B } 2 2 ,
for some ϑ = vec { ( α , B ) } , where L α and L B denote the Lipschitz constants in Lemma 3. Instead of minimizing L ( ϑ ) directly, we apply gradient descent to minimize its surrogate upper bound function. The gradient descent updates are given by
α = α q ( q + 1 ) 2 n max N E ϕ q ( F ) 1 N ,
B = B q ( q + 1 ) 2 max k diag ( e k ) X 2 2 E ϕ q ( F ) X .
Next, we address the problem’s constraints and regularization simultaneously by applying the proximal operator. For α , it is clear that
α new = argmin α 1 K = 0 α α 2 2 = P K α ,
where P K = I K k 1 1 K 1 K . For B = ( β 1 , , β p ) , the minimization problem can be expressed as
B new = argmin B 1 K = 0 p 1 2 vec { B B } 2 2 + λ 1 L B vec { B } 1 + λ 2 L B vec { B } 1 , 2 = argmin B 1 K = 0 p j = 1 p 1 2 β j β j 2 2 + λ 1 L B β j 1 + λ 2 L B β j 2 ,
which implies that we can implement minimization for p groups in parallel. The following theorem provides the solution to (13).
Theorem 3.
Let ρ 1 , ρ 2 0 and β R K . Then the constrained regularization problem
min β R K 1 2 β β 2 2 + ρ 1 β 1 + ρ 2 β 2 s . t . β 1 K = 0
has a solution of the form
β = 1 ρ 2 P K ( β ρ 1 u ) 2 + P K ( β ρ 1 u )
for some u β 1 .
In the special case with ρ 2 = 0 , the constrained regularization problem in Theorem 3 reduces to the constrained lasso problem with solution β ˜ = P K ( β ρ 1 u ) . Combined with (14), the proximal operator U , given by
β = U ( β ˜ , ρ 2 ) = 1 ρ 2 β ˜ 2 + β ˜ ,
can be introduced to realize the group sparsity of β ˜ .
For the standard lasso problem, the subgradient u has a closed form given by β ˜ = β ρ 1 u = S ( β , ρ 1 ) , with S ( u , v ) = sign ( u ) ( | u | v ) + . However, under the constraint on β ˜ , the naive solution P K S ( β , ρ 1 ) is misleading in that it satisfies the constraint but does not achieve shrinkage, let alone loss function minimization. The term P K u is suggestive of the intersection between the subdifferential set β 1 and the constraint set { β R K | β 1 K = 0 } ; in this sense, β ˜ might not have a closed form. Here we consider using coordinate descent to solve the constrained lasso problem. For some fixed coordinate m, since β 1 K = 0 , we have that b m = l m b l . Rewriting the objective function of the lasso-constrained problem in a coordinate-wise form, we obtain
l = 1 K 1 2 ( b l b l ) 2 + ρ 1 | b l | = b k ( b k b m ) 2 + 1 2 l k , m K b l 2 + ρ 1 | b k | + | b k + l k , m K b l | + 1 4 b k + b m + l k , m K b l 2 + l k , m K 1 2 ( b l b l ) 2 + ρ 1 | b l | .
Next, Theorem 4 provides the solution to the optimization problem (16).
Theorem 4.
Suppose that t , s R and ϱ 0 . Then the regularization problem
min b R 1 2 ( b t ) 2 + ϱ { | b | + | b + s | }
has solution
b = t , | t | < C ( s , t ) C ( s , t ) , C ( s , t ) | t | C ( s , t ) + 2 ϱ sign ( t ) ( | t | 2 ϱ ) , | t | > C ( s , t ) + 2 ϱ = t S t , C ( s , t ) + S S t , C ( s , t ) , 2 ϱ ,
where C ( s , t ) = 1 sign ( s ) sign ( t ) 2 | s | .
By Theorem 4, given some m { 1 , , K } , the coordinate-wise minimizer for any k m can be expressed as the proximal operator
b k = T ( t , s , ρ 1 ) = t S t , C ( s , t ) + S S t , C ( s , t ) , ρ 1 ,
with s = l k , m b l and t = ( b k b m s ) / 2 . If we fix m during iteration, then the shrinkage of b m will be indirectly reflected in the other b k . We propose that m change with k in the coordinate-wise minimization process to ensure that every coordinate can be equally shrunk. We summarize our proposed algorithm in Algorithm 1.
Algorithm 1: Proximal gradient descent algorithm for SGL-MgDWD.
Input: λ 1 ,   λ 2 .
Initialization: α ( 0 ) = 0 K , B ( 0 ) = O p × K , l = 0 .
1:
repeat
2:
    Update α according to (10) and (12):
α ( l + 1 ) = P K { α ( l ) L α 1 S ( α ( l ) ) } .
3:
    Update B ˜ according to (11):
B ˜ = B ( l ) L B 1 S ( B ( l ) ) .
4:
    Set B ( l + 1 ) B ˜ .
5:
    repeat
6:
        for m = 1 to K do
7:
           for k in { 1 , , K } \ m do
8:
               Update ( t , s ) :
t = b ˜ k b ˜ m , s = r = 1 K b r ( l + 1 ) b k ( l + 1 ) b m ( l + 1 ) .
9:
               Update b k ( l + 1 ) according to (17) and b m ( l + 1 ) by constraint:
b k ( l + 1 ) = T ( t , s , L B 1 λ 1 ) , b m ( l + 1 ) = s b k ( l + 1 ) .
10:
           end for
11:
        end for
12:
    until B ( l + 1 ) convergence.
13:
    Update B ( l + 1 ) according to (15):
B ( l + 1 ) = U ( B ( l + 1 ) , L B 1 λ 2 ) .
14:
    Set l l + 1 .
15:
until some condition is met.
Output: α ( l ) and B ( l ) .

3. Numerical Analysis

In the following section, we use both simulated and real data sets to evaluate the finite sample properties of our proposed method. We compare the finite sample performance of SGL-MgDWD with L 1 -regularized multinomial logistic regression ( L 1 -logistic).

3.1. Simulation Studies

The data is generated from the following model. Consider the K-category classification problem where π k = K 1 and g k ( X ) is the density function of a normal distribution with mean vector μ k = ( μ 1 k , μ 2 k , 0 p 2 ) and covariance matrix I p , where ( μ 1 k , μ 2 k ) = ( 2 cos ( π r k ) , 2 sin ( π r k ) ) with r k = 2 ( k 1 ) K , for k = 1 , , K . In this model, only the first two variables contribute to the classification and their corresponding parameter vectors β 1 and β 2 form two groups of coefficients. The true model has the sparsity structure ( s e , s g ) = ( 2 K , 2 ) for a total of K ( p + 1 ) coefficients. We set the sample size for each category to n k = 50 , 100 , 200 and 400, and the number of classes to K = 5 and 11. We consider dimensionality p = 100 and 1000.
In what follows, we compare the proposed SGL-MgDWD method with the OVR method based on SGL-MgDWD with K = 2 (OVR-SGL-gDWD). For SGL-MgDWD, logistic regression and OVR, the tuning parameter λ is optimized over a discrete set by minimizing the prediction error using 5-fold cross validation. In each simulation, we conduct 100 runs and use a testing set of equal size to evaluate each method’s performance using the following criteria:
  • Testing set accuracy, measuring the rate of correct classification;
  • Signal, as the average number of correctly-selected element-wise and group-wise signals, that is, with b ^ j k 0 and β ^ j 0 , respectively, denoted by the pair ( s e + , s g + ) ;
  • Noise, as the average number of incorrectly-selected element-wise and group-wise components, that is, with b ^ j k = 0 and β ^ j = 0 , respectively, denoted by the pair ( n e + , n g + ) .
Simulation results are summarized in Table 1 and Table 2.
As shown in Table 1 and Table 2, the proposed SGL-MgDWD method performs better than the L 1 -logistic and OVR methods. Specifically, in each scenario, predictions from the SGL-MgDWD method had higher accuracy relative to the other two methods. Similarly, the SGL-MgDWD method correctly selected the signal components of the model with fewer incorrectly-selected noise components, again relative to the L 1 -logistic and OVR methods. These simulation results also demonstrate that test accuracy increases with increasing sample size n k and that test accuracy decreases with higher dimension p at fixed n k . This is consistent with the derived theoretical properties. All computations were performed on a Tensorflow 2.3 CPU on Threadripper 2950X at 4.1 Ghz.

3.2. HIV Data Analysis

Symptomatic distal sensory polyneuropathy (sDSP) is a common debilitating condition among people with HIV. This condition leads to neuropathic pain and is associated with substantial comorbidities and increased health care costs. Plasma miRNA profiles show differences between HIV patients with and without sDSP, and several miRNA biomarkers are reported to be associated with the presence of sDSP in HIV patients (see [30]). The corresponding binary classification problem was analyzed in [30] using random forest classifiers. However, the HIV data set can be further classified into four classes. The HIV data set has 1715 miRNA measures for 40 patients and is partitioned into four groups ( K = 4 ) with n k = 10 patients each category: non-HIV, HIV with no brain damage (HIVNBD), HIV with brain damage but stable (HIVBDS) and HIV with brain damage and unstable (HIVBDU). In the following analysis, we apply our proposed method to this classification problem. The primary aim was to identify critical miRNA biomarkers for each of the four groups. Beyond achieving a finer classification, this analysis is helpful in assessing related pathogenic effects for each patient group.
Given the small sample size of N = 40 , we chose the tuning parameter λ by maximizing leave-one-out cross validation accuracy. We fixed ( q , τ ) = ( 1 , 0.1 ) . Table 3 shows the signal for coefficient estimates obtained from the SGL-MgDWD method using the selected λ . We conclude that there are 22 critical miRNA biomarkers important to the classification problem. In particular, the biomarkers miR-25-star, miR-3171, miR-3924 and miR-4307 are not relevant to the non-HIV group; miR-4641, miR-4655-3p and miR-660 are not relevant to the HIVNBD group; miR-217 and miR-4683 are not relevant to the HIVBDS group; and miR-217 and miR-4307 are not relevant to the HIVBDU group.

Author Contributions

Conceptualization, L.K. and N.T.; Methodology, T.S., L.K. and N.T.; Formal Analysis, Y.L.; Data Curation, W.G.B., E.A. and C.P.; Writing—Review & Editing, Y.W., B.J. and L.K.; Supervision, B.J., L.K. and N.T. All authors have read and agreed to the published version of the manuscript.

Funding

A Canadian Institutes of Health Research Team Grant and Canadian HIV-Ageing Multidisciplinary Programmatic Strategy (CHAMPS) in NeuroHIV (Christopher Power) supported these studies. Bei Jiang and Linglong Kong were supported by the Natural Sciences and Engineering Research Council of Canada (NSERC). Christopher Power and Linglong Kong were supported by Canada Research Chairs in Neurological Infection and Immunity and Statistical Learning, respectively. Niansheng Tang was supported by grants from the National Natural Science Foundation of China (grant number: 11671349) and the Key Projects of the National Natural Science Foundation of China (grant number: 11731011).

Acknowledgments

The authors are thankful for the invitation of the two guest editors, Farouk Nathoo and Ejaz Ahmed. This work has also benefited from two anonymous reviewers’ constructive comments and valuable feedback. The authors also thank the great help of Matthew Pietrosanu with editing.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proofs

Appendix A.1. Proof of Lemma 1

Proof. 
For simplicity, we write p j = P ( y = j | X ) and f k = f k ( X ) . Using the Lagrange multiplier method, we define
L ( F ) = E k = 1 K 𝟙 { y = k } ϕ q { F ( X ) } | X = u + μ 1 K F ( X ) = k = 1 K p k ϕ q ( f k ) + μ f k .
Then for each k,
L ( F ) f k = ϕ q ( f k ) p k + μ = 0
with
ϕ q ( f j ) = 1 , f k Q ( Q f k 1 ) q , f k > Q .
Without loss of generality, assume that p 1 > p 2 p 3 p K 1 > p K . Note that 1 ϕ q < 0 , and so p j ϕ q ( f k ) p k = μ > 0 and μ = p k if and only if f k Q .
If μ < p K < p k , then p K μ when f K > Q , which implies that f k > f K > Q for all 1 k K . Hence, substituting ϕ q ( f k ) = ( Q f k 1 ) q into (A1) yields
f k = Q p k μ 1 q > Q > 0 .
However, k = 1 K f k > 0 , contradicting the sum-to-zero constraint. Therefore, μ = p K < p k for k < K and the result follows. □

Appendix A.2. Proof of Theorem 1

Lemma A1.
Under (C1), L ( ϑ ) exists, and it is convex on ϑ .
Proof. 
The existence of L ( ϑ ) will be satisfied if
E X | y | ϕ q ( Z θ k ) | | y = k = X | ϕ q ( Z θ k ) | g k ( X ) d X < .
We divide X into two disjoint subsets. Defining X k = { X X Z θ k > Q } , it is clear that
X k | ϕ q ( Z θ k ) | g k ( X ) d X ( q + 1 ) 1 X k g k ( X ) d X < .
Note that 0 < ϕ q ( u ) < ( 1 + q ) 1 < 1 when u > Q . On the other hand, for X k c = { X X Z θ k Q } ,
X k c | ϕ q ( Z θ k ) | g k ( X ) d X | 1 a k | + j = 1 p b j k X | x j | g k ( X ) d X < ,
if E X | y | x j | | y = k < for all k Y . This completes the proof of the existence of L ( ϑ ) .
Recall that
L ( ϑ ) = k = 1 K π k X ϕ q ( Z θ k ) g k ( X ) d X ,
where ϕ q ( u ) is a convex function of u, so its composition with the affine mapping u = Z θ k is still convex in θ k . Clearly, g k ( X ) , π k > 0 , so the non-negatively-weighted integral and sum both preserve convexity. □
Lemma A2.
Existence of minimizers of L ( ϑ ) on C = ϑ R K ( p + 1 ) | C ϑ = 0 K , where C = 1 K I p + 1 .
Proof. 
By Jensen’s inequality, for any ϑ C , we have that
L ( ϑ ) ϕ q k = 1 K π k E { Z θ k | y = k } .
Let μ = vec ( π k E { z j | y = k } ) j k , where μ 2 k = 1 K π k 2 1 2 K 1 2 > 0 . For some C > 0 , we have that
L ( ϑ ) ϕ q ( μ ϑ ) = 𝟙 { μ ϑ < Q } ( 1 μ ϑ ) + 𝟙 { μ ϑ Q } φ q ( μ ϑ ) 𝟙 { μ ϑ < Q } | 1 | μ ϑ | | = 𝟙 { μ ϑ < ( C + 1 ) } ( | μ ϑ | 1 ) + 𝟙 { ( C + 1 ) < μ ϑ < 1 } ( | μ ϑ | 1 ) + 𝟙 { 1 < μ ϑ < Q } ( 1 | μ ϑ | ) > 𝟙 { μ 2 ϑ 2 > C + 1 } C = 𝟙 ϑ 2 > C + 1 μ 2 C .
Note that 1 μ ϑ > 1 Q > 0 when μ ϑ < Q . By the Cauchy–Schwarz inequality, μ ϑ = | μ ϑ | μ 2 ϑ 2 .
Hence, if ϑ 2 > C + 1 μ 2 > 0 , then L ( ϑ ) > C > 0 . The contrapositive of this result implies the existence of a minimizer in the unconstrained problem. That is, the closed set ϑ C | L ( ϑ ) C is bounded for some large enough C. This guarantees the existence of a solution, as desired. □
Lemma A3.
Under (C1), S ( ϑ ) exists and
L ( ϑ ) ϑ = S ( ϑ ) .
Proof. 
The existence of S ( ϑ ) will follow if
X | ϕ q ( Z θ k ) z j | π k g k ( X ) d X π k X | z j | g k ( X ) d X <
for j = 1 , , p + 1 . Note that | ϕ q ( u ) | 1 when u > Q .
For every θ k j R , ϕ q ( Z θ k ) is a Lebesgue integrable function of X . For any u R , ϕ q ( u ) exists and | ϕ q ( u ) | 1 . Hence, by the Leibniz integral rule, we have that
θ j k X ϕ q ( Z θ k ) π k g k ( X ) d X = X ϕ q ( Z θ k ) θ j k π k g k ( X ) d X = X ϕ q ( Z θ k ) z j π k g k ( X ) d X
and for any l k ,
θ j l X ϕ q ( Z θ k ) π k g k ( X ) d X = 0 ,
which is sufficient to show that
L ( ϑ ) ϑ = S ( ϑ ) .
Lemma A4.
Suppose (C1) is satisfied. Then (C2) implies that b k 0 .
Proof. 
We can rewrite ϕ q ( u ) as
ϕ q ( u ) = 𝟙 { u Q } ( 1 u ) + 𝟙 { u > Q } ( 1 Q ) Q u q = 𝟙 { u Q } 𝟙 { u > Q } Q u q + 1 u + 𝟙 { u Q } + 𝟙 { u > Q } Q u q = ϕ q ( u ) u + 𝟙 { u Q } + 𝟙 { u > Q } Q u q .
Then for any γ R p + 1 and its corresponding X k = { X X | Z γ > Q } , we have that
E { 𝟙 { y = k } ϕ q ( Z γ ) } = E 𝟙 { y = k } ϕ q ( Z γ ) Z γ + E 𝟙 { y = k , Z γ Q } + E 𝟙 { y = k , Z γ > Q } Q Z γ q = S k ( γ ) γ + Pr { y = k , X X k } + E 𝟙 { y = k , X X k } Q Z γ q = S k ( γ ) γ + π k 1 E 𝟙 { X X k } 1 Q Z γ q | y = k .
Let ϑ C be a local minimizer. It follows that P S ( ϑ ) = 0 and k = 1 K S k ( θ k ) θ k = S ( ϑ ) ϑ = 0 since ϑ = P ϑ and P = I K K 1 1 K 1 K I p + 1 . Therefore,
L ( ϑ ) = E 𝟙 { y = k } ϕ q ( Z θ k ) = k = 1 K π k 1 E 𝟙 { X X k } 1 Q Z θ k q | y = k = k = 1 K π k 1 Pr { X X k | y = k } E 1 Q Z θ k q | y = k , X X k .
For any γ R p + 1 and its corresponding X k = { X X | Z γ > Q } , we always have that
0 < E Q Z γ q | y = k , X X k < 1 .
If γ = 0 p + 1 , then X k = so that Pr { y = k , X X k } = π k and Pr { y = k , X X k } = 0 . If γ 1 Q and γ / 1 = 0 p , then X k = , giving the same conclusions as the previous case. If γ 1 > Q and γ / 1 = 0 p , then X k = X so that Pr { y = k , X X k } = 0 and Pr { y = k , X X k } = π k . Consequently, when 0 < Pr { X X k | y = k } < 1 , then neither X k nor X equal ⌀, so b k 0 follows.
Note that Pr { X X k | y = k } > 0 implies that Pr { 0 < Z γ Q | y = k } > 0 or Pr { Z γ 0 | y = k } > 0 , and so special attention should be paid to bounded random variables. □
Lemma A5.
Under (C1), H ( ϑ ) exists and
2 L ( ϑ ) ϑ ϑ = H ( ϑ ) .
Furthermore, H ( ϑ ) O K ( p + 1 ) when (C2) and (C3) hold.
Proof. 
The existence of H ( ϑ ) follows if its all entries are absolutely integrable, that is, for any j , k = 1 , , p + 1 ,
X | 𝟙 { Z θ k > Q } φ q ( Z θ k ) z j z l | π k g k ( X ) d X    ( q + q 1 + 2 ) X k c | z j z l | g k ( X ) d X    < .
Equivalently, the result follows if E X | y | z j z l | | y = k < for all k Y . Note that 0 < φ q ( u ) q + q 1 + 2 when u > Q .
Let η be a test function belonging to the Schwartz space D . Then η D with some support denoted by supp ( η ) .
Clearly, ϕ q ( u ) is not differentiable at Q but is Lipschitz continuous. Therefore, the measurable function S k ( θ k ) is a locally integrable function of θ k . Then the (regular) generalized functions S k ( θ k ) belong to the dual space of D .
For the distributional derivative of S k ( θ k ) with respect to θ j k , we have that
S k ( θ k ) θ j k , η ( θ j k ) = S k ( θ k ) , d η ( θ j k ) d θ j k R S k ( θ k ) η ( θ j k ) d θ j k max θ j k supp ( η ) | η ( θ j k ) | supp ( η ) | S k ( θ k ) | d θ j k <
implying that the function f ( θ j k , X ) = ϕ q ( Z θ k ) Z π k g k ( X ) η ( θ j k ) is integrable on R × X . Therefore, by Fubini’s Theorem,
S k ( θ k ) θ j k , η ( θ j k ) = S k ( θ k ) , d η ( θ j k ) d θ j k = X ϕ q ( Z θ k ) Z π k g k ( X ) , d η ( θ j k ) d θ j k d X = X ϕ q ( Z θ k ) θ j k Z π k g k ( X ) , η ( θ j k ) d X = E ϕ q ( Z θ k ) θ j k Z 𝟙 { y = k } , η ( θ j k ) ,
which implies that
S k ( θ k ) θ j k = E ϕ q ( Z θ k ) θ j k Z 𝟙 { y = k } .
Recall that ϕ q can be written as
ϕ q ( u ) = φ q ( u ) 𝟙 { u > Q } + ( 1 ) 𝟙 { u Q } = ( φ q ( u ) + 1 ) 𝟙 { u > Q } 1 ,
which contains a Schwartz product between the differentiable function φ q ( u ) and the generalized function 𝟙 { u > Q } . Note that
𝟙 { Z θ k > Q } = 𝟙 { z j > 0 , θ j k > c j k } + 𝟙 { z j 0 , θ j k c j k } = ( 2 𝟙 { z j > 0 } 1 ) 𝟙 { θ j k > c j k } + ( 1 𝟙 { z j > 0 } ) = sign ( z j ) 𝟙 { θ j k > c j k } + 𝟙 { z j 0 } ,
where c j k = ( Q l j z l θ l k ) / z j and
𝟙 { Z θ k > Q } θ j k + 0 = sign ( z j ) δ ( θ j k c j k ) = sign ( z j ) | z j | δ ( Z θ k Q ) = z j δ ( Z θ k Q ) ,
where δ ( x ) is the Dirac delta function and the distributional derivative of 𝟙 { x > 0 } . Recall that δ ( c x ) = δ ( x ) / | c | and f ( x ) δ ( x c ) = f ( c ) δ ( x c ) for some constant c and function f.
Thus, by the product rule for the distributional derivative of the Schwartz product,
ϕ q ( Z θ k ) θ j k = ( φ q ( Z θ k ) + 1 ) θ j k 𝟙 { Z θ k > Q } + ( φ q ( Z θ k ) + 1 ) 𝟙 { Z θ k > Q } θ j k = φ q ( Z θ k ) z j 𝟙 { Z θ k > Q } + ( φ q ( Z θ k ) + 1 ) z j δ ( Z θ k Q ) = φ q ( Z θ k ) z j 𝟙 { Z θ k > Q } .
Substituting the above expression, we obtain
S k ( θ k ) θ j k = E φ q ( Z θ k ) Z z j 𝟙 { Z θ k > Q } 𝟙 { y = k } .
Similarly, for l k , we have the distributional derivative
S k ( θ k ) θ j l = 0 .
Recall that the distributional derivative does not depend on the order of differentiation and agrees with the classical derivative whenever the latter exists. To summarize, we have that
H k ( θ k ) = 2 L ( ϑ ) θ k θ k = S k ( θ k ) θ k , H ( ϑ ) = k = 1 K H k ( θ k ) .
The H k ( θ k ) are symmetric matrices, so H ( ϑ ) is also symmetric.
In the sense of generalized functions, differentiation is a continuous operation with respect to convergence in D . Therefore, ϕ 0 = lim q 0 ϕ q = 𝟙 { u 0 } and ϕ 0 = lim q 0 ϕ q = δ ( u ) ; ϕ = lim q ϕ q = 𝟙 { u 1 } and ϕ = lim q ϕ q = δ ( u 1 ) , which coincides with results from the hinge loss.
Next, H ( ϑ ) O K ( p + 1 ) if and only if both H 1 ( θ 1 ) and its Schur complement k = 2 K H k ( θ k ) are both symmetric and positive definite. We can deduce that H ( ϑ ) O K ( p + 1 ) if and only if H k ( θ k ) O p + 1 for all k.
Note that there exists c > 0 such that φ q ( Z θ k ) c on X k . Then for any γ R p + 1 ,
γ H k ( θ k ) γ = π k X k φ q ( Z θ k ) ( Z γ ) 2 g k ( X ) d X c Pr { X X k , y = k } E { ( Z γ ) 2 | X X k , y = k } c Pr { X X k , y = k } γ 0 2 + γ 1 Var { X | X X k , y = k } γ 1 ,
which implies that γ H k ( θ k ) γ = 0 if and only if γ = 0 p + 1 when Var { X | X X k , y = k } is assumed to be non-singular. Assuming that Var { X | y = k } O implies that Var { X | X X k , y = k } O . □
Proof of Theorem 1
By Lemma A2, a minimizer ϑ C exists with b k 0 p (by Lemma A4) and H ( ϑ ) O K ( p + 1 ) (by Lemma A5). By the second-order Lagrange condition and the convexity of L ( ϑ ) (by Lemma A1), a minimizer of the population MgDWD loss is unique.
Recall from (A2) that
L ( ϑ ) = E 𝟙 { y = k } ϕ q ( Z θ k ) = k = 1 K π k 1 E 𝟙 { X X k } 1 Q Z θ k q | y = k = k = 1 K A ( k , q ) π k .
It follows that
0 E 𝟙 { X X k } 1 Q Z θ k q | y = k < E 𝟙 { Z γ > 1 + q 1 } + 𝟙 { Q < Z γ 1 + q 1 } 1 Q 1 + q 1 q | y = m = Pr Z γ > Q | y = m Pr Q < Z γ Q 1 | y = m Q 2 q 1
and
1 E 𝟙 { X X k } 1 Q Z θ k q | y = k > E 𝟙 { Z θ k > 1 + ϵ } 1 Q 1 + ϵ q | y = k sup ϵ > 0 1 Q 1 + ϵ q Pr Z θ k > 1 + ϵ | y = m 0 .
Consequently, 0 u ( k , q ) A ( k , q ) v ( k , q ) 1 .
Note that lim q ( 1 + ϵ ) q Q q = e 1 when ϵ = 0 and lim q ( 1 + ϵ ) q Q q = 0 when ϵ > 0 . The difference between these two results is attributed to pointwise convergence.
Let f m = 1 A ( k , m ) D with m = 1 , 2 , and η D . By Fubini’s theorem and the dominated convergence theorem,
lim m f m , η = lim m E 𝟙 { X X k } Q Z θ k q | y = k , η ( γ ) = lim m E 𝟙 { X X k } Q Z θ k q , η ( θ k ) | y = k = E lim m 𝟙 { X X k } Q Z θ k q , η ( θ k ) | y = k = 0 = 0 , η ( θ k ) .
Similarly,
lim m 0 f m , η = E lim m 0 𝟙 { X X k } Q Z θ k q , η ( γ ) | y = k = E 𝟙 { Z θ k > 0 } , η ( γ ) | y = k = E 𝟙 { Z θ k > 0 } | y = k , η ( θ k ) = Pr Z θ k > 0 | y = k , η ( θ k ) ,
hence
A ( k , ) = lim q A ( k , q ) = Pr X X k | y = k , and A ( k , 0 ) = lim q 0 A ( k , q ) = 1 .
As a result, A ( k , ) coincides with the population hinge/SVM loss and A ( k , 0 ) is independent of θ k . □

Appendix A.3. Proof of Lemma 2

Proof. 
By the definition of P ˜ ,
P ˜ P S ( ϑ ) = τ P S ( ϑ ) + ( 1 τ ) max j P K S ( α ) 2 , P K S ( β j ) 2 ,
where
P K S ( α ) = P K E ϕ q { F ( ϑ ) } 1 K = 1 N i = 1 N P K diag { E i } ϕ q ( F i ) , P K S ( β j ) = P K E ϕ q { F ( ϑ ) } x j = 1 N i = 1 N x i j P K diag { E i } ϕ q ( F i ) ,
P K = ( p 1 , , p K ) with p k = ( p l k ) = 𝟙 { l = k } K 1 , and
E { P K S ( α ) } = P K S ( α ) = 0 K , E { P K S ( β j ) } = P K S ( β j ) = 0 K .
Denoting
d i k = p k diag { E i } ϕ q ( F i ) = l = 1 K 𝟙 { y i = k } 1 K e i l ϕ q ( f i l ) ,
we have that | d i k | 1 K 1 . Note that the d i k are N i.i.d. random variables with
1 N i = 1 N E ( d i k ) = p k S ( α ) = 0 and 1 N i = 1 N E ( d i k x i j ) = p k S ( β j ) = 0 .
By Hoeffding’s inequality, we have that
Pr | p k S ( α ) | > c 1 1 1 K 2 log ( p K ) N 2 ( p K ) c 1 2 ,
where c 1 > 1 .
Regarding the d i k x i j , we have that
E exp { d i k x i j } E exp { ( 1 K 1 ) | x i j | } exp { 4 ( 1 K 1 ) 2 ς 1 2 κ 2 } ,
which implies that the d i k x i j are N independent sub-Gaussian random variables with variance proxy ( 1 K 1 ) 2 ς 1 2 κ 2 . Taking c 1 > 1 , we have that
Pr | p k S ( β j ) | > c 1 ς 1 κ 1 1 K 2 log ( p K ) N 2 ( p K ) c 1 2 .
Then by (A3) and (A4),
Pr max j | p k S ( α ) | , | p k S ( β j ) | > Λ 1 2 ( p K ) c 1 2
with
Λ 1 = max { ς 1 κ , 1 } c 1 1 1 K 2 log ( p K ) N .
Taking a union bound over the K p entries of P S ( β ) yields that
Pr P S ( ϑ ) Λ 1 = Pr max j , k | 1 N i = 1 N p k S ( α ) | , | 1 N i = 1 N p k S ( β j ) | Λ 1 2 K ( p + 1 ) ( K p ) c 1 2 .
On one hand,
P diag { E i } ϕ q ( F i ) 2 2 = ( E i K 1 ) ϕ q ( F i ) 2 2 l = 1 K ( e i l K 1 ) 2 · 1 = 1 K 1 ,
so for any γ R K ,
| γ P diag { E i } ϕ q ( F i ) | γ 2 1 1 K
and E { γ P diag { E i } ϕ q ( F i ) } = 0 . Applying Hoeffding’s lemma,
E exp { γ P K S ( α ) } = i = 1 N E exp 1 N γ P K diag { E i } ϕ q ( F i ) exp γ 2 2 2 N 1 1 K .
Applying a square root to Theorem 2.1 of [31] with c 2 > 1 , we have that
Pr P S ( α ) 2 K 1 N + c 2 1 1 K 2 log ( p ) N p c 2 2 .
On the other hand, since the x i j are N independent sub-Gaussian random variables with variance proxy ς 1 2 κ 2 ,
E exp { γ P S ( β j ) } = i = 1 N E exp x i j N γ P diag { E i } ϕ q ( F i ) i = 1 N E exp 1 1 K γ 2 N | x i j | = exp γ 2 2 2 1 1 K 8 ς 1 2 κ 2 N
and E { P K S ( β j ) } = 0 K . Similarly, we have that
Pr P S ( β j ) 2 2 2 ς 1 κ K 1 N + c 2 1 1 K 2 log ( p ) N p c 2 2
for a constant c 2 > 1 .
Therefore, by (A6) and (A7),
Pr max j P S ( α ) 2 , P S ( β j ) 2 Λ 2 p c 2 2
with
Λ 2 = max { 2 2 ς 1 κ , 1 } K 1 N + c 2 1 1 K 2 log ( p ) N .
Applying the union bound to (A5), it follows that
Pr P ˜ P S ( ϑ ) τ Λ 1 + ( 1 τ ) Λ 2 2 K ( p + 1 ) ( p K ) 1 c 1 2 + p 1 c 2 2 ,
and the desired result follows. □

Appendix A.4. Proof of Theorem 2

Lemma A6.
Suppose that λ = c 0 log ( p K ) N . Then ϑ ^ ϑ U , where
U = δ R K ( p + 1 ) | τ 1 τ δ E + 1 + j G + δ j 2 C 0 τ 1 τ δ E c 1 + j G δ j 2 ,
C 0 = ( c 0 1 ) ( c 0 + 1 ) , E c denotes the complement of E , E + = E { l = 1 + ( k 1 ) ( p + 1 ) | k = 1 , , K } , and G + = G { 0 } .
Proof. 
Since ϑ ^ = ϑ + δ is the minimizer, we have that
L ( ϑ ) + λ P ( β ) L ( ϑ ^ ) + λ P ( β ^ ) λ P ( β ) P ( β + δ ˜ ) L ( ϑ + δ ) L ( ϑ ) ,
where β is the vector ϑ without the a k components, replacing δ ˜ for δ . Then
P ( β ) P ( β + δ ˜ ) = τ β E 1 β E + δ ˜ E 1 δ ˜ E c 1 + ( 1 τ ) j G β j 2 j G β j + δ j 2 j G δ j 2 τ δ ˜ E 1 δ ˜ E c 1 + ( 1 τ ) j G δ j 2 j G δ j 2 τ δ E + 1 δ E c 1 + ( 1 τ ) j G + δ j 2 j G δ j 2 .
By the convexity of L,
L ( ϑ + δ ) L ( ϑ ) S ( ϑ ) , δ P ¯ { P S ( ϑ ) } P ( δ ) λ c 0 P ( δ ) .
Note that
P ( δ ) = τ δ E + 1 + δ E c 1 + ( 1 τ ) j G + δ j 2 + j G δ j 2 .
Combining the above results, we have that
λ P ( ϑ ) P ( ϑ + δ ) L ( ϑ + δ ) L ( ϑ ) ( c + 1 ) τ δ E + 1 + ( 1 τ ) j G + δ j 2 ( c 1 ) τ δ E c 1 + ( 1 τ ) j G δ j 2 τ 1 τ δ E + 1 + j G + δ j 2 C 0 τ 1 τ δ E c 1 + j G δ j 2 .
Lemma A7.
Assume that conditions (A1)–(A3) are satisfied. Then
sup v V | Δ L ( u , v ) E { Δ L ( u , v ) } | v 2 > Λ 3
with probability at most 2 ( K p ) 2 ( s e + K ) ( 1 c 3 2 ) , where
Λ 3 = ( 1 + 2 c 3 ) ς 2 2 ( s e + K ) log ( p K ) N
and Δ L ( u , v ) = L ( u + v ) L ( u ) for any u , v R K ( p + 1 ) and for some constant c 3 > 1 .
Proof. 
Given any u R K ( p + 1 ) and v V with V = v R K ( p + 1 ) | 0 < v 0 s e + K ,
Δ L ( u , v ) = 1 N i = 1 N E i ϕ q ( U + V ) Z i ϕ q U Z i = 1 N i = 1 N k = 1 K e i k ϕ q Z i ( u k + v k ) ϕ q Z i ( u k ) = 1 N i = 1 N d i ( u , v ) ,
where u = vec { U } , v = vec { V } .
The bounded gradient implies the Lipschitz continuity of ϕ q so that | ϕ q ( u + v ) ϕ q ( u ) | | v | . Since e i k { 0 , 1 } , we have that
| d i ( u , v ) | k = 1 K | e i k ϕ q Z i ( u k + v k ) ϕ q ( Z i u k ) | k = 1 K | e i k Z i v k | E i vec { V Z i } = v ( Z i I K ) E i .
Note that
i = 1 N v ( Z i I K ) E i 2 = diag { vec { E } } ( Z I K ) v 2 2 .
By Hoeffding’s inequality, we have that
Pr | 1 N i = 1 N d i ( u , v ) E 1 N i = 1 N d i ( u , v ) | > t 2 exp 2 N 2 t 2 4 diag { vec { E } } ( Z I K ) v 2 2 2 exp N t 2 2 ς 2 2 v 2 2 .
Thus Pr { R ( v ) > Λ 3 } 2 ( K p ) ( s e + K ) c 3 2 with
R ( v ) = | Δ L ( u , v ) E { Δ L ( u , v ) } | v 2 and Λ 3 = c 3 ς 2 2 ( s e + K ) log ( p K ) N .
Next, we consider covering V with ϵ -balls such that for any v 1 and v 2 in the same ball, | v 1 v 1 2 v 1 v 1 2 | ϵ , where ϵ is a small positive number. The number of ϵ -balls required to cover a m-dimensional unit ball is bounded by ( 2 ϵ + 1 ) m . Then for those v v 2 , we require a covering number of at most ( 3 ( K p ) / ϵ ) s e + K . Let N denote such an ϵ -net. We have that
Pr sup v N R ( v ) > Λ 3 3 K p ϵ s e + K 2 ( K p ) ( s e + K ) c 3 2 = 2 3 ϵ ( K p ) 1 c 3 2 s e + K .
Furthermore, for any v 1 , v 2 V ,
| R ( v 1 ) R ( v 2 ) | 2 N diag { vec { E } } ( Z I K ) v 1 v 1 2 v 1 v 1 2 1 2 N diag { vec { E } } ( Z I K ) v 1 v 1 2 v 1 v 1 2 2 2 ς 2 ϵ .
Therefore sup v V R ( v ) sup v N R ( v ) + 2 ς 2 ϵ . Taking ϵ = ( s e + K ) log ( p K ) 2 N , we have that
Pr sup v V R ( v ) > Λ 3 Pr sup v N R ( v ) > ( c 3 1 ) ς 1 2 ( s e + K ) log ( p K ) N 2 2 N ( s e + K ) log ( p K ) 3 ( K p ) 1 ( c 3 1 ) 2 s e + K 2 ( K p ) 2 ( c 3 1 ) 2 s e + K .
Setting c 3 = 1 + 2 c 4 and c 4 > 1 , we obtain the desired result that
Pr sup v V R ( v ) > ( 1 + 2 c 4 ) ς 2 2 ( s e + K ) log ( p K ) N 2 ( K p ) 2 ( s e + K ) ( 1 c 4 2 ) .
Proof of Theorem 2.
Consider a disjoint partition on the coordinate set δ = ϑ ^ ϑ , that is, δ = m = 1 M v m with v m V . Note that, each subvector v m has at most s e + K non-zero coordinates. Denote v 0 = 0 and u m = ϑ + l = 0 m 1 v l so that u 1 = ϑ and u M + v M = ϑ + δ . We have the decomposition
Δ L ( ϑ , δ ) = L ϑ + m = 1 M v m L ( ϑ ) = m = 1 M L ϑ + l = 0 m v l L ϑ + l = 0 m 1 v l = m = 1 M L ( u m + v m ) L ( u m ) = m = 1 M Δ L ( u m , v m ) .
By Lemma A7,
m = 1 M Δ L ( u m , v m ) m = 1 M E Δ L ( u m , v m ) Λ 3 v m 2 = E Δ L ( ϑ , δ ) Λ 3 δ 2
with high probability. By Lemma A5, L is twice differentiable so that
E Δ L ( ϑ , δ ) = 1 N i = 1 N E E i ϕ q F i ( ϑ + δ ) E E i ϕ q F i ( ϑ ) = L ( ϑ + δ ) L ( ϑ ) = S ( ϑ ) δ + 1 2 δ H ( ϑ ) δ + o ( δ 2 2 ) 0 + ς 3 2 2 δ 2 2 + o ( δ 2 2 ) .
Consequently, Δ L ( ϑ , δ ) is bounded below by ς 3 2 2 δ 2 2 Λ 3 δ 2 with high probability.
Note that
P ( β ) P ( β + δ ˜ ) τ δ E + 1 δ E c 1 + ( 1 τ ) j G + δ j 2 j G δ j 2 τ δ E + 1 + ( 1 τ ) j G + δ j 2 .
From (A8),
L ( ϑ ) + λ P ( β ) L ( ϑ ^ ) + λ P ( β ^ ) λ P ( β ) P ( β + δ ˜ ) L ( ϑ + δ ) L ( ϑ ) λ τ δ E + 1 + ( 1 τ ) j G + δ j 2 ς 3 2 2 δ 2 2 Λ 3 δ 2 .
Clearly, δ E + 1 s e + K δ E + 2 s e + K δ 2 and j G + δ j 2 s g + 1 δ 2 . We conclude that
ς 3 2 2 δ 2 2 λ τ δ E + 1 + ( 1 τ ) j G + δ j 2 + Λ 3 δ 2 δ 2 2 2 ς 3 2 λ τ s e + K + ( 1 τ ) s g + 1 + Λ 3 δ 2 ,
after which the desired result follows from straightforward algebraic manipulation. □

Appendix A.5. Proof of Lemma 3

Proof. 
Since
vec ( F ) = vec ( 1 N α + XB ) = α ( 1 N I K ) + vec ( B ) ( X I K ) ,
we have that
vec ( F ) α = α ( 1 N I K ) α = α α ( 1 N I K ) = I K ( 1 N I K ) = 1 N I K vec ( F ) vec ( B ) = vec ( B ) ( X I K ) vec ( B ) = I p K ( X I K ) = X I K .
The derivative with respect to α is
N S ( α ) = N L ( θ ) α = α vec { E } vec ϕ q ( F ) = vec ( F ) α ϕ q vec ( F ) vec ( F ) vec { E } = ( 1 N I K ) diag vec ϕ q ( F ) vec { E } = vec I K E ϕ q ( F ) 1 N = E ϕ q ( F ) 1 N .
Thus,
S ( α ) | v u 2 2 = S ( u ) S ( v ) 2 2 = N 2 ( 1 N I K ) vec E ϕ q { F ( α ) } | v u 2 2 N 2 1 N I K 2 2 vec E ϕ q { F ( α ) } | v u 2 2 = N 1 k = 1 K i = 1 N e i k 2 ϕ q { f i k ( u k ) } ϕ q { f i k ( v k ) } 2 N 1 k = 1 K i = 1 N e i k L q 2 ( u k v k ) 2 N 1 n max L q 2 u v 2 2 ,
where L q = ( q + 1 ) 2 q is the Lipschitz constant of ϕ q . We have that L α = n max N L q .
The derivative with respect to vec ( B ) is
N L ( θ ) vec ( B ) = vec ( B ) vec { E } vec ϕ q ( F ) = vec ( F ) vec ( B ) ϕ q vec ( F ) vec ( F ) vec { E } = ( X I K ) diag vec ϕ q ( F ) vec { E } = vec I K E ϕ q ( F ) X = vec E ϕ q ( F ) X .
Therefore, the derivative with respect to B is S ( B ) = N 1 X E ϕ q ( F ) . Note that
vec X E ϕ q ( F ) = ( I K X ) diag { vec ( E ) } vec { ϕ q ( F ) } = k = 1 K X diag ( e k ) vec { ϕ q ( F ) }
and
i = 1 N e i k X i ( u k v k ) 2 = diag ( e k ) X ( u k v k ) 2 2 diag ( e k ) X 2 2 u k v k 2 2 ;
thus
N 2 vec { S ( U ) S ( V ) } 2 2 = k = 1 K X diag ( e k ) ϕ q { f k ( b k ) } | v k u k 2 2 k = 1 K X diag ( e k ) 2 2 diag ( e k ) ϕ q { f k ( b k ) } | v k u k 2 2 k = 1 K diag ( e k ) X 2 2 i = 1 N e i k ϕ q { f i k ( u k ) } ϕ q { f i k ( v k ) } 2 L q 2 k = 1 K diag ( e k ) X 2 2 i = 1 N e i k X i ( u k v k ) 2 L q 2 k = 1 K diag ( e k ) X 2 4 u k v k 2 2 max k diag ( e k ) X 2 2 2 vec ( U V ) 2 2 .
We conclude that L B = L q N 1 max k diag ( e k ) X 2 2 . □

Appendix A.6. Proof of Theorem 3

Lemma A8.
The indicator function
δ R ( x ) = 0 , if x R , if x R ,
where R = { x R p 1 p x = 0 } , has subdifferential
δ R ( x ) = { g R p g = s 1 p , s R } , if x R , if x R .
Proof. 
Suppose that x R . Then g δ R ( x ) if and only if both
δ R ( y ) δ R ( x ) + g , y x for all y R and ω ( y x ) 0 .
Let z = y x . Then z R since 1 p ( y X ) = 0 . Thus, g z 0 . If g z = 0 , then g { g R p g = s 1 p , s R } . If there exists g δ R ( x ) satisfying g z < 0 for some z R , then z R , so we must have that g z > 0 . This is a contradiction.
Now, for any x R , we have that g δ R ( x ) if and only if both
δ R ( y ) δ R ( x ) + g , y x for all y R and ω ( x y ) .
For x R and y R , since z = x y R p and g z , it must be that g . □
Proof of Theorem 3.
It is sufficient to minimize the objective function
G ( β ) = 1 2 β β 2 2 + ρ 1 β 1 + ρ 2 β 2 + δ R ( β ) ,
where R = { x R K 1 K β = 0 } . Then the subdifferential of G ( β ) is
G ( β ) = β β + ρ 1 β 1 + ρ 2 β 2 + δ R ( β ) .
For an optimal solution β R , we have that 0 p G ( β ) if and only if there exist u β 1 , v β 2 and s R such that β = β ρ 1 u ρ 2 v s 1 p . Since 1 β = 0 , we have that s = p 1 1 p ( β ρ 1 u ρ 2 v ) , so
β = P K ( β ρ 1 u ρ 2 v ) .
If β = 0 p , then | u j | < 1 for j = 1 , , p , v 2 1 and
P K ( β ρ 1 u ) 2 = ρ 2 P K v 2 ρ 2 P K 2 v 2 = ρ 2 v 2 ρ 2 ;
If β 0 K , then u X 1 , v = β β 2 and
β = P K β ρ 1 u ρ 2 β β 2 1 + ρ 2 β 2 β = P K ( β ρ 1 u ) .
Note that β = P K β R . Taking the norm of both sides, we see that
1 + ρ 2 β 2 β 2 = P K ( β ρ 1 u ) 2 β 2 = P K ( β ρ 1 u ) 2 ρ 2 > 0 .
Substituting this result back into the β 0 K case, we have that
β = 1 ρ 2 P K ( β ρ 1 u ) 2 P K ( β ρ 1 u ) .
Combining the above two cases gives the desired result. □

Appendix A.7. Proof of Theorem 4

Proof. 
Denote the objective function by
G ( b ) = 1 2 ( b t ) 2 + ϱ { | b | + | b + s | } .
When s = 0 , we obtain a lasso problem with
b = argmin b R 1 2 ( b t ) 2 + 2 ϱ | x | = S ( t , 2 ϱ ) .
When s 0 , the subdifferential of G ( b ) is
G ( b ) = b t + ϱ { | x | + | x + s | } .
We see that 0 G ( b ) if and only if there exist u | b | and v | b + s | with
b = b ϱ ( u + v ) .
If b = 0 , then | u | < 1 and v = sign ( s ) , hence
b = 0 if | t ϱ sign ( s ) | ϱ .
If s > 0 , then sign ( s ) = 1 and 0 t 2 ϱ . If s < 0 , then sign ( s ) = 1 , and 2 ϱ t 0 . Note that if t 0 , then sign ( s ) = sign ( t ) or sign ( s ) sign ( t ) = 1 .
When b = s , then u = sign ( s ) and | v | < 1 , hence
b = s if | t + s + ϱ sign ( s ) | ϱ .
If s > 0 , then sign ( s ) = 1 and ( s + 2 λ ) t s < 0 . If s < 0 , then sign ( s ) = 1 and 0 < s t ( s 2 λ ) . Note that sign ( s ) = sign ( t ) is equivalent to sign ( s ) sign ( t ) = 1 .
Let C ( s , t ) = 1 sign ( s ) sign ( t ) 2 | s | 0 . We can summarize the two cases above as
b = C ( s , t ) if 0 C ( s , t ) | t | C ( s , t ) + 2 ϱ .
If b 0 , s , then u = sign ( b ) and v = sign ( b + s ) , thus
b = t ϱ sign ( b ) + sign ( b + s ) b + s = t + s ϱ sign ( b ) + sign ( b + s ) .
If sign ( b ) = sign ( b + s ) = 1 , then b ( b + s ) < 0 or 0 < t < s . Thus b = t > 0 if 0 < t < s . If sign ( b ) = sign ( b + s ) = 1 , then b ( b + s ) < 0 or s < t < 0 . Thus b = t < 0 if s < t < 0 . Rewriting the two cases above, we have that
b = t if 0 < | t | < C ( s , t ) .
If sign ( b ) = sign ( b + s ) = 1 , then
min { b , b + s } > 0 t 2 ϱ + s | s | 2 > 0 sign ( t ) | t | > sign ( t ) | s | 2 + 2 ϱ s 2 > 0 .
Note that t > 0 and sign ( x ) = sign ( t ) . If sign ( b ) = sign ( b + s ) = 1 , then
max { b , b + s } < 0 t + 2 ϱ + s + | s | 2 > 0 sign ( t ) | t | < sign ( t ) | s | 2 + 2 ϱ s 2 < 0 .
Note that t < 0 and sign ( x ) = sign ( t ) . Rewriting the two cases above, we have that
b = t 2 ϱ sign ( t ) if | t | > 2 ϱ + C ( s , t ) .
Summarizing (A9)–(A11),
b = t , | t | < C ( s , t ) , C ( s , t ) , C ( s , t ) | t | C ( s , t ) + 2 ϱ , sign ( t ) ( | t | 2 ϱ ) , | t | > C ( s , t ) + 2 ϱ ,
with C ( s , t ) = 1 sign ( s ) sign ( t ) 2 | s | 0 . On one hand, when s 0 ,
b = t S t , C ( s , t ) + S S t , C ( s , t ) , 2 ϱ .
On the other hand, when s = 0 , it follows that b = S ( t , 2 ϱ ) since S ( z , 0 ) = z . □

References

  1. Haralick, R.M.; Shanmugam, K.; Dinstein, I.H. Textural features for image classification. IEEE Trans. Syst. Man Cybern. 1973, SMC-3, 610–621. [Google Scholar] [CrossRef] [Green Version]
  2. Wang, X.; Zhang, H.H.; Wu, Y. Multiclass probability estimation with support vector machines. J. Comput. Graph. Stat. 2019, 28, 586–595. [Google Scholar] [CrossRef]
  3. Hansen, J.H.; Hasan, T. Speaker recognition by machines and humans: A tutorial review. IEEE Signal Process. Mag. 2015, 32, 74–99. [Google Scholar] [CrossRef]
  4. Duda, R.O.; Hart, P.E.; Stork, D.G. Pattern Classification; John Wiley & Sons: New York, NY, USA, 2012. [Google Scholar]
  5. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer Science & Business Media: New York, NY, USA, 2009. [Google Scholar]
  6. Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  7. Cristianini, N.; Shawe-Taylor, J. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
  8. Marron, J.S.; Todd, M.J.; Ahn, J. Distance-weighted discrimination. J. Am. Stat. Assoc. 2007, 102, 1267–1271. [Google Scholar] [CrossRef] [Green Version]
  9. Qiao, X.; Zhang, H.H.; Liu, Y.; Todd, M.J.; Marron, J.S. Weighted distance weighted discrimination and its asymptotic properties. J. Am. Stat. Assoc. 2010, 105, 401–414. [Google Scholar] [CrossRef] [Green Version]
  10. Marron, J. Distance-weighted discrimination. Wiley Interdiscip. Rev. Comput. Stat. 2015, 7, 109–114. [Google Scholar] [CrossRef] [Green Version]
  11. Zhang, L.; Lin, X. Some considerations of classification for high dimension low-sample size data. Stat. Methods Med. Res. 2013, 22, 537–550. [Google Scholar] [CrossRef]
  12. Wang, B.; Zou, H. Another look at distance-weighted discrimination. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2018, 80, 177–198. [Google Scholar] [CrossRef]
  13. Liu, Y.; Zhang, H.H.; Wu, Y. Hard or soft classification? Large-margin unified machines. J. Am. Stat. Assoc. 2011, 106, 166–177. [Google Scholar] [CrossRef] [Green Version]
  14. Huang, H.; Liu, Y.; Du, Y.; Perou, C.M.; Hayes, D.N.; Todd, M.J.; Marron, J.S. Multiclass distance-weighted discrimination. J. Comput. Graph. Stat. 2013, 22, 953–969. [Google Scholar] [CrossRef]
  15. Wang, B.; Zou, H. A multicategory kernel distance weighted discrimination method for multiclass classification. Technometrics 2019, 61, 396–408. [Google Scholar] [CrossRef]
  16. Wang, B.; Zou, H. Sparse distance weighted discrimination. J. Comput. Graph. Stat. 2016, 25, 826–838. [Google Scholar] [CrossRef] [Green Version]
  17. Wang, L.; Shen, X. On L1-norm multiclass support vector machines: Methodology and theory. J. Am. Stat. Assoc. 2007, 102, 583–594. [Google Scholar] [CrossRef]
  18. Zhang, X.; Wu, Y.; Wang, L.; Li, R. Variable selection for support vector machines in moderately high dimensions. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2016, 78, 53–76. [Google Scholar] [CrossRef] [Green Version]
  19. Peng, B.; Wang, L.; Wu, Y. An error bound for L1-norm support vector machine coefficients in ultra-high dimension. J. Mach. Learn. Res. 2016, 17, 8279–8304. [Google Scholar]
  20. Simon, N.; Friedman, J.; Hastie, T.; Tibshirani, R. A sparse-group lasso. J. Comput. Graph. Stat. 2013, 22, 231–245. [Google Scholar] [CrossRef]
  21. Friedman, J.; Hastie, T.; Tibshirani, R. A note on the group lasso and a sparse group lasso. arXiv 2010, arXiv:1001.0736. [Google Scholar]
  22. Cai, T.T.; Zhang, A.; Zhou, Y. Sparse group lasso: Optimal sample complexity, convergence rate, and statistical inference. arXiv 2019, arXiv:1909.09851. [Google Scholar]
  23. Yu, D.; Zhang, L.; Mizera, I.; Jiang, B.; Kong, L. Sparse wavelet estimation in quantile regression with multiple functional predictors. Comput. Stat. Data Anal. 2019, 136, 12–29. [Google Scholar] [CrossRef] [Green Version]
  24. He, Q.; Kong, L.; Wang, Y.; Wang, S.; Chan, T.A.; Holland, E. Regularized quantile regression under heterogeneous sparsity with application to quantitative genetic traits. Comput. Stat. Data Anal. 2016, 95, 222–239. [Google Scholar] [CrossRef] [Green Version]
  25. Huang, H. Large dimensional analysis of general margin based classification methods. arXiv 2019, arXiv:1901.08057. [Google Scholar]
  26. Huang, H.; Yang, Q. Large scale analysis of generalization error in learning using margin based classification methods. arXiv 2020, arXiv:2007.10112. [Google Scholar] [CrossRef]
  27. Lam, X.Y.; Marron, J.; Sun, D.; Toh, K.C. Fast algorithms for large-scale generalized distance weighted discrimination. J. Comput. Graph. Stat. 2018, 27, 368–379. [Google Scholar] [CrossRef]
  28. Sun, D.; Toh, K.C.; Yang, L. A convergent 3-block semiproximal alternating direction method of multipliers for conic programming with 4-type constraints. SIAM J. Optim. 2015, 25, 882–915. [Google Scholar] [CrossRef] [Green Version]
  29. Parikh, N.; Boyd, S. Proximal algorithms. Found. Trends Optim. 2014, 1, 127–239. [Google Scholar] [CrossRef]
  30. Asahchop, E.L.; Branton, W.G.; Krishnan, A.; Chen, P.A.; Yang, D.; Kong, L.; Zochodne, D.W.; Brew, B.J.; Gill, M.J.; Power, C. HIV-associated sensory polyneuropathy and neuronal injury are associated with miRNA–455-3p induction. JCI Insight 2018, 3, e122450. [Google Scholar] [CrossRef] [Green Version]
  31. Hsu, D.; Kakade, S.; Zhang, T. A tail inequality for quadratic forms of subgaussian random vectors. Electron. Commun. Probab. 2012, 17, 52. [Google Scholar] [CrossRef]
Table 1. Simulation results for the SGL-MgDWD, L 1 -logistic, and OVR methods with K = 5 . Time is measured relative to a baseline logistic regression model with K = 5 , p = 100 , and N = 50 . Numbers in parentheses denote standard deviations.
Table 1. Simulation results for the SGL-MgDWD, L 1 -logistic, and OVR methods with K = 5 . Time is measured relative to a baseline logistic regression model with K = 5 , p = 100 , and N = 50 . Numbers in parentheses denote standard deviations.
n k pMethodTest AccuracySignal ( s e + , s g + ) Noise ( n e + , n g + ) Time (SD)
50100SGL-MgDWD0.980(9.99, 2)(0, 0)1.150 (0.173)
L 1 -logistic0.979(9.00, 2)(116.98, 26.17)1.000 (0.153)
OVR-SGL-gDWD0.912---
1000SGL-MgDWD0.979(10, 2)(6.96, 1.94)5.290 (0.166)
L 1 -logistic0.966(10, 2)(2793.65, 722.38)5.130 (0.063)
OVR-SGL-gDWD0.740---
100100SGL-MgDWD0.981(10, 2)(0.07, 0.03)1.453 (0.155)
L 1 -logistic0.980(8.82, 2)(35.18, 3.98)1.258 (0.127)
OVR-SGL-gDWD0.828---
1000SGL-MgDWD0.980(10, 2)(1.01, 0.25)4.863 (0.150)
L 1 -logistic0.978(9.93, 2)(1380.38, 192.37)4.703 (0.061)
OVR-SGL-gDWD0.546---
200100SGL-MgDWD0.980(10, 2)(7.67, 2.08)1.776 (0.164)
L 1 -logistic0.980(9.39, 2)(13.1, 0.72)1.709 (0.175)
OVR-SGL-gDWD0.934---
1000SGL-MgDWD0.982(10, 2)(1.09, 0.29)8.641 (0.186)
L 1 -logistic0.981(9.79, 2)(199.02, 2.51)2.505 (0.121)
OVR-SGL-gDWD0.950---
400100SGL-MgDWD0.981(10, 2)(0.02, 0)2.792 (0.159)
L 1 -logistic0.981(10, 2)(4.72, 3.95)2.828 (0.115)
OVR-SGL-gDWD0.979---
1000SGL-MgDWD0.981(10, 2)(4.72, 3.95)15.800 (0.221)
L 1 -logistic0.981(9.6, 2)(16.17, 0.02)17.915 (1.585)
OVR-SGL-gDWD0.964---
Table 2. Simulation results for the SGL-MgDWD, L 1 -logistic, and OVR methods with K = 11 . Time is measured relative to a baseline logistic regression model with K = 5 , p = 100 , and N = 50 . Numbers in parentheses denote standard deviations.
Table 2. Simulation results for the SGL-MgDWD, L 1 -logistic, and OVR methods with K = 11 . Time is measured relative to a baseline logistic regression model with K = 5 , p = 100 , and N = 50 . Numbers in parentheses denote standard deviations.
n k pMethodTest AccuracySignal ( s e + , s g + ) Noise ( n e + , n g + ) Time (SD)
50100SGL-MgDWD0.735(21.41, 2)(0.14, 0.02)1.661 (0.143)
L 1 -logistic0.735(20.13, 2)(337.77, 22.07)1.610 (0.110)
OVR-SGL-gDWD0.647---
1000SGL-MgDWD0.733(21.25, 2)(0, 0)7.105 (0.205)
L 1 -logistic0.566(20.67, 2)(3805.97, 265.82)6.933 (0.205)
OVR-SGL-gDWD0.382---
100100SGL-MgDWD0.737(21.82, 2)(0.06, 0.01)2.518 (0.099)
L 1 -logistic0.721(20, 2)(173.17, 5.81)2.418 (0.103)
OVR-SGL-gDWD0.609---
1000SGL-MgDWD0.737(21.88, 2)(5.4, 0.77)12.371 (0.109)
L 1 -logistic0.697(20.15, 2)(1859.51, 9.04)12.279 (0.114)
OVR-SGL-gDWD0.214---
200100SGL-MgDWD0.738(22, 2)(0, 0)5.191 (0.079)
L 1 -logistic0.730(20, 2)(50.7, 0.08)4.246 (0.100)
OVR-SGL-gDWD0.609---
1000SGL-MgDWD0.738(21.98, 2)(0.23, 0.04)21.950 (0.241)
L 1 -logistic0.730(20, 2)(523.08, 1.07)22.158 (0.163)
OVR-SGL-gDWD0.490---
400100SGL-MgDWD0.740(22, 2)(0, 0)7.025 (0.172)
L 1 -logistic0.738(20, 2)(3.71, 3.48)7.997 (0.122)
OVR-SGL-gDWD0.709---
1000SGL-MgDWD0.738(22, 2)(0.68, 0.11)38.301 (0.200)
L 1 -logistic0.734(20, 2)(38.84, 35.37)41.059 (2.064)
OVR-SGL-gDWD0.556---
Table 3. Signal for the coefficient estimates obtained from the SGL-MgDWD method with ( q , τ ) = ( 1 , 0.1 ) for the HIV data set. The symbols “+” and “-” denote positive and negative coefficient estimates, respectively, while “0” denotes a zero coefficient (i.e., an irrelevant variable).
Table 3. Signal for the coefficient estimates obtained from the SGL-MgDWD method with ( q , τ ) = ( 1 , 0.1 ) for the HIV data set. The symbols “+” and “-” denote positive and negative coefficient estimates, respectively, while “0” denotes a zero coefficient (i.e., an irrelevant variable).
Non-HIVHIVNBDHIVBDSHIVBDU
interception++-+
miR-255b-+-+
miR-217+-00
miR-25-star0++-
miR-3136-5p--+-
miR-3152-3p+--+
miR-3159---+
miR-31710+--
miR-33b---+
miR-34c-3p--++
miR-3545-5p-+-+
miR-3654---+
miR-39240-+-
miR-43070-+0
miR-4474-5p-+++
miR-4526+---
miR-4641+0--
miR-4655-3p+0--
miR-4680-5p--+-
miR-4683--0+
miR-589-++-
miR-619+--+
miR-660+0-+
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Su, T.; Wang, Y.; Liu, Y.; Branton, W.G.; Asahchop, E.; Power, C.; Jiang, B.; Kong, L.; Tang, N. Sparse Multicategory Generalized Distance Weighted Discrimination in Ultra-High Dimensions. Entropy 2020, 22, 1257. https://doi.org/10.3390/e22111257

AMA Style

Su T, Wang Y, Liu Y, Branton WG, Asahchop E, Power C, Jiang B, Kong L, Tang N. Sparse Multicategory Generalized Distance Weighted Discrimination in Ultra-High Dimensions. Entropy. 2020; 22(11):1257. https://doi.org/10.3390/e22111257

Chicago/Turabian Style

Su, Tong, Yafei Wang, Yi Liu, William G. Branton, Eugene Asahchop, Christopher Power, Bei Jiang, Linglong Kong, and Niansheng Tang. 2020. "Sparse Multicategory Generalized Distance Weighted Discrimination in Ultra-High Dimensions" Entropy 22, no. 11: 1257. https://doi.org/10.3390/e22111257

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop