A Linearly Involved Generalized Moreau Enhancement of (cid:96) 2,1 -Norm with Application to Weighted Group Sparse Classiﬁcation

: This paper proposes a new group-sparsity-inducing regularizer to approximate (cid:96) 2,0 pseudo-norm. The regularizer is nonconvex, which can be seen as a linearly involved generalized Moreau enhancement of (cid:96) 2,1 -norm. Moreover, the overall convexity of the corresponding group-sparsity-regularized least squares problem can be achieved. The model can handle general group conﬁgu-rations such as weighted group sparse problems, and can be solved through a proximal splitting algorithm. Among the applications, considering that the bias of convex regularizer may lead to incorrect classiﬁcation results especially for unbalanced training sets, we apply the proposed model to the (weighted) group sparse classiﬁcation problem. The proposed classiﬁer can use the label, similarity and locality information of samples. It also suppresses the bias of convex regularizer-based classiﬁers. Experimental results demonstrate that the proposed classiﬁer improves the performance of convex (cid:96) 2,1 regularizer-based methods, especially when the training data set is unbalanced. This paper enhances the potential applicability and effectiveness of using nonconvex regularizers in the frame of convex optimization.


Introduction
In recent decades, sparse reconstruction has become an active topic in many areas, such as in fields of signal processing, statistics, and machine learning [1].By reconstructing a sparse solution from a linear measurement, we can obtain a certain expression of highdimensional data as a vector with only a small number of nonzero entries.In practical applications, the data of interest can often be assumed to have a certain special structure.For example, in microarray analysis of gene expression [2,3], hyperspectral image unmixing [4][5][6], force identification in industrial applications [7], classification problems [8][9][10][11][12][13], etc., the solution of interest often possesses group-sparsity structure, namely the solution has a natural grouping of its coefficients and nonzero entries only occur in few groups.
This paper focuses on the estimation of group sparse solution, which is related to the Group LASSO (least absolute shrinkage and selection operator) [14].Suppose x = [x 1 , x 2 , • • • , x g ] ∈ R n is a group sparse signal, where x i ∈ R n i , ∑ g i=1 n i = n and g is the number of groups.Just as with the use of 0 pseudo-norm for evaluation of the sparsity, the group sparsity of x can be evaluated with the 2,0 pseudo-norm, i.e., x 2,0 = x 1 2 , x 2 2 , • • • , x g 2 0 , where • 2 is the Euclidean norm, and • 0 is the 0 pseudo-norm which counts the number of nonzero entries in the vector in R g .The group sparse regularized least squares problem can be modeled as where y ∈ R m and A ∈ R m×n are known, and λ > 0 is the regularization parameter.However, the employment of the pseudo-norm 2,0 makes (1) NP-hard [15].Most studies in the application replace the nonconvex regularizer 2,0 with its tightest convex envelope 2,1 [16] (or its weighted variants), and the following regularized least squares problem has been proposed known as the Group LASSO [14], where w i > 0 (i = 1, • • • , g) in the regularization term (this is w,2,1 -norm of x, i.e., a separable weighted version [17] of 2,1 -norm x 2,1 = ∑ g i=1 x i 2 ) are used to adjust for group sizes with w i = √ n i in [14,18].We give a simple but clear explanation in Appendix A, to show the bias of 2,1 -norm caused by group size in the application of group sparse classification (GSC).
Although the convex optimization problem (2) has been used as a standard model for group sparse estimation applications, the convex regularizer w,2,1 does not necessarily promote group sparsity sufficiently, mainly due to the fact that w,2,1 -norm is just an approximation of 2,0 pseudo-norm within the severe restriction of the convexity.To promote the group sparsity more effectively than convex regularizers, nonconvex regularizers such as group SCAD (smoothly clipped absolute deviation) [3], group MCP (minimax concave penalty) [18,19], p,q regularization ( x p,q := ∑ g i=1 x i q p 1/q , 0 < q < 1 ≤ p) [20], iterative weighted group minimization [21], and 2,0 [22] have been used for group sparse estimation problems.However, they lose the overall convexity (In [23], a nonconvex regularizer which can preserve the overall convexity was proposed, but the fidelity term of the optimization model is 1  2 y − x 2 2 (limited to the case of A = I n , where I n ∈ R n×n is the identity matrix), which cannot be applied to (1) for general A ∈ R m×n .) of the optimization problems, which results in their algorithms of no guarantee of convergence to global minimizers of the overall cost functions.
In this paper, we propose a generalized weighted group sparse estimation model based on the linearly involved generalized-Moreau-enhanced (LiGME) approach [24] that uses nonconvex regularizer while maintaining the overall convexity of the optimization problem.Our contributions can be summarized as follows:

•
We present a convex regularized least squares model with a nonconvex group sparsity promoting regularizer based on LiGME.It can be served as a unified model of many types of group sparsity related applications.

•
We illustrate the unfairness of 2,1 regularizer in unbalanced classification and then apply the proposed model to reduce the unfairness of it in GSC and weighted GSC (WGSC) [11].
The remainder of this paper is organized as follows.In Section 2, we give a brief review of LiGME model and WGSC method.In Section 3, we present our group sparse enhanced representation model and its mathematical properties.In Section 4, we apply the proposed model to group-sparsity-based classification problems.The conclusion is given in Section 5.
A preliminary short version of this paper was presented at a conference [25].

Preliminaries 2.1. Review of Linearly Involved Generalized-Moreau-Enhanced (LiGME) Model
We first give a brief review of linearly involved generalized-Moreau-enhanced (LiGME) models, which is closely related to our method.Although the convex function 1 -norm (or nuclear norm) is the most frequently adopted regularizer for sparsity (or low-rank) pursuing problems, it tends to yield underestimation for high-amplitude value (or large singular value) [26,27].The convexity-preserving nonconvex regularizers have been widely explored in [24,[28][29][30][31][32][33], which promote sparsity (or low-rank) more effectively than convex regularizers without losing the overall convexity.Among them, the generalized minimax concave (GMC) function in [31] does not rely on certain strong assumptions in the least squares term and has great potential for dealing with nonconvex variations of • 1 .Motivated by GMC function, the LiGME model [24] provides a general framework for constructing linearly involved nonconvex regularizers for sparsity (or low-rank) regularized linear least squares while maintaining the overall convexity of the cost function. Let • Z ) be finite-dimensional real Hilbert spaces.Let a function Ψ ∈ Γ 0 (Z ) be coercive with domΨ = Z.Here Γ 0 (Z ) is the set of proper (i.e., domΨ := {z ∈ Z |Ψ(z) < ∞} = ∅) lower semicontinuous (i.e., lev ≤a Ψ := {z ∈ Z |Ψ(z) ≤ a} is closed for ∀a ∈ R) convex function (i.e., Ψ(θz where B is a tuning matrix for the enhancement.Then the LiGME model defined as the minimization of where Please note that GMC [31] can be seen as a special case of (5) with Ψ = • 1 and L = Id, where Id is the identity operator.Model (5) can also be seen as an extension of [32,33].
Although the GME function is the zero operator, the overall convexity of the cost function ( 5) can be achieved with B designed to satisfy the following convexity condition.

Proposition 1 ([24], Proposition 1). The cost function J
where A * denotes the adjoint of A and O X ∈ B(X , X ) is the zero operator.In particular, when Ψ is a certain norm over the vector space Z, A method of designing B satisfying (6) for X = R n is provided in [24]; see Proposition A1 in Appendix B. For any Ψ ∈ Γ 0 (Z ) that is coercive, even symmetry and prox-friendly (Even symmetry means Ψ • (−Id) = Ψ; prox-friendly means Prox γΨ is computable (∀γ ∈ R ++ ).) with domΨ = Z , [24] provides a proximal splitting algorithm (see Proposition A2 in Appendix B) of guaranteed convergence to a globally optimal solution of model ( 5) under the overall-convexity condition (6).

Basic Idea of Weighted Group Sparse Classification (WGSC)
As a relatively simple but typical scenario for the application of the proposed idea in this paper, we introduce the main idea of weighted group sparse classification (WGSC).Classification is one of fundamental tasks in the field of the signal and image processing and pattern recognition.For a classification problem with g classes of subjects, the training samples can formulate a dictionary matrix A Wright et al. proposed the sparse representation-based classification (SRC) [34] for face recognition.With the assumption that samples of a specific subject lie in a linear subspace, a valid test sample y is expected to be approximated well by a linear combination of the training samples from the same class, which leads to a sparse representation coefficient over all training samples.Specifically, the test sample y is approximated by the linear combination of the dictionary items, i.e., y ≈ Ax, where x is the coefficient vector.A simple minimization model with sparse representation can be minimize x∈R n 1 2 y − Ax 2 2 + λ x 0 .In most SRC-based approaches, 0 regularizer is relaxed to 1 , and the model becomes the well-known LASSO model [35] in statistics.
The label information of the dictionary atoms is not used in the simple model of SRC, hence the regression is based solely on the structure of each sample.When the subspaces spanned by different classes are not independent, SRC may lead the test image to be represented by training samples from multiple different classes.Considering ideal situation where the test image should only be approximated well by the training samples from one class corresponding to the correct one, in [8][9][10], the authors divided training samples into groups by prior label information and used group-sparsity regularizers.Naturally, the coefficient vector x has group structure This kind of group sparse classification (GSC) approach aims to represent the test image using the minimum number of groups, and thus an ideal model is (1) which is NP-hard.As stated in Section 1, a convex approximation of 2,0 , i.e., 2,1 -norm, has been used widely as a best convex regularizer to incorporate the class labels.
More generally, the non-separable weighted 2,1 -norm, i.e., W • 2,1 has also been used as the regularizer in GSC [11,36,37].For example, Tang et al. [11] proposed a weighted GSC (WGSC) model as follows, by involving the information of the similarity between query sample and each class as well as the distance between query sample and each training sample, where penalizes the distance between y and each training sample of i-th class, w i is set to assess the relative importance of training samples from i-th class for representing the test sample, and here denotes element-wise multiplication.Specifically, the weights are computed by where σ 1 and σ 2 are bandwidth parameters, computes the distance from y to the individual subspace generated by A i , and r min denotes the minimum reconstruction error of {r i } g i=1 .The regularizer in (7) can be written as a non-separable weighted 2,1 , i.e., W x 2,1 , where For the aforementioned methods, after obtaining the optimal solution (denoted by x = [ x 1 , x 2 , • • • , x g ] ), they assign y to the class that minimizes the class reconstruction residual defined by y − A i xi 2 .
Although 2,1 regularizer and its weighted variants are widely used in GSC and WGSC-based methods, they not only suppress the number of selected classes, but also suppress significant nonzero coefficients within classes.The later may lead to underestimation of high-amplitude elements and adversely affect the performance.The nonconvex regularizers such as 2,p (0 < p < 1) [37] and group MCP [38] make the corresponding optimization problems nonconvex.Therefore, we hope to use a regularizer which can reduce the bias and approximate 2,0 better than 2,1 while ensuring the overall convexity of the problem.

GME of Weighted 2,1 -Norm and Its Properties
Although 2,1 -norm (or its weighted variants) acts as the favorable approach to approximate 2,0 in the literature of group sparse estimation, it has large bias and does not promote group sparsity as effective as 2,0 .Since GME provides an approach to better approximate direct discrete measures (e.g., 0 for sparsity, matrix rank for low-rankness) than their convex envelopes, we propose to use it for designing group-sparsity pursuing regularizers.
In the following, we consider the GME of non-separable weighted 2,1 -norm W • 2,1 , where W ∈ R l×n is not necessarily a diagonal matrix.This is because in some applications, such as classification problems [11,36,37] stated in Section 2.2, and also heterogeneous feature selection [40], weights are introduced inside groups as well (i.e., the weight of every entry can be different) to improve the estimation accuracy.The GME of W • 2,1 with B ∈ R b×n is well-defined (The lack of coercivity requires slight modification from min to inf.) as and therefore we can formulate minimize However, we should remark that W • 2,1 ∈ Γ 0 (R n ) is even symmetric but not necessarily coercive or prox-friendly (As found in ( [39], Proposition 24.14), it is known that for Ψ ∈ Γ 0 (Z ) and L ∈ B(X , Z ) satisfying LL * = µId with some µ ∈ R ++ , we have Prox Ψ•L (x) = x + µ −1 L * (Prox µΨ (Lx) − Lx) for x ∈ X .In such a special case, if Ψ is prox-friendly, Ψ • L is also prox-friendly.However, for general L ∈ B(X , Z ) not necessarily satisfying such standard conditions, we have to discuss the prox-friendliness of Ψ • L case by case.).Fortunately, by Proposition 3 below, if rank(W ) = l and B can be expressed as B = BW for some B ∈ R b×l , we can show the useful relation which implies that the GME ( W • 2,1 ) B of W • 2,1 can be handled as the LiGME Proposition 3.For Ψ ∈ Γ 0 (Z ) which is coercive and B ∈ B(Z, Z ), assume L ∈ B(X , Z ) has full row-rank.Then for any x ∈ X , where Proof.On one hand, by the definition of GME, we have where h(z) : Z → R is given by Therefore, (Ψ 2  2 by definition.Thus, we obtain the conclusion.
In the rest of the paper, we focus on LiGME model of 2,1 -norm.

LiGME of 2,1 -Norm
For simplicity as well as for effectiveness in application to GSC and WGSC, we focus on the LiGME model of • 2,1 with an invertible linear operator W, In this case, for achieving we can simply design B ∈ R m×n , in a way similar to ([31], (48)), as in the next proposition.

Proposition 4. For an invertible
then for the LiGME model in (20), Model (20) can be applied to many different applications that conform to groupsparsity structure.

Proposed Algorithm for Group-Sparsity Based Classification
Since 2,1 regularizer in GSC is unfair for classes of different sizes (see Appendix A) while 2,0 -regularizer is not, our purpose is to use a better approximation of 2,0 as the regularizer.Therefore, we apply model (20) to group-sparsity-based classification.Following GSC, we can set W = I n in (20).
Inspired by WGSC [11] which well designs weights to enforce locality and similarity information of samples, we can also set the weight matrix W according to (9).The classification algorithm is summarized in Algorithm 1.
The 2,1 -norm regularized least squares problem in WGSC can be solved by a proximal gradient method [41].Compared with it, the step 2 in Algorithm 1 for solving (20) requires at each update only one additional computation for Prox γ • 2,1 (see (10) with w i = 1).

Experiments
First, by setting W = I n , we conduct the experiments on a relatively simple dataset to investigate the influence by bias of 2,1 regularizer on the classification problem (especially when training set is unbalanced), and verify the performance improvement using ( • 2,1 ) B as the regularizer by conducting the experiments on a relatively simple dataset.The USPS handwritten digit database [42] has 11,000 samples of digits "0" through "9" (1100 samples per class).The dimension of each sample is 16 × 16.In our classification experiments, we vectorized them to 256-D vectors.The number of training samples for each class is not necessarily equal, which varies from 5 to 50 (the size of test set is fixed to 50 images per class).
We set W = I n (the initialization of W should be modified in Algorithm 1) for the proposed model (20) and compared it with GSC (with 2,1 regularizer) [10].We set B = √ θ/λA and fix θ = 0.9 to achieve the overall convexity of proposed method, and set κ = . The initial estimate is set as (x (0) , u (0) , w (0) ) = (O n×1 , O n×1 , O n×1 ), and the stopping criterion is set to either (x (k) , u (k) , w (k) ) − (x (k+1) , u (k+1) , w (k+1) ) 2 < 10 −4 or steps reaching 10,000.Figure 2 shows an example of unbalanced training set (digits "0" through "4" have 5 samples per class and "5" through "9" have 25 samples per class).The input (an image of digit "0") was misclassified (into digital "6") by GSC while classified correctly by proposed method.The obtained coefficient vectors by GSC and proposed method (both with λ = 4) are illustrated respectively, and some samples corresponding to nonzero coefficients are also displayed in Figure 2. It can be seen that the samples from digit "6" made the greatest contribution to the representation in GSC, and samples from "5" and "0" also made small contribution.In our method, samples from the correct class "0" made the biggest contribution and led to correct result.It is reasonable, because our method did not suppress the high value coefficients too much whereas 2,1 did.The big suppression of 2,1 made the coefficients of the correct class cannot be large enough, and thus easily led to misclassification.

Algorithm 1:
The proposed group-sparsity enhanced classification algorithm class information, a test sample vector y ∈ R m , parameters λ, σ 1 and σ 2 .
Compute the weight matrix W by (9).Choose until the stopping criterion is fulfilled.Table 1 summarizes the recognition accuracy of GSC and the proposed method with W = I n .The training set includes digits "0" through "4" β samples per class and "5" through "9" α samples per class.Through numerical experiments, we found that GSC with λ = λ GSC = 1.5 and the proposed method with λ = λ prop = 3 perform well on this dataset.We also experimented the proposed method of using λ GSC , which did not degrade too much compared with using λ prop .We see that the GSC model degrades when the training set is unbalanced, and the proposed method outperforms GSC especially in such case.Next, we conduct the experiments on a classic face dataset to verify the validity of the proposed linearly involved model by setting the weight matrix W according to (9).The ORL Database of Faces [43] contains 400 images from 40 distinct subjects (10 images per subject) with variations in lighting, facial expressions (open or closed eyes, smiling or not smiling) and facial details (glasses or no glasses).In our experiments, following [44], all images were downsampled from 112 × 92 to 16 × 16 and then formed 256-D vectors.The number of training samples for each class is not necessarily equal, which varies from 4 to 8 (test set is fixed to 2 images per class).
We compared the proposed model (20) (W = I n and W by ( 9) respectively) with GSC [10] and WGSC [11].In order to achieve the overall convexity, we set B = √ θ/λAW −1 , 0 ≤ θ ≤ 1 and fix θ = 0.9 for proposed method.Settings of (ι, τ, κ), initial estimate and stopping criterion are the same as those in the previous experiment.When the parameter λ is assigned too small, the obtained coefficient vector is not group sparse; when the parameter σ 1 or σ 2 is assigned too small, the information of locality or similarity plays a decisive role.We found that λ = 0.05 for 2,1 regularizer-based methods (i.e., GSC and WGSC), λ = 0.2 for the proposed method and σ 1 ∈ [2, 4], σ 2 ∈ [0.5, 2] for weights involved methods (i.e., WGSC and proposed method with W by ( 9)) work well on this dataset.
Figure 3 shows a classification result of WGSC and proposed method (W by ( 9)) (both with σ 1 = 4, σ 2 = 2)) when training set is unbalanced (20 subjects have 8 samples per class and the others have 6 samples per class).The input is an image of subject 10 which was misclassified into subject 8 by WGSC while classified correctly by proposed method.Table 2 summarizes the recognition accuracy of GSC, the proposed method with W = I n , WGSC and the proposed method with W computed by (9).Training set setting is that 20 subjects have β samples per class and the others have α samples per class.With the strategically designed matrix (9), WGSC achieves a significant improvement over GSC.By using the proposed method with W computed by (9), the performance can be further improved, especially when the training set is unbalanced.

Conclusions
In this paper, the potential applicability and effectiveness of using nonconvex regularizers in convex optimization framework was explored.We proposed a generalized Moreau enhancement (GME) of weighted 2,1 function and analyzed its relationship with the linearly involved GME of 2,1 -norm.The proposed regularizer is nonconvex and promotes group sparsity more effectively than 2,1 while maintaining the overall convexity of the regression model at the same time.The model can be used in many applications and we applied it to classification problems.Our model makes use of the grouping structure by class information and suppresses the tendency of underestimation of high-amplitude coefficients.Experimental results showed that the proposed method is effective for image classification.Assume the condition (A1) holds.Then, for any initial point (x (0) , u (0) , v (0) ), the sequence {(x (k) , v (k) , u (k) )} k∈N generated by (x (k+1) , u (k+1) , v (k+1) ) = T LiGME (x (k) , u (k) , v (k) ) the subset of the training samples from subject i, a ij is the j-th training sample from the i-th class, n i is the number of training samples from class i, and n = ∑ g i=1 n i is the number of total training samples.The aim is to correctly determine which class the input test sample y ∈ R m belongs to.Although deep learning is very popular and powerful for classification tasks, it requires a very large-scale training set and computation resources for numerous parameters training with complicated backpropagation.

3 .Figure 2 .
Figure 2.Estimated sparse coefficients x by GSC and proposed method respectively.

Figure 3 .
Figure 3.An example of results by WGSC and proposed method.

Table 1 .
Recognition results on the USPS database.= max i {n i }, β = min i {n i })

Table 2 .
Recognition results on the ORL database.= max i {n i }, β = min i {n i })