A Linearly Involved Generalized Moreau Enhancement of ℓ2,1-Norm with Application to Weighted Group Sparse Classification

Yang Chen; Masao Yamagishi; Isao Yamada

doi:10.3390/a14110312

,

and

Department of Information and Communications Engineering, Tokyo Institute of Technology, 2-12-1 Okayama, Meguro-ku, Tokyo 152-8552, Japan

^*

Author to whom correspondence should be addressed.

Algorithms2021, 14(11), 312;https://doi.org/10.3390/a14110312

This article belongs to the Special Issue Recent Advances in Nonsmooth Optimization and Analysis

Version Notes

Order Reprints

Abstract

This paper proposes a new group-sparsity-inducing regularizer to approximate

ℓ_{2, 0}

pseudo-norm. The regularizer is nonconvex, which can be seen as a linearly involved generalized Moreau enhancement of

ℓ_{2, 1}

-norm. Moreover, the overall convexity of the corresponding group-sparsity-regularized least squares problem can be achieved. The model can handle general group configurations such as weighted group sparse problems, and can be solved through a proximal splitting algorithm. Among the applications, considering that the bias of convex regularizer may lead to incorrect classification results especially for unbalanced training sets, we apply the proposed model to the (weighted) group sparse classification problem. The proposed classifier can use the label, similarity and locality information of samples. It also suppresses the bias of convex regularizer-based classifiers. Experimental results demonstrate that the proposed classifier improves the performance of convex

ℓ_{2, 1}

regularizer-based methods, especially when the training data set is unbalanced. This paper enhances the potential applicability and effectiveness of using nonconvex regularizers in the frame of convex optimization.

Keywords:

convex optimization; proximal splitting algorithm; generalized Moreau enhancement; group sparsity; weighted ℓ_2,1-norm; sparse representation-based classification

1. Introduction

In recent decades, sparse reconstruction has become an active topic in many areas, such as in fields of signal processing, statistics, and machine learning [1]. By reconstructing a sparse solution from a linear measurement, we can obtain a certain expression of high-dimensional data as a vector with only a small number of nonzero entries. In practical applications, the data of interest can often be assumed to have a certain special structure. For example, in microarray analysis of gene expression [2,3], hyperspectral image unmixing [4,5,6], force identification in industrial applications [7], classification problems [8,9,10,11,12,13], etc., the solution of interest often possesses group-sparsity structure, namely the solution has a natural grouping of its coefficients and nonzero entries only occur in few groups.

This paper focuses on the estimation of group sparse solution, which is related to the Group LASSO (least absolute shrinkage and selection operator) [14]. Suppose

x = {[x_{1}^{⊤}, x_{2}^{⊤}, \dots, x_{g}^{⊤}]}^{⊤} \in R^{n}

is a group sparse signal, where

x_{i} \in R^{n_{i}}

,

\sum_{i = 1}^{g} n_{i} = n

and g is the number of groups. Just as with the use of

ℓ_{0}

pseudo-norm for evaluation of the sparsity, the group sparsity of

x

can be evaluated with the

ℓ_{2, 0}

pseudo-norm, i.e.,

{∥ x ∥}_{2, 0} = {∥(∥ x_{1} ∥_{2}, ∥ x_{2} ∥_{2}, \dots, {∥ x_{g} ∥}_{2})∥}_{0}

, where

{∥ \cdot ∥}_{2}

is the Euclidean norm, and

{∥ \cdot ∥}_{0}

is the

ℓ_{0}

pseudo-norm which counts the number of nonzero entries in the vector in

R^{g}

.

The group sparse regularized least squares problem can be modeled as

\underset{x \in R^{n}}{minimize} \frac{1}{2} {∥ y - A x ∥}_{2}^{2} + λ {∥ x ∥}_{2, 0},

(1)

where

y \in R^{m}

and

A \in R^{m \times n}

are known, and

λ > 0

is the regularization parameter. However, the employment of the pseudo-norm

ℓ_{2, 0}

makes (1) NP-hard [15]. Most studies in the application replace the nonconvex regularizer

ℓ_{2, 0}

with its tightest convex envelope

ℓ_{2, 1}

[16] (or its weighted variants), and the following regularized least squares problem has been proposed known as the Group LASSO [14],

\underset{x \in R^{n}}{minimize} \frac{1}{2} {∥ y - A x ∥}_{2}^{2} + λ \sum_{i = 1}^{g} w_{i} {∥ x_{i} ∥}_{2},

(2)

where

w_{i} > 0

(

i = 1, \dots, g

) in the regularization term

{∥ x ∥}_{w, 2, 1} : = \sum_{i = 1}^{g} w_{i} {∥ x_{i} ∥}_{2}

(3)

(this is

ℓ_{w, 2, 1}

-norm of

x

, i.e., a separable weighted version [17] of

ℓ_{2, 1}

-norm

{∥ x ∥}_{2, 1} = \sum_{i = 1}^{g} {∥ x_{i} ∥}_{2}

) are used to adjust for group sizes with

w_{i} = \sqrt{n_{i}}

in [14,18]. We give a simple but clear explanation in Appendix A, to show the bias of

ℓ_{2, 1}

-norm caused by group size in the application of group sparse classification (GSC).

Although the convex optimization problem (2) has been used as a standard model for group sparse estimation applications, the convex regularizer

ℓ_{w, 2, 1}

does not necessarily promote group sparsity sufficiently, mainly due to the fact that

ℓ_{w, 2, 1}

-norm is just an approximation of

ℓ_{2, 0}

pseudo-norm within the severe restriction of the convexity. To promote the group sparsity more effectively than convex regularizers, nonconvex regularizers such as group SCAD (smoothly clipped absolute deviation) [3], group MCP (minimax concave penalty) [18,19],

ℓ_{p, q}

regularization (

{∥ x ∥}_{p, q} : = {(\sum_{i = 1}^{g} {∥ x_{i} ∥}_{p}^{q})}^{1 / q}

,

0 < q < 1 \leq p

) [20], iterative weighted group minimization [21], and

ℓ_{2, 0}

[22] have been used for group sparse estimation problems. However, they lose the overall convexity (In [23], a nonconvex regularizer which can preserve the overall convexity was proposed, but the fidelity term of the optimization model is

\frac{1}{2} {∥ y - x ∥}_{2}^{2}

(limited to the case of

A = I_{n}

, where

I_{n} \in R^{n \times n}

is the identity matrix), which cannot be applied to (1) for general

A \in R^{m \times n}

.) of the optimization problems, which results in their algorithms of no guarantee of convergence to global minimizers of the overall cost functions.

In this paper, we propose a generalized weighted group sparse estimation model based on the linearly involved generalized-Moreau-enhanced (LiGME) approach [24] that uses nonconvex regularizer while maintaining the overall convexity of the optimization problem. Our contributions can be summarized as follows:

We show in Proposition 2 that the generalized Moreau enhancement (GME) of $ℓ_{w, 2, 1}$ , i.e., ${(∥ \cdot ∥}_{w, 2, 1})_{B}$ (see (11)), can bridge the gap between $ℓ_{w, 2, 1}$ and $ℓ_{2, 0}$ . For the non-separable weighted $ℓ_{2, 1}$ , i.e., ${∥ W \cdot ∥}_{2, 1}$ , its GME can be expressed as LiGME of $ℓ_{2, 1}$ in the case of weight matrix $W$ has full row-rank.
We present a convex regularized least squares model with a nonconvex group sparsity promoting regularizer based on LiGME. It can be served as a unified model of many types of group sparsity related applications.
We illustrate the unfairness of $ℓ_{2, 1}$ regularizer in unbalanced classification and then apply the proposed model to reduce the unfairness of it in GSC and weighted GSC (WGSC) [11].

The remainder of this paper is organized as follows. In Section 2, we give a brief review of LiGME model and WGSC method. In Section 3, we present our group sparse enhanced representation model and its mathematical properties. In Section 4, we apply the proposed model to group-sparsity-based classification problems. The conclusion is given in Section 5.

A preliminary short version of this paper was presented at a conference [25].

2. Preliminaries

2.1. Review of Linearly Involved Generalized-Moreau-Enhanced (LiGME) Model

We first give a brief review of linearly involved generalized-Moreau-enhanced (LiGME) models, which is closely related to our method. Although the convex function

ℓ_{1}

-norm (or nuclear norm) is the most frequently adopted regularizer for sparsity (or low-rank) pursuing problems, it tends to yield underestimation for high-amplitude value (or large singular value) [26,27]. The convexity-preserving nonconvex regularizers have been widely explored in [24,28,29,30,31,32,33], which promote sparsity (or low-rank) more effectively than convex regularizers without losing the overall convexity. Among them, the generalized minimax concave (GMC) function in [31] does not rely on certain strong assumptions in the least squares term and has great potential for dealing with nonconvex variations of

{∥ \cdot ∥}_{1}

. Motivated by GMC function, the LiGME model [24] provides a general framework for constructing linearly involved nonconvex regularizers for sparsity (or low-rank) regularized linear least squares while maintaining the overall convexity of the cost function.

Let

(X, {⟨ \cdot, \cdot ⟩}_{X}, ∥ \cdot ∥_{X})

,

(Y, {⟨ \cdot, \cdot ⟩}_{Y}, ∥ \cdot ∥_{Y})

,

(Z, {⟨ \cdot, \cdot ⟩}_{Z}, ∥ \cdot ∥_{Z})

, and

(\tilde{Z}, {⟨ \cdot, \cdot ⟩}_{\tilde{Z}}, ∥ \cdot ∥_{\tilde{Z}})

be finite-dimensional real Hilbert spaces. Let a function

Ψ \in Γ_{0} (Z)

be coercive with

dom Ψ = Z

. Here

Γ_{0} (Z)

is the set of proper (i.e.,

dom Ψ : = {z \in Z | Ψ (z) < \infty} \neq ⌀

) lower semicontinuous (i.e.,

{lev}_{\leq a} Ψ : = {z \in Z | Ψ (z) \leq a}

is closed for

\forall a \in R

) convex function (i.e.,

Ψ (θ z_{1} + (1 - θ) z_{2}) \leq θ Ψ (z_{1}) + (1 - θ) Ψ (z_{2}))

for

\forall z_{1}, z_{2} \in dom Ψ, 0 \leq θ \leq 1

) from

Z

to

(- \infty, \infty]

; a function

Ψ \in Γ_{0} (Z)

is called coercive if

{∥ z ∥}_{2} \to \infty \Rightarrow Ψ (z) \to \infty

. For

Ψ \in Γ_{0} (Z)

, the proximity operator of

Ψ

is defined by

{Prox}_{Ψ} : Z \to Z : z \mapsto arg {min}_{v \in Z} Ψ (v) + \frac{1}{2} {∥ v - z ∥}_{Z}^{2} .

The generalized Moreau enhancement (GME) of

Ψ

with

B \in B (Z, \tilde{Z})

is defined as

Ψ_{B} (\cdot) : = Ψ (\cdot) - min_{v \in Z} [Ψ (v) + \frac{1}{2} {∥ B (\cdot - v) ∥}_{\tilde{Z}}^{2}],

(4)

where

B

is a tuning matrix for the enhancement. Then the LiGME model defined as the minimization of

J_{Ψ_{B} \circ L} : X \to R : x \mapsto \frac{1}{2} {∥ y - A x ∥}_{Y}^{2} + λ Ψ_{B} \circ L (x),

(5)

where

(A, L, λ) \in B (X, Y) \times B (X, Z) \times R_{+}

.

Please note that GMC [31] can be seen as a special case of (5) with

Ψ = {∥ \cdot ∥}_{1}

and

L = Id

, where Id is the identity operator. Model (5) can also be seen as an extension of [32,33].

Although the GME function

Ψ_{B}

in (4) is not convex in general for

B \neq O_{B (Z, \tilde{Z})}

, where

O_{B (Z, \tilde{Z})} \in B (Z, \tilde{Z})

is the zero operator, the overall convexity of the cost function (5) can be achieved with

B

designed to satisfy the following convexity condition.

Proposition 1

([24], Proposition 1). The cost function

J_{Ψ_{B} \circ L}

in (5) belongs to

Γ_{0} (X)

for any

y \in Y

, if the GME regularizer

Ψ_{B}

in (4) satisfies that

A^{*} A - λ L^{*} B^{*} B L ⪰ O_{X},

(6)

where

A^{*}

denotes the adjoint of

A

and

O_{X} \in B (X, X)

is the zero operator. In particular, when Ψ is a certain norm over the vector space

Z

,

J_{Ψ_{B} \circ L} \in Γ_{0} (X)

if and only if (6) is satisfied.

A method of designing

B

satisfying (6) for

X = R^{n}

is provided in [24]; see Proposition A1 in Appendix B. For any

Ψ \in Γ_{0} (Z)

that is coercive, even symmetry and prox-friendly (Even symmetry means

Ψ \circ (- Id) = Ψ

; prox-friendly means

{Prox}_{γ Ψ}

is computable (

\forall γ \in R_{+ +}

).) with

dom Ψ = Z

, [24] provides a proximal splitting algorithm (see Proposition A2 in Appendix B) of guaranteed convergence to a globally optimal solution of model (5) under the overall-convexity condition (6).

2.2. Basic Idea of Weighted Group Sparse Classification (WGSC)

As a relatively simple but typical scenario for the application of the proposed idea in this paper, we introduce the main idea of weighted group sparse classification (WGSC). Classification is one of fundamental tasks in the field of the signal and image processing and pattern recognition. For a classification problem with g classes of subjects, the training samples can formulate a dictionary matrix

A = [A_{1}, A_{2}, \dots, A_{g}] \in R^{m \times n}

, where

A_{i} = [a_{i 1}, a_{i 2}, \dots, a_{i n_{i}}] \in R^{m \times n_{i}}

is the subset of the training samples from subject i,

a_{i j}

is the j-th training sample from the i-th class,

n_{i}

is the number of training samples from class i, and

n = \sum_{i = 1}^{g} n_{i}

is the number of total training samples. The aim is to correctly determine which class the input test sample

y \in R^{m}

belongs to. Although deep learning is very popular and powerful for classification tasks, it requires a very large-scale training set and computation resources for numerous parameters training with complicated back-propagation.

Wright et al. proposed the sparse representation-based classification (SRC) [34] for face recognition. With the assumption that samples of a specific subject lie in a linear subspace, a valid test sample

y

is expected to be approximated well by a linear combination of the training samples from the same class, which leads to a sparse representation coefficient over all training samples. Specifically, the test sample

y

is approximated by the linear combination of the dictionary items, i.e.,

y \approx A x

, where

x

is the coefficient vector. A simple minimization model with sparse representation can be

{minimize}_{x \in R^{n}} \frac{1}{2} {∥ y - A x ∥}_{2}^{2} + λ {∥ x ∥}_{0}

. In most SRC-based approaches,

ℓ_{0}

regularizer is relaxed to

ℓ_{1}

, and the model becomes the well-known LASSO model [35] in statistics.

The label information of the dictionary atoms is not used in the simple model of SRC, hence the regression is based solely on the structure of each sample. When the subspaces spanned by different classes are not independent, SRC may lead the test image to be represented by training samples from multiple different classes. Considering ideal situation where the test image should only be approximated well by the training samples from one class corresponding to the correct one, in [8,9,10], the authors divided training samples into groups by prior label information and used group-sparsity regularizers. Naturally, the coefficient vector

x

has group structure

x = {[x_{1}^{⊤}, x_{2}^{⊤}, \dots, x_{g}^{⊤}]}^{⊤} \in R^{n}

, where

x_{i} = {[x_{i 1}, x_{i 2}, \dots, x_{i n_{i}}]}^{⊤} \in R^{n_{i}}

(i = 1, 2, \dots, g)

. This kind of group sparse classification (GSC) approach aims to represent the test image using the minimum number of groups, and thus an ideal model is (1) which is NP-hard. As stated in Section 1, a convex approximation of

ℓ_{2, 0}

, i.e.,

ℓ_{2, 1}

-norm, has been used widely as a best convex regularizer to incorporate the class labels.

More generally, the non-separable weighted

ℓ_{2, 1}

-norm, i.e.,

{∥ W \cdot ∥}_{2, 1}

has also been used as the regularizer in GSC [11,36,37]. For example, Tang et al. [11] proposed a weighted GSC (WGSC) model as follows, by involving the information of the similarity between query sample and each class as well as the distance between query sample and each training sample,

\underset{x \in R^{n}}{minimize} \frac{1}{2} {∥ y - A x ∥}_{2}^{2} + λ \sum_{i = 1}^{g} w_{i} {∥ d_{i} ⊙ x_{i} ∥}_{2},

(7)

where

d_{i} = [d_{i 1}, d_{i 2}, \dots, d_{i n_{i}}] \in R^{n_{i}}

penalizes the distance between

y

and each training sample of i-th class,

w_{i}

is set to assess the relative importance of training samples from i-th class for representing the test sample, and here ⊙ denotes element-wise multiplication. Specifically, the weights are computed by

d_{i j} = \exp (\frac{∥ y - a_{i j} ∥_{2}}{σ_{1}}) and w_{i} = \exp (\frac{r_{i} - r_{\min}}{σ_{2}}),

(8)

where

σ_{1}

and

σ_{2}

are bandwidth parameters,

x_{i}^{*} = arg {min}_{x_{i}} {∥ y - A_{i} x_{i} ∥}_{2}^{2}

,

r_{i} = ∥ y - A_{i} x_{i}^{*} ∥

computes the distance from

y

to the individual subspace generated by

A_{i}

, and

r_{\min}

denotes the minimum reconstruction error of

{r_{i}}_{i = 1}^{g}

. The regularizer in (7) can be written as a non-separable weighted

ℓ_{2, 1}

, i.e.,

{∥ W x ∥}_{2, 1}

, where

W = BlockDiag (W_{1}, W_{2}, \dots, W_{g}) and W_{i} = Diag (w_{i} d_{i 1}, w_{i} d_{i 2}, \dots, w_{i} d_{i n_{i}}) .

(9)

For the aforementioned methods, after obtaining the optimal solution (denoted by

\hat{x} = {[{\hat{x}}_{1}^{⊤}, {\hat{x}}_{2}^{⊤}, \dots, {\hat{x}}_{g}^{⊤}]}^{⊤}

), they assign

y

to the class that minimizes the class reconstruction residual defined by

∥ y - A_{i} {\hat{x}}_{i} ∥_{2}

.

Although

ℓ_{2, 1}

regularizer and its weighted variants are widely used in GSC and WGSC-based methods, they not only suppress the number of selected classes, but also suppress significant nonzero coefficients within classes. The later may lead to underestimation of high-amplitude elements and adversely affect the performance. The nonconvex regularizers such as

ℓ_{2, p}

(

0 < p < 1

) [37] and group MCP [38] make the corresponding optimization problems nonconvex. Therefore, we hope to use a regularizer which can reduce the bias and approximate

ℓ_{2, 0}

better than

ℓ_{2, 1}

while ensuring the overall convexity of the problem.

3. LiGME Model for Group Sparse Estimation

3.1. GME of Weighted $ℓ_{2, 1}$ -Norm and Its Properties

Although

ℓ_{2, 1}

-norm (or its weighted variants) acts as the favorable approach to approximate

ℓ_{2, 0}

in the literature of group sparse estimation, it has large bias and does not promote group sparsity as effective as

ℓ_{2, 0}

. Since GME provides an approach to better approximate direct discrete measures (e.g.,

ℓ_{0}

for sparsity, matrix rank for low-rankness) than their convex envelopes, we propose to use it for designing group-sparsity pursuing regularizers.

More generally, let us consider the GME of

{∥ \cdot ∥}_{w, 2, 1}

in (3). Clearly,

{∥ \cdot ∥}_{w, 2, 1} \in Γ_{0} (R^{n})

is coercive, even symmetry and prox-friendly, whose proximity operator can be computed by

{Prox}_{{γ ∥ \cdot ∥}_{w, 2, 1}} : R^{n} \to R^{n} : x \mapsto {\{(1 - \frac{γ w_{i}}{max {∥ x_{i} ∥_{2}, γ w_{i}}}) x_{i}\}}_{i = 1}^{g},

(10)

where

x = {[x_{1}^{⊤}, x_{2}^{⊤}, \dots, x_{g}^{⊤}]}^{⊤} \in R^{n}

is a signal with group structure,

x_{i} \in R^{n_{i}} (i = 1, 2, \dots, g)

and

\sum_{i = 1}^{g} n_{i} = n

.

Actually, the GME of

{∥ \cdot ∥}_{w, 2, 1}

with

B \in R^{b \times n}

(see (4)):

{({∥ \cdot ∥}_{w, 2, 1})}_{B} (x) = \sum_{i = 1}^{g} w_{i} {∥ x_{i} ∥}_{2} - min_{v \in R^{n}} \{\sum_{i = 1}^{g} w_{i} ∥ v_{i} ∥_{2} + \frac{1}{2} {∥ B (x - v) ∥}_{2}^{2}\},

(11)

where

v_{i} \in R^{n_{i}} (i = 1, 2, \dots, g)

and

v = {[v_{1}^{⊤}, v_{2}^{⊤}, \dots, v_{g}^{⊤}]}^{⊤} \in R^{n}

, can serve as a parametric bridge between

{∥ \cdot ∥}_{2, 0}

and

{∥ \cdot ∥}_{w, 2, 1}

.

Proposition 2.

(GME of

{∥ \cdot ∥}_{w, 2, 1}

can bridge the gap between

{∥ \cdot ∥}_{2, 0}

and

{∥ \cdot ∥}_{w, 2, 1}

.) Let

B_{γ} : = BlockDiag (\frac{w_{1}}{\sqrt{γ}} I_{n_{1}}, \frac{w_{2}}{\sqrt{γ}} I_{n_{2}}, \dots, \frac{w_{g}}{\sqrt{γ}} I_{n_{g}})

for

γ > 0

, where

w_{i} > 0

is the weight in (3) for

i = 1, \dots, g

. Then, for any

x \in R^{n}

,

lim_{γ ↓ 0} \frac{2}{γ} {({∥ \cdot ∥}_{w, 2, 1})}_{B_{γ}} (x) = {∥ x ∥}_{2, 0} .

(12)

Together with the fact that

{({∥ \cdot ∥}_{w, 2, 1})}_{O_{n \times n}} (x) = {∥ x ∥}_{w, 2, 1}

where

O_{n \times n} \in R^{n \times n}

is the zero matrix, the regularization term

\frac{2}{γ} {({∥ \cdot ∥}_{w, 2, 1})}_{B_{γ}} (x)

can serve as a parametric bridge between

{∥ \cdot ∥}_{2, 0}

and

{∥ \cdot ∥}_{w, 2, 1}

. As a special case, the GME of

{∥ \cdot ∥}_{2, 1}

can serve as a parametric bridge between

{∥ \cdot ∥}_{2, 0}

and

{∥ \cdot ∥}_{2, 1}

.

Proof.

The regularization term

\frac{2}{γ} {({∥ \cdot ∥}_{w, 2, 1})}_{B_{γ}} (x) : R^{n} \to R : {[x_{1}^{⊤}, x_{2}^{⊤}, \dots, x_{g}^{⊤}]}^{⊤} \mapsto \sum_{i = 1}^{g} \frac{2}{γ} φ_{i} (x_{i})

, where

φ_{i} (x_{i}) : = w_{i} {∥ x_{i} ∥}_{2} - min_{v_{i} \in R^{n_{i}}} \{w_{i} ∥ v_{i} ∥_{2} + \frac{{w_{i}}^{2}}{2 γ} {∥ x_{i} - v_{i} ∥}_{2}^{2}\}

for

i = 1, \dots, g

. By ([39], Example 24.20), we obtain

\frac{2}{γ} φ_{i} (x_{i}) = \{\begin{matrix} \frac{2 w_{i}}{γ} ∥ x_{i} ∥_{2} - \frac{{w_{i}}^{2}}{γ^{2}} {∥ x_{i} ∥}_{2}^{2}, & if ∥ x_{i} ∥_{2} \leq \frac{γ}{w_{i}} \\ 1, & otherwise . \end{matrix}

(13)

Then, we obtain

lim_{γ ↓ 0} \frac{2}{γ} φ_{i} (x_{i}) = \{\begin{matrix} 0, & if ∥ x_{i} ∥_{2} = 0, \\ 1, & otherwise, \end{matrix}

(14)

and

\begin{matrix} lim_{γ ↓ 0} \frac{2}{γ} {({∥ \cdot ∥}_{w, 2, 1})}_{B_{γ}} & = lim_{γ ↓ 0} \sum_{i = 1}^{g} \frac{2}{γ} φ_{i} (x_{i}) = {∥ x ∥}_{2, 0} . \end{matrix}

(15)

□

Figure 1 illustrates simple examples of

{∥ x ∥}_{2, 1}

and

{(∥ \cdot ∥}_{2, 1})_{B} (x)

when

g = 1

,

n = 2

and

B = I_{2}

. As we can see,

{(∥ \cdot ∥}_{2, 1})_{I_{2}} (x)

can approximate

{∥ x ∥}_{2, 0}

better than

{∥ x ∥}_{2, 1}

.

Figure 1. Simple examples of two group sparse regularizers (one group case): (a) The

ℓ_{2, 1}

regularizer; (b) The regularizer

{({∥ \cdot ∥}_{2, 1})}_{I_{n}}

.

Of course, as reviewed in Section 2.1, we can minimize

J_{{({∥ \cdot ∥}_{w, 2, 1})}_{B} \circ Id}

(see (5)) with the algorithm in (A3) in Proposition A2, under the overall-convexity condition

A^{⊤} A - λ B^{⊤} B ⪰ O_{n \times n}

.

In the following, we consider the GME of non-separable weighted

ℓ_{2, 1}

-norm

{∥ W \cdot ∥}_{2, 1}

, where

W \in R^{l \times n}

is not necessarily a diagonal matrix. This is because in some applications, such as classification problems [11,36,37] stated in Section 2.2, and also heterogeneous feature selection [40], weights are introduced inside groups as well (i.e., the weight of every entry can be different) to improve the estimation accuracy. The GME of

{∥ W \cdot ∥}_{2, 1}

with

\tilde{B} \in R^{b \times n}

is well-defined (The lack of coercivity requires slight modification from min to inf.) as

{({∥ W \cdot ∥}_{2, 1})}_{\tilde{B}} (x) = {∥ W x ∥}_{2, 1} - inf_{v \in R^{n}} \{{∥ W v ∥}_{2, 1} + \frac{1}{2} {∥ \tilde{B} (x - v) ∥}_{2}^{2}\},

(16)

and therefore we can formulate

\underset{x \in R^{n}}{minimize} J_{{(∥ W \cdot ∥}_{2, 1})_{\tilde{B} \circ Id}} = \frac{1}{2} {∥ y - A x ∥}_{2}^{2} {+ λ (∥ W \cdot ∥}_{2, 1})_{\tilde{B}} (x) .

(17)

However, we should remark that

{∥ W \cdot ∥}_{2, 1} \in Γ_{0} (R^{n})

is even symmetric but not necessarily coercive or prox-friendly (As found in ([39], Proposition 24.14), it is known that for

Ψ \in Γ_{0} (Z)

and

L \in B (X, Z)

satisfying

L L^{*} = μ Id

with some

μ \in R_{+ +}

, we have

{Prox}_{Ψ \circ L} (x) = x + μ^{- 1} L^{*} ({Prox}_{μ Ψ} (L x) - L x)

for

x \in X

. In such a special case, if

Ψ

is prox-friendly,

Ψ \circ L

is also prox-friendly. However, for general

L \in B (X, Z)

not necessarily satisfying such standard conditions, we have to discuss the prox-friendliness of

Ψ \circ L

case by case.). Fortunately, by Proposition 3 below, if

rank (W) = l

and

\tilde{B}

can be expressed as

\tilde{B} = B W

for some

B \in R^{b \times l}

, we can show the useful relation

{({∥ W \cdot ∥}_{2, 1})}_{\tilde{B}} (x) = {({∥ \cdot ∥}_{2, 1})}_{B} \circ W (x),

(18)

which implies that the GME

{({∥ W \cdot ∥}_{2, 1})}_{\tilde{B}}

of

{∥ W \cdot ∥}_{2, 1}

can be handled as the LiGME

{({∥ \cdot ∥}_{2, 1})}_{B} \circ W

of

{∥ \cdot ∥}_{2, 1}

.

Proposition 3.

For

Ψ \in Γ_{0} (Z)

which is coercive and

B \in B (Z, \tilde{Z})

, assume

L \in B (X, Z)

has full row-rank. Then for any

x \in X

,

{(Ψ \circ L)}_{B \circ L} (x) = Ψ_{B} \circ L (x),

(19)

where

{(Ψ \circ L)}_{B \circ L} (\cdot) : = Ψ (L \cdot) - {inf}_{v \in X} \{Ψ (L v) + \frac{1}{2} {∥ B L (\cdot - v) ∥}_{2}^{2}\}

and

Ψ_{B} (\cdot) : = Ψ (\cdot) - {min}_{v \in Z} \{Ψ (v) + \frac{1}{2} {∥ B (\cdot - v) ∥}_{2}^{2}\}

.

Proof.

On one hand, by the definition of GME, we have

\begin{matrix} {(Ψ \circ L)}_{B \circ L} (x) & = Ψ (L x) - inf_{v \in X} \{Ψ (L v) + \frac{1}{2} {∥ B L (x - v) ∥}_{2}^{2}\} \\ = Ψ (L x) - h (B L x), \end{matrix}

where

h (z) : \tilde{Z} \to R

is given by

\begin{matrix} h (z) & = inf_{v \in X} \{Ψ (L v) + \frac{1}{2} {∥ z - B L v ∥}_{2}^{2}\} \\ = inf_{u \in {(null L)}^{⊥}} inf_{\hat{u} \in null L} \{Ψ (L (u + \hat{u})) + \frac{1}{2} {∥ z - B L (u + \hat{u}) ∥}_{2}^{2}\} \\ = inf_{u \in {(null L)}^{⊥}} \{Ψ (L u) + \frac{1}{2} {∥ z - B L u ∥}_{2}^{2}\} \\ = inf_{u \in range L^{*}} \{Ψ (L u) + \frac{1}{2} {∥ z - B L u ∥}_{2}^{2}\} \\ = inf_{v \in Z} \{Ψ (L L^{*} v) + \frac{1}{2} {∥ z - B L L^{*} v ∥}_{2}^{2}\} \\ = inf_{v \in Z} \{Ψ (L L^{*} {(L L^{*})}^{- 1} v) + \frac{1}{2} {∥ z - B L L^{*} {(L L^{*})}^{- 1} v ∥}_{2}^{2}\} \\ = inf_{v \in Z} \{Ψ (v) + \frac{1}{2} {∥ z - B v ∥}_{2}^{2}\} . \end{matrix}

Therefore,

{(Ψ \circ L)}_{B \circ L} (x) = Ψ (L x) - {inf}_{v \in Z} \{Ψ (v) + \frac{1}{2} {∥ B L x - B v ∥}_{2}^{2}\}

.

On the other hand,

Ψ_{B} \circ L (x) = Ψ (L x) - {min}_{v \in Z} \{Ψ (v) + \frac{1}{2} {∥ B (L x - v) ∥}_{2}^{2}\}

by definition. Thus, we obtain the conclusion. □

In the rest of the paper, we focus on LiGME model of

ℓ_{2, 1}

-norm.

3.2. LiGME of $ℓ_{2, 1}$ -Norm

For simplicity as well as for effectiveness in application to GSC and WGSC, we focus on the LiGME model of

{∥ \cdot ∥}_{2, 1}

with an invertible linear operator

W

,

\underset{x \in R^{n}}{minimize} J_{{(∥ \cdot ∥}_{2, 1})_{B} \circ W} = \frac{1}{2} {∥ y - A x ∥}_{2}^{2} + λ {({∥ \cdot ∥}_{2, 1})}_{B} \circ W (x) .

(20)

In this case, for achieving

J_{{(∥ \cdot ∥}_{2, 1})_{B} \circ W} \in Γ_{0} (R^{n})

, we can simply design

B \in R^{m \times n}

, in a way similar to ([31], (48)), as in the next proposition.

Proposition 4.

For an invertible

W \in R^{n \times n}

, let

B = \sqrt{θ / λ} A W^{- 1}, 0 \leq θ \leq 1,

(21)

then for the LiGME model in (20),

J_{{(∥ \cdot ∥}_{2, 1})_{B} \circ W} \in Γ_{0} (R^{n})

.

Proof.

By

A^{⊤} A - λ W^{⊤} B^{⊤} B W = A^{⊤} A - λ W^{⊤} {(\sqrt{θ / λ} A W^{- 1})}^{⊤} (\sqrt{θ / λ} A W^{- 1}) W = (1 - θ) A^{⊤} A ⪰ O_{n \times n}

and Proposition 1,

J_{{(∥ \cdot ∥}_{2, 1})_{B} \circ W} \in Γ_{0} (R^{n})

is ensured. □

Model (20) can be applied to many different applications that conform to group-sparsity structure.

4. Application to Classification Problems

4.1. Proposed Algorithm for Group-Sparsity Based Classification

Since

ℓ_{2, 1}

regularizer in GSC is unfair for classes of different sizes (see Appendix A) while

ℓ_{2, 0}

-regularizer is not, our purpose is to use a better approximation of

ℓ_{2, 0}

as the regularizer. Therefore, we apply model (20) to group-sparsity-based classification. Following GSC, we can set

W = I_{n}

in (20).

Inspired by WGSC [11] which well designs weights to enforce locality and similarity information of samples, we can also set the weight matrix

W

according to (9). The classification algorithm is summarized in Algorithm 1.

The

ℓ_{2, 1}

-norm regularized least squares problem in WGSC can be solved by a proximal gradient method [41]. Compared with it, the step 2 in Algorithm 1 for solving (20) requires at each update only one additional computation for

{Prox}_{{γ ∥ \cdot ∥}_{2, 1}}

(see (10) with

w_{i} = 1

).

4.2. Experiments

First, by setting

W = I_{n}

, we conduct the experiments on a relatively simple dataset to investigate the influence by bias of

ℓ_{2, 1}

regularizer on the classification problem (especially when training set is unbalanced), and verify the performance improvement using

{(∥ \cdot ∥}_{2, 1})_{B}

as the regularizer by conducting the experiments on a relatively simple dataset. The USPS handwritten digit database [42] has 11,000 samples of digits “0” through “9” (1100 samples per class). The dimension of each sample is

16 \times 16

. In our classification experiments, we vectorized them to 256-D vectors. The number of training samples for each class is not necessarily equal, which varies from 5 to 50 (the size of test set is fixed to 50 images per class).

We set

W = I_{n}

(the initialization of

W

should be modified in Algorithm 1) for the proposed model (20) and compared it with GSC (with

ℓ_{2, 1}

regularizer) [10]. We set

B = \sqrt{θ / λ} A

and fix

θ = 0.9

to achieve the overall convexity of proposed method, and set

κ = 1.1

,

ι = ∥ (κ / 2) A^{⊤} A + λ I_{n} ∥_{spec} + (κ - 1)

,

τ = {(κ / 2 + 2 / κ) λ ∥ B ∥}_{spec}^{2} + (κ - 1)

. The initial estimate is set as

(x^{(0)}, u^{(0)}, w^{(0)}) = (O_{n \times 1}, O_{n \times 1}, O_{n \times 1})

, and the stopping criterion is set to either

∥ (x^{(k)}, u^{(k)}, w^{(k)}) - (x^{(k + 1)}, u^{(k + 1)}, w^{(k + 1)}) ∥_{2} < 10^{- 4}

or steps reaching 10,000.

Figure 2 shows an example of unbalanced training set (digits “0” through “4” have 5 samples per class and “5” through “9” have 25 samples per class). The input (an image of digit “0”) was misclassified (into digital “6”) by GSC while classified correctly by proposed method. The obtained coefficient vectors by GSC and proposed method (both with

λ = 4

) are illustrated respectively, and some samples corresponding to nonzero coefficients are also displayed in Figure 2. It can be seen that the samples from digit “6” made the greatest contribution to the representation in GSC, and samples from “5” and “0” also made small contribution. In our method, samples from the correct class “0” made the biggest contribution and led to correct result. It is reasonable, because our method did not suppress the high value coefficients too much whereas

ℓ_{2, 1}

did. The big suppression of

ℓ_{2, 1}

made the coefficients of the correct class cannot be large enough, and thus easily led to misclassification.

Algorithm 1: The proposed group-sparsity enhanced classification algorithm

Input: A matrix of training samples

A = [A_{1}, A_{2}, \dots, A_{G}] \in R^{m \times n}

grouped by
class information, a test sample vector

y \in R^{m}

, parameters

λ

,

σ_{1}

and

σ_{2}

.
1. Initialization: Let

(x^{(0)}, u^{(0)}, v^{(0)}) \in R^{n} \times R^{n} \times R^{n}

.
Compute the weight matrix

W

by (9).
Choose

B

satisfying

A^{⊤} A - λ W^{⊤} B^{⊤} B W ⪰ O_{n \times n}

.
Choose

(ι, τ, κ) \in R_{+ +} \times R_{+ +} \times (1, + \infty)

satisfying

ι I_{n} - \frac{κ}{2} A^{⊤} A - λ W^{⊤} W ⪰ O_{n \times n} and τ \geq (\frac{κ}{2} + \frac{2}{κ}) λ {∥ B ∥}_{spec}^{2} .

(22)

2. For $k = 0, 1, 2, \dots$ , compute

\begin{matrix} x^{(k + 1)} = & [I_{n} - \frac{1}{ι} (A^{⊤} A - λ W^{⊤} B^{⊤} B W)] x^{(k)} - \frac{λ}{ι} W^{⊤} B^{⊤} B u^{(k)} - \frac{λ}{ι} W^{⊤} v^{(k)} + \frac{1}{ι} A^{⊤} y, \\ u^{(k + 1)} = & {Prox}_{\frac{λ}{τ} {∥ \cdot ∥}_{2, 1}} [\frac{2 λ}{τ} B^{⊤} B W x^{(k + 1)} - \frac{λ}{τ} B^{⊤} B W x^{(k)} + (I_{n} - \frac{λ}{τ} B^{⊤} B) u^{(k)}], \\ v^{(k + 1)} = & 2 W x^{(k + 1)} - W x^{(k)} + v^{(k)} - {Prox}_{{∥ \cdot ∥}_{2, 1}} (2 W x^{(k + 1)} - W x^{(k)} + v^{(k)}) \end{matrix}

until the stopping criterion is fulfilled.
3. Compute the class label

i^{★}

of

y

by

i^{★} = arg min_{i} {∥ y - A_{i} x_{i}^{(k + 1)} ∥}_{2} .

(23)

Output: The class label

i^{★}

corresponding to

y

.

Figure 2. Estimated sparse coefficients

\hat{x}

by GSC and proposed method respectively.

(For example, any

κ > 1

,

ι = ∥ (κ / 2) A^{⊤} A + λ W^{⊤} {W ∥}_{spec} + (κ - 1)

and

τ = {(κ / 2 + 2 / κ) λ ∥ B ∥}_{spec}^{2} + (κ - 1)

can satisfy (22).)

Table 1 summarizes the recognition accuracy of GSC and the proposed method with

W = I_{n}

. The training set includes digits “0” through “4”

β

samples per class and “5” through “9”

α

samples per class. Through numerical experiments, we found that GSC with

λ = λ_{GSC} = 1.5

and the proposed method with

λ = λ_{prop} = 3

perform well on this dataset. We also experimented the proposed method of using

λ_{GSC}

, which did not degrade too much compared with using

λ_{prop}

. We see that the GSC model degrades when the training set is unbalanced, and the proposed method outperforms GSC especially in such case.

Table 1. Recognition results on the USPS database.

Next, we conduct the experiments on a classic face dataset to verify the validity of the proposed linearly involved model by setting the weight matrix

W

according to (9). The ORL Database of Faces [43] contains 400 images from 40 distinct subjects (10 images per subject) with variations in lighting, facial expressions (open or closed eyes, smiling or not smiling) and facial details (glasses or no glasses). In our experiments, following [44], all images were downsampled from

112 \times 92

to

16 \times 16

and then formed 256-D vectors. The number of training samples for each class is not necessarily equal, which varies from 4 to 8 (test set is fixed to 2 images per class).

We compared the proposed model (20) (

W = I_{n}

and

W

by (9) respectively) with GSC [10] and WGSC [11]. In order to achieve the overall convexity, we set

B = \sqrt{θ / λ} A W^{- 1}, 0 \leq θ \leq 1

and fix

θ = 0.9

for proposed method. Settings of

(ι, τ, κ)

, initial estimate and stopping criterion are the same as those in the previous experiment. When the parameter

λ

is assigned too small, the obtained coefficient vector is not group sparse; when the parameter

σ_{1}

or

σ_{2}

is assigned too small, the information of locality or similarity plays a decisive role. We found that

λ = 0.05

for

ℓ_{2, 1}

regularizer-based methods (i.e., GSC and WGSC),

λ = 0.2

for the proposed method and

σ_{1} \in [2, 4], σ_{2} \in [0.5, 2]

for weights involved methods (i.e., WGSC and proposed method with

W

by (9)) work well on this dataset.

Figure 3 shows a classification result of WGSC and proposed method (

W

by (9)) (both with

σ_{1} = 4

,

σ_{2} = 2)

) when training set is unbalanced (20 subjects have 8 samples per class and the others have 6 samples per class). The input is an image of subject 10 which was misclassified into subject 8 by WGSC while classified correctly by proposed method.

Figure 3. An example of results by WGSC and proposed method.

Table 2 summarizes the recognition accuracy of GSC, the proposed method with

W = I_{n}

, WGSC and the proposed method with

W

computed by (9). Training set setting is that 20 subjects have

β

samples per class and the others have

α

samples per class. With the strategically designed matrix (9), WGSC achieves a significant improvement over GSC. By using the proposed method with

W

computed by (9), the performance can be further improved, especially when the training set is unbalanced.

Table 2. Recognition results on the ORL database.

5. Conclusions

In this paper, the potential applicability and effectiveness of using nonconvex regularizers in convex optimization framework was explored. We proposed a generalized Moreau enhancement (GME) of weighted

ℓ_{2, 1}

function and analyzed its relationship with the linearly involved GME of

ℓ_{2, 1}

-norm. The proposed regularizer is nonconvex and promotes group sparsity more effectively than

ℓ_{2, 1}

while maintaining the overall convexity of the regression model at the same time. The model can be used in many applications and we applied it to classification problems. Our model makes use of the grouping structure by class information and suppresses the tendency of underestimation of high-amplitude coefficients. Experimental results showed that the proposed method is effective for image classification.

Author Contributions

Conceptualization, M.Y. and I.Y.; methodology, Y.C., M.Y. and I.Y.; software, Y.C.; writing-original draft, Y.C., writing-review and editing, M.Y. and I.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by JSPS Grants-in-Aid grant number 18K19804 and by JST SICORP grant number JPMJSC20C6.

Data Availability Statement

Publicly available data sets were analyzed in this study. These data can be found here: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html##usps; https://cam-orl.co.uk/facedatabase.html; and http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html (accessed on 25 October 2021).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LASSO	Least Absolute Shrinkage and Selection Operator
SCAD	Smoothly Clipped Absolute Deviation)
MCP	Minimax Concave Penalty
GMC	Generalized Minimax Concave
GME	Generalized Moreau Enhancement
LiGME	Linearly involved Generalized-Moreau-Enhanced (or Enhancement)
SRC	Sparse Representation-based Classification
GSC	Group Sparse Classification
WGSC	Weighted Group Sparse Classification

Appendix A. The Bias of ℓ_2,1 Regularizer in Group Sparse Classification

Using

ℓ_{2, 1}

regularizer in classification problems not only minimizes the number of the selected classes, but also minimizes the

ℓ_{2}

-norm of coefficients within each class. The later may adversely affect the classification result, since the optimal representation of a test sample by training samples of the correct subject may contain large coefficients. Moreover, in many classification applications, the number of training samples from different classes is not the same. We argue that the bias of

ℓ_{2, 1}

regularizer makes it unfair for classes of different sizes.

Example A1.

Suppose that a test sample

y \in R^{m}

can be represented by a combination of all

n_{i}

samples from class i without error, i.e.,

y = A_{i} x_{i}

and

∥ x_{i} ∥_{1} = 1

, where

A_{i} \in R^{m \times n_{i}}

and

x_{i} \in R^{n_{i}}

.

(a): If the number of samples in this class is doubled by duplication, the training set of class i becomes ${\tilde{A}}_{i} = [A_{i}, A_{i}] \in R^{m \times 2 n_{i}}$ . Obviously, $y$ can also be well represented by $y = {\tilde{A}}_{i} {\tilde{x}}_{i}$ , where ${\tilde{x}}_{i} = {[η x_{i}^{⊤}, (1 - η) x_{i}^{⊤}]}^{⊤} \in R^{2 n_{i}}$ ( $0 \leq η \leq 1$ ) and $∥ {\tilde{x}}_{i} ∥_{1} = 1$ . However, $∥ x_{i} ∥_{2}^{2} - ∥ {\tilde{x}}_{i} ∥_{2}^{2} = 2 η (1 - η) {∥ x_{i} ∥}_{2}^{2} \geq 0$ . That is, $ℓ_{2, 1}$ value of the first representation (before duplication) is greater than that of the second one (after duplication).
(b): If the number of samples in this class is increased to $d n_{i}$ by copying $d - 1$ times (d > 1), the training set of class i becomes ${\tilde{A}}_{i} = [A_{i}, \dots, A_{i}] \in R^{m \times d n_{i}}$ . Obviously, $y = {\tilde{A}}_{i} {\tilde{x}}_{i}$ is a representation of $y$ , where ${\tilde{x}}_{i} = {[\frac{1}{d} x_{i}^{⊤}, \dots, \frac{1}{d} x_{i}^{⊤}]}^{⊤} \in R^{d n_{i}}$ and $∥ {\tilde{x}}_{i} ∥_{1} = 1$ . Then $∥ {\tilde{x}}_{i} ∥_{2} = \frac{1}{\sqrt{d}} ∥ x_{i} ∥_{2} < {∥ x_{i} ∥}_{2}$ .

Example A1 tells us that the group size affects the value of

ℓ_{2, 1}

regularizer. Even if the new training sample is only a copy of the original samples (without adding any new information), the value of

ℓ_{2, 1}

regularizer will decrease. Therefore,

ℓ_{2, 1}

regularizer is unfair for classes of different sizes. It has the tendency to refuse the class has relatively few samples, because the coefficient vector is more likely to have a large

ℓ_{2, 1}

regularizer value. Please note that

ℓ_{2, 0}

-regularizer is independent of group size and it does not have such unfairness.

Appendix B. Parameter Tuning and Proximal Splitting Algorithm for LiGME Model

Proposition A1

([24], Proposition 2). In (5), let

(X, Y, Z) = (R^{n}, R^{m}, R^{l})

,

(A, L, λ) \in

R^{m \times n} \times R^{l \times n} \times R_{+ +}

, and

r a n k (L) = l

. Choose a nonsingular

\tilde{L} \in R^{n \times n}

satisfying

[O_{l \times (n - l)} I_{l}] \tilde{L} = L

. Then

B_{θ} : = \sqrt{θ / λ} Λ^{1 / 2} U^{⊤} \in R^{l \times l}, θ \in [0, 1]

, ensures

J_{Ψ_{B_{θ}} \circ L} \in Γ_{0} (R^{n})

, where

[{\tilde{D}}_{1} {\tilde{D}}_{2}] : = A {(\tilde{L})}^{- 1}

and

U Λ U^{⊤} : = {\tilde{D}}_{2}^{⊤} {\tilde{D}}_{2} - {\tilde{D}}_{2}^{⊤} {\tilde{D}}_{1} {({\tilde{D}}_{1}^{⊤} {\tilde{D}}_{1})}^{†} {\tilde{D}}_{1}^{⊤} {\tilde{D}}_{2} \in R^{l \times l}

is an eigendecomposition.

Proposition A2

([24], Theorem 1). Consider minimization of

J_{Ψ_{B} \circ L}

in (5) under the overall-convexity condition (6). Let a real Hilbert space

(H : = X \times Y \times Z, {⟨ \cdot, \cdot ⟩}_{H}, ∥ \cdot ∥_{H})

be a product space and define an operator

T_{LiGME} : H \to H : (x, u, v) \to (ξ, ζ, η)

with parameters

(ι, τ) \in R_{+ +} \times R_{+ +}

, by

\begin{matrix} ξ : = [Id - \frac{1}{σ} (A^{*} A - λ L^{*} B^{*} B L)] x - \frac{λ}{ι} L^{*} B^{*} B u - \frac{λ}{ι} L^{*} v + \frac{1}{ι} A^{*} y, \\ ζ : = {Prox}_{\frac{λ}{τ} Ψ} [\frac{2 λ}{τ} B^{*} B L ξ - \frac{λ}{τ} B^{*} B L x + (Id - \frac{λ}{τ} B^{*} B) u], \\ η : = 2 L ξ - L x + v - {Prox}_{Ψ} (2 L ξ - L x + v) . \end{matrix}

Then the following holds:

1.: $arg {min}_{x \in X} J_{Ψ_{B} \circ L} (x) = \{x^{★} \in H | (x^{★}, u^{★}, v^{★}) \in Fix (T_{LiGME})\}$ , where $Fix (T_{LiGME}) : = {(x, u, v) \in H | T_{LiGME} (x, u, v) = (x, u, v)}$ .
2.: Choose $(ι, τ, κ) \in R_{+ +} \times R_{+ +} \times (1, \infty)$ satisfying

$\begin{matrix} ι Id - \frac{κ}{2} A^{*} A - λ L^{*} L ≻ O_{X}, \\ τ \geq (\frac{κ}{2} + \frac{2}{κ}) λ {∥ B ∥}_{o}^{2} p, \end{matrix}$

(A1)

where ${∥ \cdot ∥}_{o p}$ is the operator norm. Then

$P : = [\begin{matrix} ι Id & - λ L^{*} B^{*} B & - λ L^{*} \\ - λ B^{*} B L & τ Id & O_{Z} \\ - λ L & O_{Z} & λ Id \end{matrix}] ≻ O_{H}$

(A2)

and $T_{LiGME}$ is $\frac{κ}{2 κ - 1}$ -averaged nonexpansive in the Hilbert space $(H, {⟨ \cdot, \cdot ⟩}_{P}, ∥ \cdot ∥_{P})$ .
3.: Assume the condition (A1) holds. Then, for any initial point $(x^{(0)}, u^{(0)}, v^{(0)})$ , the sequence ${(x^{(k)}, v^{(k)}, u^{(k)})}_{k \in N}$ generated by

$(x^{(k + 1)}, u^{(k + 1)}, v^{(k + 1)}) = T_{LiGME} (x^{(k)}, u^{(k)}, v^{(k)})$

(A3)

converges to a point $(x^{★}, u^{★}, v^{★}) \in F i x (T_{LiGME})$ and

$lim_{k \to \infty} x^{(k)} = x^{★} \in arg min_{x \in X} J_{Ψ_{B} \circ L} (x) .$

References

Theodoridis, S. Machine Learning: A Bayesian and Optimization Perspective; Academic Press: Cambridge, MA, USA, 2015. [Google Scholar]
Ma, S.; Song, X.; Huang, J. Supervised group Lasso with applications to microarray data analysis. BMC Bioinform. 2007, 8, 60. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Wang, L.; Chen, G.; Li, H. Group SCAD regression analysis for microarray time course gene expression data. Bioinformatics 2007, 23, 1486–1494. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Zhong, Y.; Zhang, L.; Xu, Y. Spatial group sparsity regularized nonnegative matrix factorization for hyperspectral unmixing. IEEE Trans. Geosci. Remote Sens. 2017, 55, 6287–6304. [Google Scholar] [CrossRef]
Drumetz, L.; Meyer, T.R.; Chanussot, J.; Bertozzi, A.L.; Jutten, C. Hyperspectral image unmixing with endmember bundles and group sparsity inducing mixed norms. IEEE Trans. Image Process. 2019, 28, 3435–3450. [Google Scholar] [CrossRef] [PubMed]
Huang, J.; Huang, T.Z.; Zhao, X.L.; Deng, L.J. Nonlocal tensor-based sparse hyperspectral unmixing. IEEE Trans. Geosci. Remote Sens. 2020, 59, 6854–6868. [Google Scholar] [CrossRef]
Qiao, B.; Mao, Z.; Liu, J.; Zhao, Z.; Chen, X. Group sparse regularization for impact force identification in time domain. J. Sound Vib. 2019, 445, 44–63. [Google Scholar] [CrossRef]
Majumdar, A.; Ward, R.K. Classification via group sparsity promoting regularization. In Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, Taiwan, 19–24 April 2009; pp. 861–864. [Google Scholar]
Elhamifar, E.; Vidal, R. Robust classification using structured sparse representation. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 1873–1879. [Google Scholar]
Huang, J.; Nie, F.; Huang, H.; Ding, C. Supervised and projected sparse coding for image classification. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, Bellevue, WA, USA, 14–18 July 2013. [Google Scholar]
Tang, X.; Feng, G.; Cai, J. Weighted group sparse representation for undersampled face recognition. Neurocomputing 2014, 145, 402–415. [Google Scholar] [CrossRef]
Rao, N.; Nowak, R.; Cox, C.; Rogers, T. Classification with the sparse group lasso. IEEE Trans. Signal Process. 2015, 64, 448–463. [Google Scholar] [CrossRef]
Tan, S.; Sun, X.; Chan, W.; Qu, L.; Shao, L. Robust face recognition with kernelized locality-sensitive group sparsity representation. IEEE Trans. Image Process. 2017, 26, 4661–4668. [Google Scholar] [CrossRef]
Yuan, M.; Lin, Y. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2006, 68, 49–67. [Google Scholar] [CrossRef]
Natarajan, B.K. Sparse approximate solutions to linear systems. SIAM J. Comput. 1995, 24, 227–234. [Google Scholar] [CrossRef] [Green Version]
Argyriou, A.; Foygel, R.; Srebro, N. Sparse prediction with the k-support norm. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Siem Reap, Cambodia, 13–16 December 2018; Volume 1, pp. 1457–1465. [Google Scholar]
Deng, W.; Yin, W.; Zhang, Y. Group sparse optimization by alternating direction method. In Wavelets and Sparsity XV; International Society for Optics and Photonics: Bellingham, WA, USA, 2013; Volume 8858, p. 88580R. [Google Scholar]
Huang, J.; Breheny, P.; Ma, S. A selective review of group selection in high-dimensional models. Stat. Sci. A Rev. J. Inst. Math. Stat. 2012, 27. [Google Scholar] [CrossRef]
Breheny, P.; Huang, J. Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Stat. Comput. 2015, 25, 173–187. [Google Scholar] [CrossRef] [Green Version]
Hu, Y.; Li, C.; Meng, K.; Qin, J.; Yang, X. Group sparse optimization via lp, q regularization. J. Mach. Learn. Res. 2017, 18, 960–1011. [Google Scholar]
Jiang, L.; Zhu, W. Iterative Weighted Group Thresholding Method for Group Sparse Recovery. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 63–76. [Google Scholar] [CrossRef] [PubMed]
Jiao, Y.; Jin, B.; Lu, X. Group Sparse Recovery via the ℓ⁰(ℓ²) Penalty: Theory and Algorithm. IEEE Trans. Signal Process. 2016, 65, 998–1012. [Google Scholar] [CrossRef]
Chen, P.Y.; Selesnick, I.W. Group-sparse signal denoising: Non-convex regularization, convex optimization. IEEE Trans. Signal Process. 2014, 62, 3464–3478. [Google Scholar] [CrossRef] [Green Version]
Abe, J.; Yamagishi, M.; Yamada, I. Linearly involved generalized Moreau enhanced models and their proximal splitting algorithm under overall convexity condition. Inverse Probl. 2020, 36, 035012. [Google Scholar] [CrossRef] [Green Version]
Chen, Y.; Yamagishi, M.; Yamada, I. A Generalized Moreau Enhancement of ℓ_2,1-norm and Its Application to Group Sparse Classification. In Proceedings of the 2021 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 23–27 August 2021. [Google Scholar]
Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Larsson, V.; Olsson, C. Convex low rank approximation. Int. J. Comput. Vis. 2016, 120, 194–214. [Google Scholar] [CrossRef]
Blake, A.; Zisserman, A. Visual Reconstruction; MIT Press: Cambridge, MA, USA, 1987. [Google Scholar]
Zhang, C.H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef] [Green Version]
Nikolova, M.; Ng, M.K.; Tam, C.P. Fast nonconvex nonsmooth minimization methods for image restoration and reconstruction. IEEE Trans. Image Process. 2010, 19, 3073–3088. [Google Scholar] [CrossRef] [PubMed]
Selesnick, I. Sparse regularization via convex analysis. IEEE Trans. Signal Process. 2017, 65, 4481–4494. [Google Scholar] [CrossRef]
Yin, L.; Parekh, A.; Selesnick, I. Stable principal component pursuit via convex analysis. IEEE Trans. Signal Process. 2019, 67, 2595–2607. [Google Scholar] [CrossRef]
Abe, J.; Yamagishi, M.; Yamada, I. Convexity-edge-preserving signal recovery with linearly involved generalized minimax concave penalty function. In Proceedings of the ICASSP 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 4918–4922. [Google Scholar]
Wright, J.; Yang, A.Y.; Ganesh, A.; Sastry, S.S.; Ma, Y. Robust face recognition via sparse representation. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 31, 210–227. [Google Scholar] [CrossRef] [Green Version]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
Xu, Y.; Sun, Y.; Quan, Y.; Luo, Y. Structured sparse coding for classification via reweighted ℓ_2,1 minimization. In CCF Chinese Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2015; pp. 189–199. [Google Scholar]
Zheng, J.; Yang, P.; Chen, S.; Shen, G.; Wang, W. Iterative re-constrained group sparse face recognition with adaptive weights learning. IEEE Trans. Image Process. 2017, 26, 2408–2423. [Google Scholar] [CrossRef]
Zhang, C.; Li, H.; Chen, C.; Qian, Y.; Zhou, X. Enhanced group sparse regularized nonconvex regression for face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020. [Google Scholar] [CrossRef]
Bauschke, H.H.; Combettes, P.L. Convex Analysis and Monotone Operator Theory in Hilbert Spaces, 2nd ed.; Springer International Publishing: New York, NY, USA, 2017. [Google Scholar]
Zhao, L.; Hu, Q.; Wang, W. Heterogeneous feature selection with multi-modal deep neural networks and sparse group lasso. IEEE Trans. Multimed. 2015, 17, 1936–1948. [Google Scholar] [CrossRef] [Green Version]
Qin, Z.; Scheinberg, K.; Goldfarb, D. Efficient block-coordinate descent algorithms for the group lasso. Math. Program. Comput. 2013, 5, 143–169. [Google Scholar] [CrossRef]
Hull, J.J. A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell. 1994, 16, 550–554. [Google Scholar] [CrossRef]
Samaria, F.S.; Harter, A.C. Parameterisation of a stochastic model for human face identification. In Proceedings of the1994 IEEE Workshop on Applications of Computer Vision, Sarasota, FL, USA, 5–7 December 1994; pp. 138–142. [Google Scholar]
Cai, D.; He, X.; Hu, Y.; Han, J.; Huang, T. Learning a spatially smooth subspace for face recognition. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; pp. 1–7. [Google Scholar]

Figure 1. Simple examples of two group sparse regularizers (one group case): (a) The

ℓ_{2, 1}

regularizer; (b) The regularizer

{({∥ \cdot ∥}_{2, 1})}_{I_{n}}

.

Figure 2. Estimated sparse coefficients

\hat{x}

by GSC and proposed method respectively.

Figure 3. An example of results by WGSC and proposed method.

Table 1. Recognition results on the USPS database.

Method	Training Set Size ( $α = {max}_{i} {n_{i}}$ , $β = {min}_{i} {n_{i}}$ )
	$α$	10		25		50
	$β$	5	10	5	25	25	50
GSC(with $λ = 1.5$ )		81.4%	86.6%	73.6%	91.4%	88.4%	93.2%
Proposed ( $λ = 1.5$ )		82.0%	87.2%	79.0%	92.2%	89.4%	93.0%
Proposed ( $λ = 3$ )		82.6%	87.8%	80.8%	92.2%	90.6%	93.4%

Table 2. Recognition results on the ORL database.

Method	Training Set Size ( $α = {max}_{i} {n_{i}}$ , $β = {min}_{i} {n_{i}}$ )
	$α$	4	6		8
	$β$	4	4	6	4	6	8
GSC		86.3%	85.0%	91.3%	85.0%	92.5%	93.8%
Proposed ( $W = I_{n}$ )		88.8%	86.3%	93.8%	86.3%	93.8%	95.0%
WGSC		90.6%	87.5%	95.0%	88.8%	93.8%	96.3%
Proposed ( $W$ by (9))		91.3%	89.4%	95.6%	91.9%	94.4%	96.3%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

A Linearly Involved Generalized Moreau Enhancement of ℓ_2,1-Norm with Application to Weighted Group Sparse Classification

Abstract

1. Introduction

2. Preliminaries

2.1. Review of Linearly Involved Generalized-Moreau-Enhanced (LiGME) Model

2.2. Basic Idea of Weighted Group Sparse Classification (WGSC)

3. LiGME Model for Group Sparse Estimation

3.1. GME of Weighted $ℓ_{2, 1}$ -Norm and Its Properties

3.2. LiGME of $ℓ_{2, 1}$ -Norm

4. Application to Classification Problems

4.1. Proposed Algorithm for Group-Sparsity Based Classification

4.2. Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. The Bias of ℓ_2,1 Regularizer in Group Sparse Classification

Appendix B. Parameter Tuning and Proximal Splitting Algorithm for LiGME Model

References

Article Metrics

Citations

Article Access Statistics

A Linearly Involved Generalized Moreau Enhancement of ℓ2,1-Norm with Application to Weighted Group Sparse Classification

Abstract

1. Introduction

2. Preliminaries

2.1. Review of Linearly Involved Generalized-Moreau-Enhanced (LiGME) Model

2.2. Basic Idea of Weighted Group Sparse Classification (WGSC)

3. LiGME Model for Group Sparse Estimation

3.1. GME of Weighted ℓ 2 , 1 -Norm and Its Properties

3.2. LiGME of ℓ 2 , 1 -Norm

4. Application to Classification Problems

4.1. Proposed Algorithm for Group-Sparsity Based Classification

4.2. Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. The Bias of ℓ2,1 Regularizer in Group Sparse Classification

Appendix B. Parameter Tuning and Proximal Splitting Algorithm for LiGME Model

References

Article Metrics

Citations

Article Access Statistics

A Linearly Involved Generalized Moreau Enhancement of ℓ_2,1-Norm with Application to Weighted Group Sparse Classification

3.1. GME of Weighted $ℓ_{2, 1}$ -Norm and Its Properties

3.2. LiGME of $ℓ_{2, 1}$ -Norm

Appendix A. The Bias of ℓ_2,1 Regularizer in Group Sparse Classification