The Prediction Performance Analysis of the Lasso Model with Convex Non-Convex Sparse Regularization

Chen, Wei; Liu, Qiuyue; Li, Hancong; Zou, Jian

doi:10.3390/a18040195

Open AccessArticle

The Prediction Performance Analysis of the Lasso Model with Convex Non-Convex Sparse Regularization

School of Information and Mathematics, Yangtze University, Jingzhou 434020, China

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(4), 195; https://doi.org/10.3390/a18040195

Submission received: 21 February 2025 / Revised: 22 March 2025 / Accepted: 26 March 2025 / Published: 1 April 2025

(This article belongs to the Section Analysis of Algorithms and Complexity Theory)

Download

Browse Figures

Versions Notes

Abstract

The incorporation of

ℓ_{1}

regularization in Lasso regression plays a crucial role by inducing convexity to the objective function, thereby facilitating its minimization; when compared to non-convex regularization, the utilization of

ℓ_{1}

regularization introduces bias through artificial coefficient shrinkage towards zero. Recently, the convex non-convex (CNC) regularization framework has emerged as a powerful technique that enables the incorporation of non-convex regularization terms while maintaining the overall convexity of the optimization problem. Although this method has shown remarkable performance in various empirical studies, its theoretical understanding is still relatively limited. In this paper, we provide a theoretical investigation into the prediction performance of the Lasso model with CNC sparse regularization. By leveraging oracle inequalities, we establish a tighter upper bound on prediction performance compared to the traditional

ℓ_{1}

regularizer. Additionally, we propose an alternating direction method of multipliers (ADMM) algorithm to efficiently solve the proposed model and rigorously analyze its convergence property. Our numerical results, evaluated on both synthetic data and real-world magnetic resonance imaging (MRI) reconstruction tasks, confirm the superior effectiveness of our proposed approach.

Keywords:

Lasso; non-convex regularization; prediction performance; oracle inequalities; ADMM

1. Introduction

The Lasso model performs automatic feature selection by shrinking some coefficients to exactly zero. This mechanism prunes irrelevant features while retaining statistically significant predictors, thereby constructing a streamlined model with enhanced interpretability [1]. Consequently, the Lasso model has emerged as one of the most widely adopted tools for linear regression and finds applications across diverse domains including statistical and machine learning [2,3,4,5,6,7], image and signal processing [8,9,10,11].

Specifically, for n random observations

y_{1}, y_{2}, \dots, y_{n} \in R

and p fixed covariates

x_{1}, \dots, x_{p} \in R^{n}

. This high-dimensional linear model is defined as

\begin{matrix} y = X β^{*} + ϵ, ϵ \sim σ^{*} N_{n} (0, I_{n}), \end{matrix}

(1)

where

y : = {(y_{1}, y_{2} \dots, y_{n})}^{⊤} \in R^{n}

is the vector of observations,

β^{*} \in R^{p}

is the unknown parameter of interest,

X : = (x^{1}, \dots, x^{p}) \in R^{n \times p}

is a deterministic design matrix (for without loss of generality, we assume

∥ x_{j} ∥_{2}^{2} \leq n

for all

j \in (1, \dots, p)

,

ϵ \in R^{n}

is the noise vector with identically and independently distributed (i.i.d.) Gaussian entries with variance

{σ^{*}}^{2}

, and

I_{n}

denotes the identity matrix in

R^{n \times n}

. In general, we assume

p > n

, and our goal is to accurately estimate

β^{*} \in R^{p}

.

The Lasso model is a type of linear regression model utilized for estimating

β

and can be formulated as the following optimization problem [12]

\begin{matrix} {\hat{β}}_{λ}^{L 1} \in arg min_{β} \frac{1}{2 n} {∥ y - X β ∥}_{2}^{2} + λ {∥ β ∥}_{1}, \end{matrix}

(2)

where

\frac{1}{2 n} {∥ y - X β ∥}_{2}^{2}

is the mean squared error (MSE) loss and the

ℓ_{1}

norm

{∥ β ∥}_{1}

is the regularizer,

λ > 0

is a predefined tuning parameter that controls the level of regularization. The prediction performance of the Lasso model refers to how well it predicts outcomes or responses based on input features and depends on its ability to strike a balance between regularization and feature selection. The prediction performance of the Lasso model plays a crucial role in understanding its effectiveness [12,13,14].

The

ℓ_{1}

norm in Equation (2) is the most popular regularization because of its outstanding ability to induce sparsity among convex regularization methods. However, it has been observed that this formal approach often underestimates the high-amplitude components of

β

[15]. In contrast, non-convex regularizations in the Lasso model have also made significant progress [6,10,16,17]. For instance, the smoothly clipped absolute deviation (SCAD) penalty [18] and the minimax concave penalty (MCP) [15], as well as

ℓ_{q} (0 < q < 1)

and other non-convex regularization terms, can more accurately estimate high-amplitude components. However, due to their non-convex nature, the objective function is prone to getting stuck in local optima, which poses additional challenges to the solution process.

To utilize the advantages of both non-convex regularization and convex optimization methods, Selesnick et al. introduced the CNC strategy, which involves constructing a non-convex regularizer by subtracting the smooth variation from its convex sparse counterpart [19,20,21,22]. Under specific conditions, the proposed regularization ensures global convexity of the objective function. Due to its global convexity, CNC sparse regularization can effectively avoid local optima and overcome the biased estimation associated with nonconvex sparse regularization. As a result, it has gained widespread usage in image processing and machine learning applications [23,24,25,26,27,28]. However, to the best of the authors’ knowledge, most of the research on CNC sparse regularization focuses on algorithm design and applications and the theoretical analysis is lacking. This motivates us to conduct an analysis on the prediction performance of the Lasso model with CNC sparse regularization, thereby substantiating that CNC sparse regularization outperforms

ℓ_{1}

regularization.

In this paper, we consider the following Lasso model with CNC sparse regularization

\begin{matrix} {\hat{β}}_{λ}^{C N C} \in arg min_{β} \frac{1}{2 n} {∥ y - X β ∥}_{2}^{2} + λ Ψ_{B}^{C N C} (β), \end{matrix}

(3)

where the non-convex regularization term

Ψ_{B}^{C N C} (β)

is parameterized by a matrix B, and the global convexity of the objective function in (3) can also be guaranteed by adjusting B.

Through rigorous theoretical analysis and comprehensive experimental evaluations, we demonstrate that the utilization of non-convex regularization significantly enhances the prediction performance of the Lasso model, enabling accurate estimation of unknown variables of interest. Our contributions can be summarized as follows

Theoretically, for the Lasso model with a specific CNC sparse regularization, we establish the conditions necessary to ensure global convexity of the objective function. Subsequently, by leveraging oracle inequality, we derive an improved upper bound on prediction performance compared to that of the Lasso model with $ℓ_{1}$ regularization.
Algorithmically, we derive an ADMM algorithm that ensures convergence to a critical point for the proposed Lasso model with CNC sparse regularization.
Empirically, we demonstrate that the proposed Lasso model with generalized minimax concave (GMC) regularization outperforms $ℓ_{1}$ regularization in both synthetic data and MRI reconstruction experiments, owing to its utilization of non-convex regularization.

The subsequent sections of this paper are arranged as follows. Section 2 presents some preliminary and related work concerning CNC sparse regularization, predictive performance results, and convex optimization. In Section 3, we give theoretical analysis on the prediction performance of the Lasso model with CNC sparse regularization and further propose an effective ADMM algorithm to solve it. In Section 4, we verified the superiority of the proposed model through synthetic datasets and real datasets. Finally, the conclusions are summarized in Section 5.

Notation

Throughout this paper, the

ℓ_{1}

,

ℓ_{2}

,

ℓ_{q}

and

ℓ_{\infty}

norms of

β \in R^{p}

are defined

{∥ β ∥}_{1} = Σ_{i}^{p} | β_{i} |

,

{∥ β ∥}_{2} = (Σ_{i}^{p} | β_{i} {|^{2})}^{\frac{1}{2}}

,

{∥ β ∥}_{q} = (Σ_{i}^{p} | β_{i} {|^{q})}^{\frac{1}{q}}

and

{∥ β ∥}_{\infty} = {max}_{i \in p} | β_{i} |

; meanwhile,

{∥ β ∥}_{0}

represents the number of non-zero elements in

β

and

Supp (β)

is the support of

β

. For any given set T, we use

T^{c}

and

| T |

to respectively represent its complement

p ∖ T

and cardinality

| T |

. For matrix A, we denote its maximum eigenvalue and minimum eigenvalue as

σ_{m a x}

and

σ_{m i n}

, respectively. The transpose and the pseudo-inverse of a matrix X are defined as

X^{⊤}

and

X^{†}

respectively. For any given subset T of p in a matrix

X \in R^{n \times p}

, we denote

X_{T}

as the resulting matrix obtained by excluding all columns belonging to the complement of T from X. For the design matrix X, we denote the orthogonal projection onto

X_{T}

by

Π_{T}

. In addition, for the sake of convenience, we denote the prediction loss

\frac{1}{n} {∥ X (\hat{β} - β^{*}) ∥}_{2}^{2}

of two vectors

\hat{β}, β^{*} \in R^{p}

by

ℓ_{n} (\hat{β}, β^{*})

.

2. Preliminaries and Related Works

In this section, we present essential background information and related literature that will be crucial for the subsequent sections of the paper.

2.1. Convex Non-Convex Sparse Regularization

To address the limitation of

ℓ_{1}

regularization,

ℓ_{q} (0 < q < 1)

regularization has been proposed to enhance the accuracy of estimation. However, the utilization of

ℓ_{q}

regularization renders the objective function in problem (2) non-convex and leads to suboptimal local solutions [18,29,30,31].

To leverage the benefits of non-convex regularization and convex optimization techniques, Selesnick et al. introduced the CNC strategy, which constructs a non-convex sparse regularization but at the same time guarantees the global convexity of the objective function [20,21,22,23,24,25].

Common CNC sparse regularizations include logarithmic regularization [32], exponential regularization [33], arctangent regularization [32], minimax-convex (MC) regularization [21], etc. Formally, these regularizations can be represented by the class of additively separable functions

ψ_{b} (β) = \sum_{i = 1}^{p} ψ_{b} (β_{i}),

(4)

where

ψ_{b} : R \to R

is a non-convex function parameterized by a scalar parameter b, which controls the non-convexity of

ψ_{b}

. Take the MC regularization as an example, for each

β_{i}

, the

ψ_{b}^{M C} (β_{i})

can be expressed as

\begin{matrix} ψ_{b}^{M C} (β_{i}) & = \{\begin{matrix} | β_{i} | - \frac{b^{2}}{2}, & | β_{i} | \leq \frac{1}{b^{2}}, \\ \frac{1}{2 b^{2}}, & | β_{i} | \geq \frac{1}{b^{2}} . \end{matrix} \end{matrix}

(5)

Mathematically,

ψ_{b}^{M C} (β_{i})

can also be written as

ψ_{b}^{M C} (β_{i}) = | β_{i} | - min_{v} \{| v | + \frac{b^{2}}{2} {(β_{i} - v)}^{2}\},

(6)

where

min_{v} {| v | + \frac{b^{2}}{2} {(β_{i} - v)}^{2}}

is the scalar Huber function with parameter b. Noting that the scalar Huber function is a common smooth version of the

ℓ_{1}

norm, the fundamental construction strategy for CNC involves subtracting its corresponding smooth version from the sparse regularization.

As highlighted in [19,24,34], the separable (additive) CNC sparse regularizations mentioned above necessitate the full column rank of the covariate matrix X to ensure the global convexity of (3). However, in many important applications, the covariate matrix X is wide (

p > n

), then the CNC strategy with separable sparse regularizations cannot be used. This strongly motivated the development of non-separable sparse regularizations.

Again, starting with MC sparse regularization, the non-separable generalized MC (GMC) regularization is defined as

Ψ_{B}^{G M C} (β) = {∥ β ∥}_{1} - min_{v} \{{∥ v ∥}_{1} + \frac{1}{2} {∥ B (β - v) ∥}_{2}^{2}\},

(7)

where

B \in R^{n \times p}

is the non-convex control matrix parameter [19,20,21].

The GMC regularization

Ψ_{B}^{G M C} (β)

, in general, cannot be expressed in the separable form of (4) due to the arbitrariness of the matrix parameter B. Specifically, if

B^{⊤} B

is a diagonal matrix, i.e.,

B^{⊤} B = (b_{1}^{2}, b_{2}^{2}, \dots, b_{n}^{2})

, then

Ψ_{B}^{G M C} (β)

is separable, and if

B = 0

the

Ψ_{B}^{G M C} (β)

reduces to the

ℓ_{1}

norm. The key advantage of non-separable non-convex sparse regularization lies in its ability to ensure the global convexity of the objective function, even when the covariate matrix X is not full column rank. This feature is crucial as it enables the application of CNC sparse regularization to any linear image inverse problem.

2.2. Results of the Predicted Performance

The prediction performance of the Lasso

ℓ_{n} (\hat{β}, β^{*}) = \frac{1}{n} {∥ X (\hat{β} - β^{*}) ∥}_{2}^{2}

refers to how well the Lasso model predicts outcomes or responses based on input features and depends on its ability to strike a balance between regularization and feature selection.

In particular, Bunea et al. derive sparsity oracle inequalities pertaining to the prediction loss and highlight their implications for minimax estimation within the traditional nonparametric regression framework [35,36,37]. These inequalities enable us to establish bounds on the discrepancy between the prediction errors and the optimal sparse approximation, as determined by an oracle with full knowledge but constrained by sparsity. However, these results assumes a strict condition on the Gram matrix

\frac{1}{n} X^{⊤} X

, requiring it to be positive definite or subject to mutually coherent constraints. Bickel et al. proposed a more general oracle inequality by constraining the eigenvalues of the Gram matrix [38], but the inequality is not sharp because the constant in front of its infimum (usually referred to as the dominant constant of the oracle inequality) is not equal to 1 [12,39]. Subsequently, Koltchinskii et al. obtained the first sharp oracle inequality by constraining the diagonal elements of the Gram matrix to be no larger than 1. Sun et al. improved the predictive performance of the Lasso model by utilizing the scaling effect and further relaxing the constraints on the covariate matrix X [40]. Dalalyan et al. present novel findings regarding the predictive accuracy of the Lasso model, even when minimal assumptions are made about the relationship between the covariates [12]. More specifically, we give the definition of the compatibility factor as follows:

Definition 1

(Compatibility factor). The compatibility factor

κ_{T}

of a set

T \subseteq {1, 2, \dots, p

} is defined as

κ_{\emptyset} = 1

, and for any nonempty set T,

\begin{matrix} κ_{T} = inf_{β} \frac{\sqrt{| T |} \cdot {∥ X β ∥}_{2}^{2}}{\sqrt{n} {∥ β ∥}_{1}} . \end{matrix}

(8)

The basic error bound of

ℓ_{1}

Lasso minimizer in [12] can be rewritten as Theorem 1.

Theorem 1

([12], Theorem 2). Let

δ \in (0, 1)

be a fixed tolerance level and

γ > 1

. For (2), let

\hat{β}

be a stationary point. Set the regularization parameter as

λ = γ σ^{*} \sqrt{\frac{2 log (p / δ)}{n}}

then

\begin{matrix} ℓ_{n} (\hat{β}, β^{*}) \leq inf_{\bar{β} \in R^{p}, T \subset p} \{ℓ_{n} (\bar{β}, β^{*}) + 4 {∥ {\bar{β}}_{T^{c}} ∥}_{1}\} + \frac{σ^{* 2}}{n} (| T | + 2 log (\frac{1}{δ}) + \frac{8 γ^{2} log (p / δ)}{κ_{T}^{2}}), \end{matrix}

(9)

on the estimation error with probability at least

1 - 2 δ

for any

T \subset (1, 2, \dots, p)

.

The result obtained from Theorem 1 differs slightly from the one given in [12]; however, it can be easily derived from the proof provided in [12]. The proof of Theorem 1 is provided in the Appendix A.1.

2.3. Convex Optimization

For the following optimization problem

\begin{matrix} min_{β} {f (β) + g (β)}, \end{matrix}

(10)

where

f (β)

is a convex and differentiable,

g (β)

is proper closed, convex but non-smooth regularization term. Proximal gradient descent (PGD) algorithm perform well in solving such optimization model [41]. The iterative formula utilized for the PGD is

\begin{matrix} β^{k + 1} = {prox}_{α g} (β^{k} - α \nabla f (β^{k})), \end{matrix}

(11)

where

α > 0

is the step size of each iteration and

\nabla f (β^{k})

is the gradient of

f (β^{k})

.

{prox}_{g} (\cdot)

is the proximal operator of g, which is defined as

\begin{matrix} {prox}_{g} (β) = arg min_{v} \{\frac{1}{2} {∥ β - v ∥}_{2}^{2} + g (v)\} . \end{matrix}

(12)

As a special case, if

g (β) = {λ ∥ β ∥}_{1}

, then its proximal operator

{prox}_{{λ ∥ \cdot ∥}_{1}}

is the element-wise soft thresholding function

{prox}_{{λ ∥ \cdot ∥}_{1}} (β_{i}) = sgn (β_{i}) max {| β_{i} | - λ, 0} = \{\begin{matrix} β_{i} - λ, & β_{i} \geq λ, \\ 0, & | β_{i} | < λ, \\ β_{i} + λ, & β_{i} \leq - λ . \end{matrix}

(13)

where

β_{i}

is the i-th element of

β

.

3. Lasso with CNC Sparse Regularization

In this section, we will analyze the Lasso model with CNC sparse regular terms both theoretically and algorithmically. In particular, we will consider the following Lasso model with the non-separable non-convex GMC regularization

\begin{matrix} {\hat{β}}_{λ}^{G M C} \in arg min_{β} \{\frac{1}{2 n} {∥ y - X β ∥}_{2}^{2} + λ Ψ_{B}^{G M C} (β)\}, \end{matrix}

(14)

where the GMC regularization is defined as (7), and the matrix parameter B can influence the non-convexity of

Ψ_{B}^{G M C} (β)

.

3.1. Convex Condition

In this subsection, we explore the adjustment of GMC regularization to preserve the overall convexity of the Lasso model (14).

Theorem 2.

Let

y \in R^{n}

,

X \in R^{n \times p}

, and

λ > 0

. Define the objection function of (14) as

F_{B}^{G M C} (β) = \frac{1}{2 n} {∥ y - X β ∥}_{2}^{2} + λ Ψ_{B}^{G M C} (β),

(15)

then

F_{B}^{G M C} (β)

is convex if

B^{⊤} B ⪯ \frac{1}{n λ} X^{⊤} X

and is strictly convex if

B^{⊤} B ≺ \frac{1}{n λ} X^{⊤} X

.

The proof of Theorem 2 is provided in Appendix A.2.

By constructing the matrix parameter B, we can ensure that the convexity conditions mentioned above are maintained. Similar to [23], when X and

λ

are specified, a straightforward method to determine the parameter B is

\begin{matrix} B = \sqrt{\frac{ω}{n λ}} X, 0 \leq ω \leq 1 . \end{matrix}

(16)

Then,

B^{⊤} B = \frac{ω}{n λ} X^{⊤} X

which satisfies Theorem 2 when

ω \leq 1

. The parameter

ω

controls the non-convexity of the

Ψ_{B}^{G M C}

. When

ω = 0

,

B = 0

and the penalty reduces to the

ℓ_{1}

norm. When

ω = 1

, (15) is satisfied with equality, resulting in a ’maximally’ non-convex penalty.

According to Theorem 2, we can deduce the following corollary, which is useful in our proof of algorithm convergence.

Corollary 1.

For all

μ \geq λ σ_{m a x}

,

Ψ_{B}^{G M C} (β) + \frac{μ}{2} {∥ β ∥}_{2}^{2}

is convex, where

σ_{m a x}

is the largest eigenvalue of

B^{⊤} B

.

By replacing

\frac{1}{2 n} {∥ y - X β ∥}_{2}^{2}

with

\frac{μ}{2} {∥ β ∥}_{2}^{2}

in (A14), we can draw Corollary 1 from the proof of the Theorem 2.

3.2. Prediction Performance

In this subsection, we analyze the prediction performance of the Lasso model with GMC regularization (14) and use the oracle inequality to obtain an upper bound on the prediction performance that is better than the upper bound of Lasso model with

ℓ_{1}

regularization.

We consider a general Lasso model with GMC regularization (14), that is, the GMC regularization term is non-convex and does not necessarily satisfy the convexity condition given by Theorem 2. Due to the non-convexity, it is possible that the global minimum of the proposed Lasso model cannot be achieved. Hence, it becomes crucial to find a stable point that satisfies the conditions. We define

\hat{β} \in R^{p}

as a stationary point of

F_{B}^{G M C} (β)

when it fulfills

\begin{matrix} 0 \in \nabla_{β} F_{B}^{G M C} {(β) |}_{β = \hat{β}} . \end{matrix}

Based on Definition 1, the compatibility factor enables us to establish a subsequent oracle inequality, which is applicable to the stationary points of the Lasso estimator with GMC regularization.

Theorem 3

(Oracle inequality of the Lasso model with GMC regularization). Assume

μ < \frac{1}{n ∥ X^{†} ∥_{2}^{2}}

. Let

δ \in (0, 1)

be a fixed tolerance level and

γ > 1

. Let

\hat{β}

be a stationary point of (14) and

σ_{m i n}

is the smallest eigenvalue of

B^{⊤} B

. Set the regularization parameter as

λ = γ σ^{*} \sqrt{\frac{2 log (p / δ)}{n}}

; then, the estimation error satisfies

\begin{matrix} ℓ_{n} (\hat{β}, β^{*}) & \leq inf_{\bar{β} \in R^{p}, T \subset p} \{ℓ_{n} (\bar{β}, β^{*}) + 4 Ψ_{\sqrt{σ_{m i n} I}}^{G M C} ({(\bar{β})}_{T^{c}})\} \\ + \frac{2 σ^{* 2}}{n (1 - n μ ∥ X^{†} ∥_{2}^{2})} (| T | + 2 log (1 / δ) + \frac{8 γ^{2} log (p / δ) | T |}{κ_{T}^{2}}), \end{matrix}

(17)

with probability at least

1 - 2 δ

for any

T \subseteq (1, 2, \dots, p)

.

The proof of Theorem 3 is given in the Appendix A.3.

The oracle inequality holds for any

\hat{β}

that meets the first-order optimality condition, regardless of whether the regularization is non-convex. This mild condition imposed on

\hat{β}

represents a significant divergence from previous findings, such as Theorem 2 of [12], which is suitable for global minimization but challenging to ensure when employing non-convex regularizations.

Theorem 3 enables the optimization of the error bounds on the right-hand side of (17) by allowing the selection of suitable values for

\bar{β}

and T. For instance, choose

\bar{β} = β^{*}

in (17) (hence an ‘oracle’), then

ℓ_{n} (\bar{β}, β^{*}) = 0

; therefore, the (17) can be rewritten as

\begin{matrix} ℓ_{n} (\hat{β}, β^{*}) \leq 4 Ψ_{\sqrt{σ_{m i n} I}}^{G M C} ({(β^{*})}_{T^{c}}) + \frac{2 σ^{* 2}}{n (1 - n μ ∥ X^{†} ∥_{2}^{2})} (| T | + 2 log (1 / δ) + \frac{8 γ^{2} log (p / δ) | T |}{κ_{T}^{2}}) . \end{matrix}

(18)

Furthermore, if T is set as the empty set, then

\begin{matrix} ℓ_{n} (\hat{β}, β^{*}) \leq 4 Ψ_{\sqrt{σ_{m i n} I}}^{G M C} (β^{*}) + \frac{4 log (1 / δ) σ^{* 2}}{n (1 - n μ ∥ X^{†} ∥_{2}^{2})} . \end{matrix}

(19)

In contrast, if T is set as the support of

β

, then

\begin{matrix} ℓ_{n} (\hat{β}, β^{*}) \leq \frac{2 σ^{* 2}}{n (1 - n μ ∥ X^{†} ∥_{2}^{2})} ({∥ β ∥}_{0} + 2 log (1 / δ) + \frac{8 γ^{2} log (p / δ) {∥ β ∥}_{0}}{κ_{T}^{2}}), \end{matrix}

(20)

which increases linearly with the sparsity level

{∥ β ∥}_{0}

.

We compare the prediction error bounds of Theorem 3 with Theorem 1, which is obtained for a Lasso model with the

ℓ_{1}

regularization. We note that the upper bounds of terms within the lower bound in inequality (17) are limited by the upper bounds of terms within the lower bound in Theorem 1. When

β

contains larger coefficients, non-convex regularization leads to more stringent bound, hence

Ψ_{\sqrt{σ_{m i n} I}}^{G M C} (\bar{β}) ≪ {∥ \bar{β} ∥}_{1}

. Furthermore, the error bound in Equation (17) includes

n (1 - n μ ∥ X^{†} ∥_{2}^{2})

in the denominator, which makes it an upper bound for Theorem 1. This gap can be mitigated by adjusting the matrix parameter B in

Ψ_{B}^{G M C} (β)

, specifically by selecting a smaller value for

ω

in (16). However, when

ω \to 0

, the non-convex GMC regularization

Ψ_{B}^{G M C} (β)

also tends to

ℓ_{1}

, eliminating the improvement brought by using a non-convex regularization in the first term of the bound. This means that the overall error bound can be adjusted by tuning

ω

, thereby enabling a trade-off with the non-convexity of the regularization.

In summary, despite the non-convex of the regularization term, we can ensure that any stationary point of the proposed Lasso model with GMC regularization possesses a strong statistical guarantee.

3.3. ADMM Algorithm

In this subsection, we will consider using the ADMM algorithm to solve the Lasso model with GMC regularization (14). Boyd et al. introduced a generalized ADMM framework for addressing sparse optimization problems [42]. The basic idea is to transform unconstrained optimization problems into constrained ones by splitting variables and then iteratively solve them. ADMM proves to be particularly efficient when the sparse regularization term possesses a closed-form proximal operator. The main challenge in solving (14) by ADMM lies in how to obtain the proximal operator of the GMC regularization.

The augmented Lagrange form of (14) can be written as follows by setting

z = β

.

\begin{matrix} L (β, z, u) = \frac{1}{2 n} {∥ y - X β ∥}_{2}^{2} + λ Ψ_{B}^{G M C} (z) + u^{⊤} (z - β) + \frac{ρ}{2} {∥ z - β ∥}_{2}^{2}, \end{matrix}

(21)

where u represents the Lagrange multiplier and

ρ > 0

denotes the penalty parameter. We then update

β, z

and u in an alternating manner.

Step 1. Update

β^{k + 1}

:

\begin{matrix} β^{k + 1} = arg min_{β} \{\frac{1}{2 n} {∥ y - X β ∥}_{2}^{2} - {(u^{k})}^{⊤} β + \frac{ρ}{2} {∥ z^{k} - β ∥}_{2}^{2}\} . \end{matrix}

(22)

The objective function in Equation (22) is differentiable with respect to

β

. As a result,

β^{k + 1}

can be obtained by utilizing the first-order optimality condition

\begin{matrix} β^{k + 1} = {(\frac{1}{n} X^{⊤} X + ρ I)}^{- 1} (y^{⊤} X + u^{k} + ρ z^{k}) . \end{matrix}

(23)

Step 2. Update

z^{k + 1}

:

\begin{matrix} z^{k + 1} & = arg min_{z^{k}} \{λ Ψ_{B}^{G M C} (z^{k}) + {(u^{k})}^{⊤} z^{k} + \frac{ρ}{2} {∥ z^{k} - β^{k + 1} ∥}_{2}^{2}\} \\ = arg min_{z^{k}} \{λ Ψ_{B}^{G M C} (z^{k}) + \frac{ρ}{2} {∥ z^{k} - β^{k + 1} + \frac{u^{k}}{ρ} ∥}_{2}^{2}\} . \end{matrix}

(24)

Equation (24) serves as the proximal operator for

Ψ_{B}^{G M C} (z)

, although it lacks a closed-form solution. Despite this limitation, the PGD algorithm can be employed to iteratively address Equation (24).

The substitution of

Ψ_{B}^{G M C} (β)

with (7) allows us to reformulate Equation (24) as

\begin{matrix} z^{k + 1} = arg min_{z^{k}} \{λ ∥ z^{k} ∥_{1} - λ min_{v} {{∥ v ∥}_{1} + \frac{1}{2} ∥ B (z^{k} - v) ∥_{2}^{2}} + \frac{ρ}{2} {∥ z^{k} - β^{k + 1} + \frac{u^{k}}{ρ} ∥}_{2}^{2}\} . \end{matrix}

(25)

Let

f (z^{k}) = \frac{ρ}{2} ∥ z^{k} - (β^{k + 1} - \frac{u^{k}}{ρ}) ∥_{2}^{2} - λ min_{v} {{∥ v ∥}_{1} + \frac{1}{2} {∥ B (z^{k} - v) ∥}_{2}^{2}

and

g (z^{k}) = λ {∥ z^{k} ∥}_{1}

. The update of

z^{k + 1}

of (25) can be obtained through PGD algorithm as follows:

\begin{matrix} z^{k + 1} & = {prox}_{α g} (z^{k} - α \nabla f (z^{k})) \\ = {prox}_{{α λ ∥ \cdot ∥}_{1}} (z^{k} - α \nabla f (z^{k})), \end{matrix}

(26)

where

α

is the iteration step size. The main challenge of (26) is solving the gradient

\nabla f (z^{k})

.

According to Lemma 3 in [24], the last term of

f (z^{k})

is a differentiable with respect to

z^{k}

. Furthermore, the gradient of

f (z^{k})

can be expressed as

\begin{matrix} \nabla f (z^{k}) = ρ (z^{k} - β^{k + 1} + \frac{u^{k}}{ρ}) - λ (B^{⊤} B (z^{k} - arg min_{v} {\frac{1}{2} ∥ B (z^{k} - v) ∥_{2}^{2} + {∥ v ∥}_{1}})) . \end{matrix}

(27)

Note that (27) contains an

ℓ_{1}

-regularization problem

v^{k + 1} = arg min_{v} {\frac{1}{2} {∥ B (z - v) ∥}_{2}^{2} + {∥ v ∥}_{1}}

, which can be viewed as a proximal operator associated with the

{∥ \cdot ∥}_{1}

, i.e.,

\begin{matrix} v^{k + 1} = {prox}_{{∥ \cdot ∥}_{1}} (v^{k} + B (z^{k} - v^{k})), \end{matrix}

(28)

where

{prox}_{{∥ \cdot ∥}_{1}}

can be easily addressed by employing the soft-thresholding function (13).

Then, by substituting (28) and (27) to (26) and sorting them out, the update of

z^{k + 1}

can be summarized as follows

\{\begin{matrix} v^{k + 1} = & {prox}_{{∥ \cdot ∥}_{1}} (v^{k} + B (z^{k} - v^{k})), \\ s^{k + 1} = & (1 - α ρ + α λ B^{⊤} B) z^{k} + α ρ β^{k + 1} - α u^{k} - α λ B^{⊤} B v^{k + 1}, \\ z^{k + 1} = & {prox}_{{α λ ∥ \cdot ∥}_{1}} (s^{k + 1}) . \end{matrix}

(29)

Step 3. update

u^{k + 1}

:

\begin{matrix} u^{k + 1} = u^{k} + (β^{k + 1} - z^{k + 1}) . \end{matrix}

(30)

Finally, the ADMM algorithm for the Lasso model with convex non-convex sparse regularization can be derived by integrating Equations (23), (29), and (30). This derivation is presented as Algorithm 1.

Algorithm 1 ADMM for solving (14)

Require:: $y, X, β^{0}, z^{0}, u^{0}, B, λ, ρ, α$ .
Ensure:: $β$ .

while “stopping criterion is not met” do

β^{k + 1} = {(\frac{1}{n} X^{⊤} X + ρ I)}^{- 1} (y^{⊤} X + u^{k} + ρ z^{k})

;

v^{k + 1} = {prox}_{{∥ \cdot ∥}_{1}} (v^{k} + B (z^{k} - v^{k});

s^{k + 1} = (1 - α ρ + α λ B^{⊤} B) z^{k} + α ρ β^{k + 1} - α u^{k} - α λ B^{⊤} B v^{k + 1}

;

z^{k + 1} = {prox}_{{α λ ∥ \cdot ∥}_{1}} (s^{k + 1});

u^{k + 1} = u^{k} + (β^{k + 1} - z^{k + 1})

.

end while

Furthermore, we provide the following convergence guarantee for Algorithm 1.

Theorem 4.

Through proper selection of the penalty parameter ρ, the primal residual

r^{(k)} = {∥ β - z^{(k)} ∥}_{2}

, and the dual residual

s^{(k + 1)} = {∥ ρ (z^{(m + 1)} - z^{(m)}) ∥}_{2}

, Algorithm 1 satisfies

lim_{k \to \infty} r^{(k)} = 0

and

lim_{k \to \infty} s^{(k)} = 0

.

The proof of Theorem 4 is given in Appendix A.4.

Theorem 4 illustrates that Algorithm 1 ultimately converges to satisfy both primal and dual feasibility conditions. Additionally, it confirms the equivalence between the augmented Lagrangian formulation (21) with constant z and u values and the original Lasso problem (14). During each iteration of Algorithm 1, a stationary point

β

is generated for the augmented Lagrangian formulation (21) when z and u are fixed, which indicates that

β^{*}

also serves as a stationary point for (14).

4. Numerical Experiment

In this section, we present the efficacy of the proposed Lasso model that incorporates GMC sparse regularization through numerical experiments conducted on synthetic and real-world data. All experiments are demonstrated on MATLAB R2020a on a PC equipped with a 2.5 GHz CPU and 16 GB memory.

4.1. Synthetic Data

The data in this experiment are simulated with

n = 4000

samples and

p = 8000

features, where the correlation between features j and

j^{'}

is equal to

0.6 | j - j^{'} |

. The true vector

β^{*}

consists of 800 non-zero entries, all equal to 1. The observations are obtained through the linear model

y = X β^{*} + ϵ

, where

ϵ

is Gaussian noise with variance such that

∥ X β ∥ / ∥ ϵ ∥ = 5

(

S N R = 5

). The parameter

λ

is expressed as a fraction of

λ_{max} = {∥ X^{⊤} y ∥}_{\infty} / n

. The objective function incorporates a maximum penalty

λ_{max}

for regularization, which will give the maximum sparse solution, i.e.,

β = 0

. Then, the range of

λ

values can be set to vary between

(0, λ_{max})

in order to obtain a regular path graph. For the GMC regularization, it is necessary to explicitly specify matrix B we set it by using (16) and

ω = 0.8

.

In this experiment, F1-score and root mean square error (RMSE) were chosen as evaluation metrics to assess the predictive performance of model (14).

The F1 score is the harmonic mean of precision and recall, offers a more comprehensive perspective for assessing model performance in comparison to solely considering precision and recall metrics. Specifically, the precision and recall are defined as

\begin{matrix} Precision = \frac{| Supp (\hat{β}) \cap Supp (β^{*}) |}{| Supp (\hat{β}) |}, Recall = \frac{| Supp (\hat{β}) \cap Supp (β^{*}) |}{| Supp (β^{*}) |}, \end{matrix}

(31)

where

\hat{β}

and

β^{*}

respectively represent the estimated value and the true value. Then, the F1 score is defined as

F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} .

(32)

The range of F1 score is between 0 and 1, where an F1 score of 1 indicates complete support for recovery, meaning the model accurately estimates sparse vectors; while an F1 score of 0 means no support for recovery, indicating zero estimation capability. Thus, a higher F1 score signifies stronger predictive performance demonstrated by the model.

The RMSE is an important indicator in regression model evaluation, which can be used to measure the magnitude of prediction errors. Its definition is as follows

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(β_{i} - β_{i}^{*})}^{2}},

(33)

where

β_{i}

is the i-th predicted value,

β_{i}^{*}

is the i-th actual observed value, n is the number of observations. A smaller RMSE value indicates a higher predictive capability and better model fit.

We compared the GMC sparse regularization (14) with the traditional

ℓ_{1}

and

ℓ_{1 / 2}

regularizations. To obtain typical conclusions, we conducted 100 Monte Carlo experiments using random noise with

S N R = 5

and calculated the average F1 and RMSE values. As shown in Figure 1a, when an appropriate

λ

value is selected, the F1 score of the GMC sparse regularization reaches the maximum value of 1. In contrast, the F1 scores of the

ℓ_{1}

and

ℓ_{1 / 2}

regularizations are approximately 0.7 and 0.95, respectively, both lower than the performance of the GMC regularization. Additionally, it can be clearly observed from Figure 1b that in the sparse vector prediction task, the minimum RMSE of the GMC sparse regularization is significantly lower than that of the

ℓ_{1}

and

ℓ_{1 / 2}

regularizations. This result further validates the superior accuracy of the GMC sparse regularization in sparse vector prediction. On the other hand, Figure 1 shows that when estimating sparse vectors, the difference between the

λ

values corresponding to the maximum F1 score and the minimum RMSE in the Lasso model based on GMC sparse regularization is relatively small. In contrast, this difference is larger in the

ℓ_{1}

regularization, indicating that GMC sparse regularization (14) outperforms the

ℓ_{1}

and

ℓ_{1 / 2}

regularizations in terms of prediction performance when choosing an appropriate

λ

value.

The conclusion of our study highlights the exceptional predictive performance of our model (2) in estimating sparse vectors, surpassing the

ℓ_{1}

regularization approach employed by the Lasso model. Additionally, although the Lasso model with

ℓ_{1 / 2}

regularization also performs well in prediction tasks, its non-convexity poses challenges for the algorithms solving the objective function. In contrast, our model not only has excellent predictive ability but also ensures the overall convexity of the objective function, thereby providing convenient conditions for the design and implementation of optimization algorithms. In summary, the model proposed in this study has significant advantages in both predictive performance and optimization characteristics, effectively avoiding the computational difficulties brought by non-convex regularization methods.

4.2. MRI Reconstruction

The utilization of high-dimensional sparse linear regression serves as the foundation for numerous signal and image processing techniques, including MRI reconstruction [43,44,45]. The MRI is a powerful medical imaging technique, featuring high soft tissue contrast and the advantage of being radiation-free. It is widely used in clinical diagnosis and scientific research fields. In this experimental, we employ GMC sparse regularization (15) for MRI reconstruction and compare its performance against

ℓ_{1}

and

ℓ_{1 / 2}

regularization.

To make the statement clear, we define the design matrix

X = R \times F

, where

R \in R^{n \times p}

is the sampling template, and

F \in R^{p \times p}

is the sparse Fourier operator. We tested different reconstruction models on three MRI images with variable density sampling and Cartesian sampling templates, respectively. For comparison, all images are set to

256 \times 256

pixels, with grayscale values ranging from 0 to 255. The parameters were

λ = 0.001

,

ρ = 150

, the iteration step size

α

was chosen

α = 1

. For the GMC regularization, it is necessary to explicitly specify matrix B and set it using (16) with

ω = 0.9

.

In this experiment, we selected relative error (RE) and peak signal-to-noise ratio (PSNR) as evaluation metrics to quantitatively assess the quality and accuracy of reconstructed images.

The RE is defined as

RE = \frac{∥ \hat{β} - β^{*} ∥_{2}}{∥ β^{*} ∥_{2}},

(34)

where

β^{*}

and

\hat{β}

are the original and reconstructed images, respectively.

The PSNR is defined as

PSNR = 10 \cdot lg \frac{{MAX}^{2}}{\frac{1}{N} {∥ \hat{β} - β^{*} ∥}_{2}^{2}},

(35)

where

N = 256 \times 256

,

MAX = 255

, and

\hat{β}

,

β^{*}

are the reconstructed, and the original image, respectively.

For these two types of quantitative evaluation criteria, the lower the RE value, the more superior the reconstruction effect, while for PSNR, an increase in numerical value indicates a more outstanding reconstruction ability.

To further enhance the visual contrast effect, we calculated the difference between the reconstructed image and the original image. Furthermore, we magnified a small portion of the local image to show more details. The reconstruction results of three regularizations are shown in Figure 2, Figure 3 and Figure 4. It can be found that the reconstructed images based on

ℓ_{1}

regularization have problems such as blurred edges and residual shadows, and the

ℓ_{1 / 2}

regularization reconstruction model also shows similar phenomena. In contrast, the reconstructed images based on GMC sparse regularization are closer to the original images and exhibit higher reconstruction quality.

Additionally, we conducted a quantitative evaluation of the reconstruction results in Table 1. It is obvious that compared with

ℓ_{1}

and

ℓ_{1 / 2}

regularization, GMC sparse regularization has the best performance in MRI reconstruction and can obtain the lowest RE and highest PSNR.

5. Conclusions

In this paper, we propose CNC sparse regularization as a valuable alternative to the

ℓ_{1}

regularization used in Lasso regression. This approach effectively addresses the issue of underestimation for high-amplitude components while ensuring the global convexity of the objective function. Our theoretical analysis demonstrates that the prediction error bound associated with CNC sparse regularization is smaller than that of

ℓ_{1}

regularization, which provides theoretical support for the practical application of the CNC regularization. Additionally, we demonstrate that the Lasso model with CNC regularization exhibits superior performance on both synthetic and real-world datasets. These findings suggest its potential significance in future applications such as image denoising, image reconstruction, seismic reflection analysis, etc.

Additionally, given that the oracle inequality of the Lasso model relies on the restricted eigenvalue condition, future research directions may include exploring theoretical guarantees under more relaxed assumptions, such as the oracle inequality based on weak correlation or unconstrained design matrices, and experimentally verifying the practical effectiveness of the restricted eigenvalue condition (such as calculating the specific value of

κ_{T}

), and further analyzing its impact on model performance.

Author Contributions

Conceptualization, W.C. and J.Z.; methodology, W.C.; software, W.C.; validation, Q.L. and H.L.; formal analysis, Q.L.; data curation, H.L.; writing—original draft preparation, W.C.; writing—review and editing, J.Z.; visualization, W.C.; supervision, J.Z.; project administration, J.Z.; funding acquisition, J.Z.. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Undergraduate Training Program of Yangtze University for Innovation and Entrepreneurship (Yz2023302).

Data Availability Statement

The code for the proposed method in this paper are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A

Appendix A.1. Proof of Theorem 1

The proof of Theorem 1 follows the approach outlined in Theorem 2 from [12].

Proof.

Firstly, we consider the first order optimality conditions of the convex problems (2). According to the chain rule of subdifferential, we can derive the subdifferential the

ℓ_{1}

term is

{\partial ∥ β ∥}_{1} = sgn (β),

where

\begin{matrix} sgn (β) & = \{\begin{matrix} {1}, & β > 0, \\ [- 1, 1], & β = 0, \\ {- 1}, & β < 0 . \end{matrix} \end{matrix}

Using the Karush–Kuhn–Tucker conditions, we have

\begin{matrix} \frac{1}{n} β X^{⊤} (y - X \hat{β}) \in λ sgn (β) . \end{matrix}

This implies that for any

\bar{β} \in R^{p}

, we obtain

\begin{matrix} \frac{1}{n} {\hat{β}}^{⊤} X^{⊤} (y - X \hat{β}) = λ {∥ \hat{β} ∥}_{1}, \end{matrix}

(A1)

and

\begin{matrix} \frac{1}{n} {\bar{β}}^{⊤} X^{⊤} (y - X \hat{β}) = λ {∥ \bar{β} ∥}_{1} . \end{matrix}

(A2)

Subtracting (A1) from (A2), we obtain

\frac{1}{n} {(\bar{β} - \hat{β})}^{⊤} X^{⊤} (y - X \hat{β}) \leq λ (∥ \bar{β} ∥_{1} - ∥ \hat{β} ∥_{1}) .

(A3)

By utilizing the observation model

y = X β^{*} + ϵ

and considering polarization equality, i.e.,

2 u^{⊤} v = {∥ u ∥}_{2}^{2} + {∥ v ∥}_{2}^{2} - {∥ u - v ∥}_{2}^{2}

, the above Equation (A3) can be rewritten as

\begin{matrix} \frac{1}{n} ∥ X (\bar{β} - \hat{β}) ∥_{2}^{2} + \frac{1}{n} {∥ X (β^{*} - \hat{β}) ∥}_{2}^{2} & \leq \frac{1}{2 n} {∥ X (\bar{β} - \hat{β}) - X (β^{*} - \hat{β}) ∥}_{2}^{2} + \frac{1}{n} ϵ^{⊤} X^{⊤} (\hat{β} - \bar{β}) \\ + 2 λ ∥ \bar{β} ∥_{1} - 2 λ {∥ \hat{β} ∥}_{1} . \end{matrix}

(A4)

Let us first consider

ϵ^{⊤} X^{⊤} (\hat{β} - \bar{β})

and apply Holder’s inequality, we have

\begin{matrix} ϵ^{⊤} X^{⊤} (\hat{β} - \bar{β}) & = {(X \hat{β} - X \bar{β})}^{⊤} (I - Π_{T}) ϵ + {(X \hat{β} - X \bar{β})}^{⊤} Π_{T} ϵ \\ \leq ∥ X^{⊤} (I - Π_{T}) {ϵ ∥}_{\infty} \cdot ∥ \hat{β} - \bar{β} ∥_{1} + ∥ Π_{T} {ϵ ∥}_{2} \cdot {∥ X (\hat{β} - \bar{β}) ∥}_{2} . \end{matrix}

(A5)

The classical results for the standard Gaussian tail distribution and the

χ^{2}

distribution demonstrate that, given any T, there exists a probability of at least

1 - 2 δ

satisfying the following two inequalities [46]

\begin{matrix} ∥ X^{⊤} (I - Π_{T}) {ϵ ∥}_{\infty} \leq γ σ \sqrt{2 log (\frac{p}{δ})} = λ \sqrt{n}, \end{matrix}

(A6)

\begin{matrix} ∥ Π_{T} {ϵ ∥}_{2} \leq σ^{*} (\sqrt{| T |} + \sqrt{2 log (\frac{1}{δ})}) . \end{matrix}

(A7)

Next, note that for any set T,

{∥ β ∥}_{1} = ∥ β_{T} ∥_{1} + {∥ β_{T^{c}} ∥}_{1}

. Then, using the triangle inequality and subadditivity of

{∥ β ∥}_{1}

, we have

\begin{matrix} ∥ \hat{β} - \bar{β} ∥_{1} + ∥ \bar{β} ∥_{1} - ∥ \hat{β} ∥_{1} \leq 2 ∥ {(\hat{β} - \bar{β})}_{T} ∥ 1 + 2 {∥ {\bar{β}}_{T^{c}} ∥}_{1} . \end{matrix}

(A8)

Furthermore, by employing the definition of compatibility, we can deduce the following fact:

\begin{matrix} ∥ \hat{β} - \bar{β} ∥_{1} \leq κ_{T}^{- 1} \sqrt{| T | / n} {∥ X (\hat{β} - \bar{β}) ∥}_{2} . \end{matrix}

(A9)

Combining (A4)–(A9) leads to the following inequality

\begin{matrix} \frac{1}{n} ∥ X (\bar{β} - \hat{β}) ∥_{2}^{2} + \frac{1}{n} {∥ X (β^{*} - \hat{β}) ∥}_{2}^{2} & \leq \frac{1}{2 n} ∥ X (\bar{β} - \hat{β}) - X (β^{*} - \hat{β}) ∥_{2}^{2} + 2 {∥ \frac{1}{n} Π_{T} ϵ ∥}_{2} \cdot \\ ∥ X (\hat{β} - \bar{β}) ∥_{2} + 4 λ κ_{T}^{- 1} \sqrt{| T | / n} \cdot ∥ X (\hat{β} - \bar{β}) ∥_{2} + 4 λ {∥ β_{T^{c}} ∥}_{1} . \end{matrix}

(A10)

By employing Young’s inequality, which is

2 u v \leq u^{2} / e + e v^{2}

for

e > 0

, with

u = {∥ Π ϵ ∥}_{2} + 2 λ κ_{T}^{- 1} \sqrt{| T | / n}

,

v = ∥ X (\hat{β} - \bar{β}) ∥_{2}

, and

e = \frac{1}{n}

, we have

\begin{matrix} 2 (\frac{1}{n} {∥ Π_{T} ϵ ∥}_{2} + 2 λ κ_{T}^{- 1} \sqrt{| T | / n}) {∥ X (\hat{β} - \bar{β}) ∥}_{2} & \leq n {(\frac{1}{n} {∥ Π_{T} ϵ ∥}_{2} + 2 λ κ_{T}^{- 1} \sqrt{| T | / n})}^{2} \\ + \frac{1}{n} {∥ X (\hat{β} - \bar{β}) ∥}_{2}^{2} . \end{matrix}

(A11)

Therefore, the (A10) can be rewritten as

\begin{matrix} \frac{1}{n} {∥ X (β^{*} - \hat{β}) ∥}_{2}^{2} & \leq \frac{1}{2 n} ∥ X (\bar{β} - \hat{β}) - X (β^{*} - \hat{β}) ∥_{2}^{2} + 4 λ {∥ β_{T^{c}} ∥}_{1} \\ + n {(\frac{1}{n} {∥ Π_{T} ϵ ∥}_{2} + 2 λ κ_{T}^{- 1} \sqrt{| T | / n})}^{2} . \end{matrix}

(A12)

The substitution of (A7) into (A12) results in

\begin{matrix} \frac{1}{n} {∥ X (β^{*} - \hat{β}) ∥}_{2}^{2} & \leq \frac{1}{2 n} ∥ X (\bar{β} - \hat{β}) - X (β^{*} - \hat{β}) ∥_{2}^{2} + 4 λ {∥ β_{T^{c}} ∥}_{1} \\ + \frac{σ^{* 2}}{n} (| T | + 2 log (\frac{1}{δ}) + \frac{8 γ^{2} log (p / δ)}{κ_{T}^{2}}) . \end{matrix}

(A13)

☐

Appendix A.2. Proof of Theorem 2

Proof.

By plugging the GMC regularization (7) into

F_{B}^{G M C} (β)

, we can rewrite

F_{B}^{G M C} (β)

as

\begin{matrix} F_{B}^{G M C} (β) & = \frac{1}{2 n} {∥ y - X β ∥}_{2}^{2} + λ {∥ β ∥}_{1} - λ min_{v} \{{∥ v ∥}_{1} + \frac{1}{2} {∥ B (β - v) ∥}_{2}^{2}\} \\ = max_{v} \{\frac{1}{2 n} {∥ y - X β ∥}_{2}^{2} + {λ ∥ β ∥}_{1} - {λ ∥ v ∥}_{1} - \frac{λ}{2} {∥ B (β - v) ∥}_{2}^{2}\} \\ = max_{v} {\frac{1}{2} β^{⊤} (\frac{1}{n} X^{⊤} X - λ B^{⊤} B) β + {λ ∥ β ∥}_{1} + \frac{1}{2 n} {∥ y ∥}_{2}^{2} - \frac{1}{n} y^{⊤} X β \\ - {∥ v ∥}_{1} - \frac{λ}{2} {∥ B v ∥}_{2}^{2} + λ v^{⊤} B^{⊤} B β} \\ = \frac{1}{2} β^{⊤} (\frac{1}{n} X^{⊤} X - λ B^{⊤} B) β + {λ ∥ β ∥}_{1} + \frac{1}{2 n} {∥ y ∥}_{2}^{2} - \frac{1}{n} y^{⊤} X β \\ + λ max_{v} \{- {∥ v ∥}_{1} - \frac{λ}{2} {∥ B v ∥}_{2}^{2} + λ v^{⊤} B^{⊤} B β\} . \end{matrix}

(A14)

The expression enclosed in the curly braces in (A14) is affine with respect to

β

(hence convex). Consequently, the final term in the equation is also convex, as it represents the pointwise maximum of a collection of convex functions. Therefore,

F (β)

is convex if

\frac{1}{n} X^{⊤} X - λ B^{⊤} B ⪰ 0

, and if

\frac{1}{n} X^{⊤} X - λ B^{⊤} B ≻ 0

, then

F_{B}^{G M C} (β)

demonstrates strict convexity. ☐

Appendix A.3. Proof of Theorem 3

Proof.

We first use the Karush–Kuhn–Tucker conditions to infer that

\begin{matrix} 0 \in \nabla_{β} f (β) {|_{β = \hat{β}} = - \frac{1}{n} (X^{T} y - X^{⊤} X \hat{β}) + \nabla_{β} Ψ_{B}^{G M C} (β) |}_{β = \hat{β}} . \end{matrix}

(A15)

Using the chain rule,

\nabla_{β} Ψ_{B}^{G M C} {(β) |}_{β = \hat{β}} = sgn (β) + B^{⊤} B (β - arg min_{v} \{{∥ v ∥}_{1} + \frac{1}{2} {∥ (β - v) ∥}_{2}^{2}\})

. Then, using (A15), we also can see that there exists

z = sgn (β) + B^{⊤} B (β - arg min_{v} \{{∥ v ∥}_{1} + \frac{1}{2} {∥ (β - v) ∥}_{2}^{2}\})

, such that

\begin{matrix} \frac{1}{n} X^{⊤} (y - X \hat{β}) = λ z . \end{matrix}

Specifically,

\forall \bar{β} \in R^{p}

, we have

\begin{matrix} \frac{1}{n} {\bar{β}}^{⊤} X^{⊤} (y - X \hat{β}) = λ {\bar{β}}^{⊤} z, \end{matrix}

(A16)

and

\begin{matrix} \frac{1}{n} {\hat{β}}^{⊤} X^{⊤} (y - X \hat{β}) = λ {\hat{β}}^{⊤} z . \end{matrix}

(A17)

Subtracting (A17) from (A16), and then using the subgradient definition, we can get

\begin{matrix} \frac{1}{n} {(\bar{β} - \hat{β})}^{⊤} X^{⊤} (y - X \hat{β}) & = λ {(\bar{β} - \hat{β})}^{⊤} z \\ \leq λ (Ψ_{B}^{G M C} (\bar{β}) - Ψ_{B}^{G M C} (\hat{β})) . \end{matrix}

(A18)

By utilizing the observation model

y = X β^{*} + ϵ

and considering polarization equality, i.e.,

2 u^{⊤} v = {∥ u ∥}_{2}^{2} + {∥ v ∥}_{2}^{2} - {∥ u - v ∥}_{2}^{2}

, the left side of Equation (A18) can be represented as

\begin{matrix} \frac{1}{n} {(\bar{β} - \hat{β})}^{⊤} X^{⊤} (X β^{*} + ϵ - X \hat{β}) \end{matrix}

\begin{matrix} = \frac{1}{n} {(\bar{β} - \hat{β})}^{⊤} X^{⊤} X (β^{*} - \hat{β}) + \frac{1}{n} ϵ^{⊤} X^{⊤} (\bar{β} - \hat{β}) \\ = \frac{1}{2 n} ∥ X (\bar{β} - \hat{β}) ∥_{2}^{2} + \frac{1}{2 n} ∥ X (β^{*} - \hat{β}) ∥_{2}^{2} - \frac{1}{2 n} {∥ X (\bar{β} - \hat{β}) - X (β^{*} - \hat{β}) ∥}_{2}^{2} + \frac{1}{n} ϵ^{⊤} X^{⊤} (\bar{β} - \hat{β}) . \end{matrix}

(A19)

Let us introduce the two difference vectors

δ = \hat{β} - β^{*}

and

\bar{δ} = \hat{β} - \bar{β}

. Thus, for every

T \subset [P]

, combining (A19), (A18) with the decomposition

ϵ = Π_{T} ϵ + (I_{n} - Π_{T}) ϵ

yields

\begin{matrix} \frac{1}{n} ∥ X \bar{δ} ∥_{2}^{2} + \frac{1}{n} {∥ X δ ∥}_{2}^{2} \leq \frac{1}{n} {∥ X \bar{δ} - X δ ∥}_{2}^{2} + \frac{2}{n} ϵ^{⊤} X^{⊤} \bar{δ} + 2 λ Ψ_{B}^{G M C} (\bar{β}) - 2 λ Ψ_{B}^{G M C} (\hat{β}) . \end{matrix}

(A20)

Applying Holder’s inequality to

ϵ^{⊤} X^{⊤} \bar{δ}

in (A20), we can obtain

\begin{matrix} {(X \bar{δ})}^{⊤} ϵ & = {(X \bar{δ})}^{⊤} (I - Π_{T}) ϵ + {(X \bar{δ})}^{⊤} Π_{T} ϵ \end{matrix}

\begin{matrix} \leq ∥ X^{⊤} (I - Π_{T}) {ϵ ∥}_{\infty} \cdot ∥ \bar{δ} ∥_{1} + ∥ Π_{T} {ϵ ∥}_{2} \cdot {∥ X \bar{δ} ∥}_{2} . \end{matrix}

(A21)

The classical results for the standard Gaussian tail distribution and the

χ^{2}

distribution demonstrate that, given any T, there exists a probability of at least

1 - 2 δ

satisfying the following two inequalities [46]

\begin{matrix} ∥ X^{⊤} {(I - Π) ϵ ∥}_{\infty} \leq γ σ \sqrt{2 log (\frac{p}{δ})} = λ \sqrt{n}, \end{matrix}

(A22)

\begin{matrix} ∥ Π_{T} {ϵ ∥}_{2} \leq σ^{*} (\sqrt{| T |} + \sqrt{2 log (\frac{1}{δ})}) . \end{matrix}

(A23)

Then, using

{λ ∥ β ∥}_{1} \leq λ Ψ_{\sqrt{σ_{m a x} I}}^{G M C} (β) + \frac{μ}{2} {∥ β ∥}_{2}^{2}

and (A22), we can further bound (A21) as follows

\begin{matrix} \frac{2}{n} {(X \bar{δ})}^{⊤} ϵ & \leq \frac{2}{n} ∥ X^{⊤} (I - Π_{T}) {ϵ ∥}_{\infty} \cdot ∥ \bar{δ} ∥_{1} + \frac{2}{n} ∥ Π_{T} {ϵ ∥}_{2} \cdot {∥ X \bar{δ} ∥}_{2} \\ \leq \frac{2}{n} ∥ Π_{T} {ϵ ∥}_{2} \cdot ∥ X \bar{δ} ∥_{2} + 2 λ Ψ_{\sqrt{σ_{m a x} I}}^{G M C} (\hat{β} - \bar{β}) + μ {∥ \hat{β} - \bar{β} ∥}_{2}^{2} . \end{matrix}

(A24)

For non-separable regularizers, we can easily obtain

\begin{matrix} Ψ_{B}^{G M C} (\bar{β}) = {∥ \bar{β} ∥}_{1} - min_{v} \{{∥ v ∥}_{1} + \frac{1}{2} {∥ B (\bar{β} - v) ∥}_{2}^{2}\} \leq Ψ_{\sqrt{σ_{m i n} I}}^{G M C} (\bar{β}), \end{matrix}

and

\begin{matrix} Ψ_{B}^{G M C} (\hat{β}) = {∥ \hat{β} ∥}_{1} - min_{v} \{{∥ v ∥}_{1} + \frac{1}{2} {∥ B (\hat{β} - v) ∥}_{2}^{2}\} \geq Ψ_{\sqrt{σ_{m a x} I}}^{G M C} (\hat{β}) . \end{matrix}

Together with

X^{†} X = I

, and

∥ X^{†} X \bar{δ} ∥_{2}^{2} \leq ∥ X^{†} ∥_{2}^{2} {∥ X \bar{δ} ∥}_{2}^{2}

, the bound (A20) can be rewritten as

\begin{matrix} \frac{1}{n} ∥ X \bar{δ} ∥_{2}^{2} + \frac{1}{n} {∥ X δ ∥}_{2}^{2} & \leq \frac{1}{n} ∥ X \bar{δ} {- X δ ∥}_{2}^{2} + \frac{2}{n} ∥ Π_{T} {ϵ ∥}_{2} \cdot {∥ X \bar{δ} ∥}_{2} \end{matrix}

\begin{matrix} + 2 λ Ψ_{\sqrt{σ_{m a x} I}}^{G M C} (\hat{β} - \bar{β}) + 2 λ Ψ_{\sqrt{σ_{m i n} I}}^{G M C} (\bar{β}) - 2 λ Ψ_{\sqrt{σ_{m a x} I}}^{G M C} (\hat{β}) + μ ∥ X^{†} ∥_{2}^{2} {∥ X \bar{δ} ∥}_{2}^{2} . \end{matrix}

(A25)

Note that for any set T,

Ψ_{b I}^{G M C} (β) = Ψ_{b I}^{G M C} ({(β)}_{T}) + Ψ_{b I}^{G M C} ({(β)}_{T^{c}})

. Subsequently, according to the triangle inequality, as well as the subadditivity and symmetry of

Ψ_{b}^{G M C}

, we derive

\begin{matrix} Ψ_{\sqrt{σ_{m a x} I}}^{G M C} (\hat{β} - \bar{β}) & + Ψ_{\sqrt{σ_{m i n} I}}^{G M C} (\bar{β}) - Ψ_{\sqrt{σ_{m a x} I}}^{G M C} (\hat{β}) \leq Ψ_{\sqrt{σ_{m a x} I}}^{G M C} ({(\hat{β} - \bar{β})}_{T}) + Ψ_{\sqrt{σ_{m a x} I}}^{G M C} ({\hat{β}}_{T^{c}}) \\ + Ψ_{\sqrt{σ_{m a x} I}}^{G M C} ({\bar{β}}_{T^{c}}) + Ψ_{\sqrt{σ_{m i n} I}}^{G M C} (\bar{β}) - Ψ_{\sqrt{σ_{m a x} I}}^{G M C} ({\hat{β}}_{T}) - Ψ_{\sqrt{σ_{m a x} I}}^{G M C} ({\hat{β}}_{T^{c}}) \end{matrix}

\begin{matrix} \leq Ψ_{\sqrt{σ_{m i n} I}}^{G M C} ({(\hat{β} - \bar{β})}_{T}) + 2 Ψ_{\sqrt{σ_{m i n} I}}^{G M C} ({(\bar{β})}_{T^{c}}) + Ψ_{\sqrt{σ_{m i n} I}}^{G M C} ({(\bar{β})}_{T}) - Ψ_{\sqrt{σ_{m a x} I}}^{G M C} ({\hat{β}}_{T}) \\ \leq 2 Ψ_{\sqrt{σ_{m i n} I}}^{G M C} ({(\hat{β} - \bar{β})}_{T}) + 2 Ψ_{\sqrt{σ_{m i n} I}}^{G M C} ({(\bar{β})}_{T^{c}}) . \end{matrix}

(A26)

The constraint (A26) is further refined by the compatibility factor,

\begin{matrix} Ψ_{\sqrt{σ_{m i n} I}}^{G M C} ({(\hat{β} - \bar{β})}_{T}) & \leq ∥ {(\hat{β} - \bar{β})}_{T} ∥_{1} \leq κ_{T}^{- 1} \sqrt{| T | / n} {∥ X (\hat{β} - \bar{β}) ∥}_{2} . \end{matrix}

Next, by combining Equations (A25)–(A27), we can obtain

\begin{matrix} \frac{1}{n} ∥ X \bar{δ} ∥_{2}^{2} + \frac{1}{n} {∥ X δ ∥}_{2}^{2} & \leq \frac{1}{n} ∥ X \bar{δ} {- X δ ∥}_{2}^{2} + 2 (\frac{1}{n} {∥ Π_{T} ϵ ∥}_{2} + 2 λ κ_{T}^{- 1} \sqrt{| T | / n}) {∥ X \bar{δ} ∥}_{2} \\ + 4 λ Ψ_{\sqrt{σ_{m i n} I}}^{G M C} ({(\bar{β})}_{T^{c}}) + μ ∥ X^{†} ∥_{2}^{2} {∥ X \bar{δ} ∥}_{2}^{2} . \end{matrix}

(A27)

By employing Young’s inequality, for any positive value of

e > 0

, the inequality

2 a b \leq \frac{a^{2}}{e} + e b^{2}

holds true, with

a = \frac{1}{n} {∥ Π_{T} ϵ ∥}_{2} + 2 λ κ_{T}^{- 1} \sqrt{| T | / n}

,

b = ∥ X \bar{δ} ∥_{2}

, and

e = \frac{1}{n} - μ {∥ X^{†} ∥}_{2}^{2}

we have

\begin{matrix} 2 (\frac{1}{n} {∥ Π_{T} ϵ ∥}_{2} + 2 λ κ_{T}^{- 1} \sqrt{| T | / n}) {∥ X \bar{δ} ∥}_{2} & \leq \frac{1}{e} {(\frac{1}{n} {∥ Π_{T} ϵ ∥}_{2} + 2 λ κ_{T}^{- 1} \sqrt{| T | / n})}^{2} + e {∥ X \bar{δ} ∥}_{2}^{2} \\ \leq \frac{2}{\frac{1}{n} - μ {∥ X^{†} ∥}_{2}^{2}} (\frac{1}{n^{2}} ∥ Π_{T} {ϵ ∥}_{2}^{2} + \frac{1}{n} 4 λ^{2} κ_{T}^{- 2} | T |) \\ + (\frac{1}{n} - μ ∥ X^{†} ∥_{2}^{2}) ∥ X \bar{δ} ∥_{2}^{2} . \end{matrix}

(A28)

Therefore,

\begin{matrix} \frac{1}{n} {∥ X δ ∥}_{2}^{2} & \leq \frac{1}{n} {∥ X \bar{δ} - X δ ∥}_{2}^{2} + 4 λ Ψ_{\sqrt{σ_{m i n} I}}^{G M C} ({(\bar{β})}_{T^{c}}) + \frac{2}{\frac{1}{n} - μ {∥ X^{†} ∥}_{2}^{2}} (\frac{1}{n^{2}} ∥ Π_{T} {ϵ ∥}_{2}^{2} + \frac{1}{n} 4 λ^{2} κ_{T}^{- 2} | T |) . \end{matrix}

(A29)

The substitution of Equation (A23) into Equation (A29) yields

\begin{matrix} \frac{1}{n} {∥ X δ ∥}_{2}^{2} & \leq \frac{1}{n} {∥ X \bar{δ} - X δ ∥}_{2}^{2} \end{matrix}

\begin{matrix} + 4 λ Ψ_{\sqrt{σ_{m i n} I}}^{G M C} ({(\bar{β})}_{T^{c}}) + \frac{2 σ^{* 2}}{n (1 - n μ ∥ X^{†} ∥_{2}^{2})} (| T | + 2 log (1 / δ) + \frac{8 γ^{2} log (p / δ) | T |}{κ_{T}^{2}}) . \end{matrix}

(A30)

Therefore, Theorem 3 is proven.

☐

Appendix A.4. Proof of Theorem 4

The proof of Theorem 4 is drawing inspiration from Proposition 1 in [47].

Proof.

According to Corollary 1, it can be easily observed that there exists

μ \geq 0

such that inequality

λ Ψ_{B}^{G M C} + \frac{μ}{2} {∥ β ∥}_{2}^{2} \geq 0

is convex.

Now consider the augmented Lagrangian

L (β, z, u)

with regard to z as follows

\begin{matrix} L (β, z, u) & = \frac{1}{2 n} {∥ y - X β ∥}_{2}^{2} + λ Ψ_{B}^{G M C} (z) + u^{⊤} (z - β) + \frac{ρ}{2} {∥ z - β ∥}_{2}^{2} \\ = λ Ψ_{B}^{G M C} (z) + \frac{ρ}{2} {∥ z ∥}_{2}^{2} - u^{⊤} z + \frac{1}{2 n} {∥ y - X β ∥}_{2}^{2} - ρ u^{⊤} β + \frac{ρ}{2} {∥ β ∥}_{2}^{2} . \end{matrix}

(A31)

Note that

\frac{1}{2 n} {∥ y - X β ∥}_{2}^{2} - ρ u^{⊤} β + \frac{ρ}{2} {∥ β ∥}_{2}^{2}

is independent of z. Given the choice of

ρ \geq μ

,

L (β, z, u)

is convex with respect to each of

β, z

and u. Therefore, Algorithm 1 can converge to limit points

β^{*}, z^{*}

, and

u^{*}

.

The implication is that the dual residual

lim_{k \to \infty} s^{(k + 1)} = {∥ ρ (z^{*} - z^{*}) ∥}_{2} = 0

. Regarding the primal residual, it can be observed from the u update step in line 6 of Algorithm 1 that for all

k, t \geq 0

,

\begin{matrix} u^{(k + t)} = u^{(k)} + \sum_{i = 1}^{t} (β^{(k + i)} - z^{(k + i)}) . \end{matrix}

For fixed t and as

k \to \infty

, we have

\begin{matrix} u^{*} = u^{*} + t (β^{*} - z^{*}), \end{matrix}

holds

\forall t \geq 0

. Thus,

β^{*} - z^{*} = 0

, and therefore

lim_{k \to \infty} r^{(k)} = {∥ β^{*} - z^{*} ∥}_{2} = 0

. ☐

References

Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
Zhao, P.; Yu, B. On Model Selection Consistency of Lasso. J. Mach. Learn. Res. 2006, 7, 2541–2563. [Google Scholar]
Hastie, T.; Tibshirani, R.; Wainwright, M. Statistical Learning with Sparsity: The Lasso and Generalizations; CRC Press: Boca Raton, FL, USA, 2015. [Google Scholar]
Chen, S.S.; Donoho, D.L.; Saunders, M.A. Atomic decomposition by basis pursuit. SIAM Rev. 2001, 43, 129–159. [Google Scholar] [CrossRef]
Adamek, R.; Smeekes, S.; Wilms, I. Lasso Inference for High-Dimensional Time Series. J. Econom. 2023, 235, 1114–1143. [Google Scholar] [CrossRef]
Lee, H.; Hwang, T.; Oh, M.-h. Lasso Bandit with Compatibility Condition on Optimal Arm. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Ogundimu, E.O. On Lasso and adaptive Lasso for non-random sample in credit scoring. Stat. Model. 2024, 24, 115–138. [Google Scholar]
Bruckstein, A.M.; Donoho, D.L.; Elad, M. From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Rev. 2009, 51, 34–81. [Google Scholar]
Zanon, M.; Zambonin, G.; Susto, G.A.; McLoone, S. Sparse Logistic Regression: Comparison of Regularization and Bayesian Implementations. Algorithms 2020, 13, 137. [Google Scholar] [CrossRef]
Kayanan, M.; Wijekoon, P. Improved LARS algorithm for adaptive LASSO in the linear regression model. Asian J. Probab. Stat. 2024, 26, 86–95. [Google Scholar] [CrossRef]
Iloska, M.; Djurić, P.M.; Bugallo, M.F. Fast Sparse Learning from Streaming Data with LASSO. In Proceedings of the 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025. [Google Scholar]
Dalalyan, A.S.; Hebiri, M.; Lederer, J. On the Prediction Performance of the Lasso. Bernoulli 2017, 23, 552–581. [Google Scholar] [CrossRef]
Donoho, D.L.; Elad, M.; Temlyakov, V.N. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Inf. Theory 2005, 52, 6–18. [Google Scholar]
Candès, E.J.; Romberg, J.K.; Tao, T. Stable signal recovery from incomplete and inaccurate measurements. Commun. Pure Appl. Math. 2010, 59, 1207–1223. [Google Scholar]
Zhang, C.H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef] [PubMed]
Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef]
Basu, A.; Ghosh, A.; Jaenada, M.; Pardo, L.; Proietti, T. Robust adaptive LASSO in high-dimensional logistic regression. Stat. Methods Appl. 2024, 33, 1217–1249. [Google Scholar] [CrossRef]
Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Lanza, A.; Morigi, S.; Selesnick, I.W.; Sgallari, F. Convex non-convex variational models. In Handbook of Mathematical Models and Algorithms in Computer Vision and Imaging: Mathematical Imaging and Vision; Springer: Cham, Switzerland, 2022; pp. 1–57. [Google Scholar]
Selesnick, I. Sparse regularization via convex analysis. IEEE Trans. Signal Process. 2017, 65, 4481–4494. [Google Scholar] [CrossRef]
Selesnick, I.; Lanza, A.; Morigi, S.; Sgallari, F. Non-convex total variation regularization for convex denoising of signals. J. Math. Imaging Vis. 2020, 62, 825–841. [Google Scholar] [CrossRef]
Zou, J.; Shen, M.; Zhang, Y.; Li, H.; Liu, G.; Ding, S. Total variation denoising with non-convex regularizers. IEEE Access 2018, 7, 4422–4431. [Google Scholar] [CrossRef]
Selesnick, I. Total variation denoising via the Moreau envelope. IEEE Signal Process. Lett. 2017, 24, 216–220. [Google Scholar] [CrossRef]
Lanza, A.; Morigi, S.; Selesnick, I.; Sgallari, F. Sparsity-inducing nonconvex nonseparable regularization for convex image processing. SIAM J. Imaging Sci. 2019, 12, 1099–1134. [Google Scholar] [CrossRef]
Shen, M.; Li, J.; Zhang, T.; Zou, J. Magnetic resonance imaging reconstruction via non-convex total variation regularization. Int. J. Imaging Syst. Technol. 2021, 31, 412–424. [Google Scholar] [CrossRef]
Li, J.; Li, J.; Xie, Z.; Zou, J. Plug-and-play ADMM for MRI reconstruction with convex nonconvex sparse regularization. IEEE Access 2021, 9, 148315–148324. [Google Scholar] [CrossRef]
Li, J.; Xie, Z.; Liu, G.; Yang, L.; Zou, J. Diffusion optical tomography reconstruction based on convex–nonconvex graph total variation regularization. Math. Methods Appl. Sci. 2023, 46, 4534–4545. [Google Scholar] [CrossRef]
Xu, Y.; Qu, M.; Liu, L.; Liu, G.; Zou, J. Plug-and-play algorithms for convex non-convex regularization: Convergence analysis and applications. Math. Methods Appl. Sci. 2024, 47, 1577–1598. [Google Scholar] [CrossRef]
Xu, Z.; Chang, X.; Xu, F.; Zhang, H. L_{1/2} regularization: A thresholding representation theory and a fast solver. IEEE Trans. Neural Netw. Learn. Syst. 2012, 23, 1013–1027. [Google Scholar]
Wen, F.; Chu, L.; Liu, P.; Qiu, R.C. A survey on nonconvex regularization-based sparse and low-rank recovery in signal processing, statistics, and machine learning. IEEE Access 2018, 6, 69883–69906. [Google Scholar] [CrossRef]
Woodworth, J.; Chartrand, R. Compressed sensing recovery via nonconvex shrinkage penalties. Inverse Probl. 2016, 32, 075004. [Google Scholar] [CrossRef]
Selesnick, I.W.; Bayram, I. Sparse Signal Estimation by Maximally Sparse Convex Optimization. IEEE Trans. Signal Process. 2014, 62, 1078–1092. [Google Scholar] [CrossRef]
Al-Shabili, A.H.; Feng, Y.; Selesnick, I. Sharpening sparse regularizers via smoothing. IEEE Open J. Signal Process. 2021, 2, 396–409. [Google Scholar] [CrossRef]
Lanza, A.; Morigi, S.; Selesnick, I.; Sgallari, F. Nonconvex nonsmooth optimization via convex–nonconvex majorization–minimization. Numer. Math. 2017, 136, 343–381. [Google Scholar] [CrossRef]
Bunea, F.; Tsybakov, A.B.; Wegkamp, M.H. Aggregation and sparsity via ℓ₁ penalized least squares. In Proceedings of the 19th Annual Conference on Learning Theory, Pittsburgh, PA, USA, 22–25 June 2006; COLT’06. pp. 379–391. [Google Scholar] [CrossRef]
Bunea, F.; Tsybakov, A.B.; Wegkamp, M.H. Aggregation for Gaussian regression. Ann. Stat. 2007, 35, 1674–1697. [Google Scholar]
Bunea, F.; Tsybakov, A.; Wegkamp, M. Sparsity oracle inequalities for the Lasso. Electron. J. Stat. 2007, 1, 169–194. [Google Scholar] [CrossRef]
Bickel, P.J.; Ritov, Y.; Tsybakov, A. Simultaneous Analysis of Lasso and Dantzig Selector. Ann. Stat. 2009, 37, 1705–1732. [Google Scholar]
Koltchinskii, V.; Lounici, K.; Tsybakov, A.B. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Stat. 2011, 39, 2302–2329. [Google Scholar]
Sun, T.; Zhang, C.H. Scaled sparse linear regression. Biometrika 2012, 99, 879–898. [Google Scholar]
Parikh, N.; Boyd, S. Proximal algorithms. Found. Trends® Optim. 2014, 1, 127–239. [Google Scholar]
Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach. Learn. 2010, 3, 1–122. [Google Scholar]
Lustig, M.; Donoho, D.; Pauly, J.M. Sparse MRI: The application of compressed sensing for rapid MR imaging. Magn. Reson. Med. 2007, 58, 1182–1195. [Google Scholar] [CrossRef]
Lustig, M.; Donoho, D.L.; Santos, J.M.; Pauly, J.M. Compressed sensing MRI. IEEE Signal Process. Mag. 2008, 25, 72–82. [Google Scholar]
Fessler, J.A. Optimization Methods for Magnetic Resonance Image Reconstruction: Key Models and Optimization Algorithms. IEEE Signal Process. Mag. 2020, 37, 33–40. [Google Scholar]
Wainwright, M.J. High-Dimensional Statistics: A Non-Asymptotic Viewpoint; Cambridge University Press: Cambridge, UK, 2019. [Google Scholar]
Ma, S.; Huang, J. A concave pairwise fusion approach to subgroup analysis. J. Am. Stat. Assoc. 2017, 112, 410–423. [Google Scholar]

Figure 1. Regularization paths. (a) F1 score, (b) RMSE. The red solid line, black solid line, and blue solid line represent the Lasso model (14) with GMC,

ℓ_{1 / 2}

, and

ℓ_{1}

regularization, respectively. The red dotted line, black dotted line, and blue dotted line correspond to the values of

λ / λ_{m a x}

when the maximum F1 score and the minimum RMSE are achieved in the support recovery process for the Lasso model with GMC,

ℓ_{1 / 2}

, and

ℓ_{1}

regularization, respectively.

Figure 1. Regularization paths. (a) F1 score, (b) RMSE. The red solid line, black solid line, and blue solid line represent the Lasso model (14) with GMC,

ℓ_{1 / 2}

, and

ℓ_{1}

regularization, respectively. The red dotted line, black dotted line, and blue dotted line correspond to the values of

λ / λ_{m a x}

when the maximum F1 score and the minimum RMSE are achieved in the support recovery process for the Lasso model with GMC,

ℓ_{1 / 2}

, and

ℓ_{1}

regularization, respectively.

Figure 2. (a), Original image; (b), undersampling template with

30 %

sampling rate; (c–e), reconstructed images using

ℓ_{1}

regularization,

ℓ_{1 / 2}

regularization and GMC regularization, respectively; (f), difference between (a) and (c); (g), difference between (a) and (d); (h), difference between (a) and (e).

Figure 2. (a), Original image; (b), undersampling template with

30 %

sampling rate; (c–e), reconstructed images using

ℓ_{1}

regularization,

ℓ_{1 / 2}

regularization and GMC regularization, respectively; (f), difference between (a) and (c); (g), difference between (a) and (d); (h), difference between (a) and (e).

Figure 3. (a), Original image; (b), undersampling template with

30 %

sampling rate; (c–e), reconstructed images using

ℓ_{1}

regularization,

ℓ_{1 / 2}

regularization and GMC regularization, respectively; (f), difference between (a) and (c); (g), difference between (a) and (d); (h), difference between (a) and (e).

Figure 3. (a), Original image; (b), undersampling template with

30 %

sampling rate; (c–e), reconstructed images using

ℓ_{1}

regularization,

ℓ_{1 / 2}

regularization and GMC regularization, respectively; (f), difference between (a) and (c); (g), difference between (a) and (d); (h), difference between (a) and (e).

Figure 4. (a), Original image; (b), undersampling template with

30 %

sampling rate; (c–e), reconstructed images using

ℓ_{1}

regularization,

ℓ_{1 / 2}

and GMC regularization, respectively; (f), difference between (a) and (c); (g), difference between (a) and (d); (h), difference between (a) and (e).

Figure 4. (a), Original image; (b), undersampling template with

30 %

sampling rate; (c–e), reconstructed images using

ℓ_{1}

regularization,

ℓ_{1 / 2}

and GMC regularization, respectively; (f), difference between (a) and (c); (g), difference between (a) and (d); (h), difference between (a) and (e).

Table 1. Quantitative results of different regularizations. The best results are highlighted in bold.

Template	Image	Model	RE	PSNR (dB)
Variable Density Sampling	Image1	$ℓ_{1}$	0.0458	35.6637
		$ℓ_{1 / 2}$	0.0356	37.8626
		GMC	0.0277	40.0481
Variable Density Sampling	Image2	$ℓ_{1}$	0.0912	31.3563
		$ℓ_{1 / 2}$	0.0797	32.5274
		GMC	0.0648	34.3238
Cartesian Sampling	Image3	$ℓ_{1}$	0.1076	29.8075
		$ℓ_{1 / 2}$	0.0952	30.8666
		GMC	0.0866	31.6867

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, W.; Liu, Q.; Li, H.; Zou, J. The Prediction Performance Analysis of the Lasso Model with Convex Non-Convex Sparse Regularization. Algorithms 2025, 18, 195. https://doi.org/10.3390/a18040195

AMA Style

Chen W, Liu Q, Li H, Zou J. The Prediction Performance Analysis of the Lasso Model with Convex Non-Convex Sparse Regularization. Algorithms. 2025; 18(4):195. https://doi.org/10.3390/a18040195

Chicago/Turabian Style

Chen, Wei, Qiuyue Liu, Hancong Li, and Jian Zou. 2025. "The Prediction Performance Analysis of the Lasso Model with Convex Non-Convex Sparse Regularization" Algorithms 18, no. 4: 195. https://doi.org/10.3390/a18040195

APA Style

Chen, W., Liu, Q., Li, H., & Zou, J. (2025). The Prediction Performance Analysis of the Lasso Model with Convex Non-Convex Sparse Regularization. Algorithms, 18(4), 195. https://doi.org/10.3390/a18040195

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Prediction Performance Analysis of the Lasso Model with Convex Non-Convex Sparse Regularization

Abstract

1. Introduction

Notation

2. Preliminaries and Related Works

2.1. Convex Non-Convex Sparse Regularization

2.2. Results of the Predicted Performance

2.3. Convex Optimization

3. Lasso with CNC Sparse Regularization

3.1. Convex Condition

3.2. Prediction Performance

3.3. ADMM Algorithm

4. Numerical Experiment

4.1. Synthetic Data

4.2. MRI Reconstruction

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. Proof of Theorem 1

Appendix A.2. Proof of Theorem 2

Appendix A.3. Proof of Theorem 3

Appendix A.4. Proof of Theorem 4

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI