Generalized Penalized Constrained Regression: Sharp Guarantees in High Dimensions with Noisy Features

Alrashdi, Ayed M.; Alazmi, Meshari; Alrasheedi, Masad A.

doi:10.3390/math11173706

Open AccessArticle

Generalized Penalized Constrained Regression: Sharp Guarantees in High Dimensions with Noisy Features

by

Ayed M. Alrashdi

^1,*

,

Meshari Alazmi

² and

Masad A. Alrasheedi

³

¹

Department of Electrical Engineering, College of Engineering, University of Ha’il, Ha’il 81441, Saudi Arabia

²

Department of Information and Computer Science, College of Computer Science and Engineering, University of Ha’il, Ha’il 81411, Saudi Arabia

³

Department of Management Information Systems, College of Business Administration, Taibah University, Madinah 42353, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(17), 3706; https://doi.org/10.3390/math11173706

Submission received: 25 June 2023 / Revised: 3 August 2023 / Accepted: 24 August 2023 / Published: 28 August 2023

(This article belongs to the Section E2: Control Theory and Mechanics)

Download

Browse Figures

Versions Notes

Abstract

:

The generalized penalized constrained regression (G-PCR) is a penalized model for high-dimensional linear inverse problems with structured features. This paper presents a sharp error performance analysis of the G-PCR in the over-parameterized high-dimensional setting. The analysis is carried out under the assumption of a noisy or erroneous Gaussian features matrix. To assess the performance of the G-PCR problem, the study employs multiple metrics such as prediction risk, cosine similarity, and the probabilities of misdetection and false alarm. These metrics offer valuable insights into the accuracy and reliability of the G-PCR model under different circumstances. Furthermore, the derived results are specialized and applied to well-known instances of G-PCR, including

l_{1}

-norm penalized regression for sparse signal recovery and

l_{2}

-norm (ridge) penalization. These specific instances are widely utilized in regression analysis for purposes such as feature selection and model regularization. To validate the obtained results, the paper provides numerical simulations conducted on both real-world and synthetic datasets. Using extensive simulations, we show the universality and robustness of the results of this work to the assumed Gaussian distribution of the features matrix. We empirically investigate the so-called double descent phenomenon and show how optimal selection of the hyper-parameters of the G-PCR can help mitigate this phenomenon. The derived expressions and insights from this study can be utilized to optimally select the hyper-parameters of the G-PCR. By leveraging these findings, one can make well-informed decisions regarding the configuration and fine-tuning of the G-PCR model, taking into consideration the specific problem at hand as well as the presence of noisy features in the high-dimensional setting.

Keywords:

penalized regression; prediction risk; cosine similarity; probability of false alarm; double descent; over-parameterization; constrained ridge regression

MSC:

62J05; 62J07; 60G35; 62E20

1. Introduction

1.1. Notations and Definitions

To avoid confusion, we start by introducing the notations and definitions used throughout this paper. For any positive integer p, let

[p]

denote the set

{1, 2, \dots, p}

. Bold face lower case letters (e.g., θ) represent a column vector, and

θ_{i}

is its ith entry, while

{∥ θ ∥}_{q} = {(\sum_{i = 1}^{p} {| θ_{i} |}^{q})}^{\frac{1}{q}}

is its

l_{q}

-norm. The

l_{\infty}

-norm of a vector is defined as:

{∥ θ ∥}_{\infty}

= {max}_{i} | θ_{i} |

. Upper case bold letters such as X are used to indicate matrices, with

I_{p}

representing the

p \times p

identity matrix. The symbols

{(\cdot)}^{- 1}

and

{(\cdot)}^{⊤}

are the inversion and transpose operations, respectively. We use

P (\cdot)

and

E [\cdot]

to indicate the probability of an event and the expected value of a random variable, respectively. The notation

“ \overset{P}{⟶} ”

is used to represent convergence in probability. We write

X \sim p_{X}

to indicate that a random variable X is randomly distributed according to a probability mass (or density) function

p_{X}

. Particularly,

v \sim N (0, C_{v})

means that the random vector v has a normal distribution with

0

mean vector and covariance matrix

C_{v} = E [v v^{⊤}]

, where

0

is the zero vector. For

m \in N

, a function

ψ : R^{m} \to R

is said to be pseudo-Lipschitz of order

k \geq 1

if there exists a constant

L > 0

such that, for all,

x, y \in R^{m}

:

|ψ (x) - ψ (y)| \leq {L (1 + ∥ x ∥}_{2}^{k - 1} + {∥ y ∥}_{2}^{k - 1} {) ∥ x - y ∥}_{2}

.

A function

P : R^{p} \to R

is called separable if

P (x) = \sum_{j = 1}^{p} \tilde{P} (x_{j}) \forall x \in R^{p}

, where

\tilde{P} : R \to R

is a real-valued function. The notation

1_{{A}}

is the indicator function, which is defined as

1_{{A}} (x) = \{\begin{matrix} 1, & if x \in A, \\ 0, & otherwise . \end{matrix}

Finally, we need the following definitions:

The generalized Moreau envelope function of a proper convex function $h : R \to R$ is defined as

$M_{h} (a; b, c, d) = min_{c \leq x \leq d} \frac{1}{2} {(x - a)}^{2} + b h (x)$

(1)

for $a, b, c, d \in R$ , with $b \geq 0, c \leq 0$ and $d \geq 0$ . The generalized Moreau envelope given above is an extended version of the well-known Moreau–Yosida envelope function [1].
The minimizer of the above function is called the generalized proximal operator, which is given as

${prox}_{h} (a; b, c, d) = \arg min_{c \leq x \leq d} \frac{1}{2} {(x - a)}^{2} + b h (x) .$

1.2. Motivation

Suppose we observe a response vector y

\in R^{n}

and a data matrix X

\in R^{n \times p}

according to the linear model

y = X θ_{0} + ϵ,

(2)

where

θ_{0} \in R^{p}

is a vector of coefficients or parameters, or an unknown signal vector, and

ϵ \in R^{n}

is an error vector. This is also known as a linear inverse problem model [2]. The linear model in (2) appears in many practical problems in engineering and science [3,4]. For example, in statistics and machine learning [5,6,7], is the response vector (or the output data); X is often called the predictor matrix, features matrix, or design matrix, which collects the input data (or features);

θ_{0}

is the so called target vector, which is a vector of some weighting parameters or regression coefficients; and ϵ is a random noise term. In the context of compressed sensing [8,9], y represents the measured data, X is a sensing or measurement matrix,

θ_{0}

denotes a signal of interest (to be recovered), and ϵ is a random noise vector. In addition, in signal representation [10,11], y is a signal of interest, the matrix X denotes an over-complete dictionary of elementary atoms, the vector

θ_{0}

contains the representation coefficients of the signal y, and ϵ represents some approximation error. Moreover, in the field of wireless communications [12,13], y represents the received signal, X is the channel matrix between the transmitter and the receiver,

θ_{0}

is the transmitted signal vector, and ϵ is the additive thermal noise.

In the past, different computational algorithms have been proposed for recovering (estimating) the unknown vector

θ_{0}

. The simplest and most conspicuous approach is the ordinary least squares (OLS) estimator, which finds an estimate

\hat{θ}

of

θ_{0}

by minimizing the residual sum of squares (RSS), i.e.,

{\hat{θ}}_{OLS} = arg min_{θ \in R^{p}} {∥ y - X θ ∥}_{2}^{2} .

(3)

For the OLS estimator, it is required that

n \geq p

, i.e., X is a full column rank matrix. In this case, (3) has the following closed-form solution:

{\hat{θ}}_{OLS} = {(X^{⊤} X)}^{- 1} X^{⊤} y .

(4)

In many applications, most of the time, the number of parameters to be recovered p is greater than the number of available samples n, i.e.,

p > n

. This scenario is known in the literature as the over-parameterized regime [14] (This case,

n < p

, is also called the “compressed measurement” scenario in the compressive sensing context). Such inverse problems are known to be ill-posed unless the unknown vector

θ_{0}

is located in a manifold with a considerably lower dimension than the initial ambient dimension p. These vectors are called structured vectors [15]. Examples of structured vectors are vectors with a finite-alphabet structure, sparse and block-sparse structures, low-rankness, etc. [9].

Despite being a popular approach, the OLS estimator performs very poorly when applied to ill-posed or under-determined problems [16]. Thus, to solve ill-posed problems, penalization methods are often used. Examples of these methods include penalized least squares (PLS) [17], least absolute shrinkage and selection operator (LASSO) [18], truncated singular value decomposition (SVD) [19], etc.

For structured vectors, the most widely used approach is the penalized M-estimator [20], which finds an estimate

\hat{θ}

of the unknown vector

θ_{0}

by solving the convex optimization problem

\hat{θ} = arg min_{θ \in R^{p}} L (X θ - y) + α P (θ),

(5)

where

L : R^{n} \to R

is a convex loss function that determines how close the estimate

X \hat{θ}

is to the linear model

y = X θ_{0} + ϵ

. Furthermore,

P : R^{p} \to R

is a convex penalization function that enforces the specific structure (the a priori information) of the unknown vector

θ_{0}

, and

α > 0

is a penalization factor that is used to balance the two functions. In addition, we assume that

P

is separable, i.e.,

P (θ) = \sum_{j = 1}^{p} \tilde{P} (θ_{j})

. Examples of the most popular structure-inducing functions are:

$P (\cdot) = {∥ \cdot ∥}_{1}$ induces sparsity structure.
$P (\cdot) = {∥ \cdot ∥}_{★}$ encourages low-rankness structure, where ${∥ \cdot ∥}_{★}$ is the nuclear norm of a matrix, which is defined as the sum of its singular values.
$P (\cdot) = {∥ \cdot ∥}_{1, 2}$ induces block-sparsity structures, where ${∥ \cdot ∥}_{1, 2}$ is the mixed $l_{1, 2}$ -norm.
$P (\cdot) = {∥ \cdot ∥}_{\infty}$ promotes finite-alphabet (i.e., constant-amplitude) signals.

The choice of the loss function

L (\cdot)

depends on the noise distribution [3] as follows:

If the noise is Gaussian-distributed, then we choose $L (\cdot) = {(1 / 2) ∥ \cdot ∥}_{2}^{2}$ or $L (\cdot) = {∥ \cdot ∥}_{2}$ , which is related to the maximum likelihood estimation [10].
If the noise is sparse (e.g., Laplacian distributed), then one can select $L (\cdot) = {∥ \cdot ∥}_{1}$ .
If the noise is bounded, then a proper choice is $L (\cdot) = {∥ \cdot ∥}_{\infty}$ , and so on.

Different popular algorithms that correspond to different choices of

L (\cdot)

and

P (\cdot)

include:

OLS: $P (\cdot) = 0$ , as in (3);
$l_{2}$ -penalized LS or ridge regression: $min_{θ} {∥ X θ - y ∥}_{2}^{2} + α {∥ θ ∥}_{2}^{2}$ ;
$l_{1}$ -penalized LS or LASSO: $min_{θ} {∥ X θ - y ∥}_{2}^{2} + α {∥ θ ∥}_{1}$ ;
group LASSO: $min_{θ} {∥ X θ - y ∥}_{2}^{2} + α {∥ θ ∥}_{1, 2}$ ;
generalized least absolute deviation (LAD): $min_{θ} {∥ X θ - y ∥}_{1} + α P (θ)$ .

The above list is not exhaustive, and many regression, classification, and other statistical learning algorithms can be written in the form of (5).

1.3. Summary of Contributions and Related Work

Since the Gaussian noise is the most widely encountered noise in practice, we focus in this work on optimization problems involving an

ℓ_{2}

-norm squared loss and a general penalization function. We call these problems the Generalized Penalized Regression (G-PR), which solves the following convex optimization problem:

\hat{θ} = arg min_{θ \in R^{p}} {∥ X θ - y ∥}_{2}^{2} + α P (θ) .

(6)

In this paper, we provide high-dimensional analysis of a constrained version of the G-PR called the G-PCR (as in Equation (9)) based on the convex Gaussian min–max theorem (CGMT) [20]. This analysis includes studying its general error performance and specializing it to particular cases such as sparse and ridge linear regressions. The derived performance measures, such as the prediction risk and similarity and probability of misdetection, are then used to tune the involved hyper-parameters of the algorithm. Numerical simulations of both synthetic and real data are presented to support the theoretical analysis presented in this work.

Previous works on the high-dimensional performance characterization of convex optimization problems have a very rich history. There are early results that provided order-wise “loose bounds” of the error performance of several penalized regression problems, such as in [14,21,22,23,24,25]. However, the first results that provided a high-dimensional error analysis were derived using the approximate message passing (AMP) algorithm by Bayati et al. [26,27] for the unconstrained (standard) LASSO. Later, Ref. [28] extended the AMP framework to analyze the performance of more general loss functions.

A different approach that is based on the replica method was considered in [29,30] to analyze various problems in the compressed sensing setting.

In addition, another powerful high-dimensional tool called the random matrix theory (RMT) [31] was used in [32,33,34] to derive asymptotic error analysis of some optimization problems that possess closed-form solutions.

Recently, Thrampoulidis et al. developed a new high-dimensional analysis framework that is based on the convex Gaussian min–max theorem (CGMT). First, this framework was used in [35,36,37,38] to provide precise error analysis of the LASSO and square-root LASSO. Then, in [20], it was extended to obtain asymptotic error performance analysis of unconstrained penalized M-estimator regression problems. The first CGMT-based results on constrained regression models were derived in [39,40,41] for the box relaxation optimization (BRO) and its regularized variant. This BRO method is used to promote constant-amplitude structures. The authors in [42,43,44] extended the previous CGMT results to obtain sharp error performance characterization of constrained versions of the popular LASSO and Elastic-Net (EN) problems. These extended versions are called the Box-LASSO and Box-EN, respectively. Furthermore, the authors in [45,46] extended the above results to derive symbol error rate performance of a more general method called the sum of absolute values (SOAV) optimization and its constrained pair (Box-SOAV) for discrete-valued binary and sparse signal recovery.

Even though the focus of this paper is on regression problems, we should highlight that the CGMT framework was also applied to characterize the high-dimensional error performance of classification problems as in [47,48,49], phase retrieval problems [50,51], and various statistical learning problems [52,53,54].

In most of these works, the features matrix is considered to be fully known, but in practice, data are always noisy and contain different types of errors. This motivates the analysis considered in this paper to be performed under uncertainties in the design matrix (see Section 2.2). As compared to related work, such as [41,43,44] which considered the imperfect design matrix assumption, this work differs in multiple ways.

The proposed constrained G-PCR problem in (9) considers a general penalization function $P (\cdot)$ instead of the specific penalties used in previous works.
This work derives a general performance measure (Theorem 1) that is more broad and useful than the particular metrics previously taken into consideration, such as the mean square error (MSE), symbol error probability, etc.
This work generalizes these previous results, as they can be obtained as special cases of the results of this paper.
In Appendix B, we highlight the use of the same machinery developed in this work to analyze a closely related class of problems known as Square-Root Generalized Penalized Constrained Regression.

2. Problem Setup

2.1. Dataset Model

Consider the problem of estimating a scalar response

y_{i}

from a set of n independent training data samples

{(x_{i}, y_{i})}_{i = 1}^{n}

, where

x_{i} \in R^{p}

is the feature vector, following the linear model

\begin{matrix} y_{i} = θ_{0}^{⊤} x_{i} + ϵ_{i}, i \in [n], \end{matrix}

(7)

where

θ_{0} \in R^{p}

is an unknown structured target vector, and

{ϵ_{i}}_{i = 1}^{n}

denotes the noise samples with zero mean and variance

σ_{ϵ}^{2}

. Furthermore, the feature vectors

x_{i}

are assumed to independent and identically distributed (i.i.d.) random normal vectors with zero mean and covariance matrix

\frac{1}{p} I_{p}

.

The model in (7) can be compactly written as

y = X θ_{0} + ϵ,

where

y = {[y_{1}, y_{2}, \dots, y_{n}]}^{⊤}, X = {[x_{1}, x_{2}, \dots, x_{n}]}^{⊤},

and

ϵ = {[ϵ_{1}, ϵ_{2}, \dots, ϵ_{n}]}^{⊤}

.

2.2. Main Assumptions

Our study is based on the following set of assumptions:

The unknown target vector $θ_{0}$ is assumed to be a structured vector, with entries $Θ_{0}$ that are sampled i.i.d. from a probability distribution function $p_{Θ}$ , which has zero mean, and variance $E [Θ_{0}^{2}] = σ_{θ}^{2}$ , where $0 < σ_{θ}^{2} < \infty$ .
The noise variance $σ_{ϵ}^{2} < \infty$ is a fixed positive constant.
As discussed above, the data matrix $X \in R^{n \times p}$ is a Gaussian matrix with i.i.d. $N (0, \frac{1}{p})$ elements. The choice of the $1 / p$ as the variance level in X is commonly used in the literature; see [20,27]. This is done to ensure that ${∥ X θ ∥}_{2}^{2} / n$ and ${∥ θ ∥}_{2}^{2} / p$ are of the same order.

Furthermore, in this work, we assume that the data matrix, X, is not perfectly known, and we only have an erroneous copy of it,

\hat{X}

, which is given as:

\hat{X} = X + E,

(8)

where

\hat{X}

, and

E \in R^{n \times p}

are independent matrices which have i.i.d. entries drawn from

N (0, \frac{1 - σ_{e}^{2}}{p})

, and

N (0, \frac{σ_{e}^{2}}{p})

, respectively. (This uncertainty notion is widely encountered in practice. For example, it could be used to represent model mismatch, errors, and noises from the data collection process, noise in the sensors used to gather the measurements, etc.) Here, E represents the unknown error matrix, and

σ_{e}^{2} \in [0, 1]

is the variance of the error.

n and p grow to infinity with $\frac{n}{p} ⟶ ζ \in (0, \infty)$ .

2.3. Generalized Penalized Constrained Regression (G-PCR)

In this paper, we refer to (5) as the standard G-PR, but we analyze a modified version that we call the Generalized Penalized Constrained Regression (G-PCR), which solves the following optimization instead:

\hat{θ} = arg min_{θ \in V^{p}} {∥ \hat{X} θ - y ∥}_{2}^{2} + α P (θ),

(9)

where

V = [- L, U], and L, U \in R_{+} \cup {0} .

When compared to (6), the constraint

V

is used instead of

R

, and

\hat{X}

is used instead of X. This is due to the fact that X is not perfectly known and we only have its noisy estimate

\hat{X}

.

When we compare (9) to (6), we can see that there is only a slight difference between them, which is the constraint set

V

instead of

R

. This small change assures significant performance improvements in the algorithm in many practical applications, such as image and signal processing [55], wireless communications [41,43,56], etc. These improvements are shown for several cases in Section 4 and Section 5.

3. Sharp Asymptotics

3.1. Measures of Performance

This paper considers the following measures used to assess the high-dimensional performance of the G-PCR.

Prediction Risk: One of the most extensively used measures of performance is the prediction risk. For a given estimator $\hat{θ}$ , the prediction risk is defined as

$\begin{matrix} R (\hat{θ}, θ_{0}) : = \frac{1}{p} E_{x, y} [{|x^{⊤} (\hat{θ} - θ_{0})|}^{2}] = \frac{1}{p} {∥ \hat{θ} - θ_{0} ∥}_{2}^{2}, \end{matrix}$

(10)

where x and y are new test points following the linear model in (7) but are independent of the training data.
Similarity: Another metric that is used to quantify the degree of alignment between the target vector $θ_{0}$ and its estimate $\hat{θ}$ is the (dis)similarity. It is a measure of orientation rather than magnitude. It is defined as

$\begin{matrix} ϱ (\hat{θ}, θ_{0}) : = \frac{{\hat{θ}}^{⊤} θ_{0}}{∥ \hat{θ} ∥_{2} ∥ θ_{0} ∥_{2}} \in [- 1, 1] . \end{matrix}$

(11)

This similarity measure could also be thought of as the correlation between the estimated and true target vectors. Essentially, we desire estimates that maximize this similarity. Note that this metric is also known as the cosine similarity in machine learning literature, since

ϱ (\hat{θ}, θ_{0}) = cos (∠ (\hat{θ}, θ_{0}))

.

Note that these two measures are related as

R (\hat{θ}, θ_{0}) = \frac{1}{p} [∥ \hat{θ} ∥_{2}^{2} + ∥ θ_{0} ∥_{2}^{2} - 2 ∥ \hat{θ} ∥_{2} ∥ θ_{0} ∥_{2} ϱ (\hat{θ}, θ_{0})] .

3.2. High-Dimensional Performance Evaluation

In this subsection, we provide the main results of the paper, namely, the sharp analysis of the asymptotic performance of the G-PCR convex program. We start by analyzing the estimation performance via a general pseudo-Lipschitz function as in Theorem 1 below, which sharply characterizes the general asymptotic behavior of the error. Then, we use this theorem to compute particular performance measures such as the prediction risk, similarity, etc.

Theorem 1

(General Performance Metric). Consider the high-dimensional setup of Section 2.2, and let the assumptions therein hold. Moreover, let

\hat{θ}

be a minimizer of the G-PCR program in (9) for a fixed

α > 0

. Let

ψ : R \times R \to R

be a pseudo-Lipschitz function. Then, in the limit of

p \to \infty

, it holds that

\begin{matrix} \frac{1}{p} \sum_{j = 1}^{p} ψ ({\hat{θ}}_{j}, θ_{0, j}) \overset{P}{⟶} E [ψ ({prox}_{\tilde{P}} (Θ_{0} + \frac{H}{γ_{★} \sqrt{ζ (1 - σ_{e}^{2})}}; \frac{α}{q_{★} γ_{★} \sqrt{ζ} (1 - σ_{e}^{2})}, - L, U), Θ_{0})], \end{matrix}

(12)

where

(q_{★}, γ_{★})

is the unique optimal solution to the following objective,

\begin{matrix} \begin{matrix} sup_{q \geq 0} inf_{γ > 0} O_{\tilde{P}} (q, γ) & : = \frac{q \sqrt{ζ}}{2 γ} + \frac{q γ \sqrt{ζ}}{2} (σ_{ϵ}^{2} + σ_{e}^{2} σ_{θ}^{2}) - \frac{q}{2 γ \sqrt{ζ}} - \frac{q^{2}}{4} \\ + q γ \sqrt{ζ} (1 - σ_{e}^{2}) E [M_{\tilde{P}} (Θ_{0} + \frac{H}{γ \sqrt{ζ (1 - σ_{e}^{2})}}; \frac{α}{q γ \sqrt{ζ} (1 - σ_{e}^{2})}, - L, U)], \end{matrix} \end{matrix}

(13)

and the expectation is taken with respect to independent random variables

Θ_{0} \sim p_{Θ}

and

H \sim N (0, 1)

.

Proof.

The proof is given in Appendix A. □

Remark 1

(Choice of

ψ (\cdot, \cdot)

). The performance metric in Theorem 1 is computed in terms of evaluation of a pseudo-Lipschitz function,

ψ (\cdot, \cdot)

. As an example,

ψ (a, b) = {(a - b)}^{2}

can be used to compute the prediction risk, or

ψ (a, b) = | a - b |

can be used to evaluate the mean absolute error (MAE). We will appeal to this theorem later with various choices of

ψ (\cdot, \cdot)

to evaluate different performance measures on

\hat{θ}

.

Remark 2

(Optimal Solutions). Note that

q_{★}

and

γ_{★}

can be calculated by any search technique such as the golden-section search method and the ternary search [57].

Remark 3

(Decoupling Property). Theorem 1 shows that

\begin{matrix} \frac{1}{p} \sum_{j = 1}^{p} ψ ({\hat{θ}}_{j}, θ_{0, j}) \overset{P}{⟶} E [ψ (\hat{Θ}, Θ_{0})], \end{matrix}

(14)

where

\begin{matrix} \hat{Θ} : = {prox}_{\tilde{P}} (\underset{: = Y}{\underset{︸}{Θ_{0} + \frac{H}{γ_{★} \sqrt{ζ (1 - σ_{e}^{2})}}}}; \frac{α}{q_{★} γ_{★} \sqrt{ζ} (1 - σ_{e}^{2})}, - L, U) . \end{matrix}

(15)

This provides some insights for the structured signal recovery with the G-PCR optimization. From (14), it can be seen that the random variable

\hat{Θ}

shares the same statistical properties as the estimate

\hat{θ}

[46]. Thus, Equation (15) can be considered as a decoupled scalar version of the original system as depicted in Figure 1. Particularly, in the original system (Figure 1a), the true target vector

θ_{0}

is first mixed by the design matrix X, and then the additive white Gaussian noise (AWGN) vector ϵ is added to form the measurement vector y. On the other hand, in the decoupled system (Figure 1b), the unknown variable

Θ_{0}

is only mixed by the Gaussian vector

\frac{1}{γ_{★} \sqrt{ζ (1 - σ_{e}^{2})}} H

, where

Y : = Θ_{0} + \frac{1}{γ_{★} \sqrt{ζ (1 - σ_{e}^{2})}} H

. Furthermore, letting

B : = \frac{1}{q_{★} γ_{★} \sqrt{ζ} (1 - σ_{e}^{2})}

, it can be observed that the generalized proximal operator solution

\hat{Θ} = {prox}_{\tilde{P}} (Y; α B, - L, U) = \underset{- L \leq θ \leq U}{\arg \min} \frac{1}{2} {(Y - θ)}^{2} + α B \tilde{P} (θ)

has a decoupled scalar form of the original G-PCR in (9), which can be expressed as

\hat{θ} = \underset{- L \leq θ_{j} \leq U}{\arg \min} \frac{1}{2} ∥ y - \hat{X} θ ∥_{2}^{2} + α B P (θ), j \in [p]

up to a scaling of B. This suggests that, in the high-dimensional asymptotic setting, one can use the decoupled scalar system to characterize the probabilistic properties of the G-PCR recovery problem.

This decoupling property was also shown for similar problems, such as box-constrained sum of absolute values (Box-SOAV) optimization for sparse recovery [46], sparse logistic regression [58], and the approximate message passing (AMP) algorithm [59].

As a first application of Theorem 1, we provide a sharp high-dimensional performance evaluation of the prediction risk as given in the following corollary.

Corollary 1

(Prediction Risk). Under the same assumptions of Theorem 1, and for

Θ_{0} \sim p_{Θ}

that is independent of

H \sim N (0, 1)

, it holds that

\begin{matrix} \begin{matrix} R (\hat{θ}, θ_{0}) \overset{P}{⟶} & E_{Θ_{0}, H} [{({prox}_{\tilde{P}} (Θ_{0} + \frac{H}{γ_{★} \sqrt{ζ (1 - σ_{e}^{2})}}; \frac{α}{q_{★} γ_{★} \sqrt{ζ} (1 - σ_{e}^{2})}, - L, U) - Θ_{0})}^{2}] \\ = \frac{1}{1 - σ_{e}^{2}} (\frac{1}{γ_{★}^{2}} - σ_{θ}^{2} σ_{e}^{2} - σ_{ϵ}^{2}), \end{matrix} \end{matrix}

(16)

where

q_{★}

and

γ_{★}

are the unique optimal solutions to the objective function in (13).

Proof.

Using Theorem 1 with

ψ (a, b) = {(a - b)}^{2}

, we can obtain the above expression of the prediction risk. Details are deferred to Appendix A.2.4. □

Remark 4

(Optimal Hyper-parameters). Corollary 1 allows us to determine the optimal hyper-parameters, such as

α, L

, and U, that minimize the prediction risk. To do so, it is first required to estimate some variances, such as

σ_{θ}^{2}, σ_{ϵ}^{2}

and

σ_{e}^{2}

, from the available data. Those can be easily estimated by using existing algorithms such as [60,61].

It should be noted that this theoretical hyper-parameter optimal tuning as discussed above avoids the traditional time/data-consuming practice of cross-validation used to tune the hyper parameters.

The following corollary sharply characterizes the similarity measure defined earlier in (11).

Corollary 2

(Similarity). Under the same assumptions and settings of Theorem 1, and in the limit of

p \to \infty

, it holds that

ϱ (\hat{θ}, θ_{0}) \overset{P}{⟶} \frac{E_{Θ_{0}, H} [{prox}_{\tilde{P}} (Θ_{0} + \frac{H}{γ_{★} \sqrt{ζ (1 - σ_{e}^{2})}}; \frac{α}{q_{★} γ_{★} \sqrt{ζ} (1 - σ_{e}^{2})}, - L, U) Θ_{0}]}{\sqrt{σ_{θ}^{2} \cdot E_{Θ_{0}, H} [{prox}_{\tilde{P}}^{2} (Θ_{0} + \frac{H}{γ_{★} \sqrt{ζ (1 - σ_{e}^{2})}}; \frac{α}{q_{★} γ_{★} \sqrt{ζ} (1 - σ_{e}^{2})}, - L, U)]}},

(17)

where

q_{★}

and

γ_{★}

are the unique optimal solutions to (13).

Proof.

The proof follows from Theorem 1 and from the continuous mapping Theorem [62]. Details are given in Appendix A.2.5. □

In the subsequent sections, we consider various instants of (9), such as

l_{1}

-norm and

l_{2}

-norm penalization, to illustrate the application of the theoretical asymptotic expressions derived in this section.

4. Sparse Linear Regression

In this section, we study the performance of the G-PCR with an

l_{1}

-norm penalization. As indicated in the introduction, this penalty function is used to promote sparse solutions. In contemporary machine learning applications, it is common to encounter a significantly large number of features, p. To prevent the problem of over-fitting, it becomes crucial to engage in feature selection, which involves eliminating irrelevant variables from the regression model [18]. A popular technique for accomplishing this is by introducing an

l_{1}

-norm penalty to the loss function. This approach is widely adopted and used for feature selection tasks.

Therefore, we specialize Theorem 1 to analyze the asymptotic performance of the G-PCR with an

l_{1}

-norm penalization. Particularly, for an s-sparse vector, we study the performance of the following optimization problem:

\hat{θ} = \arg min_{- L \leq θ_{j} \leq U} ∥ {\hat{X} θ - y ∥}_{2}^{2} + α {∥ θ ∥}_{1}, j \in [p] .

(18)

(We say that a vector v

\in R^{p}

is an s-sparse vector if only s of its p elements are non-zero (on average), and most of its elements are zeros, where

s ≪ p

.)

4.1. Asymptotic Behavior of Sparse G-PCR

To analyze (18), we specialize Theorem 1 with

\tilde{P} (\cdot) = | \cdot |

. Then, the generalized proximal operator and Moreau envelope functions can be expressed, respectively, in the following closed forms:

{prox}_{| \cdot |} (x; b, c, d) : = η_{1} (x; b, c, d) = \{\begin{matrix} d & , if x \geq d + b \\ x - b & , if b < x < d + b \\ 0 & , if | x | \leq b \\ x + b & , if c - b < x < - b \\ c & , if x \leq c - b, \end{matrix}

(19)

and

M_{| \cdot |} (x; b, c, d) = \{\begin{matrix} \frac{1}{2} {(d - x)}^{2} + b d & , if x \geq d + b \\ b x - \frac{1}{2} b^{2} & , if b < x < d + b \\ \frac{1}{2} x^{2} & , if | x | \leq b \\ - b x - \frac{1}{2} b^{2} & , if c - b < x < - b \\ \frac{1}{2} {(c - x)}^{2} - b c & , if x \leq c - b . \end{matrix}

(20)

Note that this proximal operator is a generalization of the well-known soft-thresholding operator, i.e.,

η (x; b) = sign (x) ReLU (| x | - b)

, where the Rectified Linear Unit (

ReLU

) is defined as

ReLU (t) = max (0, t)

.

These expressions can be used to solve the scalar optimization in (13) of Theorem 1 and to simplify the similarity expression in (17). Specifically, (13) becomes

\begin{matrix} \begin{matrix} sup_{q \geq 0} inf_{γ > 0} O_{| \cdot |} (q, γ) & = \frac{q \sqrt{ζ}}{2 γ} + \frac{q γ \sqrt{ζ}}{2} (σ_{ϵ}^{2} + σ_{e}^{2} σ_{θ}^{2}) - \frac{q}{2 γ \sqrt{ζ}} - \frac{q^{2}}{4} \\ + q γ \sqrt{ζ} (1 - σ_{e}^{2}) E [M_{| \cdot |} (Θ_{0} + \frac{H}{γ \sqrt{ζ (1 - σ_{e}^{2})}}; \frac{α}{q γ \sqrt{ζ} (1 - σ_{e}^{2})}, - L, U)] . \end{matrix} \end{matrix}

(21)

Figure 2 illustrates the performance of the

l_{1}

-penalized G-PCR as a function of

α

for different levels of error variance

σ_{e}^{2}

. We generated the target vector

θ_{0}

randomly with elements in

{- 1, 0, + 1}

and

P (θ_{0, j} = - 1) = P (θ_{0, j} = + 1) = 0.2

. This means that the sparsity factor

ρ : = \frac{s}{p}

is

0.2

in these simulations. From these figures, we can also see that the theoretical expressions of the prediction risk and the similarity match the empirical simulations very well. Furthermore, it can be noted in Figure 2a that, for different values of

σ_{e}^{2}

, there exists an optimal value

α_{★}

that achieves the minimum possible prediction risk. Similarly, notice in Figure 2b the optimal

α_{★}

that maximizes the similarity metric. It can also be observed that increasing

α

beyond

α_{★}

reduces the similarity

ϱ

between

\hat{θ}

and

θ_{0}

. In these simulations, we set

- L = min (θ_{0}) = - 1

and

U = max (θ_{0}) = + 1

.

In Figure 3, we compare the unconstrained G-PR in (6) (which is equivalent to a standard LASSO formulation in this case) to the proposed G-PCR for an over-parameterized setting with

ζ = 0.85

. As we can see from this figure, the G-PCR clearly outperforms the unconstrained one in both metrics. Moreover, despite the fact that our theoretical results are assumed to be asymptotic in the problem dimensions (i.e.,

n \to \infty

and

p \to \infty

), we can see from all of the above figures that our rigorous results are accurate even for problems with a few hundred variables, e.g.,

p = 300

.

4.2. Support Recovery

In this section, we analyze the so-called support recovery of the sparse G-PCR. As discussed earlier, a sparse vector means that it has few non-zero elements. We define the support of

θ_{0}

as follows:

Ω : = {j \in [p] | θ_{0, j} \neq 0} \subseteq [p] .

Here, we are interested in computing the probability that an element on the support of

θ_{0}

has been recovered correctly. Let

\hat{θ}

be a solution to the optimization problem in (18). Let us fix

ξ > 0

as a user-predefined hard threshold based on whether an entry of

\hat{θ}

is decided to be on the support or not. Formally, we construct the following set as the estimate of the support given

\hat{θ}

:

{\hat{Ω}}_{ξ} : = {j \in [p] | | {\hat{θ}}_{j} | > ξ} .

In order to analyze the support recovery correctness, we consider the following error metrics, which are known as the probability of misdetection (MD) and the probability of false alarm (FA), respectively.

P_{MD} (ξ) = P (j \notin {\hat{Ω}}_{ξ} | j \in Ω), and P_{FA} (ξ) = P (j \in {\hat{Ω}}_{ξ} | j \notin Ω) .

In the following lemma, we study the asymptotic performance of both of these measures.

Lemma 1.

Let

\hat{θ}

be a solution to (18), and assume that

θ_{0}

is a sparse signal. Fix

α > 0

and

ξ > 0

. Then, in the limit of

p \to \infty

, it holds that

\begin{matrix} P_{MD} (ξ) \overset{P}{⟶} & P (|η_{1} (Θ_{0} + \frac{H}{γ_{★} \sqrt{ζ (1 - σ_{e}^{2})}}; \frac{α}{q_{★} γ_{★} \sqrt{ζ} (1 - σ_{e}^{2})}, - L, U)| \leq ξ), \end{matrix}

(22)

and

\begin{matrix} P_{FA} (ξ) \overset{P}{⟶} & P (|η_{1} (\frac{H}{γ_{★} \sqrt{ζ (1 - σ_{e}^{2})}}; \frac{α}{q_{★} γ_{★} \sqrt{ζ} (1 - σ_{e}^{2})}, - L, U)| \geq ξ), \end{matrix}

(23)

where

η_{1} (\cdot; \cdot, \cdot, \cdot)

is as defined in (19),

(q_{★}, γ_{★})

is the unique optimal solution to (21), and the probabilities are taken with respect to the randomness of

Θ_{0}

and

H \sim N (0, 1)

.

Proof.

The proof can be obtained from Theorem 1 with some approximations of these metrics to Lipschitz functions. Details are omitted for briefness. See [39] for a similar proof. □

Next, we give an example to illustrate this lemma.

Example: Sparse-Binary Target Vectors

For an s-sparse target vector, define

ρ : = \frac{s}{p} \in (0, 1]

as the sparsity factor. Then, as an example, let us assume that each element

θ_{0, j}

, for

j \in [p]

, is i.i.d. drawn from the following distribution (this model has been widely adopted in the relevant literature; see, for example, [59,63,64]):

\begin{matrix} p_{Θ} (θ) = (1 - ρ) δ_{0} (θ) + ρ δ_{0} (θ - E), \end{matrix}

(24)

for some

E > 0

, and

δ_{0} (\cdot)

indicates a Dirac delta function (i.e., a point-mass distribution). In other words, the elements of

θ_{0}

are zero with probability

1 - ρ

, and the non-zero elements all have the value

E

. Figure 4 illustrates this distribution.

For a

Θ_{0}

that follows the distribution in (24), and for

ξ \in (0, E)

, with

L = 0

and

U = E

, the error measures in Lemma 1 simplify to the following:

\begin{matrix} P_{MD} \overset{P}{⟶} Φ ((ξ - E) γ_{★} \sqrt{ζ (1 - σ_{e}^{2})} + \frac{α}{q_{★} \sqrt{1 - σ_{e}^{2}}}), \end{matrix}

(25)

and

\begin{matrix} P_{FA} \overset{P}{⟶} 1 - Φ (ξ γ_{★} \sqrt{ζ (1 - σ_{e}^{2})} + \frac{α}{q_{★} \sqrt{1 - σ_{e}^{2}}}), \end{matrix}

(26)

where

Φ (t) = \frac{1}{\sqrt{2 π}} \int_{- \infty}^{t} e^{- u^{2}} d u

is the cumulative distribution function (CDF) of the standard normal distribution.

Figure 5 shows the accuracy of the above-derived theoretical expressions as compared to empirical simulations for the considered sparse-binary vector example.

5. G-PCR with $ℓ_{2}^{2}$ -Norm Penalization

Even though it does not promote a particular structure,

l_{2}

-norm penalization is used in many signal processing, statistics, and machine learning applications to stabilize the model when we have ill-conditioned or under-determined systems [17]. Adding this penalization will shrink all the coefficients toward zero and hence decrease the variance of the resultant model; therefore, it can be used to avoid over-fitting. Within the Bayesian framework, the incorporation of this penalization implies that the regression coefficients are assumed to follow a Gaussian distribution. This assumption is often justifiable in numerous applications, in which the regression coefficients are typically taken from a random process. In this section, we provide high-dimensional asymptotic performance analysis of the G-PCR with

l_{2}^{2}

-norm penalization; that is:

\hat{θ} = \arg min_{- L \leq θ_{j} \leq U} ∥ \hat{X} θ {- y ∥}_{2}^{2} + α {∥ θ ∥}_{2}^{2}, j \in [p] .

(27)

To analyze (27), we use Theorem 1. However, here the generalized proximal operator and Moreau envelope functions of

\tilde{P} (\cdot) = {(\cdot)}^{2}

can be expressed, respectively, in the following closed-forms:

{prox}_{{(\cdot)}^{2}} (x; b, c, d) = \{\begin{matrix} \frac{x}{1 + 2 b}, & if c \leq x \leq d \\ c, & if x < c \\ d, & if x > d, \end{matrix}

(28)

and

M_{{(\cdot)}^{2}} (x; b, c, d) = \{\begin{matrix} \frac{b x^{2}}{1 + 2 b}, & if c \leq x \leq d \\ \frac{1}{2} {(x - c)}^{2} + b c^{2}, & if x < c \\ \frac{1}{2} {(x - d)}^{2} + b d^{2}, & if x > d . \end{matrix}

(29)

Letting

b = \frac{α}{q γ \sqrt{ζ} (1 - σ_{e}^{2})}

and

λ = γ \sqrt{ζ (1 - σ_{e}^{2})}

, the scalar optimization in (13) of Theorem 1 reduces to:

\begin{matrix} \begin{matrix} sup_{q \geq 0} inf_{γ > 0} O_{{(\cdot)}^{2}} (q, γ) & = \frac{q \sqrt{ζ}}{2 γ} + \frac{q γ \sqrt{ζ}}{2} (σ_{ϵ}^{2} + σ_{e}^{2} σ_{θ}^{2}) - \frac{q}{2 γ \sqrt{ζ}} - \frac{q^{2}}{4} \\ + q γ \sqrt{ζ} (1 - σ_{e}^{2}) {E [\frac{b}{1 + 2 b} {(Θ_{0} + \frac{H}{λ})}^{2} 1_{{- L \leq Θ_{0} + \frac{H}{λ} \leq U}} \\ + [\frac{1}{2} {(Θ_{0} + \frac{H}{λ} + L)}^{2} + b L^{2}] 1_{{Θ_{0} + \frac{H}{λ} \leq - L}} \\ + [\frac{1}{2} {(Θ_{0} + \frac{H}{λ} - U)}^{2} + b U^{2}] 1_{{Θ_{0} + \frac{H}{λ} \geq U}}]}, \end{matrix} \end{matrix}

(30)

where

1_{{\cdot}}

is the indicator function.

In the same manner, we can simplify the similarity expression in (17) using the closed-form expression of the generalized proximal operator of the

l_{2}^{2}

-norm in (28). The prediction risk and the similarity metric are given by (16) and (17), respectively. However, now

(q_{★}, γ_{★})

is the unique solution to (30). To illustrate the ideal let us consider the next examples.

5.1. Numerical Illustration

As stated at the beginning of this section,

l_{2}

-norm penalization can be used for Gaussian distributed target vectors. Therefore, as a first illustration, let us assume that

θ_{0, j}

\sim N (0, 1) \forall j \in [p]

. Figure 6 depicts the risk/similarity performance of the G-PCR with an

l_{2}

-norm penalization for several levels of the penalization factor

α

. It also shows that the G-PCR outperforms the unconstrained G-PR, which is equivalent to a ridge regression formulation here. Again, Figure 6a illustrates that there exists an optimal value

α_{★}

that minimizes prediction risk, while Figure 6b shows that there is an optimal value of the penalization factor

α_{★}

that gives the maximum similarity. Both figures show the high accuracy of the derived asymptotic expressions as compared to Monte Carlo simulations.

5.2. Binary Target Vector Estimation

Let us assume that

θ_{0} = {\pm 1}^{p}

, i.e., it takes only one of two possible values,

+ 1

or

- 1

, with equal probability, i.e.,

\begin{matrix} p_{Θ} (θ) = \frac{1}{2} [δ_{0} (θ - 1) + δ_{0} (θ + 1)] . \end{matrix}

(31)

Such vectors are widely encountered in many practical applications, such as the detection of wireless communication signals [12,41]. We use (27) as our estimation method, with

L = U = 1

. For this vector,

σ_{θ}^{2} = 1

; i.e., the covariance matrix of

θ_{0}

is

C_{θ} = I_{p}

.

The task of estimating

θ_{0}

here is equivalent to a binary classification task, with the two classes being

+ 1

and

- 1

. After obtaining the estimates using (27), we can map (decode) them to the relative class using the following link function:

\bar{θ} = sign (\hat{θ}) .

We can use the prediction risk and similarity to measure the performance. However, a more suitable performance measure for this kind of target vector is the so-called “classification error rate”, which is defined as:

\begin{matrix} C_{err} : = \frac{1}{p} \sum_{j = 1}^{p} 1_{{{\bar{θ}}_{j} \neq θ_{0, j}}} . \end{matrix}

(32)

The next lemma derives an asymptotic expression for this metric.

Lemma 2.

Let

\hat{θ}

be a solution to (27), and assume that

θ_{0} = {\pm 1}^{p}

, with a PMF

p_{Θ}

that follows (31). Fix

ζ > 0

. Then, in the limit of

p \to \infty

, it holds that

\begin{matrix} C_{err} \overset{P}{⟶} 1 - Φ (γ_{★} \sqrt{ζ (1 - σ_{e}^{2})}), \end{matrix}

(33)

where

γ_{★}

is the optimal solution of (30) in γ.

Proof.

The proof is similar to that of Lemma 1. Please refer to [39,47]. Details are skipped for brevity. □

Figure 7 illustrates the sharpness of Lemma 2 as compared to empirical simulations. Similar to previous figures, this figure demonstrates the perfect match between numerical simulations and theory.

5.3. Unpenalized Regression

When

α = 0

in (9), we have an optimization problem with no penalization. The resulting algorithm is known as the box relaxation optimization (BRO) [39], which has been extensively studied in the literature. It is used to promote boundedness structure. In fact, when

L = U

, the BRO is equivalent to an

l_{\infty}

-norm penalization. Setting

α = 0

in (30), and after some mapping of the involved variables, we can obtain the same results as in [39] for binary vectors.

6. Additional Numerical Experiments

In this section, we provide additional numerical experiments to validate our results. These experiments are performed on synthetic data beyond the Gaussian ensemble and on real data as well. In addition, we empirically discuss the double descent phenomenon.

6.1. Synthetic Data: Universality of the Gaussian Design

Theorem 1 assumes that the elements of matrix X are i.i.d. Gaussian distributed. However, we expect the asymptotic results derived in this paper (prediction risk, similarity, etc.) to be robust and hold for a larger class of random matrices. Rigorous proofs are presented in [65,66,67,68,69], where the asymptotic prediction is shown to have a universal limit (as

p \to \infty

) with respect to random matrices with i.i.d. entries.

To validate the above claims, see Figure 8, where we plotted the prediction risk and the similarity for a sparse-Gaussian target vector with i.i.d. entries that follow the distribution:

θ_{0, j} \sim (1 - ρ) δ_{0} + ρ N (0, 2)

. We used the G-PCR with an

l_{1}

-norm in (18) to obtain the estimates. In addition to the Gaussian design matrix, we simulated the performance using other random matrices with i.i.d. entries drawn from a uniform distribution

\sqrt{\frac{3}{p}} U [- 1, 1]

, an exponential distribution

\frac{1}{\sqrt{p}} Exp (1)

, and from a Poisson distribution

\frac{1}{\sqrt{p}} Poiss (1)

. Note that the normalization of these matrices is used to satisfy the high-dimensionality assumptions in Section 2.2. From this figure, we can see that the behavior seems to be nearly identical for all distributions, suggesting that our results enjoy a universality property.

6.2. Real-World Data

In previous section, we showed the robustness of our results to the distribution of the i.i.d. entries of the data matrix

X

. In this section, we take it a step further and consider real-world datasets instead of the synthetic data discussed earlier. These datasets are essentially not random and do not have i.i.d. elements. However, as seen in the numerical simulations below, they match our theoretical results to a great extent.

As an illustration, we present in Figure 9 the outcomes of these simulations for three real datasets. Each of these datasets consists of a small number of samples (n) and a high-dimensional feature space (p), which is consistent with the over-parameterized setting (

p > n

). These datasets are mainly for detecting several diseases and cancer samples. We generated the target vector,

θ_{0}

, randomly with entries following the distribution

θ_{0, j} \sim (1 - ρ) δ_{0} + ρ N (0, 1)

. The noise vector ϵ was generated using i.i.d.

N (0, 0.2)

elements. We generated the observations as

y = X θ_{0} + ϵ

. The G-PCR with

l_{1}

-norm penalty in (18) is used to obtain

\hat{θ}

with

\hat{X} = X + E

.

The three figures correspond to the following datasets:

Figure 9a: For this figure, we used breast cancer data [70] (available at: https://github.com/kivancguckiran/microarray-data (accessed on 27 May 2023)). This dataset has been used in [71] for DNA microarray gene expression classification using the LASSO. It consists of 22,215 gene expressions (features) and 118 samples. From this matrix, we took a sub-matrix X of aspect ratio $ζ = 0.75$ . We standardized all columns of matrix X to have mean 0 and variance 1.
Figure 9b: In this figure, glioma disease data [72] were used (available at: https://github.com/kivancguckiran/microarray-data (accessed on 27 May 2023)). This dataset includes 54,613 features and 180 samples. Similar to the breast cancer data, the sub-matrix X with the same aspect ratio $ζ$ was selected and standardized.
Figure 9c: The dataset used in this figure includes colon cancer data [73] (available at: http://www.weizmann.ac.il/mcb/UriAlon/download/downloadable-data (accessed on 27 May 2023)). This dataset was used in [74] for a sparse-group LASSO model. It includes 2000 genes and 62 samples (22 normal tissues and 40 colon tumor tissues). Similar to the previous datasets, we selected a sub-matrix X with aspect ratio $ζ$ and standardized it.

For all figures, we can see that the agreement between theory and simulations is remarkably good.

6.3. Double Descent Phenomenon

In Figure 10, we plotted the prediction risk as a function of

ζ

for different choices of the penalization factor

α

. As can be seen, for an arbitrary choice of

α

, the prediction risk of the G-PCR first decreases for small values of

ζ

, then increases until it reaches a peak known as the interpolation peak. After that, the prediction risk decreases monotonically with respect to

ζ

. This is known as the double descent phenomenon [75]. On the other hand, optimal values of the penalization factor

α_{★}

always guarantee that the prediction risk decreases with more training samples being used (i.e., with increasing

ζ

). This emphasizes the important role of the optimal tuning of

α

to mitigate the double descent phenomenon and to give the best performance.

7. Conclusions

In this paper, we studied the high-dimensional error performance of the generalized penalized constrained regression (G-PCR) optimization with noisy features. Several analytical expressions were derived to measure the performance, such as the prediction risk, similarity, probability of misdetection, and probability of false alarm. Different popular instances of this optimization, such as

l_{1}

-norm penalized regression and

l_{2}

-norm penalization, were considered. We presented numerical simulations to validate these expressions based on both synthetic and real data. These results can be used to tune the involved hyper-parameters efficiently.

Furthermore, we empirically investigated the so-called double descent phenomenon and showed that optimal penalization can mitigate its effect. We also illustrated through several simulations the universality of our results beyond the assumed Gaussian distribution.

Finally, we note that numerical simulations have shown that our rigorous results are accurate even for problems with a few hundred variables, despite the fact that these results are assumed to be asymptotic in the problem dimensions.

Author Contributions

Conceptualization, A.M.A.; Methodology, A.M.A.; Software, A.M.A.; Validation, A.M.A.; Formal analysis, A.M.A.; Investigation, A.M.A. and M.A.; Resources, M.A.; Data curation, A.M.A. and M.A.A.; Writing—original draft, A.M.A.; Writing—review & editing, A.M.A., M.A. and M.A.A.; Visualization, A.M.A. and M.A.; Supervision, A.M.A.; Project administration, M.A.A.; Funding acquisition, M.A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Deputyship for Research & Innovation, Ministry of Education, Saudi Arabia through project number 445-9-196.

Data Availability Statement

The data presented in this study are available within the article.

Acknowledgments

The authors extend their appreciation to the Deputyship for Research & Innovation, Ministry of Education in Saudi Arabia for funding this research work through project number 445-9-196. Also, the authors would like to extend their appreciation to Taibah University for its supervision support.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of the Main Results

In this appendix, we provide an outline of the proof of the high-dimensional analysis of the prediction risk of the considered G-PCR learning algorithm. Our main analysis framework is the convex Gaussian min–max theorem (CGMT). For the reader’s convenience, we firstly recall the CGMT.

Appendix A.1. Main Analysis Framework: CGMT

The CGMT is an extension of Gordon’s comparison lemma [76]. Gordon’s lemma was used in the analysis of some high-dimensional inference problems, such as the study of sharp phase-transitions in noiseless compressed sensing. The CGMT was initiated first in [36] and further developed in [20]. It uses convexity to compare the min–max values of two Gaussian processes.

To illustrate the main ideas of the CGMT, let us first consider the following doubly indexed Gaussian random processes:

\begin{matrix} 𝒳_{r, w} : = w^{⊤} G r + Ξ (r, w), \end{matrix}

(A1a)

\begin{matrix} 𝒴_{r, w} : = {∥ r ∥}_{2} h_{1}^{⊤} w + {∥ w ∥}_{2} h_{2}^{⊤} r + Ξ (r, w), \end{matrix}

(A1b)

where

G \in R^{n \times p}, h_{1} \in R^{n}, h_{2} \in R^{p}

, they all have i.i.d. standard Gaussian elements, and

Ξ : R^{p} \times R^{n} \to R

. For these two processes, consider the following min–max optimization programs, which are referred to as the primal optimization (PO) and the auxiliary optimization (AO):

\begin{matrix} F (G) : = min_{r \in S_{r}} max_{w \in S_{w}} 𝒳_{r, w}, \end{matrix}

(A2a)

\begin{matrix} f (h_{1}, h_{2}) : = min_{r \in S_{r}} max_{w \in S_{w}} 𝒴_{r, w}, \end{matrix}

(A2b)

where the sets

S_{r} \subset R^{p}

and

S_{w} \subset R^{n}

are assumed to be compact and convex sets. In addition, if the function

Ξ (r, w)

is continuous and convex–concave on

S_{r} \times S_{w}

, then, according to the CGMT formulation in Theorem 6 in [20], for any

χ \in R

and

μ > 0

:

P (| F (G) - χ | > μ) \leq 2 P (| f (h_{1}, h_{2}) - χ | > μ) .

(A3)

The above result states that if we can show that the optimal AO cost is

f (h_{1}, h_{2}) \overset{P}{⟶} c_{★}

asymptotically,

c_{★} \in R

, then it can be concluded that the optimal PO cost is

F (G) \overset{P}{⟶} c_{★}

. The premise is that it is usually much easier to analyze the AO instead of the PO. In addition, the CGMT (Theorem 6.1(iii) in [20]) shows that concentration of the optimal solution to the AO problem implies concentration of the optimal solution of the PO around the same value. In other words, if minimizers of (A2b) satisfy that

∥ {\hat{r}}_{f} (h_{1}, h_{2}) ∥_{2} \overset{P}{⟶} ν_{★}

, where

ν_{★} > 0

, then the same holds true for minimizers of (A2a), i.e.,

∥ {\hat{r}}_{F} {(G) ∥}_{2} \overset{P}{⟶} ν_{★}

. In addition, we make use of the following corollary that holds true in the high-dimensional asymptotic regime.

Corollary A1

(Asymptotic CGMT [20]). Using the same notations and assumptions as in the above discussion, let

S \subset S_{r}

and

S^{c} : = S_{r} / S

. Define

F_{S^{c}} (G)

and

f_{S^{c}} (h_{1}, h_{2})

as the optimal costs in (A2a) and (A2b), respectively, given that we now constrain the optimization over

r \in S^{c}

. Suppose there exist constants

\bar{J} < {\bar{J}}_{S^{c}}

, such that

f_{S^{c}} (h_{1}, h_{2}) \overset{P}{⟶} {\bar{J}}_{S^{c}}

and

f (h_{1}, h_{2}) \overset{P}{⟶} \bar{J}

. Then,

\begin{matrix} lim_{p \to \infty} P ({\hat{r}}_{F (G)} \in S) = 1 . \end{matrix}

(A4)

For more details about the framework of CGMT, the reader is advised to see [20].

Next, we use the CGMT to provide a proof outline of the general error asymptotic behavior provided in Theorem 1.

Appendix A.2. Sharp Analysis of the G-PCR

Appendix A.2.1. Primal and Auxiliary Problems of the G-PCR

To obtain the main asymptotic result using CGMT, we first need to rewrite the G-PCR learning problem in (9) as a PO problem. For convenience, define the vector

r : = θ - θ_{0}

, and the following set

B : = \{r_{j} \in R | - L - θ_{0, j} \leq r_{j} \leq U - θ_{0, j}, j \in [p]\};

(A5)

then, the problem in (9) can be reformulated as

\hat{r} : = \arg min_{r \in B^{p}} ∥ \hat{X} r + E θ_{0} - ϵ ∥_{2}^{2} + α P (r + θ_{0}) .

(A6)

Introducing the Convex Conjugate: Any convex function

h : R^{n} \to R

can be expressed in terms of its convex conjugate

h^{★} : R^{n} \to R

as:

h (t) = sup_{\bar{w} \in R^{n}} {\bar{w}}^{⊤} t - h^{★} (\bar{w}) = sup_{w \in R^{n}} \sqrt{p} w^{⊤} t - h^{★} (\sqrt{p} w) .

Using the above definition, we can express the

l_{2}^{2}

-norm loss function in (A6) as

{∥ t ∥}_{2}^{2} = sup_{w \in R^{n}} \sqrt{p} w^{⊤} t - \frac{p}{4} {∥ w ∥}_{2}^{2} .

(A7)

Hence, (A6) becomes equivalent to the following:

\begin{matrix} min_{r \in B^{p}} sup_{w \in R^{n}} \sqrt{p} w^{⊤} \hat{X} r + \sqrt{p} w^{⊤} E θ_{0} - \sqrt{p} w^{⊤} ϵ - \frac{p}{4} {∥ w ∥}_{2}^{2} + α P (r + θ_{0}) . \end{matrix}

(A8)

To apply the CGMT, we need the optimization sets to be compact. This is true for

S_{r} = B^{p}

, but

S_{w} = R^{n}

is not. This issue can be treated in a similar way to the method in Appendix A in [20]. We introduce an artificial compact set

S_{w} = {w \in R^{n} {| ∥ w ∥}_{2} \leq R_{w}}

for a sufficiently large constant

R_{w} > 0

that is independent of p. The optimization problem is unaffected by this constraint set asymptotically. After that, we obtain

\begin{matrix} min_{r \in B^{p}} sup_{w \in S_{w}} \sqrt{p (1 - σ_{e}^{2})} w^{⊤} \tilde{X} r + σ_{e} \sqrt{p} w^{⊤} \tilde{E} θ_{0} - \sqrt{p} w^{⊤} ϵ - \frac{p}{4} {∥ w ∥}_{2}^{2} + α P (r + θ_{0}), \end{matrix}

(A9)

where

\tilde{X}

and

\tilde{E}

are independent Gaussian matrices that have i.i.d.

N (0, 1 / p)

elements each. Now, the above problem is in the format of a PO with

Ξ (r, w) = σ_{e} \sqrt{p} w^{⊤} \tilde{E} θ_{0} - \sqrt{p} w^{⊤} ϵ - \frac{p}{4} {∥ w ∥}_{2}^{2} + α P (r + θ_{0}) .

Therefore, the corresponding AO problem is

\begin{matrix} min_{r \in B^{p}} sup_{w \in S w} \sqrt{1 - σ_{e}^{2}} ∥ r ∥_{2} h_{1}^{⊤} w + \sqrt{1 - σ_{e}^{2}} ∥ w ∥_{2} h_{2}^{⊤} r + Ξ (r, w), \end{matrix}

(A10)

where

h_{1} \sim N (0, I_{n})

and

h_{2} \sim N (0, I_{p})

are independent standard Gaussian vectors.

Appendix A.2.2. Simplifying the Auxiliary Problem

The next step is to reduce (simplify) the AO into a scalar problem, i.e., a problem that has only scalar variables. To do so, first, let

\begin{matrix} \tilde{h} = \sqrt{1 - σ_{e}^{2}} ∥ r ∥_{2} h_{1} - \sqrt{p} ϵ + \sqrt{p σ_{e}^{2}} \tilde{E} θ_{0} . \end{matrix}

(A11)

Using standard probability theory results, one can show that

\tilde{h} \sim N (0, C_{\tilde{h}})

with a covariance matrix that is given by

C_{\tilde{h}} = ((1 - σ_{e}^{2}) {∥ r ∥}_{2}^{2} + p σ_{ϵ}^{2} + σ_{e}^{2} {∥ θ_{0} ∥}_{2}^{2}) I_{n} .

Thus, the AO in (A10) holds that

\begin{matrix} min_{r \in B^{p}} sup_{w \in S_{w}} {\tilde{h}}^{⊤} w + \sqrt{1 - σ_{e}^{2}} ∥ w ∥_{2} h_{2}^{⊤} r - \frac{p}{4} {∥ w ∥}_{2}^{2} + α P (r + θ_{0}) . \end{matrix}

(A12)

In order to further simplify the AO, we fix the norm of w to

q : = {∥ w ∥}_{2}

. In this case, one can simply optimize over the direction of w, which reduces the AO problem to

min_{r \in B^{p}} sup_{q \geq 0} q ∥ \tilde{h} ∥_{2} + \sqrt{1 - σ_{e}^{2}} q h_{2}^{⊤} r - \frac{p q^{2}}{4} + α P (r + θ_{0}) .

(A13)

Moreover, to have the proper convergence, we have to normalize the above cost function by factor of

\frac{1}{p}

. Then, we obtain

\begin{matrix} sup_{q \geq 0} min_{r \in B^{p}} q \sqrt{\frac{1}{p} [(1 - σ_{e}^{2}) {∥ r ∥}_{2}^{2} + p σ_{ϵ}^{2} + σ_{e}^{2} ∥ θ_{0} ∥_{2}^{2}]} \frac{{∥ g ∥}_{2}}{\sqrt{p}} + \sqrt{1 - σ_{e}^{2}} q \frac{1}{p} h_{2}^{⊤} r - \frac{q^{2}}{4} + \frac{α}{p} P (r + θ_{0}), \end{matrix}

(A14)

where

g \sim N (0, I_{n})

. Note the change in the order of the min–sup, which can be justified using (Appendix A in [20]). Next, we wish to write the above optimization as a separable problem by using the following (

for u \geq 0

):

\sqrt{u} = inf_{γ > 0} \frac{γ u}{2} + \frac{1}{2 γ} .

(A15)

Note that the optimal solution to (A15) is

\hat{γ} = \frac{1}{\sqrt{u}}

. Using this identity with

\begin{matrix} u = (1 - σ_{e}^{2}) \frac{1}{p} {∥ r ∥}_{2}^{2} + σ_{ϵ}^{2} + σ_{e}^{2} \frac{1}{p} {∥ θ_{0} ∥}_{2}^{2}, \end{matrix}

(A16)

we can write the problem in (A14) as

\begin{matrix} \begin{matrix} sup_{q \geq 0} inf_{γ > 0} & \frac{{q ∥ g ∥}_{2}}{2 γ \sqrt{p}} + \frac{{γ q ∥ g ∥}_{2}}{2 \sqrt{p}} (σ_{ϵ}^{2} + \frac{σ_{e}^{2}}{p} ∥ θ_{0} ∥_{2}^{2}) - \frac{q^{2}}{4} \\ + min_{r \in B^{p}} \frac{{γ q ∥ g ∥}_{2}}{2 \sqrt{p}} \frac{(1 - σ_{e}^{2})}{p} {∥ r ∥}_{2}^{2} + q \sqrt{1 - σ_{e}^{2}} \frac{h_{2}^{⊤} r}{p} + \frac{α}{p} P (r + θ_{0}) . \end{matrix} \end{matrix}

(A17)

By the weak law of large numbers (WLLN), we can show that

\frac{{∥ g ∥}_{2}}{\sqrt{p}} \overset{P}{⟶} \sqrt{ζ}

and

\frac{1}{p} ∥ θ_{0} ∥_{2}^{2} \overset{P}{⟶} σ_{θ}^{2}

. Now, let us work with the initial variable θ rather than r; then, the above optimization problem converges to

\begin{matrix} \begin{matrix} sup_{q \geq 0} inf_{γ > 0} & \frac{q \sqrt{ζ}}{2 γ} + \frac{q γ \sqrt{ζ}}{2} (σ_{ϵ}^{2} + σ_{e}^{2} σ_{θ}^{2}) - \frac{q^{2}}{4} \\ + \frac{1}{p} \sum_{j = 1}^{p} min_{- L \leq θ_{j} \leq U} \{\frac{q γ \sqrt{ζ}}{2} (1 - σ_{e}^{2}) {(θ_{j} - θ_{0, j})}^{2} + \sqrt{1 - σ_{e}^{2}} q h_{2, j} (θ_{j} - θ_{0, j}) + α \tilde{P} (θ_{j})\} . \end{matrix} \end{matrix}

(A18)

Completing the squares in

θ_{j}

in the last minimization of the above problem, and using the fact that

\frac{1}{p} h_{2}^{⊤} θ_{0} \overset{P}{⟶} 0

, we obtain

\begin{matrix} \begin{matrix} sup_{q \geq 0} inf_{γ > 0} & \frac{q \sqrt{ζ}}{2 γ} + \frac{q γ \sqrt{ζ}}{2} (σ_{ϵ}^{2} + σ_{e}^{2} σ_{θ}^{2}) - \frac{q^{2}}{4} - \frac{1}{p} \sum_{j = 1}^{p} \frac{q}{2 γ \sqrt{ζ}} h_{2, j}^{2} \\ + q γ \sqrt{ζ} (1 - σ_{e}^{2}) \frac{1}{p} \sum_{j = 1}^{p} min_{- L \leq θ_{j} \leq U} \frac{1}{2} {(θ_{j} - (θ_{0, j} + \frac{h_{2, j}}{γ \sqrt{ζ (1 - σ_{e}^{2})}}))}^{2} + \frac{α}{q γ \sqrt{ζ} (1 - σ_{e}^{2})} \tilde{P} (θ_{j}) . \end{matrix} \end{matrix}

(A19)

Note that the last summation term in (A19) can be expressed by the generalized Moreau envelope function

M_{\tilde{P}} (\cdot)

defined in (1). Hence, we obtain the following problem:

\begin{matrix} \begin{matrix} sup_{q \geq 0} inf_{γ > 0} & \frac{q \sqrt{ζ}}{2 γ} + \frac{q γ \sqrt{ζ}}{2} (σ_{ϵ}^{2} + σ_{e}^{2} σ_{θ}^{2}) - \frac{q^{2}}{4} - \frac{1}{p} \sum_{j = 1}^{p} \frac{q}{2 γ \sqrt{ζ}} h_{2, j}^{2} \\ + q γ \sqrt{ζ} (1 - σ_{e}^{2}) \frac{1}{p} \sum_{j = 1}^{p} M_{\tilde{P}} (θ_{0, j} + \frac{h_{2, j}}{γ \sqrt{ζ (1 - σ_{e}^{2})}}; \frac{α}{q γ \sqrt{ζ} (1 - σ_{e}^{2})}, - L, U) . \end{matrix} \end{matrix}

(A20)

Next, by the WLLN,

\frac{1}{p} \sum_{j = 1}^{p} h_{2, j}^{2} \overset{P}{⟶} 1

, and for all

q > 0 and γ > 0

, we have

\begin{matrix} \frac{1}{p} \sum_{j = 1}^{p} & M_{\tilde{P}} (θ_{0, j} + \frac{h_{2, j}}{γ \sqrt{ζ (1 - σ_{e}^{2})}}; \frac{α}{q γ \sqrt{ζ} (1 - σ_{e}^{2})}, - L, U) \\ \overset{P}{⟶} E [M_{\tilde{P}} (Θ_{0} + \frac{H}{γ \sqrt{ζ (1 - σ_{e}^{2})}}; \frac{α}{q γ \sqrt{ζ} (1 - σ_{e}^{2})}, - L, U)], \end{matrix}

where the expectation is taken with respect to the independent scalar random variables

Θ_{0} \sim p_{Θ}

and

H \sim N (0, 1)

.

Finally, (A20) converges to the following scalar problem:

\begin{matrix} \begin{matrix} sup_{q \geq 0} inf_{γ > 0} & \frac{q \sqrt{ζ}}{2 γ} + \frac{q γ \sqrt{ζ}}{2} (σ_{ϵ}^{2} + σ_{e}^{2} σ_{θ}^{2}) - \frac{q^{2}}{4} - \frac{q}{2 γ \sqrt{ζ}} \\ + q γ \sqrt{ζ} (1 - σ_{e}^{2}) E [M_{\tilde{P}} (Θ_{0} + \frac{H}{γ \sqrt{ζ (1 - σ_{e}^{2})}}; \frac{α}{q γ \sqrt{ζ} (1 - σ_{e}^{2})}, - L, U)] . \end{matrix} \end{matrix}

(A21)

Appendix A.2.3. General Performance Metric: Proof of Theorem 1

Now that we derived the scalar optimization problem, we proceed to prove Theorem 1. Recall that in the process of scalarizing the AO, we introduced the generalized Moreau envelope function in (A20). It can be shown that the optimizer of this function gives the AO solution in

θ

. Let

(q_{★}, γ_{★})

be the unique solution to (A21). Then, the AO solution can be presented as

{\hat{θ}}_{j}^{AO} = \hat{Θ} : = {prox}_{\tilde{P}} (Θ_{0} + \frac{H}{γ_{★} \sqrt{ζ (1 - σ_{e}^{2})}}; \frac{α}{q_{★} γ_{★} \sqrt{ζ} (1 - σ_{e}^{2})}, - L, U), j \in [p],

where H is a standard normal random variable and

Θ_{0} \sim p_{Θ}

independent of H.

The last step is to show the convergence of any pseudo-Lipschitz function

ψ (\cdot, \cdot)

. Using the weak law of large numbers and the fact that the elements of

θ_{0}

are i.i.d. sampled from a density

p_{θ}

, we obtain

\begin{matrix} \frac{1}{p} \sum_{j = 1}^{p} ψ ({\hat{θ}}_{j}^{AO}, θ_{0, j}) \overset{P}{⟶} E [ψ (\hat{Θ}, Θ_{0})], \end{matrix}

(A22)

where the expectation is taken over

H \sim N (0, 1)

and

Θ_{0} \sim p_{Θ}

independent of H. To use the CGMT (Corollary A1), we introduce the following set:

S_{η} = \{v \in R^{p} | | \frac{1}{p} \sum_{j = 1}^{p} ψ (v_{j}, θ_{0, j}) - E [ψ (\hat{Θ}, Θ_{0})] | < η\},

for

η > 0

.

The convergence result in (A22) establishes that

{lim}_{p \to \infty} P ({\hat{θ}}^{AO} \in S_{η}) = 1

. Hence, using the CGMT (Corollary A1),

{lim}_{p \to \infty} P (\hat{θ} \in S_{η}) = 1

, where

\hat{θ}

is the solution to the original G-PCR in (9). This concludes the proof of Theorem 1.

Appendix A.2.4. Prediction Risk Analysis: Proof of Corollary 1

The objective of this part is to analyze the prediction risk of the G-PCR asymptotically. To begin with, for any

η > 0

, define the following set:

{\overset{˘}{S}}_{η} = \{r \in R^{p} | | \frac{1}{p} {∥ r ∥}_{2}^{2} - \frac{1}{1 - σ_{e}^{2}} (\frac{1}{γ_{★}^{2}} - σ_{θ}^{2} σ_{e}^{2} - σ_{ϵ}^{2}) | < η\},

where

γ_{★}

is the solution to (A21). Recall from (A16) that

{\hat{γ}}_{p} = \frac{1}{\sqrt{\hat{u}}}

and

\hat{u} = (1 - σ_{e}^{2}) \frac{1}{p} {∥ \tilde{r} ∥}_{2}^{2}

+ σ_{ϵ}^{2} + σ_{e}^{2} \frac{1}{p} {∥ θ_{0} ∥}_{2}^{2} .

Hence,

\frac{1}{p} {∥ \tilde{r} ∥}_{2}^{2} = \frac{1}{1 - σ_{e}^{2}} (\frac{1}{{\hat{γ}}_{p}^{2}} - \frac{σ_{e}^{2}}{p} ∥ θ_{0} ∥_{2}^{2} - σ_{ϵ}^{2}),

where

\tilde{r}

is the optimal solution to (A14) and

{\hat{γ}}_{p}

is the solution to (A17). Using the uniform convergence of the cost functions, we can show that

{\hat{γ}}_{p} \overset{P}{⟶} γ_{★}

. Hence, using the WLLN,

\frac{1}{p} {∥ θ_{0} ∥}_{2}^{2} \overset{P}{⟶} σ_{θ}^{2}

and, therefore,

\begin{matrix} R (\tilde{θ}, θ_{0}) = \frac{1}{p} ∥ \tilde{r} ∥_{2}^{2} = \frac{1}{p} {∥ \tilde{θ} - θ_{0} ∥}_{2}^{2} & \overset{P}{⟶} \frac{1}{1 - σ_{e}^{2}} (\frac{1}{γ_{★}^{2}} - σ_{θ}^{2} σ_{e}^{2} - σ_{ϵ}^{2}) . \end{matrix}

(A23)

Remember that

\tilde{r} = \tilde{θ} - θ_{0}

, where

\tilde{θ}

is the AO solution in θ. From Equation (A23), we can see that, for all

η > 0

,

R (\tilde{θ}, θ_{0}) \in {\overset{˘}{S}}_{η}

, with probability approaching 1. Then, an application of the CGMT yields that

R (\hat{θ}, θ_{0}) \in {\overset{˘}{S}}_{η}

with high probability.

Furthermore, Corollary 1 can also be proven as an immediate result of Theorem 1 with

ψ (a, b) = {(a - b)}^{2}

therein. Hence,

\begin{matrix} R (\hat{θ}, θ_{0}) \overset{P}{⟶} & E_{Θ_{0}, H} [{({prox}_{\tilde{P}} (Θ_{0} + \frac{H}{γ_{★} \sqrt{ζ (1 - σ_{e}^{2})}}; \frac{α}{q_{★} γ_{★} \sqrt{ζ} (1 - σ_{e}^{2})}, - L, U) - Θ_{0})}^{2}] . \end{matrix}

(A24)

Combining the results in (A23) and (A24) concludes the proof of Corollary 1.

Appendix A.2.5. Similarity Analysis: Proof of Corollary 2

The proof of Corollary 2 is based on the CGMT to derive asymptotic predictions of the numerator and the denominator of the similarity expression

ϱ (\hat{θ}, θ_{0})

in (11) separately, and then to use the continuous mapping Theorem [62] to arrive at the desired result. For the sake of brevity, we only highlight the main steps of the proof.

The similarity expression in (11) can be rewritten as

\begin{matrix} ϱ (\hat{θ}, θ_{0}) = \frac{\frac{1}{p} \sum_{j = 1}^{p} {\hat{θ}}_{j} θ_{0, j}}{\sqrt{\frac{1}{p} \sum_{j = 1}^{p} {\hat{θ}}_{j}^{2}} \cdot \sqrt{\frac{1}{p} \sum_{j = 1}^{p} θ_{★, j}^{2}}} . \end{matrix}

(A25)

For the numerator, we use Theorem 1 with

ψ (a, b) = a \cdot b

, to obtain the following convergence:

Numerator = \frac{1}{p} \sum_{j = 1}^{p} {\hat{θ}}_{j} θ_{0, j} \overset{P}{⟶} E_{Θ_{0}, H} [{prox}_{\tilde{P}} (Θ_{0} + \frac{H}{γ_{★} \sqrt{ζ (1 - σ_{e}^{2})}}; \frac{α}{q_{★} γ_{★} \sqrt{ζ} (1 - σ_{e}^{2})}, - L, U) Θ_{0}] .

The denominator consists of two terms, each of which converges as well. For the first term,

\sqrt{\frac{1}{p} \sum_{j = 1}^{p} {\hat{θ}}_{j}^{2}}

, use Theorem 1 with

ψ (a, a) = a^{2}

and the continuous mapping Theorem to obtain

\sqrt{\frac{1}{p} \sum_{j = 1}^{p} {\hat{θ}}_{j}^{2}} \overset{P}{⟶} \sqrt{E_{Θ_{0}, H} [{prox}_{\tilde{P}}^{2} (Θ_{0} + \frac{H}{γ_{★} \sqrt{ζ (1 - σ_{e}^{2})}}; \frac{α}{q_{★} γ_{★} \sqrt{ζ} (1 - σ_{e}^{2})}, - L, U)]} .

For the second term in the denominator,

\sqrt{\frac{1}{p} \sum_{j = 1}^{p} θ_{★, j}^{2}}

, using the WLLN, we have

\sqrt{\frac{1}{p} \sum_{j = 1}^{p} θ_{★, j}^{2}} \overset{P}{⟶} \sqrt{E [Θ_{0}^{2}]} = \sqrt{σ_{θ}^{2}} .

Putting together all of the above convergence results coupled with an application of the continuous mapping Theorem [62], we obtain the asymptotic expression of the similarity measure in (17).

Appendix B. A Note on the Square-Root Generalized Penalized Constrained Regression

Sqrt G-PCR Learning Algorithm

Let us consider the following optimization problem:

\hat{θ} = \arg min_{θ \in V^{p}} ∥ \hat{X} θ - y ∥_{2} + \frac{α}{\sqrt{p}} P (θ),

(A26)

where

V = [- L, U], and L, U \in R_{+} \cup {0} .

Problems of the above type are known as regularized square-root regression problems [77]. Here, instead of the

l_{2}^{2}

-norm squared loss in (9), there is a non-squared

l_{2}

-norm loss. This leads to optimization problems with a loss function that is not separable. Examples of this algorithm include the square-root LASSO [24] and the square-root group LASSO [78]. Please see [20,64,77] for the motivations for using the non-squared loss. Furthermore, the scaling of the penalization factor

α

by a factor of

\frac{1}{\sqrt{p}}

is just for convergence issues of the analysis of the CGMT (see [20] for further justification). The analysis of the above optimization, which we call the Square-root G-PCR (Sqrt G-PCR) is very similar to the one provided in the previous sections of this paper for the G-PCR problem. The only difference, however, is that, instead of (A7), we have

∥ t ∥_{2} = max_{∥ w ∥_{2} \leq 1} w^{⊤} t .

(A27)

Following the same analysis (with some normalization adjustments) as in Appendix A, but using (A27) instead of (A7), we finally arrive at the following deterministic scalar max–min optimization problem:

\begin{matrix} \begin{matrix} sup_{0 \leq q \leq 1} inf_{γ > 0} {\tilde{O}}_{\tilde{P}} (q, γ) : & = \frac{q \sqrt{ζ}}{2 γ} + \frac{q γ \sqrt{ζ}}{2} (σ_{ϵ}^{2} + σ_{e}^{2} σ_{β}^{2}) - \frac{q}{2 γ \sqrt{ζ}} \\ + q γ \sqrt{ζ} (1 - σ_{e}^{2}) E [M_{\tilde{P}} (Θ_{0} + \frac{H}{γ \sqrt{ζ (1 - σ_{e}^{2})}}; \frac{α}{q γ \sqrt{ζ} (1 - σ_{e}^{2})}, - L, U)] . \end{matrix} \end{matrix}

(A28)

Comparing

{\tilde{O}}_{\tilde{P}} (q, γ)

to

O {(q, γ)}_{\tilde{P}}

in (13), we can see two main differences, which are the absence of the

- \frac{q^{2}}{4}

term and the presence of the constraint

0 \leq q \leq 1

in

{\tilde{O}}_{\tilde{P}} (q, γ)

.

This means that the prediction risk and the similarity of the Sqrt G-PCR in (A26) converge to the same asymptotic limits in (16) and (17), respectively, but now with

q_{★}

and

γ_{★}

, which are solutions to (A28) instead of (13).

References

Parikh, N.; Boyd, S. Proximal algorithms. Found. Trends Optim. 2014, 1, 127–239. [Google Scholar] [CrossRef]
Tarantola, A. Inverse Problem Theory and Methods for Model Parameter Estimation; SIAM: Philadelphia, PA, USA, 2005. [Google Scholar]
Kailath, T.; Sayed, A.H.; Hassibi, B. Linear Estimation; Prentice Hall: Hoboken, NJ, USA, 2000. [Google Scholar]
Groetsch, C.W.; Groetsch, C. Inverse Problems in the Mathematical Sciences; Springer: Berlin/Heidelberg, Germany, 1993; Volume 52. [Google Scholar]
Bishop, C.M. Pattern recognition. Mach. Learn. 2006, 128, 1–58. [Google Scholar]
Rencher, A.C.; Schaalje, G.B. Linear Models in Statistics; John Wiley & Sons: Hoboken, NJ, USA, 2008. [Google Scholar]
Dhar, V. Data science and prediction. Commun. ACM 2013, 56, 64–73. [Google Scholar] [CrossRef]
Donoho, D.L. Compressed sensing. IEEE Trans. Inf. Theory 2006, 52, 1289–1306. [Google Scholar] [CrossRef]
Duarte, M.F.; Eldar, Y.C. Structured compressed sensing: From theory to applications. IEEE Trans. Signal Process. 2011, 59, 4053–4085. [Google Scholar] [CrossRef]
Poor, H.V. An Introduction to Signal Detection and Estimation; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1998. [Google Scholar]
Fadili, J.M.; Bullmore, E. Penalized partially linear models using sparse representations with an application to fMRI time series. IEEE Trans. Signal Process. 2005, 53, 3436–3448. [Google Scholar] [CrossRef]
Goldsmith, A. Wireless Communications; Cambridge University Press: Cambridge, UK, 2005. [Google Scholar]
Marzetta, T.L.; Yang, H. Fundamentals of Massive MIMO; Cambridge University Press: Cambridge, UK, 2016. [Google Scholar]
Candes, E.; Tao, T. The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Stat. 2007, 35, 2313–2351. [Google Scholar]
Bach, F. Structured sparsity-inducing norms through submodular functions. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 6–9 December 2010; Volume 23. [Google Scholar]
Aster, R.C.; Borchers, B.; Thurber, C.H. Parameter Estimation and Inverse Problems; Elsevier: Amsterdam, The Netherlands, 2018. [Google Scholar]
McDonald, G.C. Ridge regression. Wiley Interdiscip. Rev. Comput. Stat. 2009, 1, 93–100. [Google Scholar] [CrossRef]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
Varah, J.M. Pitfalls in the numerical solution of linear ill-posed problems. SIAM J. Sci. Stat. Comput. 1983, 4, 164–176. [Google Scholar] [CrossRef]
Thrampoulidis, C.; Abbasi, E.; Hassibi, B. Precise error analysis of regularized M-estimators in high dimensions. IEEE Trans. Inf. Theory 2018, 64, 5592–5628. [Google Scholar] [CrossRef]
Bickel, P.J.; Ritov, Y.; Tsybakov, A.B. Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat. 2009, 37, 1705–1732. [Google Scholar] [CrossRef]
Negahban, S.; Yu, B.; Wainwright, M.J.; Ravikumar, P.K. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 7–10 December 2009; pp. 1348–1356. [Google Scholar]
Wainwright, M.J. Sharp thresholds for high-dimensional and noisy sparsity recovery using -constrained quadratic programming (Lasso). IEEE Trans. Inf. Theory 2009, 55, 2183–2202. [Google Scholar] [CrossRef]
Belloni, A.; Chernozhukov, V.; Wang, L. Square-root lasso: Pivotal recovery of sparse signals via conic programming. Biometrika 2011, 98, 791–806. [Google Scholar] [CrossRef]
Li, Y.H.; Hsieh, Y.P.; Zerbib, N.; Cevher, V. A geometric view on constrained M-estimators. arXiv 2015, arXiv:1506.08163. [Google Scholar]
Bayati, M.; Montanari, A. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Trans. Inf. Theory 2011, 57, 764–785. [Google Scholar] [CrossRef]
Bayati, M.; Montanari, A. The LASSO risk for Gaussian matrices. IEEE Trans. Inf. Theory 2012, 58, 1997–2017. [Google Scholar] [CrossRef]
Donoho, D.; Montanari, A. High dimensional robust m-estimation: Asymptotic variance via approximate message passing. Probab. Theory Relat. Fields 2016, 166, 935–969. [Google Scholar] [CrossRef]
Rangan, S.; Goyal, V.; Fletcher, A.K. Asymptotic analysis of map estimation via the replica method and compressed sensing. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 7–10 December 2009; Volume 22. [Google Scholar]
Kabashima, Y.; Wadayama, T.; Tanaka, T. Statistical mechanical analysis of a typical reconstruction limit of compressed sensing. In Proceedings of the 2010 IEEE International Symposium on Information Theory, Austin, TX, USA, 13–18 June 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 1533–1537. [Google Scholar]
Couillet, R.; Debbah, M. Random Matrix Methods for Wireless Communications; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
Karoui, N.E. Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators: Rigorous results. arXiv 2013, arXiv:1311.2445. [Google Scholar]
Liao, Z.; Couillet, R. Random matrices meet machine learning: A large dimensional analysis of ls-svm. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2397–2401. [Google Scholar]
El Karoui, N. On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. Probab. Theory Relat. Fields 2018, 170, 95–175. [Google Scholar] [CrossRef]
Stojnic, M. Recovery thresholds for ℓ₁ optimization in binary compressed sensing. In Proceedings of the 2010 IEEE International Symposium on Information Theory, Austin, TX, USA, 13–18 June 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 1593–1597. [Google Scholar]
Stojnic, M. A framework to characterize performance of lasso algorithms. arXiv 2013, arXiv:1303.7291. [Google Scholar]
Thrampoulidis, C.; Oymak, S.; Hassibi, B. Regularized Linear Regression: A Precise Analysis of the Estimation Error. In Proceedings of the COLT, Paris, France, 3–6 July 2015; pp. 1683–1709. [Google Scholar]
Thrampoulidis, C.; Panahi, A.; Guo, D.; Hassibi, B. Precise error analysis of the lasso. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 3467–3471. [Google Scholar]
Thrampoulidis, C.; Xu, W.; Hassibi, B. Symbol error rate performance of box-relaxation decoders in massive MIMO. IEEE Trans. Signal Process. 2018, 66, 3377–3392. [Google Scholar] [CrossRef]
Atitallah, I.B.; Thrampoulidis, C.; Kammoun, A.; Al-Naffouri, T.Y.; Hassibi, B.; Alouini, M.S. Ber analysis of regularized least squares for bpsk recovery. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 4262–4266. [Google Scholar]
Alrashdi, A.M.; Kammoun, A.; Muqaibel, A.H.; Al-Naffouri, T.Y. Asymptotic Performance of Box-RLS Decoders under Imperfect CSI with Optimized Resource Allocation. IEEE Open J. Commun. Soc. 2022, 3, 2051–2075. [Google Scholar] [CrossRef]
Atitallah, I.B.; Thrampoulidis, C.; Kammoun, A.; Al-Naffouri, T.Y.; Alouini, M.S.; Hassibi, B. The BOX-LASSO with application to GSSK modulation in massive MIMO systems. In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1082–1086. [Google Scholar]
Alrashdi, A.M.; Alrashdi, A.E.; Alghadhban, A.; Eleiwa, M.A. Optimum GSSK Transmission in Massive MIMO Systems Using the Box-LASSO Decoder. IEEE Access 2022, 10, 15845–15859. [Google Scholar] [CrossRef]
Alrashdi, A.M.; Atitallah, I.B.; Al-Naffouri, T.Y. Precise performance analysis of the box-elastic net under matrix uncertainties. IEEE Signal Process. Lett. 2019, 26, 655–659. [Google Scholar] [CrossRef]
Hayakawa, R.; Hayashi, K. Binary vector reconstruction via discreteness-aware approximate message passing. In Proceedings of the 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Kuala Lumpur, Malaysia, 12–15 December 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1783–1789. [Google Scholar]
Hayakawa, R.; Hayashi, K. Asymptotic Performance of Discrete-Valued Vector Reconstruction via Box-Constrained Optimization With Sum of ℓ₁ Regularizers. IEEE Trans. Signal Process. 2020, 68, 4320–4335. [Google Scholar] [CrossRef]
Deng, Z.; Kammoun, A.; Thrampoulidis, C. A model of double descent for high-dimensional binary linear classification. Inf. Inference J. IMA 2022, 11, 435–495. [Google Scholar] [CrossRef]
Kini, G.R.; Thrampoulidis, C. Analytic study of double descent in binary classification: The impact of loss. In Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 2527–2532. [Google Scholar]
Salehi, F.; Abbasi, E.; Hassibi, B. The performance analysis of generalized margin maximizers on separable data. In Proceedings of the International Conference on Machine Learning, Virtual Online, 13–18 July 2020; PMLR: Mc Kees Rocks, PA, USA, 2020; pp. 8417–8426. [Google Scholar]
Dhifallah, O.; Thrampoulidis, C.; Lu, Y.M. Phase retrieval via linear programming: Fundamental limits and algorithmic improvements. In Proceedings of the 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 3–6 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1071–1077. [Google Scholar]
Salehi, F.; Abbasi, E.; Hassibi, B. A precise analysis of phasemax in phase retrieval. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 976–980. [Google Scholar]
Bosch, D.; Panahi, A.; Hassibi, B. Precise Asymptotic Analysis of Deep Random Feature Models. arXiv 2023, arXiv:2302.06210. [Google Scholar]
Dhifallah, O.; Lu, Y.M. A precise performance analysis of learning with random features. arXiv 2020, arXiv:2008.11904. [Google Scholar]
Dhifallah, O.; Lu, Y.M. Phase transitions in transfer learning for high-dimensional perceptrons. Entropy 2021, 23, 400. [Google Scholar] [CrossRef]
Ting, M.; Raich, R.; Hero, A.O., III. Sparse image reconstruction for molecular imaging. IEEE Trans. Image Process. 2009, 18, 1215–1227. [Google Scholar] [CrossRef]
Gui, G.; Peng, W.; Wang, L. Improved sparse channel estimation for cooperative communication systems. Int. J. Antennas Propag. 2012, 2012, 476509. [Google Scholar] [CrossRef]
Luenberger, D.G.; Ye, Y. Linear and Nonlinear Programming; Springer: Berlin/Heidelberg, Germany, 1984; Volume 2. [Google Scholar]
Salehi, F.; Abbasi, E.; Hassibi, B. The impact of regularization on high-dimensional logistic regression. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Donoho, D.L.; Maleki, A.; Montanari, A. Message-passing algorithms for compressed sensing. Proc. Natl. Acad. Sci. USA 2009, 106, 18914–18919. [Google Scholar] [CrossRef]
Hayakawa, R. Noise variance estimation using asymptotic residual in compressed sensing. arXiv 2020, arXiv:2009.13678. [Google Scholar]
Suliman, M.A.; Alrashdi, A.M.; Ballal, T.; Al-Naffouri, T.Y. SNR estimation in linear systems with Gaussian matrices. IEEE Signal Process. Lett. 2017, 24, 1867–1871. [Google Scholar] [CrossRef]
Kobayashi, H.; Mark, B.L.; Turin, W. Probability, Random Processes, and Statistical Analysis: Applications to Communications, Signal Processing, Queueing Theory and Mathematical Finance; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
Donoho, D.L.; Maleki, A.; Montanari, A. The noise-sensitivity phase transition in compressed sensing. IEEE Trans. Inf. Theory 2011, 57, 6920–6941. [Google Scholar] [CrossRef]
Thrampoulidis, C.; Abbasi, E.; Hassibi, B. Lasso with non-linear measurements is equivalent to one with linear measurements. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, USA, 7–12 December 2015; pp. 3420–3428. [Google Scholar]
Abbasi, E.; Salehi, F.; Hassibi, B. Universality in learning from linear measurements. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Hu, H.; Lu, Y.M. Universality laws for high-dimensional learning with random features. IEEE Trans. Inf. Theory 2022, 69, 1932–1964. [Google Scholar] [CrossRef]
Han, Q.; Shen, Y. Universality of regularized regression estimators in high dimensions. arXiv 2022, arXiv:2206.07936. [Google Scholar]
Dudeja, R.; Bakhshizadeh, M. Universality of linearized message passing for phase retrieval with structured sensing matrices. IEEE Trans. Inf. Theory 2022, 68, 7545–7574. [Google Scholar] [CrossRef]
Gerace, F.; Krzakala, F.; Loureiro, B.; Stephan, L.; Zdeborová, L. Gaussian Universality of Perceptrons with Random Labels. arXiv 2023, arXiv:2205.13303. [Google Scholar]
Chin, K.; DeVries, S.; Fridlyand, J.; Spellman, P.T.; Roydasgupta, R.; Kuo, W.L.; Lapuk, A.; Neve, R.M.; Qian, Z.; Ryder, T.; et al. Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell 2006, 10, 529–541. [Google Scholar] [CrossRef]
Güçkiran, K.; Cantürk, İ.; Özyilmaz, L. DNA microarray gene expression data classification using SVM, MLP, and RF with feature selection methods relief and LASSO. Süleyman Demirel Üniv. Fen Bilim. Enstitüsü Derg. 2019, 23, 126–132. [Google Scholar] [CrossRef]
Sun, L.; Hui, A.M.; Su, Q.; Vortmeyer, A.; Kotliarov, Y.; Pastorino, S.; Passaniti, A.; Menon, J.; Walling, J.; Bailey, R.; et al. Neuronal and glioma-derived stem cell factor induces angiogenesis within the brain. Cancer Cell 2006, 9, 287–300. [Google Scholar] [CrossRef]
Alon, U.; Barkai, N.; Notterman, D.A.; Gish, K.; Ybarra, S.; Mack, D.; Levine, A.J. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA 1999, 96, 6745–6750. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Dong, W.; Meng, D. Grouped gene selection of cancer via adaptive sparse group lasso based on conditional mutual information. IEEE/ACM Trans. Comput. Biol. Bioinform. 2017, 15, 2028–2038. [Google Scholar] [CrossRef] [PubMed]
Belkin, M.; Hsu, D.; Ma, S.; Mandal, S. Reconciling modern machine-learning practice and the classical bias—Variance trade-off. Proc. Natl. Acad. Sci. USA 2019, 116, 15849–15854. [Google Scholar] [CrossRef] [PubMed]
Gordon, Y. On Milman’s inequality and random subspaces which escape through a mesh in ℝⁿ. In Geometric Aspects of Functional Analysis; Lecture Notes in Mathematics; Springer: Berlin/Heidelberg, Germany, 1988; pp. 84–106. [Google Scholar]
Chu, H.T.; Toh, K.C.; Zhang, Y. On Regularized Square-root Regression Problems: Distributionally Robust Interpretation and Fast Computations. J. Mach. Learn. Res. 2022, 23, 13885–13923. [Google Scholar]
Bunea, F.; Lederer, J.; She, Y. The group square-root lasso: Theoretical properties and fast algorithms. IEEE Trans. Inf. Theory 2013, 60, 1313–1325. [Google Scholar] [CrossRef]

Figure 1. A system model comparison between the original G-PCR recovery algorithm and its scalar decoupled version. (a) Original system. (b) Scalar decoupled system.

Figure 2. Performance of the G-PCR vs. the penalization factor for a sparse linear regression. The parameters are set as follows:

p = 300, ζ = 1.5, ρ = 0.2, σ_{ϵ}^{2} = 0.1, L = 1, and U = 1

. The simulation results are averaged over 50 independent Monte Carlo trials. (a) The prediction risk. (b) The cosine similarity.

Figure 2. Performance of the G-PCR vs. the penalization factor for a sparse linear regression. The parameters are set as follows:

p = 300, ζ = 1.5, ρ = 0.2, σ_{ϵ}^{2} = 0.1, L = 1, and U = 1

. The simulation results are averaged over 50 independent Monte Carlo trials. (a) The prediction risk. (b) The cosine similarity.

Figure 3. Performance comparison between the G-PCR and G-PR. For the numerical simulations, the results are averaged over 100 independent trials, with

p = 128, ζ = 0.85, ρ = 0.1, σ_{ϵ}^{2} = 0.2

,

σ_{e}^{2} = 0.05, L = 1, and U = 1

. (a) The prediction risk. (b) The cosine similarity.

Figure 3. Performance comparison between the G-PCR and G-PR. For the numerical simulations, the results are averaged over 100 independent trials, with

p = 128, ζ = 0.85, ρ = 0.1, σ_{ϵ}^{2} = 0.2

,

σ_{e}^{2} = 0.05, L = 1, and U = 1

. (a) The prediction risk. (b) The cosine similarity.

Figure 4. Probability mass function (PMF) of a sparse-binary distribution.

Figure 5. Support recovery performance of the G-PCR versus the penalization factor for a sparse-binary signal recovery. The parameters are set as follows:

E = 1, p = 300, ζ = 0.85, ρ = 0.2

,

σ_{ϵ}^{2} = 0.05, ξ = 0.1, L = 0, and U = 1

. The simulations are averaged over 100 independent Monte Carlo trials. (a) Probability of misdetection. (b) Probability of false alarm.

Figure 5. Support recovery performance of the G-PCR versus the penalization factor for a sparse-binary signal recovery. The parameters are set as follows:

E = 1, p = 300, ζ = 0.85, ρ = 0.2

,

σ_{ϵ}^{2} = 0.05, ξ = 0.1, L = 0, and U = 1

. The simulations are averaged over 100 independent Monte Carlo trials. (a) Probability of misdetection. (b) Probability of false alarm.

Figure 6. Performance of the

l_{2}^{2}

-norm G-PCR vs. the penalization factor for a Gaussian target vector. The parameters are set as follows:

p = 500, ζ = 1.3, σ_{θ}^{2} = 1, σ_{e}^{2} = 0.01, σ_{ϵ}^{2} = 0.01, L = 1,

and

U = 1

. The results are averaged over 50 independent realizations. (a) The prediction risk. (b) The cosine similarity.

Figure 6. Performance of the

l_{2}^{2}

-norm G-PCR vs. the penalization factor for a Gaussian target vector. The parameters are set as follows:

p = 500, ζ = 1.3, σ_{θ}^{2} = 1, σ_{e}^{2} = 0.01, σ_{ϵ}^{2} = 0.01, L = 1,

and

U = 1

. The results are averaged over 50 independent realizations. (a) The prediction risk. (b) The cosine similarity.

Figure 7. Classification error rate (

C_{err}

) of the G-PCR vs.the penalization factor for a binary target vector. The parameters are set as follows:

p = 500, ζ = 0.85, σ_{e}^{2} = 0.01, σ_{ϵ}^{2} = 0.02, and L = U = 1

. The results are averaged over 100 independent Monte Carlo trials.

Figure 7. Classification error rate (

C_{err}

) of the G-PCR vs.the penalization factor for a binary target vector. The parameters are set as follows:

p = 500, ζ = 0.85, σ_{e}^{2} = 0.01, σ_{ϵ}^{2} = 0.02, and L = U = 1

. The results are averaged over 100 independent Monte Carlo trials.

Figure 8. Performance the G-PCR for a sparse-Gaussian target vector. We set the parameters as

p = 180, ζ = 0.95, σ_{e}^{2} = 0.1, σ_{ϵ}^{2} = 0.2, ρ = 0.1, and L = U = \sqrt{2}

. The results are averaged over 50 independent trials. (a) The prediction risk. (b) The similarity.

Figure 8. Performance the G-PCR for a sparse-Gaussian target vector. We set the parameters as

p = 180, ζ = 0.95, σ_{e}^{2} = 0.1, σ_{ϵ}^{2} = 0.2, ρ = 0.1, and L = U = \sqrt{2}

. The results are averaged over 50 independent trials. (a) The prediction risk. (b) The similarity.

Figure 9. Prediction risk as a function of the penalization factor

α

. Here, the data matrix X is a standardized real dataset. We used a sparse-Gaussian vector

θ_{0}

, and we generated the observations as

y = X θ_{0} + ϵ

. The G-PCR with

l_{1}

-norm penalty is used to obtain

\hat{θ}

with

\hat{X} = X + E

. The parameters are set as

ζ = 0.75, ρ = 0.1, σ_{ϵ}^{2} = 0.2, σ_{e}^{2} = 0.1

, and

L = U = 1

. The results are averaged over 200 independent trials. (a) Breast cancer data. (b) Glioma disease data. (c) Colon cancer data.

Figure 9. Prediction risk as a function of the penalization factor

α

. Here, the data matrix X is a standardized real dataset. We used a sparse-Gaussian vector

θ_{0}

, and we generated the observations as

y = X θ_{0} + ϵ

. The G-PCR with

l_{1}

-norm penalty is used to obtain

\hat{θ}

with

\hat{X} = X + E

. The parameters are set as

ζ = 0.75, ρ = 0.1, σ_{ϵ}^{2} = 0.2, σ_{e}^{2} = 0.1

, and

L = U = 1

. The results are averaged over 200 independent trials. (a) Breast cancer data. (b) Glioma disease data. (c) Colon cancer data.

Figure 10. Prediction risk as a function of the aspect ratio

ζ

. We used G-PCR with an

l_{1}

-norm penalty and sparse-binary vector with

ρ = 0.2, σ_{e}^{2} = 0.1

,

L = 0

, and

U = 1

. (a)

σ_{ϵ}^{2} = 0.1

. (b)

σ_{ϵ}^{2} = 0.3

. Illustration of the double descent and how optimal penalization can mitigate it.

Figure 10. Prediction risk as a function of the aspect ratio

ζ

. We used G-PCR with an

l_{1}

-norm penalty and sparse-binary vector with

ρ = 0.2, σ_{e}^{2} = 0.1

,

L = 0

, and

U = 1

. (a)

σ_{ϵ}^{2} = 0.1

. (b)

σ_{ϵ}^{2} = 0.3

. Illustration of the double descent and how optimal penalization can mitigate it.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alrashdi, A.M.; Alazmi, M.; Alrasheedi, M.A. Generalized Penalized Constrained Regression: Sharp Guarantees in High Dimensions with Noisy Features. Mathematics 2023, 11, 3706. https://doi.org/10.3390/math11173706

AMA Style

Alrashdi AM, Alazmi M, Alrasheedi MA. Generalized Penalized Constrained Regression: Sharp Guarantees in High Dimensions with Noisy Features. Mathematics. 2023; 11(17):3706. https://doi.org/10.3390/math11173706

Chicago/Turabian Style

Alrashdi, Ayed M., Meshari Alazmi, and Masad A. Alrasheedi. 2023. "Generalized Penalized Constrained Regression: Sharp Guarantees in High Dimensions with Noisy Features" Mathematics 11, no. 17: 3706. https://doi.org/10.3390/math11173706

APA Style

Alrashdi, A. M., Alazmi, M., & Alrasheedi, M. A. (2023). Generalized Penalized Constrained Regression: Sharp Guarantees in High Dimensions with Noisy Features. Mathematics, 11(17), 3706. https://doi.org/10.3390/math11173706

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Generalized Penalized Constrained Regression: Sharp Guarantees in High Dimensions with Noisy Features

Abstract

1. Introduction

1.1. Notations and Definitions

1.2. Motivation

1.3. Summary of Contributions and Related Work

2. Problem Setup

2.1. Dataset Model

2.2. Main Assumptions

2.3. Generalized Penalized Constrained Regression (G-PCR)

3. Sharp Asymptotics

3.1. Measures of Performance

3.2. High-Dimensional Performance Evaluation

4. Sparse Linear Regression

4.1. Asymptotic Behavior of Sparse G-PCR

4.2. Support Recovery

Example: Sparse-Binary Target Vectors

5. G-PCR with ℓ 2 2 -Norm Penalization

5.1. Numerical Illustration

5.2. Binary Target Vector Estimation

5.3. Unpenalized Regression

6. Additional Numerical Experiments

6.1. Synthetic Data: Universality of the Gaussian Design

6.2. Real-World Data

6.3. Double Descent Phenomenon

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Proof of the Main Results

Appendix A.1. Main Analysis Framework: CGMT

Appendix A.2. Sharp Analysis of the G-PCR

Appendix A.2.1. Primal and Auxiliary Problems of the G-PCR

Appendix A.2.2. Simplifying the Auxiliary Problem

Appendix A.2.3. General Performance Metric: Proof of Theorem 1

Appendix A.2.4. Prediction Risk Analysis: Proof of Corollary 1

Appendix A.2.5. Similarity Analysis: Proof of Corollary 2

Appendix B. A Note on the Square-Root Generalized Penalized Constrained Regression

Sqrt G-PCR Learning Algorithm

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5. G-PCR with $ℓ_{2}^{2}$ -Norm Penalization