Minimax Rates of ℓp-Losses for High-Dimensional Linear Errors-in-Variables Models over ℓq-Balls

In this paper, the high-dimensional linear regression model is considered, where the covariates are measured with additive noise. Different from most of the other methods, which are based on the assumption that the true covariates are fully obtained, results in this paper only require that the corrupted covariate matrix is observed. Then, by the application of information theory, the minimax rates of convergence for estimation are investigated in terms of the ℓp(1≤p<∞)-losses under the general sparsity assumption on the underlying regression parameter and some regularity conditions on the observed covariate matrix. The established lower and upper bounds on minimax risks agree up to constant factors when p=2, which together provide the information-theoretic limits of estimating a sparse vector in the high-dimensional linear errors-in-variables model. An estimator for the underlying parameter is also proposed and shown to be minimax optimal in the ℓ2-loss.


Introduction
In various fields of applied sciences and engineering, such as machine learning [1], a fundamental problem is to estimate an underlying parameter β * ∈ R d of a linear regression model as follows y i = X i· , β * + e i , for i = 1, 2, . . . , n, where {(X i· , y i )} n i=1 are i.i.d. observations (X i· ∈ R d ) and e ∈ R n is the random noise. In matrix form, model (1) can be written as y = Xβ * + e, where X = (X 1· , . . . , X n· ) ∈ R n×d and y, e ∈ R n . The covariates X i· (i = 1, 2, . . . , n) are always assumed to be fully observed in standard formulations. However, this assumption is far away from reality since, in general, the measurement error cannot be avoided. In many real-world applications, due to the lack of observation or the instrumental constraint, the collected data, such as remote sensing data, may always be perturbed and tend to be noisy [2]. It has been shown in [3] that misleading inference results may be obtained if the method for clean data is applied to the noisy data naively. Therefore, it is more realistic to explore the case where only the corrupted covariates of the corresponding true covariates X i· 's are obtained; see, e.g., [4]. This is known as the measurement error model in the literature.
Estimation in the presence of measurement errors has attracted a lot of interest for a long time. Bickel and Ritov [5] first studied the linear measurement error models and proposed an efficient estimator. Then, Stefanski and Carroll [6] investigated the generalized linear measurement error models and constructed consistent estimators. Extensive results have also been established on parameter estimation and variable selection for both parametric or nonparametric settings; see [7,8] and references therein. It should be noted that these results are only applicable to classical low-dimensional (i.e., n ≥ d) statistical models.
In the past two decades, high-dimensional statistical models, where the number of observations is much less than the number of predictors (i.e., n d), have been paid much attention and have achieved fruitful results in a wide range of research areas; see [9,10] for a detailed review. Most of the existing results are only suitable for models with clean data, while some researchers have began to focus on the measurement error case. For example, Loh and Wainwright studied the high-dimensional sparse linear regression model with corrupted covariates. Though the proposed estimator involves solving a nonconvex optimization problem, they proved that the global and stationary points are statistically consistent; see [11,12]. Datta and Zou [13] proposed the Convex Conditioned Lasso (CoCoLasso), which enjoys the convex benefit of the Lasso in both estimation and algorithm and can handle a class of corrupted datasets, including the cases of additive or multiplicative measurement error. Li et al. [14] investigated a general nonconvex estimation method from statistical and computational aspects and the results can be immediately applied to the corrupted errors-in-variables linear regression. Apart from the study on statistical convergence rates and designing efficient algorithms to solve certain estimators, it is also fundamental to information-theoretic limitations of statistical inference to understand the computationally efficient procedures. Such fundamental limits are usually studied by virtue of the minimax rates, which aim to find an estimator that minimizes the worst-case loss and, thus, can reveal gaps between the performance of some computationally efficient algorithm and that of an optimal algorithm. The minimax rate is always analyzed from two aspects, namely the informational lower bounds and statistical upper bounds. In the information-theoretic aspect, the Kullback-Leibler (KL) divergence is always used to provided lower bounds [15]. Recently, in [16], Loh provides a detailed review of a variety of techniques utilized to derive information-theoretic lower bounds for minimax estimation and learning, focusing on the problem settings with community recovery, parameter and function estimation, and online learning for multi-armed bandits. In the statistical aspect, a special estimator is always constructed to derive upper bounds; see, e.g., [17,18]. For the high-dimensional linear regression with additive errors, Loh and Wainwright [19] established minimax rates of convergence for estimating the unknown parameter in the 2 -loss. The proposed estimator was also shown to be minimax optimal in the additive error case under the 2 -loss, assuming that the true parameter is exact sparse, that is, β * has at most s d nonzero elements, which is also known as the exact sparsity assumption.
However, this exact assumption may be sometimes too restrictive to be satisfied in some real applications. For example, in the field of image processing, it is a standard phenomenon that wavelet coefficients for images usually exhibit an exponential decay, but do not need to be almost 0 (see, e.g., [20]). Other applications under high-dimensional scenarios include compressed sensing [21], genomic analysis [22], signal processing [23], and so on, where it is not suitable to impose an exact sparsity assumption on the underlying parameter. Hence, it is necessary to investigate minimax rates of estimation when the exact sparse assumption does not hold.
Our main purpose in the present study is to investigate the more general situation that coefficients of the true parameter are not almost zeros and then provide minimax rates of convergence for estimation in sparse linear regression with additive errors. More precisely speaking, we assume that for q ∈ [0, 1] fixed, the q -norm of β * defined as β * q := (∑ p j=1 |β * j | q ) 1/q is bounded from above. Note that this assumption is reduced to the exact sparsity assumption when q = 0. When q ∈ (0, 1], this type of sparsity is known as the soft sparsity. The exact sparsity assumption has been widely used for statistical inference, while the soft sparsity assumption attracts relatively little attention apart from the work [24][25][26]. Specifically, under both exact and soft sparsity assumptions, Raskutti et al. [24] and Ye and Zhang [26] provided minimax rates of convrgence for estimation in high-dimensional linear regression, respectively; Wang et al. [25] developed the optimal rates of convergence and proposed an adaptive q -aggregation strategy via model mixing which attains the established optimal rate automatically. It is worth noting that results in [24][25][26] are all obtained for clean data and cannot be applied to the errors-in-variables model. This is a fundamental difference from our present study. The main contributions of this paper are as follows. By assuming that the regression parameter is of soft sparsity, in the information-theoretic aspect we establish lower bounds on the minimax risks for p (1 ≤ p < ∞)-losses by virtue of the mutual information which hold for any arbitrary estimator for the model regardless of the specific method. In the statistical aspect, we propose an estimator which can be solved efficiently and then provide upper bounds on the 2 -loss between the estimator and the true parameter. Moreover, the lower and upper bounds when p = 2 agree up to constant factors, implying that proposed estimator is minimax optimal in the 2 -loss.
The remainder of this paper is organized as follows. In Section 2, we provide background on the errors-in-variables linear regression model and some regularity conditions on the observed covariate matrix. In Section 3, we establish our main results on lower and upper bounds on minimax risks for p (1 ≤ p < ∞)-losses over q -balls. Conclusions and future work are discussed in Section 4.
We end this section by introducing some notations for future reference. We use Greek lowercase letter β to denote the vectors. All vectors are column vectors following classical mathematical convention. A vector β is supported on S if and only if S = {i ∈ {1, 2, . . . , d} : β i = 0}, and S is the support of β denoted by supp(β), namely

Problem Setup
In this section, we begin with a precise formulation of the problem and then impose some regularity assumptions on the observed matrix.
Recall the standard linear regression model (1). One of the main types of measurement errors is the additive error. Specifically, for each i = 1, 2, . . . , n, we observe Z i· = X i· + W i· , where W i· ∈ R d is a random vector independent of X i· with mean 0 and known covariance matrix Σ w . When the noise covariance Σ w is unknown, there are some method to estimate it from the observed data; see, e.g., [4]. For example, a simple method is to estimate Σ w from blank independent observations of the noise. Specifically, suppose that one independently observes a matrix W 0 ∈ R n×d with n i.i.d. vectors of noise. Then we use Σ w = 1 n W 0 W 0 as the estimate of Σ w . Some other sophisticated variant of this method in are also provided in [4].
Throughout this paper, we assume that for i = 1, 2, . . . , n, the vectors X i· , W i· , and e are Gaussian with mean 0 and covariance matrices σ 2 x I d (σ x > 0), σ 2 w I d , and σ 2 e I n , respectively, and we write σ 2 z = σ 2 x + σ 2 w for simplicity. According to the previous works of [11,12], we fix i ∈ {1, 2, . . . , n} and write Σ x to denote the covariance matrix of X i· (i.e., As has been discussed in [11], an unbiased and suitable choice of the surrogate pair (Γ,Υ) for the additive error case is given bŷ Under the high-dimensional scenario (n d), the matrixΓ, which is the estimator of Σ x in the corrupted case, is always negative definite. To be specific, the matrix Z Z has rank at most n, and then the positive definite matrices Σ w are subtracted to obtainΓ. Consequently,Γ cannot be guaranteed to be positive definite regardless of the amount of noise. However, this does not affect the current result. Particularly, though the negative definiteness ofΓ leads to a nonconvex optimization problem in estimating β * (cf. (14)) as well as the upper bound, a weaker condition (cf. Assumption 2) allows further analysis.
Instead of assuming the regression parameter β * is exact sparse (i.e., supp(β) d), we use a general notion to characterize the sparsity of β * . Specifically, we assume that for q ∈ [0, 1], and a radius The use of q -ball is a common and popular way to measure the degree of sparsity (accurately, the above sets are not real "balls" , as they fail to be convex when q ∈ [0, 1)). Note that β ∈ B 0 (R 0 ) corresponds to the case that β is exact sparse, while for q ∈ (0, 1], β ∈ B q (R q ) corresponds to the case of weak sparsity, which endows a certain decay rate on the ordered entries of β. Throughout this paper, let q ∈ [0, 1] be fixed, and we assume that β * ∈ B q (R q ) unless otherwise specified. Moreover, without loss of generality, we assume that β * 2 = 1 and define S 2 (1) := {β ∈ R d | β 2 = 1}, i.e., the 2 unit sphere. Then it follows that β * ∈ B q (R q ) ∩ S 2 (1).
In order to estimate the regression parameter, one usually considers an estimator β : R n×d × R n → R d , which is a measurable function of the observed data {(Z i· , y i )} n i=1 . Then, for the purpose of assessing the estimation quality ofβ, it is typical to introduce a loss function L(β, β * ), which represents the loss incurred by the estimatorβ when the true parameter β * ∈ B q (R q ) ∩ S 2 (1). Finally, in the minimax formalism, we aim to choose an estimator that minimizes the following worst-case loss Specifically, in this paper, we shall consider the p -losses for p ∈ [1, +∞) as follows We then impose some regularity conditions on the observed matrix Z, which are beneficial to analyze the minimax rates. The first assumption requires that the columns of Z are bounded from above in 2 -norm.

Assumption 1 (Column normalization).
There exists a constant 0 < κ c < +∞ such that The second assumption imposes a lower bound on the restricted eigenvalue of the surrogate gram matrixΓ, which in other words is a lower bound for the restricted curvature.
Assumption 2 (Restricted eigenvalue condition). There exists a constant κ l > 0 and a function τ l (n, d) such that for all β ∈ B q (2R q ),

Remark 1. (i)
Note that though we focused on the random design case in this article, Assumptions 1 and 2 are stated in deterministic form. This choice is to make them universal to both fixed and random design matrices. Specifically, previous studies have shown that Assumptions 1 and 2 can be satisfied by a wide range of random matrices with high probability; see, e.g., [11,14,27]. Meanwhile, Assumptions 1 and 2 provide the possibility to analyze the fixed design case in which the matrices are usually chosen by researchers with suitable constants, i.e., κ c in Assumption 1 and κ l , τ l in Assumption 2. This deterministic form of the regularity condition on the design matrix is also adopted in the field of modern high-dimensional statistics and machine learning; see, e.g., [11,12,14,28].
(ii) For the Gaussian model we assumed that for i = 1, 2, . . . , n, the vectors X i· and W i· are independently Gaussian with mean 0 and covariance matrices σ 2 x I d and σ 2 w I d , respectively, and the observed covariate Z i· is also Gaussian with mean 0 and covariance matrix Σ z = (σ 2 x + σ 2 w )I d . Recall that σ 2 z = σ 2 x + σ 2 w , then one has that Z i· ∼ N (0, σ 2 z I d ). Furthermore, since the observations are i.i.d., each column Z ·j (j = 1, . . . , d) has i.i.d. elements, and thus Z ·j

Main Results
In this section, we turn to our main results on lower and upper bounds on minimax risks. We first begin with deriving intermediate results under Assumptions 1 and 2, then we will turn to our main results on probabilistic consequences by virtue of Remark 1(ii), (iii) on the conditions to guarantee Assumptions 1 and 2.
Let P β denote the distribution of y in the linear regression model with additive errors, when β is given and Z is observed. The following lemma tells us the KL divergence between the distributions induced by two different parameters β, β ∈ B q (R q ). The KL divergence plays a key role in establishing the information-theoretic related lower bound. Recall that for two distributions P and Q which have densities dP and dQ with respect to some base measure µ, the KL divergence is defined by D(P||Q) = log dP dQ P(dµ).

Lemma 1.
In the additive error setting, the KL divergence between the distributions induced by Proof. For each i = 1, 2, . . . , n fixed, by the model setting, (y i , Z i· ) is jointly Gaussian with mean 0. Then by some elementary algebra to compute the covariances, one has that Then, it follows from standard results on the conditional distribution of Gaussian variables that Now assume that σ e and σ w are not both 0; otherwise, the conclusion holds trivially. Since P β is a product distribution of y i |Z i· over all i = 1, 2, . . . , n, it follows from (2) that where σ 2 β := β (Σ x − Σ x Σ −1 z Σ x )β + σ 2 e , and σ 2 β is given analogously. Since Σ x = σ 2 x I n , Σ w = σ 2 w I n , and β 2 = 1 by the assumptions, we immediately arrive at that

Substituting this equality into (3) yields that
The proof is completed.

Proposition 1.
In the additive error setting, suppose that the observed matrix Z satisfies Assumption 1 with 0 < κ c < +∞. Then, for any p ∈ [1, +∞), there exists a constant c q,p depending only on q and p such that with probability at least 1/2, the minimax p -loss over the q -ball is lower bounded as Proof. For positive numbers δ > 0 and > 0, let M p (δ) denote the cardinality of a maximal δ-packing of the ball B q (R q ) in the l p metric with elements {β 1 , β 2 , . . . , β M }, and N 2 ( ) denote the minimal cardinality of an -covering of B q (R q ) in 2 -norm. We follow the standard technique in [30] to transform the estimation on lower bound into a multi-way hypothesis testing problem as follows where B ∈ R d is a random variable uniformly distributed over the packing set {β 1 , β 2 , . . . , β M }, andβ is an estimator taking values in the packing set. It then follows from Fano's inequality [30] that where I(y; B) is the mutual information between the random variable B and the observation vector y ∈ R n . It now remains to upper bound the mutual information I(y; B). Based on the procedure of [30], the mutual information is upper bounded as I(y; B) ≤ log N 2 ( ) + D(P β ||P β ).
Let absconv q (Z/ √ n) denote the q-convex hull of the rescaled columns of the observed matrix Z, that is, where the normalization factor 1/ √ n is used for convenience. Since Z satisfies Assumption 1 [31], [Lemma 4] is applicable to concluding that there exists a set {Zβ 1 , Zβ 2 , . . . , Zβ N } such that for all Zβ ∈ absconv q (Z), there exists some index i and some constant c > 0 such that Z(β −β i ) 2 / √ n ≤ cκ c . Combining this inequality with Lemma 1 and (7), one has that the mutual information is upper bounded as Thus, we obtain by (6) that It remains to choose the packing and covering set radii (i.e., δ and , respectively) such that (8) is strictly above zero, say bounded below by 1/2. For the sake of simplicity, denote x . Suppose that we choose the pair (δ, ) such that log M p (δ) ≥ 6 log N 2 ( ).
As long as N 2 ( ) ≥ 2, it is guaranteed that as desired. It remains to determine the values of the pair (δ, ) satisfying (9). By [31] [Lemma 3], we know that if c 2 n σ 2 κ 2 depending only on q, then (9a) is satisfied. Thus, we can choose satisfying In addition, it follows from [31] [Lemma 3] that if δ is chosen to satisfy for some constant U q,p depending only on q and p, then (9b) holds. Combining (11) and (12), one has that Combining this inequality with (10) and (5), we obtain that there exists a constant c q,p depending only on q and p such that The proof is complete.
Note that the probability 1/2 in Proposition 1 is just a standard convention, and it may be made arbitrarily close to 2/3 by choosing the universal constants suitably. Specifically, noting from Equation (10) that as long as N 2 ( ) ≥ 2 is sufficiently large, the probability can be made sufficiently close to 2/3. The requirement on the sufficiently large value of N 2 ( ) can be satisfied by choosing the universal constants L q,2 and c in view of Equation (11).

Proposition 2.
In the additive error setting, suppose that for a universal constant c 1 ,Γ satisfies Assumption 2 with κ l > 0 and τ l (n, d) ≤ c 1 R q log d n 1−q/2 . Then there exist universal constants (c 2 , c 3 ) and a constant c q depending only on q such that, with probability at least 1 − c 2 exp(−c 3 log d), the minimax 2 -loss over the q -ball is upper bounded as Proof. It suffices to find an estimator for β * , which has a small 2 -norm estimation error with high probability. We consider the estimator formulated as followŝ It is worth noting that (14) involves solving a nonconvex optimization problem when q ∈ [0, 1), while a near-global solution can be obtained efficiently by the algorithm proposed in [14]. Since β * ∈ B q (R q ) ∩ S 2 (1), it follows from the optimality ofβ that 1/2β Γβ − Υ β ≤ 1/2β * Γ β * −Υ β * . Define∆ :=β − β * , and thus one has that∆ ∈ B q (2R q ). Then it follows that∆ Γ∆ ≤ 2 ∆,Υ −Γβ * .
This inequality, together with the assumption thatΓ satisfies Assumption 2, implies that It then follows from [11] [Lemma 2] that there exist universal constants (c 2 , c 3 , c 4 ) such that, with probability at least 1 − c 2 exp(−c 3 log d), Combining (15) and (16), one has that Introduce the shorthand σ := σ z (σ w + σ e ). Recall that∆ ∈ B q (2R q ). It then follows from [24] [ Therefore, by solving this inequality with the indeterminate viewed as ∆ 2 , we arrive at the conclusion that there exists a constant c q depending only on q such that (13) holds with probability at least 1 − c 2 exp(−c 3 log d). The proof is complete.

Remark 2. (i)
The lower and upper bounds on minimax risks are dependent on the triple (R q , n, d), the error level, and structural properties of the observed matrix Z, as shown in Propositions 1 and 2. Specifically, by setting p = 2 in Proposition 1, the lower and upper bounds agree up to constant factors independent of the triple (R q , n, d), showing the optimal minimax rate in the additive error case.
(ii) Note that when p = 2 and q = 0 (i.e., the exact sparse case), the minimax rate scales as Θ R 0 log d n . In the high-dimensional regime when d/R 0 ∼ d γ for some constant γ > 0, this rate is equivalent to R 0 log(d/R 0 ) n (up to constant factors), which re-captures the same scaling as in [19].
(iii) The assumption that τ l (R q , n, d) ≤ c 1 R q log d n 1−q/2 in Proposition 2 is not unreasonable. It has been shown in [11] [Lemma 1] that it can be satisfied with high probability for the high-dimensional linear errors-in-variables model.
The following two theorems are on probabilistic consequences in view of conditions to ensure Assumptions 1 and 2. The proofs are obtained by applying Propositions 1 and 2 together with Remark 1(ii),(iii), respectively, as well as the elementary probability theory.

Funding:
The APC was funded by The APC was funded by the Research Achievement Cultivation Fund of School of Mathematics, Northwest University.