Transfer Learning for Moderate–Dimensional Ridge-Regularized Robust Linear Regression

Lyu, Lingfeng; Guo, Xiao; Liu, Zongqi

doi:10.3390/e28050543

Open AccessArticle

Transfer Learning for Moderate–Dimensional Ridge-Regularized Robust Linear Regression

by

Lingfeng Lyu

,

Xiao Guo

^*

and

Zongqi Liu

Department of Statistics and Finance, School of Management, University of Science and Technology of China, Hefei 230026, China

^*

Author to whom correspondence should be addressed.

Entropy 2026, 28(5), 543; https://doi.org/10.3390/e28050543

Submission received: 18 March 2026 / Revised: 7 May 2026 / Accepted: 9 May 2026 / Published: 11 May 2026

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

This paper studies transfer learning for ridge-regularized robust linear regression in the moderate–dimensional regime, where the number of predictors is of the same order as the sample size and the regression coefficients are not assumed to be sparse. We propose Trans-RR, which combines a robust ridge estimator from a source study with a robust ridge correction based on the target study. Under mild assumptions, we characterize the asymptotic estimation error of the proposed estimator and show that leveraging source data can substantially improve estimation accuracy relative to the traditional single-study ridge-regularized robust estimator. To guard against negative transfer when the source study is not sufficiently informative, we further propose an adaptive aggregation of Trans-RR with the single-task estimator that selects the mixing weight by cross-validation. Simulation studies and a real-data analysis support the theory and illustrate the transition between positive and negative transfer as the discrepancy between the source and target studies varies.

Keywords:

moderate dimension; non-sparse; robust regression; transfer learning

1. Introduction

Modern statistical analyses often involve several related datasets collected from different studies, populations, or experiments. A basic question is how to use these datasets together to improve prediction and estimation for a study of interest. Transfer learning addresses this question by borrowing useful information from related tasks. It is now standard in machine learning and has been successful in applications such as natural language processing, remote sensing, and computer vision [1,2,3]. In statistics, it has also become an important tool for improving performance in multi-study problems.

In many contemporary applications, regression problems arise in a moderate–dimensional regime, where the number of predictors is of the same order as the sample size and sparsity is often not a reasonable structural assumption. At the same time, heavy-tailed errors or outlying observations may substantially affect estimation accuracy. This setting arises naturally in multisite metabolomics studies, where many metabolites are measured simultaneously, between-cohort heterogeneity is often present, and outliers or other data contamination can be an important practical concern [4]. These features make transfer learning particularly challenging: related source studies may contain useful information, but effective borrowing requires methods that are robust to contamination and suitable for moderate–dimensional, non-sparse settings.

Existing theoretical work on transfer learning covers several important settings. For linear regression, ref. [5] studied data-enriched regression in a fixed-dimensional setting, and ref. [6] analyzed linear models with a shared low-dimensional representation across tasks. In the high-dimensional sparse regime, ref. [7] considered transfer learning with proxy data, while ref. [8] established prediction and estimation guarantees for sparse linear regression. For high-dimensional generalized linear models, refs. [9,10] developed transfer learning methods with theoretical guarantees. Transfer learning has also been studied for nonparametric classification [11], nonparametric regression [12], and settings with unreliable source data [13]. However, these works do not apply to the moderate–dimensional robust setting considered here.

Robust regression has, in contrast, been extensively studied in the single-study setting. For classical M-estimation, a substantial body of work has established asymptotic results when

p / n \to 0

while

p \to \infty

, including [14,15,16,17,18,19]. When

p / n \to κ \in (0, 1)

, robust regression has a qualitatively different asymptotic behavior [20]. When

p / n \to κ > 0

, ref. [21] proposed a ridge-regularized robust estimator to address the nonexistence of the ordinary robust estimator. However, these results do not directly extend to the transfer learning framework when related source data are available.

Motivated by these gaps, we study transfer learning under a moderate–dimensional robust linear model with one target study and one related source study. We allow the predictor dimension to be of the same order as the target and source sample sizes, do not impose sparsity assumptions on the regression coefficients, and permit heavy-tailed errors. Within this framework, our goal is to leverage information from the source study to improve the estimation performance of traditional single-task approaches.

Our main contribution is to develop and analyze a transfer learning method for moderate–dimensional ridge-regularized robust linear regression. First, we propose Trans-RR, a transfer learning procedure for ridge-regularized robust linear regression. It combines a robust ridge fit on the source study with a robust ridge correction on the target study and is designed for non-sparse coefficients. Second, we derive an asymptotic characterization of the

l_{2}

risk of the resulting estimator under mild assumptions on the design and error distributions. The theory shows how auxiliary source data can improve estimation accuracy relative to the single-study ridge-regularized robust estimator, while also clarifying the possibility of negative transfer. Third, we propose an adaptive aggregation of Trans-RR with the single-task estimator that selects the mixing weight by cross-validation, providing a data-driven safeguard against negative transfer. Fourth, we conduct simulation studies and a real-data analysis to examine the practical performance of the proposed methods, including their sensitivity to tuning choices and to the identity–covariance assumption underlying the theory.

The rest of the paper is organized as follows. Section 2 introduces the model setup and the proposed algorithm. Section 3 presents the technical assumptions, the theoretical results, and the adaptive aggregation against negative transfer. Section 4 presents simulation studies to evaluate the performance of the proposed methods. Section 5 applies the proposed methods to a real-data example. The proofs of the main theoretical result as well as the lemmas are included in Appendix D and Appendix E.

Notation. Denote by

I_{m}

the

m \times m

identity matrix. Let

0_{m} \in R^{m}

and

1_{m} \in R^{m}

be the vectors of zeros and ones, respectively. For a vector

v = {(v_{1}, \dots, v_{m})}^{⊤}

, the

l_{2}

norms are

∥ v ∥ = {(\sum_{i = 1}^{m} v_{i}^{2})}^{1 / 2}

, whereas

{∥ v ∥}_{\infty} = \max_{1 \leq k \leq p} | v_{k} |

. For an

m \times m

matrix

A = {a_{i j}}_{1 \leq i, j \leq m}

, denote by

λ_{\max} (A)

and

λ_{\min} (A)

the maximum and minimum eigenvalues of

A

, respectively. The

L_{2}

norm of

A

is defined as

∥ A ∥ = {λ_{\max} (A^{⊤} A)}^{1 / 2}

.

2. Methodology

2.1. Problem Setup

We consider a transfer learning problem with one target study and one related source study. In the target study, we observe n samples

x_{i} \in R^{p}

and

y_{i} \in R

,

i = 1, \dots, n

, generated from

y_{i} = x_{i}^{⊤} β_{0} + ϵ_{i},

(1)

where

ϵ_{i}, i = 1, \dots, n

, are independently distributed errors and

β_{0} \in R^{p}

is the unknown regression parameter of interest.

In addition to the target data, we observe

n_{1}

samples

(x_{i}^{(1)}, y_{i}^{(1)})

,

i = 1, \dots, n_{1}

, from the source study satisfying

y_{i}^{(1)} = {(x_{i}^{(1)})}^{⊤} w_{0} + ϵ_{i}^{(1)},

(2)

where

w_{0} \in R^{p}

is the regression parameter for the source study and

ϵ_{i}^{(1)}, i = 1, \dots, n_{1}

, are independently distributed errors. Throughout, both

ϵ_{i}

and

ϵ_{i}^{(1)}

are allowed to be heavy-tailed.

We work in the moderate–dimensional regime, where p is of the same order as both n and

n_{1}

, with

p / n \to κ \in (0, \infty)

and

p / n_{1} \to κ_{1} \in (0, \infty)

. We also do not impose sparsity assumptions on

β_{0}

or

w_{0}

. Let

δ_{0} = β_{0} - w_{0}

denote the source–target discrepancy and let

h = ∥ δ_{0} ∥

measure its size. Smaller values of h correspond to a more informative source study and hence a greater potential for useful transfer.

2.2. Trans-RR Algorithm

Based on this setup, we now introduce the proposed transfer learning algorithm, referred to as Trans-RR. Following the general two-stage strategy used in [7,8,9], the core of the procedure consists of two estimation steps. The first step estimates the source coefficient vector

w_{0}

from the source data. The second step estimates the source–target discrepancy

δ_{0} = β_{0} - w_{0}

from the target data, after which the two estimates are combined. Algorithm 1 summarizes the procedure.

Algorithm 1: Trans-RR algorithm

Input: target data

{(x_{i}, y_{i})}_{i = 1}^{n}

and source data

{(x_{i}^{(1)}, y_{i}^{(1)})}_{i = 1}^{n_{1}}

Output: the estimated coefficient vector

\hat{β}

Step 1. Compute

\hat{w} = \underset{w \in R^{p}}{arg min} [\frac{1}{n_{1}} \sum_{i = 1}^{n_{1}} \tilde{ρ} \{y_{i}^{(1)} - {(x_{i}^{(1)})}^{⊤} w\} + \frac{τ_{1}}{2} {∥ w ∥}^{2}]

(3)

for some constant

τ_{1}

.
Step 2. Compute

\hat{δ} = \underset{δ \in R^{p}}{arg min} [\frac{1}{n} \sum_{i = 1}^{n} ρ \{y_{i} - x_{i}^{⊤} (\hat{w} + δ)\} + \frac{τ}{2} {∥ δ ∥}^{2}]

(4)

for some constant

τ

.
Step 3. Let

\hat{β} = \hat{w} + \hat{δ} .

(5)

Step 4. Output

\hat{β}

.

The idea behind the construction is straightforward. Step 1 computes a ridge-regularized robust estimator from the source study. Step 2 then estimates the discrepancy relative to the source-stage fit by solving a second ridge-regularized robust regression problem on the target study. The final estimator is obtained by combining these two pieces, namely

\hat{β} = \hat{w} + \hat{δ}

. The main difference between our procedure and those of [7,8,9] is that we use ridge/

l_{2}

regularization in both steps, whereas they use

l_{1}

regularization. This choice is motivated by our diffuse-coefficient setting: the regression parameters

β_{0}

and

w_{0}

have many small coordinates and are not well approximated by sparse vectors. In this setting, lasso-based methods are not well-suited to the problem, whereas ridge penalization is natural. Another difference is that we use robust loss functions rather than the quadratic loss, which makes the procedure less sensitive to outliers and heavy-tailed errors.

Remark 1.

When robustness to heavy-tailed errors or outliers is needed, Huber-type loss functions are natural candidates for ρ and

\tilde{ρ}

. Specific choices of ρ and

\tilde{ρ}

under our theoretical framework are discussed in Section 3. The regularization parameters τ and

τ_{1}

may be selected by standard data-driven tuning methods such as cross-validation.

3. Theoretical Results

This section introduces the assumptions for the analysis and then presents the main asymptotic error results for Trans-RR.

3.1. Technical Assumptions

We study the estimation error of the estimator in Algorithm 1 under the following assumptions. We state the assumptions separately for the target study (Assumption 1) and the source study (Assumption 2), since the two stages are based on different samples. The two sets of conditions are largely parallel.

Assumption 1.

(a): $p / n \to κ \in (0, \infty)$ .
(b): Suppose ρ is an even and convex function. Assume that $ψ = ρ^{'}$ is bounded and $ψ^{'}$ is Lipschitz and bounded. Moreover, we assume that $sign (ψ (x)) = sign (x)$ and that $ρ (x) \geq ρ (0) = 0$ for all $x \in R$ .
(c): Assume that there exist independent variables $λ_{i}$ ’s and $X_{i}$ ’s such that $x_{i} = λ_{i} X_{i}$ . Suppose that $X_{i}$ ’s are i.i.d. with independent entries, and they have mean $0_{p}$ and $cov (X_{i}) = I_{p}$ . Suppose there exist $c_{n}$ and $C_{n}$ that vary with n, where $1 / c_{n} = O (polyLog (n))$ and $C_{n}$ is bounded in n, such that for any convex 1-Lipschitz function G of $X_{i}$ , $P (| G (X_{i}) - m_{G} | > t) \leq C_{n} \exp (- c_{n} t^{2})$ holds for all $t > 0$ , where $m_{G}$ is the median of $G (X_{i})$ . We require the same assumption to hold for the columns of the $n \times p$ design matrix $X$ . Additionally, we assume that the coordinates of $X_{i}$ have moments of all orders, and the k-th moment of the entries of $X_{i}$ is assumed to be uniformly bounded independently of n and p for all k.
(d): Suppose that $λ_{i}$ ’s are independent, with $E (λ_{i}^{2}) = 1$ , $E (λ_{i}^{4})$ being bounded, and $\sup_{1 \leq i \leq n} | λ_{i} |$ growing at most like $C_{λ} {(\log n)}^{k}$ for some k. $λ_{i}$ ’s may have finitely many possible distributions.
(e): Suppose that $ϵ_{i}$ ’s are independent and are also independent of $X_{i}$ ’s and $λ_{i}$ ’s. They may have finitely many possible distributions, each with a density that is differentiable, symmetric, and unimodal. If $f_{i}$ is the density of one such distribution, we assume that $\lim_{x \to \infty} x f_{i} (x) = 0$ .
(f): The fraction of occurrences for each possible combination of distributions for $(ϵ_{i}, λ_{i})$ has a limit as $n \to \infty$ .
(g): There exist constants $C_{β}$ and $e > 1 / 3$ such that $∥ β_{0} ∥ \leq C_{β}$ and $∥ β_{0} ∥_{\infty} = O (n^{- e})$ .

Assumption 2.

(a): $p / n_{1} \to κ_{1} \in (0, \infty)$ .
(b): $\tilde{ρ}$ and $\tilde{ψ}$ satisfy Assumption 1(b).
(c): $x_{i}^{(1)}$ , $X_{i}^{(1)}$ ’s and $λ_{i}^{(1)}$ ’s satisfy Assumption 1(c).
(d): $λ_{i}^{(1)}$ ’s satisfy Assumption 1(d).
(e): $ϵ_{i}^{(1)}$ ’s, $X_{i}^{(1)}$ ’s and $λ_{i}^{(1)}$ ’s satisfy Assumption 1(e).
(f): $λ_{i}^{(1)}$ ’s and $ϵ_{i}^{(1)}$ ’s satisfy Assumption 1(f).
(g): $∥ w_{0} ∥_{2}$ remains bounded. Furthermore, $∥ w_{0} ∥_{\infty} = O (n_{1}^{- e})$ , where $e > 1 / 3$ .

For Assumptions 1(b) and 2(b), it is quite common in robust statistics to require

ψ

to be bounded. For example, the Huber loss

ρ_{H} (x) = {\begin{matrix} \frac{x^{2}}{2} & if | x | \leq δ, \\ δ \cdot (| x | - \frac{1}{2} δ) & otherwise . \end{matrix}

is chosen to grow linearly to infinity, which reduces the influence of outliers on the resulting regression estimator. Although the Huber loss does not fully satisfy the assumptions because it is not differentiable at

| x | = δ

, these assumptions hold for a smoothed approximation such as

ρ_{η} (x) = {\begin{matrix} \frac{x^{2}}{2} & if | x | \leq δ - η, \\ (δ - \frac{η}{2}) \cdot | x | + \frac{{(δ - | x |)}^{3}}{6 η} + C_{ρ} & if | x | \in (δ - η, δ), \\ (δ - \frac{η}{2}) \cdot | x | + C_{ρ} & if | x | \geq δ, \end{matrix}

(6)

where

C_{ρ} = - η^{2} / 6 + η δ / 2 - δ^{2} / 2

. The corresponding

ψ_{η}

is given by

ψ_{η} (x) = {\begin{matrix} x & if | x | \leq δ - η, \\ sign (x) \cdot \{δ - \frac{η}{2} - \frac{{(δ - | x |)}^{2}}{2 η}\} & if | x | \in (δ - η, δ), \\ sign (x) \cdot (δ - \frac{η}{2}) & if | x | \geq δ . \end{matrix}

(7)

Another example that fully satisfies the assumptions is the pseudo-Huber loss function from [22,23], defined by

ρ_{P} (x) = δ^{2} (\sqrt{1 + \frac{x^{2}}{δ^{2}}} - 1) .

(8)

The assumptions on

x_{i}

’s and

x_{i}^{(1)}

’s, in particular that they have mean

0_{p}

and covariance matrix

I_{p}

, are common in the study of M-estimators for linear models. These assumptions have been used in the low-dimensional regime

p / n \to 0

studied in [14,15,24], in the moderate–dimensional regime

p / n \to κ \in (0, 1)

considered in [20], and in the regime

p / n \to κ > 0

analyzed in [21,25].

The concentration assumption on

X_{i}

’s and

X_{i}^{(1)}

’s is weaker than the Gaussian assumptions often imposed in robust statistics. This assumption has also been studied in [21,26] and holds for a broad class of distributions. Corollary 4.10 in [27] demonstrates that our assumptions are satisfied if

X_{i}

has independent entries bounded by

1 / (2 \sqrt{c_{1}})

for some

c_{1} > 0

. Additionally, Theorem 2.7 in [27] shows that the assumptions hold when

X_{i}

has independent entries with density

f_{k}

,

1 \leq k \leq p

, such that

f_{k} (x) = \exp (- u_{k} (x))

and

u_{k}^{''} (x) \geq \sqrt{c_{2}}

for some

c_{2} > 0

. In particular, this condition holds when

X_{i}

has i.i.d.

N (0, 1)

entries, where

c_{2} = 1

. As will be seen in the proof, the functions G that arise in our analysis are either linear functions or square roots of quadratic forms. A similar discussion applies to the

X_{i}^{(1)}

’s.

The introduction of

λ_{i}

and

λ_{i}^{(1)}

, as also considered in [20,21], is used to induce a nonspherical geometry on the predictors. Although the assumption

E (λ_{i}^{2}) = 1

can be relaxed to the requirement that

E (λ_{i}^{2})

be uniformly bounded, it remains statistically important because it guarantees that

cov (x_{i}) = I_{p}

in all the models we consider. This construction shows that many models can share the same covariance

cov (x_{i})

while exhibiting substantially different estimation errors for

\hat{β}

. This contrasts with the low-dimensional setting studied in [28], where

cov (x_{i})

is the key quantity for robust regression. A similar discussion applies to

λ_{i}^{(1)}

’s and

x_{i}^{(1)}

’s.

In Assumptions 1(e) and 2(e), no moment restriction is imposed on the

ϵ_{i}

’s and

ϵ_{i}^{(1)}

’s. For instance, smooth symmetric log-concave densities satisfy all of these assumptions; see [29,30]. Furthermore, the Cauchy distribution also satisfies these conditions; see Theorem 1.6 in [31]. This makes the framework compatible with heavy-tailed errors, which are of particular interest in robust regression.

Assumptions 1(g) and 2(g) impose a non-sparse structure on

β_{0}

and

w_{0}

, meaning that these vectors cannot be well approximated by sparse vectors in

l_{2}

norm. This setting is common in moderate–dimensional statistics and contrasts with the sparse regime, where only a small fraction of coefficients are substantial. Under these assumptions, the proposed Trans-RR estimator may outperform lasso-based methods.

We now turn to the target-stage result and the resulting error characterization for Trans-RR.

3.2. Asymptotic Characterization of Estimation Error

Our main theorem characterizes the asymptotic

l_{2}

error of the Trans-RR estimator. Recall that

\hat{β}

is defined in (5), and let

τ > 0

be fixed as n and p vary. To state the result, let

prox (c ρ)

denote the proximal mapping of the function

c ρ

, see [32]. It is given by

prox (c ρ) (x) = \underset{y \in R}{arg min} \{c ρ (y) + \frac{1}{2} {(x - y)}^{2}\}, x \in R .

(9)

When

ρ

is differentiable,

prox (c ρ) (x)

is the unique

y \in R

satisfying

y + c ψ (y) = x

, with

ψ = ρ^{'}

.

prox (c ρ) (x)

can be viewed as a shrinkage of x toward the minimizer of

ρ

, with the amount of shrinkage depending on c and

ρ

. The proximal mapping is a standard object in convex analysis and convex optimization (see [33] for a review of its analytic properties and an introduction to proximal gradient algorithms). As explained in [34], the system in Theorem 1 can be reformulated in terms of

prox ({(c_{ρ} (κ) ρ)}^{*})

, where

f^{*} (u) = \sup_{y \in R} {u y - f (y)}

denotes the Fenchel–Legendre conjugate of f.

Under Assumptions 1 and 2, the estimation error admits the following limit.

Theorem 1.

Under Assumptions 1 and 2, conditional on the source-stage estimator

\hat{w}

, which is independent of the target sample, we have

∥ \hat{β} - β_{0} ∥ \to r_{ρ} (κ)

in probability, where

r_{ρ} (κ)

is deterministic for the given value of

\hat{w}

. Let

W_{i} = ϵ_{i} + r_{ρ} (κ) λ_{i} Z_{i}

, where

Z_{i}

is a standard normal random variable independent of

ϵ_{i}

and

λ_{i}

. Then there exists a constant

c_{ρ} (κ)

such that

{\begin{matrix} \lim_{n \to \infty} \frac{1}{n} \sum_{i = 1}^{n} E \{{[prox {c_{ρ} (κ) λ_{i}^{2} ρ}]}^{'} (W_{i})\} & = 1 - κ + τ c_{ρ} (κ), \\ κ [\lim_{n \to \infty} \frac{1}{n} \sum_{i = 1}^{n} E (\frac{{[W_{i} - prox {c_{ρ} (κ) λ_{i}^{2} ρ} (W_{i})]}^{2}}{λ_{i}^{2}})] + τ^{2} {∥ β_{0} - \hat{w} ∥}^{2} c_{ρ}^{2} (κ) & = κ^{2} r_{ρ}^{2} (κ) . \end{matrix}

(10)

The proof of Theorem 1 is given in Appendix E. Here and below, the dependence of

r_{ρ} (κ)

and

c_{ρ} (κ)

on

\hat{w}

is suppressed for notational simplicity.

Theorem 1 shows that the asymptotic error

r_{ρ} (κ)

depends on the source study only through the discrepancy

∥ β_{0} - \hat{w} ∥

. To investigate when positive transfer occurs, Section 4.2 numerically computes

r_{ρ}

as a function of

∥ β_{0} - \hat{w} ∥

under the smoothed Huber loss (6) in three simulation cases. The resulting curves are monotonically increasing across the displayed range. By the triangle inequality,

∥ β_{0} - \hat{w} ∥ \leq ∥ β_{0} - w_{0} ∥ + ∥ \hat{w} - w_{0} ∥ .

The right-hand side is the sum of the population gap between the source and target coefficients and the source-stage estimation error. Transfer is therefore expected to help when the source-stage estimator is accurate and close to the target coefficient, and to hurt when either the population gap

∥ β_{0} - w_{0} ∥

or the source-stage estimation error

∥ \hat{w} - w_{0} ∥

is large. In practice, Section 3.3 develops an adaptive Trans-RR estimator to avoid negative transfer.

Unlike several recent transfer learning analyses, such as [7,8,9], our theory does not impose direct structural restrictions on

δ_{0}

, the difference between the target and source coefficients. Theorem 1 also shows that the performance of

\hat{β}

depends on the distribution of the

λ_{i}

’s in the representation

x_{i} = λ_{i} X_{i}

from Assumption 1(c). Thus, in the moderate–dimensional regime, the geometry of the target predictors encoded by

λ_{i}

materially affects the estimation error. This again contrasts with low-dimensional robust regression, where

cov (x_{i})

is the dominant quantity.

When

λ_{i}^{2} = 1

for all i and the errors

ϵ_{i}

are i.i.d., Theorem 1 simplifies as follows.

Corollary 1.

Under the same assumptions as in Theorem 1, if

λ_{i}^{2} = 1

for all i and the errors

ϵ_{i}

are i.i.d., then, conditional on

\hat{w}

, we have

∥ \hat{β} - β_{0} ∥ \to r_{ρ} (κ)

in probability, where

r_{ρ} (κ)

is deterministic for the given value of

\hat{w}

. Let

w = ϵ + r_{ρ} (κ) z

, where ϵ has the same distribution as the

ϵ_{i}

’s and z is a standard normal random variable independent of ϵ. Then there exists a constant

c_{ρ} (κ)

such that

{\begin{matrix} E \{{[prox {c_{ρ} (κ) ρ}]}^{'} (w)\} & = 1 - κ + τ c_{ρ} (κ), \\ κ E ({[w - prox {c_{ρ} (κ) ρ} (w)]}^{2}) + τ^{2} {∥ β_{0} - \hat{w} ∥}^{2} c_{ρ}^{2} (κ) & = κ^{2} r_{ρ}^{2} (κ) . \end{matrix}

Corollary 1 shows that, under a homogeneous target design, the general characterization in Theorem 1 reduces to a simpler scalar system. This special case is useful for interpretation and will also serve as a convenient benchmark in the simulation study.

Remark 2.

The limits on the left-hand side of (10) exist because Assumption 1(f) guarantees convergence of the proportions associated with each pair

(L (ϵ_{i}), L (λ_{i}))

, where

L (ϵ_{i})

and

L (λ_{i})

denote the laws of

ϵ_{i}

and

λ_{i}

. For the second equation in (10), the ratio can be interpreted through the identity

\frac{{[x - prox {c_{ρ} (κ) λ^{2} ρ} (x)]}^{2}}{λ^{2}} = c_{ρ}^{2} (κ) λ^{2} ψ^{2} (prox {c_{ρ} (κ) λ^{2} ρ} (x))

which is well defined when

λ = 0

. Equivalently, (10) can be written as

{\begin{matrix} \lim_{n \to \infty} \frac{1}{n} \sum_{i = 1}^{n} E \{{[prox {c_{ρ} (κ) λ_{i}^{2} ρ}]}^{'} (W_{i})\} & = 1 - κ + τ c_{ρ} (κ), \\ κ [\lim_{n \to \infty} \frac{1}{n} \sum_{i = 1}^{n} E \{c_{ρ}^{2} (κ) λ_{i}^{2} ψ^{2} (prox {c_{ρ} (κ) λ_{i}^{2} ρ} (W_{i}))\}] + τ^{2} {∥ β_{0} - \hat{w} ∥}^{2} c_{ρ}^{2} (κ) & = κ^{2} r_{ρ}^{2} (κ) . \end{matrix}

This representation shows that the expectation in (10) is well defined, and Assumption 1(f) ensures that the relevant limits exist.

3.3. Adaptive Aggregation Against Negative Transfer

We now develop an adaptive aggregation of Trans-RR with the single-task estimator to address negative transfer. Specifically, let

{\hat{β}}_{s t}

denote the single-task ridge-regularized robust estimator on the target sample,

{\hat{β}}_{s t} = \underset{β \in R^{p}}{arg min} [\frac{1}{n} \sum_{i = 1}^{n} ρ (y_{i} - x_{i}^{⊤} β) + \frac{τ_{s t}}{2} {∥ β ∥}^{2}]

(11)

for some constant

τ_{s t}

. Theorem 1 and the discussion above show that negative transfer may happen when the target–source discrepancy

∥ β_{0} - w_{0} ∥

or the source-stage estimation error is too large. We therefore propose the adaptive Trans-RR estimator to avoid negative transfer, defined for

θ \in [0, 1]

by

{\hat{β}}_{ada} (θ) = θ \hat{β} + (1 - θ) {\hat{β}}_{s t} .

(12)

Intuitively,

{\hat{β}}_{ada} (θ)

recovers Trans-RR at

θ = 1

and the single-task estimator at

θ = 0

. Including both endpoints allows the procedure to fall back to either Trans-RR or the single-task estimator, while interior values allow partial transfer. We select the mixing weight

θ

from data by cross-validation and write

{\hat{β}}_{ada} : = {\hat{β}}_{ada} (\hat{θ})

for the resulting estimator.

We tune the ridge penalties before selecting

θ

. The source penalty

τ_{1}

is tuned by cross-validation on the source sample, yielding

\hat{w}

via (3). With

\hat{w}

fixed, the target penalty

τ

is tuned by cross-validation on the target sample through (4). The single-task penalty

τ_{s t}

is tuned by cross-validation on the target sample. To select

θ

, we then use a K-fold partition

{V_{k}}_{k = 1}^{K}

of the target sample, drawn independently of the partitions used to tune

τ

and

τ_{s t}

. Here,

V_{k}

denotes the validation index set of fold k.

For each

k = 1, \dots, K

, let

{\hat{β}}^{(- k)}

be the output of Algorithm 1 applied to the full source data and

{(x_{i}, y_{i}) : i \notin V_{k}}

at the tuned

(τ_{1}, τ)

. Let

{\hat{β}}_{s t}^{(- k)}

be the solution of (11) on

{(x_{i}, y_{i}) : i \notin V_{k}}

at the tuned

τ_{s t}

. We reuse

τ_{1}

,

τ

, and

τ_{s t}

across all K folds rather than re-tuning per fold, which would multiply the penalty-tuning cost by K. We then choose

\hat{θ} = \underset{θ \in Θ}{arg min} \frac{1}{n} \sum_{k = 1}^{K} \sum_{i \in V_{k}} L (y_{i} - x_{i}^{⊤} [θ {\hat{β}}^{(- k)} + (1 - θ) {\hat{β}}_{s t}^{(- k)}]),

(13)

where

L : R \to R_{\geq 0}

is a validation loss and

Θ \subset [0, 1]

is a finite candidate set. A natural default is the absolute-error loss

L (t) = | t |

, which matches the criterion used to tune the ridge penalties. We refer to

{\hat{β}}_{ada}

as the adaptive Trans-RR estimator, denoted Trans-RR-Ada in the following numerical experiments. Algorithm 2 summarizes the procedure.

Algorithm 2: Adaptive Trans-RR algorithm

Input: target data

{(x_{i}, y_{i})}_{i = 1}^{n}

, source data

{(x_{i}^{(1)}, y_{i}^{(1)})}_{i = 1}^{n_{1}}

, fold count K,
candidate set

Θ \subset [0, 1]

, validation loss L
Output: the adaptive estimator

{\hat{β}}_{ada}

and the selected weight

\hat{θ}

Step 1. Tune

τ_{1}

,

τ

, and

τ_{s t}

by cross-validation on the corresponding samples.
Compute

\hat{β}

from Algorithm 1 and

{\hat{β}}_{s t}

from (11).
Step 2. Draw a K-fold partition

{V_{k}}_{k = 1}^{K}

of the target indices, independent of the
partitions used in Step 1.
Step 3. For each

k = 1, \dots, K

, let

{\hat{β}}^{(- k)}

be the output of Algorithm 1 applied to the
full source data and

{(x_{i}, y_{i}) : i \notin V_{k}}

at the tuned

(τ_{1}, τ)

. Let

{\hat{β}}_{s t}^{(- k)}

be the
solution of (11) on

{(x_{i}, y_{i}) : i \notin V_{k}}

at the tuned

τ_{s t}

.
Step 4. Compute

\hat{θ} = \underset{θ \in Θ}{arg min} \frac{1}{n} \sum_{k = 1}^{K} \sum_{i \in V_{k}} L (y_{i} - x_{i}^{⊤} [θ {\hat{β}}^{(- k)} + (1 - θ) {\hat{β}}_{s t}^{(- k)}]) .

Step 5. Output

{\hat{β}}_{ada} = \hat{θ} \hat{β} + (1 - \hat{θ}) {\hat{β}}_{s t}

and

\hat{θ}

.

3.4. Applicability and Limitations

Theorem 1 and Corollary 1 are derived under Assumptions 1 and 2, which include three structural conditions: identity covariance

cov (x_{i}) = I_{p}

, the moderate–dimensional regime

p / n \to κ \in (0, \infty)

, and a twice-differentiable robust loss. Under these assumptions, the asymptotic

l_{2}

estimation error of Trans-RR equals the deterministic limit

r_{ρ} (κ)

, in agreement with the numerical results of Section 4. When any of these assumptions fails, Theorem 1 no longer applies.

The adaptive aggregation of Section 3.3, by contrast, is constructed without invoking these structural assumptions. Its mixing weight

\hat{θ}

is selected by cross-validation on the target sample and serves as a data-driven safeguard against negative transfer. Section 4.4 provides numerical support under AR(1)-correlated predictors across all three noise cases: a transition between positive and negative transfer is observed near

h = 1

, and Trans-RR-Ada continues to track the better of the two base estimators.

4. Simulation

In this section, we conduct numerical studies to support the theoretical results. We set the dimension of both target and source data to be

p \in {200, 400, 800}

. We set

n = p, p / 4

and

n_{1} = 2 p, p / 2

, corresponding to moderate–dimensional settings with

κ = 1, 4

and

κ_{1} = 1 / 2, 2

. To generate data, we set

x_{i} = λ_{i} X_{i}

and

x_{i}^{(1)} = λ_{i}^{(1)} X_{i}^{(1)}

, where

X_{i}

and

X_{i}^{(1)}

have i.i.d.

N (0, 1)

entries. We consider three different cases for the choices of

λ_{i}

’s,

ϵ_{i}

’s,

λ_{i}^{(1)}

’s and

ϵ_{i}^{(1)}

’s:

Case I: $λ_{i} = 1$ for $i = 1, \dots, n$ and $λ_{j}^{(1)} = 1$ for $j = 1, \dots, n_{1}$ . The target errors $ϵ_{i}$ are i.i.d. $N (0, 1)$ , and the source errors $ϵ_{j}^{(1)}$ are i.i.d. $N (0, 2^{2})$ .
Case II: The variables $λ_{i}$ and $λ_{j}^{(1)}$ are i.i.d. Unif $(0, \sqrt{3})$ , while $ϵ_{i}$ and $ϵ_{j}^{(1)}$ are i.i.d. $C a u c h y (0, 1)$ and $C a u c h y (0, 2)$ , respectively.
Case III: In both the target and source studies, half of the observations are generated as in Case $I$ and the other half are generated as in Case $II$ .

Case

I

is a standard Gaussian setup for linear regression. Case

II

features a non-Gaussian design and heavy-tailed errors. Case

III

is a mixture of the two cases and is designed to test the effectiveness of our theoretical results under non-identical

x_{i}

’s and

ϵ_{i}

’s.

4.1. Validity of Theoretical Results

We first evaluate the validity of the proposed scalar

r_{ρ}

in Theorem 1. For each setting, we generate

β^{*}

and

w^{*}

once with i.i.d. Unif

(0, 1)

entries and set

β_{0} = β^{*} / \sqrt{n}

and

w_{0} = w^{*} / \sqrt{n}

. This construction yields diffuse coefficients whose Euclidean norms remain bounded as n grows. These coefficient vectors are fixed across the 1000 replications, while the target and source samples are regenerated in each replicate. In each replicate, we first compute

\hat{w}

from the source sample and then obtain

\hat{β}

by applying Algorithm 1, using the smoothed Huber loss (6) for both

\tilde{ρ}

and

ρ

with parameters

δ = 1.35

and

η = 0.1

. We fix

τ = τ_{1} = 1

and repeat each setup 1000 times.

Figure 1 presents boxplots of the estimation error

∥ \hat{β} - β_{0} ∥^{2}

for cases

I

–

III

and

κ = 1, 4

. The red point in each boxplot marks the theoretical value

r_{ρ}^{2}

, obtained by numerically solving the system in Theorem 1 under the corresponding simulation specification. We observe that the empirical distribution of

∥ \hat{β} - β_{0} ∥^{2}

is centered close to this value, and its dispersion decreases as n and p become larger. Table 1 shows the mean and standard deviation (SD) of

∥ \hat{β} - β_{0} ∥^{2}

(denoted as

{\hat{r}}^{2}

) and the corresponding

r_{ρ}^{2}

for each setup. As dimensionality increases, that is, as

κ

increases from 1 to 4, both the mean error and its variability increase, indicating that estimation becomes more difficult in more challenging moderate–dimensional regimes. The average estimation error also grows with heavier-tailed errors, highlighting the difficulty of estimation under such conditions. Results under case

III

demonstrate that Theorem 1 is effective in handling non-identical

x_{i}

’s and

ϵ_{i}

’s. Overall, the findings in Figure 1 and Table 1 align well with the theoretical predictions of Theorem 1.

4.2. Theoretical Estimation Error Curves

We take

ρ

to be our recommended smoothed Huber loss (6) with

(δ, η) = (1.35, 0.1)

and fix

κ = 1

. We numerically solve the associated scalar system while varying the discrepancy term

∥ β_{0} - \hat{w} ∥

that enters the second equation of Theorem 1. Figure 2 plots the resulting curves of

r_{ρ}

for five values of

τ

under cases

I

–

III

. In all three cases, the curves are monotonically increasing over the displayed range, so a larger

∥ β_{0} - \hat{w} ∥

corresponds to a larger asymptotic estimation error in this setting.

4.3. Comparison with Existing Methods

To evaluate when transfer is beneficial, we compare our method with several competing procedures across the three scenarios described above. We set

p = 400

,

n = p

and

n_{1} = 2 p

. We generate

β_{0} = β^{*} / ∥ β^{*} ∥

, where

β^{*} = {(β_{1}^{*}, \dots, β_{p}^{*})}^{⊤}

has i.i.d. Unif

(0, 1)

entries. To control the transfer strength

h = ∥ δ_{0} ∥

, we set

δ_{0} = \exp (c_{d}) \cdot 1_{p} / \sqrt{p}, c_{d} \in {- 2.0, - 1.5, - 1.0, - 0.5, 0, 0.5, 1.0},

and define the source coefficient by

w_{0} = β_{0} - δ_{0}

. By varying

c_{d}

from

- 2.0

to

1.0

, we obtain h ranging from approximately

0.135

to

2.718

, providing a comprehensive evaluation across different levels of source-target similarity. For each value of

c_{d}

, the pair

(β_{0}, w_{0})

is fixed across the 500 replications, and only the data are regenerated.

We compare four ridge-type methods (Single-RR, Trans-RR, Trans-RR-Ada, and Pooled-RR) and two lasso baselines (Single-Lasso and Trans-Lasso):

Single-RR: The single-task estimator ${\hat{β}}_{s t}$ in (11), fit to the target sample alone.
Trans-RR: The two-stage estimator $\hat{β} = \hat{w} + \hat{δ}$ in (5), computed by Algorithm 1.
Trans-RR-Ada: The adaptive aggregate ${\hat{β}}_{ada}$ in (12), computed by Algorithm 2 with $K = 5$ , absolute-error loss $L (t) = | t |$ , and weight grid $Θ = {0, 0.1, \dots, 0.9, 1}$ .
Pooled-RR: The same robust ridge fit applied to the concatenation of the source and target samples.
Single-Lasso: The lasso on the target sample, with its regularization parameter chosen by $f i v e$ -fold cross-validation.
Trans-Lasso: The two-stage transfer-lasso of [8], in which a cross-validated source-stage lasso estimates $w_{0}$ and a cross-validated target-stage lasso fits the residual $y_{i} - x_{i}^{⊤} \hat{w}$ on the target sample.

The four ridge-type methods all use the smoothed Huber loss (6) with

(δ, η) = (1.35, 0.1)

. Each ridge penalty is tuned by

f i v e

-fold cross-validation under the mean absolute error criterion, over the grid

{3^{a} : a = - 2, - 3 / 2, \dots, 2}

. For Trans-RR, the source penalty

τ_{1}

in (3) is tuned on the source sample, yielding

\hat{w}

as the minimizer of (3) at the tuned

τ_{1}

. With

\hat{w}

held fixed, the target penalty

τ

in (4) is then tuned on the target sample.

Performance is summarized by the relative estimation error

∥ \hat{β} - β_{0} ∥^{2} / {∥ β_{0} ∥}^{2}

. Figure 3 presents boxplots of these errors on a logarithmic scale, for varying values of h across cases

I

–

III

. We report all six estimators under Case

I

. In Cases

II

and

III

, the noise is heavy-tailed, and the lasso methods are not designed for robustness. Their fits failed to converge to stable estimates, so we restrict these two cases to the ridge-type methods.

Under Case

I

, Trans-RR compares favorably with both lasso baselines, consistent with the non-sparse structure assumed throughout the paper. Among the ridge-type methods, Trans-RR outperforms Pooled-RR for small and moderate h across all three cases. Pooled-RR fits the source and target observations together as if they shared one coefficient, so its error reflects two mismatches: the gap between

β_{0}

and

w_{0}

, and the difference between source and target noise levels. As h decreases, the gap between Pooled-RR and Trans-RR narrows, in line with the theoretical expectation that smaller h indicates greater similarity between the two domains.

More importantly, the comparison with Single-RR reveals the transition between positive and negative transfer. When h is small, Trans-RR achieves the lowest relative error among the ridge-type methods. As h grows, its advantage shrinks and eventually reverses. In our experiments, this turnover occurs near

h = 1

. This is consistent with the numerical evidence in Section 3, since transfer is more favorable when the source-stage estimator is closer to the target coefficient. The same qualitative pattern appears in Cases

II

and

III

, where the transition near

h = 1

persists under heavy-tailed errors and heterogeneous designs.

Trans-RR-Ada provides a data-driven safeguard against this negative transfer. Figure 3 shows that it tracks the better of Single-RR and Trans-RR for every value of h. For small h, Trans-RR-Ada essentially coincides with Trans-RR, while for large h it coincides with Single-RR. At the transition

h = 1

, where Single-RR and Trans-RR are comparable, Trans-RR-Ada stays close to the better one.

4.4. Robustness to Non-Identity Covariance

The asymptotic theory in Section 3 assumes

cov (x_{i}) = I_{p}

. To verify that our methods remain effective under non-identity covariance, we re-run the comparison of Section 4.3 under AR(1) covariance

\sum_{j k} = ρ^{| j - k |}

with

ρ = 0.6

, across all three noise cases

I

,

II

, and

III

. All other settings, including the four ridge-type methods Single-RR, Trans-RR, Trans-RR-Ada, and Pooled-RR and the tuning protocol, match Section 4.3.

Figure 4 reports the resulting boxplots. The qualitative pattern of the i.i.d. comparison persists in all three cases. For small h, Trans-RR achieves the lowest error among the ridge-type methods. As h grows, its advantage shrinks and eventually reverses, with the transition again occurring near

h = 1

. Trans-RR-Ada tracks the better of Single-RR and Trans-RR across all values of h. The negative-transfer transition and the safeguard role of Trans-RR-Ada are therefore not specific to the identity–covariance assumption underlying Theorem 1. This suggests that Trans-RR may remain effective under non-identity covariance.

4.5. Sensitivity to Tuning Choices

The comparison in Figure 3 fixes the smoothed Huber loss with

(δ, η) = (1.35, 0.1)

and selects each ridge penalty by

f i v e

-fold cross-validation under the mean absolute error criterion, on a common geometric grid of nine values from

1 / 9

to 9. This subsection examines whether the qualitative findings of that comparison are stable under perturbations to four tuning choices: the smoothed Huber parameters

(δ, η)

, the cross-validation criterion, the robust loss family, and the ridge penalty grid. Each perturbation re-runs the full simulation with

M = 500

replications per setting.

4.5.1. Choice of $(δ, η)$

The default

δ = 1.35

is the standard Huber tuning. The smoothing parameter

η = 0.1

closely approximates the unsmoothed Huber loss and keeps

ρ_{η}

twice continuously differentiable, as required by Assumption 1(b). In the sensitivity experiment, we vary

δ

over

{1.0, 1.35, 2.0}

and

η

over

{0.05, 0.10, 0.20}

, giving nine

(δ, η)

pairs, and re-run the full simulation across the three cases and seven discrepancy values h for every pair. Within each

(case, h)

combination, varying the

(δ, η)

pair over the nine settings changes each method’s mean error by less than

7 %

(median

1.5 %

) of its mean. The four-method ranking matches the default

(1.35, 0.10)

ranking in 171 of the 189

(case, h, (δ, η))

cells. The qualitative findings are therefore stable under this perturbation. Table 2 displays the four method means as a

(δ, η)

heatmap at

h = 1

in each of the three cases. Within each

3 \times 3

grid, the entries vary only slightly and the four-method ordering is the same in every cell, illustrating the two findings stated above. Appendix B reports the heatmaps at the other six values of h, where the same pattern holds.

4.5.2. Choice of Cross-Validation Criterion

The default selects each ridge penalty by

f i v e

-fold cross-validation under the mean absolute error criterion. We re-run the simulation with every cross-validation loss changed to mean squared error, keeping the smoothed Huber loss for estimation. Table 3 compares the mean estimation errors under the two criteria across all cases and values of h.

Under Case

I

, the two criteria yield nearly identical mean errors and the same four-method ranking at every h. Under Cases

II

and

III

, the mean errors of all four methods become substantially larger, and two qualitative findings of Section 4.3 no longer hold. First, Trans-RR has a larger mean error than Single-RR at every h in both heavy-tailed cases, so the range on which transfer helps disappears entirely. Second, Trans-RR-Ada no longer approximates min(Single-RR, Trans-RR) in most settings of the heavy-tailed cases. This may result from the heavy tails of Cauchy errors, under which MSE-based cross-validation is more sensitive to extreme residuals than MAE-based cross-validation. MAE-CV is therefore the appropriate default for heavy-tailed errors, and the conclusions of Section 4.3 are specific to this choice.

4.5.3. Choice of Robust Loss

In order to assess the sensitivity to the specific form of the robust loss, we replace the smoothed Huber loss in (6) with the pseudo-Huber loss

ρ_{P} (t; δ) = δ^{2} (\sqrt{1 + {(t / δ)}^{2}} - 1)

at the same

δ = 1.35

. Table 4 compares the mean estimation errors under the two losses across all cases and values of h.

The mean errors are nearly identical to those under the smoothed Huber loss across all 21 settings. The four-method ranking matches the default in nearly every setting, and the few mismatches involve methods whose mean errors are essentially tied. The negative-transfer transition stays at the same value of h in every case, and Trans-RR-Ada continues to track min(Single-RR, Trans-RR) as closely as under the default. The qualitative findings of Section 4.3 are insensitive to this choice, consistent with Theorem 1.

4.5.4. Choice of Ridge Penalty Grid

The four ridge penalties

τ_{1}

,

τ

,

τ_{st}

, and

τ_{p}

are tuned by

f i v e

-fold cross-validation over a common geometric grid of nine values from

1 / 9

to 9. We probe the sensitivity to this grid in two complementary ways: extending the grid to wider values, and forcing all four penalties to a common fixed value with no cross-validation.

We first extend the grid to thirteen values from

1 / 27

to 27 on the same geometric scale, which contains the default grid as a strict subset. Table 5 compares the mean estimation errors under the default and the wide grid across all cases and values of h. The impact on estimation is negligible: each method’s mean error barely changes across settings, the four-method ranking is unchanged except at settings with near-tied means, the negative-transfer transition does not move, and Trans-RR-Ada continues to track min(Single-RR, Trans-RR) as closely as under the default. The default grid is therefore wide enough on both sides, and the qualitative findings of Section 4.3 do not depend on the choice of upper or lower endpoint.

We next force all four ridge penalties to a common fixed value

τ \in {1 / 3, 1, 3, 9}

, removing the ridge cross-validation entirely. Trans-RR-Ada’s mixing weight

θ

is still selected by

f i v e

-fold cross-validation on the target sample (Algorithm 2). Table 6 reports the mean estimation errors at the four fixed

τ

values across all cases and values of h. At every fixed

τ

, the mean error of Trans-RR grows monotonically with h in all three cases, consistent with the theoretical risk curves of Figure 2. The absolute levels and the four-method ranking do vary substantially across the four

τ

values, but the qualitative dependence on h predicted by the theory is preserved at every

τ

.

The Trans-RR-Ada safeguard nevertheless remains effective: its mean error stays close to min(Single-RR, Trans-RR) in every one of the 84 settings, just as under cross-validated penalties. The agreement between Trans-RR-Ada and min(Single-RR, Trans-RR) is therefore robust to a misspecified ridge penalty.

5. Real Data Analysis

We evaluate the proposed transfer procedure on the near-infrared (NIR) spectral dataset from the 2002 International Diffuse Reflectance Conference (IDRC) “Shootout” competition. The data consist of pharmaceutical tablet measurements collected by two spectrometers, with ASSAY as the response. For each instrument, the dataset contains a training sample of size 460 and an external test sample of size 155. Each spectrum is recorded at 650 wavelengths, yielding a moderate–dimensional regression problem.

Let X and

X_{1}

denote the spectra from the two instruments. We consider two transfer directions. In Direction A, X is the target domain and

X_{1}

is the source domain. In Direction B, the roles are reversed. In each repetition, we randomly split the training sample of the target instrument into two parts of sizes 160 and 300. The subset of size 160 is used as the target training sample. The remaining 300 tablets are matched with their measurements from the other instrument to form the source training sample. When X is the target domain, the target fit is constructed from 160 observations in X, and the source fit is constructed from the corresponding 300 observations in

X_{1}

. The same scheme is applied symmetrically in the reverse direction. The external test sample of the target instrument is used for evaluation. We repeat this procedure 20 times to reduce Monte Carlo variability.

The preprocessing used in both transfer directions has two stages. First, we thin the spectrum by retaining every fourth wavelength of the original 650, yielding

p = 163

predictors. Two considerations motivate this thinning. Adjacent NIR wavelengths record highly overlapping absorption signals, so a four-fold thinning preserves nearly all the spectral information. This is standard practice in NIR chemometrics. A step of four also gives

p = 163 \approx n = 160

, the same

p / n \approx 1

setting as in our simulations.

Second, for each domain, predictors are whitened using its own training sample,

\tilde{X} = (X - \hat{μ}) {\sum^{^}}^{- 1 / 2},

where

\hat{μ}

and

\sum^{^}

are estimated from that training sample, and the same transformation is applied to the associated test sample. Whitening decorrelates the highly collinear NIR predictors so that the sample second-moment matrix of

\tilde{X}

approximates

I_{p}

, which is the design assumption underlying Theorem 1. The response is centered using the mean of the corresponding training response, and the same centering constant is used for the associated test response. All preprocessing parameters are thus estimated only on training data and carried over to the test set, avoiding information leakage.

We compare six methods: Single-RR, Trans-RR, the adaptive Trans-RR estimator

{\hat{β}}_{ada}

from Section 3.3 (denoted Trans-RR-Ada in the table), Pooled-RR, Single-Lasso, and Trans-Lasso. For the ridge-type methods, we use the smoothed Huber loss in (6) with

(δ, η) = (1.35, 0.1)

. All tuning parameters are selected by 5-fold cross-validation with mean absolute error as the validation criterion, from the common grid

G = {10^{- 5}, 10^{- 4.5}, 10^{- 4}, \dots, 10^{1}} .

Performance is measured by the test-set RMSE,

RMSE = \sqrt{\frac{1}{n_{test}} \sum_{i = 1}^{n_{test}} {(y_{i} - {\hat{y}}_{i})}^{2}},

where

y_{i}

and

{\hat{y}}_{i}

are on the original ASSAY scale (predictions are uncentered before evaluation) and

n_{test} = 155

in both transfer directions.

Table 7 reports the average RMSE and its standard deviation over the 20 repetitions, and Figure 5 displays the corresponding distributional information. Trans-RR achieves the smallest mean RMSE in both transfer directions, with relatively small variability across the 20 splits. The adaptive variant Trans-RR-Ada matches Trans-RR within

0.03

in mean RMSE in both directions, indicating that the source and target are close enough for transfer to be uniformly beneficial on this dataset. Pooled-RR ranks third behind the two transfer-ridge methods and is uniformly dominated by Trans-RR, suggesting that naive pooling fails to account for cross-instrument differences. The lasso methods are less competitive overall, especially Single-Lasso. This is consistent with the non-sparse structure of NIR spectra, where relevant information is distributed over broad wavelength regions rather than concentrated on a small subset of predictors.

We further checked the procedure for four starting offsets of the every-fourth-wavelength thinning and for the same procedure without whitening. Across all five resulting preprocessing variants and both transfer directions, Trans-RR and Trans-RR-Ada are the top two methods by mean RMSE, and their mean RMSE values differ by at most

0.05

. Mean RMSE for all six methods in each variant and direction is reported in Appendix C. The results, together with these robustness checks, therefore support the effectiveness of the proposed Trans-RR method.

6. Discussion

This paper introduces Trans-RR, a robust transfer-learning approach for moderate–dimensional linear regression. It extends transfer-learning ideas of [7,8] to a setting with non-sparse coefficients and heavy-tailed errors, without relying on sparsity assumptions or moment restrictions on the errors. The theory and the numerical results show that negative transfer can occur when the source study is not sufficiently informative for the target study. To guard against this, we also propose an adaptive aggregation of Trans-RR with the single-task estimator that selects the mixing weight by cross-validation. The theoretical results, simulation studies, and real-data analysis support the effectiveness of both procedures. A natural direction for further study is to identify the choice of loss function that minimizes the asymptotic estimation error in our framework.

Author Contributions

Conceptualization, L.L. and X.G.; Methodology, L.L., X.G. and Z.L.; Software, L.L.; Validation, L.L.; Formal analysis, L.L.; Data curation, L.L.; Writing—original draft, L.L.; Writing—review & editing, L.L., X.G. and Z.L.; Supervision, X.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 12471267.

Data Availability Statement

Publicly available datasets were analyzed in this study. The near-infrared spectral data are from the 2002 International Diffuse Reflectance Conference (IDRC) “Shootout” competition and can be downloaded from https://www.eigenvector.com/data/tablets/index.html, accessed on 7 May 2026. The code presented in this study is available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Notation

For ease of reference, Table A1 collects the main symbols used throughout the paper, grouped by topic. Additional proof-level notation is introduced at the beginning of Appendix D.

Table A1. Summary of the main symbols used in the paper.

Symbol	Meaning
Dimensions and asymptotic regime
p, n, $n_{1}$	predictor dimension; target and source sample sizes
$κ, κ_{1} \in (0, \infty)$	limits of $p / n$ and $p / n_{1}$
Target study (1) ( $i = 1, \dots, n$ ) and source study (2) ( $i = 1, \dots, n_{1}$ )
$x_{i} = λ_{i} X_{i}$ , $x_{i}^{(1)} = λ_{i}^{(1)} X_{i}^{(1)}$	i-th target/source predictor, with components $X_{i}, X_{i}^{(1)} \in R^{p}$ and scales $λ_{i}, λ_{i}^{(1)} \in R$ (Assumptions 1(c) and 2(c))
$y_{i}, ϵ_{i}$ , $y_{i}^{(1)}, ϵ_{i}^{(1)}$	target/source response and error ( $ϵ_{i}$ may be heavy-tailed)
$β_{0}, w_{0} \in R^{p}$	target/source regression coefficient (non-sparse)
$ρ, ψ = ρ^{'}$ , $\tilde{ρ}, \tilde{ψ} = {\tilde{ρ}}^{'}$	target-stage/source-stage loss and its derivative
$τ, τ_{s t}, τ_{1} > 0$	target-stage ridge for Trans-RR and for ${\hat{β}}_{s t}$ , source-stage ridge for $\hat{w}$
$\hat{w}$ , $\hat{δ}$ , $\hat{β} = \hat{w} + \hat{δ}$ , ${\hat{β}}_{s t}$	source-stage, Step 2, Trans-RR (5), and single-task (11) estimators
Source–target discrepancy
$δ_{0} = β_{0} - w_{0}$ , $h = ∥ δ_{0} ∥$	discrepancy vector and its magnitude
Adaptive aggregation (Section 3.3)
${\hat{β}}_{ada} (θ) = θ \hat{β} + (1 - θ) {\hat{β}}_{s t}$ , ${\hat{β}}_{ada}$	adaptive Trans-RR estimator and its instance at $θ = \hat{θ}$ , (12)
$θ \in [0, 1]$ , $\hat{θ}$ , $Θ \subset [0, 1]$	mixing weight, its cross-validation choice, and candidate set
K, ${V_{k}}_{k = 1}^{K}$ , $L : R \to R_{\geq 0}$	fold count, target-sample fold partition, validation loss in (13)
Smoothed Huber family (choices for $ρ$ and $\tilde{ρ}$ )
$δ > 0$ , $η \in (0, δ]$	Huber transition and smoothing parameters (scalar $δ$ distinct from vector $δ_{0}$ )
$ρ_{H}$ , $ρ_{η}$ , $ρ_{P}$	Huber, smoothed Huber (6), and Pseudo-Huber (8) losses
Asymptotic quantities (Theorem 1)
$r_{ρ} (κ), c_{ρ} (κ)$	asymptotic value of $∥ \hat{β} - β_{0} ∥$ and companion scalar in the fixed-point system
$Z_{i}$ , $W_{i} = ϵ_{i} + r_{ρ} (κ) λ_{i} Z_{i}$	standard normal (independent of $ϵ_{i}, λ_{i}$ ) and auxiliary random variable
$prox (c ρ)$	proximal mapping: $prox (c ρ) (x) = {arg min}_{y} {c ρ (y) + {(x - y)}^{2} / 2}$

Appendix B. Additional (δ, η) Heatmaps

This appendix reports the

(δ, η)

heatmaps at the six values of h not shown in Table 2 of the main text. The pattern at every h matches the one at

h = 1

. Each method’s mean error varies by less than

7 %

across the nine

(δ, η)

pairs, and the four-method ranking matches the default

(1.35, 0.10)

ranking in the large majority of cells. The few rank swaps that do occur happen at small h in Cases

II

and

III

, where Pooled-RR, Trans-RR, and Trans-RR-Ada have nearly tied mean errors.

Table A2. Sensitivity of relative estimation error to the smoothed Huber parameters

(δ, η)

at

h = 0.135

. Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over

M = 500

replications.

Table A2. Sensitivity of relative estimation error to the smoothed Huber parameters

(δ, η)

at

h = 0.135

. Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over

M = 500

replications.

Case I (Gaussian Errors)
$δ ∖ η$	$0.05$	$0.10$	$0.20$
$1.00$	0.653/0.560/0.561/0.604	0.654/0.561/0.562/0.604	0.656/0.562/0.563/0.606
$1.35$	0.644/0.549/0.550/0.600	0.645/0.550/0.550/0.600	0.646/0.551/0.551/0.601
$2.00$	0.641/0.537/0.538/0.594	0.640/0.537/0.539/0.594	0.641/0.538/0.539/0.593
Case II (Cauchy Errors)
$δ ∖ η$	$0.05$	$0.10$	$0.20$
$1.00$	0.884/0.819/0.821/0.809	0.884/0.819/0.821/0.808	0.884/0.819/0.821/0.808
$1.35$	0.883/0.817/0.819/0.812	0.883/0.816/0.819/0.811	0.883/0.816/0.818/0.811
$2.00$	0.890/0.821/0.823/0.822	0.889/0.821/0.823/0.821	0.889/0.821/0.823/0.821
Case III (Mixture Errors)
$δ ∖ η$	$0.05$	$0.10$	$0.20$
$1.00$	0.779/0.695/0.694/0.706	0.780/0.695/0.695/0.706	0.780/0.695/0.695/0.706
$1.35$	0.781/0.693/0.692/0.705	0.781/0.693/0.692/0.706	0.781/0.692/0.692/0.705
$2.00$	0.794/0.701/0.701/0.714	0.793/0.700/0.701/0.714	0.792/0.700/0.700/0.713

Table A3. Sensitivity of relative estimation error to the smoothed Huber parameters

(δ, η)

at

h = 0.223

. Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over

M = 500

replications.

Table A3. Sensitivity of relative estimation error to the smoothed Huber parameters

(δ, η)

at

h = 0.223

. Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over

M = 500

replications.

Case I (Gaussian Errors)
$δ ∖ η$	$0.05$	$0.10$	$0.20$
$1.00$	0.653/0.576/0.577/0.634	0.654/0.577/0.578/0.635	0.656/0.578/0.579/0.635
$1.35$	0.644/0.565/0.565/0.628	0.645/0.565/0.566/0.628	0.646/0.566/0.567/0.629
$2.00$	0.641/0.554/0.556/0.624	0.640/0.555/0.556/0.624	0.641/0.555/0.556/0.624
Case $II$ (Cauchy Errors)
$δ ∖ η$	$0.05$	$0.10$	$0.20$
$1.00$	0.884/0.828/0.830/0.824	0.884/0.828/0.830/0.825	0.884/0.828/0.831/0.825
$1.35$	0.883/0.826/0.829/0.829	0.883/0.826/0.829/0.829	0.883/0.825/0.828/0.828
$2.00$	0.890/0.829/0.832/0.838	0.889/0.830/0.832/0.837	0.889/0.829/0.831/0.836
Case $III$ (Mixture Errors)
$δ ∖ η$	$0.05$	$0.10$	$0.20$
$1.00$	0.779/0.709/0.709/0.729	0.780/0.709/0.709/0.729	0.780/0.710/0.709/0.730
$1.35$	0.781/0.708/0.707/0.728	0.781/0.707/0.707/0.729	0.781/0.707/0.706/0.728
$2.00$	0.794/0.715/0.715/0.737	0.793/0.715/0.715/0.737	0.792/0.714/0.714/0.736

Table A4. Sensitivity of relative estimation error to the smoothed Huber parameters

(δ, η)

at

h = 0.368

. Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over

M = 500

replications.

Table A4. Sensitivity of relative estimation error to the smoothed Huber parameters

(δ, η)

at

h = 0.368

. Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over

M = 500

replications.

Case I (Gaussian Errors)
$δ ∖ η$	$0.05$	$0.10$	$0.20$
$1.00$	0.653/0.598/0.600/0.683	0.654/0.599/0.601/0.684	0.656/0.601/0.602/0.686
$1.35$	0.644/0.588/0.589/0.678	0.645/0.588/0.589/0.678	0.646/0.589/0.591/0.678
$2.00$	0.641/0.579/0.580/0.677	0.640/0.579/0.580/0.677	0.641/0.579/0.580/0.677
Case II (Cauchy Errors)
$δ ∖ η$	$0.05$	$0.10$	$0.20$
$1.00$	0.884/0.842/0.844/0.854	0.884/0.842/0.844/0.854	0.884/0.843/0.846/0.853
$1.35$	0.883/0.840/0.843/0.855	0.883/0.840/0.843/0.855	0.883/0.839/0.842/0.856
$2.00$	0.890/0.843/0.847/0.864	0.889/0.843/0.847/0.864	0.889/0.842/0.846/0.863
Case III (Mixture Errors)
$δ ∖ η$	$0.05$	$0.10$	$0.20$
$1.00$	0.779/0.729/0.729/0.769	0.780/0.729/0.730/0.769	0.780/0.729/0.730/0.770
$1.35$	0.781/0.728/0.728/0.770	0.781/0.727/0.727/0.770	0.781/0.727/0.727/0.770
$2.00$	0.794/0.737/0.738/0.777	0.793/0.736/0.737/0.778	0.792/0.735/0.736/0.777

Table A5. Sensitivity of relative estimation error to the smoothed Huber parameters

(δ, η)

at

h = 0.607

. Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over

M = 500

replications.

Table A5. Sensitivity of relative estimation error to the smoothed Huber parameters

(δ, η)

at

h = 0.607

. Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over

M = 500

replications.

Case I (Gaussian Errors)
$δ ∖ η$	$0.05$	$0.10$	$0.20$
$1.00$	0.653/0.625/0.626/0.767	0.654/0.626/0.627/0.767	0.656/0.627/0.629/0.768
$1.35$	0.644/0.615/0.616/0.764	0.645/0.616/0.617/0.764	0.646/0.617/0.618/0.764
$2.00$	0.641/0.607/0.608/0.764	0.640/0.607/0.608/0.764	0.641/0.607/0.609/0.764
Case II (Cauchy Errors)
$δ ∖ η$	$0.05$	$0.10$	$0.20$
$1.00$	0.884/0.862/0.864/0.897	0.884/0.862/0.864/0.897	0.884/0.863/0.865/0.897
$1.35$	0.883/0.860/0.863/0.898	0.883/0.860/0.862/0.898	0.883/0.859/0.862/0.898
$2.00$	0.890/0.865/0.868/0.903	0.889/0.864/0.868/0.903	0.889/0.864/0.868/0.902
Case III (Mixture Errors)
$δ ∖ η$	$0.05$	$0.10$	$0.20$
$1.00$	0.779/0.755/0.755/0.835	0.780/0.755/0.755/0.835	0.780/0.755/0.755/0.836
$1.35$	0.781/0.754/0.754/0.834	0.781/0.754/0.754/0.834	0.781/0.753/0.754/0.834
$2.00$	0.794/0.764/0.764/0.841	0.793/0.763/0.764/0.840	0.792/0.762/0.762/0.840

Table A6. Sensitivity of relative estimation error to the smoothed Huber parameters

(δ, η)

at

h = 1.649

. Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over

M = 500

replications.

Table A6. Sensitivity of relative estimation error to the smoothed Huber parameters

(δ, η)

at

h = 1.649

. Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over

M = 500

replications.

Case I(Gaussian Errors)
$δ ∖ η$	$0.05$	$0.10$	$0.20$
$1.00$	0.653/0.797/0.653/0.987	0.654/0.795/0.654/0.987	0.656/0.796/0.657/0.987
$1.35$	0.644/0.792/0.645/0.990	0.645/0.791/0.645/0.990	0.646/0.792/0.646/0.990
$2.00$	0.641/0.788/0.641/0.999	0.640/0.789/0.641/0.999	0.641/0.787/0.641/0.998
Case II (Cauchy Errors)
$δ ∖ η$	$0.05$	$0.10$	$0.20$
$1.00$	0.884/1.024/0.888/1.005	0.884/1.023/0.888/1.005	0.884/1.023/0.888/1.005
$1.35$	0.883/1.021/0.888/1.005	0.883/1.021/0.887/1.005	0.883/1.020/0.887/1.005
$2.00$	0.890/1.029/0.894/1.006	0.889/1.027/0.893/1.006	0.889/1.026/0.893/1.006
Case III (Mixture Errors)
$δ ∖ η$	$0.05$	$0.10$	$0.20$
$1.00$	0.779/0.934/0.781/0.997	0.780/0.935/0.781/0.996	0.780/0.936/0.781/0.996
$1.35$	0.781/0.943/0.783/0.998	0.781/0.941/0.783/0.998	0.781/0.940/0.783/0.998
$2.00$	0.794/0.955/0.796/1.003	0.793/0.954/0.795/1.003	0.792/0.952/0.794/1.002

Table A7. Sensitivity of relative estimation error to the smoothed Huber parameters

(δ, η)

at

h = 2.718

. Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over

M = 500

replications.

Table A7. Sensitivity of relative estimation error to the smoothed Huber parameters

(δ, η)

at

h = 2.718

. Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over

M = 500

replications.

Case I (Gaussian Errors)
$δ ∖ η$	$0.05$	$0.10$	$0.20$
$1.00$	0.653/1.617/0.653/1.231	0.654/1.618/0.654/1.228	0.656/1.620/0.656/1.228
$1.35$	0.644/1.637/0.644/1.262	0.645/1.639/0.645/1.261	0.646/1.635/0.646/1.254
$2.00$	0.641/1.617/0.641/1.312	0.640/1.613/0.640/1.309	0.641/1.611/0.641/1.304
Case II (Cauchy Errors)
$δ ∖ η$	$0.05$	$0.10$	$0.20$
$1.00$	0.884/1.882/0.884/1.118	0.884/1.882/0.884/1.119	0.884/1.890/0.884/1.118
$1.35$	0.883/1.902/0.883/1.122	0.883/1.898/0.883/1.123	0.883/1.900/0.883/1.121
$2.00$	0.890/1.873/0.890/1.137	0.889/1.869/0.889/1.139	0.889/1.864/0.889/1.136
Case III (Mixture Errors)
$δ ∖ η$	$0.05$	$0.10$	$0.20$
$1.00$	0.779/1.975/0.779/1.169	0.780/1.971/0.780/1.171	0.780/1.993/0.780/1.169
$1.35$	0.781/1.992/0.781/1.183	0.781/1.985/0.781/1.181	0.781/1.971/0.781/1.178
$2.00$	0.794/1.999/0.794/1.209	0.793/2.001/0.793/1.208	0.792/2.009/0.792/1.205

Appendix C. Robustness Checks for the Real-Data Analysis

In Section 5, we report robustness checks across two preprocessing variants: the four starting offsets of the every-fourth-wavelength thinning, and the same procedure without the whitening step (predictors still mean-centered per instrument). Table A8 reports the per-offset mean RMSE and standard deviation; Table A9 compares the whitened and the unwhitened analysis. Each entry reports the mean RMSE (with standard deviation in parentheses) over the 20 repeated random splits.

Table A8. Per-offset prediction performance on the NIR spectral dataset. The four offsets correspond to the four starting indices of the every-fourth-wavelength thinning, each yielding

p = 163

predictors.

Table A8. Per-offset prediction performance on the NIR spectral dataset. The four offsets correspond to the four starting indices of the every-fourth-wavelength thinning, each yielding

p = 163

predictors.

Direction A: Target = X
Method	Offset 0	Offset 1	Offset 2	Offset 3
Trans-RR	4.6230 (0.1732)	4.5555 (0.1619)	4.7143 (0.1660)	4.5511 (0.1149)
Trans-RR-Ada	4.6294 (0.1861)	4.5569 (0.1703)	4.7155 (0.1602)	4.5669 (0.1202)
Pooled-RR	5.0952 (0.0812)	4.9644 (0.1462)	5.0982 (0.0971)	4.9367 (0.0769)
Trans-Lasso	5.5668 (0.3650)	5.5799 (0.5213)	5.5239 (0.2604)	5.3689 (0.3004)
Single-RR	6.2666 (2.2628)	6.3039 (2.0647)	6.4540 (2.4723)	6.4131 (2.3509)
Single-Lasso	8.0672 (2.8904)	8.1289 (3.5097)	8.5205 (4.3280)	8.7014 (3.3436)
Direction B: Target = $X_{1}$
Method	Offset 0	Offset 1	Offset 2	Offset 3
Trans-RR	4.7933 (0.2736)	4.7063 (0.1320)	4.6329 (0.2049)	4.6457 (0.1895)
Trans-RR-Ada	4.8211 (0.3757)	4.7405 (0.1403)	4.6222 (0.1583)	4.6447 (0.1779)
Pooled-RR	5.4909 (0.1335)	5.2180 (0.1444)	5.4916 (0.1550)	5.1584 (0.1728)
Trans-Lasso	5.6803 (0.3131)	5.9048 (0.5452)	5.5392 (0.4612)	5.6729 (0.3458)
Single-RR	6.8272 (2.2674)	6.9625 (2.5627)	6.6887 (2.2085)	6.6730 (2.4256)
Single-Lasso	8.5607 (3.3739)	9.0675 (3.7103)	7.8504 (2.7262)	7.9018 (2.8987)

Table A9. Comparison of mean RMSE between the whitened analysis (default) and the analysis without whitening, in both transfer directions.

	Direction A		Direction B
Method	Whitened	Unwhitened	Whitened	Unwhitened
Trans-RR	4.6230 (0.1732)	4.6281 (0.2237)	4.7933 (0.2736)	4.9253 (0.2064)
Trans-RR-Ada	4.6294 (0.1861)	4.6225 (0.2182)	4.8211 (0.3757)	4.9386 (0.1926)
Pooled-RR	5.0952 (0.0812)	5.0615 (0.0959)	5.4909 (0.1335)	5.0178 (0.1401)
Trans-Lasso	5.5668 (0.3650)	4.8776 (0.3020)	5.6803 (0.3131)	5.0792 (0.2623)
Single-RR	6.2666 (2.2628)	4.8845 (0.3038)	6.8272 (2.2674)	5.2476 (0.3971)
Single-Lasso	8.0672 (2.8904)	4.9933 (0.3261)	8.5607 (3.3739)	5.2374 (0.3037)

Appendix D. Assumptions

Notation. Let

polyLog (p)

replace a power of

\log (p)

. Denote by

I_{m}

the

m \times m

identity matrix. When this does not create problems, we also use the standard notation

I

for

I_{p}

. Let

0_{m} \in R^{m}

and

1_{m} \in R^{m}

be the vectors of zeros and ones, respectively. For an

m \times m

matrix

A = {a_{i j}}_{1 \leq i, j \leq m}

, denote by

λ_{\max} (A)

and

λ_{\min} (A)

the maximum and minimum eigenvalues of

A

, respectively. The

L_{2}

norm of

A

is defined as

∥ A ∥ = {λ_{\max} (A^{⊤} A)}^{1 / 2}

. We call

\sum^{^} = n^{- 1} \sum_{i = 1}^{n} x_{i} x_{i}^{⊤}

the sample covariance matrix of

x_{i}

’s when

x_{i}

’s are known to have zero mean. We say that

X \leq Y

in

L_{k}

if

{E (| X |}^{k} {) \leq E (| Y |}^{k})

. We use the notation

(a, b)

to represent either the interval

(a, b)

or

(b, a)

, as in some cases we need to localize quantities within intervals defined by two values, a and b, without knowing in advance whether

a < b

or

b > a

. We denote by

X

the

n \times p

design matrix whose i-th row is

x_{i}^{⊤}

, and by

X_{(i)}

the matrix obtained by removing the i-th row of

X

. We write

a \land b

for

\min (a, b)

and

a \lor b

for

\max (a, b)

. For two symmetric matrices A and B,

A ⪰ B

means that

A - B

is positive semi-definite. For the random variable W, we use the definition

{∥ W ∥}_{L_{k}} = {{E (| W |}^{k} {)}}^{1 / k}

. For sequences of random variables

W_{n}, Z_{n}

, we use the notation

W_{n} = O_{L_{k}} (Z_{n})

and

W_{n} = o_{L_{k}} (Z_{n})

when

∥ W_{n} ∥_{L_{k}} = O (∥ Z_{n} ∥_{L_{k}})

and

∥ W_{n} ∥_{L_{k}} = o (∥ Z_{n} ∥_{L_{k}})

, respectively. For a vector

v = {(v_{1}, \dots, v_{m})}^{⊤}

, the

L_{2}

norms are

∥ v ∥ = {(\sum_{i = 1}^{m} v_{i}^{2})}^{1 / 2}

, whereas

{∥ v ∥}_{\infty} = \max_{1 \leq k \leq p} | v (k) |

. For a function f from

R

to

R

, denote

{∥ f ∥}_{\infty} = \sup_{x \in R} | f (x) |

.

In this appendix, we refine the assumptions in the main text into the forms needed for the proof of Theorem 1. The proof is divided into three parts, and each part uses a slightly different set of technical conditions. In particular, the assumptions on

β_{0}

are specified below according to the needs of the different proof subsections, whereas the source-side conditions remain as in Assumption 2.

Assumptions under which the whole proof goes through

M1. $p / n \to κ \in (0, \infty)$ .
M2. There exists constants $C_{β}$ and $e > 1 / 3$ such that $∥ β_{0} ∥ \leq C_{β}$ and $∥ β_{0} ∥_{\infty} = O (n^{- e})$ .
M3. Suppose $ρ$ is an even function. Assume that $ψ = ρ^{'}$ is bounded and $ψ^{'}$ is Lipschitz and bounded. Moreover, we assume that $sign (ψ (x)) = sign (x)$ and that $ρ (x) \geq ρ (0) = 0$ for all $x \in R$ .
M4. Assume that there exist independent variables $λ_{i}$ ’s and $X_{i}$ ’s such that $x_{i} = λ_{i} X_{i}$ . Suppose that $X_{i}$ ’s are i.i.d. with independent entries, and they have mean $0_{p}$ and $cov (X_{i}) = I_{p}$ . Suppose there exist $c_{n}$ and $C_{n}$ that vary with n, where $1 / c_{n} = O (polyLog (n))$ and $C_{n}$ is bounded in n, such that for any convex 1-Lipschitz function G of $X_{i}$ , $P (| G (X_{i}) - m_{G} | > t) \leq C_{n} \exp (- c_{n} t^{2})$ holds for all $t > 0$ , where $m_{G}$ is the median of $G (X_{i})$ . We require the same assumption to hold for the columns of the $n \times p$ design matrix $X$ . Additionally, we assume that the coordinates of $X_{i}$ have moments of all orders, and the k-th moment of the entries of $X_{i}$ is assumed to be uniformly bounded independently of n and p for all k. Also, for any $1 \leq k \leq p$ , the vectors $Θ_{k} = (X_{1} (k), \dots, X_{n} (k))$ in $R^{n}$ satisfy: for any 1-Lipschitz (with respect to Euclidean norm) convex function G, if $m_{G (Θ_{k})}$ is a median of $G (Θ_{k})$ , for any $t > 0, P (| G (Θ_{k}) - m_{G (Θ_{k})} | > t) \leq C_{n} \exp (- c_{n} t^{2})$ , $C_{n}$ and $c_{n}$ can vary with n. As above, we assume that $1 / c_{n} = O (polyLog (n))$ .
M5. $λ_{i}$ ’s are independent, with $E (λ_{i}^{2}) = 1$ , $E (λ_{i}^{4})$ being bounded, and $\sup_{1 \leq i \leq n} | λ_{i} |$ growing at most like $C_{λ} {(\log n)}^{k}$ for some k. $λ_{i}$ ’s may have finitely many possible distributions.
M6. Suppose that $ϵ_{i}$ ’s are independent and independent of $X_{i}$ ’s and $λ_{i}$ ’s. They may have finitely many possible distributions, each with a density that is differentiable, symmetric, and unimodal. Furthermore, for any $r \in R$ , if $z \sim N (0, 1)$ , independent of $ϵ_{i}$ , $ϵ_{i} + r z$ has a differentiable density $f_{i, r}$ which is increasing on $(- \infty, 0)$ and decreasing on $(0, \infty)$ . $\lim_{x \to \infty} x f_{i} (x) = 0$ .
M7. $ϵ_{i}$ ’s can have different distributions. Similarly, $λ_{i}$ ’s can have different distributions. The fraction of occurrences for each possible combination of distributions for $(ϵ_{i}, λ_{i})$ has a limit as $n \to \infty$ .

First part of the proof (Appendix E.3)

O1. $p / n \to κ \in (0, \infty)$ .
O2. $ρ$ is twice differentiable, convex, and non-linear. $ψ = ρ^{'}$ . Note that $ψ^{'} \geq 0$ since $ρ$ is convex. We assume that $sign (ψ (x)) = sign (x)$ and $ρ \geq 0 = ρ (0)$ .
O3. $\sup_{x} | ψ (x) | \leq C polyLog (n)$ and $∥ ψ^{2} ∥_{\infty} \leq C$ for some constant C. Furthermore, $ψ^{'}$ is assumed to be $L (n)$ -Lipschitz with $L (n) \leq C n^{α}, α \geq 0$ . We also assume that $∥ ψ^{'} ∥_{\infty} \leq C polyLog (n)$ .
O4. Assume that there exist independent variables $λ_{i}$ ’s and $X_{i}$ ’s such that $x_{i} = λ_{i} X_{i}$ . Suppose that $X_{i}$ ’s are i.i.d. with independent entries, and they have mean $0_{p}$ and $cov (X_{i}) = I_{p}$ . Suppose there exist $c_{n}$ and $C_{n}$ that vary with n, where $1 / c_{n} = O (polyLog (n))$ and $C_{n}$ is bounded in n, such that for any convex 1-Lipschitz function G of $X_{i}$ , $P (| G (X_{i}) - m_{G} | > t) \leq C_{n} \exp (- c_{n} t^{2})$ holds for all $t > 0$ , where $m_{G}$ is the median of $G (X_{i})$ . We require the same assumption to hold for the columns of the $n \times p$ design matrix $X$ . Additionally, we assume that the coordinates of $X_{i}$ have moments of all orders, and the k-th moment of the entries of $X_{i}$ is assumed to be uniformly bounded independently of n and p for all k.
O5. ${X_{i}}_{i = 1}^{n}$ and ${λ_{i}}_{i = 1}^{n}$ are independent of ${ϵ_{i}}_{i = 1}^{n}$ . $ϵ_{i}$ ’s are independent of each other.
O6. $\sup_{1 \leq i \leq n} | λ_{i} | ≜ L_{n} = O_{L_{k}} (polyLog (n))$ and $λ_{i}$ ’s are independent. Moreover, $E (λ_{i}^{2}) = 1$ .
O7. $1 - 2 α > 0$ and $∥ β_{0} ∥ = O (polyLog (n))$ .

Second part of the proof (Appendix E.4)

P1. $X_{i}$ ’s have independent entries. Furthermore, for any $1 \leq k \leq p$ , the vectors $Θ_{k} = (X_{1} (k), \dots, X_{n} (k))$ in $R^{n}$ satisfy: for any 1-Lipschitz (with respect to Euclidean norm) convex function G, if $m_{G (Θ_{k})}$ is a median of $G (Θ_{k})$ , for any $t > 0, P (| G (Θ_{k}) - m_{G (Θ_{k})} | > t) \leq C_{n} \exp (- c_{n} t^{2})$ , $C_{n}$ and $c_{n}$ can vary with n. As above, we assume that $1 / c_{n} = O (polyLog (n))$ .
P2. $∥ ψ^{'} ∥_{\infty} = O (1)$ .
P3. $∥ β_{0} ∥_{\infty} = O (n^{- e})$ , where $e > 0$ . Furthermore, $∥ β_{0} ∥_{2} \leq C$ , where C is a constant independent of p and n. $e$ satisfies $α + 1 / 4 - e < 0$ .
P4. $1 / 2 - 2 α > 0$ and $\min (1 / 2, e) - α - 1 / 4 > 0$ . The latter implies that $\min (1 / 2, e) - α > 0$

Last part of the proof (Appendix E.5)

F1. $ϵ_{i}$ ’s may have different distributions; however, they may only come from finitely many distributions. Furthermore, for any $r \in R$ , if $z \sim N (0, 1)$ , independent of $ϵ_{i}$ , $ϵ_{i} + r z$ has a differentiable density $f_{i, r}$ which is increasing on $(- \infty, 0)$ and decreasing on $(0, \infty)$ . $\lim_{x \to \infty} x f_{i} (x) = 0$ .
F2. ${∥ ψ ∥}_{\infty} = O (1)$ . $ψ^{'}$ has Lipschitz constant $L (n)$ . Furthermore, ${L (n) ∥ ψ ∥}_{\infty} = O (1)$ .
F3. $α < 1 / 6$ and $α + 1 / 3 < 2 \min (1 / 2, e)$ .
F4. there exists constant C such that $E (λ_{i}^{4}) \leq C$ .
F5. $λ_{i}$ ’s may have different distributions. The fraction of occurrences for each possible combination of distributions for $(ϵ_{i}, λ_{i})$ has a limit as $n \to \infty$ .

Appendix E. Proof for Theorem 1

We call

F (δ) = \frac{1}{n} \sum_{i = 1}^{n} ρ {ϵ_{i} + x_{i}^{⊤} (w_{0} - \hat{w}) + x_{i}^{⊤} (δ_{0} - δ)} + \frac{τ}{2} {∥ δ ∥}^{2} .

\hat{δ}

is defined as the solution of

\begin{matrix} f (\hat{δ}) & = 0 with \\ \nabla F = f (δ) & = \frac{1}{n} \sum_{i = 1}^{n} - x_{i} ψ {ϵ_{i} + x_{i}^{⊤} (w_{0} - \hat{w}) + x_{i}^{⊤} (δ_{0} - δ)} + τ δ . \end{matrix}

We further define

{\tilde{ϵ}}_{i} = ϵ_{i} + x_{i}^{⊤} (w_{0} - \hat{w}), R_{i} = {\tilde{ϵ}}_{i} + x_{i}^{⊤} (δ_{0} - \hat{δ}), S = \frac{1}{n} \sum_{i = 1}^{n} ψ^{'} (R_{i}) x_{i} x_{i}^{⊤}, c_{τ} = \frac{1}{n} tr {(S + τ I)}^{- 1} .

(A1)

Appendix E.1. Preliminaries

Lemma A1.

Under Assumptions P3–P4, for any

l = 1, \dots, p

, where

w (l)

denotes the lth coordinate of a vector

w

,

| \hat{w} (l) - w_{0} (l) | = O_{L_{k}} (polyLog (n_{1}) n_{1}^{- 1 / 2} + n_{1}^{- e}) = O_{L_{k}} (\frac{polyLog (n_{1})}{n_{1}^{1 / 2} \land n_{1}^{e}}) .

In particular, if we further assume

n = O (n_{1})

, we have

| \hat{w} (l) - w_{0} (l) | = O_{L_{k}} (\frac{polyLog (n)}{\sqrt{n} \land n^{e}}) .

Proof.

By Proposition 3.12 of El Karoui [21],

|w_{l} - w_{0} (l)| = O_{L_{k}} (polyLog (n_{1}) n_{1}^{- 1 / 2} + {∥ w_{0} ∥}_{\infty}),

where

w_{l}

denotes the analog of

b_{p}

defined in Appendix 4 of El Karoui [21]. Its explicit construction is rather involved and is not needed here; we only use

w_{l}

as an intermediate quantity in the argument. Under Assumption P3,

∥ w_{0} ∥_{\infty} = O (n_{1}^{- e})

, hence

|w_{l} - w_{0} (l)| = O_{L_{k}} (polyLog (n_{1}) n_{1}^{- 1 / 2} + n_{1}^{- e}) = O_{L_{k}} (\frac{polyLog (n_{1})}{n_{1}^{1 / 2} \land n_{1}^{e}}) .

(A2)

Moreover, by Theorem 3.20 of El Karoui [21],

\hat{w} (l) - w_{l} = O_{L_{k}} (\frac{polyLog (n_{1}) n_{1}^{α}}{{[n_{1}^{1 / 2} \land n_{1}^{e}]}^{2}}) .

(A3)

Let

m : = \min (1 / 2, e)

. Then

{[n_{1}^{1 / 2} \land n_{1}^{e}]}^{2} = n_{1}^{2 m}

and

\frac{polyLog (n_{1}) n_{1}^{α}}{{[n_{1}^{1 / 2} \land n_{1}^{e}]}^{2}} = polyLog (n_{1}) n_{1}^{- m} n_{1}^{α - m} .

Assumption P4 gives

m > α

, thus

\frac{polyLog (n_{1}) n_{1}^{α}}{{[n_{1}^{1 / 2} \land n_{1}^{e}]}^{2}} = o (polyLog (n_{1}) n_{1}^{- m}),

and hence (A3) yields

\hat{w} (l) - w_{l} = o_{L_{k}} (polyLog (n_{1}) n_{1}^{- m}) .

(A4)

By the triangle inequality and Minkowski’s inequality in

L_{k}

,

∥ \hat{w} (l) - w_{0} {(l) ∥}_{L_{k}} \leq ∥ \hat{w} (l) - w_{l} ∥_{L_{k}} + {∥ w_{l} - w_{0} (l) ∥}_{L_{k}} .

Using (A2) and (A4), we obtain

∥ \hat{w} (l) - w_{0} {(l) ∥}_{L_{k}} = O (\frac{polyLog (n_{1})}{n_{1}^{1 / 2} \land n_{1}^{e}}),

which is equivalent to

| \hat{w} (l) - w_{0} (l) | = O_{L_{k}} (\frac{polyLog (n_{1})}{n_{1}^{1 / 2} \land n_{1}^{e}}) .

Assume

n = O (n_{1})

, i.e.,

n_{1} \geq n / C

for some

C > 0

and all large n. Since

polyLog (t)

denotes a power of

\log t

, write

polyLog (t) = {(\log t)}^{q}

for some fixed

q \geq 0

(up to a constant factor). Consider

f (t) : = {(\log t)}^{q} t^{- m}

. For

t \geq \exp (q / m)

,

f^{'} (t) = \frac{{(\log t)}^{q - 1}}{t^{m + 1}} (q - m \log t) \leq 0,

so f is decreasing for all sufficiently large t. Hence, for all sufficiently large n and all

n_{1} \geq n / C

,

\frac{polyLog (n_{1})}{\sqrt{n_{1}} \land n_{1}^{e}} = \frac{{(\log n_{1})}^{q}}{n_{1}^{m}} = f (n_{1}) \leq f (n / C) = \frac{{(\log (n / C))}^{q}}{{(n / C)}^{m}} \leq C^{m} \frac{{(\log n)}^{q}}{n^{m}} ≍ \frac{polyLog (n)}{\sqrt{n} \land n^{e}},

where we use

a_{n} ≍ b_{n}

to denote that

a_{n} = O (b_{n})

and

b_{n} = O (a_{n})

. Therefore, we have

| \hat{w} (l) - w_{0} (l) | = O_{L_{k}} (\frac{polyLog (n)}{\sqrt{n} \land n^{e}}) .

□

Proposition A1.

Let

δ_{1}

and

δ_{2}

be the two vectors in

R^{p}

. Then, when ρ’s are convex and twice-differentiable,

∥ δ_{1} - δ_{2} ∥ \leq \frac{1}{τ} ∥ f (δ_{1}) - f (δ_{2}) ∥ .

(A5)

Proof.

We have by definition

f (δ_{1}) - f (δ_{2}) = τ (δ_{1} - δ_{2}) + \frac{1}{n} \sum_{i = 1}^{n} x_{i} [ψ {{\tilde{ϵ}}_{i} + x_{i}^{⊤} (δ_{0} - δ_{2})} - ψ {{\tilde{ϵ}}_{i} + x_{i}^{⊤} (δ_{0} - δ_{1})}] .

By the mean value theorem, we have

ψ {{\tilde{ϵ}}_{i} + x_{i}^{⊤} (δ_{0} - δ_{2})} - ψ {{\tilde{ϵ}}_{i} + x_{i}^{⊤} (δ_{0} - δ_{1})} = ψ^{'} (γ_{i}^{🟉}) x_{i}^{⊤} (δ_{0} - δ_{1}),

where

γ_{i}^{🟉}

is in the interval

({\tilde{ϵ}}_{i} + x_{i}^{⊤} (δ_{0} - δ_{1}), {\tilde{ϵ}}_{i} + x_{i}^{⊤} (δ_{0} - δ_{2}))

.

Hence,

\begin{matrix} f (δ_{1}) - f (δ_{2}) & = & τ (δ_{1} - δ_{2}) + \frac{1}{n} \sum_{i = 1}^{n} ψ^{'} (γ_{i}^{🟉}) x_{i} x_{i}^{⊤} (δ_{0} - δ_{1}) \\ = & (S_{δ_{1}, δ_{2}} + τ I_{p}) (δ_{1} - δ_{2}), \end{matrix}

where

S_{δ_{1}, δ_{2}} = \frac{1}{n} \sum_{i = 1}^{n} ψ^{'} (γ_{i}^{🟉}) x_{i} x_{i}^{⊤} .

This shows that

δ_{1} - δ_{2} = {(S_{δ_{1}, δ_{2}} + τ I_{p})}^{- 1} {f (δ_{1}) - f (δ_{2})} .

Since

ρ

is convex,

ψ^{'} = ρ^{″}

is non-negative and

S_{δ_{1}, δ_{2}}

is positive semi-definite. In the semi-definite order, we have

S_{δ_{1}, δ_{2}} + τ I_{p} ⪰ τ I_{p}

. In particular,

∥ δ_{1} - δ_{2} ∥ \leq \frac{1}{τ} ∥ f (δ_{1}) - f (δ_{2}) ∥ .

□

Proposition A1 yields the following lemma.

Lemma A2.

For any

δ_{1}

,

∥ \hat{δ} - δ_{1} ∥ \leq \frac{1}{τ} ∥ f (δ_{1}) ∥ .

The lemma is a simple consequence of Equation (A5) by definition

f (\hat{δ}) = 0

.

Appendix E.2. On $∥ \hat{δ} ∥$ and $∥ \hat{δ} - δ_{0} ∥$

Lemma A3.

Define

q_{n} (b) = n^{- 1} \sum_{i = 1}^{n} x_{i} ψ {{\tilde{ϵ}}_{i} + x_{i}^{⊤} b}

,

q_{n} \in R^{p}

.

If

D_{ψ}

is the

n \times n

diagonal matrix with

(i, i)

-entry

ψ {{\tilde{ϵ}}_{i} + x_{i}^{⊤} δ_{0}}

,

∥ \hat{δ} ∥ \leq \frac{1}{τ} ∥ q_{n} (δ_{0}) ∥ = \frac{1}{τ} \sqrt{\frac{1}{n^{2}} 1^{⊤} D_{ψ} X X^{⊤} D_{ψ} 1},

and if

D_{ψ (ξ_{i})}

is the

n \times n

diagonal matrix with

(i, i)

-entry

ψ ({\tilde{ϵ}}_{i})

,

∥ \hat{δ} - δ_{0} ∥ \leq ∥ δ_{0} ∥ + \frac{1}{τ} ∥ q_{n} (0) ∥ = ∥ δ_{0} ∥ + \frac{1}{τ} \sqrt{\frac{1}{n^{2}} 1^{⊤} D_{ψ (ξ_{i})} X X^{⊤} D_{ψ (ξ_{i})} 1},

Also,

∥ q_{n} (δ_{0}) ∥^{2} \leq \frac{1^{⊤} D_{ψ}^{2} 1}{n} {∥ X X^{⊤} / n ∥}_{2},

where

{∥ A ∥}_{2}

denotes the largest singular value of the matrix

A

.

Therefore, under Assumptions O1–O6,

\begin{matrix} E (∥ \hat{δ} ∥^{2}) \leq \frac{1}{τ^{2}} \frac{p}{n} C^{2} polyLog (n), \\ E (∥ \hat{δ} ∥^{4}) \leq \frac{1}{τ^{4}} C polyLog (n) . \end{matrix}

Similarly, for any finite k,

E (∥ \hat{δ} - δ_{0} ∥_{2}^{k}) \leq C_{k} [∥ δ_{0} ∥^{k} + polyLog (n) / τ^{k}] .

In the case

k = 2

, we have the more precise bound

E (∥ \hat{δ} - δ_{0} ∥^{2}) \leq 2 [∥ δ_{0} ∥^{2} + \frac{p / n}{τ^{2}} \frac{1}{n} \sum_{i = 1}^{n} E {ψ^{2} ({\tilde{ϵ}}_{i})}] .

Noting that [21] has shown that

E (∥ \hat{w} - w_{0} ∥) = O (1)

, we have

E (∥ \hat{δ} + \hat{w} - δ_{0} - w_{0} ∥)

is bounded by

K polyLog (n) / τ^{k}

.

Proof.

Recall that

f (δ) = \frac{1}{n} \sum_{i = 1}^{n} - x_{i} ψ {{\tilde{ϵ}}_{i} + x_{i}^{⊤} (δ_{0} - δ)} + τ δ

. Applying Lemma A2 with

δ_{1} = 0

we have

∥ \hat{δ} - 0 ∥ \leq \frac{1}{τ} ∥ f (0) ∥ = \frac{1}{τ} ∥ \frac{1}{n} \sum_{i = 1}^{n} - x_{i} ψ {{\tilde{ϵ}}_{i} + x_{i}^{⊤} δ_{0}} ∥,

which gives the first inequality.

Using

δ_{1} = δ_{0}

we have

∥ \hat{δ} - δ_{0} ∥ \leq \frac{1}{τ} ∥ f (δ_{0}) ∥ = \frac{1}{τ} ∥ \frac{1}{n} \sum_{i = 1}^{n} - x_{i} ψ ({\tilde{ϵ}}_{i}) + τ δ_{0} ∥,

which gives the second inequality.

We note that under our assumptions, according to Lemma 3.38 from [21], we have

∥ X X^{⊤} {/ n ∥}_{2} = O_{L_{k}} (polyLog (n))

and

\frac{1}{n} \sum_{i = 1}^{n} ψ^{2} {{\tilde{ϵ}}_{i} + x_{i}^{⊤} δ_{0}} \leq \frac{1}{n} \sum_{i = 1}^{n} {∥ ψ^{2} ∥}_{\infty} = O (1),

which gives all the results about

L_{k}

bounds.

For the last result, we note that

∥ q_{n} {(0) ∥}^{2} = q_{n} {(0)}^{⊤} q_{n} (0) = \frac{1}{n^{2}} \sum_{i, j} x_{i}^{⊤} x_{j} ψ ({\tilde{ϵ}}_{i}) ψ ({\tilde{ϵ}}_{j}) .

It implies that

E ∥ q_{n} {(0) ∥}^{2} = \frac{1}{n^{2}} \sum_{i = 1}^{n} E (∥ x_{i} ∥^{2}) E {ψ^{2} ({\tilde{ϵ}}_{i})} .

Because

E (∥ x_{i} ∥^{2}) = p

, we can conclude that

E (∥ q_{n} (0) ∥^{2}) = \frac{p}{n} \frac{1}{n} \sum_{i = 1}^{n} E {ψ^{2} ({\tilde{ϵ}}_{i})} .

Together with the bound

∥ \hat{δ} - δ_{0} ∥^{2} \leq 2 ∥ δ_{0} ∥^{2} + \frac{2}{τ^{2}} {∥ q_{n} (0) ∥}^{2},

it implies the last result about

k = 2

. □

Appendix E.3. Leave-One-Observation-Out

In this subsection, we approximate

\hat{δ}

by

{\hat{δ}}_{(i)}

via leave-one-observation-out method.

We consider the situation where we leave the i-th observation,

(x_{i}, ϵ_{i})

, out. By definition,

{\hat{δ}}_{(i)} = \underset{δ \in R^{p}}{arg min} F_{i} (δ), where F_{i} (δ) = \frac{1}{n} \sum_{j \neq i} ρ_{j} {{\tilde{ϵ}}_{j} + x_{j}^{⊤} δ_{0} - x_{j}^{⊤} δ} + \frac{τ}{2} {∥ δ ∥}^{2} .

We call

\begin{matrix} f_{i} (δ) & = & - \frac{1}{n} \sum_{j \neq i} x_{j} ψ_{j} {{\tilde{ϵ}}_{j} + x_{j}^{⊤} δ_{0} - x_{j}^{⊤} δ} + τ δ \\ = & f (δ) + \frac{1}{n} x_{i} ψ {{\tilde{ϵ}}_{i} + x_{i}^{⊤} δ_{0} - x_{i}^{⊤} δ} . \end{matrix}

We have

f_{i} ({\hat{δ}}_{(i)}) = 0 .

We call

{\tilde{r}}_{j, (i)} = {\tilde{ϵ}}_{j} - x_{j}^{⊤} ({\hat{δ}}_{(i)} - δ_{0}) and S_{i} = \frac{1}{n} \sum_{j \neq i} ψ_{j}^{'} ({\tilde{r}}_{j, (i)}) x_{j} x_{j}^{⊤} .

Consider

{\tilde{δ}}_{i} = {\hat{δ}}_{(i)} + \frac{1}{n} {(S_{i} + τ I)}^{- 1} x_{i} ψ {prox (c_{i} ρ) ({\tilde{r}}_{j, (i)})} ≜ {\hat{δ}}_{(i)} + η_{i},

where

\begin{matrix} c_{i} = \frac{1}{n} x_{i}^{⊤} {(S_{i} + τ I)}^{- 1} x_{i}, \\ η_{i} = \frac{1}{n} {(S_{i} + τ I)}^{- 1} x_{i} ψ {prox (c_{i} ρ) ({\tilde{r}}_{i, (i)})} . \end{matrix}

(A6)

Appendix E.3.1. Deterministic Bounds

Proposition A2.

We have

∥ \hat{δ} - {\tilde{δ}}_{i} ∥ \leq \frac{1}{τ} ∥ R_{i} ∥,

where

R_{i} = \frac{1}{n} \sum_{j \neq i} [ψ_{j}^{'} {γ^{★} (x_{j}, {\hat{δ}}_{(i)}, η_{i})} - ψ_{j}^{'} ({\tilde{r}}_{j, (i)})] x_{j} x_{j}^{⊤} η_{i},

and

γ^{★} (x_{j}, {\hat{δ}}_{(i)}, η_{i})

is in the (“unordered”) interval

({\tilde{r}}_{j, (i)}, {\tilde{r}}_{j, (i)} - x_{j}^{⊤} η_{i})

.

Proof.

Recall that

y_{i} = ϵ_{i} + x_{i}^{⊤} w_{0} + x_{i}^{⊤} δ_{0}

.

Since

f_{i} ({\hat{δ}}_{(i)}) = 0

, and

{\tilde{δ}}_{i} = {\hat{δ}}_{(i)} + η_{i}

, we have

\begin{matrix} f ({\tilde{δ}}_{i}) & = & f ({\tilde{δ}}_{i}) - f_{i} ({\hat{δ}}_{(i)}) \\ = & - \frac{1}{n} \sum_{i = 1}^{n} x_{i} ψ {{\tilde{ϵ}}_{i} + x_{i}^{⊤} δ_{0} - x_{i}^{⊤} {\tilde{δ}}_{i}} + τ {\tilde{δ}}_{i} + \frac{1}{n} \sum_{j \neq i} x_{j} ψ {{\tilde{ϵ}}_{j} + x_{j}^{⊤} δ_{0} - x_{j}^{⊤} δ_{(i)}} - τ δ_{(i)} \\ = & - \frac{1}{n} x_{i} ψ {{\tilde{ϵ}}_{i} + x_{i}^{⊤} δ_{0} - x_{i}^{⊤} {\tilde{δ}}_{i}} + τ η_{i} \\ + \frac{1}{n} \sum_{j \neq i} x_{j} [ψ_{j} {{\tilde{ϵ}}_{j} + x_{j}^{⊤} δ_{0} - x_{j}^{⊤} {\hat{δ}}_{(i)}} - ψ_{j} {{\tilde{ϵ}}_{j} + x_{j}^{⊤} δ_{0} - x_{j}^{⊤} ({\hat{δ}}_{(i)} + η_{i})}] . \end{matrix}

By the mean value theorem, we have

\begin{matrix} ψ_{j} {{\tilde{ϵ}}_{j} + x_{j}^{⊤} δ_{0} - x_{j}^{⊤} {\hat{δ}}_{(i)}} - ψ_{j} {{\tilde{ϵ}}_{j} + x_{j}^{⊤} δ_{0} - x_{j}^{⊤} ({\hat{δ}}_{(i)} + η_{i})} \\ = ψ_{j}^{'} ({\tilde{r}}_{j, (i)}) x_{j}^{⊤} η_{i} + [ψ_{j}^{'} {γ^{★} (x_{j}, {\hat{δ}}_{(i)}, η_{i})} - ψ_{j}^{'} ({\tilde{r}}_{j, (i)})] x_{j}^{⊤} η_{i}, \end{matrix}

where

γ^{★} (x_{j}, {\hat{δ}}_{(i)}, η_{i})

is in the (“unordered”) interval

({\tilde{r}}_{j, (i)}, {\tilde{r}}_{j, (i)} - x_{j}^{⊤} η_{i})

.

Therefore, if

R_{i}

is defined as above, we have

\begin{matrix} \frac{1}{n} \sum_{j \neq i} x_{j} [ψ_{j} {{\tilde{ϵ}}_{j} + x_{j}^{⊤} δ_{0} - x_{j}^{⊤} {\hat{δ}}_{(i)}} - ψ_{j} {{\tilde{ϵ}}_{j} + x_{j}^{⊤} δ_{0} - x_{j}^{⊤} ({\hat{δ}}_{(i)} + η_{i})}] \\ = & \frac{1}{n} \sum_{j \neq i} ψ_{j}^{'} ({\tilde{r}}_{j, (i)}) x_{j} x_{j}^{⊤} η_{i} + R_{i} \\ = & S_{i} η_{i} + R_{i} . \end{matrix}

The previous simplicities yield that

f ({\tilde{δ}}_{i}) = - \frac{1}{n} x_{i} ψ {{\tilde{ϵ}}_{i} + x_{i}^{⊤} δ_{0} - x_{i}^{⊤} {\tilde{δ}}_{i}} + (S_{i} + τ I) η_{i} + R_{i} .

Since by definition,

η_{i} = n^{- 1} {(S_{i} + τ I)}^{- 1} x_{i} ψ {prox (c_{i} ρ) ({\tilde{r}}_{i, (i)})}

, we have

(S_{i} + τ I) η_{i} = \frac{1}{n} x_{i} ψ {prox (c_{i} ρ) ({\tilde{r}}_{i, (i)})} .

Also, by definition we have

{\tilde{ϵ}}_{i} + x_{i}^{⊤} δ_{0} - x_{i}^{⊤} {\tilde{δ}}_{i} = {\tilde{r}}_{i, (i)} - c_{i} ψ {prox (c_{i} ρ) ({\tilde{r}}_{i, (i)})} .

When

ρ

is differentiable,

x - c ψ {prox (c ρ) (x)} = prox (c ρ) (x)

. Therefore,

{\tilde{ϵ}}_{i} + x_{i}^{⊤} δ_{0} - x_{i}^{⊤} {\tilde{δ}}_{i} = prox (c_{i} ρ) ({\tilde{r}}_{i, (i)})

and

- \frac{1}{n} x_{i} ψ {{\tilde{ϵ}}_{i} + x_{i}^{⊤} δ_{0} - x_{i}^{⊤} {\tilde{δ}}_{i}} + (S_{i} + τ I) η_{i} = - \frac{1}{n} x_{i} [ψ {prox (c_{i} ρ) ({\tilde{r}}_{i, (i)})} - ψ ({\tilde{r}}_{i, (i)})] = 0 .

Therefore,

f ({\tilde{δ}}_{i}) = R_{i}

. Applying Lemma A2 we have

∥ \hat{δ} - {\tilde{δ}}_{i} ∥ \leq \frac{1}{τ} ∥ R_{i} ∥ .

□

i.: On $R_{i}$

Next, we provide a bound for

R_{i}

.

Lemma A4.

We have

∥ η_{i} ∥ \leq \frac{1}{\sqrt{n} τ} \frac{∥ x_{i} ∥}{\sqrt{n}} | ψ ({\tilde{r}}_{i, (i)}) |,

and

∥ R_{i} ∥ \leq ∥ \sum^{^} ∥_{2} \sup_{j \neq i} | ψ_{j}^{'} {γ^{★} (x_{j}, {\hat{δ}}_{(i)}, η_{i})} - ψ_{j}^{'} ({\tilde{r}}_{j, (i)}) | \frac{1}{\sqrt{n} τ} \frac{∥ x_{i} ∥}{\sqrt{n}} | ψ ({\tilde{r}}_{i, (i)}) | .

Proof.

We have

R_{i} = \frac{1}{n} \sum_{j \neq i} [ψ_{j}^{'} {γ^{★} (x_{j}, {\hat{δ}}_{(i)}, η_{i})} - ψ_{j}^{'} ({\tilde{r}}_{j, (i)})] x_{j} x_{j}^{⊤} η_{i} .

Note that

S = n^{- 1} \sum_{j \neq i} [ψ_{j}^{'} {γ^{★} (x_{j}, {\hat{δ}}_{(i)}, η_{i})} - ψ_{j}^{'} ({\tilde{r}}_{j, (i)})] x_{j} x_{j}^{⊤}

can be written as

S = n^{- 1} X^{⊤} D X

, where

D

is a diagonal matrix with

(j, j)

-entry

[ψ_{j}^{'} {γ^{★} (x_{j}, {\hat{δ}}_{(i)}, η_{i})} - ψ_{j}^{'} ({\tilde{r}}_{j, (i)})]

.

Using the property of matrix norm

{∥ \cdot ∥}_{2}

, we have

{∥ S ∥}_{2} \leq ∥ \sum^{^} ∥_{2} {∥ D ∥}_{2}

, which implies that

∥ R_{i} ∥ \leq ∥ \sum^{^} ∥_{2} \sup_{j \neq i} | ψ_{j}^{'} {γ^{★} (x_{j}, {\hat{δ}}_{(i)}, η_{i})} - ψ_{j}^{'} ({\tilde{r}}_{j, (i)}) | ∥ η_{i} ∥,

where

\sum^{^} = n^{- 1} \sum_{j = 1}^{n} x_{j} x_{j}^{⊤}

is the sample covariance matrix.

We now bound

∥ η_{i} ∥

. Note that

∥ η_{i} ∥ \leq \frac{1}{\sqrt{n} τ} \frac{∥ x_{i} ∥}{\sqrt{n}} | ψ {prox (c_{i} ρ) ({\tilde{r}}_{i, (i)})} | .

Using Lemma A-1 in [25], we see that

| ψ ({prox}_{c_{i}} (ρ) ({\tilde{r}}_{i, (i)})) | \leq | ψ ({\tilde{r}}_{i, (i)}) | .

□

Under our assumptions, we have

| ψ ({\tilde{r}}_{i, (i)}) {| \leq ∥ ψ ∥}_{\infty} \leq C polyLog (n)

and later in the proof of Lemma A6, we will show that

\sup_{i} \frac{∥ x_{i} ∥}{\sqrt{n}} = \sup_{i} \frac{| λ_{i} | ∥ X_{i} ∥}{\sqrt{n}} = O_{L_{k}} (\sup_{i} | λ_{i} |) .

ii.: On $γ^{★} (x_{j}, {\hat{δ}}_{(i)}, η_{i})$ and related quantities

We now show how to control

n^{- 1 / 2} \sup_{j \neq i} | ψ_{j}^{'} {γ^{★} (x_{j}, {\hat{δ}}_{(i)}, η_{i})} - ψ_{j}^{'} ({\tilde{r}}_{j, (i)}) |

.

Lemma A5.

Suppose that

ψ^{'}

is

L (n)

-Lipschitz. Then

\sup_{j \neq i} | ψ_{j}^{'} {γ^{★} (x_{j}, {\hat{δ}}_{(i)}, η_{i})} - ψ_{j}^{'} ({\tilde{r}}_{j, (i)}) | \leq L (n) \sup_{j \neq i} | x_{j}^{⊤} η_{i} | .

Proof.

By definition, we have

| γ^{★} (x_{j}, {\hat{δ}}_{(i)}, η_{i}) - {\tilde{r}}_{j, (i)} | \leq | x_{j}^{⊤} η_{i} | .

Therefore, the bound follows, using the fact that

ψ_{j}^{'}

is

L (n)

-Lipschitz. □

Appendix E.3.2. Stochastic Aspects

Recall that

x_{j}^{⊤} η_{i} = ψ {prox (c_{i} ρ) ({\tilde{r}}_{i, (i)})} \frac{1}{n} x_{j}^{⊤} {(S_{i} + τ I)}^{- 1} x_{i} .

Therefore, we can bound

∥ R_{i} ∥

by

∥ R_{i} ∥ \leq [\sup_{j \neq i} \frac{| x_{j}^{⊤} {(S_{i} + τ I)}^{- 1} x_{i} |}{n}] \frac{L (n)}{\sqrt{n} τ} \frac{∥ x_{i} ∥}{\sqrt{n}} ∥ \sum^{^} ∥_{2} {∥ ψ ∥}_{\infty}^{2} .

i.: On $\sup_{j \neq i} | x_{j}^{⊤} {(S_{i} + τ I)}^{- 1} x_{i} |$

Now we control

x_{j}^{⊤} {(S_{i} + τ I)}^{- 1} x_{i} / n

.

Lemma A6.

Suppose

x_{i}

’s are independent and satisfy O4; suppose that

λ_{i}

’s satisfy O6. Then

\sup_{j \neq i} | x_{j}^{⊤} {(S_{i} + τ I)}^{- 1} x_{i} / n | \leq \frac{1}{\sqrt{n}} \sup_{j \neq i} \frac{∥ X_{j} ∥}{τ \sqrt{n}} polyLog (n)

in

L_{k}

, for any finite k. Note that under AssumptionO4, for any finite k,

\sup_{j \neq i} | ∥ X_{j} ∥ / \sqrt{n} | = O_{L_{k}} (1) .

Proof.

Note that

| x_{j}^{⊤} {(S_{i} + τ I)}^{- 1} x_{i} / n | = | λ_{i} λ_{j} | | X_{j}^{⊤} {(S_{i} + τ I)}^{- 1} X_{i} / n | .

Denote

X_{(i)} = (X_{1}, \dots, X_{i - 1}, X_{i + 1}, \dots, X_{n})

, and

v_{j, (i)} = {(S_{i} + τ I)}^{- 1} X_{j}

. Then we have the map

F_{j} (X_{i}) = X_{j}^{⊤} {(S_{i} + τ I)}^{- 1} X_{i} = X_{i}^{⊤} v_{j, (i)}

is linear in

X_{i}

and it is Lipschitz with Lipschitz constant

{X_{j}^{⊤} {(S_{i} + τ I)}^{- 2} X_{j}}^{1 / 2} \leq ∥ X_{j} ∥ / τ

. Therefore, using Lemma B-2 in [25] and the fact that

X_{i}

has mean 0, we have

\sup_{j \neq i} | X_{j}^{⊤} {(S_{i} + τ I)}^{- 1} X_{i} / n | | X_{(i)} \leq \frac{1}{\sqrt{n}} \sup_{j \neq i} \frac{∥ X_{j} ∥}{τ \sqrt{n}} polyLog (n) / c_{n}^{1 / 2}

in

L_{k}

, when

\sup_{j \neq i} ∥ X_{j} ∥ / (τ n^{1 / 2}) = O_{L_{k}} (1)

.

To prove it, Using the fact that

X_{j} \to ∥ X_{j} ∥ / n^{1 / 2}

is

n^{- 1 / 2}

-Lipschitz, we have

\sup_{j \neq i} | ∥ X_{j} ∥ / \sqrt{n} - m_{∥ X_{j} ∥ / \sqrt{n}} | \leq polyLog (n) / \sqrt{n c_{n}} in \sqrt{L_{2 k}} .

Note that

E (∥ X_{j} ∥) \leq {E (∥ X_{j} ∥^{2} {)}}^{1 / 2} = p^{1 / 2}

, so

m_{∥ X_{j} ∥ / \sqrt{n}}

is of order 1. Therefore, by Assumption O4 on

c_{n}

we have

\sup_{j \neq i} | ∥ X_{j} ∥ / \sqrt{n} | = O_{L_{k}} (1) .

Now our Assumption

O 6

concerning

\sup_{i} | λ_{i} | = O_{L_{k}} (polyLog (n))

guarantee that the bounds we announced are valid. □

Consequences

We have the following result.

Proposition A3.

Under Assumptions O1–O6, we have

∥ R_{i} ∥ = O_{L_{k}} (\frac{{[L (n)] ∥ ψ ∥}_{\infty}^{2}}{n τ} polyLog (n)) .

Furthermore, the same bound hold for

\sup_{1 \leq i \leq n} ∥ R_{i} ∥

.

Proof.

By aggregating all the intermediate results, using Holder’s inequality and the fact that

∥ \sum^{^} ∥_{2} = O_{L_{k}} (polyLog (n))

shown in Lemma 3.38 from [21], we finish the proof. □

We can now prove and state the following result. Recall that

{\tilde{δ}}_{i} = {\hat{δ}}_{(i)} + \frac{1}{n} {(S_{i} + τ I_{p})}^{- 1} x_{i} ψ {prox (c_{i} ρ) ({\tilde{r}}_{j, (i)})} ≜ {\hat{δ}}_{(i)} + η_{i} .

Theorem A1.

Under Assumptions O1–O7, we have, for any fixed k, when τ is held fixed and

L (n) \leq C n^{α}

,

\sup_{1 \leq i \leq n} ∥ \hat{δ} - {\tilde{δ}}_{i} ∥ = O_{L_{k}} (\frac{polyLog (n)}{n^{1 - α}}) .

In particular, we have

\forall 1 \leq i \leq n, E (∥ \hat{δ} - {\tilde{δ}}_{i} ∥^{2}) = O (\frac{polyLog (n)}{n^{2 - 2 α}}) .

Also,

\sup_{1 \leq i \leq n} \sup_{j \neq i} | {\tilde{r}}_{j, (i)} - R_{j} | = O_{L_{k}} (\frac{polyLog (n)}{n^{1 / 2 - α}}) .

Finally,

\sup_{1 \leq i \leq n} | R_{i} - prox (c_{i} ρ) ({\tilde{r}}_{i, (i)}) | = O_{L_{k}} (\frac{polyLog (n)}{n^{1 / 2 - α}}) .

(A7)

Proof.

The first two results are direct consequences of the previous propositions.

The third result follows from that

\begin{matrix} \sup_{j \neq i} | {\tilde{r}}_{j, (i)} - R_{j} | & = \sup_{j \neq i} | x_{j}^{⊤} (\hat{δ} - {\hat{δ}}_{(i)}) | \leq \sup_{j \neq i} | x_{j}^{⊤} (\hat{δ} - {\tilde{δ}}_{i}) | + \sup_{j \neq i} | x_{j}^{⊤} ({\tilde{δ}}_{i} - {\hat{δ}}_{(i)}) | \\ = (\sup_{1 \leq j \leq n} \frac{∥ x_{j} ∥}{\sqrt{n}}) \sqrt{n} ∥ \hat{δ} - {\tilde{δ}}_{i} ∥ + \sup_{j \neq i} | x_{j}^{⊤} η_{i} |, \end{matrix}

and the fact that

\sup_{1 \leq j \leq n} ∥ x_{j} ∥ / n^{1 / 2} = O_{L_{k}} (polyLog (n))

under our assumptions. Control of the first term follows from the results on

∥ \hat{δ} - {\tilde{δ}}_{i} ∥

. Control of the second term follows from Lemma A6 and the assumption that

ψ

is bounded by

C polyLog (n)

.

For the last result, recall that

R_{i} = {\tilde{ϵ}}_{i} + x_{i}^{⊤} (δ_{0} - \hat{δ}) = {\tilde{ϵ}}_{i} + x_{i}^{⊤} δ_{0} - x_{i}^{⊤} {\tilde{δ}}_{i} - x_{i}^{⊤} (\hat{δ} - {\tilde{δ}}_{i}) .

Given the definition of

{\tilde{δ}}_{i}

, we have

x_{i}^{⊤} {\tilde{δ}}_{i} = x_{i}^{⊤} {\hat{δ}}_{(i)} + c_{i} ψ {prox (c_{i} ρ) ({\tilde{r}}_{i, (i)})} .

By the property of the proximal operator, if

y = prox (c ρ) (x)

,

y + c ψ (y) = x

, we have

{\tilde{ϵ}}_{i} + x_{i}^{⊤} δ_{0} - x_{i}^{⊤} {\tilde{δ}}_{i} = {\tilde{r}}_{i, (i)} - c_{i} ψ [prox (c_{i} ρ) ({\tilde{r}}_{i, (i)})] = prox (c_{i} ρ) ({\tilde{r}}_{i, (i)}) .

Therefore, we have

\sup_{i} | R_{i} - prox (c_{i} ρ) ({\tilde{r}}_{i, (i)}) | = \sup_{i} | x_{i}^{⊤} (\hat{δ} - {\tilde{δ}}_{i}) |,

which is controlled in the previous results. □

On the limiting variance of $∥ \hat{δ} ∥^{2}$ and $∥ \hat{δ} - δ_{0} ∥^{2}$

Proposition A4.

Under Assumptions O1–O7,

var (∥ \hat{δ} ∥^{2}) \to 0 as n \to \infty .

Therefore,

∥ \hat{δ} ∥^{2}

has a deterministic equivalent in probability and in

L_{2}

.

More precisely, we have

var (∥ \hat{δ} ∥^{2}) = O (\frac{polyLog (n)}{n^{1 - 2 α}}) .

The same type of results are true for

var (∥ \hat{δ} - δ_{0} ∥^{2})

and

var (∥ \hat{β} - β_{0} ∥^{2})

provided that

∥ δ_{0} ∥ = O (polyLog (n))

.

Proof.

We use the Efron–Stein inequality [35]: if W is a function of n independent random variables, and

W_{(i)}

is any function of all those random variables except the i-th,

var (W) \leq \sum_{i = 1}^{n} var (W - W_{(i)}) \leq \sum_{i = 1}^{n} E ({(W - W_{(i)})}^{2}) .

We apply this inequality to

W = ∥ \hat{δ} ∥^{2}

and

W_{(i)} = {∥ {\hat{δ}}_{(i)} ∥}^{2}

. We first note that

E (| ∥ \hat{δ} ∥^{2} - ∥ {\hat{δ}}_{(i)} ∥^{2} |^{2}) = 2 [E (| ∥ \hat{δ} ∥^{2} - ∥ {\tilde{δ}}_{i} {∥^{2} |}^{2}) + E (| ∥ {\tilde{δ}}_{i} ∥^{2} - ∥ {\hat{δ}}_{(i)} {∥^{2} |}^{2})] .

For the first term, we have

| ∥ \hat{δ} ∥^{2} - ∥ {\tilde{δ}}_{i} {∥^{2} |}^{2} = {[{(\hat{δ} - {\tilde{δ}}_{i})}^{⊤} (\hat{δ} + {\tilde{δ}}_{i})]}^{2}

and

{(\hat{δ} - {\tilde{δ}}_{i})}^{⊤} (\hat{δ} + {\tilde{δ}}_{i}) = 2 {(\hat{δ} - {\tilde{δ}}_{i})}^{⊤} \hat{δ} - {∥ \hat{δ} - {\tilde{δ}}_{i} ∥}^{2}

. Therefore, by the Cauchy–Schwarz inequality, we have

| ∥ \hat{δ} ∥^{2} - ∥ {\tilde{δ}}_{i} ∥^{2} |^{2} = O_{L_{1}} (∥ \hat{δ} - {\tilde{δ}}_{i} ∥^{4}) + \sqrt{O_{L_{1}} (polyLog (n)) {∥ \hat{δ} - {\tilde{δ}}_{i} ∥}^{4}},

since

E (∥ \hat{δ} ∥^{k})

exists and is bounded by

k polyLog (n) / τ^{k}

.

Using the results of Theorem A1 we have

E (| ∥ \hat{δ} ∥^{2} - ∥ {\tilde{δ}}_{i} {∥^{2} |}^{2}) = O (\frac{polyLog (n)}{n^{2 - 2 α}}) = o (n^{- 1}),

provided that

α < 1 / 2

.

For the second term, by definition, we have

\begin{matrix} ∥ {\tilde{δ}}_{i} ∥^{2} - {∥ {\hat{δ}}_{(i)} ∥}^{2} = & \frac{2}{n} {\hat{δ}}_{(i)}^{⊤} {(S_{i} + τ I)}^{- 1} x_{i} ψ (prox (c_{i} ρ) ({\tilde{r}}_{i, (i)})) \\ + \frac{1}{n^{2}} x_{i}^{⊤} {(S_{i} + τ I)}^{- 2} x_{i} ψ^{2} (prox (c_{i} ρ) ({\tilde{r}}_{i, (i)})) . \end{matrix}

Since

∥ {\hat{δ}}_{(i)} ∥^{2}

and

S_{i}

are independent of

x_{i}

, and

∥ {(S_{i} + τ I)}^{- 1} ∥_{2} \leq τ^{- 1}

, we have

{\hat{δ}}_{(i)}^{⊤} {(S_{i} + τ I)}^{- 1} x_{i} = O_{L_{2}} (| λ_{i} | ∥ {\hat{δ}}_{(i)} ∥ / c_{n}^{1 / 2})

. Recall also that

\sup_{i} {∥ ψ ∥}_{\infty} = O (polyLog (n))

. Therefore, both terms are of order

O_{L_{2}} (polyLog (n) / n c_{n}^{1 / 2})

.

We can now conclude that

E (| ∥ {\tilde{δ}}_{i} ∥^{2} - ∥ {\hat{δ}}_{(i)} {∥^{2} |}^{2}) = O (\frac{polyLog (n)}{n^{2}}) .

Therefore, we have

var (∥ \hat{δ} ∥^{2}) = O (\frac{polyLog (n)}{n^{1 - 2 α}}) = o (1) .

This shows that

∥ \hat{δ} ∥^{2}

has a deterministic equivalent in probability and in

L_{2}

.

For the second part of the result, similarly, we write that

E (| ∥ \hat{δ} - δ_{0} ∥^{2} - ∥ {\hat{δ}}_{(i)} - δ_{0} ∥^{2} |^{2}) = 2 [E (| ∥ \hat{δ} - δ_{0} ∥^{2} - ∥ {\tilde{δ}}_{i} - δ_{0} {∥^{2} |}^{2}) + E (| ∥ {\tilde{δ}}_{i} - δ_{0} ∥^{2} - ∥ {\hat{δ}}_{(i)} - δ_{0} {∥^{2} |}^{2})] .

Using the fact that

| ∥ \hat{δ} - δ_{0} ∥^{2} - ∥ {\tilde{δ}}_{i} - δ_{0} {∥^{2} |}^{2} = {[{(\hat{δ} - {\tilde{δ}}_{i})}^{⊤} (\hat{δ} + {\tilde{δ}}_{i} - 2 δ_{0})]}^{2}

and

{(\hat{δ} - {\tilde{δ}}_{i})}^{⊤} (\hat{δ} + {\tilde{δ}}_{i} - 2 δ_{0}) = 2 {(\hat{δ} - {\tilde{δ}}_{i})}^{⊤} (\hat{δ} - δ_{0}) - {∥ \hat{δ} - {\tilde{δ}}_{i} ∥}^{2}

, by the Cauchy–Schwarz inequality we have

| ∥ \hat{δ} - δ_{0} ∥^{2} - ∥ {\tilde{δ}}_{i} - δ_{0} ∥^{2} |^{2} = O_{L_{1}} (∥ \hat{δ} - {\tilde{δ}}_{i} ∥^{4}) + \sqrt{O_{L_{1}} (polyLog (n)) {∥ \hat{δ} - {\tilde{δ}}_{i} ∥}^{4}},

since

E (∥ \hat{δ} - δ_{0} ∥^{k})

exists and is bounded by

k polyLog (n) / τ^{k}

following from Assumption O7 and Lemma A3.

Using the results of Theorem A1 we have

E (| ∥ \hat{δ} - δ_{0} ∥^{2} - ∥ {\tilde{δ}}_{i} - δ_{0} {∥^{2} |}^{2}) = O (\frac{polyLog (n)}{n^{2 - 2 α}}) = o (n^{- 1}),

provided that

α < 1 / 2

.

Similarly, by definition we have

\begin{matrix} ∥ {\tilde{δ}}_{i} - δ_{0} ∥^{2} - {∥ {\hat{δ}}_{(i)} - δ_{0} ∥}^{2} = & \frac{2}{n} {({\hat{δ}}_{(i)} - δ_{0})}^{⊤} {(S_{i} + τ I)}^{- 1} x_{i} ψ (prox (c_{i} ρ) ({\tilde{r}}_{i, (i)})) \\ + \frac{1}{n^{2}} x_{i}^{⊤} {(S_{i} + τ I)}^{- 2} x_{i} ψ^{2} (prox (c_{i} ρ) ({\tilde{r}}_{i, (i)})) . \end{matrix}

Since

{({\hat{δ}}_{(i)} - δ_{0})}^{⊤} {(S_{i} + τ I)}^{- 1} x_{i} = O_{L_{2}} (| λ_{i} | ∥ {\hat{δ}}_{(i)} - δ_{0} ∥ / c_{n}^{1 / 2})

, both terms are of order

O_{L_{2}} (polyLog (n) / n c_{n}^{1 / 2})

.

Similarly, we have

var (∥ \hat{δ} - δ_{0} ∥^{2}) = O (\frac{polyLog (n)}{n^{1 - 2 α}}) = o (1) .

The results for

var (∥ \hat{β} - β_{0} ∥^{2})

are simply followed by the fact that

var (∥ \hat{w} - w_{0} ∥^{2}) = o (1)

in Proposition 3.10 of [21].

By assuming

\tilde{W} = {∥ \hat{δ} + \hat{w} - δ_{0} - w_{0} ∥}^{2}

and

{\tilde{W}}_{(i)} = {∥ {\hat{δ}}_{(i)} + \hat{w} - δ_{0} - w_{0} ∥}^{2}

, we can similarly have

E (| ∥ \hat{δ} + \hat{w} - δ_{0} - w_{0} ∥^{2} - ∥ {\hat{δ}}_{(i)} + \hat{w} - δ_{0} - w_{0} {∥^{2} |}^{2}) = o (1) .

Note that

E (∥ \hat{w} - w_{0} ∥) = O (1)

, we have

E (∥ \hat{δ} + \hat{w} - δ_{0} - w_{0} ∥)

is bounded by

K polyLog (n) / τ^{k}

. □

Appendix E.4. Leaving Out a Predictor

Let

V

be the

n \times (p - 1)

matrix corresponding to the first

(p - 1)

columns of the design matrix

X

. We call

v_{i}

in

R^{p - 1}

the vector corresponding to the first

p - 1

entries of

x_{i}

, i.e

v_{i}^{⊤} = (x_{i} (1), \dots, x_{i} (p - 1))

. Denote

X_{i} = {(X_{i} (1), \dots, X_{i} (p))}^{⊤}

. We call

X (p)

the vector in

R^{n}

with j-th entry

x_{j} (p)

, i.e the p-th entry of the vector

x_{j}

. When this does not create problems, we also use the standard notation

x_{j, p}

for

x_{j} (p)

. We further denote

δ_{0} = {(γ_{0}^{⊤}, δ_{0} (p))}^{⊤}

.

Call

\hat{γ}

the solution of

\hat{γ} = \underset{γ \in R^{p - 1}}{arg min} \frac{1}{n} \sum_{i = 1}^{n} ρ {ϵ_{i} + v_{i}^{⊤} (w_{0} - \hat{w}) - v_{i}^{⊤} (γ - γ_{0})} + \frac{τ}{2} {∥ γ ∥}^{2} .

(A8)

Note that

[\begin{matrix} \hat{γ} \\ 0 \end{matrix}]

is the solution of the original optimization problem (5) when

x_{i} (p)

is replaced by 0.

Approximation to $\hat{δ}$ via leave-one-predictor-out

We use the notations and partitions

x_{i} = [\begin{matrix} v_{i} \\ x_{i} (p) \end{matrix}], \hat{δ} = [\begin{matrix} {\hat{δ}}_{- p} \\ \hat{δ} (p) \end{matrix}] .

(A9)

Naturally,

\hat{γ}

satisfies

- \frac{1}{n} \sum_{i = 1}^{n} v_{i} ψ {ϵ_{i} + v_{i}^{⊤} (w_{0, - p} - {\hat{w}}_{- p}) - v_{i}^{⊤} (\hat{γ} - γ_{0})} + τ \hat{γ} = 0_{p - 1} .

(A10)

Denote

r_{i, [p]} = ϵ_{i} + v_{i}^{⊤} (w_{0, - p} - {\hat{w}}_{- p}) - v_{i}^{⊤} (\hat{γ} - γ_{0}) .

i.e., the residuals based on

p - 1

predictors.

Recall that

- \frac{1}{n} \sum_{i = 1}^{n} x_{i} ψ {ϵ_{i} + x_{i}^{⊤} (w_{0} - \hat{w}) - x_{i}^{⊤} (\hat{δ} - δ_{0})} + τ \hat{δ} = 0_{p},

(A11)

and

R_{i} = ϵ_{i} + x_{i}^{⊤} (w_{0} - \hat{w}) - x_{i}^{⊤} (\hat{δ} - δ_{0}) .

Taking the difference between (A10) and (A11), we have

\frac{1}{n} \sum_{i}^{n} \{x_{i} ψ (R_{i}) - [\begin{matrix} v_{i} \\ 0 \end{matrix}] ψ (r_{i, [p]})\} - τ (\hat{δ} - [\begin{matrix} \hat{γ} \\ 0 \end{matrix}]) = 0_{p} .

Note that this p-dimensional equation can be separated into a scalar and a vector equation, namely,

\begin{matrix} \frac{1}{n} \sum_{i}^{n} \{x_{i} (p) ψ (R_{i})\} - τ \hat{δ} (p) = 0, \\ \frac{1}{n} \sum_{i}^{n} v_{i} \{ψ (R_{i}) - ψ (r_{i, [p]})\} - τ ({\hat{δ}}_{- p} - \hat{γ}) = 0_{p - 1} . \end{matrix}

Using a first-order Taylor expansion of

ψ (R_{i})

around

ψ (r_{i, [p]})

and noting that

R_{i} - r_{i, [p]} = v_{i}^{⊤} (\hat{γ} - {\hat{δ}}_{- p}) + x_{i} (p) {w_{0} (p) - \hat{w} (p) + δ_{0} (p) - \hat{δ} (p)}

, we can transform the first equation above into

\frac{1}{n} \sum_{i}^{n} x_{i} (p) (ψ (r_{i, [p]}) + ψ^{'} (r_{i, [p]}) [v_{i}^{⊤} (\hat{γ} - {\hat{δ}}_{- p}) + x_{i} (p) {w_{0} (p) - \hat{w} (p) + δ_{0} (p) - \hat{δ} (p)}]) - τ \hat{δ} (p) ≃ 0 .

This gives the near equivalence

\hat{δ} (p) ≃ \frac{\frac{1}{n} \sum x_{i} (p) (ψ (r_{i, [p]}) + ψ^{'} (r_{i, [p]}) [v_{i}^{⊤} (\hat{γ} - {\hat{δ}}_{- p}) + x_{i} (p) {w_{0} (p) - \hat{w} (p) + δ_{0} (p)}])}{\frac{1}{n} \sum x_{i}^{2} (p) ψ^{'} (r_{i, [p]}) + τ} .

Working similarly on the second equation involving

v_{i}

, we have

\frac{1}{n} \sum_{i}^{n} ψ^{'} (r_{i, [p]}) v_{i} (R_{i} - r_{i, [p]}) - τ ({\hat{δ}}_{- p} - \hat{γ}) ≃ 0_{p - 1} .

Since

R_{i} - r_{i, [p]} = v_{i}^{⊤} (\hat{γ} - {\hat{δ}}_{- p}) + x_{i} (p) {w_{0} (p) - \hat{w} (p) + δ_{0} (p) - \hat{δ} (p)}

, the above equation can be transformed into

[\frac{1}{n} \sum_{i}^{n} ψ^{'} (r_{i, [p]}) v_{i} v_{i}^{⊤}] (\hat{γ} - {\hat{δ}}_{- p}) + {w_{0} (p) - \hat{w} (p) + δ_{0} (p) - \hat{δ} (p)} \frac{1}{n} \sum_{i}^{n} ψ^{'} (r_{i, [p]}) v_{i} x_{i} (p) - τ ({\hat{δ}}_{- p} - \hat{γ}) ≃ 0_{p - 1} .

Denote

u_{p} = \frac{1}{n} \sum_{i = 1}^{n} ψ^{'} (r_{i, [p]}) v_{i} x_{i} (p), and S_{p} = \frac{1}{n} \sum_{i = 1}^{n} ψ^{'} (r_{i, [p]}) v_{i} v_{i}^{⊤},

we see that

\hat{γ} - {\hat{δ}}_{- p} ≃ - {w_{0} (p) - \hat{w} (p) + δ_{0} (p) - \hat{δ} (p)} {(S_{p} + τ I)}^{- 1} u_{p}

. Using the above approximation in the equation for

\hat{δ} (p)

, we can write

\hat{δ} (p) ≃ \frac{\frac{1}{n} \sum x_{i} (p) ψ (r_{i, [p]}) + {w_{0} (p) - \hat{w} (p) + δ_{0} (p)} {\frac{1}{n} \sum x_{i}^{2} (p) ψ^{'} (r_{i, [p]}) - u_{p}^{⊤} {(S_{p} + τ I)}^{- 1}} u_{p}}{\frac{1}{n} \sum x_{i}^{2} (p) ψ^{'} (r_{i, [p]}) - u_{p}^{⊤} {(S_{p} + τ I)}^{- 1} u_{p} + τ} .

Denote

ξ_{n} ≜ \frac{1}{n} \sum_{i = 1}^{n} x_{i}^{2} (p) ψ^{'} (r_{i, [p]}) - u_{p}^{⊤} {(S_{p} + τ I)}^{- 1} u_{p},

and

N_{p} ≜ \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} x_{i} (p) ψ (r_{i, [p]}) .

We have

(ξ_{n} + τ) \hat{δ} (p) ≃ \sqrt{n} N_{p} + {w_{0} (p) - \hat{w} (p) + δ_{0} (p)} ξ_{n} .

Also we have

{\hat{δ}}_{- p} ≃ \hat{γ} + {w_{0} (p) - \hat{w} (p) + δ_{0} (p) - \hat{δ} (p)} {(S_{p} + τ I)}^{- 1} u_{p} .

Thus, we construct an approximation to

\hat{δ}

. As a summary, we introduce the following definitions:

Definition A1.

We call the residuals corresponding to this optimization problem

{r_{i, [p]}}_{i = 1}^{n}

, in other words

r_{i, [p]} = ϵ_{i} + v_{i}^{⊤} (w_{0, - p} - {\hat{w}}_{- p}) - v_{i}^{⊤} (\hat{γ} - γ_{0}) .

We call

u_{p} = \frac{1}{n} \sum_{i = 1}^{n} ψ^{'} (r_{i, [p]}) v_{i} x_{i} (p), and S_{p} = \frac{1}{n} \sum_{i = 1}^{n} ψ^{'} (r_{i, [p]}) v_{i} v_{i}^{⊤} .

Note that

u_{p} \in R^{p - 1}

and

S_{p}

is

(p - 1) \times (p - 1)

. We call

ξ_{n} ≜ \frac{1}{n} \sum_{i = 1}^{n} x_{i}^{2} (p) ψ^{'} (r_{i, [p]}) - u_{p}^{⊤} {(S_{p} + τ I)}^{- 1} u_{p},

and

N_{p} ≜ \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} x_{i} (p) ψ (r_{i, [p]}) .

We consider

b_{p} ≜ {w_{0} (p) - \hat{w} (p) + δ_{0} (p)} \frac{ξ_{n}}{τ + ξ_{n}} + \frac{1}{\sqrt{n}} \frac{N_{p}}{τ + ξ_{n}} .

Note that when

ξ_{n} > 0

, we have

b_{p} - δ_{0} (p) = w_{0} (p) - \hat{w} (p) + \frac{n^{- 1 / 2} N_{p} - τ b_{p}}{ξ_{n}} .

We call

\tilde{b} = [\begin{matrix} \hat{γ} \\ δ_{0} (p) - \hat{w} (p) + w_{0} (p) \end{matrix}] + {b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p)} [\begin{matrix} - {(S_{p} + τ I)}^{- 1} u_{p} \\ 1 \end{matrix}] .

Appendix E.4.1. Deterministic Aspects

Proposition A5.

We have

\begin{matrix} ∥ \hat{δ} - \tilde{b} ∥ \leq \frac{1}{τ} | b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p) | \sup_{1 \leq i \leq n} | d_{i, p} | ∥ \sum^{^} ∥_{2} \sqrt{∥ {(S_{p} + τ I)}^{- 1} u_{p} ∥^{2} + 1} \end{matrix}

(A12)

where

d_{i, p} = [ψ^{'} (γ_{i, p}^{*}) - ψ^{'} (r_{i, [p]})]

and

γ_{i, p}^{*}

is in the interval

(ϵ_{i} + v_{i}^{⊤} (w_{0, - p} - {\hat{w}}_{- p}) - v_{i}^{⊤} (\hat{γ} - γ_{0}), ϵ_{i} + x_{i}^{⊤} (w_{0} - \hat{w}) - x_{i}^{⊤} (\tilde{b} - β_{0}))

. Furthermore,

∥ {(S_{p} + τ I)}^{- 1} u_{p} ∥^{2} \leq \frac{1}{n τ} \sum_{i = 1}^{n} x_{i}^{2} (p) ψ^{'} (r_{i, [p]}) = \frac{1}{n τ} \sum_{i = 1}^{n} λ_{i}^{2} ψ^{'} (r_{i, [p]}) X_{i}^{2} (p) .

(A13)

As in Lemma A2, we have

∥ \hat{δ} - \tilde{b} ∥ \leq \frac{1}{τ} ∥ f (\tilde{b}) ∥,

where

f (\tilde{b}) = - \frac{1}{n} \sum_{i = 1}^{n} x_{i} ψ {ϵ_{i} + x_{i}^{⊤} (w_{0} - \hat{w}) - x_{i}^{⊤} (\tilde{b} - δ_{0})} + τ \tilde{b} .

(A14)

We note furthermore that, by definition of

\hat{γ}

,

g (\hat{γ}) = - \frac{1}{n} \sum_{i = 1}^{n} v_{i} ψ {ϵ_{i} + v_{i}^{⊤} (w_{0, - p} - {\hat{w}}_{- p}) - v_{i}^{⊤} (\hat{γ} - γ_{0})} + τ \hat{γ} = 0_{p - 1} .

Proof.

i. Work on the first $p - 1$ coordinates of $f (\tilde{b})$

Denote

f_{p - 1} (δ)

the first

p - 1

coordinates of

f (δ)

. Denote

{\hat{γ}}_{e x t}

the p-dimensional vector whose first

p - 1

coordinates are

\hat{γ}

and last coordinate is

δ_{0} (p)

, i.e.,

{\hat{γ}}_{e x t} = [\begin{matrix} \hat{γ} \\ δ_{0} (p) - \hat{w} (p) + w_{0} (p) \end{matrix}] .

For a vector

v

, we use the notation

v_{- k}

to denote the vector obtained by removing the k-th coordinate of

v

.

Note that

\begin{matrix} f_{p - 1} (\tilde{b}) & = & f_{p - 1} (\tilde{b}) - g (\hat{γ}) \\ = & - \frac{1}{n} \sum_{i = 1}^{n} v_{i} [ψ {ϵ_{i} + x_{i}^{⊤} (w_{0} - \hat{w}) + x_{i}^{⊤} (δ_{0} - \tilde{b})} - ψ {ϵ_{i} + v_{i}^{⊤} (w_{0, - p} - {\hat{w}}_{- p}) + v_{i}^{⊤} (γ_{0} - \hat{γ})}] \\ + τ ({\tilde{b}}_{- p} - \hat{γ}) . \end{matrix}

By the mean value theorem, for

γ_{i, p}^{*}

in the interval

(ϵ_{i} + v_{i}^{⊤} (w_{0} - \hat{w}) - v_{i}^{⊤} (w_{0, - p} - {\hat{w}}_{- p}), ϵ_{i} + x_{i}^{⊤} (w_{0} - \hat{w}) - x_{i}^{⊤} (\tilde{b} - δ_{0}))

, we have

\begin{matrix} ψ {ϵ_{i} + x_{i}^{⊤} (w_{0} - \hat{w}) + x_{i}^{⊤} (δ_{0} - \tilde{b})} - ψ {ϵ_{i} + v_{i}^{⊤} (w_{0, - p} - {\hat{w}}_{- p}) + v_{i}^{⊤} (γ_{0} - \hat{γ})} \\ = & ψ^{'} (γ_{i, p}^{*}) x_{i}^{⊤} ({\hat{γ}}_{e x t} - \tilde{b}) \\ = & ψ^{'} (r_{i, [p]}) x_{i}^{⊤} ({\hat{γ}}_{e x t} - \tilde{b}) + {ψ^{'} (γ_{i, p}^{*}) - ψ^{'} (r_{i, [p]})} x_{i}^{⊤} ({\hat{γ}}_{e x t} - \tilde{b}) . \end{matrix}

Denote

\begin{matrix} d_{i, p} & = ψ^{'} (γ_{i, p}^{*}) - ψ^{'} (r_{i, [p]}), \\ δ_{i, p} & = {ψ^{'} (γ_{i, p}^{*}) - ψ^{'} (r_{i, [p]})} x_{i}^{⊤} ({\hat{γ}}_{e x t} - \tilde{b}), \\ R_{p} & = - \frac{1}{n} \sum_{i = 1}^{n} v_{i} [{ψ^{'} (γ_{i, p}^{*}) - ψ^{'} (r_{i, [p]})} x_{i}^{⊤} ({\hat{γ}}_{e x t} - \tilde{b})] . \end{matrix}

Therefore, we have

f_{p - 1} (\tilde{b}) = - \frac{1}{n} \sum_{i = 1}^{n} ψ^{'} (r_{i, [p]}) v_{i} x_{i}^{⊤} ({\hat{γ}}_{e x t} - \tilde{b}) + τ ({\tilde{b}}_{- p} - \hat{γ}) + R_{p} ≜ A_{p} + R_{p} .

Note that by definition,

\begin{matrix} {\hat{γ}}_{e x t} - \tilde{b} & = {b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p)} [\begin{matrix} {(S_{p} + τ I)}^{- 1} u_{p} \\ - 1 \end{matrix}], \\ {\tilde{b}}_{- p} - \hat{γ} & = - {b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p)} {(S_{p} + τ I)}^{- 1} u_{p} . \end{matrix}

Therefore, we have

x_{i}^{⊤} ({\hat{γ}}_{e x t} - \tilde{b}) = {b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p)} {v_{i}^{⊤} {(S_{p} + τ I)}^{- 1} u_{p} - x_{i} (p)}

, and

A_{p} = - {b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p)} [\frac{1}{n} \sum_{i = 1}^{n} ψ^{'} (r_{i, [p]}) v_{i} \{v_{i}^{⊤} {(S_{p} + τ I)}^{- 1} u_{p} - x_{i} (p)\} + τ {(S_{p} + τ I)}^{- 1} u_{p}] .

By the definition of

S_{p}

and

u_{p}

, we have

A_{p} = - {b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p)} {S_{p} {(S_{p} + τ I)}^{- 1} u_{p} - u_{p} + τ {(S_{p} + τ I)}^{- 1} u_{p}} = 0_{p - 1},

since

S_{p} {(S_{p} + τ I)}^{- 1} + τ {(S_{p} + τ I)}^{- 1} = I

.

Therefore, we conclude that

f_{p - 1} (\tilde{b}) = R_{p} .

ii. Work on the last coordinate of $f (\tilde{b})$

Denote

{[f (\tilde{δ})]}_{p}

the last coordinate of

f (\tilde{δ})

. We have shown that

\begin{matrix} ψ {ϵ_{i} + x_{i}^{⊤} (w_{0} - \hat{w}) + x_{i}^{⊤} (δ_{0} - \tilde{b})} - ψ {ϵ_{i} + v_{i}^{⊤} (w_{0, - p} - {\hat{w}}_{- p}) + v_{i}^{⊤} (γ_{0} - \hat{γ})} \\ = & ψ^{'} (r_{i, [p]}) x_{i}^{⊤} ({\hat{γ}}_{e x t} - \tilde{b}) + {ψ^{'} (γ_{i, p}^{*}) - ψ^{'} (r_{i, [p]})} x_{i}^{⊤} ({\hat{γ}}_{e x t} - \tilde{b}) . \end{matrix}

Recall that

\begin{matrix} r_{i, [p]} & = ϵ_{i} + v_{i}^{⊤} (w_{0, - p} - {\hat{w}}_{- p}) - v_{i}^{⊤} (\hat{γ} - γ_{0}), \\ δ_{i, p} & = {ψ^{'} (γ_{i, p}^{*}) - ψ^{'} (r_{i, [p]})} x_{i}^{⊤} ({\hat{γ}}_{e x t} - \tilde{b}) . \end{matrix}

We note that

\begin{matrix} ψ {ϵ_{i} + x_{i}^{⊤} (w_{0} - \hat{w}) + x_{i}^{⊤} (δ_{0} - \tilde{b})} \\ = & ψ (r_{i, [p]}) + ψ^{'} (r_{i, [p]}) x_{i}^{⊤} ({\hat{γ}}_{e x t} - \tilde{b}) + δ_{i, p} \\ = & ψ (r_{i, [p]}) + ψ^{'} (r_{i, [p]}) {b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p)} {v_{i}^{⊤} {(S_{p} + τ I)}^{- 1} u_{p} - x_{i} (p)} + δ_{i, p} . \end{matrix}

Therefore, we have

\begin{matrix} {[f (\tilde{δ})]}_{p} + \frac{1}{n} \sum_{i = 1}^{n} x_{i} (p) δ_{i, p} \\ = & - \frac{1}{n} \sum_{i = 1}^{n} x_{i} (p) [ψ (r_{i, [p]}) + ψ^{'} (r_{i, [p]}) {b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p)} {v_{i}^{⊤} {(S_{p} + τ I)}^{- 1} u_{p} - x_{i} (p)}] + τ \tilde{b} (p), \\ = & - \frac{1}{n} \sum_{i = 1}^{n} x_{i} (p) ψ (r_{i, [p]}) - {b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p)} u_{p}^{⊤} {(S_{p} + τ I)}^{- 1} u_{p} \\ + {b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p)} \frac{1}{n} \sum_{i = 1}^{n} ψ^{'} (r_{i, [p]}) x_{i}^{2} (p) + τ b_{p}, \\ = & - \{\frac{1}{n} \sum_{i = 1}^{n} x_{i} (p) ψ (r_{i, [p]}) - τ b_{p}\} \\ + {b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p)} \{\frac{1}{n} \sum_{i = 1}^{n} ψ^{'} (r_{i, [p]}) x_{i}^{2} (p) - u_{p}^{⊤} {(S_{p} + τ I)}^{- 1} u_{p}\}, \\ = & - (\frac{1}{\sqrt{n}} N_{p} - τ b_{p}) + {b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p)} ξ_{n}, \\ = & 0 . \end{matrix}

We conclude that

\begin{matrix} {[f (\tilde{δ})]}_{p} & = & - \frac{1}{n} \sum_{i = 1}^{n} x_{i} (p) δ_{i, p} \\ = & - \frac{1}{n} \sum_{i = 1}^{n} x_{i} (p) {ψ^{'} (γ_{i, p}^{*}) - ψ^{'} (r_{i, [p]})} x_{i}^{⊤} ({\hat{γ}}_{e x t} - \tilde{b}) . \end{matrix}

iii. Representation of $f (\tilde{b})$

Aggregating all the results we have obtained so far, we see that

\begin{matrix} f (\tilde{b}) = & - \frac{1}{n} \sum_{i = 1}^{n} d_{i, p} x_{i} x_{i}^{⊤} ({\hat{γ}}_{e x t} - \tilde{b}) \\ = & {b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p)} \{\frac{1}{n} \sum_{i = 1}^{n} d_{i, p} x_{i} x_{i}^{⊤}\} [\begin{matrix} {(S_{p} + τ I)}^{- 1} u_{p} \\ - 1 \end{matrix}], \end{matrix}

(A15)

which implies (A12).

For

∥ {(S_{p} + τ I)}^{- 1} u_{p} ∥^{2}

, denote

D_{ψ^{'} (r_{\cdot, [p]})}

the diagonal matrix with

(i, i)

entry

ψ^{'} (r_{i, [p]})

. We have

u_{p} = \frac{1}{n} V^{⊤} D_{ψ^{'} (r_{\cdot, [p]})} X (p) .

Therefore,

\begin{matrix} ∥ {(S_{p} + τ I)}^{- 1} u_{p} ∥^{2} \\ = & \frac{X (p)}{\sqrt{n}} D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2} \frac{D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2} V}{\sqrt{n}} {(\frac{V^{⊤} D_{ψ^{'} (r_{\cdot, [p]})} V}{n} + τ I)}^{- 1} \frac{V^{⊤} D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2}}{\sqrt{n}} D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2} \frac{X (p)}{\sqrt{n}} . \end{matrix}

Note that

\frac{D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2} V}{\sqrt{n}} {(\frac{V^{⊤} D_{ψ^{'} (r_{\cdot, [p]})} V}{n} + τ I)}^{- 1} \frac{V^{⊤} D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2}}{\sqrt{n}} ⪯ I .

So we have

∥ {(S_{p} + τ I)}^{- 1} u_{p} ∥^{2} \leq \frac{1}{n} X {(p)}^{⊤} D_{ψ^{'} (r_{\cdot, [p]})} X (p) = \frac{1}{n} \sum_{i = 1}^{n} x_{i}^{2} (p) ψ^{'} (r_{i, [p]}) .

□

Appendix E.4.2. Stochastic Aspects

Assume that

X (p) = {(X_{1} (p), \dots, X_{n} (p))}^{⊤}

is independent of

{V_{i}, ϵ_{i}}_{i = 1}^{n}

, where

V_{i} = V_{i} / λ_{i}

. This is consistent with Assumption P1.

To bound

∥ \hat{δ} - \tilde{b} ∥

, using Equation (A13) we have

∥ {(S_{p} + τ I)}^{- 1} u_{p} ∥^{2} \leq \frac{1}{n τ} \sum_{i = 1}^{n} λ_{i}^{2} {∥ ψ^{'} ∥}_{\infty} X_{i}^{2} (p),

and

∥ {(S_{p} + τ I)}^{- 1} u_{p} ∥^{2} \leq \frac{\sup_{i} {∥ ψ^{'} ∥}_{\infty}}{τ} \frac{1}{n} \sum_{i = 1}^{n} λ_{i}^{2} X_{i}^{2} (p) .

Therefore, under Assumptions O3, O4 and O6, we have for any fixed k and at fixed

τ

,

∥ {(S_{p} + τ I)}^{- 1} u_{p} ∥^{2} = O_{L_{k}} (polyLog (n)) .

It guarantees that

{∥\begin{matrix} {(S_{p} + τ I)}^{- 1} u_{p} \\ - 1 \end{matrix}∥}^{2} \leq (1 + ∥ {(S_{p} + τ I)}^{- 1} u_{p} ∥^{2}) = O_{L_{k}} (polyLog (n)) .

Thus

∥ \hat{δ} - \tilde{b} ∥ = O_{L_{k}} (\frac{1}{τ} polyLog (n) | b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p) | \sup_{1 \leq i \leq n} | d_{i, p} | ∥ \sum^{^} ∥_{2}) .

Recall that Lemma 3.38 from [21] guarantees that

∥ \sum^{^} ∥ = O_{L_{k}} (polyLog (n))

under assumption O1–O7. Also we will show in Proposition A6 that

| b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p) | = O_{L_{k}} {polyLog (n) (n^{- 1 / 2} \lor n^{- e})}

and in Proposition A8 that

\sup_{1 \leq i \leq n} | d_{i, p} | = {polyLog (n) (n^{α - 1 / 2} \lor n^{α - e})}

to show that

M_{1}

is small.

On $b_{p} - δ_{0} (p)$

Recall the notations

\begin{matrix} N_{p} & = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} x_{i} (p) ψ (r_{i, [p]}) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} λ_{i} ψ (r_{i, [p]}) X_{i} (p), \\ ξ_{n} & = \frac{1}{n} \sum_{i = 1}^{n} x_{i}^{2} (p) ψ^{'} (r_{i, [p]}) - u_{p}^{⊤} {(S_{p} + τ I)}^{- 1} u_{p} . \end{matrix}

Under our assumptions, we have

E (X_{i}) = 0_{p}

and

cov (X_{i}) = I_{p}

, and

X (p)

is independent of

{r_{i, [p]}}_{i = 1}^{n}

.

Proposition A6.

We have

| b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p) | \leq \frac{1}{\sqrt{n} τ} | N_{p} | + ∥ δ_{0} ∥_{\infty} + | \hat{w} (p) - w_{0} (p) | .

Furthermore, under Assumptions O1–O7 and P1,

N_{p} = O_{L_{k}} (polyLog (n))

and therefore, when τ is fixed,

| b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p) | = O_{L_{k}} (polyLog (n) n^{- 1 / 2} + ∥ δ_{0} ∥_{\infty} + | \hat{w} (p) - w_{0} (p) |) .

Note that Lemma A1 has shown that

| \hat{w} (p) - w_{0} (p) | = O_{L_{k}} (\frac{polyLog (n)}{\sqrt{n} \land n^{e}}) .

(A16)

Under Assumption P3,

n ∥ δ_{0} ∥_{\infty}^{2} polyLog (n) n^{2 α - 1 / 2} \to 0

, and therefore we have

| b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p) | = O_{L_{k}} (\frac{polyLog (n)}{\sqrt{n} \land n^{e}}) .

(A17)

Proof.

From the definition of

b_{p}

, we have, when

ξ_{n} > 0

,

b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p) = \frac{1}{\sqrt{n}} \frac{N_{p}}{τ + ξ_{n}} - \frac{τ}{τ + ξ_{n}} {δ_{0} (p) + w_{0} (p) - \hat{w} (p)} .

We will see later that

ξ_{n} \geq 0

in Lemma A7. It follows that

| b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p) | \leq \frac{1}{\sqrt{n} τ} | N_{p} | + | δ_{0} (p) + w_{0} (p) - \hat{w} (p) | .

Using independence of

X (p)

and

{V_{i}, ϵ_{i}}_{i = 1}^{n}

, and the fact that

E (X_{i}) = 0_{p}

, we have

E (N_{p}^{2}) = \frac{1}{n} \sum_{i = 1}^{n} E {X_{i}^{2} (p)} E {λ_{i}^{2} ψ^{2} (r_{i, [p]})},

whether the right-hand side is finite or not. Using the bounds on

\max λ_{i}^{2}

and

\sup_{i} {∥ ψ ∥}_{\infty}

, we have

E (N_{p}^{2}) \leq \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} E {X_{i}^{2} (p)} {∥ ψ ∥}_{\infty} E {λ_{i}^{2}} = O (1) = O (polyLog (n)) .

Similarly, it is clear that

N_{p} = O_{L_{k}} (polyLog (n)) .

Therefore, we have

| b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p) | = \frac{1}{\sqrt{n} τ} O_{L_{k}} (polyLog (n)) + \sup_{1 \leq k \leq p} | δ_{0} (k) | + | \hat{w} (p) - w_{0} (p) | .

□

On $ξ_{n}$

Write

ξ_{n}

in matrix form. Let

D_{ψ^{'} (r_{\cdot, [p]})}

be the diagonal matrix with

(i, i)

entry

ψ^{'} (r_{i, [p]})

. Denote

X (p)

the last column of the design matrix

X

. Then we have

ξ_{n} = \frac{1}{n} X {(p)}^{⊤} D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2} M D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2} X (p),

where

M = I_{n} - \frac{D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2} V}{\sqrt{n}} {(\frac{1}{n} V^{⊤} D_{ψ^{'} (r_{\cdot, [p]})} V + τ I)}^{- 1} \frac{V^{⊤} D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2}}{\sqrt{n}} .

(A18)

Lemma A7.

We have

ξ_{n} \geq 0 .

Furthermore, under Assumptions O1–O7 and P1, if

D_{λ_{i}}

is the diagonal matrix with

(i, i)

entry

λ_{i}

,

| ξ_{n} - \frac{1}{n} tr (D_{λ_{i}} D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2} M D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2} D_{λ_{i}}) | = O_{L_{k}} (\sup_{1 \leq i \leq n} λ_{i}^{2} ψ^{'} (r_{i, [p]}) / \sqrt{n c_{n}}) .

Proof.

When

τ > 0

, all the eigenvalues of

M

are positive. Indeed, if the singular values of

n^{1 / 2} D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2} V

are denoted by

σ_{1}

, the eigenvalues of

M

are

τ / (σ_{i}^{2} + τ)

.

Therefore, since

ξ_{n} = n^{- 1} v^{⊤} M v

with

v = D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2} X (p)

, we have

ξ_{n} \geq 0

.

Since

M

is symmetric and has eigenvalues between 0 and 1, using Lemma V.1.5 in [36], we have

0 ⪯ D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2} M D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2} ⪯ D_{ψ^{'} (r_{\cdot, [p]})} .

Under Assumption P1, the matrix

M

is independent of

X (p)

.

D_{ψ^{'} (r_{\cdot, [p]})}

is also independent of

X (p)

. By definition,

X (p) = D_{λ_{i}} X (p)

.

Under Assumption P1,

X_{p}

satisfy the necessary concentration assumptions. Using Lemma 3.37 in [21], we have

\begin{matrix} | \frac{1}{n} X {(p)}^{⊤} D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2} M D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2} X (p) - \frac{1}{n} tr (D_{λ_{i}} D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2} M D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2} D_{λ_{i}}) | \\ = O_{L_{k}} (\frac{1}{\sqrt{n c_{n}}} \sup_{1 \leq i \leq n} λ_{i}^{2} ψ^{'} (r_{i, [p]})) . \end{matrix}

□

About $\frac{1}{n} tr (D_{λ_{i}} D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2} M D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2} D_{λ_{i}})$

Lemma A8.

Denote

S_{p} = \frac{1}{n} \sum_{i = 1}^{n} ψ^{'} (r_{i, [p]}) v_{i} v_{i}^{⊤}, and S_{p} (i) = S_{p} - \frac{1}{n} ψ^{'} (r_{i, [p]}) v_{i} v_{i}^{⊤} .

Denote

\begin{matrix} c_{τ, p} & = \frac{1}{n} tr {{(S_{p} + τ I)}^{- 1}}, \\ ζ_{i} & = \frac{1}{n} v_{i}^{⊤} {S_{p} (i) + τ I}^{- 1} v_{i} - λ_{i}^{2} c_{τ, p} . \end{matrix}

(A19)

Then we have under Assumptions O1–O7andP1, if

M

is the matrix defined in Lemma A7,

| \frac{1}{n} tr (I_{n} - M) - \{\frac{1}{n} tr (D_{λ_{i}} D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2} M D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2} D_{λ_{i}})\} c_{τ, p} | \leq \sup_{i} | ζ_{i} | \cdot \frac{1}{n} \sum_{i = 1}^{n} ψ^{'} (r_{i, [p]}) .

We also have

\frac{1}{n} tr (I_{n} - M) = \frac{p - 1}{n} - τ c_{τ, p} .

Proof.

Denote

d_{i, i} = ψ^{'} (r_{i, [p]}) / n

. By using the Sherman-MorrisonWoodbury formula (see, e.g., [37], p. 19),

\begin{matrix} M_{i, i} & = 1 - d_{i, i} v_{i}^{⊤} {(V^{⊤} D_{ψ^{'} (r_{\cdot, [p]})} V / n + τ I)}^{- 1} v_{i}, \\ = 1 - d_{i, i} \frac{v_{i}^{⊤} {(S_{p} (i) + τ I)}^{- 1} v_{i}}{1 + d_{i, i} v_{i}^{⊤} {(S_{p} (i) + τ I)}^{- 1} v_{i}}, \\ = \frac{1}{1 + d_{i, i} v_{i}^{⊤} {(S_{p} (i) + τ I)}^{- 1} v_{i}} . \end{matrix}

Recall that we are interested in

\frac{1}{n} \sum_{i} λ_{i}^{2} ψ^{'} (r_{i, [p]}) M_{i, i} = \frac{1}{n} (D_{λ_{i}} D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2} M D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2} D_{λ_{i}}) .

By the property of trace, we have

\begin{matrix} tr (I_{n} - M) & = tr {{(S_{p} + τ I)}^{- 1} S_{p}} \\ = p - 1 - τ tr {{(S_{p} + τ I)}^{- 1}} = p - 1 - n τ c_{τ, p} . \end{matrix}

This shows the second result of the lemma.

The first result follows from the fact that

\begin{matrix} tr (I_{n} - M) & = \sum_{i} (1 - M_{i, i}) \\ = \sum_{i} d_{i, i} \frac{v_{i}^{⊤} {(S_{p} (i) + τ I)}^{- 1} v_{i}}{1 + d_{i, i} v_{i}^{⊤} {(S_{p} (i) + τ I)}^{- 1} v_{i}} . \end{matrix}

(A20)

With our definitions, we have, since

λ_{i}^{2} c_{τ, p} + ζ_{i} = n^{- 1} v_{i}^{⊤} {(S_{p} (i) + τ I)}^{- 1} v_{i}

,

\begin{matrix} \frac{1}{n} tr (I_{n} - M) = & (\frac{1}{n} \sum_{i} λ_{i}^{2} ψ^{'} (r_{i, [p]}) M_{i, i}) c_{τ, p} + \frac{1}{n} \sum_{i} ψ^{'} (r_{i, [p]}) \frac{ζ_{i}}{1 + d_{i, i} v_{i}^{⊤} {(S_{p} (i) + τ I)}^{- 1} v_{i}} . \end{matrix}

It immediately follows that

| \frac{1}{n} tr (I_{n} - M) - (\frac{1}{n} \sum_{i} λ_{i}^{2} ψ^{'} (r_{i, [p]}) M_{i, i}) c_{τ, p} | \leq \sup_{i} | ζ_{i} | \cdot \frac{1}{n} \sum_{i = 1}^{n} ψ^{'} (r_{i, [p]}) .

□

Controlling $ζ_{i}$

Lemma A9.

Suppose we can find

{r_{i, [p]}^{(i)}}_{j \neq i}

independent of

(λ_{i}, V_{i})

and

K_{n}

such that

\sup_{i} \sup_{j \neq i} | ψ_{j}^{'} (r_{j, [p]}^{(i)}) - ψ_{j}^{'} (r_{j, [p]}) | \leq K_{n} .

Then

\sup_{i} | ζ_{i} | = O_{L_{k}} [\{\frac{1}{τ^{2}} K_{n} {∥ \sum^{^} ∥}_{2} + \frac{polyLog (n)}{τ \sqrt{n c_{n}}} + \frac{1}{n τ}\} polyLog (n)],

provided that

K_{n}

has

3 k

uniformly bounded moments.

Proof.

Denote

{AM}_{i, p} = \frac{1}{n} \sum_{j \neq i} ψ^{'} (r_{j, [p]}^{(i)}) v_{j} v_{j}^{⊤} .

Then, using the fact that

A^{- 1} - B^{- 1} = A^{- 1} (B - A) B^{- 1}

, we have

∥ {(S_{p} (i) + τ I)}^{- 1} - {({AM}_{i, p} + τ I)}^{- 1} ∥ \leq \frac{1}{τ^{2}} K_{n} {∥ \sum^{^} ∥}_{2},

since

∥ n^{- 1} \sum_{i} v_{i} v_{i}^{⊤} ∥ \leq ∥ \sum^{^} ∥_{2}

. In particular, we have

| \frac{1}{n} v_{i}^{⊤} {(S_{p} (i) + τ I)}^{- 1} v_{i} - \frac{1}{n} v_{i}^{⊤} {({AM}_{i, p} + τ I)}^{- 1} v_{i} | \leq \frac{∥ v_{i} ∥^{2}}{n} \frac{1}{τ^{2}} K_{n} {∥ \sum^{^} ∥}_{2} .

Since

{AM}_{i, p}

is independent of

(λ_{i}, V_{i})

, we can use Lemma 3.37 in [21] and since

v = λ_{i} V_{i}

, we have

\sup_{1 \leq i \leq n} | \frac{1}{n} v_{i}^{⊤} {({AM}_{i, p} + τ I)}^{- 1} v_{i} - \frac{λ_{i}^{2}}{n} tr {{({AM}_{i, p} + τ I)}^{- 1}} | = O_{L_{k}} (\frac{polyLog (n)}{τ \sqrt{n c_{n}}} \sup_{1 \leq i \leq n} λ_{i}^{2}),

by using the fact that

λ_{\max} ({({AM}_{i, p} + τ I)}^{- 1}) \leq τ^{- 1}

.

Using the operator norm bound we gave above, we also have

| \frac{1}{n} tr {{({AM}_{i, p} + τ I)}^{- 1}} - \frac{1}{n} tr {{(S_{p} (i) + τ I)}^{- 1}} | \leq \frac{1}{τ^{2}} K_{n} {∥ \sum^{^} ∥}_{2} \frac{p}{n} .

We can now conclude that

\begin{matrix} \sup_{1 \leq i \leq n} | \frac{1}{n} v_{i}^{⊤} {(S_{p} (i) + τ I)}^{- 1} v_{i} - \frac{λ_{i}^{2}}{n} tr {{(S_{p} (i) + τ I)}^{- 1}} | \\ = O [\{\frac{1}{τ^{2}} K_{n} {∥ \sum^{^} ∥}_{2} \sup_{1 \leq i \leq n} (\frac{p}{n} + \frac{∥ v_{i} ∥^{2}}{n}) + \frac{polyLog (n)}{τ \sqrt{n c_{n}}}\} (\sup_{1 \leq i \leq n} λ_{i}^{2} \lor 1)] . \end{matrix}

So under O1 and O4,

\sup_{1 \leq i \leq n} {∥ v_{i} ∥}^{2} / n = O_{L_{k}} (1)

and finally

\begin{matrix} \sup_{1 \leq i \leq n} | \frac{1}{n} v_{i}^{⊤} {(S_{p} (i) + τ I)}^{- 1} v_{i} - \frac{λ_{i}^{2}}{n} tr {{(S_{p} (i) + τ I)}^{- 1}} | \\ = O_{L_{k}} \{(\frac{1}{τ^{2}} K_{n} {∥ \sum^{^} ∥}_{2} + \frac{polyLog (n)}{τ \sqrt{n c_{n}}}) (\sup_{1 \leq i \leq n} λ_{i}^{2} \lor 1)\} . \end{matrix}

Control of

n^{- 1} tr {{(S_{p} (i) + τ I)}^{- 1}} - n^{- 1} tr {{(S_{p} + τ I)}^{- 1}}

Using the Sherman-Woodbury-Morrison formula, we have

{(S_{p} (i) + τ I)}^{- 1} - {(S_{p} + τ I)}^{- 1} = \frac{ψ^{'} (r_{i, [p]})}{n} \frac{{(S_{p} (i) + τ I)}^{- 1} v_{i} v_{i}^{⊤} {(S_{p} (i) + τ I)}^{- 1}}{1 + \frac{ψ^{'} (r_{i, [p]})}{n} v_{i}^{⊤} {(S_{p} (i) + τ I)}^{- 1} v_{i}} .

Take the trace of both sides, and we have

0 \leq tr {{(S_{p} (i) + τ I)}^{- 1}} - tr {{(S_{p} + τ I)}^{- 1}} \leq \frac{1}{τ},

since

v_{i}^{⊤} {(S_{p} (i) + τ I)}^{- 2} v_{i} \leq τ^{- 1} v_{i}^{⊤} {(S_{p} (i) + τ I)}^{- 1} v_{i}

.

Therefore,

0 \leq \frac{1}{n} tr {{(S_{p} (i) + τ I)}^{- 1}} - \frac{1}{n} tr {{(S_{p} + τ I)}^{- 1}} \leq \frac{1}{n τ} .

We can now conclude that

\sup_{i} | ζ_{i} | = O_{L_{k}} [\{\frac{1}{τ^{2}} K_{n} {∥ \sum^{^} ∥}_{2} + \frac{polyLog (n)}{τ \sqrt{n c_{n}}} + \frac{1}{n τ}\} (\sup_{1 \leq i \leq n} λ_{i}^{2} \lor 1)] .

provided we can use Holder’s inequality. This, in turn, requires

K_{n}

to have

3 k

moments. □

Control of $K_{n}$

A natural choice for

{r_{i, [p]}^{(i)}}_{j \neq i}

defined in Lemma A9 is to use a leave-one-out estimator of

\hat{γ}

, where the i-th observation (and hence

v_{i}

) is removed. Hence, all the work performed in Theorem A1 can be used here.

Lemma A10.

Suppose we use for

r_{i, [p]}^{(i)}

the residuals we would obtain by using a leave-one-out estimator of

\hat{γ}

, i.e., excluding

(v_{i}, ϵ_{i})

from problem (A8).

With the notations of Lemma A9, we have under Assumptions O1–O7 and P1,

K_{n} = O_{L_{k}} (n^{2 α - 1 / 2} polyLog (n)) .

In particular, for any fixed τ,

\sup_{i} | ζ_{i} | = O_{L_{k}} (n^{2 α - 1 / 2} polyLog (n)) .

Proof.

Denote

δ_{n} (i)

random variables such that

\sup_{j \neq i} | r_{j, [p]}^{(i)} - r_{j, [p]} | \leq δ_{n} (i) .

Applying Theorem A1 with

R_{j} = r_{j, [p]}

and

{\tilde{r}}_{j, (i)} = r_{j, [p]}^{(i)}

, we have

\sup_{i} δ_{n} (i) = O_{L_{k}} (n^{2 α - 1 / 2} polyLog (n)) .

The control of

K_{n}

follows by using the fact that

ψ^{'}

is

C n^{α}

-Lipschitz. □

Corollary A1.

Recall that in (A6) and (A1)

c_{i} = \frac{1}{n} x_{i}^{⊤} {(S_{i} + τ I)}^{- 1} x_{i}, c_{τ} = \frac{1}{n} tr {(S + τ I)}^{- 1} .

Then under Assumptions O1–O7 and P1, we have

\sup_{i} | c_{i} - λ_{i}^{2} c_{τ} | = O_{L_{k}} (n^{2 α - 1 / 2} polyLog (n)) .

Proof.

We have shown that

\sup_{i} | \frac{1}{n} v_{i}^{⊤} {(S_{p} (i) + τ I)}^{- 1} v_{i} - λ_{i}^{2} c_{τ, p} | = O_{L_{k}} (n^{2 α - 1 / 2} polyLog (n)) .

Recall that

c_{τ} = \frac{1}{n} tr [{\{\frac{1}{n} \sum_{i = 1}^{n} ψ^{'} (R_{i}) x_{i} x_{i}^{⊤} + τ I\}}^{- 1}] .

We see that

c_{τ}

is analogous to

c_{τ, p}

when we use all the data, rather than

(p - 1)

of them.

Indeed,

c_{i}

in (A6) is defined, in the notation of the proof of Lemma A9 as an analog of

n^{- 1} v_{i}^{⊤} {({AM}_{i, p} + τ I)}^{- 1} v_{i}

, with the role of

{r_{i, [p]}^{(i)}}_{j \neq i}

being played by the residuals obtained by the leave-one-out estimator of

\hat{γ}

, excluding

(x_{i}, y_{i})

from the problem. Lemma A9 in connection with Theorem A2 shows that

\sup_{i} | n^{- 1} v_{i}^{⊤} {({AM}_{i, p} + τ I)}^{- 1} v_{i} - λ_{i}^{2} c_{τ, p} | = O_{L_{k}} (n^{2 α - 1 / 2} polyLog (n))

. Passing from the

p - 1

dimensional version of this result, i.e., Lemma A9, to the p-dimensional version gives the approximation stated in the corollary.

We therefore have

\sup_{i} | c_{i} - λ_{i}^{2} c_{τ} | = O_{L_{k}} (n^{2 α - 1 / 2} polyLog (n)) .

□

Further results on $ξ_{n}$ and $b_{p}$

Proposition A7.

Under Assumptions O1–O7 and P1, we have

| c_{τ, p} (ξ_{n} + τ) - \frac{p - 1}{n} | = O_{L_{k}} (\frac{polyLog (n)}{n^{1 / 2 - 2 α}}) .

(A21)

Furthermore, under Assumptions O1–O7 and P1–P3, since

∥ δ_{0} ∥_{\infty} = O_{L_{k}} (n^{- e})

,

\begin{matrix} {(\frac{p}{n})}^{2} n E [{b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p)}^{2}] = & \frac{1}{n} \sum_{i = 1}^{n} E [{c_{τ, p} λ_{i} ψ (r_{i, [p]})}^{2}] \\ + n τ^{2} {δ_{0} (p) - \hat{w} (p) + w_{0} (p)}^{2} E (c_{τ, p}^{2}) + o (1) . \end{matrix}

(A22)

Proof.

For Equation (A21):

By Lemma A8, we have

\frac{p - 1}{n} - τ c_{τ, p} = \frac{1}{n} tr (I_{n} - M) \geq 0 .

The latter quantity was approximated in Lemma A8 by

\frac{1}{n} tr (D_{λ_{i}} D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2} M D_{ψ^{'} (r_{\cdot, [p]})}^{1 / 2} D_{λ_{i}}) c_{τ, p},

which approximate

ξ

as in Lemma A7. This gives the result of Equation (A21), by simply keeping track of the approximation errors we make at each step.

For Equation (A22):

Recall that by definition:

\sqrt{n} [(τ + ξ_{n}) b_{p} - ξ_{n} δ_{0} (p) + ξ_{n} {\hat{w} (p) - w_{0} (p)}] = N_{p} = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} λ_{i} ψ (r_{i, [p]}) X_{i} (p) .

Therefore, we have

\begin{matrix} c_{τ, p} \sqrt{n} (τ + ξ_{n}) {b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p)} = & \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} c_{τ, p} λ_{i} ψ (r_{i, [p]}) X_{i} (p) \\ - c_{τ, p} \sqrt{n} τ {δ_{0} (p) - \hat{w} (p) + w_{0} (p)} . \end{matrix}

Note that

c_{τ, p} λ_{i} ψ (r_{i, [p]})

is independent of

X_{i} (p)

, and

X_{i} (p)

’s are independent with mean zero and variance one. We have

\begin{matrix} E [c_{τ, p}^{2} n {(τ + ξ_{n})}^{2} {b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p)}^{2}] = & \frac{1}{n} \sum_{i = 1}^{n} E [{c_{τ, p} λ_{i} ψ (r_{i, [p]})}^{2}] \\ + n τ^{2} {δ_{0} (p) - \hat{w} (p) + w_{0} (p)}^{2} E (c_{τ, p}^{2}) . \end{matrix}

(A23)

Recall that Proposition A6 gives that

| b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p) | \leq \frac{1}{\sqrt{n} τ} | N_{p} | + ∥ δ_{0} ∥_{\infty} + | \hat{w} (p) - w_{0} (p) | .

Under Assumption P3,

n ∥ δ_{0} ∥_{\infty}^{2} polyLog (n) n^{2 α - 1 / 2} \to 0

. Together with Equations (A21) and (A16), the LHS of (A23) can be written as

\begin{matrix} E [c_{τ, p}^{2} {(τ + ξ_{n})}^{2} n {b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p)}^{2}] \\ = & E [{\{c_{τ, p} (τ + ξ_{n}) - \frac{p - 1}{n} + \frac{p - 1}{n}\}}^{2} n {b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p)}^{2}] \\ = & {(\frac{p}{n})}^{2} n E [{b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p)}^{2}] + o (1) . \end{matrix}

This implies Equation (A22). □

On $d_{i, p}$

Recall that

d_{i, p} = ψ^{'} (γ_{i, p}^{*}) - ψ^{'} (r_{i, [p]}),

where

γ_{i, p}^{*}

in the interval

(r_{i, [p]}, r_{i, [p]} + ν_{i})

, with

\begin{matrix} ν_{i} & = {b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p)} x_{i}^{⊤} [\begin{matrix} {(S_{p} + τ I)}^{- 1} u_{p} \\ - 1 \end{matrix}] \\ = {b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p)} π_{i} . \end{matrix}

We have the following result.

Proposition A8.

Under Assumptions O1–O7 and P1–P3, we have, at fixed τ,

\sup_{i} | d_{i, p} | = O_{L_{k}} (\frac{polyLog (n) n^{α}}{n^{1 / 2} \land n^{e}}) .

Proof.

Note that we can write

π_{i} = x_{i}^{⊤} [\begin{matrix} {(S_{p} + τ I)}^{- 1} u_{p} \\ - 1 \end{matrix}] = v_{i}^{⊤} {(S_{p} + τ I)}^{- 1} u_{p} - x_{i} (p) .

Recall that

u_{p} = n^{- 1} V^{⊤} D_{ψ^{'} (r_{\cdot, [p]})} X (p)

. We can also write it as

u_{p} = \frac{1}{n} V^{⊤} D_{λ_{i}^{2} ψ^{'} (r_{\cdot, [p]})} X (p) .

Using the independence of

X (p)

with

{(V_{i}, ϵ_{i})}_{i = 1}^{n}

, and the concentration assumptions on

X (p)

in Assumption P1, according to Lemma 3.36 in [21], we have

\sup_{i} | v_{i}^{⊤} {(S_{p} + τ I)}^{- 1} u_{p} | = O_{L_{k}} (\frac{polyLog (n)}{c_{n}^{1 / 2}} \sup_{i} ∥ \frac{1}{n} D_{λ_{i}^{2} ψ^{'} (r_{\cdot, [p]})} V {(S_{p} + τ I)}^{- 1} V_{i} ∥),

where we view

v_{i}^{⊤} {(S_{p} + τ I)}^{- 1} u_{p}

as a linear form in

V_{i}

. Note that we have absorbed the

\sup_{i} | λ_{i} |

in the

polyLog (n)

term.

We can write

∥ \frac{1}{n} D_{λ_{i}^{2} ψ^{'} (r_{\cdot, [p]})} V {(S_{p} + τ I)}^{- 1} V_{i} ∥ = \frac{1}{n} V_{i}^{⊤} {(S_{p} + τ I)}^{- 1} \frac{V^{⊤} D_{λ_{i}^{2} ψ^{'} (r_{\cdot, [p]})}^{2} V}{n} {(S_{p} + τ I)}^{- 1} V_{i} .

We notice that

S_{p} = n^{- 1} V^{⊤} D_{λ_{i}^{2} ψ^{'} (r_{\cdot, [p]})} V

. Hence,

\frac{V^{⊤} D_{λ_{i}^{2} ψ^{'} (r_{\cdot, [p]})}^{2} V}{n} ⪯ {∥ D_{λ_{i}^{2} ψ^{'} (r_{\cdot, [p]})} ∥}_{2} S_{p} .

Therefore, we conclude that

\frac{1}{n} V_{i}^{⊤} {(S_{p} + τ I)}^{- 1} \frac{V^{⊤} D_{λ_{i}^{2} ψ^{'} (r_{\cdot, [p]})}^{2} V}{n} {(S_{p} + τ I)}^{- 1} V_{i} \leq \frac{∥ V_{i} ∥^{2}}{n τ} {∥ D_{λ_{i}^{2} ψ^{'} (r_{\cdot, [p]})} ∥}_{2} = \frac{∥ V_{i} ∥^{2}}{n τ} \sup_{i} λ_{i}^{2} ψ^{'} (r_{i, [p]}) .

Note that

\sup_{i} | x_{i} (p) | = O_{L_{k}} (polyLog (n) / \sqrt{c_{n}})

under Assumption O4, O6 and P1, according to Appendix 7 in [21]. Therefore, we have

\begin{matrix} \sup_{i} | π_{i} | & = O_{L_{k}} [\frac{polyLog (n)}{c_{n}^{1 / 2}} \{1 + \sqrt{\sup_{i} λ_{i}^{2} ψ^{'} (r_{i, [p]}) \sup_{i} \frac{∥ V_{i} ∥^{2}}{n τ}}\}] \\ = O_{L_{k}} [\frac{polyLog (n)}{c_{n}^{1 / 2}} \{1 + \sqrt{\sup_{i} λ_{i}^{2} ψ^{'} (r_{i, [p]})}\}] \\ = O_{L_{k}} {polyLog (n)} . \end{matrix}

Recall that

| b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p) | = O_{L_{k}} (polyLog (n) n^{- 1 / 2} + ∥ δ_{0} ∥_{\infty} + | \hat{w} (p) - w_{0} (p) |)

and by Lemma A1 we have

| \hat{w} (p) - w_{0} (p) | = O_{L_{k}} (\frac{polyLog (n)}{\sqrt{n} \land n^{e}}) .

We have

\sup_{i} | ν_{i} | = O_{L_{k}} (\frac{polyLog (n)}{\sqrt{n} \land n^{e}}) .

Note that [21] has shown that Under our assumption that

ψ^{'}

is

C n^{α}

-Lipschitz, we see that

\sup_{i} | d_{i, p} | = O_{L_{k}} (\frac{polyLog (n) n^{α}}{n^{1 / 2} \land n^{e}}) .

□

Appendix E.4.3. Final Conclusions

Gathering all the results, we have the following Theorem.

Theorem A2.

Under Assumptions O1–O7 and P1–P3, we have, for any fixed τ,

∥ \hat{δ} - \tilde{b} ∥ \leq O_{L_{k}} (\frac{polyLog (n) n^{α}}{{(n^{1 / 2} \land n^{e})}^{2}}) .

In particular,

\begin{matrix} \sqrt{n} ({\hat{δ}}_{p} - b_{p}) & = O_{L_{k}} (\frac{polyLog (n) n^{α + 1 / 2}}{{(n^{1 / 2} \land n^{e})}^{2}}), \\ \sup_{i} | x_{i}^{⊤} (\hat{δ} - \tilde{b}) | & = O_{L_{k}} (\frac{polyLog (n) n^{α + 1 / 2}}{{(n^{1 / 2} \land n^{e})}^{2}}), \\ \sup_{i} | R_{i} - r_{i, [p]} | & = O_{L_{k}} (\{\frac{polyLog (n)}{\sqrt{n} \land n^{e}}\} \lor \{\frac{polyLog (n) n^{α + 1 / 2}}{{(n^{1 / 2} \land n^{e})}^{2}}\}) . \end{matrix}

Proof.

The first two results are direct consequences of all our results, using the key bound on

∥ \hat{δ} - \tilde{b} ∥

of Proposition A6.

The third result is a direct consequence of the fact that

\sup_{i} ∥ X_{i} ∥ / n^{1 / 2} = O_{L_{k}} (1)

, which was shown in the proof of Lemma A6.

The last result follows from the fact that

R_{i} - r_{i, [p]} = x_{i}^{⊤} (\hat{δ} - \tilde{b}) - ν_{i}

. The result on

ν_{i}

is given in the proof of Proposition A8. □

Recall Equation (A22)

\begin{matrix} {(\frac{p}{n})}^{2} n E [{b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p)}^{2}] = & \frac{1}{n} \sum_{i = 1}^{n} E [{c_{τ, p} λ_{i} ψ (r_{i, [p]})}^{2}] \\ + n τ^{2} {δ_{0} (p) - \hat{w} (p) + w_{0} (p)}^{2} E (c_{τ, p}^{2}) + o (1) . \end{matrix}

and Equation (A17)

| b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p) | = O_{L_{k}} (\frac{polyLog (n)}{\sqrt{n} \land n^{e}}) .

We have

\begin{matrix} {(\frac{p}{n})}^{2} n E [{{\hat{δ}}_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p)}^{2}] \\ = {(\frac{p}{n})}^{2} n E [{{\hat{δ}}_{p} - b_{p} + b_{p} - δ_{0} (p) + \hat{w} (p) - w_{0} (p)}^{2}] \\ = \frac{1}{n} \sum_{i = 1}^{n} E [{c_{τ, p} λ_{i} ψ (r_{i, [p]})}^{2}] + n τ^{2} {δ_{0} (p) - \hat{w} (p) + w_{0} (p)}^{2} E (c_{τ, p}^{2}) + o (A), \end{matrix}

where

A = \frac{1}{n} \sum_{i = 1}^{n} E [{c_{τ, p} λ_{i} ψ (r_{i, [p]})}^{2}] + n τ^{2} {δ_{0} (p) - \hat{w} (p) + w_{0} (p)}^{2} E (c_{τ, p}^{2})

.

We note that the index p in Equation (A22) and the previous theorem plays no particular role, and similar results hold when p is replaced by any k in the range

1 \leq k \leq p

. Summing over all the coordinates, we have under Assumptions O1–O7 and P1–P4,

\begin{matrix} {(\frac{p}{n})}^{2} E (∥ \hat{δ} - δ_{0} + \hat{w} - w_{0} ∥^{2}) = & \frac{1}{n} \sum_{k = 1}^{p} (\frac{1}{n} \sum_{i = 1}^{n} E [{c_{τ, k} λ_{i} ψ (r_{i, [k]})}^{2}]) \\ + τ^{2} \sum_{k = 1}^{p} {δ_{0} (k) - \hat{w} (k) + w_{0} (k)}^{2} E (c_{τ, k}^{2}) + o (B), \end{matrix}

where

B = \frac{1}{n} \sum_{k = 1}^{p} (\frac{1}{n} \sum_{i = 1}^{n} E [{c_{τ, k} λ_{i} ψ (r_{i, [k]})}^{2}]) + τ^{2} \sum_{k = 1}^{p} {δ_{0} (k) - \hat{w} (k) + w_{0} (k)}^{2} E (c_{τ, k}^{2})

.

Note that

0 \leq c_{τ, k} \leq κ / τ

,

n^{- 1} \sum_{i = 1}^{n} {∥ ψ^{2} ∥}_{\infty} \leq C

from Assumption O3, and

∥ δ_{0} ∥^{2} = O (1)

. We have

B = O (1)

. Therefore, we have

\begin{matrix} {(\frac{p}{n})}^{2} E (∥ \hat{δ} - δ_{0} + \hat{w} - w_{0} ∥^{2}) = & \frac{1}{n} \sum_{k = 1}^{p} (\frac{1}{n} \sum_{i = 1}^{n} E [{c_{τ, k} λ_{i} ψ (r_{i, [k]})}^{2}]) \\ + τ^{2} \sum_{k = 1}^{p} {δ_{0} (k) - \hat{w} (k) + w_{0} (k)}^{2} E (c_{τ, k}^{2}) + o (1) . \end{matrix}

On $c_{τ, k}$ and $c_{τ}$

We now show that

c_{τ, k}

and

c_{τ}

are close to each other.

Proposition A9.

We have, under Assumptions O1–O7 and P1–P3,

\sup_{1 \leq k \leq p} | c_{τ, k} - c_{τ} | = O_{L_{k}} ([\frac{polyLog (n) n^{α}}{\sqrt{n} \land n^{e}}] \lor [\frac{polyLog (n) n^{2 α + 1 / 2}}{{(n^{1 / 2} \land n^{e})}^{2}}] \lor \frac{polyLog (n)}{n}) .

Of course, we also have

0 \leq c_{τ} \leq p / (n τ)

and

0 \leq c_{τ, k} \leq p / (n τ)

.

Proof.

Recall that

S = \frac{1}{n} \sum_{i = 1}^{n} ψ^{'} (R_{i}) x_{i} x_{i}^{⊤} .

Denote

Γ = n^{- 1} \sum_{i = 1}^{n} ψ^{'} (R_{i}) v v^{⊤}

and

a = n^{- 1} \sum_{i = 1}^{n} ψ^{'} (R_{i}) x_{i}^{2} (p)

, we have

S = (\begin{matrix} Γ & v \\ v & a \end{matrix}) .

According to Lemma 3.40 in [21], we have, since

c_{τ} = n^{- 1} tr {{(S + τ I_{p})}^{- 1}}

,

| c_{τ} - \frac{1}{n} tr {{(Γ + τ I_{p - 1})}^{- 1}} | \leq \frac{1}{n} \frac{1 + a / τ}{τ} .

We also have

a = \frac{1}{n} \sum_{i = 1}^{n} λ_{i}^{2} X_{i}^{2} (p) ψ^{'} (R_{i}) \leq polyLog (n) \frac{1}{n} \sum_{i = 1}^{n} λ_{i}^{2} X_{i}^{2} (p) = O_{L_{k}} (polyLog (n)) .

Since

ψ^{'}

is

C n^{α}

-Lipschitz and

\sup_{i} | R_{i} - r_{i, [p]} | = O_{L_{k}} (\{\frac{polyLog (n)}{\sqrt{n} \land n^{e}}\} \lor \{\frac{polyLog (n) n^{α + 1 / 2}}{{(n^{1 / 2} \land n^{e})}^{2}}\}),

we have

\sup_{i} | ψ^{'} (R_{i}) - ψ^{'} (r_{i, [p]}) | = O_{L_{k}} (\{\frac{polyLog (n) n^{α}}{\sqrt{n} \land n^{e}}\} \lor \{\frac{polyLog (n) n^{2 α + 1 / 2}}{{(n^{1 / 2} \land n^{e})}^{2}}\}),

Hence, using arguments similar to those in the proof of Lemma A9, we have

| \frac{1}{n} tr {{(Γ + τ I_{p - 1})}^{- 1}} - \frac{1}{n} tr {{(S_{p} + τ I_{p})}^{- 1}} | = O_{L_{k}} ([\frac{polyLog (n) n^{α}}{\sqrt{n} \land n^{e}}] \lor [\frac{polyLog (n) n^{2 α + 1 / 2}}{{(n^{1 / 2} \land n^{e})}^{2}}]) .

Since

c_{τ, k} = n^{- 1} tr {{(S_{p} + τ I_{p})}^{- 1}}

, the result follows immediately.

We note that p did not play a particular role in the proof and hence taking the sup over those indices only adds a

polyLog (n)

term. Hence the result holds for

\sup_{1 \leq k \leq n} | c_{τ, k} - c_{τ} |

. □

We are now ready to prove the last proposition of this section, which will help us obtain the second equation of our System.

Proposition A10.

\begin{matrix} {(\frac{p}{n})}^{2} E (∥ \hat{δ} - δ_{0} + \hat{w} - w_{0} ∥^{2}) = & \frac{p}{n} \frac{1}{n} \sum_{i = 1}^{n} E [{c_{τ} λ_{i} ψ (prox (c_{τ} λ_{i}^{2} ρ) ({\tilde{r}}_{i, (i)}))}^{2}] \\ + τ^{2} {∥ δ_{0} + w_{0} - \hat{w} ∥}^{2} E (c_{τ}^{2}) + o (1) . \end{matrix}

(A24)

Furthermore, when all

λ_{i}

’s are non-zero,

\frac{1}{n} \sum_{i = 1}^{n} E [{c_{τ} λ_{i} ψ (prox (c_{τ} λ_{i}^{2} ρ) ({\tilde{r}}_{i, (i)}))}^{2}] = \frac{1}{n} \sum_{i = 1}^{n} E (\frac{{{\tilde{r}}_{i, (i)} - prox (c_{τ} λ_{i}^{2} ρ) ({\tilde{r}}_{i, (i)})}^{2}}{λ_{i}^{2}}) .

Proof.

In light of Proposition A9 and Assumption P3 which guarantees that

∥ δ_{0} ∥^{2}

is uniformly bounded in p and n, we have

\sum_{k = 1}^{p} {δ_{0} (k) - \hat{w} (k) + w_{0} (k)}^{2} E (c_{τ, k}^{2}) = {∥ δ_{0} + w_{0} - \hat{w} ∥}^{2} E (c_{τ}^{2}) + o (1) .

Using Theorem A2 and the bound on

∥ ψ^{'} ∥_{\infty}

in Assumption O3, we have

\frac{1}{p} \sum_{k = 1}^{p} E [{c_{τ, k} λ_{i} ψ (r_{i, [k]})}^{2}] = \frac{1}{p} \sum_{k = 1}^{p} E [{c_{τ, k} λ_{i} ψ (R_{i})}^{2}] + o (1) .

With the help of Proposition A9, we have

\frac{1}{p} \sum_{k = 1}^{p} E [{c_{τ, k} λ_{i} ψ (R_{i})}^{2}] = \frac{1}{p} \sum_{k = 1}^{p} E [{c_{τ} λ_{i} ψ (R_{i})}^{2}] + o (1) = E [{c_{τ} λ_{i} ψ (R_{i})}^{2}] + o (1) .

In light of Equation (A7), we have

\frac{1}{n} \sum_{i = 1}^{n} E {{(c_{τ} λ_{i} ψ (R_{i}))}^{2}} = \frac{1}{n} \sum_{i = 1}^{n} E ({[c_{τ} λ_{i} ψ {prox (c_{i} ρ) ({\tilde{r}}_{i, (i)})}]}^{2}) + o (1) .

Lemma A11 gives the computation of the derivative of

prox (c ρ) (x)

with respect to c, which allows us to bound the error

| ψ {prox (c_{i} ρ) (x)} - ψ {prox (c_{τ} λ_{i}^{2} ρ) (x)} |

. In light of this, by using the fact that

\sup_{i} | c_{i} - λ_{i}^{2} c_{τ} | = O_{L_{k}} (n^{2 α - 1 / 2} polyLog (n))

in Corollary A1, we can re-express the previous equation as

\frac{1}{n} \sum_{i = 1}^{n} E {{(c_{τ} λ_{i} ψ (R_{i}))}^{2}} = \frac{1}{n} \sum_{i = 1}^{n} E ({[c_{τ} λ_{i} ψ {prox (c_{τ} λ_{i}^{2} ρ) ({\tilde{r}}_{i, (i)})}]}^{2}) + o (1) .

When all

λ_{i}

’s are non-zero, we have

\frac{1}{n} \sum_{i = 1}^{n} E {{(c_{τ} λ_{i} ψ (R_{i}))}^{2}} = \frac{1}{n} \sum_{i = 1}^{n} E (\frac{{[c_{τ} λ_{i}^{2} ψ {prox (c_{τ} λ_{i}^{2} ρ) ({\tilde{r}}_{i, (i)})}]}^{2}}{λ_{i}^{2}}) + o (1) .

Finally, since by definition,

\forall x \in R, x = prox (c ρ) (x) + c ψ (prox (c ρ) (x)),

we have

\frac{1}{n} \sum_{i = 1}^{n} E (\frac{{[c_{τ} λ_{i}^{2} ψ {prox (c_{τ} λ_{i}^{2} ρ) ({\tilde{r}}_{i, (i)})}]}^{2}}{λ_{i}^{2}}) = \frac{1}{n} \sum_{i = 1}^{n} E (\frac{{[{\tilde{r}}_{i, (i)} - prox (c_{τ} λ_{i}^{2} ρ) ({\tilde{r}}_{i, (i)})]}^{2}}{λ_{i}^{2}}) .

□

Lemma A11

(El Karoui [21], Lemma 3.32). Suppose x is a real and ρ is twice differentiable and convex. Then, for

c > 0

, we have

\frac{\partial}{\partial c} prox (c ρ) (x) = - \frac{ψ (prox (c ρ) (x))}{1 + c ψ^{'} (prox (c ρ) (x))},

and

\frac{\partial}{\partial c} ρ (prox (c ρ) (x)) = - \frac{ψ^{2} (prox (c ρ) (x))}{1 + c ψ^{'} (prox (c ρ) (x))} .

In particular, at x given

c \to ρ (prox (c ρ) (x))

is decreasing in c.

Appendix E.5. Last Steps of the Proof

Appendix E.5.1. On the Asymptotic Behavior of r ˜ i,(i)

We have the following result.

Lemma A12.

Under Assumptions O1–O7 and P1–P4, as n and p tend to infinity,

{\tilde{r}}_{i, (i)} = ϵ_{i} + x_{i}^{⊤} (w_{0} - \hat{w}) - x_{i}^{⊤} ({\hat{δ}}_{(i)} - δ_{0})

behaves like

ϵ_{i} + λ_{i} \sqrt{E (∥ \hat{β} - β_{0} ∥^{2})} Z_{i}

. where

Z_{i} \sim N (0, 1)

is independent of

ϵ_{i}

and

λ_{i}

, in the sense of weak convergence.

Furthermore, if

i \neq j

,

{\tilde{r}}_{i, (i)}

and

{\tilde{r}}_{j, (j)}

are asymptotically (pairwise) independent. The same is true for

({\tilde{r}}_{i, (i)}, λ_{i})

and

({\tilde{r}}_{j, (j)}, λ_{j})

.

Proof.

First part.

In this part, we will show that

X_{i}^{⊤} ({\hat{δ}}_{(i)} - δ_{0} + \hat{w} - w_{0})

is asymptotically

N (0, E (∥ \hat{β} - β_{0} ∥^{2}))

. Recall that

{\hat{δ}}_{(i)}

is independent of

X_{i}

and that

E (X_{i}) = 0

,

cov (X_{i}) = I

and that, for any finite k, the first k moments of its entries are bounded uniformly in n.

We have shown that in Proposition A4 that

var (∥ \hat{β} - β_{0} ∥^{2}) \to 0

. In light of Lemma A3, we also know that

E (∥ \hat{β} - β_{0} ∥^{2})

is uniformly bounded. Furthermore, in the proof of Proposition A4, we showed that

E (∥ \hat{δ} + \hat{w} - δ_{0} - w_{0} ∥^{2} - ∥ {\hat{δ}}_{(i)} + \hat{w} - δ_{0} - w_{0} ∥^{2}) \to 0

.

We use a simple generalization of the standard Lindeberg-Feller theorem (see e.g., [38]). Indeed, if

a_{n, p} (k)

are random variables with

\sqrt{\sum_{k = 1}^{p} a_{n, p} {(k)}^{2}} = A_{n}

,

E (A_{n}^{2})

remains bounded in n, and

a_{n, p} (k)

’s are independent of

X_{i}

, we see that: a) if

z \sim N (0, I_{p})

, independent of

a_{n, p} (k)

, then

a_{n, p}^{⊤} z \sim A_{n} N

where

N \sim N (0, 1)

and independent of

A_{n}

(conditionally and unconditionally on

a_{n, p}

); b) Theorem 2.1.5 and its proof in [38] hold provided that

\sum_{i = 1}^{n} E (| a_{n, p} (k) |^{3}) = o (1)

. The proof simply needs to be started conditionally on

a_{n, p} (k)

, and the final moment bounds are then taken unconditionally. This very mild generalization gives, if

ϕ

is a

C^{3}

function, with bounded 2nd and 3rd derivatives,

\begin{matrix} \forall ϵ > 0, | E {ϕ (a_{n, p}^{⊤} X_{i})} - E {ϕ (A_{n} N)} | \\ \leq K [ϵ ∥ ϕ^{(3)} ∥_{\infty} E \{\sum_{k = 1}^{p} a_{n, p} {(k)}^{2}\} + \frac{∥ ϕ^{(2)} ∥_{\infty}}{ϵ} \sum_{k = 1}^{p} E (| a_{n, p} (k) |^{3})], \end{matrix}

where K is a constant depending on the second and third absolute moments of the entries of

X_{i}

. It is therefore independent of n and p under our assumptions on

X_{i}

.

We can apply this result to

a_{n, p} (k) = {\hat{δ}}_{(i)} (k) - δ_{0} (k) + \hat{w} (k) - w_{0} (k)

. Recall that we have shown that

\hat{δ} (k) - δ_{0} (k) + \hat{w} (k) - w_{0} (k) = O_{L_{k}} (\frac{polyLog (n)}{n^{1 / 2} \land n^{e}})

for each coordinate k. The same arguments apply also to

{\hat{δ}}_{(i)} (k)

, the kth coordinate of

{\hat{δ}}_{(i)}

. We have

E (| {\hat{δ}}_{(i)} (k) - δ_{0} (k) + \hat{w} (k) - w_{0} (k) |^{3}) = O \{\frac{polyLog (n)}{{(n^{1 / 2} \land n^{e})}^{3}}\}

Provided that

e > 1 / 3

, we have

E (\sum_{k = 1}^{p} {| {\hat{δ}}_{(i)} (k) - δ_{0} (k) + \hat{w} (k) - w_{0} (k) |}^{3}) = O \{\frac{polyLog (n) n}{{(n^{1 / 2} \land n^{e})}^{3}}\} = o (1) .

Therefore, in connection with Corollary 2.1.9 in [38],

X_{i}^{⊤} ({\hat{δ}}_{(i)} - δ_{0} + \hat{w} - w_{0})

behaves asymptotically like

∥ {\hat{δ}}_{(i)} - δ_{0} + \hat{w} - w_{0} ∥ N

in the sense of weak convergence.

Since

∥ {\hat{δ}}_{(i)} - δ_{0} + \hat{w} - w_{0} ∥ - E ∥ {\hat{δ}}_{(i)} - δ_{0} + \hat{w} - w_{0} ∥ \to 0

in probability and the expectations are uniformly bounded, Slutsky’s lemma gives that

X_{i}^{⊤} ({\hat{δ}}_{(i)} - δ_{0} + \hat{w} - w_{0}) behaves like E (∥ {\hat{δ}}_{(i)} - δ_{0} + \hat{w} - w_{0} ∥) N

asymptotically in the sense of weak convergence. Using the fact that

E (∥ \hat{δ} - δ_{0} + \hat{w} - w_{0} ∥^{2} - ∥ {\hat{δ}}_{(i)} - δ_{0} + \hat{w} - w_{0} ∥^{2}) \to 0

, another application of Slutsky’s lemma yields

X_{i}^{⊤} ({\hat{δ}}_{(i)} - δ_{0} + \hat{w} - w_{0}) behaves like E (∥ \hat{δ} - δ_{0} + \hat{w} - w_{0} ∥) N

in the sense of weak convergence.

We note that the same reasoning applies when replacing

a_{n, p} (k) = {\hat{δ}}_{(i)} (k) - δ_{0} (k) + \hat{w} (k) - w_{0} (k)

by

{\tilde{a}}_{n, p} (k) = λ_{i} {{\hat{δ}}_{(i)} (k) - δ_{0} (k) + \hat{w} (k) - w_{0} (k)}

, provided that

λ_{i}

has 3 moments. It shows that

X_{i}^{⊤} ({\hat{δ}}_{(i)} - δ_{0} + \hat{w} - w_{0}) behaves like λ_{i} E (∥ \hat{β} - β_{0} ∥) N .

This shows the first part of the lemma, since

E (∥ \hat{β} - β_{0} ∥) = \sqrt{E (∥ \hat{β} - β_{0} ∥^{2})} + o (1)

by Proposition A4.

Second part.

For the second part, we use a leave-two-out approach. More precisely, we use the approximation

{\tilde{r}}_{i, (i)} = ϵ_{i} + x_{i}^{⊤} (w_{0} - \hat{w}) - x_{i}^{⊤} ({\hat{δ}}_{(i)} - δ_{0}) = ϵ_{i} + x_{i}^{⊤} (w_{0} - \hat{w}) - x_{i}^{⊤} ({\hat{δ}}_{(i j)} - δ_{0}) + o_{L_{k}} (1),

and similarly for

{\tilde{r}}_{j, (j)}

, which follows from Theorem A1. Here

{\hat{δ}}_{(i j)}

is computed by solving Problem (5) without

(x_{i}, y_{i})

and

(x_{j}, y_{j})

. It is clear that

{\tilde{r}}_{i, (i)}

and

{\tilde{r}}_{j, (j)}

are asymptotically independent conditional on

X_{(i j)}

, i.e., the design matrix without the i-th and j-th rows.

Similarly, we have

E (\sum_{k = 1}^{p} {| {\hat{δ}}_{(i j)} (k) - δ_{0} (k) + \hat{w} (k) - w_{0} (k) |}^{3}) = O \{\frac{polyLog (n) n}{{(n^{1 / 2} \land n^{e})}^{3}}\} = o (1) .

Therefore, we also have

\sum_{k = 1}^{p} {| {\hat{δ}}_{(i j)} (k) - δ_{0} (k) + \hat{w} (k) - w_{0} (k) |}^{3} = o_{p} (1) .

Note that

{\hat{δ}}_{(i j)}

depends only on

{X_{(i j)}, ϵ_{(i j)}}

. We call

P_{(i j)}

the joint probability measure

P_{(i j)} = \prod_{k \neq (i, j)} P_{x_{k}, ϵ_{k}}

, i.e., probability computed with respect to all our random variables except

(x_{i}, ϵ_{i})

and

(x_{j}, ϵ_{j})

.

So we have found

E_{(i j)}^{n}

, which depends only on

(X_{(i j)}, ϵ_{(i j)})

, such that

P_{(i j)} (E_{(i j)}^{n}) \to 1

and

\sum_{k = 1}^{p} {| {\hat{δ}}_{(i j)} (k) - δ_{0} (k) + \hat{w} (k) - w_{0} (k) |}^{3} = o (1)

when

(X_{(i j)}, {ϵ_{k}}_{k \neq (i, j)}) \in E_{(i j)}^{n}

. By treating

a_{n, p}

’s as deterministic quantities, the arguments we gave above then imply that, when

(X_{(i j)}, ϵ_{(i j)}) \in E_{(i j)}^{n}

,

X_{i}^{⊤} ({\hat{δ}}_{(i j)} - δ_{0} + \hat{w} - w_{0}) | (X_{(i j)}, ϵ_{(i j)}) behaves like (∥ {\hat{δ}}_{(i j)} - δ_{0} ∥) N .

We now use characteristic function arguments. Let

α_{i} = X_{i}^{⊤} ({\hat{δ}}_{(i j)} - δ_{0} + \hat{w} - w_{0})

and

α_{j} = X_{j}^{⊤} ({\hat{δ}}_{(i j)} - δ_{0} + \hat{w} - w_{0})

.

Let

(w_{i}, w_{j}) \in R^{2}

be fixed and

χ (w_{i}, w_{j}) = E \{e^{i (w_{1} α_{i} + w_{2} α_{j})}\} = E \{e^{i (w_{1} α_{i} + w_{2} α_{j})} (1_{E_{(i j)}^{n}} + 1_{{[E_{(i j)}^{n}]}^{c}})\} .

(A25)

Since

P (E_{(i j)}^{n}) = P_{(i j)} (E_{(i j)}^{n}) \to 1

, we can just focus on

E {e^{i (w_{1} α_{i} + w_{2} α_{j})} 1_{E_{(i j)}^{n}}}

, since the modulus of the functions we are integrating is bounded by 1.

Now, we have

E \{e^{i (w_{1} α_{i} + w_{2} α_{j})} 1_{E_{(i j)}^{n}}\} = E [1_{E_{(i j)}^{n}} E \{e^{i (w_{1} α_{i} + w_{2} α_{j})} | X_{(i j)}, ϵ_{(i j)}\}],

since

1_{E_{(i j)}^{n}}

is a deterministic function of

(X_{(i j)}, ϵ_{(i j)})

. Independence of

X_{i}

and

X_{j}

gives

E \{e^{i (w_{1} α_{i} + w_{2} α_{j})} | X_{(i j)}, ϵ_{(i j)}\} = E (e^{i w_{1} α_{i}} | X_{(i j)}, ϵ_{(i j)}) E (e^{i w_{2} α_{j}} | X_{(i j)}, ϵ_{(i j)}) .

Also, the conditional Gaussian approximation established above implies that

1_{E_{(i j)}^{n}} \{E (e^{i (w_{1} α_{i} + w_{2} α_{j})} | X_{(i j)}, ϵ_{(i j)}) - e^{- (w_{1}^{2} / 2 + w_{2}^{2} / 2) {∥ {\hat{δ}}_{(i j)} - δ_{0} ∥}^{2}}\} \to 0

in

P_{(i j)}

-probability.

So we conclude that

E \{1_{E_{(i j)}^{n}} e^{i (w_{1} α_{i} + w_{2} α_{j})}\} - E \{1_{E_{(i j)}^{n}} e^{- (w_{1}^{2} / 2 + w_{2}^{2} / 2) {∥ {\hat{δ}}_{(i j)} - δ_{0} ∥}^{2}}\} \to 0 .

Since

P (E_{(i j)}^{n}) \to 1

and

∥ {\hat{δ}}_{(i j)} - δ_{0} ∥^{2}

is asymptotically deterministic by arguments similar to those in the proof of Proposition A4, we have

E \{1_{E_{(i j)}^{n}} e^{- (w_{1}^{2} / 2 + w_{2}^{2} / 2) {∥ {\hat{δ}}_{(i j)} - δ_{0} ∥}^{2}}\} - e^{- (w_{1}^{2} / 2 + w_{2}^{2} / 2) E (∥ {\hat{δ}}_{(i j)} - δ_{0} ∥^{2})} \to 0 .

Therefore,

E \{e^{i (w_{1} α_{i} + w_{2} α_{j})}\} - E (e^{i w_{1} α_{i}}) E (e^{i w_{2} α_{j}}) \to 0 .

This shows that

α_{i}

and

α_{j}

are asymptotically independent. It easily follows that

{\tilde{r}}_{i, (i)}

and

{\tilde{r}}_{j, (j)}

are asymptotically independent.

The same reasoning applies to

({\tilde{r}}_{i, (i)}, λ_{i})

and

({\tilde{r}}_{j, (j)}, λ_{j})

, since

{\hat{δ}}_{(i j)}

is independent of

λ_{i}

and

λ_{j}

under Assumption O6. □

Appendix E.5.2. On the Asymptotic Behavior of c τ

We now show that

c_{τ} = n^{- 1} tr {{(S + τ I_{p})}^{- 1}}

is asymptotically deterministic. This require several steps.

Lemma A13.

We work under Assumptions O1–O7 , P1–P4 and F2–F4. Consider the random function

g_{n} (x) = \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{1 + x λ_{i}^{2} ψ^{'} {prox (x λ_{i}^{2} ρ) ({\tilde{r}}_{i, (i)})}}, defined for x \geq 0 .

Let

B > 0

be in

R_{+}

. We have, for any

(x, y) \in R_{+}^{2}

, and

0 \leq x \leq B

,

0 \leq y \leq B

,

\sup_{(x, y) : | x - y | \leq η, 0 \leq x \leq B, 0 \leq y \leq B} | g_{n} (x) - g_{n} (y) | \leq η \frac{1}{n} \sum_{i = 1}^{n} (λ_{i}^{2} ∥ ψ^{'} ∥_{\infty} + B L (n) λ_{i}^{4} {∥ ψ ∥}_{\infty}) .

In particular, under Assumption P2 and F3–F4, we have for

C

a constant independent of n and p,

\Pr^{*} (\sup_{(x, y) : | x - y | \leq η, 0 \leq x \leq B, 0 \leq y \leq B} | g_{n} (x) - g_{n} (y) | > δ) \leq \frac{η}{δ} C .

(A26)

Hence,

g_{n}

is stochastic equicontinuous on

[0, B]

for any

B > 0

given.

We used the notation

\Pr^{*}

above to denote outer probability and avoid a discussion of potential measure theoretic issues associated with taking a supremum over a noncountable collection of random variables (see e.g., van der Vaart [39] (Sect. 18.2)).

Proof.

Consider the function defined for

x \geq 0

,

h_{u}^{(i)} (x) = \frac{1}{1 + x λ_{i}^{2} ψ^{'} {prox (x λ_{i}^{2} ρ) (u)}} = \frac{\partial}{\partial u} prox (x λ_{i}^{2} ρ) (u) .

The last equality comes from Lemma 3.33 from El Karoui [21].

Since

ψ^{'}

is non-negative,

\forall u, | h_{u}^{(i)} (x) - h_{u}^{(i)} (y) | \leq | x λ_{i}^{2} ψ^{'} (prox (x λ_{i}^{2} ρ) (u)) - y λ_{i}^{2} ψ^{'} (prox (y λ_{i}^{2} ρ) (u)) | \land 1 .

Thus, since

x, y \geq 0

, for all u,

\begin{matrix} | h_{u}^{(i)} (x) - h_{u}^{(i)} (y) | \leq & λ_{i}^{2} | x - y | ψ^{'} (prox (x λ_{i}^{2} ρ) (u)) \\ + λ_{i}^{2} y | ψ^{'} (prox (x λ_{i}^{2} ρ) (u)) - ψ^{'} (prox (y λ_{i}^{2} ρ) (u)) | . \end{matrix}

In particular, if

| x - y | \leq η

, and

x \lor y \leq B

, with

x, y \geq 0

, for all u,

\begin{matrix} \sup_{y : | x - y | \leq η; x \lor y \leq B} | h_{u}^{(i)} (x) - h_{u}^{(i)} (y) | \leq & λ_{i}^{2} η ψ^{'} {prox (x λ_{i}^{2} ρ) (u)} \\ + B λ_{i}^{2} \sup_{y : | x - y | \leq η, x \lor y \leq B} | ψ^{'} {prox (x λ_{i}^{2} ρ) (u)} - ψ^{'} {prox (y λ_{i}^{2} ρ) (u)} | . \end{matrix}

Under Assumption O3,

ψ^{'}

is

L (n)

-Lipschitz. Therefore, for

x_{i} = x λ_{i}^{2}

,

y_{i} = y λ_{i}^{2} \geq 0

, we have

\forall u, | ψ^{'} {prox (x_{i} ρ) (u)} - ψ^{'} {prox (y_{i} ρ) (u)} | \leq L (n) | prox (x_{i} ρ) (u) - prox (y_{i} ρ) (u) | .

Recall that according to Lemma A11,

\frac{\partial}{\partial x} prox (x ρ) (u) = - \frac{ψ {prox (x ρ) (u)}}{1 + x ψ^{'} {prox (x ρ) (u)}} .

Hence we have

\sup_{x} | \frac{\partial}{\partial x} prox (x ρ) (u) | \leq {∥ ψ ∥}_{\infty} .

We conclude that

\forall u, | ψ^{'} {prox (x_{i} ρ) (u)} - ψ^{'} {prox (y_{i} ρ) (u)} | \leq {L (n) {∥ ψ ∥}_{\infty} | x_{i} - y_{i} |} \land {2 ∥ ψ ∥}_{\infty} .

We therefore have, for

x \lor y \leq B

and

x, y \geq 0

,

\forall u, \sup_{y : | x - y | \leq η} | h_{u}^{(i)} (x) - h_{u}^{(i)} (y) | \leq λ_{i}^{2} η ψ^{'} {prox (x λ_{i}^{2} ρ) (u)} + B λ_{i}^{4} L (n) {∥ ψ ∥}_{\infty} η .

Thus, for

x, y \geq 0

,

\forall u, \sup_{(x, y) : | x - y | \leq η, x \lor y \leq B} | h_{u}^{(i)} (x) - h_{u}^{(i)} (y) | \leq λ_{i}^{2} η ∥ ψ^{'} ∥_{\infty} + B λ_{i}^{4} L (n) {∥ ψ ∥}_{\infty} η .

Since the right-hand side does not depend on u, we have

\sup_{u} \sup_{(x, y) : | x - y | \leq η, x \lor y \leq B} | h_{u}^{(i)} (x) - h_{u}^{(i)} (y) | \leq λ_{i}^{2} η ∥ ψ^{'} ∥_{\infty} + B λ_{i}^{4} L (n) {∥ ψ ∥}_{\infty} η .

Naturally,

g_{n} (x)

can be written as

g_{n} (x) = \frac{1}{n} \sum_{i = 1}^{n} h_{{\tilde{r}}_{i, (i)}}^{(i)} (x) .

Therefore, for any

x, y \geq 0

,

| g_{n} (x) - g_{n} (y) | \leq \frac{1}{n} \sum_{i = 1}^{n} | h_{{\tilde{r}}_{i, (i)}}^{(i)} (x) - h_{{\tilde{r}}_{i, (i)}}^{(i)} (y) | .

The bound we have obtained above on

\sup_{u} | h_{u}^{(i)} (x) - h_{u}^{(i)} (y) |

when x and y are sufficiently close to one another can now be used. This shows that for x given, if

x, y \geq 0

,

| x - y | \leq η

, and

x \lor y \leq B

, we have

\sup_{(x, y) : | x - y | \leq η, 0 \leq x \leq B, 0 \leq y \leq B} | g_{n} (x) - g_{n} (y) | \leq η \frac{1}{n} \sum_{i = 1}^{n} (λ_{i}^{2} ∥ ψ^{'} ∥_{\infty} + B L (n) λ_{i}^{4} {∥ ψ ∥}_{\infty}) .

Under Assumptions P2 and F3–F4, all the terms on the right-hand side are bounded in

L_{1}

. We can now take expectations and obtain the result in

L_{1}

. □

Lemma A14.

Let us call

G_{n} (x) = E (g_{n} (x))

. Let

B > 0

be given. For any given

x_{0} \leq B

,

g_{n} (x_{0}) - G_{n} (x_{0}) = o_{L_{2}} (1) .

Under Assumptions O1–O7, P1–P4 and F1–F5, we have

E^{*} (\sup_{0 \leq x \leq B} | g_{n} (x) - G_{n} (x) |) \to 0 .

Proof.

Under Assumptions F1 and F5, we can divide the index set

{1, \dots, n}

into finite K subsets

A_{1}, \dots, A_{K}

, in which

{(x_{i}, ϵ_{i})}_{i \in A_{j}}

play a symmetric role. Hence,

var (g_{n} (x_{0}))

can be expressed as a sum of variances and covariances of finitely many functions of finitely many random variables

(λ_{i}, {\tilde{r}}_{i, (i)})

: for those random variables, we just need to pick a representative in each subset

{A_{j}}_{j = 1}^{K}

.

Since

ψ^{'}

is Lipschitz,

g_{n}

is an average of bounded continuous functions of the random variables of interest to us.

Asymptotic pairwise independence of

(λ_{i}, {\tilde{r}}_{i, (i)})

’s implies that

var (g_{n} (x_{0})) = o (1) .

and therefore gives the first result.

Now we pick

ϵ > 0

. By the stochastic equicontinuity of

g_{n}

and the bound in (A26), we can find

x_{1}, \dots, x_{K}

, independent of n, such that for all

x \in [0, B]

, there exists l such that when n is large enough,

E (| g_{n} (x) - g_{n} (x_{l}) |) \leq ϵ .

Notice that

| g_{n} (x) - G_{n} (x) | \leq | g_{n} (x) - g_{n} (x_{l}) | + | g_{n} (x_{l}) - G_{n} (x_{l}) | + | G_{n} (x_{l}) - G_{n} (x) | .

We immediately get

E^{*} (\sup_{0 \leq x \leq B} | g_{n} (x) - G_{n} (x) |) \leq 2 ϵ + E (\sup_{1 \leq l \leq K} | g_{n} (x_{l}) - G_{n} (x_{l}) |) .

Because K is finite, the fact that for all l,

g_{n} (x_{l}) - G_{n} (x_{l}) = o_{L_{2}} (1)

implies that

\sup_{1 \leq l \leq K} | g_{n} (x_{l}) - G_{n} (x_{l}) | = o_{L_{2}} (1)

. In particular, if n is sufficiently large, we have

E (\sup_{1 \leq l \leq K} | g_{n} (x_{l}) - G_{n} (x_{l}) |) \leq ϵ .

This gives the result. □

Lemma A15.

Assume O1–O7, P1–P4 and F1–F5. Recall that

c_{τ} = n^{- 1} tr {{(S + τ I_{p})}^{- 1}}

. Call as before

g_{n} (x) = \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{1 + x λ_{i}^{2} ψ^{'} {prox (x λ_{i}^{2} ρ) ({\tilde{r}}_{i, (i)})}} .

Then

c_{τ}

is a near solution of

\frac{p}{n} - τ x - 1 + g_{n} (x) = 0,

i.e.,

p / n - τ c_{τ} - 1 + g_{n} (c_{τ}) = o_{L_{k}} (1)

, when

3 α - 1 / 2 < 0

.

Asymptotically, near solution of

δ_{n} (x) ≜ \frac{p}{n} - τ x - 1 + g_{n} (x) = 0,

are close to solutions of

Δ_{n} (x) = \frac{p}{n} - τ x - 1 + E {g_{n} (x)} = 0 .

More precisely, call

T_{n, ϵ} = {x : | Δ_{n} (x) | \leq ϵ}

. We note that

T_{n, ϵ} \subseteq (0, p / (n τ) + ϵ / τ)

. For any given ϵ, as

n \to \infty

, near solutions of

δ_{n} (x) = 0

belong to

T_{n, ϵ}

with high probability.

Our assumptions concerning the possible distributions of

ϵ_{i}

’s, specifically Assumption F1, imply that as

n \to \infty

, there is a unique solution to

Δ_{n} (x) = 0

.

Hence

c_{τ}

is asymptotically deterministic.

Proof.

Note that

g_{n} (x) \leq 1

.

Let

δ_{n} (x)

be the function

δ_{n} (x) = \frac{p}{n} - τ x - 1 + g_{n} (x),

and

Δ_{n} (x) = E {δ_{n} (x)}

. Call

x_{n}

a solution of

δ_{n} (x_{n}) = 0

and

x_{n, 0}

a solution of

Δ_{n} (x_{n, 0}) = 0

.

Since

0 \leq g_{n} (x) \leq 1

, we have

x_{n} \leq p / (n τ)

, for otherwise,

δ_{n} (x) < 0

. The same arguments shows that if

x > (p / n + ϵ) / τ

, then

δ_{n} (x) < - ϵ

and

x \notin T_{n, ϵ}

. Similarly, near solutions of

δ_{n} (x) = 0

must be less or equal to

(p / n + ϵ) / τ

.

Proof of the fact that $c_{τ}$ is such that $δ_{n} (c_{τ}) = o (1)$

Recall that in the notation of Lemma A8, we have

\frac{p - 1}{n} - τ c_{τ, p} = \frac{1}{n} tr (I_{n} - M) .

According to (A20),

\frac{1}{n} tr (I_{n} - M) = 1 - \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{1 + ψ^{'} (r_{i, [p]}) \frac{1}{n} v_{i}^{⊤} {(S_{p} (i) + τ I_{p})}^{- 1} v_{i}} .

According to Lemmas A8–A10, we have

\sup_{i} | \frac{1}{n} v_{i}^{⊤} {(S_{p} (i) + τ I_{p})}^{- 1} v_{i} - λ_{i}^{2} c_{τ, p} | = O_{L_{k}} (\frac{polyLog (n)}{n^{1 / 2 - 2 α}}) .

When

x \geq 0

and

y \geq 0

,

| 1 / (1 + x) - 1 / (1 + y) | \leq | x - y | \land 1

. Hence, we have

\begin{matrix} | \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{1 + ψ^{'} (r_{i, [p]}) \frac{1}{n} v_{i}^{⊤} {(S_{p} (i) + τ I_{p})}^{- 1} v_{i}} - \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{1 + ψ^{'} (r_{i, [p]}) λ_{i}^{2} c_{τ, p}} | \\ \leq \sup_{1 \leq i \leq n} | \frac{1}{n} v_{i}^{⊤} {(S_{p} (i) + τ I_{p})}^{- 1} v_{i} - λ_{i}^{2} c_{τ, p} | {∥ ψ^{'} ∥}_{\infty} . \end{matrix}

We conclude that

p / n - τ c_{τ, p} - 1 + \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{1 + λ_{i}^{2} c_{τ, p} ψ^{'} (r_{i, [p]})} = O_{L_{k}} (n^{- 1 / 2 + 2 α} polyLog (n)) .

Exactly the same computations can be performed for

c_{τ}

. We have established that

p / n - τ c_{τ} - 1 + \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{1 + λ_{i}^{2} c_{τ} ψ^{'} (R_{i})} = O_{L_{k}} (n^{- 1 / 2 + 2 α} polyLog (n)) .

Now we have seen in Theorem A1 that

\sup_{i} | R_{i} - prox (c_{i} ρ) ({\tilde{r}}_{i, (i)}) | = O_{L_{k}} (n^{- 1 / 2 + α} polyLog (n)) .

By the assumptions on

ψ^{'}

, this implies that

\sup_{i} | ψ^{'} (R_{i}) - ψ^{'} {prox (c_{i} ρ) ({\tilde{r}}_{i, (i)})} | = O_{L_{k}} (n^{- 1 / 2 + α} polyLog (n)) .

We have furthermore noted that

\sup_{i} | c_{i} - λ_{i}^{2} c_{τ} | = O_{L_{k}} (n^{- 1 / 2 + 2 α} polyLog (n))

in Corollary A1. Using Lemma A11, we can write

| prox (c_{i} ρ) ({\tilde{r}}_{i, (i)}) - prox (λ_{i}^{2} c_{τ} ρ) ({\tilde{r}}_{i, (i)}) {| \leq ∥ ψ ∥}_{\infty} | c_{i} - λ_{i}^{2} c_{τ} |

and hence

| ψ^{'} {prox (c_{i} ρ) ({\tilde{r}}_{i, (i)})} - ψ^{'} {prox (λ_{i}^{2} c_{τ} ρ) ({\tilde{r}}_{i, (i)})} | = O_{L_{k}} {(∥ ψ ∥}_{\infty} n^{- 1 / 2 + 3 α} polyLog (n)) .

Gathering all these results, we have

| ψ^{'} (R_{i}) - ψ^{'} (prox (λ_{i}^{2} c_{τ} ρ) ({\tilde{r}}_{i, (i)})) | = O_{L_{k}} {{(∥ ψ ∥}_{\infty} + 1) n^{- 1 / 2 + 3 α} polyLog (n)} .

So we have shown that

δ_{n} (c_{τ}) = O_{L_{k}} (n^{- 1 / 2 + 3 α} polyLog (n))

.

Final details

By Lemma A14, we have

δ_{n} (x) - Δ_{n} (x) = o_{p} (1)

for any given x. In our case, using the notation of this lemma,

B = p / (n τ) + η / τ

, for

η > 0

given.

This implies that for any given

ϵ > 0

,

\sup_{x \in (0, p / (n τ) + η / τ)} | δ_{n} (x) - Δ_{n} (x) | < ϵ,

with high probability when n is large enough. Therefore, for any

ϵ > 0

, if

x_{n}

is a solution to

δ_{n} (x) = 0

,

| Δ_{n} (x_{n}) | < ϵ with high probability .

This means that

x_{n} \in T_{n, ϵ}

with high probability. The same reasoning applies for near solution of

δ_{n} (x) = 0

, which must belong to

T_{n, ϵ}

as

n \to \infty

with high probability for any given

ϵ > 0

. Note that

T_{n, ϵ}

is compact because it is bounded and closed, using the fact that

G_{n} = E (g_{n})

is continuous.

If

T_{n, 0}

were reduced to a single point, we would have established the asymptotically deterministic character of

c_{τ}

.

Given our work on the asymptotic behavior of

{\tilde{r}}_{i, (i)}

and our assumptions on

ϵ_{i}

’s, we see that Lemma 3.39 from El Karoui [21] applies to

\lim_{n \to \infty} Δ_{n} (x)

under Assumption F1. Therefore,

T_{n, 0}

is reduced to a single point as

n \to \infty

and

c_{τ}

is asymptotically deterministic. □

Proof of Theorem 1.

Notice that

\frac{\partial}{\partial t} prox (c ρ) (t) = prox {(c ρ)}^{'} (t) = \frac{1}{1 + c ψ^{'} (prox (c ρ) (t))} .

Therefore,

Δ_{n}

can be written as

Δ_{n} (x) = \frac{p}{n} - τ x - 1 + \frac{1}{n} \sum_{i = 1}^{n} E \{{prox}^{'} (x λ_{i}^{2} ρ) ({\tilde{r}}_{i, (i)})\} .

Hence, the limiting root of

Δ_{n} (x) = 0

satisfies the first fixed-point equation in Theorem 1. Since Lemma A15 shows that

c_{τ}

is asymptotically arbitrarily close to this root, the first equation follows. The second equation comes from (A24). Theorem 1 is proved, with

c_{ρ} (κ)

being the limit of

c_{τ}

. □

References

Chen, M. Analysis on transfer learning models and applications in natural language processing. Highlights Sci. Eng. Technol. 2022, 16, 446–452. [Google Scholar] [CrossRef]
Ma, Y.; Chen, S.; Ermon, S.; Lobell, D.B. Transfer learning in environmental remote sensing. Remote Sens. Environ. 2024, 301, 113924. [Google Scholar] [CrossRef]
Gopalakrishnan, K.; Khaitan, S.K.; Choudhary, A.; Agrawal, A. Deep convolutional neural networks with transfer learning for computer vision-based data-driven pavement distress detection. Constr. Build. Mater. 2017, 157, 322–330. [Google Scholar] [CrossRef]
Liu, D.; Luo, J.; Johnson, B.; Chew, H.; Blais, J.; Deik, A.; Paul, F.; Hanson, R.L.; Crandall, J.P.; Sun, Y.; et al. Modeling blood metabolite homeostatic levels reduces sample heterogeneity across cohorts. Proc. Natl. Acad. Sci. USA 2024, 121, e2307430121. [Google Scholar] [CrossRef]
Chen, A.; Owen, A.B.; Shi, M. Data enriched linear regression. Electron. J. Stat. 2015, 9, 1078–1112. [Google Scholar] [CrossRef]
Tripuraneni, N.; Jin, C.; Jordan, M. Provable meta-learning of linear representations. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Volume 139, pp. 10434–10443. [Google Scholar]
Bastani, H. Predicting with proxies: Transfer learning in high dimension. Manag. Sci. 2021, 67, 2964–2984. [Google Scholar] [CrossRef]
Li, S.; Cai, T.T.; Li, H. Transfer learning for high-dimensional linear regression: Prediction, estimation and minimax optimality. J. R. Stat. Soc. Ser. B Stat. Methodol. 2022, 84, 149–173. [Google Scholar] [CrossRef]
Tian, Y.; Feng, Y. Transfer learning under high-dimensional generalized linear models. J. Am. Stat. Assoc. 2023, 118, 2684–2697. [Google Scholar] [CrossRef]
Li, S.; Zhang, L.; Cai, T.T.; Li, H. Estimation and inference for high-dimensional generalized linear models with knowledge transfer. J. Am. Stat. Assoc. 2024, 119, 1274–1285. [Google Scholar] [CrossRef]
Cai, T.T.; Wei, H. Transfer learning for nonparametric classification: Minimax rate and adaptive classifier. Ann. Stat. 2021, 49, 100–128. [Google Scholar] [CrossRef]
Cai, T.T.; Pu, H. Transfer learning for nonparametric regression: Non-asymptotic minimax analysis and adaptive procedure. arXiv 2024, arXiv:2401.12272. [Google Scholar] [CrossRef]
Fan, J.; Gao, C.; Klusowski, J.M. Robust transfer learning with unreliable source data. arXiv 2023, arXiv:2310.04606. [Google Scholar] [CrossRef]
Huber, P.J. Robust regression: Asymptotics, conjectures and Monte Carlo. Ann. Stat. 1973, 1, 799–821. [Google Scholar] [CrossRef]
Portnoy, S. Asymptotic behavior of M-estimators of p regression parameters when p²/n is large. I. Consistency. Ann. Stat. 1984, 13, 1298–1309. [Google Scholar]
Portnoy, S. Asymptotic behavior of M-estimators of p regression parameters when p²/n is large. II. Normal approximation. Ann. Stat. 1985, 13, 1403–1417. [Google Scholar] [CrossRef]
Portnoy, S. Asymptotic behavior of the empiric distribution of M-estimated residuals from a regression model with many parameters. Ann. Stat. 1986, 14, 1152–1170. [Google Scholar] [CrossRef]
Portnoy, S. A central limit theorem applicable to robust regression estimators. J. Multivar. Anal. 1987, 22, 24–50. [Google Scholar] [CrossRef][Green Version]
Mammen, E. Asymptotics with increasing dimension for robust regression with applications to the bootstrap. Ann. Stat. 1989, 17, 382–400. [Google Scholar] [CrossRef]
El Karoui, N.; Bean, D.; Bickel, P.J.; Lim, C.; Yu, B. On robust regression with high-dimensional predictors. Proc. Natl. Acad. Sci. USA 2013, 110, 14557–14562. [Google Scholar] [CrossRef]
El Karoui, N. On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. Probab. Theory Relat. Fields 2018, 170, 95–175. [Google Scholar] [CrossRef]
Charbonnier, P.; Blanc-Feraud, L.; Aubert, G.; Barlaud, M. Deterministic edge-preserving regularization in computed imaging. IEEE Trans. Image Process. 1997, 6, 298–311. [Google Scholar] [CrossRef]
Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Yohai, V.J.; Maronna, R.A. Asymptotic behavior of M-estimators for the linear model. Ann. Stat. 1979, 7, 258–268. [Google Scholar] [CrossRef]
El Karoui, N. Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators: Rigorous results. arXiv 2013, arXiv:1311.2445. [Google Scholar]
El Karoui, N. Concentration of measure and spectra of random matrices: Applications to correlation matrices, elliptical distributions and beyond. Ann. Appl. Probab. 2009, 19, 2362–2405. [Google Scholar] [CrossRef]
Ledoux, M. The Concentration of Measure Phenomenon; American Mathematical Society: Providence, RI, USA, 2001; Volume 89. [Google Scholar]
Huber, P.J.; Ronchetti, E.M. Robust Statistics; John Wiley & Sons: Hoboken, NJ, USA, 2009. [Google Scholar]
Karlin, S. Total Positivity; Stanford University Press: Stanford, CA, USA, 1968. [Google Scholar]
Ibragimov, I.A. On the composition of unimodal distributions. Teor. Veroyatnost. Primen. 1956, 1, 283–288. [Google Scholar] [CrossRef]
Dharmadhikari, S.; Joag-Dev, K. Unimodality, Convexity, and Applications; Probability and Mathematical Statistics; Academic Press, Inc.: Boston, MA, USA, 1988. [Google Scholar]
Moreau, J.J. Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. Fr. 1965, 93, 273–299. [Google Scholar] [CrossRef]
Beck, A.; Teboulle, M. Gradient-based algorithms with applications to signal-recovery problems. In Convex Optimization in Signal Processing and Communications; Palomar, D.P., Eldar, Y.C., Eds.; Cambridge University Press: Cambridge, UK, 2010; pp. 42–88. [Google Scholar]
Bean, D.; Bickel, P.J.; El Karoui, N.; Yu, B. Optimal M-estimation in high-dimensional regression. Proc. Natl. Acad. Sci. USA 2013, 110, 14563–14568. [Google Scholar] [CrossRef]
Efron, B.; Stein, C. The jackknife estimate of variance. Ann. Stat. 1981, 9, 586–596. [Google Scholar] [CrossRef]
Bhatia, R. Matrix Analysis; Springer: New York, NY, USA, 1997. [Google Scholar]
Johnson, C.R.; Horn, R.A. Matrix Analysis; Cambridge University Press: Cambridge, UK, 1985. [Google Scholar]
Stroock, D.W. Probability Theory: An Analytic View; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
van der Vaart, A.W. Asymptotic Statistics; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]

Figure 1. Boxplot of

∥ \hat{β} - β_{0} ∥^{2}

over 1000 simulations. The red point in each boxplot represents the theoretical value

r_{ρ}^{2}

from Theorem 1. Panels from top to bottom are for

κ = 1, 4

, respectively, while panels from left to right are for cases

I, II, III

, respectively.

Figure 1. Boxplot of

∥ \hat{β} - β_{0} ∥^{2}

over 1000 simulations. The red point in each boxplot represents the theoretical value

r_{ρ}^{2}

from Theorem 1. Panels from top to bottom are for

κ = 1, 4

, respectively, while panels from left to right are for cases

I, II, III

, respectively.

Figure 2. Theoretical curves of

r_{ρ}

as a function of

∥ β_{0} - \hat{w} ∥

for five values of

τ

under cases

I

–

III

, obtained by numerically solving Corollary 1. The three panels correspond to cases

I

,

II

, and

III

, respectively.

Figure 2. Theoretical curves of

r_{ρ}

as a function of

∥ β_{0} - \hat{w} ∥

for five values of

τ

under cases

I

–

III

, obtained by numerically solving Corollary 1. The three panels correspond to cases

I

,

II

, and

III

, respectively.

Figure 3. Boxplots of relative estimation errors (log scale) across 500 replications for varying

∥ β_{0} - w_{0} ∥

under cases

I

–

III

, with

p = 400

and

n = 400

. Case

I

includes all six methods, while cases

II

and

III

include only the four ridge-type procedures. Panels from top to bottom are for cases

I, II, III

, respectively.

Figure 3. Boxplots of relative estimation errors (log scale) across 500 replications for varying

∥ β_{0} - w_{0} ∥

under cases

I

–

III

, with

p = 400

and

n = 400

. Case

I

includes all six methods, while cases

II

and

III

include only the four ridge-type procedures. Panels from top to bottom are for cases

I, II, III

, respectively.

Figure 4. Robustness check under AR(1) correlated predictors. Boxplots of relative estimation errors (log scale) across 500 replications for varying

∥ β_{0} - w_{0} ∥

under cases

I

–

III

with

x_{i} \sim N (0, \sum_{ρ})

, where

\sum_{ρ, j k} = ρ^{| j - k |}

and

ρ = 0.6

. Panels from top to bottom are for cases

I

,

II

, and

III

, respectively. The four ridge-type procedures, Single-RR, Trans-RR, Trans-RR-Ada, and Pooled-RR are shown.

Figure 4. Robustness check under AR(1) correlated predictors. Boxplots of relative estimation errors (log scale) across 500 replications for varying

∥ β_{0} - w_{0} ∥

under cases

I

–

III

with

x_{i} \sim N (0, \sum_{ρ})

, where

\sum_{ρ, j k} = ρ^{| j - k |}

and

ρ = 0.6

. Panels from top to bottom are for cases

I

,

II

, and

III

, respectively. The four ridge-type procedures, Single-RR, Trans-RR, Trans-RR-Ada, and Pooled-RR are shown.

Figure 5. Distribution of RMSE across the 20 repeated random splits for each of the six methods, in the two transfer directions. Boxes show the inter-quartile range, whiskers extend to 1.5 × IQR, and circles mark splits beyond that.

Table 1. Mean and SD (in parentheses) of

∥ \hat{β} - β_{0} ∥^{2}

(denoted as

{\hat{r}}^{2}

) and the corresponding

r_{ρ}^{2}

over 1000 simulations. Rows are indexed by

(p, n)

, with

n_{1} = 2 p

when

κ = 1

and

n_{1} = p / 2

when

κ = 4

.

Table 1. Mean and SD (in parentheses) of

∥ \hat{β} - β_{0} ∥^{2}

(denoted as

{\hat{r}}^{2}

) and the corresponding

r_{ρ}^{2}

over 1000 simulations. Rows are indexed by

(p, n)

, with

n_{1} = 2 p

when

κ = 1

and

n_{1} = p / 2

when

κ = 4

.

( $p, n$ )	Case $I$		Case $II$		Case $III$
( $p, n$ )	${\hat{r}}^{2}$	$r_{ρ}^{2}$	${\hat{r}}^{2}$	$r_{ρ}^{2}$	${\hat{r}}^{2}$	$r_{ρ}^{2}$
$κ = 1$
(200, 200)	0.3653 (0.0318)	0.3649	0.7163 (0.0738)	0.7204	0.4683 (0.0478)	0.4685
(400, 400)	0.3472 (0.0208)	0.3477	0.6970 (0.0549)	0.6923	0.5076 (0.0368)	0.5065
(800, 800)	0.3603 (0.0151)	0.3598	0.7206 (0.0374)	0.7212	0.4989 (0.0238)	0.4986
$κ = 4$
(200, 50)	1.3415 (0.0734)	1.3419	2.8222 (0.3311)	2.8216	2.2410 (0.2407)	2.2415
(400, 100)	1.3565 (0.0531)	1.3544	2.4896 (0.2363)	2.4996	1.8087 (0.1590)	1.8102
(800, 200)	1.5261 (0.0427)	1.5247	2.7226 (0.1693)	2.7219	2.0342 (0.1153)	2.0301

Table 2. Sensitivity of relative estimation error to the smoothed Huber parameters

(δ, η)

at the transition discrepancy

h = 1

. Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over

M = 500

replications. The three blocks correspond to the three error distributions of Section 4.3.

Table 2. Sensitivity of relative estimation error to the smoothed Huber parameters

(δ, η)

at the transition discrepancy

h = 1

. Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over

M = 500

replications. The three blocks correspond to the three error distributions of Section 4.3.

Case I (Gaussian Errors)
$δ ∖ η$	$0.05$	$0.10$	$0.20$
$1.00$	0.653/0.655/0.651/0.883	0.654/0.656/0.652/0.883	0.656/0.658/0.654/0.883
$1.35$	0.644/0.647/0.642/0.882	0.645/0.647/0.642/0.882	0.646/0.648/0.643/0.882
$2.00$	0.641/0.642/0.637/0.885	0.640/0.642/0.637/0.885	0.641/0.643/0.637/0.885
Case II (Cauchy Errors)
$δ ∖ η$	$0.05$	$0.10$	$0.20$
$1.00$	0.884/0.893/0.884/0.952	0.884/0.892/0.884/0.952	0.884/0.893/0.884/0.952
$1.35$	0.883/0.893/0.884/0.951	0.883/0.892/0.883/0.951	0.883/0.892/0.884/0.951
$2.00$	0.890/0.899/0.891/0.951	0.889/0.899/0.890/0.951	0.889/0.899/0.890/0.951
Case III (Mixture Errors)
$δ ∖ η$	$0.05$	$0.10$	$0.20$
$1.00$	0.779/0.786/0.779/0.921	0.780/0.787/0.779/0.921	0.780/0.787/0.779/0.921
$1.35$	0.781/0.788/0.781/0.920	0.781/0.787/0.780/0.920	0.781/0.787/0.780/0.921
$2.00$	0.794/0.801/0.793/0.924	0.793/0.800/0.792/0.924	0.792/0.800/0.791/0.923

Table 3. Sensitivity of relative estimation error to the cross-validation criterion. All settings are as in Figure 3, except that every cross-validation loss (used to select

τ_{1}

,

τ

,

τ_{st}

,

τ_{p}

, and

θ

) is changed from MAE to MSE. Each entry reports the default (MAE-CV) and the MSE-CV mean estimation error, in the format “default/MSE-CV”, over

M = 500

replications.

Table 3. Sensitivity of relative estimation error to the cross-validation criterion. All settings are as in Figure 3, except that every cross-validation loss (used to select

τ_{1}

,

τ

,

τ_{st}

,

τ_{p}

, and

θ

) is changed from MAE to MSE. Each entry reports the default (MAE-CV) and the MSE-CV mean estimation error, in the format “default/MSE-CV”, over

M = 500

replications.

Case	h	Single-RR	Trans-RR	Trans-RR-Ada	Pooled-RR
$I$	$0.135$	0.645/0.644	0.550/0.546	0.550/0.546	0.600/0.601
	$0.223$	0.645/0.644	0.565/0.561	0.566/0.561	0.628/0.629
	$0.368$	0.645/0.644	0.588/0.585	0.589/0.585	0.678/0.681
	$0.607$	0.645/0.644	0.616/0.614	0.617/0.615	0.764/0.770
	$1.000$	0.645/0.644	0.647/0.645	0.642/0.641	0.882/0.888
	$1.649$	0.645/0.644	0.791/0.793	0.645/0.644	0.990/0.991
	$2.718$	0.645/0.644	1.639/1.646	0.645/0.644	1.261/1.479
$II$	$0.135$	0.883/1.772	0.816/2.582	0.819/1.895	0.811/1.324
	$0.223$	0.883/1.772	0.826/2.597	0.829/1.904	0.829/1.326
	$0.368$	0.883/1.772	0.840/2.618	0.843/1.915	0.855/1.353
	$0.607$	0.883/1.772	0.860/2.640	0.862/1.920	0.898/1.401
	$1.000$	0.883/1.772	0.892/2.745	0.883/1.963	0.951/1.497
	$1.649$	0.883/1.772	1.021/2.994	0.887/2.020	1.005/1.679
	$2.718$	0.883/1.772	1.898/3.827	0.883/2.108	1.123/2.223
$III$	$0.135$	0.781/1.108	0.693/1.319	0.692/1.094	0.706/0.944
	$0.223$	0.781/1.108	0.707/1.334	0.707/1.102	0.729/0.966
	$0.368$	0.781/1.108	0.727/1.358	0.727/1.114	0.770/0.993
	$0.607$	0.781/1.108	0.754/1.395	0.754/1.138	0.834/1.054
	$1.000$	0.781/1.108	0.787/1.473	0.780/1.161	0.920/1.155
	$1.649$	0.781/1.108	0.941/1.749	0.783/1.194	0.998/1.368
	$2.718$	0.781/1.108	1.985/2.646	0.781/1.274	1.181/1.961

Table 4. Sensitivity of relative estimation error to the choice of robust loss. All settings are as in Figure 3, except that the smoothed Huber loss is replaced by the pseudo-Huber loss

ρ_{P} (t; δ) = δ^{2} (\sqrt{1 + {(t / δ)}^{2}} - 1)

with

δ = 1.35

(the smoothing parameter

η

is no longer needed). Each entry reports the default (smoothed Huber) and the pseudo-Huber mean estimation error, in the format “default/pseudo-Huber”, over

M = 500

replications.

Table 4. Sensitivity of relative estimation error to the choice of robust loss. All settings are as in Figure 3, except that the smoothed Huber loss is replaced by the pseudo-Huber loss

ρ_{P} (t; δ) = δ^{2} (\sqrt{1 + {(t / δ)}^{2}} - 1)

with

δ = 1.35

(the smoothing parameter

η

is no longer needed). Each entry reports the default (smoothed Huber) and the pseudo-Huber mean estimation error, in the format “default/pseudo-Huber”, over

M = 500

replications.

Case	h	Single-RR	Trans-RR	Trans-RR-Ada	Pooled-RR
$I$	$0.135$	0.645/0.645	0.550/0.545	0.550/0.546	0.600/0.595
	$0.223$	0.645/0.645	0.565/0.561	0.566/0.562	0.628/0.625
	$0.368$	0.645/0.645	0.588/0.585	0.589/0.586	0.678/0.675
	$0.607$	0.645/0.645	0.616/0.615	0.617/0.616	0.764/0.763
	$1.000$	0.645/0.645	0.647/0.647	0.642/0.643	0.882/0.883
	$1.649$	0.645/0.645	0.791/0.797	0.645/0.645	0.990/0.995
	$2.718$	0.645/0.645	1.639/1.642	0.645/0.645	1.261/1.299
$II$	$0.135$	0.883/0.891	0.816/0.828	0.819/0.830	0.811/0.823
	$0.223$	0.883/0.891	0.826/0.836	0.829/0.839	0.829/0.839
	$0.368$	0.883/0.891	0.840/0.850	0.843/0.852	0.855/0.865
	$0.607$	0.883/0.891	0.860/0.869	0.862/0.872	0.898/0.905
	$1.000$	0.883/0.891	0.892/0.900	0.883/0.891	0.951/0.954
	$1.649$	0.883/0.891	1.021/1.023	0.887/0.895	1.005/1.007
	$2.718$	0.883/0.891	1.898/1.871	0.883/0.891	1.123/1.129
$III$	$0.135$	0.781/0.793	0.693/0.703	0.692/0.703	0.706/0.713
	$0.223$	0.781/0.793	0.707/0.717	0.707/0.718	0.729/0.738
	$0.368$	0.781/0.793	0.727/0.740	0.727/0.740	0.770/0.777
	$0.607$	0.781/0.793	0.754/0.765	0.754/0.765	0.834/0.842
	$1.000$	0.781/0.793	0.787/0.798	0.780/0.791	0.920/0.925
	$1.649$	0.781/0.793	0.941/0.950	0.783/0.794	0.998/1.002
	$2.718$	0.781/0.793	1.985/1.972	0.781/0.793	1.181/1.199

Table 5. Sensitivity of relative estimation error to the ridge-penalty cross-validation grid. The default grid contains 9 values from

1 / 9

to 9 on a geometric scale. The wide grid extends this to 13 values from

1 / 27

to 27 on the same geometric scale, and contains the default grid as a strict subset. All other settings are as in Figure 3, with

M = 500

replications per cell. Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error.

Table 5. Sensitivity of relative estimation error to the ridge-penalty cross-validation grid. The default grid contains 9 values from

1 / 9

to 9 on a geometric scale. The wide grid extends this to 13 values from

1 / 27

to 27 on the same geometric scale, and contains the default grid as a strict subset. All other settings are as in Figure 3, with

M = 500

replications per cell. Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error.

Case	h	Default Grid (9 pts, [1/9, 9])	Wide Grid (13 pts, [1/27, 27])
$I$	$0.135$	0.645/0.550/0.550/0.600	0.645/0.550/0.550/0.600
$I$	$0.223$	0.645/0.565/0.566/0.628	0.645/0.565/0.566/0.628
$I$	$0.368$	0.645/0.588/0.589/0.678	0.645/0.589/0.590/0.678
$I$	$0.607$	0.645/0.616/0.617/0.764	0.645/0.618/0.619/0.764
$I$	$1.000$	0.645/0.647/0.642/0.882	0.645/0.648/0.642/0.883
$I$	$1.649$	0.645/0.791/0.645/0.990	0.645/0.791/0.645/0.990
$I$	$2.718$	0.645/1.639/0.645/1.261	0.645/1.639/0.645/1.261
$II$	$0.135$	0.883/0.816/0.819/0.811	0.887/0.824/0.826/0.811
$II$	$0.223$	0.883/0.826/0.829/0.829	0.887/0.835/0.838/0.829
$II$	$0.368$	0.883/0.840/0.843/0.855	0.887/0.850/0.852/0.856
$II$	$0.607$	0.883/0.860/0.862/0.898	0.887/0.869/0.871/0.900
$II$	$1.000$	0.883/0.892/0.883/0.951	0.887/0.896/0.888/0.955
$II$	$1.649$	0.883/1.021/0.887/1.005	0.887/1.021/0.891/1.006
$II$	$2.718$	0.883/1.898/0.883/1.123	0.887/1.899/0.887/1.120
$III$	$0.135$	0.781/0.693/0.692/0.706	0.782/0.694/0.694/0.706
$III$	$0.223$	0.781/0.707/0.707/0.729	0.782/0.709/0.709/0.729
$III$	$0.368$	0.781/0.727/0.727/0.770	0.782/0.731/0.731/0.770
$III$	$0.607$	0.781/0.754/0.754/0.834	0.782/0.758/0.758/0.834
$III$	$1.000$	0.781/0.787/0.780/0.920	0.782/0.789/0.781/0.922
$III$	$1.649$	0.781/0.941/0.783/0.998	0.782/0.941/0.783/0.999
$III$	$2.718$	0.781/1.985/0.781/1.181	0.782/1.985/0.782/1.180

Table 6. Sensitivity of relative estimation error to the ridge penalty value when cross-validation tuning is disabled. All four ridge penalties (Single-RR’s

τ_{st}

, Trans-RR’s

τ_{1}

and

τ

, Pooled-RR’s

τ_{p}

) are forced to a common fixed value from

{1 / 3, 1, 3, 9}

. Trans-RR-Ada’s mixing weight

θ

is still selected by 5-fold cross-validation on the target sample (Algorithm 2). All other settings are as in Figure 3, with

M = 500

replications per cell. Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error.

Table 6. Sensitivity of relative estimation error to the ridge penalty value when cross-validation tuning is disabled. All four ridge penalties (Single-RR’s

τ_{st}

, Trans-RR’s

τ_{1}

and

τ

, Pooled-RR’s

τ_{p}

) are forced to a common fixed value from

{1 / 3, 1, 3, 9}

. Trans-RR-Ada’s mixing weight

θ

is still selected by 5-fold cross-validation on the target sample (Algorithm 2). All other settings are as in Figure 3, with

M = 500

replications per cell. Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error.

Case	h	τ = 1/3	τ = 1	τ = 3	τ = 9
$I$	$0.135$	0.742/0.769/0.687/0.619	0.628/0.523/0.525/0.605	0.742/0.620/0.620/0.773	0.881/0.814/0.814/0.906
$I$	$0.223$	0.742/0.781/0.696/0.645	0.628/0.538/0.538/0.625	0.742/0.631/0.632/0.783	0.881/0.820/0.820/0.910
$I$	$0.368$	0.742/0.804/0.709/0.694	0.628/0.565/0.561/0.661	0.742/0.651/0.652/0.801	0.881/0.829/0.829/0.917
$I$	$0.607$	0.742/0.851/0.728/0.790	0.628/0.616/0.596/0.726	0.742/0.687/0.690/0.833	0.881/0.846/0.846/0.929
$I$	$1.000$	0.742/0.953/0.745/0.990	0.628/0.714/0.629/0.848	0.742/0.752/0.740/0.890	0.881/0.876/0.876/0.951
$I$	$1.649$	0.742/1.185/0.743/1.423	0.628/0.913/0.628/1.074	0.742/0.866/0.742/0.986	0.881/0.924/0.881/0.986
$I$	$2.718$	0.742/1.739/0.742/2.321	0.628/1.287/0.628/1.436	0.742/1.033/0.742/1.113	0.881/0.987/0.881/1.030
$II$	$0.135$	2.026/2.443/2.014/1.143	0.978/0.964/0.930/0.789	0.861/0.778/0.784/0.850	0.924/0.877/0.878/0.936
$II$	$0.223$	2.026/2.459/2.019/1.167	0.978/0.979/0.938/0.804	0.861/0.787/0.793/0.857	0.924/0.881/0.882/0.939
$II$	$0.368$	2.026/2.490/2.027/1.211	0.978/1.005/0.953/0.833	0.861/0.803/0.810/0.870	0.924/0.888/0.889/0.943
$II$	$0.607$	2.026/2.551/2.037/1.297	0.978/1.055/0.972/0.885	0.861/0.832/0.836/0.894	0.924/0.900/0.903/0.952
$II$	$1.000$	2.026/2.681/2.045/1.475	0.978/1.152/0.988/0.983	0.861/0.885/0.865/0.937	0.924/0.923/0.922/0.968
$II$	$1.649$	2.026/2.971/2.042/1.848	0.978/1.342/0.984/1.161	0.861/0.977/0.864/1.008	0.924/0.959/0.927/0.994
$II$	$2.718$	2.026/3.615/2.034/2.569	0.978/1.663/0.978/1.420	0.861/1.096/0.861/1.093	0.924/1.000/0.925/1.022
$III$	$0.135$	1.286/1.442/1.246/0.827	0.784/0.713/0.704/0.686	0.798/0.693/0.694/0.808	0.902/0.844/0.844/0.920
$III$	$0.223$	1.286/1.456/1.253/0.852	0.784/0.729/0.715/0.705	0.798/0.703/0.704/0.817	0.902/0.849/0.849/0.923
$III$	$0.368$	1.286/1.484/1.265/0.900	0.784/0.756/0.735/0.737	0.798/0.721/0.724/0.833	0.902/0.857/0.857/0.929
$III$	$0.607$	1.286/1.541/1.282/0.994	0.784/0.808/0.763/0.797	0.798/0.754/0.759/0.861	0.902/0.872/0.873/0.940
$III$	$1.000$	1.286/1.661/1.294/1.187	0.784/0.909/0.789/0.910	0.798/0.815/0.799/0.912	0.902/0.898/0.899/0.959
$III$	$1.649$	1.286/1.933/1.290/1.600	0.784/1.109/0.786/1.115	0.798/0.920/0.799/0.996	0.902/0.941/0.903/0.990
$III$	$2.718$	1.286/2.559/1.286/2.420	0.784/1.466/0.784/1.426	0.798/1.064/0.798/1.103	0.902/0.993/0.902/1.026

Table 7. Prediction performance on the NIR spectral dataset over 20 repeated random splits. Each entry reports the average RMSE, with the standard deviation in parentheses.

Method	Direction A	Direction B
Trans-RR	4.6230 (0.1732)	4.7933 (0.2736)
Trans-RR-Ada	4.6294 (0.1861)	4.8211 (0.3757)
Pooled-RR	5.0952 (0.0812)	5.4909 (0.1335)
Trans-Lasso	5.5668 (0.3650)	5.6803 (0.3131)
Single-RR	6.2666 (2.2628)	6.8272 (2.2674)
Single-Lasso	8.0672 (2.8904)	8.5607 (3.3739)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lyu, L.; Guo, X.; Liu, Z. Transfer Learning for Moderate–Dimensional Ridge-Regularized Robust Linear Regression. Entropy 2026, 28, 543. https://doi.org/10.3390/e28050543

AMA Style

Lyu L, Guo X, Liu Z. Transfer Learning for Moderate–Dimensional Ridge-Regularized Robust Linear Regression. Entropy. 2026; 28(5):543. https://doi.org/10.3390/e28050543

Chicago/Turabian Style

Lyu, Lingfeng, Xiao Guo, and Zongqi Liu. 2026. "Transfer Learning for Moderate–Dimensional Ridge-Regularized Robust Linear Regression" Entropy 28, no. 5: 543. https://doi.org/10.3390/e28050543

APA Style

Lyu, L., Guo, X., & Liu, Z. (2026). Transfer Learning for Moderate–Dimensional Ridge-Regularized Robust Linear Regression. Entropy, 28(5), 543. https://doi.org/10.3390/e28050543

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Transfer Learning for Moderate–Dimensional Ridge-Regularized Robust Linear Regression

Abstract

1. Introduction

2. Methodology

2.1. Problem Setup

2.2. Trans-RR Algorithm

3. Theoretical Results

3.1. Technical Assumptions

3.2. Asymptotic Characterization of Estimation Error

3.3. Adaptive Aggregation Against Negative Transfer

3.4. Applicability and Limitations

4. Simulation

4.1. Validity of Theoretical Results

4.2. Theoretical Estimation Error Curves

4.3. Comparison with Existing Methods

4.4. Robustness to Non-Identity Covariance

4.5. Sensitivity to Tuning Choices

4.5.1. Choice of ( δ , η )

4.5.2. Choice of Cross-Validation Criterion

4.5.3. Choice of Robust Loss

4.5.4. Choice of Ridge Penalty Grid

5. Real Data Analysis

6. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Notation

Appendix B. Additional (δ, η) Heatmaps

Appendix C. Robustness Checks for the Real-Data Analysis

Appendix D. Assumptions

Appendix E. Proof for Theorem 1

Appendix E.1. Preliminaries

Appendix E.2. On ∥ δ ^ ∥ and ∥ δ ^ − δ 0 ∥

Appendix E.3. Leave-One-Observation-Out

Appendix E.3.1. Deterministic Bounds

Appendix E.3.2. Stochastic Aspects

Appendix E.4. Leaving Out a Predictor

Appendix E.4.1. Deterministic Aspects

Appendix E.4.2. Stochastic Aspects

Appendix E.4.3. Final Conclusions

Appendix E.5. Last Steps of the Proof

Appendix E.5.1. On the Asymptotic Behavior of r ˜ i,(i)

Appendix E.5.2. On the Asymptotic Behavior of c τ

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.5.1. Choice of $(δ, η)$

Appendix E.2. On $∥ \hat{δ} ∥$ and $∥ \hat{δ} - δ_{0} ∥$