Next Article in Journal
A Complex Tension Origin for Dilaton Gravity: Jordan Stiffness and Logarithmic Einstein Dynamics
Previous Article in Journal
Efficient Non-Interactive Discrete ReLU over CKKS Using Interpolation Look-Up Table
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Transfer Learning for Moderate–Dimensional Ridge-Regularized Robust Linear Regression

Department of Statistics and Finance, School of Management, University of Science and Technology of China, Hefei 230026, China
*
Author to whom correspondence should be addressed.
Entropy 2026, 28(5), 543; https://doi.org/10.3390/e28050543
Submission received: 18 March 2026 / Revised: 7 May 2026 / Accepted: 9 May 2026 / Published: 11 May 2026
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

This paper studies transfer learning for ridge-regularized robust linear regression in the moderate–dimensional regime, where the number of predictors is of the same order as the sample size and the regression coefficients are not assumed to be sparse. We propose Trans-RR, which combines a robust ridge estimator from a source study with a robust ridge correction based on the target study. Under mild assumptions, we characterize the asymptotic estimation error of the proposed estimator and show that leveraging source data can substantially improve estimation accuracy relative to the traditional single-study ridge-regularized robust estimator. To guard against negative transfer when the source study is not sufficiently informative, we further propose an adaptive aggregation of Trans-RR with the single-task estimator that selects the mixing weight by cross-validation. Simulation studies and a real-data analysis support the theory and illustrate the transition between positive and negative transfer as the discrepancy between the source and target studies varies.

1. Introduction

Modern statistical analyses often involve several related datasets collected from different studies, populations, or experiments. A basic question is how to use these datasets together to improve prediction and estimation for a study of interest. Transfer learning addresses this question by borrowing useful information from related tasks. It is now standard in machine learning and has been successful in applications such as natural language processing, remote sensing, and computer vision [1,2,3]. In statistics, it has also become an important tool for improving performance in multi-study problems.
In many contemporary applications, regression problems arise in a moderate–dimensional regime, where the number of predictors is of the same order as the sample size and sparsity is often not a reasonable structural assumption. At the same time, heavy-tailed errors or outlying observations may substantially affect estimation accuracy. This setting arises naturally in multisite metabolomics studies, where many metabolites are measured simultaneously, between-cohort heterogeneity is often present, and outliers or other data contamination can be an important practical concern [4]. These features make transfer learning particularly challenging: related source studies may contain useful information, but effective borrowing requires methods that are robust to contamination and suitable for moderate–dimensional, non-sparse settings.
Existing theoretical work on transfer learning covers several important settings. For linear regression, ref. [5] studied data-enriched regression in a fixed-dimensional setting, and ref. [6] analyzed linear models with a shared low-dimensional representation across tasks. In the high-dimensional sparse regime, ref. [7] considered transfer learning with proxy data, while ref. [8] established prediction and estimation guarantees for sparse linear regression. For high-dimensional generalized linear models, refs. [9,10] developed transfer learning methods with theoretical guarantees. Transfer learning has also been studied for nonparametric classification [11], nonparametric regression [12], and settings with unreliable source data [13]. However, these works do not apply to the moderate–dimensional robust setting considered here.
Robust regression has, in contrast, been extensively studied in the single-study setting. For classical M-estimation, a substantial body of work has established asymptotic results when p / n 0 while p , including [14,15,16,17,18,19]. When p / n κ ( 0 , 1 ) , robust regression has a qualitatively different asymptotic behavior [20]. When p / n κ > 0 , ref. [21] proposed a ridge-regularized robust estimator to address the nonexistence of the ordinary robust estimator. However, these results do not directly extend to the transfer learning framework when related source data are available.
Motivated by these gaps, we study transfer learning under a moderate–dimensional robust linear model with one target study and one related source study. We allow the predictor dimension to be of the same order as the target and source sample sizes, do not impose sparsity assumptions on the regression coefficients, and permit heavy-tailed errors. Within this framework, our goal is to leverage information from the source study to improve the estimation performance of traditional single-task approaches.
Our main contribution is to develop and analyze a transfer learning method for moderate–dimensional ridge-regularized robust linear regression. First, we propose Trans-RR, a transfer learning procedure for ridge-regularized robust linear regression. It combines a robust ridge fit on the source study with a robust ridge correction on the target study and is designed for non-sparse coefficients. Second, we derive an asymptotic characterization of the l 2 risk of the resulting estimator under mild assumptions on the design and error distributions. The theory shows how auxiliary source data can improve estimation accuracy relative to the single-study ridge-regularized robust estimator, while also clarifying the possibility of negative transfer. Third, we propose an adaptive aggregation of Trans-RR with the single-task estimator that selects the mixing weight by cross-validation, providing a data-driven safeguard against negative transfer. Fourth, we conduct simulation studies and a real-data analysis to examine the practical performance of the proposed methods, including their sensitivity to tuning choices and to the identity–covariance assumption underlying the theory.
The rest of the paper is organized as follows. Section 2 introduces the model setup and the proposed algorithm. Section 3 presents the technical assumptions, the theoretical results, and the adaptive aggregation against negative transfer. Section 4 presents simulation studies to evaluate the performance of the proposed methods. Section 5 applies the proposed methods to a real-data example. The proofs of the main theoretical result as well as the lemmas are included in Appendix D and Appendix E.
Notation. Denote by I m the m × m identity matrix. Let 0 m R m and 1 m R m be the vectors of zeros and ones, respectively. For a vector v = ( v 1 , , v m ) , the  l 2 norms are v = ( i = 1 m v i 2 ) 1 / 2 , whereas v = max 1 k p | v k | . For an m × m matrix A = { a i j } 1 i , j m , denote by λ max ( A ) and λ min ( A ) the maximum and minimum eigenvalues of A , respectively. The L 2 norm of A is defined as A = { λ max ( A A ) } 1 / 2 .

2. Methodology

2.1. Problem Setup

We consider a transfer learning problem with one target study and one related source study. In the target study, we observe n samples x i R p and y i R , i = 1 , , n , generated from
y i = x i β 0 + ϵ i ,
where ϵ i , i = 1 , , n , are independently distributed errors and β 0 R p is the unknown regression parameter of interest.
In addition to the target data, we observe n 1 samples ( x i ( 1 ) , y i ( 1 ) ) , i = 1 , , n 1 , from the source study satisfying
y i ( 1 ) = ( x i ( 1 ) ) w 0 + ϵ i ( 1 ) ,
where w 0 R p is the regression parameter for the source study and ϵ i ( 1 ) , i = 1 , , n 1 , are independently distributed errors. Throughout, both ϵ i and ϵ i ( 1 ) are allowed to be heavy-tailed.
We work in the moderate–dimensional regime, where p is of the same order as both n and n 1 , with  p / n κ ( 0 , ) and p / n 1 κ 1 ( 0 , ) . We also do not impose sparsity assumptions on β 0 or w 0 . Let δ 0 = β 0 w 0 denote the source–target discrepancy and let h = δ 0 measure its size. Smaller values of h correspond to a more informative source study and hence a greater potential for useful transfer.

2.2. Trans-RR Algorithm

Based on this setup, we now introduce the proposed transfer learning algorithm, referred to as Trans-RR. Following the general two-stage strategy used in [7,8,9], the core of the procedure consists of two estimation steps. The first step estimates the source coefficient vector w 0 from the source data. The second step estimates the source–target discrepancy δ 0 = β 0 w 0 from the target data, after which the two estimates are combined. Algorithm 1 summarizes the procedure.    
Algorithm 1: Trans-RR algorithm
Input: target data { ( x i , y i ) } i = 1 n and source data { ( x i ( 1 ) , y i ( 1 ) ) } i = 1 n 1
Output: the estimated coefficient vector β ^
 Step 1. Compute
w ^ = arg min w R p 1 n 1 i = 1 n 1 ρ ˜ y i ( 1 ) ( x i ( 1 ) ) w + τ 1 2 w 2

 for some constant τ 1 .
 Step 2. Compute
δ ^ = arg min δ R p 1 n i = 1 n ρ y i x i ( w ^ + δ ) + τ 2 δ 2

 for some constant τ .
 Step 3. Let
β ^ = w ^ + δ ^ .

 Step 4. Output β ^ .
The idea behind the construction is straightforward. Step 1 computes a ridge-regularized robust estimator from the source study. Step 2 then estimates the discrepancy relative to the source-stage fit by solving a second ridge-regularized robust regression problem on the target study. The final estimator is obtained by combining these two pieces, namely β ^ = w ^ + δ ^ . The main difference between our procedure and those of [7,8,9] is that we use ridge/ l 2 regularization in both steps, whereas they use l 1 regularization. This choice is motivated by our diffuse-coefficient setting: the regression parameters β 0 and w 0 have many small coordinates and are not well approximated by sparse vectors. In this setting, lasso-based methods are not well-suited to the problem, whereas ridge penalization is natural. Another difference is that we use robust loss functions rather than the quadratic loss, which makes the procedure less sensitive to outliers and heavy-tailed errors.
Remark 1. 
When robustness to heavy-tailed errors or outliers is needed, Huber-type loss functions are natural candidates for ρ and ρ ˜ . Specific choices of ρ and ρ ˜ under our theoretical framework are discussed in Section 3. The regularization parameters τ and τ 1 may be selected by standard data-driven tuning methods such as cross-validation.

3. Theoretical Results

This section introduces the assumptions for the analysis and then presents the main asymptotic error results for Trans-RR.

3.1. Technical Assumptions

We study the estimation error of the estimator in Algorithm 1 under the following assumptions. We state the assumptions separately for the target study (Assumption 1) and the source study (Assumption 2), since the two stages are based on different samples. The two sets of conditions are largely parallel.
Assumption 1. 
(a) 
p / n κ ( 0 , ) .
(b) 
Suppose ρ is an even and convex function. Assume that ψ = ρ is bounded and ψ is Lipschitz and bounded. Moreover, we assume that sign ( ψ ( x ) ) = sign ( x ) and that ρ ( x ) ρ ( 0 ) = 0 for all x R .
(c) 
Assume that there exist independent variables λ i ’s and X i ’s such that x i = λ i X i . Suppose that X i ’s are i.i.d. with independent entries, and they have mean 0 p and cov ( X i ) = I p . Suppose there exist c n and C n that vary with n, where 1 / c n = O ( polyLog ( n ) ) and C n is bounded in n, such that for any convex 1-Lipschitz function G of X i , P ( | G ( X i ) m G | > t ) C n exp ( c n t 2 ) holds for all t > 0 , where m G is the median of G ( X i ) . We require the same assumption to hold for the columns of the n × p design matrix X . Additionally, we assume that the coordinates of X i have moments of all orders, and the k-th moment of the entries of X i is assumed to be uniformly bounded independently of n and p for all k.
(d) 
Suppose that λ i ’s are independent, with  E ( λ i 2 ) = 1 , E ( λ i 4 ) being bounded, and  sup 1 i n | λ i | growing at most like C λ ( log n ) k for some k. λ i ’s may have finitely many possible distributions.
(e) 
Suppose that ϵ i ’s are independent and are also independent of X i ’s and λ i ’s. They may have finitely many possible distributions, each with a density that is differentiable, symmetric, and unimodal. If f i is the density of one such distribution, we assume that lim x x f i ( x ) = 0 .
(f) 
The fraction of occurrences for each possible combination of distributions for ( ϵ i , λ i ) has a limit as n .
(g) 
There exist constants C β and e > 1 / 3 such that β 0     C β and β 0   =   O ( n e ) .
Assumption 2. 
(a) 
p / n 1 κ 1 ( 0 , ) .
(b) 
ρ ˜ and ψ ˜ satisfy Assumption 1(b).
(c) 
x i ( 1 ) , X i ( 1 ) ’s and λ i ( 1 ) ’s satisfy Assumption 1(c).
(d) 
λ i ( 1 ) ’s satisfy Assumption 1(d).
(e) 
ϵ i ( 1 ) ’s, X i ( 1 ) ’s and λ i ( 1 ) ’s satisfy Assumption 1(e).
(f) 
λ i ( 1 ) ’s and ϵ i ( 1 ) ’s satisfy Assumption 1(f).
(g) 
w 0 2 remains bounded. Furthermore, w 0 = O ( n 1 e ) , where e > 1 / 3 .
For Assumptions 1(b) and 2(b), it is quite common in robust statistics to require ψ to be bounded. For example, the Huber loss
ρ H ( x ) = { x 2 2 if | x | δ , δ · | x | 1 2 δ otherwise .
is chosen to grow linearly to infinity, which reduces the influence of outliers on the resulting regression estimator. Although the Huber loss does not fully satisfy the assumptions because it is not differentiable at | x | = δ , these assumptions hold for a smoothed approximation such as
ρ η ( x ) = { x 2 2 if | x | δ η , δ η 2 · | x | + ( δ | x | ) 3 6 η + C ρ if | x | ( δ η , δ ) , δ η 2 · | x | + C ρ if | x | δ ,
where C ρ = η 2 / 6 + η δ / 2 δ 2 / 2 . The corresponding ψ η is given by
ψ η ( x ) = { x if | x | δ η , sign ( x ) · δ η 2 ( δ | x | ) 2 2 η if | x | ( δ η , δ ) , sign ( x ) · δ η 2 if | x | δ .
Another example that fully satisfies the assumptions is the pseudo-Huber loss function from [22,23], defined by
ρ P ( x ) = δ 2 1 + x 2 δ 2 1 .
The assumptions on x i ’s and x i ( 1 ) ’s, in particular that they have mean 0 p and covariance matrix I p , are common in the study of M-estimators for linear models. These assumptions have been used in the low-dimensional regime p / n 0 studied in [14,15,24], in the moderate–dimensional regime p / n κ ( 0 , 1 ) considered in [20], and in the regime p / n κ > 0 analyzed in [21,25].
The concentration assumption on X i ’s and X i ( 1 ) ’s is weaker than the Gaussian assumptions often imposed in robust statistics. This assumption has also been studied in [21,26] and holds for a broad class of distributions. Corollary 4.10 in [27] demonstrates that our assumptions are satisfied if X i has independent entries bounded by 1 / ( 2 c 1 ) for some c 1 > 0 . Additionally, Theorem 2.7 in [27] shows that the assumptions hold when X i has independent entries with density f k , 1 k p , such that f k ( x ) = exp ( u k ( x ) ) and u k ( x ) c 2 for some c 2 > 0 . In particular, this condition holds when X i has i.i.d. N ( 0 , 1 ) entries, where c 2 = 1 . As will be seen in the proof, the functions G that arise in our analysis are either linear functions or square roots of quadratic forms. A similar discussion applies to the X i ( 1 ) ’s.
The introduction of λ i and λ i ( 1 ) , as also considered in [20,21], is used to induce a nonspherical geometry on the predictors. Although the assumption E ( λ i 2 ) = 1 can be relaxed to the requirement that E ( λ i 2 ) be uniformly bounded, it remains statistically important because it guarantees that cov ( x i ) = I p in all the models we consider. This construction shows that many models can share the same covariance cov ( x i ) while exhibiting substantially different estimation errors for β ^ . This contrasts with the low-dimensional setting studied in [28], where cov ( x i ) is the key quantity for robust regression. A similar discussion applies to λ i ( 1 ) ’s and x i ( 1 ) ’s.
In Assumptions 1(e) and 2(e), no moment restriction is imposed on the ϵ i ’s and ϵ i ( 1 ) ’s. For instance, smooth symmetric log-concave densities satisfy all of these assumptions; see [29,30]. Furthermore, the Cauchy distribution also satisfies these conditions; see Theorem 1.6 in [31]. This makes the framework compatible with heavy-tailed errors, which are of particular interest in robust regression.
Assumptions 1(g) and 2(g) impose a non-sparse structure on β 0 and w 0 , meaning that these vectors cannot be well approximated by sparse vectors in l 2 norm. This setting is common in moderate–dimensional statistics and contrasts with the sparse regime, where only a small fraction of coefficients are substantial. Under these assumptions, the proposed Trans-RR estimator may outperform lasso-based methods.
We now turn to the target-stage result and the resulting error characterization for Trans-RR.

3.2. Asymptotic Characterization of Estimation Error

Our main theorem characterizes the asymptotic l 2 error of the Trans-RR estimator. Recall that β ^ is defined in (5), and let τ > 0 be fixed as n and p vary. To state the result, let prox ( c ρ ) denote the proximal mapping of the function c ρ , see [32]. It is given by
prox ( c ρ ) ( x ) = arg min y R c ρ ( y ) + 1 2 ( x y ) 2 , x R .
When ρ is differentiable, prox ( c ρ ) ( x ) is the unique y R satisfying y + c ψ ( y ) = x , with ψ = ρ . prox ( c ρ ) ( x ) can be viewed as a shrinkage of x toward the minimizer of ρ , with the amount of shrinkage depending on c and ρ . The proximal mapping is a standard object in convex analysis and convex optimization (see [33] for a review of its analytic properties and an introduction to proximal gradient algorithms). As explained in [34], the system in Theorem 1 can be reformulated in terms of prox ( ( c ρ ( κ ) ρ ) ) , where f ( u ) = sup y R { u y f ( y ) } denotes the Fenchel–Legendre conjugate of f.
Under Assumptions 1 and 2, the estimation error admits the following limit.
Theorem 1. 
Under Assumptions 1 and 2, conditional on the source-stage estimator w ^ , which is independent of the target sample, we have β ^ β 0 r ρ ( κ ) in probability, where r ρ ( κ ) is deterministic for the given value of w ^ . Let W i = ϵ i + r ρ ( κ ) λ i Z i , where Z i is a standard normal random variable independent of ϵ i and λ i . Then there exists a constant c ρ ( κ ) such that
{ lim n 1 n i = 1 n E [ prox { c ρ ( κ ) λ i 2 ρ } ] ( W i ) = 1 κ + τ c ρ ( κ ) , κ lim n 1 n i = 1 n E [ W i prox { c ρ ( κ ) λ i 2 ρ } ( W i ) ] 2 λ i 2 + τ 2 β 0 w ^ 2 c ρ 2 ( κ ) = κ 2 r ρ 2 ( κ ) .
The proof of Theorem 1 is given in Appendix E. Here and below, the dependence of r ρ ( κ ) and c ρ ( κ ) on w ^ is suppressed for notational simplicity.
Theorem 1 shows that the asymptotic error r ρ ( κ ) depends on the source study only through the discrepancy β 0 w ^ . To investigate when positive transfer occurs, Section 4.2 numerically computes r ρ as a function of β 0 w ^ under the smoothed Huber loss (6) in three simulation cases. The resulting curves are monotonically increasing across the displayed range. By the triangle inequality,
β 0 w ^ β 0 w 0 + w ^ w 0 .
The right-hand side is the sum of the population gap between the source and target coefficients and the source-stage estimation error. Transfer is therefore expected to help when the source-stage estimator is accurate and close to the target coefficient, and to hurt when either the population gap β 0 w 0 or the source-stage estimation error w ^ w 0 is large. In practice, Section 3.3 develops an adaptive Trans-RR estimator to avoid negative transfer.
Unlike several recent transfer learning analyses, such as [7,8,9], our theory does not impose direct structural restrictions on δ 0 , the difference between the target and source coefficients. Theorem 1 also shows that the performance of β ^ depends on the distribution of the λ i ’s in the representation x i = λ i X i from Assumption 1(c). Thus, in the moderate–dimensional regime, the geometry of the target predictors encoded by λ i materially affects the estimation error. This again contrasts with low-dimensional robust regression, where cov ( x i ) is the dominant quantity.
When λ i 2 = 1 for all i and the errors ϵ i are i.i.d., Theorem 1 simplifies as follows.
Corollary 1. 
Under the same assumptions as in Theorem 1, if  λ i 2 = 1 for all i and the errors ϵ i are i.i.d., then, conditional on w ^ , we have β ^ β 0 r ρ ( κ ) in probability, where r ρ ( κ ) is deterministic for the given value of w ^ . Let w = ϵ + r ρ ( κ ) z , where ϵ has the same distribution as the ϵ i ’s and z is a standard normal random variable independent of ϵ. Then there exists a constant c ρ ( κ ) such that
{ E [ prox { c ρ ( κ ) ρ } ] ( w ) = 1 κ + τ c ρ ( κ ) , κ E [ w prox { c ρ ( κ ) ρ } ( w ) ] 2 + τ 2 β 0 w ^ 2 c ρ 2 ( κ ) = κ 2 r ρ 2 ( κ ) .
Corollary 1 shows that, under a homogeneous target design, the general characterization in Theorem 1 reduces to a simpler scalar system. This special case is useful for interpretation and will also serve as a convenient benchmark in the simulation study.
Remark 2. 
The limits on the left-hand side of (10) exist because Assumption 1(f) guarantees convergence of the proportions associated with each pair ( L ( ϵ i ) , L ( λ i ) ) , where L ( ϵ i ) and L ( λ i ) denote the laws of ϵ i and λ i . For the second equation in (10), the ratio can be interpreted through the identity
[ x prox { c ρ ( κ ) λ 2 ρ } ( x ) ] 2 λ 2 = c ρ 2 ( κ ) λ 2 ψ 2 ( prox { c ρ ( κ ) λ 2 ρ } ( x ) )
which is well defined when λ = 0 . Equivalently, (10) can be written as
{ lim n 1 n i = 1 n E [ prox { c ρ ( κ ) λ i 2 ρ } ] ( W i ) = 1 κ + τ c ρ ( κ ) , κ lim n 1 n i = 1 n E c ρ 2 ( κ ) λ i 2 ψ 2 ( prox { c ρ ( κ ) λ i 2 ρ } ( W i ) ) + τ 2 β 0 w ^ 2 c ρ 2 ( κ ) = κ 2 r ρ 2 ( κ ) .
This representation shows that the expectation in (10) is well defined, and Assumption 1(f) ensures that the relevant limits exist.

3.3. Adaptive Aggregation Against Negative Transfer

We now develop an adaptive aggregation of Trans-RR with the single-task estimator to address negative transfer. Specifically, let β ^ s t denote the single-task ridge-regularized robust estimator on the target sample,
β ^ s t = arg min β R p 1 n i = 1 n ρ ( y i x i β ) + τ s t 2 β 2
for some constant τ s t . Theorem 1 and the discussion above show that negative transfer may happen when the target–source discrepancy β 0 w 0 or the source-stage estimation error is too large. We therefore propose the adaptive Trans-RR estimator to avoid negative transfer, defined for θ [ 0 , 1 ] by
β ^ ada ( θ ) = θ β ^ + ( 1 θ ) β ^ s t .
Intuitively, β ^ ada ( θ ) recovers Trans-RR at θ = 1 and the single-task estimator at θ = 0 . Including both endpoints allows the procedure to fall back to either Trans-RR or the single-task estimator, while interior values allow partial transfer. We select the mixing weight θ from data by cross-validation and write β ^ ada : = β ^ ada ( θ ^ ) for the resulting estimator.
We tune the ridge penalties before selecting θ . The source penalty τ 1 is tuned by cross-validation on the source sample, yielding w ^ via (3). With  w ^ fixed, the target penalty τ is tuned by cross-validation on the target sample through (4). The single-task penalty τ s t is tuned by cross-validation on the target sample. To select θ , we then use a K-fold partition { V k } k = 1 K of the target sample, drawn independently of the partitions used to tune τ and τ s t . Here, V k denotes the validation index set of fold k.
For each k = 1 , , K , let β ^ ( k ) be the output of Algorithm 1 applied to the full source data and { ( x i , y i ) : i V k } at the tuned ( τ 1 , τ ) . Let β ^ s t ( k ) be the solution of (11) on { ( x i , y i ) : i V k } at the tuned τ s t . We reuse τ 1 , τ , and  τ s t across all K folds rather than re-tuning per fold, which would multiply the penalty-tuning cost by K. We then choose
θ ^ = arg min θ Θ 1 n k = 1 K i V k L y i x i [ θ β ^ ( k ) + ( 1 θ ) β ^ s t ( k ) ] ,
where L : R R 0 is a validation loss and Θ [ 0 , 1 ] is a finite candidate set. A natural default is the absolute-error loss L ( t ) = | t | , which matches the criterion used to tune the ridge penalties. We refer to β ^ ada as the adaptive Trans-RR estimator, denoted Trans-RR-Ada in the following numerical experiments. Algorithm 2 summarizes the procedure.    
Algorithm 2: Adaptive Trans-RR algorithm
Input: target data { ( x i , y i ) } i = 1 n , source data { ( x i ( 1 ) , y i ( 1 ) ) } i = 1 n 1 , fold count K,
    candidate set Θ [ 0 , 1 ] , validation loss L
Output: the adaptive estimator β ^ ada and the selected weight θ ^
 Step 1. Tune τ 1 , τ , and  τ s t by cross-validation on the corresponding samples.
 Compute β ^ from Algorithm 1 and β ^ s t from (11).
 Step 2. Draw a K-fold partition { V k } k = 1 K of the target indices, independent of the
 partitions used in Step 1.
 Step 3. For each k = 1 , , K , let β ^ ( k ) be the output of Algorithm 1 applied to the
 full source data and { ( x i , y i ) : i V k } at the tuned ( τ 1 , τ ) . Let β ^ s t ( k ) be the
 solution of (11) on { ( x i , y i ) : i V k } at the tuned τ s t .
 Step 4. Compute
θ ^ = arg min θ Θ 1 n k = 1 K i V k L y i x i [ θ β ^ ( k ) + ( 1 θ ) β ^ s t ( k ) ] .
 Step 5. Output β ^ ada = θ ^ β ^ + ( 1 θ ^ ) β ^ s t and θ ^ .

3.4. Applicability and Limitations

Theorem 1 and Corollary 1 are derived under Assumptions 1 and 2, which include three structural conditions: identity covariance cov ( x i ) = I p , the moderate–dimensional regime p / n κ ( 0 , ) , and a twice-differentiable robust loss. Under these assumptions, the asymptotic l 2 estimation error of Trans-RR equals the deterministic limit r ρ ( κ ) , in agreement with the numerical results of Section 4. When any of these assumptions fails, Theorem 1 no longer applies.
The adaptive aggregation of Section 3.3, by contrast, is constructed without invoking these structural assumptions. Its mixing weight θ ^ is selected by cross-validation on the target sample and serves as a data-driven safeguard against negative transfer. Section 4.4 provides numerical support under AR(1)-correlated predictors across all three noise cases: a transition between positive and negative transfer is observed near h = 1 , and Trans-RR-Ada continues to track the better of the two base estimators.

4. Simulation

In this section, we conduct numerical studies to support the theoretical results. We set the dimension of both target and source data to be p { 200 , 400 , 800 } . We set n = p , p / 4 and n 1 = 2 p , p / 2 , corresponding to moderate–dimensional settings with κ = 1 , 4 and κ 1 = 1 / 2 , 2 . To generate data, we set x i = λ i X i and x i ( 1 ) = λ i ( 1 ) X i ( 1 ) , where X i and X i ( 1 ) have i.i.d. N ( 0 , 1 ) entries. We consider three different cases for the choices of λ i ’s, ϵ i ’s, λ i ( 1 ) ’s and ϵ i ( 1 ) ’s:
  • Case I:  λ i = 1 for i = 1 , , n and λ j ( 1 ) = 1 for j = 1 , , n 1 . The target errors ϵ i are i.i.d. N ( 0 , 1 ) , and the source errors ϵ j ( 1 ) are i.i.d. N ( 0 , 2 2 ) .
  • Case II: The variables λ i and λ j ( 1 ) are i.i.d. Unif ( 0 , 3 ) , while ϵ i and ϵ j ( 1 ) are i.i.d. C a u c h y ( 0 , 1 ) and C a u c h y ( 0 , 2 ) , respectively.
  • Case III: In both the target and source studies, half of the observations are generated as in Case I and the other half are generated as in Case II .
Case I is a standard Gaussian setup for linear regression. Case II features a non-Gaussian design and heavy-tailed errors. Case III is a mixture of the two cases and is designed to test the effectiveness of our theoretical results under non-identical x i ’s and ϵ i ’s.

4.1. Validity of Theoretical Results

We first evaluate the validity of the proposed scalar r ρ in Theorem 1. For each setting, we generate β and w once with i.i.d. Unif ( 0 , 1 ) entries and set β 0 = β / n and w 0 = w / n . This construction yields diffuse coefficients whose Euclidean norms remain bounded as n grows. These coefficient vectors are fixed across the 1000 replications, while the target and source samples are regenerated in each replicate. In each replicate, we first compute w ^ from the source sample and then obtain β ^ by applying Algorithm 1, using the smoothed Huber loss (6) for both ρ ˜ and ρ with parameters δ = 1.35 and η = 0.1 . We fix τ = τ 1 = 1 and repeat each setup 1000 times.
Figure 1 presents boxplots of the estimation error β ^ β 0 2 for cases I III and κ = 1 , 4 . The red point in each boxplot marks the theoretical value r ρ 2 , obtained by numerically solving the system in Theorem 1 under the corresponding simulation specification. We observe that the empirical distribution of β ^ β 0 2 is centered close to this value, and its dispersion decreases as n and p become larger. Table 1 shows the mean and standard deviation (SD) of β ^ β 0 2 (denoted as r ^ 2 ) and the corresponding r ρ 2 for each setup. As dimensionality increases, that is, as  κ increases from 1 to 4, both the mean error and its variability increase, indicating that estimation becomes more difficult in more challenging moderate–dimensional regimes. The average estimation error also grows with heavier-tailed errors, highlighting the difficulty of estimation under such conditions. Results under case III demonstrate that Theorem 1 is effective in handling non-identical x i ’s and ϵ i ’s. Overall, the findings in Figure 1 and Table 1 align well with the theoretical predictions of Theorem 1.

4.2. Theoretical Estimation Error Curves

We take ρ to be our recommended smoothed Huber loss (6) with ( δ , η ) = ( 1.35 , 0.1 ) and fix κ = 1 . We numerically solve the associated scalar system while varying the discrepancy term β 0 w ^ that enters the second equation of Theorem 1. Figure 2 plots the resulting curves of r ρ for five values of τ under cases I III . In all three cases, the curves are monotonically increasing over the displayed range, so a larger β 0 w ^ corresponds to a larger asymptotic estimation error in this setting.

4.3. Comparison with Existing Methods

To evaluate when transfer is beneficial, we compare our method with several competing procedures across the three scenarios described above. We set p = 400 , n = p and n 1 = 2 p . We generate β 0 = β / β , where β = ( β 1 , , β p ) has i.i.d. Unif ( 0 , 1 ) entries. To control the transfer strength h = δ 0 , we set
δ 0 = exp ( c d ) · 1 p / p , c d { 2.0 , 1.5 , 1.0 , 0.5 , 0 , 0.5 , 1.0 } ,
and define the source coefficient by w 0 = β 0 δ 0 . By varying c d from 2.0 to 1.0 , we obtain h ranging from approximately 0.135 to 2.718 , providing a comprehensive evaluation across different levels of source-target similarity. For each value of c d , the pair ( β 0 , w 0 ) is fixed across the 500 replications, and only the data are regenerated.
We compare four ridge-type methods (Single-RR, Trans-RR, Trans-RR-Ada, and Pooled-RR) and two lasso baselines (Single-Lasso and Trans-Lasso):
  • Single-RR: The single-task estimator β ^ s t in (11), fit to the target sample alone.
  • Trans-RR: The two-stage estimator β ^ = w ^ + δ ^ in (5), computed by Algorithm 1.
  • Trans-RR-Ada: The adaptive aggregate β ^ ada in (12), computed by Algorithm 2 with K = 5 , absolute-error loss L ( t ) = | t | , and weight grid Θ = { 0 , 0.1 , , 0.9 , 1 } .
  • Pooled-RR: The same robust ridge fit applied to the concatenation of the source and target samples.
  • Single-Lasso: The lasso on the target sample, with its regularization parameter chosen by f i v e -fold cross-validation.
  • Trans-Lasso: The two-stage transfer-lasso of [8], in which a cross-validated source-stage lasso estimates w 0 and a cross-validated target-stage lasso fits the residual y i x i w ^ on the target sample.
The four ridge-type methods all use the smoothed Huber loss (6) with ( δ , η ) = ( 1.35 , 0.1 ) . Each ridge penalty is tuned by f i v e -fold cross-validation under the mean absolute error criterion, over the grid { 3 a : a = 2 , 3 / 2 , , 2 } . For Trans-RR, the source penalty τ 1 in (3) is tuned on the source sample, yielding w ^ as the minimizer of (3) at the tuned τ 1 . With w ^ held fixed, the target penalty τ in (4) is then tuned on the target sample.
Performance is summarized by the relative estimation error β ^ β 0 2 / β 0 2 . Figure 3 presents boxplots of these errors on a logarithmic scale, for varying values of h across cases I III . We report all six estimators under Case I . In Cases II and III , the noise is heavy-tailed, and the lasso methods are not designed for robustness. Their fits failed to converge to stable estimates, so we restrict these two cases to the ridge-type methods.
Under Case I , Trans-RR compares favorably with both lasso baselines, consistent with the non-sparse structure assumed throughout the paper. Among the ridge-type methods, Trans-RR outperforms Pooled-RR for small and moderate h across all three cases. Pooled-RR fits the source and target observations together as if they shared one coefficient, so its error reflects two mismatches: the gap between β 0 and w 0 , and the difference between source and target noise levels. As h decreases, the gap between Pooled-RR and Trans-RR narrows, in line with the theoretical expectation that smaller h indicates greater similarity between the two domains.
More importantly, the comparison with Single-RR reveals the transition between positive and negative transfer. When h is small, Trans-RR achieves the lowest relative error among the ridge-type methods. As h grows, its advantage shrinks and eventually reverses. In our experiments, this turnover occurs near h = 1 . This is consistent with the numerical evidence in Section 3, since transfer is more favorable when the source-stage estimator is closer to the target coefficient. The same qualitative pattern appears in Cases II and III , where the transition near h = 1 persists under heavy-tailed errors and heterogeneous designs.
Trans-RR-Ada provides a data-driven safeguard against this negative transfer. Figure 3 shows that it tracks the better of Single-RR and Trans-RR for every value of h. For small h, Trans-RR-Ada essentially coincides with Trans-RR, while for large h it coincides with Single-RR. At the transition h = 1 , where Single-RR and Trans-RR are comparable, Trans-RR-Ada stays close to the better one.

4.4. Robustness to Non-Identity Covariance

The asymptotic theory in Section 3 assumes cov ( x i ) = I p . To verify that our methods remain effective under non-identity covariance, we re-run the comparison of Section 4.3 under AR(1) covariance j k = ρ | j k | with ρ = 0.6 , across all three noise cases I , II , and  III . All other settings, including the four ridge-type methods Single-RR, Trans-RR, Trans-RR-Ada, and Pooled-RR and the tuning protocol, match Section 4.3.
Figure 4 reports the resulting boxplots. The qualitative pattern of the i.i.d. comparison persists in all three cases. For small h, Trans-RR achieves the lowest error among the ridge-type methods. As h grows, its advantage shrinks and eventually reverses, with the transition again occurring near h = 1 . Trans-RR-Ada tracks the better of Single-RR and Trans-RR across all values of h. The negative-transfer transition and the safeguard role of Trans-RR-Ada are therefore not specific to the identity–covariance assumption underlying Theorem 1. This suggests that Trans-RR may remain effective under non-identity covariance.

4.5. Sensitivity to Tuning Choices

The comparison in Figure 3 fixes the smoothed Huber loss with ( δ , η ) = ( 1.35 , 0.1 ) and selects each ridge penalty by f i v e -fold cross-validation under the mean absolute error criterion, on a common geometric grid of nine values from 1 / 9 to 9. This subsection examines whether the qualitative findings of that comparison are stable under perturbations to four tuning choices: the smoothed Huber parameters ( δ , η ) , the cross-validation criterion, the robust loss family, and the ridge penalty grid. Each perturbation re-runs the full simulation with M = 500 replications per setting.

4.5.1. Choice of ( δ , η )

The default δ = 1.35 is the standard Huber tuning. The smoothing parameter η = 0.1 closely approximates the unsmoothed Huber loss and keeps ρ η twice continuously differentiable, as required by Assumption 1(b). In the sensitivity experiment, we vary δ over { 1.0 , 1.35 , 2.0 } and η over { 0.05 , 0.10 , 0.20 } , giving nine ( δ , η ) pairs, and re-run the full simulation across the three cases and seven discrepancy values h for every pair. Within each ( case , h ) combination, varying the ( δ , η ) pair over the nine settings changes each method’s mean error by less than 7 % (median 1.5 % ) of its mean. The four-method ranking matches the default ( 1.35 , 0.10 ) ranking in 171 of the 189 ( case , h , ( δ , η ) ) cells. The qualitative findings are therefore stable under this perturbation. Table 2 displays the four method means as a ( δ , η ) heatmap at h = 1 in each of the three cases. Within each 3 × 3 grid, the entries vary only slightly and the four-method ordering is the same in every cell, illustrating the two findings stated above. Appendix B reports the heatmaps at the other six values of h, where the same pattern holds.

4.5.2. Choice of Cross-Validation Criterion

The default selects each ridge penalty by f i v e -fold cross-validation under the mean absolute error criterion. We re-run the simulation with every cross-validation loss changed to mean squared error, keeping the smoothed Huber loss for estimation. Table 3 compares the mean estimation errors under the two criteria across all cases and values of h.
Under Case I , the two criteria yield nearly identical mean errors and the same four-method ranking at every h. Under Cases II and III , the mean errors of all four methods become substantially larger, and two qualitative findings of Section 4.3 no longer hold. First, Trans-RR has a larger mean error than Single-RR at every h in both heavy-tailed cases, so the range on which transfer helps disappears entirely. Second, Trans-RR-Ada no longer approximates min(Single-RR, Trans-RR) in most settings of the heavy-tailed cases. This may result from the heavy tails of Cauchy errors, under which MSE-based cross-validation is more sensitive to extreme residuals than MAE-based cross-validation. MAE-CV is therefore the appropriate default for heavy-tailed errors, and the conclusions of Section 4.3 are specific to this choice.

4.5.3. Choice of Robust Loss

In order to assess the sensitivity to the specific form of the robust loss, we replace the smoothed Huber loss in (6) with the pseudo-Huber loss ρ P ( t ; δ ) = δ 2 1 + ( t / δ ) 2 1 at the same δ = 1.35 . Table 4 compares the mean estimation errors under the two losses across all cases and values of h.
The mean errors are nearly identical to those under the smoothed Huber loss across all 21 settings. The four-method ranking matches the default in nearly every setting, and the few mismatches involve methods whose mean errors are essentially tied. The negative-transfer transition stays at the same value of h in every case, and Trans-RR-Ada continues to track min(Single-RR, Trans-RR) as closely as under the default. The qualitative findings of Section 4.3 are insensitive to this choice, consistent with Theorem 1.

4.5.4. Choice of Ridge Penalty Grid

The four ridge penalties τ 1 , τ , τ st , and  τ p are tuned by f i v e -fold cross-validation over a common geometric grid of nine values from 1 / 9 to 9. We probe the sensitivity to this grid in two complementary ways: extending the grid to wider values, and forcing all four penalties to a common fixed value with no cross-validation.
We first extend the grid to thirteen values from 1 / 27 to 27 on the same geometric scale, which contains the default grid as a strict subset. Table 5 compares the mean estimation errors under the default and the wide grid across all cases and values of h. The impact on estimation is negligible: each method’s mean error barely changes across settings, the four-method ranking is unchanged except at settings with near-tied means, the negative-transfer transition does not move, and Trans-RR-Ada continues to track min(Single-RR, Trans-RR) as closely as under the default. The default grid is therefore wide enough on both sides, and the qualitative findings of Section 4.3 do not depend on the choice of upper or lower endpoint.
We next force all four ridge penalties to a common fixed value τ { 1 / 3 , 1 , 3 , 9 } , removing the ridge cross-validation entirely. Trans-RR-Ada’s mixing weight θ is still selected by f i v e -fold cross-validation on the target sample (Algorithm 2). Table 6 reports the mean estimation errors at the four fixed τ values across all cases and values of h. At every fixed τ , the mean error of Trans-RR grows monotonically with h in all three cases, consistent with the theoretical risk curves of Figure 2. The absolute levels and the four-method ranking do vary substantially across the four τ values, but the qualitative dependence on h predicted by the theory is preserved at every τ .
The Trans-RR-Ada safeguard nevertheless remains effective: its mean error stays close to min(Single-RR, Trans-RR) in every one of the 84 settings, just as under cross-validated penalties. The agreement between Trans-RR-Ada and min(Single-RR, Trans-RR) is therefore robust to a misspecified ridge penalty.

5. Real Data Analysis

We evaluate the proposed transfer procedure on the near-infrared (NIR) spectral dataset from the 2002 International Diffuse Reflectance Conference (IDRC) “Shootout” competition. The data consist of pharmaceutical tablet measurements collected by two spectrometers, with ASSAY as the response. For each instrument, the dataset contains a training sample of size 460 and an external test sample of size 155. Each spectrum is recorded at 650 wavelengths, yielding a moderate–dimensional regression problem.
Let X and X 1 denote the spectra from the two instruments. We consider two transfer directions. In Direction A, X is the target domain and X 1 is the source domain. In Direction B, the roles are reversed. In each repetition, we randomly split the training sample of the target instrument into two parts of sizes 160 and 300. The subset of size 160 is used as the target training sample. The remaining 300 tablets are matched with their measurements from the other instrument to form the source training sample. When X is the target domain, the target fit is constructed from 160 observations in X, and the source fit is constructed from the corresponding 300 observations in X 1 . The same scheme is applied symmetrically in the reverse direction. The external test sample of the target instrument is used for evaluation. We repeat this procedure 20 times to reduce Monte Carlo variability.
The preprocessing used in both transfer directions has two stages. First, we thin the spectrum by retaining every fourth wavelength of the original 650, yielding p = 163 predictors. Two considerations motivate this thinning. Adjacent NIR wavelengths record highly overlapping absorption signals, so a four-fold thinning preserves nearly all the spectral information. This is standard practice in NIR chemometrics. A step of four also gives p = 163 n = 160 , the same p / n 1 setting as in our simulations.
Second, for each domain, predictors are whitened using its own training sample,
X ˜ = ( X μ ^ ) ^ 1 / 2 ,
where μ ^ and ^ are estimated from that training sample, and the same transformation is applied to the associated test sample. Whitening decorrelates the highly collinear NIR predictors so that the sample second-moment matrix of X ˜ approximates I p , which is the design assumption underlying Theorem 1. The response is centered using the mean of the corresponding training response, and the same centering constant is used for the associated test response. All preprocessing parameters are thus estimated only on training data and carried over to the test set, avoiding information leakage.
We compare six methods: Single-RR, Trans-RR, the adaptive Trans-RR estimator β ^ ada from Section 3.3 (denoted Trans-RR-Ada in the table), Pooled-RR, Single-Lasso, and Trans-Lasso. For the ridge-type methods, we use the smoothed Huber loss in (6) with ( δ , η ) = ( 1.35 , 0.1 ) . All tuning parameters are selected by 5-fold cross-validation with mean absolute error as the validation criterion, from the common grid
G = { 10 5 , 10 4.5 , 10 4 , , 10 1 } .
Performance is measured by the test-set RMSE,
RMSE = 1 n test i = 1 n test ( y i y ^ i ) 2 ,
where y i and y ^ i are on the original ASSAY scale (predictions are uncentered before evaluation) and n test = 155 in both transfer directions.
Table 7 reports the average RMSE and its standard deviation over the 20 repetitions, and Figure 5 displays the corresponding distributional information. Trans-RR achieves the smallest mean RMSE in both transfer directions, with relatively small variability across the 20 splits. The adaptive variant Trans-RR-Ada matches Trans-RR within 0.03 in mean RMSE in both directions, indicating that the source and target are close enough for transfer to be uniformly beneficial on this dataset. Pooled-RR ranks third behind the two transfer-ridge methods and is uniformly dominated by Trans-RR, suggesting that naive pooling fails to account for cross-instrument differences. The lasso methods are less competitive overall, especially Single-Lasso. This is consistent with the non-sparse structure of NIR spectra, where relevant information is distributed over broad wavelength regions rather than concentrated on a small subset of predictors.
We further checked the procedure for four starting offsets of the every-fourth-wavelength thinning and for the same procedure without whitening. Across all five resulting preprocessing variants and both transfer directions, Trans-RR and Trans-RR-Ada are the top two methods by mean RMSE, and their mean RMSE values differ by at most 0.05 . Mean RMSE for all six methods in each variant and direction is reported in Appendix C. The results, together with these robustness checks, therefore support the effectiveness of the proposed Trans-RR method.

6. Discussion

This paper introduces Trans-RR, a robust transfer-learning approach for moderate–dimensional linear regression. It extends transfer-learning ideas of [7,8] to a setting with non-sparse coefficients and heavy-tailed errors, without relying on sparsity assumptions or moment restrictions on the errors. The theory and the numerical results show that negative transfer can occur when the source study is not sufficiently informative for the target study. To guard against this, we also propose an adaptive aggregation of Trans-RR with the single-task estimator that selects the mixing weight by cross-validation. The theoretical results, simulation studies, and real-data analysis support the effectiveness of both procedures. A natural direction for further study is to identify the choice of loss function that minimizes the asymptotic estimation error in our framework.

Author Contributions

Conceptualization, L.L. and X.G.; Methodology, L.L., X.G. and Z.L.; Software, L.L.; Validation, L.L.; Formal analysis, L.L.; Data curation, L.L.; Writing—original draft, L.L.; Writing—review & editing, L.L., X.G. and Z.L.; Supervision, X.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 12471267.

Data Availability Statement

Publicly available datasets were analyzed in this study. The near-infrared spectral data are from the 2002 International Diffuse Reflectance Conference (IDRC) “Shootout” competition and can be downloaded from https://www.eigenvector.com/data/tablets/index.html, accessed on 7 May 2026. The code presented in this study is available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Notation

For ease of reference, Table A1 collects the main symbols used throughout the paper, grouped by topic. Additional proof-level notation is introduced at the beginning of Appendix D.
Table A1. Summary of the main symbols used in the paper.
Table A1. Summary of the main symbols used in the paper.
SymbolMeaning
Dimensions and asymptotic regime
p, n, n 1 predictor dimension; target and source sample sizes
κ , κ 1 ( 0 , ) limits of p / n and p / n 1
Target study (1) ( i = 1 , , n ) and source study (2) ( i = 1 , , n 1 )
x i = λ i X i , x i ( 1 ) = λ i ( 1 ) X i ( 1 ) i-th target/source predictor, with components X i , X i ( 1 ) R p and scales λ i , λ i ( 1 ) R (Assumptions 1(c) and 2(c))
y i , ϵ i , y i ( 1 ) , ϵ i ( 1 ) target/source response and error ( ϵ i may be heavy-tailed)
 β 0 , w 0 R p target/source regression coefficient (non-sparse)
ρ , ψ = ρ , ρ ˜ , ψ ˜ = ρ ˜ target-stage/source-stage loss and its derivative
τ , τ s t , τ 1 > 0 target-stage ridge for Trans-RR and for β ^ s t , source-stage ridge for w ^
w ^ , δ ^ , β ^ = w ^ + δ ^ , β ^ s t source-stage, Step 2, Trans-RR (5), and single-task (11) estimators
Source–target discrepancy
δ 0 = β 0 w 0 , h = δ 0 discrepancy vector and its magnitude
Adaptive aggregation (Section 3.3)
β ^ ada ( θ ) = θ β ^ + ( 1 θ ) β ^ s t , β ^ ada adaptive Trans-RR estimator and its instance at θ = θ ^ , (12)
θ [ 0 , 1 ] , θ ^ , Θ [ 0 , 1 ] mixing weight, its cross-validation choice, and candidate set
K, { V k } k = 1 K , L : R R 0 fold count, target-sample fold partition, validation loss in (13)
Smoothed Huber family (choices for ρ and ρ ˜ )
δ > 0 , η ( 0 , δ ] Huber transition and smoothing parameters (scalar δ distinct from vector δ 0 )
ρ H , ρ η , ρ P Huber, smoothed Huber (6), and Pseudo-Huber (8) losses
Asymptotic quantities (Theorem 1)
r ρ ( κ ) , c ρ ( κ ) asymptotic value of β ^ β 0 and companion scalar in the fixed-point system
Z i , W i = ϵ i + r ρ ( κ ) λ i Z i standard normal (independent of ϵ i , λ i ) and auxiliary random variable
prox ( c ρ ) proximal mapping: prox ( c ρ ) ( x ) = arg min y { c ρ ( y ) + ( x y ) 2 / 2 }

Appendix B. Additional (δ, η) Heatmaps

This appendix reports the ( δ , η ) heatmaps at the six values of h not shown in Table 2 of the main text. The pattern at every h matches the one at h = 1 . Each method’s mean error varies by less than 7 % across the nine ( δ , η ) pairs, and the four-method ranking matches the default ( 1.35 , 0.10 ) ranking in the large majority of cells. The few rank swaps that do occur happen at small h in Cases II and III , where Pooled-RR, Trans-RR, and Trans-RR-Ada have nearly tied mean errors.
Table A2. Sensitivity of relative estimation error to the smoothed Huber parameters ( δ , η ) at h = 0.135 . Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over M = 500 replications.
Table A2. Sensitivity of relative estimation error to the smoothed Huber parameters ( δ , η ) at h = 0.135 . Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over M = 500 replications.
Case I (Gaussian Errors)
δ η 0.05 0.10 0.20
1.00 0.653/0.560/0.561/0.6040.654/0.561/0.562/0.6040.656/0.562/0.563/0.606
1.35 0.644/0.549/0.550/0.6000.645/0.550/0.550/0.6000.646/0.551/0.551/0.601
2.00 0.641/0.537/0.538/0.5940.640/0.537/0.539/0.5940.641/0.538/0.539/0.593
Case II (Cauchy Errors)
δ η 0.05 0.10 0.20
1.00 0.884/0.819/0.821/0.8090.884/0.819/0.821/0.8080.884/0.819/0.821/0.808
1.35 0.883/0.817/0.819/0.8120.883/0.816/0.819/0.8110.883/0.816/0.818/0.811
2.00 0.890/0.821/0.823/0.8220.889/0.821/0.823/0.8210.889/0.821/0.823/0.821
Case III (Mixture Errors)
δ η 0.05 0.10 0.20
1.00 0.779/0.695/0.694/0.7060.780/0.695/0.695/0.7060.780/0.695/0.695/0.706
1.35 0.781/0.693/0.692/0.7050.781/0.693/0.692/0.7060.781/0.692/0.692/0.705
2.00 0.794/0.701/0.701/0.7140.793/0.700/0.701/0.7140.792/0.700/0.700/0.713
Table A3. Sensitivity of relative estimation error to the smoothed Huber parameters ( δ , η ) at h = 0.223 . Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over M = 500 replications.
Table A3. Sensitivity of relative estimation error to the smoothed Huber parameters ( δ , η ) at h = 0.223 . Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over M = 500 replications.
Case I (Gaussian Errors)
δ η 0.05 0.10 0.20
1.00 0.653/0.576/0.577/0.6340.654/0.577/0.578/0.6350.656/0.578/0.579/0.635
1.35 0.644/0.565/0.565/0.6280.645/0.565/0.566/0.6280.646/0.566/0.567/0.629
2.00 0.641/0.554/0.556/0.6240.640/0.555/0.556/0.6240.641/0.555/0.556/0.624
Case II (Cauchy Errors)
δ η 0.05 0.10 0.20
1.00 0.884/0.828/0.830/0.8240.884/0.828/0.830/0.8250.884/0.828/0.831/0.825
1.35 0.883/0.826/0.829/0.8290.883/0.826/0.829/0.8290.883/0.825/0.828/0.828
2.00 0.890/0.829/0.832/0.8380.889/0.830/0.832/0.8370.889/0.829/0.831/0.836
Case III (Mixture Errors)
δ η 0.05 0.10 0.20
1.00 0.779/0.709/0.709/0.7290.780/0.709/0.709/0.7290.780/0.710/0.709/0.730
1.35 0.781/0.708/0.707/0.7280.781/0.707/0.707/0.7290.781/0.707/0.706/0.728
2.00 0.794/0.715/0.715/0.7370.793/0.715/0.715/0.7370.792/0.714/0.714/0.736
Table A4. Sensitivity of relative estimation error to the smoothed Huber parameters ( δ , η ) at h = 0.368 . Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over M = 500 replications.
Table A4. Sensitivity of relative estimation error to the smoothed Huber parameters ( δ , η ) at h = 0.368 . Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over M = 500 replications.
Case I (Gaussian Errors)
δ η 0.05 0.10 0.20
1.00 0.653/0.598/0.600/0.6830.654/0.599/0.601/0.6840.656/0.601/0.602/0.686
1.35 0.644/0.588/0.589/0.6780.645/0.588/0.589/0.6780.646/0.589/0.591/0.678
2.00 0.641/0.579/0.580/0.6770.640/0.579/0.580/0.6770.641/0.579/0.580/0.677
Case II (Cauchy Errors)
δ η 0.05 0.10 0.20
1.00 0.884/0.842/0.844/0.8540.884/0.842/0.844/0.8540.884/0.843/0.846/0.853
1.35 0.883/0.840/0.843/0.8550.883/0.840/0.843/0.8550.883/0.839/0.842/0.856
2.00 0.890/0.843/0.847/0.8640.889/0.843/0.847/0.8640.889/0.842/0.846/0.863
Case III (Mixture Errors)
δ η 0.05 0.10 0.20
1.00 0.779/0.729/0.729/0.7690.780/0.729/0.730/0.7690.780/0.729/0.730/0.770
1.35 0.781/0.728/0.728/0.7700.781/0.727/0.727/0.7700.781/0.727/0.727/0.770
2.00 0.794/0.737/0.738/0.7770.793/0.736/0.737/0.7780.792/0.735/0.736/0.777
Table A5. Sensitivity of relative estimation error to the smoothed Huber parameters ( δ , η ) at h = 0.607 . Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over M = 500 replications.
Table A5. Sensitivity of relative estimation error to the smoothed Huber parameters ( δ , η ) at h = 0.607 . Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over M = 500 replications.
Case I (Gaussian Errors)
δ η 0.05 0.10 0.20
1.00 0.653/0.625/0.626/0.7670.654/0.626/0.627/0.7670.656/0.627/0.629/0.768
1.35 0.644/0.615/0.616/0.7640.645/0.616/0.617/0.7640.646/0.617/0.618/0.764
2.00 0.641/0.607/0.608/0.7640.640/0.607/0.608/0.7640.641/0.607/0.609/0.764
Case II (Cauchy Errors)
δ η 0.05 0.10 0.20
1.00 0.884/0.862/0.864/0.8970.884/0.862/0.864/0.8970.884/0.863/0.865/0.897
1.35 0.883/0.860/0.863/0.8980.883/0.860/0.862/0.8980.883/0.859/0.862/0.898
2.00 0.890/0.865/0.868/0.9030.889/0.864/0.868/0.9030.889/0.864/0.868/0.902
Case III (Mixture Errors)
δ η 0.05 0.10 0.20
1.00 0.779/0.755/0.755/0.8350.780/0.755/0.755/0.8350.780/0.755/0.755/0.836
1.35 0.781/0.754/0.754/0.8340.781/0.754/0.754/0.8340.781/0.753/0.754/0.834
2.00 0.794/0.764/0.764/0.8410.793/0.763/0.764/0.8400.792/0.762/0.762/0.840
Table A6. Sensitivity of relative estimation error to the smoothed Huber parameters ( δ , η ) at h = 1.649 . Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over M = 500 replications.
Table A6. Sensitivity of relative estimation error to the smoothed Huber parameters ( δ , η ) at h = 1.649 . Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over M = 500 replications.
Case I(Gaussian Errors)
δ η 0.05 0.10 0.20
1.00 0.653/0.797/0.653/0.9870.654/0.795/0.654/0.9870.656/0.796/0.657/0.987
1.35 0.644/0.792/0.645/0.9900.645/0.791/0.645/0.9900.646/0.792/0.646/0.990
2.00 0.641/0.788/0.641/0.9990.640/0.789/0.641/0.9990.641/0.787/0.641/0.998
Case II (Cauchy Errors)
δ η 0.05 0.10 0.20
1.00 0.884/1.024/0.888/1.0050.884/1.023/0.888/1.0050.884/1.023/0.888/1.005
1.35 0.883/1.021/0.888/1.0050.883/1.021/0.887/1.0050.883/1.020/0.887/1.005
2.00 0.890/1.029/0.894/1.0060.889/1.027/0.893/1.0060.889/1.026/0.893/1.006
Case III (Mixture Errors)
δ η 0.05 0.10 0.20
1.00 0.779/0.934/0.781/0.9970.780/0.935/0.781/0.9960.780/0.936/0.781/0.996
1.35 0.781/0.943/0.783/0.9980.781/0.941/0.783/0.9980.781/0.940/0.783/0.998
2.00 0.794/0.955/0.796/1.0030.793/0.954/0.795/1.0030.792/0.952/0.794/1.002
Table A7. Sensitivity of relative estimation error to the smoothed Huber parameters ( δ , η ) at h = 2.718 . Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over M = 500 replications.
Table A7. Sensitivity of relative estimation error to the smoothed Huber parameters ( δ , η ) at h = 2.718 . Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over M = 500 replications.
Case I (Gaussian Errors)
δ η 0.05 0.10 0.20
1.00 0.653/1.617/0.653/1.2310.654/1.618/0.654/1.2280.656/1.620/0.656/1.228
1.35 0.644/1.637/0.644/1.2620.645/1.639/0.645/1.2610.646/1.635/0.646/1.254
2.00 0.641/1.617/0.641/1.3120.640/1.613/0.640/1.3090.641/1.611/0.641/1.304
Case II (Cauchy Errors)
δ η 0.05 0.10 0.20
1.00 0.884/1.882/0.884/1.1180.884/1.882/0.884/1.1190.884/1.890/0.884/1.118
1.35 0.883/1.902/0.883/1.1220.883/1.898/0.883/1.1230.883/1.900/0.883/1.121
2.00 0.890/1.873/0.890/1.1370.889/1.869/0.889/1.1390.889/1.864/0.889/1.136
Case III (Mixture Errors)
δ η 0.05 0.10 0.20
1.00 0.779/1.975/0.779/1.1690.780/1.971/0.780/1.1710.780/1.993/0.780/1.169
1.35 0.781/1.992/0.781/1.1830.781/1.985/0.781/1.1810.781/1.971/0.781/1.178
2.00 0.794/1.999/0.794/1.2090.793/2.001/0.793/1.2080.792/2.009/0.792/1.205

Appendix C. Robustness Checks for the Real-Data Analysis

In Section 5, we report robustness checks across two preprocessing variants: the four starting offsets of the every-fourth-wavelength thinning, and the same procedure without the whitening step (predictors still mean-centered per instrument). Table A8 reports the per-offset mean RMSE and standard deviation; Table A9 compares the whitened and the unwhitened analysis. Each entry reports the mean RMSE (with standard deviation in parentheses) over the 20 repeated random splits.
Table A8. Per-offset prediction performance on the NIR spectral dataset. The four offsets correspond to the four starting indices of the every-fourth-wavelength thinning, each yielding p = 163 predictors.
Table A8. Per-offset prediction performance on the NIR spectral dataset. The four offsets correspond to the four starting indices of the every-fourth-wavelength thinning, each yielding p = 163 predictors.
Direction A: Target = X
MethodOffset 0Offset 1Offset 2Offset 3
Trans-RR4.6230 (0.1732)4.5555 (0.1619)4.7143 (0.1660)4.5511 (0.1149)
Trans-RR-Ada4.6294 (0.1861)4.5569 (0.1703)4.7155 (0.1602)4.5669 (0.1202)
Pooled-RR5.0952 (0.0812)4.9644 (0.1462)5.0982 (0.0971)4.9367 (0.0769)
Trans-Lasso5.5668 (0.3650)5.5799 (0.5213)5.5239 (0.2604)5.3689 (0.3004)
Single-RR6.2666 (2.2628)6.3039 (2.0647)6.4540 (2.4723)6.4131 (2.3509)
Single-Lasso8.0672 (2.8904)8.1289 (3.5097)8.5205 (4.3280)8.7014 (3.3436)
Direction B: Target = X 1
MethodOffset 0Offset 1Offset 2Offset 3
Trans-RR4.7933 (0.2736)4.7063 (0.1320)4.6329 (0.2049)4.6457 (0.1895)
Trans-RR-Ada4.8211 (0.3757)4.7405 (0.1403)4.6222 (0.1583)4.6447 (0.1779)
Pooled-RR5.4909 (0.1335)5.2180 (0.1444)5.4916 (0.1550)5.1584 (0.1728)
Trans-Lasso5.6803 (0.3131)5.9048 (0.5452)5.5392 (0.4612)5.6729 (0.3458)
Single-RR6.8272 (2.2674)6.9625 (2.5627)6.6887 (2.2085)6.6730 (2.4256)
Single-Lasso8.5607 (3.3739)9.0675 (3.7103)7.8504 (2.7262)7.9018 (2.8987)
Table A9. Comparison of mean RMSE between the whitened analysis (default) and the analysis without whitening, in both transfer directions.
Table A9. Comparison of mean RMSE between the whitened analysis (default) and the analysis without whitening, in both transfer directions.
Direction ADirection B
Method Whitened Unwhitened Whitened Unwhitened
Trans-RR4.6230 (0.1732)4.6281 (0.2237)4.7933 (0.2736)4.9253 (0.2064)
Trans-RR-Ada4.6294 (0.1861)4.6225 (0.2182)4.8211 (0.3757)4.9386 (0.1926)
Pooled-RR5.0952 (0.0812)5.0615 (0.0959)5.4909 (0.1335)5.0178 (0.1401)
Trans-Lasso5.5668 (0.3650)4.8776 (0.3020)5.6803 (0.3131)5.0792 (0.2623)
Single-RR6.2666 (2.2628)4.8845 (0.3038)6.8272 (2.2674)5.2476 (0.3971)
Single-Lasso8.0672 (2.8904)4.9933 (0.3261)8.5607 (3.3739)5.2374 (0.3037)

Appendix D. Assumptions

Notation. Let polyLog ( p ) replace a power of log ( p ) . Denote by I m the m × m identity matrix. When this does not create problems, we also use the standard notation I for I p . Let 0 m R m and 1 m R m be the vectors of zeros and ones, respectively. For an m × m matrix A = { a i j } 1 i , j m , denote by λ max ( A ) and λ min ( A ) the maximum and minimum eigenvalues of A , respectively. The L 2 norm of A is defined as A = { λ max ( A A ) } 1 / 2 . We call ^ = n 1 i = 1 n x i x i the sample covariance matrix of x i ’s when x i ’s are known to have zero mean. We say that X Y in L k if E ( | X | k ) E ( | Y | k ) . We use the notation ( a , b ) to represent either the interval ( a , b ) or ( b , a ) , as in some cases we need to localize quantities within intervals defined by two values, a and b, without knowing in advance whether a < b or b > a . We denote by X the n × p design matrix whose i-th row is x i , and by X ( i ) the matrix obtained by removing the i-th row of X . We write a b for min ( a , b ) and a b for max ( a , b ) . For two symmetric matrices A and B, A B means that A B is positive semi-definite. For the random variable W, we use the definition W L k = { E ( | W | k ) } 1 / k . For sequences of random variables W n , Z n , we use the notation W n = O L k Z n and W n = o L k ( Z n ) when W n L k = O ( Z n L k ) and W n L k = o ( Z n L k ) , respectively. For a vector v = ( v 1 , , v m ) , the L 2 norms are v = ( i = 1 m v i 2 ) 1 / 2 , whereas v = max 1 k p | v ( k ) | . For a function f from R to R , denote f = sup x R | f ( x ) | .
In this appendix, we refine the assumptions in the main text into the forms needed for the proof of Theorem 1. The proof is divided into three parts, and each part uses a slightly different set of technical conditions. In particular, the assumptions on β 0 are specified below according to the needs of the different proof subsections, whereas the source-side conditions remain as in Assumption 2.
  • Assumptions under which the whole proof goes through
  • M1.  p / n κ ( 0 , ) .
  • M2. There exists constants C β and e > 1 / 3 such that β 0 C β and β 0 = O ( n e ) .
  • M3. Suppose ρ is an even function. Assume that ψ = ρ is bounded and ψ is Lipschitz and bounded. Moreover, we assume that sign ( ψ ( x ) ) = sign ( x ) and that ρ ( x ) ρ ( 0 ) = 0 for all x R .
  • M4. Assume that there exist independent variables λ i ’s and X i ’s such that x i = λ i X i . Suppose that X i ’s are i.i.d. with independent entries, and they have mean 0 p and cov ( X i ) = I p . Suppose there exist c n and C n that vary with n, where 1 / c n = O ( polyLog ( n ) ) and C n is bounded in n, such that for any convex 1-Lipschitz function G of X i , P ( | G ( X i ) m G | > t ) C n exp ( c n t 2 ) holds for all t > 0 , where m G is the median of G ( X i ) . We require the same assumption to hold for the columns of the n × p design matrix X . Additionally, we assume that the coordinates of X i have moments of all orders, and the k-th moment of the entries of X i is assumed to be uniformly bounded independently of n and p for all k. Also, for any 1 k p , the vectors Θ k = ( X 1 ( k ) , , X n ( k ) ) in R n satisfy: for any 1-Lipschitz (with respect to Euclidean norm) convex function G, if m G ( Θ k ) is a median of G ( Θ k ) , for any t > 0 , P ( | G ( Θ k ) m G ( Θ k ) | > t ) C n exp ( c n t 2 ) , C n and c n can vary with n. As above, we assume that 1 / c n = O ( polyLog ( n ) ) .
  • M5.  λ i ’s are independent, with E ( λ i 2 ) = 1 , E ( λ i 4 ) being bounded, and sup 1 i n | λ i | growing at most like C λ ( log n ) k for some k. λ i ’s may have finitely many possible distributions.
  • M6. Suppose that ϵ i ’s are independent and independent of X i ’s and λ i ’s. They may have finitely many possible distributions, each with a density that is differentiable, symmetric, and unimodal. Furthermore, for any r R , if z N ( 0 , 1 ) , independent of ϵ i , ϵ i + r z has a differentiable density f i , r which is increasing on ( , 0 ) and decreasing on ( 0 , ) . lim x x f i ( x ) = 0 .
  • M7.  ϵ i ’s can have different distributions. Similarly, λ i ’s can have different distributions. The fraction of occurrences for each possible combination of distributions for ( ϵ i , λ i ) has a limit as n .
  • O1.  p / n κ ( 0 , ) .
  • O2.  ρ is twice differentiable, convex, and non-linear. ψ = ρ . Note that ψ 0 since ρ is convex. We assume that sign ( ψ ( x ) ) = sign ( x ) and ρ 0 = ρ ( 0 ) .
  • O3.  sup x | ψ ( x ) | C polyLog ( n ) and ψ 2 C for some constant C. Furthermore, ψ is assumed to be L ( n ) -Lipschitz with L ( n ) C n α , α 0 . We also assume that ψ C polyLog ( n ) .
  • O4. Assume that there exist independent variables λ i ’s and X i ’s such that x i = λ i X i . Suppose that X i ’s are i.i.d. with independent entries, and they have mean 0 p and cov ( X i ) = I p . Suppose there exist c n and C n that vary with n, where 1 / c n = O ( polyLog ( n ) ) and C n is bounded in n, such that for any convex 1-Lipschitz function G of X i , P ( | G ( X i ) m G | > t ) C n exp ( c n t 2 ) holds for all t > 0 , where m G is the median of G ( X i ) . We require the same assumption to hold for the columns of the n × p design matrix X . Additionally, we assume that the coordinates of X i have moments of all orders, and the k-th moment of the entries of X i is assumed to be uniformly bounded independently of n and p for all k.
  • O5.  { X i } i = 1 n and { λ i } i = 1 n are independent of { ϵ i } i = 1 n . ϵ i ’s are independent of each other.
  • O6.  sup 1 i n | λ i | L n = O L k ( polyLog ( n ) ) and λ i ’s are independent. Moreover, E ( λ i 2 ) = 1 .
  • O7.  1 2 α > 0 and β 0 = O ( polyLog ( n ) ) .
  • P1.  X i ’s have independent entries. Furthermore, for any 1 k p , the vectors Θ k = ( X 1 ( k ) , , X n ( k ) ) in R n satisfy: for any 1-Lipschitz (with respect to Euclidean norm) convex function G, if m G ( Θ k ) is a median of G ( Θ k ) , for any t > 0 , P ( | G ( Θ k ) m G ( Θ k ) | > t ) C n exp ( c n t 2 ) , C n and c n can vary with n. As above, we assume that 1 / c n = O ( polyLog ( n ) ) .
  • P2.  ψ = O ( 1 ) .
  • P3.  β 0 = O ( n e ) , where e > 0 . Furthermore, β 0 2 C , where C is a constant independent of p and n. e satisfies α + 1 / 4 e < 0 .
  • P4.  1 / 2 2 α > 0 and min ( 1 / 2 , e ) α 1 / 4 > 0 . The latter implies that min ( 1 / 2 , e ) α > 0
  • F1.  ϵ i ’s may have different distributions; however, they may only come from finitely many distributions. Furthermore, for any r R , if z N ( 0 , 1 ) , independent of ϵ i , ϵ i + r z has a differentiable density f i , r which is increasing on ( , 0 ) and decreasing on ( 0 , ) . lim x x f i ( x ) = 0 .
  • F2.  ψ = O ( 1 ) . ψ has Lipschitz constant L ( n ) . Furthermore, L ( n ) ψ = O ( 1 ) .
  • F3.  α < 1 / 6 and α + 1 / 3 < 2 min ( 1 / 2 , e ) .
  • F4. there exists constant C such that E ( λ i 4 ) C .
  • F5.  λ i ’s may have different distributions. The fraction of occurrences for each possible combination of distributions for ( ϵ i , λ i ) has a limit as n .

Appendix E. Proof for Theorem 1

We call
F ( δ ) = 1 n i = 1 n ρ { ϵ i + x i ( w 0 w ^ ) + x i ( δ 0 δ ) } + τ 2 δ 2 .
δ ^ is defined as the solution of
f ( δ ^ ) = 0 with F = f ( δ ) = 1 n i = 1 n x i ψ { ϵ i + x i ( w 0 w ^ ) + x i ( δ 0 δ ) } + τ δ .
We further define
ϵ ˜ i = ϵ i + x i ( w 0 w ^ ) , R i = ϵ ˜ i + x i ( δ 0 δ ^ ) , S = 1 n i = 1 n ψ ( R i ) x i x i , c τ = 1 n tr ( S + τ I ) 1 .

Appendix E.1. Preliminaries

Lemma A1. 
Under Assumptions  P3P4, for any l = 1 , , p , where w ( l ) denotes the lth coordinate of a vector w ,
| w ^ ( l ) w 0 ( l ) | = O L k polyLog ( n 1 ) n 1 1 / 2 + n 1 e = O L k polyLog ( n 1 ) n 1 1 / 2 n 1 e .
In particular, if we further assume n = O ( n 1 ) , we have
| w ^ ( l ) w 0 ( l ) | = O L k polyLog ( n ) n n e .
Proof. 
By Proposition 3.12 of El Karoui [21],
w l w 0 ( l ) = O L k polyLog ( n 1 ) n 1 1 / 2 + w 0 ,
where w l denotes the analog of b p defined in Appendix 4 of El Karoui [21]. Its explicit construction is rather involved and is not needed here; we only use w l as an intermediate quantity in the argument. Under Assumption P3, w 0 = O ( n 1 e ) , hence
w l w 0 ( l ) = O L k polyLog ( n 1 ) n 1 1 / 2 + n 1 e = O L k polyLog ( n 1 ) n 1 1 / 2 n 1 e .
Moreover, by Theorem 3.20 of El Karoui [21],
w ^ ( l ) w l = O L k polyLog ( n 1 ) n 1 α [ n 1 1 / 2 n 1 e ] 2 .
Let m : = min ( 1 / 2 , e ) . Then [ n 1 1 / 2 n 1 e ] 2 = n 1 2 m and
polyLog ( n 1 ) n 1 α [ n 1 1 / 2 n 1 e ] 2 = polyLog ( n 1 ) n 1 m n 1 α m .
Assumption P4 gives m > α , thus
polyLog ( n 1 ) n 1 α [ n 1 1 / 2 n 1 e ] 2 = o polyLog ( n 1 ) n 1 m ,
and hence (A3) yields
w ^ ( l ) w l = o L k polyLog ( n 1 ) n 1 m .
By the triangle inequality and Minkowski’s inequality in L k ,
w ^ ( l ) w 0 ( l ) L k w ^ ( l ) w l L k + w l w 0 ( l ) L k .
Using (A2) and (A4), we obtain
w ^ ( l ) w 0 ( l ) L k = O polyLog ( n 1 ) n 1 1 / 2 n 1 e ,
which is equivalent to
| w ^ ( l ) w 0 ( l ) | = O L k polyLog ( n 1 ) n 1 1 / 2 n 1 e .
Assume n = O ( n 1 ) , i.e., n 1 n / C for some C > 0 and all large n. Since polyLog ( t ) denotes a power of log t , write polyLog ( t ) = ( log t ) q for some fixed q 0 (up to a constant factor). Consider f ( t ) : = ( log t ) q t m . For t exp ( q / m ) ,
f ( t ) = ( log t ) q 1 t m + 1 q m log t 0 ,
so f is decreasing for all sufficiently large t. Hence, for all sufficiently large n and all n 1 n / C ,
polyLog ( n 1 ) n 1 n 1 e = ( log n 1 ) q n 1 m = f ( n 1 ) f ( n / C ) = ( log ( n / C ) ) q ( n / C ) m C m ( log n ) q n m polyLog ( n ) n n e ,
where we use a n b n to denote that a n = O ( b n ) and b n = O ( a n ) . Therefore, we have
| w ^ ( l ) w 0 ( l ) | = O L k polyLog ( n ) n n e .
Proposition A1. 
Let δ 1 and δ 2 be the two vectors in R p . Then, when ρ’s are convex and twice-differentiable,
δ 1 δ 2 1 τ f ( δ 1 ) f ( δ 2 ) .
Proof. 
We have by definition
f ( δ 1 ) f ( δ 2 ) = τ ( δ 1 δ 2 ) + 1 n i = 1 n x i [ ψ { ϵ ˜ i + x i ( δ 0 δ 2 ) } ψ { ϵ ˜ i + x i ( δ 0 δ 1 ) } ] .
By the mean value theorem, we have
ψ { ϵ ˜ i + x i ( δ 0 δ 2 ) } ψ { ϵ ˜ i + x i ( δ 0 δ 1 ) } = ψ ( γ i 🟉 ) x i ( δ 0 δ 1 ) ,
where γ i 🟉 is in the interval ϵ ˜ i + x i ( δ 0 δ 1 ) , ϵ ˜ i + x i ( δ 0 δ 2 ) .
Hence,
f ( δ 1 ) f ( δ 2 ) = τ ( δ 1 δ 2 ) + 1 n i = 1 n ψ ( γ i 🟉 ) x i x i ( δ 0 δ 1 ) = ( S δ 1 , δ 2 + τ I p ) ( δ 1 δ 2 ) ,
where
S δ 1 , δ 2 = 1 n i = 1 n ψ ( γ i 🟉 ) x i x i .
This shows that
δ 1 δ 2 = ( S δ 1 , δ 2 + τ I p ) 1 { f ( δ 1 ) f ( δ 2 ) } .
Since ρ is convex, ψ = ρ is non-negative and S δ 1 , δ 2 is positive semi-definite. In the semi-definite order, we have S δ 1 , δ 2 + τ I p τ I p . In particular,
δ 1 δ 2 1 τ f ( δ 1 ) f ( δ 2 ) .
Proposition A1 yields the following lemma.
Lemma A2. 
For any δ 1 ,
δ ^ δ 1 1 τ f ( δ 1 ) .
The lemma is a simple consequence of Equation (A5) by definition f ( δ ^ ) = 0 .

Appendix E.2. On δ ^ and δ ^ δ 0

Lemma A3. 
Define q n ( b ) = n 1 i = 1 n x i ψ { ϵ ˜ i + x i b } , q n R p .
If D ψ is the n × n diagonal matrix with ( i , i ) -entry ψ { ϵ ˜ i + x i δ 0 } ,
δ ^ 1 τ q n ( δ 0 ) = 1 τ 1 n 2 1 D ψ X X D ψ 1 ,
and if D ψ ( ξ i ) is the n × n diagonal matrix with ( i , i ) -entry ψ ( ϵ ˜ i ) ,
δ ^ δ 0     δ 0 + 1 τ q n ( 0 ) = δ 0 + 1 τ 1 n 2 1 D ψ ( ξ i ) X X D ψ ( ξ i ) 1 ,
Also,
q n ( δ 0 ) 2 1 D ψ 2 1 n X X / n 2 ,
where A 2 denotes the largest singular value of the matrix A .
Therefore, under Assumptions O1–O6,
E ( δ ^ 2 ) 1 τ 2 p n C 2 polyLog ( n ) , E ( δ ^ 4 ) 1 τ 4 C polyLog ( n ) .
Similarly, for any finite k,
E ( δ ^ δ 0 2 k ) C k [ δ 0 k + polyLog ( n ) / τ k ] .
In the case k = 2 , we have the more precise bound
E ( δ ^ δ 0 2 ) 2 [ δ 0 2 + p / n τ 2 1 n i = 1 n E { ψ 2 ( ϵ ˜ i ) } ] .
Noting that [21] has shown that E ( w ^ w 0 ) = O ( 1 ) , we have E ( δ ^ + w ^ δ 0 w 0 ) is bounded by K polyLog ( n ) / τ k .
Proof. 
Recall that f ( δ ) = 1 n i = 1 n x i ψ { ϵ ˜ i + x i ( δ 0 δ ) } + τ δ . Applying Lemma A2 with δ 1 = 0 we have
δ ^ 0 1 τ f ( 0 ) = 1 τ 1 n i = 1 n x i ψ { ϵ ˜ i + x i δ 0 } ,
which gives the first inequality.
Using δ 1 = δ 0 we have
δ ^ δ 0 1 τ f ( δ 0 ) = 1 τ 1 n i = 1 n x i ψ ( ϵ ˜ i ) + τ δ 0 ,
which gives the second inequality.
We note that under our assumptions, according to Lemma 3.38 from [21], we have
X X / n 2 = O L k ( polyLog ( n ) )
and
1 n i = 1 n ψ 2 { ϵ ˜ i + x i δ 0 } 1 n i = 1 n ψ 2 = O ( 1 ) ,
which gives all the results about L k bounds.
For the last result, we note that
q n ( 0 ) 2 = q n ( 0 ) q n ( 0 ) = 1 n 2 i , j x i x j ψ ( ϵ ˜ i ) ψ ( ϵ ˜ j ) .
It implies that
E q n ( 0 ) 2 = 1 n 2 i = 1 n E ( x i 2 ) E { ψ 2 ( ϵ ˜ i ) } .
Because E ( x i 2 ) = p , we can conclude that
E ( q n ( 0 ) 2 ) = p n 1 n i = 1 n E { ψ 2 ( ϵ ˜ i ) } .
Together with the bound
δ ^ δ 0 2 2 δ 0 2 + 2 τ 2 q n ( 0 ) 2 ,
it implies the last result about k = 2 . □

Appendix E.3. Leave-One-Observation-Out

In this subsection, we approximate δ ^ by δ ^ ( i ) via leave-one-observation-out method.
We consider the situation where we leave the i-th observation, ( x i , ϵ i ) , out. By definition,
δ ^ ( i ) = arg min δ R p F i ( δ ) , where F i ( δ ) = 1 n j i ρ j { ϵ ˜ j + x j δ 0 x j δ } + τ 2 δ 2 .
We call
f i ( δ ) = 1 n j i x j ψ j { ϵ ˜ j + x j δ 0 x j δ } + τ δ = f ( δ ) + 1 n x i ψ { ϵ ˜ i + x i δ 0 x i δ } .
We have
f i ( δ ^ ( i ) ) = 0 .
We call
r ˜ j , ( i ) = ϵ ˜ j x j ( δ ^ ( i ) δ 0 ) and S i = 1 n j i ψ j ( r ˜ j , ( i ) ) x j x j .
Consider
δ ˜ i = δ ^ ( i ) + 1 n ( S i + τ I ) 1 x i ψ { prox ( c i ρ ) ( r ˜ j , ( i ) ) } δ ^ ( i ) + η i ,
where
c i = 1 n x i ( S i + τ I ) 1 x i , η i = 1 n ( S i + τ I ) 1 x i ψ { prox ( c i ρ ) ( r ˜ i , ( i ) ) } .

Appendix E.3.1. Deterministic Bounds

Proposition A2. 
We have
δ ^ δ ˜ i 1 τ R i ,
where
R i = 1 n j i [ ψ j { γ ( x j , δ ^ ( i ) , η i ) } ψ j ( r ˜ j , ( i ) ) ] x j x j η i ,
and γ ( x j , δ ^ ( i ) , η i ) is in the (“unordered”) interval ( r ˜ j , ( i ) , r ˜ j , ( i ) x j η i ) .
Proof. 
Recall that y i = ϵ i + x i w 0 + x i δ 0 .
Since f i ( δ ^ ( i ) ) = 0 , and δ ˜ i = δ ^ ( i ) + η i , we have
f ( δ ˜ i ) = f ( δ ˜ i ) f i ( δ ^ ( i ) ) = 1 n i = 1 n x i ψ { ϵ ˜ i + x i δ 0 x i δ ˜ i } + τ δ ˜ i + 1 n j i x j ψ { ϵ ˜ j + x j δ 0 x j δ ( i ) } τ δ ( i ) = 1 n x i ψ { ϵ ˜ i + x i δ 0 x i δ ˜ i } + τ η i + 1 n j i x j ψ j { ϵ ˜ j + x j δ 0 x j δ ^ ( i ) } ψ j { ϵ ˜ j + x j δ 0 x j ( δ ^ ( i ) + η i ) } .
By the mean value theorem, we have
ψ j { ϵ ˜ j + x j δ 0 x j δ ^ ( i ) } ψ j { ϵ ˜ j + x j δ 0 x j ( δ ^ ( i ) + η i ) } = ψ j ( r ˜ j , ( i ) ) x j η i + [ ψ j { γ ( x j , δ ^ ( i ) , η i ) } ψ j ( r ˜ j , ( i ) ) ] x j η i ,
where γ ( x j , δ ^ ( i ) , η i ) is in the (“unordered”) interval ( r ˜ j , ( i ) , r ˜ j , ( i ) x j η i ) .
Therefore, if R i is defined as above, we have
1 n j i x j ψ j { ϵ ˜ j + x j δ 0 x j δ ^ ( i ) } ψ j { ϵ ˜ j + x j δ 0 x j ( δ ^ ( i ) + η i ) } = 1 n j i ψ j ( r ˜ j , ( i ) ) x j x j η i + R i = S i η i + R i .
The previous simplicities yield that
f ( δ ˜ i ) = 1 n x i ψ { ϵ ˜ i + x i δ 0 x i δ ˜ i } + ( S i + τ I ) η i + R i .
Since by definition, η i = n 1 ( S i + τ I ) 1 x i ψ { prox ( c i ρ ) ( r ˜ i , ( i ) ) } , we have
( S i + τ I ) η i = 1 n x i ψ { prox ( c i ρ ) ( r ˜ i , ( i ) ) } .
Also, by definition we have
ϵ ˜ i + x i δ 0 x i δ ˜ i = r ˜ i , ( i ) c i ψ { prox ( c i ρ ) ( r ˜ i , ( i ) ) } .
When ρ is differentiable, x c ψ { prox ( c ρ ) ( x ) } = prox ( c ρ ) ( x ) . Therefore, ϵ ˜ i + x i δ 0 x i δ ˜ i = prox ( c i ρ ) ( r ˜ i , ( i ) ) and
1 n x i ψ { ϵ ˜ i + x i δ 0 x i δ ˜ i } + ( S i + τ I ) η i = 1 n x i ψ { prox ( c i ρ ) ( r ˜ i , ( i ) ) } ψ ( r ˜ i , ( i ) ) = 0 .
Therefore, f ( δ ˜ i ) = R i . Applying Lemma A2 we have
δ ^ δ ˜ i 1 τ R i .
i.
On R i
Next, we provide a bound for R i .
Lemma A4. 
We have
η i 1 n τ x i n | ψ ( r ˜ i , ( i ) ) | ,
and
R i ^ 2 sup j i | ψ j { γ ( x j , δ ^ ( i ) , η i ) } ψ j ( r ˜ j , ( i ) ) | 1 n τ x i n | ψ ( r ˜ i , ( i ) ) | .
Proof. 
We have
R i = 1 n j i [ ψ j { γ ( x j , δ ^ ( i ) , η i ) } ψ j ( r ˜ j , ( i ) ) ] x j x j η i .
Note that S = n 1 j i [ ψ j { γ ( x j , δ ^ ( i ) , η i ) } ψ j ( r ˜ j , ( i ) ) ] x j x j can be written as S = n 1 X D X , where D is a diagonal matrix with ( j , j ) -entry [ ψ j { γ ( x j , δ ^ ( i ) , η i ) } ψ j ( r ˜ j , ( i ) ) ] .
Using the property of matrix norm · 2 , we have S 2 ^ 2 D 2 , which implies that
R i ^ 2 sup j i | ψ j { γ ( x j , δ ^ ( i ) , η i ) } ψ j ( r ˜ j , ( i ) ) | η i ,
where ^ = n 1 j = 1 n x j x j is the sample covariance matrix.
We now bound η i . Note that
η i 1 n τ x i n | ψ { prox ( c i ρ ) ( r ˜ i , ( i ) ) } | .
Using Lemma A-1 in [25], we see that
| ψ ( prox c i ( ρ ) ( r ˜ i , ( i ) ) ) | | ψ ( r ˜ i , ( i ) ) | .
Under our assumptions, we have
| ψ ( r ˜ i , ( i ) ) | ψ C polyLog ( n )
and later in the proof of Lemma A6, we will show that
sup i x i n = sup i | λ i | X i n = O L k ( sup i | λ i | ) .
ii.
On γ ( x j , δ ^ ( i ) , η i ) and related quantities
We now show how to control n 1 / 2 sup j i | ψ j { γ ( x j , δ ^ ( i ) , η i ) } ψ j ( r ˜ j , ( i ) ) | .
Lemma A5. 
Suppose that ψ is L ( n ) -Lipschitz. Then
sup j i | ψ j { γ ( x j , δ ^ ( i ) , η i ) } ψ j ( r ˜ j , ( i ) ) | L ( n ) sup j i | x j η i | .
Proof. 
By definition, we have
| γ ( x j , δ ^ ( i ) , η i ) r ˜ j , ( i ) | | x j η i | .
Therefore, the bound follows, using the fact that ψ j is L ( n ) -Lipschitz. □

Appendix E.3.2. Stochastic Aspects

Recall that
x j η i = ψ { prox ( c i ρ ) ( r ˜ i , ( i ) ) } 1 n x j ( S i + τ I ) 1 x i .
Therefore, we can bound R i by
R i sup j i | x j ( S i + τ I ) 1 x i | n L ( n ) n τ x i n ^ 2 ψ 2 .
i.
On sup j i | x j ( S i + τ I ) 1 x i |
Now we control x j ( S i + τ I ) 1 x i / n .
Lemma A6. 
Suppose x i ’s are independent and satisfy  O4; suppose that λ i ’s satisfy  O6. Then
sup j i | x j ( S i + τ I ) 1 x i / n | 1 n sup j i X j τ n polyLog ( n )
in L k , for any finite k. Note that under AssumptionO4, for any finite k,
sup j i | X j / n | = O L k ( 1 ) .
Proof. 
Note that
| x j ( S i + τ I ) 1 x i / n | = | λ i λ j | | X j ( S i + τ I ) 1 X i / n | .
Denote X ( i ) = ( X 1 , , X i 1 , X i + 1 , , X n ) , and v j , ( i ) = ( S i + τ I ) 1 X j . Then we have the map F j ( X i ) = X j ( S i + τ I ) 1 X i = X i v j , ( i ) is linear in X i and it is Lipschitz with Lipschitz constant { X j ( S i + τ I ) 2 X j } 1 / 2 X j / τ . Therefore, using Lemma B-2 in [25] and the fact that X i has mean 0, we have
sup j i | X j ( S i + τ I ) 1 X i / n | | X ( i ) 1 n sup j i X j τ n polyLog ( n ) / c n 1 / 2
in L k , when sup j i X j / ( τ n 1 / 2 ) = O L k ( 1 ) .
To prove it, Using the fact that X j X j / n 1 / 2 is n 1 / 2 -Lipschitz, we have
sup j i | X j / n m X j / n | polyLog ( n ) / n c n in L 2 k .
Note that E ( X j ) { E ( X j 2 ) } 1 / 2 = p 1 / 2 , so m X j / n is of order 1. Therefore, by Assumption O4 on c n we have
sup j i | X j / n | = O L k ( 1 ) .
Now our Assumption O 6 concerning sup i | λ i | = O L k ( polyLog ( n ) ) guarantee that the bounds we announced are valid. □
  • Consequences
We have the following result.
Proposition A3. 
Under Assumptions  O1–O6, we have
R i = O L k [ L ( n ) ] ψ 2 n τ polyLog ( n ) .
Furthermore, the same bound hold for sup 1 i n R i .
Proof. 
By aggregating all the intermediate results, using Holder’s inequality and the fact that ^ 2 = O L k ( polyLog ( n ) ) shown in Lemma 3.38 from [21], we finish the proof. □
We can now prove and state the following result. Recall that
δ ˜ i = δ ^ ( i ) + 1 n ( S i + τ I p ) 1 x i ψ { prox ( c i ρ ) ( r ˜ j , ( i ) ) } δ ^ ( i ) + η i .
Theorem A1. 
Under Assumptions  O1–O7, we have, for any fixed k, when τ is held fixed and L ( n ) C n α ,
sup 1 i n δ ^ δ ˜ i = O L k polyLog ( n ) n 1 α .
In particular, we have
1 i n , E ( δ ^ δ ˜ i 2 ) = O polyLog ( n ) n 2 2 α .
Also,
sup 1 i n sup j i | r ˜ j , ( i ) R j | = O L k polyLog ( n ) n 1 / 2 α .
Finally,
sup 1 i n | R i prox ( c i ρ ) ( r ˜ i , ( i ) ) | = O L k polyLog ( n ) n 1 / 2 α .
Proof. 
The first two results are direct consequences of the previous propositions.
The third result follows from that
sup j i | r ˜ j , ( i ) R j | = sup j i | x j ( δ ^ δ ^ ( i ) ) | sup j i | x j ( δ ^ δ ˜ i ) | + sup j i | x j ( δ ˜ i δ ^ ( i ) ) | = sup 1 j n x j n n δ ^ δ ˜ i + sup j i | x j η i | ,
and the fact that sup 1 j n x j / n 1 / 2 = O L k ( polyLog ( n ) ) under our assumptions. Control of the first term follows from the results on δ ^ δ ˜ i . Control of the second term follows from Lemma A6 and the assumption that ψ is bounded by C polyLog ( n ) .
For the last result, recall that
R i = ϵ ˜ i + x i ( δ 0 δ ^ ) = ϵ ˜ i + x i δ 0 x i δ ˜ i x i ( δ ^ δ ˜ i ) .
Given the definition of δ ˜ i , we have
x i δ ˜ i = x i δ ^ ( i ) + c i ψ { prox ( c i ρ ) ( r ˜ i , ( i ) ) } .
By the property of the proximal operator, if y = prox ( c ρ ) ( x ) , y + c ψ ( y ) = x , we have
ϵ ˜ i + x i δ 0 x i δ ˜ i = r ˜ i , ( i ) c i ψ [ prox ( c i ρ ) ( r ˜ i , ( i ) ) ] = prox ( c i ρ ) ( r ˜ i , ( i ) ) .
Therefore, we have
sup i | R i prox ( c i ρ ) ( r ˜ i , ( i ) ) | = sup i | x i ( δ ^ δ ˜ i ) | ,
which is controlled in the previous results. □
  • On the limiting variance of δ ^ 2 and δ ^ δ 0 2
Proposition A4. 
Under Assumptions  O1O7,
var ( δ ^ 2 ) 0 as n .
Therefore, δ ^ 2 has a deterministic equivalent in probability and in L 2 .
More precisely, we have
var ( δ ^ 2 ) = O polyLog ( n ) n 1 2 α .
The same type of results are true for var ( δ ^ δ 0 2 ) and var ( β ^ β 0 2 ) provided that δ 0 = O ( polyLog ( n ) ) .
Proof. 
We use the Efron–Stein inequality [35]: if W is a function of n independent random variables, and W ( i ) is any function of all those random variables except the i-th,
var ( W ) i = 1 n var ( W W ( i ) ) i = 1 n E ( ( W W ( i ) ) 2 ) .
We apply this inequality to W = δ ^ 2 and W ( i ) = δ ^ ( i ) 2 . We first note that
E ( | δ ^ 2 δ ^ ( i ) 2 | 2 ) = 2 E | δ ^ 2 δ ˜ i 2 | 2 + E | δ ˜ i 2 δ ^ ( i ) 2 | 2 .
For the first term, we have | δ ^ 2 δ ˜ i 2 | 2 = [ ( δ ^ δ ˜ i ) ( δ ^ + δ ˜ i ) ] 2 and ( δ ^ δ ˜ i ) ( δ ^ + δ ˜ i ) = 2 ( δ ^ δ ˜ i ) δ ^ δ ^ δ ˜ i 2 . Therefore, by the Cauchy–Schwarz inequality, we have
| δ ^ 2 δ ˜ i 2 | 2 = O L 1 ( δ ^ δ ˜ i 4 ) + O L 1 ( polyLog ( n ) ) δ ^ δ ˜ i 4 ,
since E ( δ ^ k ) exists and is bounded by k polyLog ( n ) / τ k .
Using the results of Theorem A1 we have
E | δ ^ 2 δ ˜ i 2 | 2 = O polyLog ( n ) n 2 2 α = o ( n 1 ) ,
provided that α < 1 / 2 .
For the second term, by definition, we have
δ ˜ i 2 δ ^ ( i ) 2 = 2 n δ ^ ( i ) ( S i + τ I ) 1 x i ψ ( prox ( c i ρ ) ( r ˜ i , ( i ) ) ) + 1 n 2 x i ( S i + τ I ) 2 x i ψ 2 ( prox ( c i ρ ) ( r ˜ i , ( i ) ) ) .
Since δ ^ ( i ) 2 and S i are independent of x i , and ( S i + τ I ) 1 2 τ 1 , we have δ ^ ( i ) ( S i + τ I ) 1 x i = O L 2 ( | λ i | δ ^ ( i ) / c n 1 / 2 ) . Recall also that sup i ψ = O ( polyLog ( n ) ) . Therefore, both terms are of order O L 2 ( polyLog ( n ) / n c n 1 / 2 ) .
We can now conclude that
E | δ ˜ i 2 δ ^ ( i ) 2 | 2 = O polyLog ( n ) n 2 .
Therefore, we have
var ( δ ^ 2 ) = O polyLog ( n ) n 1 2 α = o ( 1 ) .
This shows that δ ^ 2 has a deterministic equivalent in probability and in L 2 .
For the second part of the result, similarly, we write that
E ( | δ ^ δ 0 2 δ ^ ( i ) δ 0 2 | 2 ) = 2 E | δ ^ δ 0 2 δ ˜ i δ 0 2 | 2 + E | δ ˜ i δ 0 2 δ ^ ( i ) δ 0 2 | 2 .
Using the fact that | δ ^ δ 0 2 δ ˜ i δ 0 2 | 2 = [ ( δ ^ δ ˜ i ) ( δ ^ + δ ˜ i 2 δ 0 ) ] 2 and ( δ ^ δ ˜ i ) ( δ ^ + δ ˜ i 2 δ 0 ) = 2 ( δ ^ δ ˜ i ) ( δ ^ δ 0 ) δ ^ δ ˜ i 2 , by the Cauchy–Schwarz inequality we have
| δ ^ δ 0 2 δ ˜ i δ 0 2 | 2 = O L 1 ( δ ^ δ ˜ i 4 ) + O L 1 ( polyLog ( n ) ) δ ^ δ ˜ i 4 ,
since E ( δ ^ δ 0 k ) exists and is bounded by k polyLog ( n ) / τ k following from Assumption O7 and Lemma A3.
Using the results of Theorem A1 we have
E | δ ^ δ 0 2 δ ˜ i δ 0 2 | 2 = O polyLog ( n ) n 2 2 α = o ( n 1 ) ,
provided that α < 1 / 2 .
Similarly, by definition we have
δ ˜ i δ 0 2 δ ^ ( i ) δ 0 2 = 2 n ( δ ^ ( i ) δ 0 ) ( S i + τ I ) 1 x i ψ ( prox ( c i ρ ) ( r ˜ i , ( i ) ) ) + 1 n 2 x i ( S i + τ I ) 2 x i ψ 2 ( prox ( c i ρ ) ( r ˜ i , ( i ) ) ) .
Since ( δ ^ ( i ) δ 0 ) ( S i + τ I ) 1 x i = O L 2 ( | λ i | δ ^ ( i ) δ 0 / c n 1 / 2 ) , both terms are of order O L 2 ( polyLog ( n ) / n c n 1 / 2 ) .
Similarly, we have
var ( δ ^ δ 0 2 ) = O polyLog ( n ) n 1 2 α = o ( 1 ) .
The results for var ( β ^ β 0 2 ) are simply followed by the fact that var ( w ^ w 0 2 ) = o ( 1 ) in Proposition 3.10 of [21].
By assuming W ˜ = δ ^ + w ^ δ 0 w 0 2 and W ˜ ( i ) = δ ^ ( i ) + w ^ δ 0 w 0 2 , we can similarly have
E | δ ^ + w ^ δ 0 w 0 2 δ ^ ( i ) + w ^ δ 0 w 0 2 | 2 = o ( 1 ) .
Note that E ( w ^ w 0 ) = O ( 1 ) , we have E ( δ ^ + w ^ δ 0 w 0 ) is bounded by K polyLog ( n ) / τ k . □

Appendix E.4. Leaving Out a Predictor

Let V be the n × ( p 1 ) matrix corresponding to the first ( p 1 ) columns of the design matrix X . We call v i in R p 1 the vector corresponding to the first p 1 entries of x i , i.e v i = ( x i ( 1 ) , , x i ( p 1 ) ) . Denote X i = ( X i ( 1 ) , , X i ( p ) ) . We call X ( p ) the vector in R n with j-th entry x j ( p ) , i.e the p-th entry of the vector x j . When this does not create problems, we also use the standard notation x j , p for x j ( p ) . We further denote δ 0 = ( γ 0 , δ 0 ( p ) ) .
Call γ ^ the solution of
γ ^ = arg min γ R p 1 1 n i = 1 n ρ { ϵ i + v i ( w 0 w ^ ) v i ( γ γ 0 ) } + τ 2 γ 2 .
Note that γ ^ 0 is the solution of the original optimization problem (5) when x i ( p ) is replaced by 0.
  • Approximation to δ ^ via leave-one-predictor-out
We use the notations and partitions
x i = v i x i ( p ) , δ ^ = δ ^ p δ ^ ( p ) .
Naturally, γ ^ satisfies
1 n i = 1 n v i ψ { ϵ i + v i ( w 0 , p w ^ p ) v i ( γ ^ γ 0 ) } + τ γ ^ = 0 p 1 .
Denote
r i , [ p ] = ϵ i + v i ( w 0 , p w ^ p ) v i ( γ ^ γ 0 ) .
i.e., the residuals based on p 1 predictors.
Recall that
1 n i = 1 n x i ψ { ϵ i + x i ( w 0 w ^ ) x i ( δ ^ δ 0 ) } + τ δ ^ = 0 p ,
and
R i = ϵ i + x i ( w 0 w ^ ) x i ( δ ^ δ 0 ) .
Taking the difference between (A10) and (A11), we have
1 n i n x i ψ ( R i ) v i 0 ψ ( r i , [ p ] ) τ δ ^ γ ^ 0 = 0 p .
Note that this p-dimensional equation can be separated into a scalar and a vector equation, namely,
1 n i n x i ( p ) ψ ( R i ) τ δ ^ ( p ) = 0 , 1 n i n v i ψ ( R i ) ψ ( r i , [ p ] ) τ ( δ ^ p γ ^ ) = 0 p 1 .
Using a first-order Taylor expansion of ψ ( R i ) around ψ ( r i , [ p ] ) and noting that R i r i , [ p ] = v i ( γ ^ δ ^ p ) + x i ( p ) { w 0 ( p ) w ^ ( p ) + δ 0 ( p ) δ ^ ( p ) } , we can transform the first equation above into
1 n i n x i ( p ) ψ ( r i , [ p ] ) + ψ ( r i , [ p ] ) v i ( γ ^ δ ^ p ) + x i ( p ) { w 0 ( p ) w ^ ( p ) + δ 0 ( p ) δ ^ ( p ) } τ δ ^ ( p ) 0 .
This gives the near equivalence
δ ^ ( p ) 1 n x i ( p ) ψ ( r i , [ p ] ) + ψ ( r i , [ p ] ) v i ( γ ^ δ ^ p ) + x i ( p ) { w 0 ( p ) w ^ ( p ) + δ 0 ( p ) } 1 n x i 2 ( p ) ψ ( r i , [ p ] ) + τ .
Working similarly on the second equation involving v i , we have
1 n i n ψ ( r i , [ p ] ) v i ( R i r i , [ p ] ) τ ( δ ^ p γ ^ ) 0 p 1 .
Since R i r i , [ p ] = v i ( γ ^ δ ^ p ) + x i ( p ) { w 0 ( p ) w ^ ( p ) + δ 0 ( p ) δ ^ ( p ) } , the above equation can be transformed into
1 n i n ψ ( r i , [ p ] ) v i v i ( γ ^ δ ^ p ) + { w 0 ( p ) w ^ ( p ) + δ 0 ( p ) δ ^ ( p ) } 1 n i n ψ ( r i , [ p ] ) v i x i ( p ) τ ( δ ^ p γ ^ ) 0 p 1 .
Denote
u p = 1 n i = 1 n ψ ( r i , [ p ] ) v i x i ( p ) , and S p = 1 n i = 1 n ψ ( r i , [ p ] ) v i v i ,
we see that γ ^ δ ^ p { w 0 ( p ) w ^ ( p ) + δ 0 ( p ) δ ^ ( p ) } ( S p + τ I ) 1 u p . Using the above approximation in the equation for δ ^ ( p ) , we can write
δ ^ ( p ) 1 n x i ( p ) ψ ( r i , [ p ] ) + { w 0 ( p ) w ^ ( p ) + δ 0 ( p ) } { 1 n x i 2 ( p ) ψ ( r i , [ p ] ) u p ( S p + τ I ) 1 } u p 1 n x i 2 ( p ) ψ ( r i , [ p ] ) u p ( S p + τ I ) 1 u p + τ .
Denote
ξ n 1 n i = 1 n x i 2 ( p ) ψ ( r i , [ p ] ) u p ( S p + τ I ) 1 u p ,
and
N p 1 n i = 1 n x i ( p ) ψ ( r i , [ p ] ) .
We have
( ξ n + τ ) δ ^ ( p ) n N p + { w 0 ( p ) w ^ ( p ) + δ 0 ( p ) } ξ n .
Also we have
δ ^ p γ ^ + { w 0 ( p ) w ^ ( p ) + δ 0 ( p ) δ ^ ( p ) } ( S p + τ I ) 1 u p .
Thus, we construct an approximation to δ ^ . As a summary, we introduce the following definitions:
Definition A1. 
We call the residuals corresponding to this optimization problem { r i , [ p ] } i = 1 n , in other words
r i , [ p ] = ϵ i + v i ( w 0 , p w ^ p ) v i ( γ ^ γ 0 ) .
We call
u p = 1 n i = 1 n ψ ( r i , [ p ] ) v i x i ( p ) , and S p = 1 n i = 1 n ψ ( r i , [ p ] ) v i v i .
Note that u p R p 1 and S p is ( p 1 ) × ( p 1 ) . We call
ξ n 1 n i = 1 n x i 2 ( p ) ψ ( r i , [ p ] ) u p ( S p + τ I ) 1 u p ,
and
N p 1 n i = 1 n x i ( p ) ψ ( r i , [ p ] ) .
We consider
b p { w 0 ( p ) w ^ ( p ) + δ 0 ( p ) } ξ n τ + ξ n + 1 n N p τ + ξ n .
Note that when ξ n > 0 , we have
b p δ 0 ( p ) = w 0 ( p ) w ^ ( p ) + n 1 / 2 N p τ b p ξ n .
We call
b ˜ = γ ^ δ 0 ( p ) w ^ ( p ) + w 0 ( p ) + { b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) } S p + τ I 1 u p 1 .

Appendix E.4.1. Deterministic Aspects

Proposition A5. 
We have
δ ^ b ˜ 1 τ | b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) | sup 1 i n | d i , p | ^ 2 ( S p + τ I ) 1 u p 2 + 1
where d i , p = [ ψ ( γ i , p ) ψ ( r i , [ p ] ) ] and γ i , p is in the interval ( ϵ i + v i ( w 0 , p w ^ p ) v i ( γ ^ γ 0 ) , ϵ i + x i ( w 0 w ^ ) x i ( b ˜ β 0 ) ) . Furthermore,
( S p + τ I ) 1 u p 2 1 n τ i = 1 n x i 2 ( p ) ψ ( r i , [ p ] ) = 1 n τ i = 1 n λ i 2 ψ ( r i , [ p ] ) X i 2 ( p ) .
As in Lemma A2, we have
δ ^ b ˜ 1 τ f ( b ˜ ) ,
where
f ( b ˜ ) = 1 n i = 1 n x i ψ { ϵ i + x i ( w 0 w ^ ) x i ( b ˜ δ 0 ) } + τ b ˜ .
We note furthermore that, by definition of γ ^ ,
g ( γ ^ ) = 1 n i = 1 n v i ψ { ϵ i + v i ( w 0 , p w ^ p ) v i ( γ ^ γ 0 ) } + τ γ ^ = 0 p 1 .
Proof. 
i. Work on the first p 1 coordinates of f ( b ˜ )
Denote f p 1 ( δ ) the first p 1 coordinates of f ( δ ) . Denote γ ^ e x t the p-dimensional vector whose first p 1 coordinates are γ ^ and last coordinate is δ 0 ( p ) , i.e.,
γ ^ e x t = γ ^ δ 0 ( p ) w ^ ( p ) + w 0 ( p ) .
For a vector v , we use the notation v k to denote the vector obtained by removing the k-th coordinate of v .
Note that
f p 1 ( b ˜ ) = f p 1 ( b ˜ ) g ( γ ^ ) = 1 n i = 1 n v i ψ { ϵ i + x i ( w 0 w ^ ) + x i ( δ 0 b ˜ ) } ψ { ϵ i + v i ( w 0 , p w ^ p ) + v i ( γ 0 γ ^ ) } + τ ( b ˜ p γ ^ ) .
By the mean value theorem, for γ i , p in the interval ( ϵ i + v i ( w 0 w ^ ) v i ( w 0 , p w ^ p ) , ϵ i + x i ( w 0 w ^ ) x i ( b ˜ δ 0 ) ) , we have
ψ { ϵ i + x i ( w 0 w ^ ) + x i ( δ 0 b ˜ ) } ψ { ϵ i + v i ( w 0 , p w ^ p ) + v i ( γ 0 γ ^ ) } = ψ ( γ i , p ) x i ( γ ^ e x t b ˜ ) = ψ ( r i , [ p ] ) x i ( γ ^ e x t b ˜ ) + { ψ ( γ i , p ) ψ ( r i , [ p ] ) } x i ( γ ^ e x t b ˜ ) .
Denote
d i , p = ψ ( γ i , p ) ψ ( r i , [ p ] ) , δ i , p = { ψ ( γ i , p ) ψ ( r i , [ p ] ) } x i ( γ ^ e x t b ˜ ) , R p = 1 n i = 1 n v i { ψ ( γ i , p ) ψ ( r i , [ p ] ) } x i ( γ ^ e x t b ˜ ) .
Therefore, we have
f p 1 ( b ˜ ) = 1 n i = 1 n ψ ( r i , [ p ] ) v i x i ( γ ^ e x t b ˜ ) + τ ( b ˜ p γ ^ ) + R p A p + R p .
Note that by definition,
γ ^ e x t b ˜ = { b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) } ( S p + τ I ) 1 u p 1 , b ˜ p γ ^ = { b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) } ( S p + τ I ) 1 u p .
Therefore, we have x i ( γ ^ e x t b ˜ ) = { b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) } { v i ( S p + τ I ) 1 u p x i ( p ) } , and
A p = { b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) } 1 n i = 1 n ψ ( r i , [ p ] ) v i v i ( S p + τ I ) 1 u p x i ( p ) + τ ( S p + τ I ) 1 u p .
By the definition of S p and u p , we have
A p = { b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) } { S p ( S p + τ I ) 1 u p u p + τ ( S p + τ I ) 1 u p } = 0 p 1 ,
since S p ( S p + τ I ) 1 + τ ( S p + τ I ) 1 = I .
Therefore, we conclude that
f p 1 ( b ˜ ) = R p .
ii. Work on the last coordinate of f ( b ˜ )
Denote [ f ( δ ˜ ) ] p the last coordinate of f ( δ ˜ ) . We have shown that
ψ { ϵ i + x i ( w 0 w ^ ) + x i ( δ 0 b ˜ ) } ψ { ϵ i + v i ( w 0 , p w ^ p ) + v i ( γ 0 γ ^ ) } = ψ ( r i , [ p ] ) x i ( γ ^ e x t b ˜ ) + { ψ ( γ i , p ) ψ ( r i , [ p ] ) } x i ( γ ^ e x t b ˜ ) .
Recall that
r i , [ p ] = ϵ i + v i ( w 0 , p w ^ p ) v i ( γ ^ γ 0 ) , δ i , p = { ψ ( γ i , p ) ψ ( r i , [ p ] ) } x i ( γ ^ e x t b ˜ ) .
We note that
ψ { ϵ i + x i ( w 0 w ^ ) + x i ( δ 0 b ˜ ) } = ψ ( r i , [ p ] ) + ψ ( r i , [ p ] ) x i ( γ ^ e x t b ˜ ) + δ i , p = ψ ( r i , [ p ] ) + ψ ( r i , [ p ] ) { b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) } { v i ( S p + τ I ) 1 u p x i ( p ) } + δ i , p .
Therefore, we have
[ f ( δ ˜ ) ] p + 1 n i = 1 n x i ( p ) δ i , p = 1 n i = 1 n x i ( p ) ψ ( r i , [ p ] ) + ψ ( r i , [ p ] ) { b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) } { v i ( S p + τ I ) 1 u p x i ( p ) } + τ b ˜ ( p ) , = 1 n i = 1 n x i ( p ) ψ ( r i , [ p ] ) { b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) } u p ( S p + τ I ) 1 u p + { b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) } 1 n i = 1 n ψ ( r i , [ p ] ) x i 2 ( p ) + τ b p , = 1 n i = 1 n x i ( p ) ψ ( r i , [ p ] ) τ b p + { b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) } 1 n i = 1 n ψ ( r i , [ p ] ) x i 2 ( p ) u p ( S p + τ I ) 1 u p , = 1 n N p τ b p + { b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) } ξ n , = 0 .
We conclude that
[ f ( δ ˜ ) ] p = 1 n i = 1 n x i ( p ) δ i , p = 1 n i = 1 n x i ( p ) { ψ ( γ i , p ) ψ ( r i , [ p ] ) } x i ( γ ^ e x t b ˜ ) .
iii. Representation of f ( b ˜ )
Aggregating all the results we have obtained so far, we see that
f ( b ˜ ) = 1 n i = 1 n d i , p x i x i ( γ ^ e x t b ˜ ) = { b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) } 1 n i = 1 n d i , p x i x i ( S p + τ I ) 1 u p 1 ,
which implies (A12).
For ( S p + τ I ) 1 u p 2 , denote D ψ ( r · , [ p ] ) the diagonal matrix with ( i , i ) entry ψ ( r i , [ p ] ) . We have
u p = 1 n V D ψ ( r · , [ p ] ) X ( p ) .
Therefore,
( S p + τ I ) 1 u p 2 = X ( p ) n D ψ ( r · , [ p ] ) 1 / 2 D ψ ( r · , [ p ] ) 1 / 2 V n V D ψ ( r · , [ p ] ) V n + τ I 1 V D ψ ( r · , [ p ] ) 1 / 2 n D ψ ( r · , [ p ] ) 1 / 2 X ( p ) n .
Note that
D ψ ( r · , [ p ] ) 1 / 2 V n V D ψ ( r · , [ p ] ) V n + τ I 1 V D ψ ( r · , [ p ] ) 1 / 2 n I .
So we have
( S p + τ I ) 1 u p 2 1 n X ( p ) D ψ ( r · , [ p ] ) X ( p ) = 1 n i = 1 n x i 2 ( p ) ψ ( r i , [ p ] ) .

Appendix E.4.2. Stochastic Aspects

Assume that X ( p ) = ( X 1 ( p ) , , X n ( p ) ) is independent of { V i , ϵ i } i = 1 n , where V i = V i / λ i . This is consistent with Assumption P1.
To bound δ ^ b ˜ , using Equation (A13) we have
( S p + τ I ) 1 u p 2 1 n τ i = 1 n λ i 2 ψ X i 2 ( p ) ,
and
( S p + τ I ) 1 u p 2 sup i ψ τ 1 n i = 1 n λ i 2 X i 2 ( p ) .
Therefore, under Assumptions O3, O4 and O6, we have for any fixed k and at fixed τ ,
( S p + τ I ) 1 u p 2 = O L k ( polyLog ( n ) ) .
It guarantees that
( S p + τ I ) 1 u p 1 2 ( 1 + ( S p + τ I ) 1 u p 2 ) = O L k ( polyLog ( n ) ) .
Thus
δ ^ b ˜ = O L k 1 τ polyLog ( n ) | b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) | sup 1 i n | d i , p | ^ 2 .
Recall that Lemma 3.38 from [21] guarantees that ^ = O L k ( polyLog ( n ) ) under assumption O1–O7. Also we will show in Proposition A6 that | b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) | = O L k { polyLog ( n ) ( n 1 / 2 n e ) } and in Proposition A8 that sup 1 i n | d i , p | = { polyLog ( n ) ( n α 1 / 2 n α e ) } to show that M 1 is small.
  • On b p δ 0 ( p )
Recall the notations
N p = 1 n i = 1 n x i ( p ) ψ ( r i , [ p ] ) = 1 n i = 1 n λ i ψ ( r i , [ p ] ) X i ( p ) , ξ n = 1 n i = 1 n x i 2 ( p ) ψ ( r i , [ p ] ) u p ( S p + τ I ) 1 u p .
Under our assumptions, we have E ( X i ) = 0 p and cov ( X i ) = I p , and X ( p ) is independent of { r i , [ p ] } i = 1 n .
Proposition A6. 
We have
| b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) | 1 n τ | N p | + δ 0 + | w ^ ( p ) w 0 ( p ) | .
Furthermore, under Assumptions  O1–O7   and  P1, N p = O L k ( polyLog ( n ) ) and therefore, when τ is fixed,
| b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) | = O L k ( polyLog ( n ) n 1 / 2 + δ 0 + | w ^ ( p ) w 0 ( p ) | ) .
Note that Lemma A1 has shown that
| w ^ ( p ) w 0 ( p ) | = O L k polyLog ( n ) n n e .
Under Assumption P3, n δ 0 2 polyLog ( n ) n 2 α 1 / 2 0 , and therefore we have
| b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) | = O L k polyLog ( n ) n n e .
Proof. 
From the definition of b p , we have, when ξ n > 0 ,
b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) = 1 n N p τ + ξ n τ τ + ξ n { δ 0 ( p ) + w 0 ( p ) w ^ ( p ) } .
We will see later that ξ n 0 in Lemma A7. It follows that
| b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) | 1 n τ | N p | + | δ 0 ( p ) + w 0 ( p ) w ^ ( p ) | .
Using independence of X ( p ) and { V i , ϵ i } i = 1 n , and the fact that E ( X i ) = 0 p , we have
E ( N p 2 ) = 1 n i = 1 n E { X i 2 ( p ) } E { λ i 2 ψ 2 ( r i , [ p ] ) } ,
whether the right-hand side is finite or not. Using the bounds on max λ i 2 and sup i ψ , we have
E ( N p 2 ) 1 n i = 1 n E { X i 2 ( p ) } ψ E { λ i 2 } = O ( 1 ) = O ( polyLog ( n ) ) .
Similarly, it is clear that
N p = O L k ( polyLog ( n ) ) .
Therefore, we have
| b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) | = 1 n τ O L k ( polyLog ( n ) ) + sup 1 k p | δ 0 ( k ) | + | w ^ ( p ) w 0 ( p ) | .
  • On ξ n
Write ξ n in matrix form. Let D ψ ( r · , [ p ] ) be the diagonal matrix with ( i , i ) entry ψ ( r i , [ p ] ) . Denote X ( p ) the last column of the design matrix X . Then we have
ξ n = 1 n X ( p ) D ψ ( r · , [ p ] ) 1 / 2 M D ψ ( r · , [ p ] ) 1 / 2 X ( p ) ,
where
M = I n D ψ ( r · , [ p ] ) 1 / 2 V n 1 n V D ψ ( r · , [ p ] ) V + τ I 1 V D ψ ( r · , [ p ] ) 1 / 2 n .
Lemma A7. 
We have
ξ n 0 .
Furthermore, under Assumptions  O1–O7   and  P1, if D λ i is the diagonal matrix with ( i , i ) entry λ i ,
| ξ n 1 n tr D λ i D ψ ( r · , [ p ] ) 1 / 2 M D ψ ( r · , [ p ] ) 1 / 2 D λ i | = O L k sup 1 i n λ i 2 ψ ( r i , [ p ] ) / n c n .
Proof. 
When τ > 0 , all the eigenvalues of M are positive. Indeed, if the singular values of n 1 / 2 D ψ ( r · , [ p ] ) 1 / 2 V are denoted by σ 1 , the eigenvalues of M are τ / ( σ i 2 + τ ) .
Therefore, since ξ n = n 1 v M v with v = D ψ ( r · , [ p ] ) 1 / 2 X ( p ) , we have ξ n 0 .
Since M is symmetric and has eigenvalues between 0 and 1, using Lemma V.1.5 in [36], we have
0 D ψ ( r · , [ p ] ) 1 / 2 M D ψ ( r · , [ p ] ) 1 / 2 D ψ ( r · , [ p ] ) .
Under Assumption P1, the matrix M is independent of X ( p ) . D ψ ( r · , [ p ] ) is also independent of X ( p ) . By definition, X ( p ) = D λ i X ( p ) .
Under Assumption P1, X p satisfy the necessary concentration assumptions. Using Lemma 3.37 in [21], we have
| 1 n X ( p ) D ψ ( r · , [ p ] ) 1 / 2 M D ψ ( r · , [ p ] ) 1 / 2 X ( p ) 1 n tr D λ i D ψ ( r · , [ p ] ) 1 / 2 M D ψ ( r · , [ p ] ) 1 / 2 D λ i | = O L k 1 n c n sup 1 i n λ i 2 ψ ( r i , [ p ] ) .
  • About 1 n tr D λ i D ψ ( r · , [ p ] ) 1 / 2 M D ψ ( r · , [ p ] ) 1 / 2 D λ i
Lemma A8. 
Denote
S p = 1 n i = 1 n ψ ( r i , [ p ] ) v i v i , and S p ( i ) = S p 1 n ψ ( r i , [ p ] ) v i v i .
Denote
c τ , p = 1 n tr { ( S p + τ I ) 1 } , ζ i = 1 n v i { S p ( i ) + τ I } 1 v i λ i 2 c τ , p .
Then we have under Assumptions  O1–O7andP1, if M is the matrix defined in Lemma A7,
| 1 n tr ( I n M ) 1 n tr D λ i D ψ ( r · , [ p ] ) 1 / 2 M D ψ ( r · , [ p ] ) 1 / 2 D λ i c τ , p | sup i | ζ i | · 1 n i = 1 n ψ ( r i , [ p ] ) .
We also have
1 n tr ( I n M ) = p 1 n τ c τ , p .
Proof. 
Denote d i , i = ψ ( r i , [ p ] ) / n . By using the Sherman-MorrisonWoodbury formula (see, e.g., [37], p. 19),
M i , i = 1 d i , i v i V D ψ ( r · , [ p ] ) V / n + τ I 1 v i , = 1 d i , i v i S p ( i ) + τ I 1 v i 1 + d i , i v i S p ( i ) + τ I 1 v i , = 1 1 + d i , i v i S p ( i ) + τ I 1 v i .
Recall that we are interested in
1 n i λ i 2 ψ ( r i , [ p ] ) M i , i = 1 n D λ i D ψ ( r · , [ p ] ) 1 / 2 M D ψ ( r · , [ p ] ) 1 / 2 D λ i .
By the property of trace, we have
tr ( I n M ) = tr { ( S p + τ I ) 1 S p } = p 1 τ tr { ( S p + τ I ) 1 } = p 1 n τ c τ , p .
This shows the second result of the lemma.
The first result follows from the fact that
tr ( I n M ) = i ( 1 M i , i ) = i d i , i v i S p ( i ) + τ I 1 v i 1 + d i , i v i S p ( i ) + τ I 1 v i .
With our definitions, we have, since λ i 2 c τ , p + ζ i = n 1 v i ( S p ( i ) + τ I ) 1 v i ,
1 n tr ( I n M ) = 1 n i λ i 2 ψ ( r i , [ p ] ) M i , i c τ , p + 1 n i ψ ( r i , [ p ] ) ζ i 1 + d i , i v i S p ( i ) + τ I 1 v i .
It immediately follows that
| 1 n tr ( I n M ) 1 n i λ i 2 ψ ( r i , [ p ] ) M i , i c τ , p | sup i | ζ i | · 1 n i = 1 n ψ ( r i , [ p ] ) .
  • Controlling ζ i
Lemma A9. 
Suppose we can find { r i , [ p ] ( i ) } j i independent of ( λ i , V i ) and K n such that
sup i sup j i | ψ j ( r j , [ p ] ( i ) ) ψ j ( r j , [ p ] ) | K n .
Then
sup i | ζ i | = O L k 1 τ 2 K n ^ 2 + polyLog ( n ) τ n c n + 1 n τ polyLog ( n ) ,
provided that K n has 3 k uniformly bounded moments.
Proof. 
Denote
AM i , p = 1 n j i ψ ( r j , [ p ] ( i ) ) v j v j .
Then, using the fact that A 1 B 1 = A 1 ( B A ) B 1 , we have
( S p ( i ) + τ I ) 1 ( AM i , p + τ I ) 1 1 τ 2 K n ^ 2 ,
since n 1 i v i v i ^ 2 . In particular, we have
| 1 n v i ( S p ( i ) + τ I ) 1 v i 1 n v i ( AM i , p + τ I ) 1 v i | v i 2 n 1 τ 2 K n ^ 2 .
Since AM i , p is independent of ( λ i , V i ) , we can use Lemma 3.37 in [21] and since v = λ i V i , we have
sup 1 i n | 1 n v i ( AM i , p + τ I ) 1 v i λ i 2 n tr { ( AM i , p + τ I ) 1 } | = O L k polyLog ( n ) τ n c n sup 1 i n λ i 2 ,
by using the fact that λ max ( ( AM i , p + τ I ) 1 ) τ 1 .
Using the operator norm bound we gave above, we also have
| 1 n tr { ( AM i , p + τ I ) 1 } 1 n tr { ( S p ( i ) + τ I ) 1 } | 1 τ 2 K n ^ 2 p n .
We can now conclude that
sup 1 i n | 1 n v i ( S p ( i ) + τ I ) 1 v i λ i 2 n tr { ( S p ( i ) + τ I ) 1 } | = O 1 τ 2 K n ^ 2 sup 1 i n p n + v i 2 n + polyLog ( n ) τ n c n sup 1 i n λ i 2 1 .
So under O1 and O4, sup 1 i n v i 2 / n = O L k ( 1 ) and finally
sup 1 i n | 1 n v i ( S p ( i ) + τ I ) 1 v i λ i 2 n tr { ( S p ( i ) + τ I ) 1 } | = O L k 1 τ 2 K n ^ 2 + polyLog ( n ) τ n c n sup 1 i n λ i 2 1 .
Control of n 1 tr { ( S p ( i ) + τ I ) 1 } n 1 tr { ( S p + τ I ) 1 }
Using the Sherman-Woodbury-Morrison formula, we have
( S p ( i ) + τ I ) 1 ( S p + τ I ) 1 = ψ ( r i , [ p ] ) n ( S p ( i ) + τ I ) 1 v i v i ( S p ( i ) + τ I ) 1 1 + ψ ( r i , [ p ] ) n v i ( S p ( i ) + τ I ) 1 v i .
Take the trace of both sides, and we have
0 tr { ( S p ( i ) + τ I ) 1 } tr { ( S p + τ I ) 1 } 1 τ ,
since v i ( S p ( i ) + τ I ) 2 v i τ 1 v i ( S p ( i ) + τ I ) 1 v i .
Therefore,
0 1 n tr { ( S p ( i ) + τ I ) 1 } 1 n tr { ( S p + τ I ) 1 } 1 n τ .
We can now conclude that
sup i | ζ i | = O L k 1 τ 2 K n ^ 2 + polyLog ( n ) τ n c n + 1 n τ sup 1 i n λ i 2 1 .
provided we can use Holder’s inequality. This, in turn, requires K n to have 3 k moments. □
  • Control of K n
A natural choice for { r i , [ p ] ( i ) } j i defined in Lemma A9 is to use a leave-one-out estimator of γ ^ , where the i-th observation (and hence v i ) is removed. Hence, all the work performed in Theorem A1 can be used here.
Lemma A10. 
Suppose we use for r i , [ p ] ( i ) the residuals we would obtain by using a leave-one-out estimator of γ ^ , i.e., excluding ( v i , ϵ i ) from problem (A8).
With the notations of Lemma A9, we have under Assumptions  O1–O7   and  P1,
K n = O L k ( n 2 α 1 / 2 polyLog ( n ) ) .
In particular, for any fixed τ,
sup i | ζ i | = O L k ( n 2 α 1 / 2 polyLog ( n ) ) .
Proof. 
Denote δ n ( i ) random variables such that
sup j i | r j , [ p ] ( i ) r j , [ p ] | δ n ( i ) .
Applying Theorem A1 with R j = r j , [ p ] and r ˜ j , ( i ) = r j , [ p ] ( i ) , we have
sup i δ n ( i ) = O L k ( n 2 α 1 / 2 polyLog ( n ) ) .
The control of K n follows by using the fact that ψ is C n α -Lipschitz. □
Corollary A1. 
Recall that in (A6) and (A1)
c i = 1 n x i ( S i + τ I ) 1 x i , c τ = 1 n tr ( S + τ I ) 1 .
Then under Assumptions  O1–O7   and  P1, we have
sup i | c i λ i 2 c τ | = O L k ( n 2 α 1 / 2 polyLog ( n ) ) .
Proof. 
We have shown that
sup i | 1 n v i ( S p ( i ) + τ I ) 1 v i λ i 2 c τ , p | = O L k ( n 2 α 1 / 2 polyLog ( n ) ) .
Recall that
c τ = 1 n tr 1 n i = 1 n ψ ( R i ) x i x i + τ I 1 .
We see that c τ is analogous to c τ , p when we use all the data, rather than ( p 1 ) of them.
Indeed, c i in (A6) is defined, in the notation of the proof of Lemma A9 as an analog of n 1 v i ( AM i , p + τ I ) 1 v i , with the role of { r i , [ p ] ( i ) } j i being played by the residuals obtained by the leave-one-out estimator of γ ^ , excluding ( x i , y i ) from the problem. Lemma A9 in connection with Theorem A2 shows that sup i | n 1 v i ( AM i , p + τ I ) 1 v i λ i 2 c τ , p | = O L k ( n 2 α 1 / 2 polyLog ( n ) ) . Passing from the p 1 dimensional version of this result, i.e., Lemma A9, to the p-dimensional version gives the approximation stated in the corollary.
We therefore have
sup i | c i λ i 2 c τ | = O L k ( n 2 α 1 / 2 polyLog ( n ) ) .
  • Further results on ξ n and b p
Proposition A7. 
Under Assumptions  O1–O7   and  P1, we have
| c τ , p ( ξ n + τ ) p 1 n | = O L k polyLog ( n ) n 1 / 2 2 α .
Furthermore, under Assumptions  O1–O7   and  P1–P3, since δ 0 = O L k ( n e ) ,
p n 2 n E [ { b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) } 2 ] = 1 n i = 1 n E { c τ , p λ i ψ ( r i , [ p ] ) } 2 + n τ 2 { δ 0 ( p ) w ^ ( p ) + w 0 ( p ) } 2 E ( c τ , p 2 ) + o ( 1 ) .
Proof.  
For Equation (A21):
By Lemma A8, we have
p 1 n τ c τ , p = 1 n tr ( I n M ) 0 .
The latter quantity was approximated in Lemma A8 by
1 n tr D λ i D ψ ( r · , [ p ] ) 1 / 2 M D ψ ( r · , [ p ] ) 1 / 2 D λ i c τ , p ,
which approximate ξ as in Lemma A7. This gives the result of Equation (A21), by simply keeping track of the approximation errors we make at each step.
For Equation (A22):
Recall that by definition:
n ( τ + ξ n ) b p ξ n δ 0 ( p ) + ξ n { w ^ ( p ) w 0 ( p ) } = N p = 1 n i = 1 n λ i ψ ( r i , [ p ] ) X i ( p ) .
Therefore, we have
c τ , p n ( τ + ξ n ) { b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) } = 1 n i = 1 n c τ , p λ i ψ ( r i , [ p ] ) X i ( p ) c τ , p n τ { δ 0 ( p ) w ^ ( p ) + w 0 ( p ) } .
Note that c τ , p λ i ψ ( r i , [ p ] ) is independent of X i ( p ) , and X i ( p ) ’s are independent with mean zero and variance one. We have
E c τ , p 2 n ( τ + ξ n ) 2 { b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) } 2 = 1 n i = 1 n E { c τ , p λ i ψ ( r i , [ p ] ) } 2 + n τ 2 { δ 0 ( p ) w ^ ( p ) + w 0 ( p ) } 2 E ( c τ , p 2 ) .
Recall that Proposition A6 gives that
| b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) | 1 n τ | N p | + δ 0 + | w ^ ( p ) w 0 ( p ) | .
Under Assumption P3, n δ 0 2 polyLog ( n ) n 2 α 1 / 2 0 . Together with Equations (A21) and (A16), the LHS of (A23) can be written as
E c τ , p 2 ( τ + ξ n ) 2 n { b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) } 2 = E c τ , p ( τ + ξ n ) p 1 n + p 1 n 2 n { b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) } 2 = p n 2 n E [ { b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) } 2 ] + o ( 1 ) .
This implies Equation (A22). □
  • On d i , p
Recall that
d i , p = ψ ( γ i , p ) ψ ( r i , [ p ] ) ,
where γ i , p in the interval ( r i , [ p ] , r i , [ p ] + ν i ) , with
ν i = { b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) } x i ( S p + τ I ) 1 u p 1 = { b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) } π i .
We have the following result.
Proposition A8. 
Under Assumptions  O1–O7   and  P1–P3, we have, at fixed τ,
sup i | d i , p | = O L k polyLog ( n ) n α n 1 / 2 n e .
Proof. 
Note that we can write
π i = x i ( S p + τ I ) 1 u p 1 = v i ( S p + τ I ) 1 u p x i ( p ) .
Recall that u p = n 1 V D ψ ( r · , [ p ] ) X ( p ) . We can also write it as
u p = 1 n V D λ i 2 ψ ( r · , [ p ] ) X ( p ) .
Using the independence of X ( p ) with { ( V i , ϵ i ) } i = 1 n , and the concentration assumptions on X ( p ) in Assumption P1, according to Lemma 3.36 in [21], we have
sup i | v i ( S p + τ I ) 1 u p | = O L k polyLog ( n ) c n 1 / 2 sup i 1 n D λ i 2 ψ ( r · , [ p ] ) V ( S p + τ I ) 1 V i ,
where we view v i ( S p + τ I ) 1 u p as a linear form in V i . Note that we have absorbed the sup i | λ i | in the polyLog ( n ) term.
We can write
1 n D λ i 2 ψ ( r · , [ p ] ) V ( S p + τ I ) 1 V i = 1 n V i ( S p + τ I ) 1 V D λ i 2 ψ ( r · , [ p ] ) 2 V n ( S p + τ I ) 1 V i .
We notice that S p = n 1 V D λ i 2 ψ ( r · , [ p ] ) V . Hence,
V D λ i 2 ψ ( r · , [ p ] ) 2 V n D λ i 2 ψ ( r · , [ p ] ) 2 S p .
Therefore, we conclude that
1 n V i ( S p + τ I ) 1 V D λ i 2 ψ ( r · , [ p ] ) 2 V n ( S p + τ I ) 1 V i V i 2 n τ D λ i 2 ψ ( r · , [ p ] ) 2 = V i 2 n τ sup i λ i 2 ψ ( r i , [ p ] ) .
Note that sup i | x i ( p ) | = O L k ( polyLog ( n ) / c n ) under Assumption O4, O6 and P1, according to Appendix 7 in [21]. Therefore, we have
sup i | π i | = O L k polyLog ( n ) c n 1 / 2 1 + sup i λ i 2 ψ ( r i , [ p ] ) sup i V i 2 n τ = O L k polyLog ( n ) c n 1 / 2 1 + sup i λ i 2 ψ ( r i , [ p ] ) = O L k { polyLog ( n ) } .
Recall that | b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) | = O L k ( polyLog ( n ) n 1 / 2 + δ 0 + | w ^ ( p ) w 0 ( p ) | ) and by Lemma A1 we have
| w ^ ( p ) w 0 ( p ) | = O L k polyLog ( n ) n n e .
We have
sup i | ν i | = O L k polyLog ( n ) n n e .
Note that [21] has shown that Under our assumption that ψ is C n α -Lipschitz, we see that
sup i | d i , p | = O L k polyLog ( n ) n α n 1 / 2 n e .

Appendix E.4.3. Final Conclusions

Gathering all the results, we have the following Theorem.
Theorem A2. 
Under Assumptions  O1–O7   and  P1–P3, we have, for any fixed τ,
δ ^ b ˜ O L k polyLog ( n ) n α ( n 1 / 2 n e ) 2 .
In particular,
n ( δ ^ p b p ) = O L k polyLog ( n ) n α + 1 / 2 ( n 1 / 2 n e ) 2 , sup i | x i ( δ ^ b ˜ ) | = O L k polyLog ( n ) n α + 1 / 2 ( n 1 / 2 n e ) 2 , sup i | R i r i , [ p ] | = O L k polyLog ( n ) n n e polyLog ( n ) n α + 1 / 2 ( n 1 / 2 n e ) 2 .
Proof. 
The first two results are direct consequences of all our results, using the key bound on δ ^ b ˜ of Proposition A6.
The third result is a direct consequence of the fact that sup i X i / n 1 / 2 = O L k ( 1 ) , which was shown in the proof of Lemma A6.
The last result follows from the fact that R i r i , [ p ] = x i ( δ ^ b ˜ ) ν i . The result on ν i is given in the proof of Proposition A8. □
Recall Equation (A22)
p n 2 n E [ { b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) } 2 ] = 1 n i = 1 n E { c τ , p λ i ψ ( r i , [ p ] ) } 2 + n τ 2 { δ 0 ( p ) w ^ ( p ) + w 0 ( p ) } 2 E ( c τ , p 2 ) + o ( 1 ) .
and Equation (A17)
| b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) | = O L k polyLog ( n ) n n e .
We have
p n 2 n E [ { δ ^ p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) } 2 ] = p n 2 n E [ { δ ^ p b p + b p δ 0 ( p ) + w ^ ( p ) w 0 ( p ) } 2 ] = 1 n i = 1 n E { c τ , p λ i ψ ( r i , [ p ] ) } 2 + n τ 2 { δ 0 ( p ) w ^ ( p ) + w 0 ( p ) } 2 E ( c τ , p 2 ) + o ( A ) ,
where A = 1 n i = 1 n E { c τ , p λ i ψ ( r i , [ p ] ) } 2 + n τ 2 { δ 0 ( p ) w ^ ( p ) + w 0 ( p ) } 2 E ( c τ , p 2 ) .
We note that the index p in Equation (A22) and the previous theorem plays no particular role, and similar results hold when p is replaced by any k in the range 1 k p . Summing over all the coordinates, we have under Assumptions O1–O7 and P1–P4,
p n 2 E ( δ ^ δ 0 + w ^ w 0 2 ) = 1 n k = 1 p 1 n i = 1 n E { c τ , k λ i ψ ( r i , [ k ] ) } 2 + τ 2 k = 1 p { δ 0 ( k ) w ^ ( k ) + w 0 ( k ) } 2 E ( c τ , k 2 ) + o ( B ) ,
where B = 1 n k = 1 p 1 n i = 1 n E { c τ , k λ i ψ ( r i , [ k ] ) } 2 + τ 2 k = 1 p { δ 0 ( k ) w ^ ( k ) + w 0 ( k ) } 2 E ( c τ , k 2 ) .
Note that 0 c τ , k κ / τ , n 1 i = 1 n ψ 2 C from Assumption O3, and δ 0 2 = O ( 1 ) . We have B = O ( 1 ) . Therefore, we have
p n 2 E ( δ ^ δ 0 + w ^ w 0 2 ) = 1 n k = 1 p 1 n i = 1 n E { c τ , k λ i ψ ( r i , [ k ] ) } 2 + τ 2 k = 1 p { δ 0 ( k ) w ^ ( k ) + w 0 ( k ) } 2 E ( c τ , k 2 ) + o ( 1 ) .
  • On c τ , k and c τ
We now show that c τ , k and c τ are close to each other.
Proposition A9. 
We have, under Assumptions  O1–O7   and  P1–P3,
sup 1 k p | c τ , k c τ | = O L k polyLog ( n ) n α n n e polyLog ( n ) n 2 α + 1 / 2 ( n 1 / 2 n e ) 2 polyLog ( n ) n .
Of course, we also have 0 c τ p / ( n τ ) and 0 c τ , k p / ( n τ ) .
Proof. 
Recall that
S = 1 n i = 1 n ψ ( R i ) x i x i .
Denote Γ = n 1 i = 1 n ψ ( R i ) v v and a = n 1 i = 1 n ψ ( R i ) x i 2 ( p ) , we have
S = Γ v v a .
According to Lemma 3.40 in [21], we have, since c τ = n 1 tr { ( S + τ I p ) 1 } ,
| c τ 1 n tr { ( Γ + τ I p 1 ) 1 } | 1 n 1 + a / τ τ .
We also have
a = 1 n i = 1 n λ i 2 X i 2 ( p ) ψ ( R i ) polyLog ( n ) 1 n i = 1 n λ i 2 X i 2 ( p ) = O L k ( polyLog ( n ) ) .
Since ψ is C n α -Lipschitz and
sup i | R i r i , [ p ] | = O L k polyLog ( n ) n n e polyLog ( n ) n α + 1 / 2 ( n 1 / 2 n e ) 2 ,
we have
sup i | ψ ( R i ) ψ ( r i , [ p ] ) | = O L k polyLog ( n ) n α n n e polyLog ( n ) n 2 α + 1 / 2 ( n 1 / 2 n e ) 2 ,
Hence, using arguments similar to those in the proof of Lemma A9, we have
| 1 n tr { ( Γ + τ I p 1 ) 1 } 1 n tr { ( S p + τ I p ) 1 } | = O L k polyLog ( n ) n α n n e polyLog ( n ) n 2 α + 1 / 2 ( n 1 / 2 n e ) 2 .
Since c τ , k = n 1 tr { ( S p + τ I p ) 1 } , the result follows immediately.
We note that p did not play a particular role in the proof and hence taking the sup over those indices only adds a polyLog ( n ) term. Hence the result holds for sup 1 k n | c τ , k c τ | . □
We are now ready to prove the last proposition of this section, which will help us obtain the second equation of our System.
Proposition A10. 
p n 2 E ( δ ^ δ 0 + w ^ w 0 2 ) = p n 1 n i = 1 n E { c τ λ i ψ ( prox ( c τ λ i 2 ρ ) ( r ˜ i , ( i ) ) ) } 2 + τ 2 δ 0 + w 0 w ^ 2 E ( c τ 2 ) + o ( 1 ) .
Furthermore, when all λ i ’s are non-zero,
1 n i = 1 n E { c τ λ i ψ ( prox ( c τ λ i 2 ρ ) ( r ˜ i , ( i ) ) ) } 2 = 1 n i = 1 n E { r ˜ i , ( i ) prox ( c τ λ i 2 ρ ) ( r ˜ i , ( i ) ) } 2 λ i 2 .
Proof. 
In light of Proposition A9 and Assumption P3 which guarantees that δ 0 2 is uniformly bounded in p and n, we have
k = 1 p { δ 0 ( k ) w ^ ( k ) + w 0 ( k ) } 2 E ( c τ , k 2 ) = δ 0 + w 0 w ^ 2 E ( c τ 2 ) + o ( 1 ) .
Using Theorem A2 and the bound on ψ in Assumption O3, we have
1 p k = 1 p E { c τ , k λ i ψ ( r i , [ k ] ) } 2 = 1 p k = 1 p E { c τ , k λ i ψ ( R i ) } 2 + o ( 1 ) .
With the help of Proposition A9, we have
1 p k = 1 p E { c τ , k λ i ψ ( R i ) } 2 = 1 p k = 1 p E { c τ λ i ψ ( R i ) } 2 + o ( 1 ) = E { c τ λ i ψ ( R i ) } 2 + o ( 1 ) .
In light of Equation (A7), we have
1 n i = 1 n E { ( c τ λ i ψ ( R i ) ) 2 } = 1 n i = 1 n E [ c τ λ i ψ { prox ( c i ρ ) ( r ˜ i , ( i ) ) } ] 2 + o ( 1 ) .
Lemma A11 gives the computation of the derivative of prox ( c ρ ) ( x ) with respect to c, which allows us to bound the error | ψ { prox ( c i ρ ) ( x ) } ψ { prox ( c τ λ i 2 ρ ) ( x ) } | . In light of this, by using the fact that sup i | c i λ i 2 c τ | = O L k ( n 2 α 1 / 2 polyLog ( n ) ) in Corollary A1, we can re-express the previous equation as
1 n i = 1 n E { ( c τ λ i ψ ( R i ) ) 2 } = 1 n i = 1 n E [ c τ λ i ψ { prox ( c τ λ i 2 ρ ) ( r ˜ i , ( i ) ) } ] 2 + o ( 1 ) .
When all λ i ’s are non-zero, we have
1 n i = 1 n E { ( c τ λ i ψ ( R i ) ) 2 } = 1 n i = 1 n E [ c τ λ i 2 ψ { prox ( c τ λ i 2 ρ ) ( r ˜ i , ( i ) ) } ] 2 λ i 2 + o ( 1 ) .
Finally, since by definition,
x R , x = prox ( c ρ ) ( x ) + c ψ ( prox ( c ρ ) ( x ) ) ,
we have
1 n i = 1 n E [ c τ λ i 2 ψ { prox ( c τ λ i 2 ρ ) ( r ˜ i , ( i ) ) } ] 2 λ i 2 = 1 n i = 1 n E [ r ˜ i , ( i ) prox ( c τ λ i 2 ρ ) ( r ˜ i , ( i ) ) ] 2 λ i 2 .
Lemma A11 
(El Karoui [21], Lemma 3.32). Suppose x is a real and ρ is twice differentiable and convex. Then, for c > 0 , we have
c prox ( c ρ ) ( x ) = ψ ( prox ( c ρ ) ( x ) ) 1 + c ψ ( prox ( c ρ ) ( x ) ) ,
and
c ρ ( prox ( c ρ ) ( x ) ) = ψ 2 ( prox ( c ρ ) ( x ) ) 1 + c ψ ( prox ( c ρ ) ( x ) ) .
In particular, at x given c ρ ( prox ( c ρ ) ( x ) ) is decreasing in c.

Appendix E.5. Last Steps of the Proof

Appendix E.5.1. On the Asymptotic Behavior of r ˜ i,(i)

We have the following result.
Lemma A12. 
Under Assumptions  O1–O7  and  P1–P4, as n and p tend to infinity, r ˜ i , ( i ) = ϵ i + x i ( w 0 w ^ ) x i ( δ ^ ( i ) δ 0 ) behaves like ϵ i + λ i E ( β ^ β 0 2 ) Z i . where Z i N ( 0 , 1 ) is independent of ϵ i and λ i , in the sense of weak convergence.
Furthermore, if i j , r ˜ i , ( i ) and r ˜ j , ( j ) are asymptotically (pairwise) independent. The same is true for ( r ˜ i , ( i ) , λ i ) and ( r ˜ j , ( j ) , λ j ) .
Proof. 
First part.
In this part, we will show that X i ( δ ^ ( i ) δ 0 + w ^ w 0 ) is asymptotically N ( 0 , E ( β ^ β 0 2 ) ) . Recall that δ ^ ( i ) is independent of X i and that E ( X i ) = 0 , cov ( X i ) = I and that, for any finite k, the first k moments of its entries are bounded uniformly in n.
We have shown that in Proposition A4 that var ( β ^ β 0 2 ) 0 . In light of Lemma A3, we also know that E ( β ^ β 0 2 ) is uniformly bounded. Furthermore, in the proof of Proposition A4, we showed that E ( δ ^ + w ^ δ 0 w 0 2 δ ^ ( i ) + w ^ δ 0 w 0 2 ) 0 .
We use a simple generalization of the standard Lindeberg-Feller theorem (see e.g., [38]). Indeed, if a n , p ( k ) are random variables with k = 1 p a n , p ( k ) 2 = A n , E ( A n 2 ) remains bounded in n, and a n , p ( k ) ’s are independent of X i , we see that: a) if z N ( 0 , I p ) , independent of a n , p ( k ) , then a n , p z A n N where N N ( 0 , 1 ) and independent of A n (conditionally and unconditionally on a n , p ); b) Theorem 2.1.5 and its proof in [38] hold provided that i = 1 n E ( | a n , p ( k ) | 3 ) = o ( 1 ) . The proof simply needs to be started conditionally on a n , p ( k ) , and the final moment bounds are then taken unconditionally. This very mild generalization gives, if ϕ is a C 3 function, with bounded 2nd and 3rd derivatives,
ϵ > 0 , | E { ϕ ( a n , p X i ) } E { ϕ ( A n N ) } | K ϵ ϕ ( 3 ) E k = 1 p a n , p ( k ) 2 + ϕ ( 2 ) ϵ k = 1 p E ( | a n , p ( k ) | 3 ) ,
where K is a constant depending on the second and third absolute moments of the entries of X i . It is therefore independent of n and p under our assumptions on X i .
We can apply this result to a n , p ( k ) = δ ^ ( i ) ( k ) δ 0 ( k ) + w ^ ( k ) w 0 ( k ) . Recall that we have shown that
δ ^ ( k ) δ 0 ( k ) + w ^ ( k ) w 0 ( k ) = O L k polyLog ( n ) n 1 / 2 n e
for each coordinate k. The same arguments apply also to δ ^ ( i ) ( k ) , the kth coordinate of δ ^ ( i ) . We have
E ( | δ ^ ( i ) ( k ) δ 0 ( k ) + w ^ ( k ) w 0 ( k ) | 3 ) = O polyLog ( n ) ( n 1 / 2 n e ) 3
Provided that e > 1 / 3 , we have
E k = 1 p | δ ^ ( i ) ( k ) δ 0 ( k ) + w ^ ( k ) w 0 ( k ) | 3 = O polyLog ( n ) n ( n 1 / 2 n e ) 3 = o ( 1 ) .
Therefore, in connection with Corollary 2.1.9 in [38], X i ( δ ^ ( i ) δ 0 + w ^ w 0 ) behaves asymptotically like δ ^ ( i ) δ 0 + w ^ w 0 N in the sense of weak convergence.
Since δ ^ ( i ) δ 0 + w ^ w 0 E δ ^ ( i ) δ 0 + w ^ w 0 0 in probability and the expectations are uniformly bounded, Slutsky’s lemma gives that
X i ( δ ^ ( i ) δ 0 + w ^ w 0 ) behaves like E ( δ ^ ( i ) δ 0 + w ^ w 0 ) N
asymptotically in the sense of weak convergence. Using the fact that E ( δ ^ δ 0 + w ^ w 0 2 δ ^ ( i ) δ 0 + w ^ w 0 2 ) 0 , another application of Slutsky’s lemma yields
X i ( δ ^ ( i ) δ 0 + w ^ w 0 ) behaves like E ( δ ^ δ 0 + w ^ w 0 ) N
in the sense of weak convergence.
We note that the same reasoning applies when replacing a n , p ( k ) = δ ^ ( i ) ( k ) δ 0 ( k ) + w ^ ( k ) w 0 ( k ) by a ˜ n , p ( k ) = λ i { δ ^ ( i ) ( k ) δ 0 ( k ) + w ^ ( k ) w 0 ( k ) } , provided that λ i has 3 moments. It shows that
X i ( δ ^ ( i ) δ 0 + w ^ w 0 ) behaves like λ i E ( β ^ β 0 ) N .
This shows the first part of the lemma, since E ( β ^ β 0 ) = E ( β ^ β 0 2 ) + o ( 1 ) by Proposition A4.
Second part.
For the second part, we use a leave-two-out approach. More precisely, we use the approximation
r ˜ i , ( i ) = ϵ i + x i ( w 0 w ^ ) x i ( δ ^ ( i ) δ 0 ) = ϵ i + x i ( w 0 w ^ ) x i ( δ ^ ( i j ) δ 0 ) + o L k ( 1 ) ,
and similarly for r ˜ j , ( j ) , which follows from Theorem A1. Here δ ^ ( i j ) is computed by solving Problem (5) without ( x i , y i ) and ( x j , y j ) . It is clear that r ˜ i , ( i ) and r ˜ j , ( j ) are asymptotically independent conditional on X ( i j ) , i.e., the design matrix without the i-th and j-th rows.
Similarly, we have
E k = 1 p | δ ^ ( i j ) ( k ) δ 0 ( k ) + w ^ ( k ) w 0 ( k ) | 3 = O polyLog ( n ) n ( n 1 / 2 n e ) 3 = o ( 1 ) .
Therefore, we also have
k = 1 p | δ ^ ( i j ) ( k ) δ 0 ( k ) + w ^ ( k ) w 0 ( k ) | 3 = o p ( 1 ) .
Note that δ ^ ( i j ) depends only on { X ( i j ) , ϵ ( i j ) } . We call P ( i j ) the joint probability measure P ( i j ) = k ( i , j ) P x k , ϵ k , i.e., probability computed with respect to all our random variables except ( x i , ϵ i ) and ( x j , ϵ j ) .
So we have found E ( i j ) n , which depends only on ( X ( i j ) , ϵ ( i j ) ) , such that P ( i j ) ( E ( i j ) n ) 1 and k = 1 p | δ ^ ( i j ) ( k ) δ 0 ( k ) + w ^ ( k ) w 0 ( k ) | 3 = o ( 1 ) when ( X ( i j ) , { ϵ k } k ( i , j ) ) E ( i j ) n . By treating a n , p ’s as deterministic quantities, the arguments we gave above then imply that, when ( X ( i j ) , ϵ ( i j ) ) E ( i j ) n ,
X i ( δ ^ ( i j ) δ 0 + w ^ w 0 ) | ( X ( i j ) , ϵ ( i j ) ) behaves like ( δ ^ ( i j ) δ 0 ) N .
We now use characteristic function arguments. Let α i = X i ( δ ^ ( i j ) δ 0 + w ^ w 0 ) and α j = X j ( δ ^ ( i j ) δ 0 + w ^ w 0 ) .
Let ( w i , w j ) R 2 be fixed and
χ ( w i , w j ) = E e i ( w 1 α i + w 2 α j ) = E e i ( w 1 α i + w 2 α j ) 1 E ( i j ) n + 1 [ E ( i j ) n ] c .
Since P ( E ( i j ) n ) = P ( i j ) ( E ( i j ) n ) 1 , we can just focus on E { e i ( w 1 α i + w 2 α j ) 1 E ( i j ) n } , since the modulus of the functions we are integrating is bounded by 1.
Now, we have
E e i ( w 1 α i + w 2 α j ) 1 E ( i j ) n = E 1 E ( i j ) n E e i ( w 1 α i + w 2 α j ) | X ( i j ) , ϵ ( i j ) ,
since 1 E ( i j ) n is a deterministic function of ( X ( i j ) , ϵ ( i j ) ) . Independence of X i and X j gives
E e i ( w 1 α i + w 2 α j ) | X ( i j ) , ϵ ( i j ) = E e i w 1 α i | X ( i j ) , ϵ ( i j ) E e i w 2 α j | X ( i j ) , ϵ ( i j ) .
Also, the conditional Gaussian approximation established above implies that
1 E ( i j ) n E e i ( w 1 α i + w 2 α j ) | X ( i j ) , ϵ ( i j ) e ( w 1 2 / 2 + w 2 2 / 2 ) δ ^ ( i j ) δ 0 2 0
in P ( i j ) -probability.
So we conclude that
E 1 E ( i j ) n e i ( w 1 α i + w 2 α j ) E 1 E ( i j ) n e ( w 1 2 / 2 + w 2 2 / 2 ) δ ^ ( i j ) δ 0 2 0 .
Since P ( E ( i j ) n ) 1 and δ ^ ( i j ) δ 0 2 is asymptotically deterministic by arguments similar to those in the proof of Proposition A4, we have
E 1 E ( i j ) n e ( w 1 2 / 2 + w 2 2 / 2 ) δ ^ ( i j ) δ 0 2 e ( w 1 2 / 2 + w 2 2 / 2 ) E ( δ ^ ( i j ) δ 0 2 ) 0 .
Therefore,
E e i ( w 1 α i + w 2 α j ) E e i w 1 α i E e i w 2 α j 0 .
This shows that α i and α j are asymptotically independent. It easily follows that r ˜ i , ( i ) and r ˜ j , ( j ) are asymptotically independent.
The same reasoning applies to ( r ˜ i , ( i ) , λ i ) and ( r ˜ j , ( j ) , λ j ) , since δ ^ ( i j ) is independent of λ i and λ j under Assumption O6. □

Appendix E.5.2. On the Asymptotic Behavior of c τ

We now show that c τ = n 1 tr { ( S + τ I p ) 1 } is asymptotically deterministic. This require several steps.
Lemma A13. 
We work under Assumptions  O1–O7  P1–P4  and  F2–F4. Consider the random function
g n ( x ) = 1 n i = 1 n 1 1 + x λ i 2 ψ { prox ( x λ i 2 ρ ) ( r ˜ i , ( i ) ) } , defined for x 0 .
Let B > 0 be in R + . We have, for any ( x , y ) R + 2 , and 0 x B , 0 y B ,
sup ( x , y ) : | x y | η , 0 x B , 0 y B | g n ( x ) g n ( y ) | η 1 n i = 1 n ( λ i 2 ψ + B L ( n ) λ i 4 ψ ) .
In particular, under Assumption  P2   and  F3–F4, we have for C a constant independent of n and p,
Pr sup ( x , y ) : | x y | η , 0 x B , 0 y B | g n ( x ) g n ( y ) | > δ η δ C .
Hence, g n is stochastic equicontinuous on [ 0 , B ] for any B > 0 given.
We used the notation Pr above to denote outer probability and avoid a discussion of potential measure theoretic issues associated with taking a supremum over a noncountable collection of random variables (see e.g., van der Vaart [39] (Sect. 18.2)).
Proof. 
Consider the function defined for x 0 ,
h u ( i ) ( x ) = 1 1 + x λ i 2 ψ { prox ( x λ i 2 ρ ) ( u ) } = u prox ( x λ i 2 ρ ) ( u ) .
The last equality comes from Lemma 3.33 from El Karoui [21].
Since ψ is non-negative,
u , | h u ( i ) ( x ) h u ( i ) ( y ) | | x λ i 2 ψ ( prox ( x λ i 2 ρ ) ( u ) ) y λ i 2 ψ ( prox ( y λ i 2 ρ ) ( u ) ) | 1 .
Thus, since x , y 0 , for all u,
| h u ( i ) ( x ) h u ( i ) ( y ) | λ i 2 | x y | ψ ( prox ( x λ i 2 ρ ) ( u ) ) + λ i 2 y | ψ ( prox ( x λ i 2 ρ ) ( u ) ) ψ ( prox ( y λ i 2 ρ ) ( u ) ) | .
In particular, if | x y | η , and x y B , with x , y 0 , for all u,
sup y : | x y | η ; x y B | h u ( i ) ( x ) h u ( i ) ( y ) | λ i 2 η ψ { prox ( x λ i 2 ρ ) ( u ) } + B λ i 2 sup y : | x y | η , x y B | ψ { prox ( x λ i 2 ρ ) ( u ) } ψ { prox ( y λ i 2 ρ ) ( u ) } | .
Under Assumption O3, ψ is L ( n ) -Lipschitz. Therefore, for x i = x λ i 2 , y i = y λ i 2 0 , we have
u , | ψ { prox ( x i ρ ) ( u ) } ψ { prox ( y i ρ ) ( u ) } | L ( n ) | prox ( x i ρ ) ( u ) prox ( y i ρ ) ( u ) | .
Recall that according to Lemma A11,
x prox ( x ρ ) ( u ) = ψ { prox ( x ρ ) ( u ) } 1 + x ψ { prox ( x ρ ) ( u ) } .
Hence we have
sup x | x prox ( x ρ ) ( u ) | ψ .
We conclude that
u , | ψ { prox ( x i ρ ) ( u ) } ψ { prox ( y i ρ ) ( u ) } | { L ( n ) ψ | x i y i | } 2 ψ .
We therefore have, for x y B and x , y 0 ,
u , sup y : | x y | η | h u ( i ) ( x ) h u ( i ) ( y ) | λ i 2 η ψ { prox ( x λ i 2 ρ ) ( u ) } + B λ i 4 L ( n ) ψ η .
Thus, for x , y 0 ,
u , sup ( x , y ) : | x y | η , x y B | h u ( i ) ( x ) h u ( i ) ( y ) | λ i 2 η ψ + B λ i 4 L ( n ) ψ η .
Since the right-hand side does not depend on u, we have
sup u sup ( x , y ) : | x y | η , x y B | h u ( i ) ( x ) h u ( i ) ( y ) | λ i 2 η ψ + B λ i 4 L ( n ) ψ η .
Naturally, g n ( x ) can be written as
g n ( x ) = 1 n i = 1 n h r ˜ i , ( i ) ( i ) ( x ) .
Therefore, for any x , y 0 ,
| g n ( x ) g n ( y ) | 1 n i = 1 n | h r ˜ i , ( i ) ( i ) ( x ) h r ˜ i , ( i ) ( i ) ( y ) | .
The bound we have obtained above on sup u | h u ( i ) ( x ) h u ( i ) ( y ) | when x and y are sufficiently close to one another can now be used. This shows that for x given, if x , y 0 , | x y | η , and x y B , we have
sup ( x , y ) : | x y | η , 0 x B , 0 y B | g n ( x ) g n ( y ) | η 1 n i = 1 n ( λ i 2 ψ + B L ( n ) λ i 4 ψ ) .
Under Assumptions P2 and F3–F4, all the terms on the right-hand side are bounded in L 1 . We can now take expectations and obtain the result in L 1 . □
Lemma A14. 
Let us call G n ( x ) = E ( g n ( x ) ) . Let B > 0 be given. For any given x 0 B ,
g n ( x 0 ) G n ( x 0 ) = o L 2 ( 1 ) .
Under Assumptions  O1–O7P1–P4  and  F1–F5, we have
E sup 0 x B | g n ( x ) G n ( x ) | 0 .
Proof. 
Under Assumptions F1 and F5, we can divide the index set { 1 , , n } into finite K subsets A 1 , , A K , in which ( x i , ϵ i ) i A j play a symmetric role. Hence, var ( g n ( x 0 ) ) can be expressed as a sum of variances and covariances of finitely many functions of finitely many random variables ( λ i , r ˜ i , ( i ) ) : for those random variables, we just need to pick a representative in each subset { A j } j = 1 K .
Since ψ is Lipschitz, g n is an average of bounded continuous functions of the random variables of interest to us.
Asymptotic pairwise independence of ( λ i , r ˜ i , ( i ) ) ’s implies that
var ( g n ( x 0 ) ) = o ( 1 ) .
and therefore gives the first result.
Now we pick ϵ > 0 . By the stochastic equicontinuity of g n and the bound in (A26), we can find x 1 , , x K , independent of n, such that for all x [ 0 , B ] , there exists l such that when n is large enough,
E ( | g n ( x ) g n ( x l ) | ) ϵ .
Notice that
| g n ( x ) G n ( x ) | | g n ( x ) g n ( x l ) | + | g n ( x l ) G n ( x l ) | + | G n ( x l ) G n ( x ) | .
We immediately get
E sup 0 x B | g n ( x ) G n ( x ) | 2 ϵ + E sup 1 l K | g n ( x l ) G n ( x l ) | .
Because K is finite, the fact that for all l, g n ( x l ) G n ( x l ) = o L 2 ( 1 ) implies that sup 1 l K | g n ( x l ) G n ( x l ) | = o L 2 ( 1 ) . In particular, if n is sufficiently large, we have
E sup 1 l K | g n ( x l ) G n ( x l ) | ϵ .
This gives the result. □
Lemma A15. 
Assume  O1–O7,  P1–P4   and  F1–F5. Recall that c τ = n 1 tr { ( S + τ I p ) 1 } . Call as before
g n ( x ) = 1 n i = 1 n 1 1 + x λ i 2 ψ { prox ( x λ i 2 ρ ) ( r ˜ i , ( i ) ) } .
Then c τ is a near solution of
p n τ x 1 + g n ( x ) = 0 ,
i.e., p / n τ c τ 1 + g n ( c τ ) = o L k ( 1 ) , when 3 α 1 / 2 < 0 .
Asymptotically, near solution of
δ n ( x ) p n τ x 1 + g n ( x ) = 0 ,
are close to solutions of
Δ n ( x ) = p n τ x 1 + E { g n ( x ) } = 0 .
More precisely, call T n , ϵ = { x : | Δ n ( x ) | ϵ } . We note that T n , ϵ ( 0 , p / ( n τ ) + ϵ / τ ) . For any given ϵ, as n , near solutions of δ n ( x ) = 0 belong to T n , ϵ with high probability.
Our assumptions concerning the possible distributions of ϵ i ’s, specifically Assumption  F1, imply that as n , there is a unique solution to Δ n ( x ) = 0 .
Hence c τ is asymptotically deterministic.
Proof. 
Note that g n ( x ) 1 .
Let δ n ( x ) be the function
δ n ( x ) = p n τ x 1 + g n ( x ) ,
and Δ n ( x ) = E { δ n ( x ) } . Call x n a solution of δ n ( x n ) = 0 and x n , 0 a solution of Δ n ( x n , 0 ) = 0 .
Since 0 g n ( x ) 1 , we have x n p / ( n τ ) , for otherwise, δ n ( x ) < 0 . The same arguments shows that if x > ( p / n + ϵ ) / τ , then δ n ( x ) < ϵ and x T n , ϵ . Similarly, near solutions of δ n ( x ) = 0 must be less or equal to ( p / n + ϵ ) / τ .
  • Proof of the fact that c τ is such that δ n ( c τ ) = o ( 1 )
Recall that in the notation of Lemma A8, we have
p 1 n τ c τ , p = 1 n tr ( I n M ) .
According to (A20),
1 n tr ( I n M ) = 1 1 n i = 1 n 1 1 + ψ ( r i , [ p ] ) 1 n v i ( S p ( i ) + τ I p ) 1 v i .
According to Lemmas A8–A10, we have
sup i | 1 n v i ( S p ( i ) + τ I p ) 1 v i λ i 2 c τ , p | = O L k polyLog ( n ) n 1 / 2 2 α .
When x 0 and y 0 , | 1 / ( 1 + x ) 1 / ( 1 + y ) | | x y | 1 . Hence, we have
| 1 n i = 1 n 1 1 + ψ ( r i , [ p ] ) 1 n v i ( S p ( i ) + τ I p ) 1 v i 1 n i = 1 n 1 1 + ψ ( r i , [ p ] ) λ i 2 c τ , p | sup 1 i n | 1 n v i ( S p ( i ) + τ I p ) 1 v i λ i 2 c τ , p | ψ .
We conclude that
p / n τ c τ , p 1 + 1 n i = 1 n 1 1 + λ i 2 c τ , p ψ ( r i , [ p ] ) = O L k ( n 1 / 2 + 2 α polyLog ( n ) ) .
Exactly the same computations can be performed for c τ . We have established that
p / n τ c τ 1 + 1 n i = 1 n 1 1 + λ i 2 c τ ψ ( R i ) = O L k ( n 1 / 2 + 2 α polyLog ( n ) ) .
Now we have seen in Theorem A1 that
sup i | R i prox ( c i ρ ) ( r ˜ i , ( i ) ) | = O L k ( n 1 / 2 + α polyLog ( n ) ) .
By the assumptions on ψ , this implies that
sup i | ψ ( R i ) ψ { prox ( c i ρ ) ( r ˜ i , ( i ) ) } | = O L k ( n 1 / 2 + α polyLog ( n ) ) .
We have furthermore noted that sup i | c i λ i 2 c τ | = O L k ( n 1 / 2 + 2 α polyLog ( n ) ) in Corollary A1. Using Lemma A11, we can write
| prox ( c i ρ ) ( r ˜ i , ( i ) ) prox ( λ i 2 c τ ρ ) ( r ˜ i , ( i ) ) | ψ | c i λ i 2 c τ |
and hence
| ψ { prox ( c i ρ ) ( r ˜ i , ( i ) ) } ψ { prox ( λ i 2 c τ ρ ) ( r ˜ i , ( i ) ) } | = O L k ( ψ n 1 / 2 + 3 α polyLog ( n ) ) .
Gathering all these results, we have
| ψ ( R i ) ψ ( prox ( λ i 2 c τ ρ ) ( r ˜ i , ( i ) ) ) | = O L k { ( ψ + 1 ) n 1 / 2 + 3 α polyLog ( n ) } .
So we have shown that δ n ( c τ ) = O L k ( n 1 / 2 + 3 α polyLog ( n ) ) .
  • Final details
By Lemma A14, we have δ n ( x ) Δ n ( x ) = o p ( 1 ) for any given x. In our case, using the notation of this lemma, B = p / ( n τ ) + η / τ , for η > 0 given.
This implies that for any given ϵ > 0 ,
sup x ( 0 , p / ( n τ ) + η / τ ) | δ n ( x ) Δ n ( x ) | < ϵ ,
with high probability when n is large enough. Therefore, for any ϵ > 0 , if x n is a solution to δ n ( x ) = 0 ,
| Δ n ( x n ) | < ϵ with high probability .
This means that x n T n , ϵ with high probability. The same reasoning applies for near solution of δ n ( x ) = 0 , which must belong to T n , ϵ as n with high probability for any given ϵ > 0 . Note that T n , ϵ is compact because it is bounded and closed, using the fact that G n = E ( g n ) is continuous.
If T n , 0 were reduced to a single point, we would have established the asymptotically deterministic character of c τ .
Given our work on the asymptotic behavior of r ˜ i , ( i ) and our assumptions on ϵ i ’s, we see that Lemma 3.39 from El Karoui [21] applies to lim n Δ n ( x ) under Assumption F1. Therefore, T n , 0 is reduced to a single point as n and c τ is asymptotically deterministic. □
Proof of Theorem 1. 
Notice that
t prox ( c ρ ) ( t ) = prox ( c ρ ) ( t ) = 1 1 + c ψ ( prox ( c ρ ) ( t ) ) .
Therefore, Δ n can be written as
Δ n ( x ) = p n τ x 1 + 1 n i = 1 n E prox ( x λ i 2 ρ ) ( r ˜ i , ( i ) ) .
Hence, the limiting root of Δ n ( x ) = 0 satisfies the first fixed-point equation in Theorem 1. Since Lemma A15 shows that c τ is asymptotically arbitrarily close to this root, the first equation follows. The second equation comes from (A24). Theorem 1 is proved, with c ρ ( κ ) being the limit of c τ . □

References

  1. Chen, M. Analysis on transfer learning models and applications in natural language processing. Highlights Sci. Eng. Technol. 2022, 16, 446–452. [Google Scholar] [CrossRef]
  2. Ma, Y.; Chen, S.; Ermon, S.; Lobell, D.B. Transfer learning in environmental remote sensing. Remote Sens. Environ. 2024, 301, 113924. [Google Scholar] [CrossRef]
  3. Gopalakrishnan, K.; Khaitan, S.K.; Choudhary, A.; Agrawal, A. Deep convolutional neural networks with transfer learning for computer vision-based data-driven pavement distress detection. Constr. Build. Mater. 2017, 157, 322–330. [Google Scholar] [CrossRef]
  4. Liu, D.; Luo, J.; Johnson, B.; Chew, H.; Blais, J.; Deik, A.; Paul, F.; Hanson, R.L.; Crandall, J.P.; Sun, Y.; et al. Modeling blood metabolite homeostatic levels reduces sample heterogeneity across cohorts. Proc. Natl. Acad. Sci. USA 2024, 121, e2307430121. [Google Scholar] [CrossRef]
  5. Chen, A.; Owen, A.B.; Shi, M. Data enriched linear regression. Electron. J. Stat. 2015, 9, 1078–1112. [Google Scholar] [CrossRef]
  6. Tripuraneni, N.; Jin, C.; Jordan, M. Provable meta-learning of linear representations. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Volume 139, pp. 10434–10443. [Google Scholar]
  7. Bastani, H. Predicting with proxies: Transfer learning in high dimension. Manag. Sci. 2021, 67, 2964–2984. [Google Scholar] [CrossRef]
  8. Li, S.; Cai, T.T.; Li, H. Transfer learning for high-dimensional linear regression: Prediction, estimation and minimax optimality. J. R. Stat. Soc. Ser. B Stat. Methodol. 2022, 84, 149–173. [Google Scholar] [CrossRef]
  9. Tian, Y.; Feng, Y. Transfer learning under high-dimensional generalized linear models. J. Am. Stat. Assoc. 2023, 118, 2684–2697. [Google Scholar] [CrossRef]
  10. Li, S.; Zhang, L.; Cai, T.T.; Li, H. Estimation and inference for high-dimensional generalized linear models with knowledge transfer. J. Am. Stat. Assoc. 2024, 119, 1274–1285. [Google Scholar] [CrossRef]
  11. Cai, T.T.; Wei, H. Transfer learning for nonparametric classification: Minimax rate and adaptive classifier. Ann. Stat. 2021, 49, 100–128. [Google Scholar] [CrossRef]
  12. Cai, T.T.; Pu, H. Transfer learning for nonparametric regression: Non-asymptotic minimax analysis and adaptive procedure. arXiv 2024, arXiv:2401.12272. [Google Scholar] [CrossRef]
  13. Fan, J.; Gao, C.; Klusowski, J.M. Robust transfer learning with unreliable source data. arXiv 2023, arXiv:2310.04606. [Google Scholar] [CrossRef]
  14. Huber, P.J. Robust regression: Asymptotics, conjectures and Monte Carlo. Ann. Stat. 1973, 1, 799–821. [Google Scholar] [CrossRef]
  15. Portnoy, S. Asymptotic behavior of M-estimators of p regression parameters when p2/n is large. I. Consistency. Ann. Stat. 1984, 13, 1298–1309. [Google Scholar]
  16. Portnoy, S. Asymptotic behavior of M-estimators of p regression parameters when p2/n is large. II. Normal approximation. Ann. Stat. 1985, 13, 1403–1417. [Google Scholar] [CrossRef]
  17. Portnoy, S. Asymptotic behavior of the empiric distribution of M-estimated residuals from a regression model with many parameters. Ann. Stat. 1986, 14, 1152–1170. [Google Scholar] [CrossRef]
  18. Portnoy, S. A central limit theorem applicable to robust regression estimators. J. Multivar. Anal. 1987, 22, 24–50. [Google Scholar] [CrossRef][Green Version]
  19. Mammen, E. Asymptotics with increasing dimension for robust regression with applications to the bootstrap. Ann. Stat. 1989, 17, 382–400. [Google Scholar] [CrossRef]
  20. El Karoui, N.; Bean, D.; Bickel, P.J.; Lim, C.; Yu, B. On robust regression with high-dimensional predictors. Proc. Natl. Acad. Sci. USA 2013, 110, 14557–14562. [Google Scholar] [CrossRef]
  21. El Karoui, N. On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. Probab. Theory Relat. Fields 2018, 170, 95–175. [Google Scholar] [CrossRef]
  22. Charbonnier, P.; Blanc-Feraud, L.; Aubert, G.; Barlaud, M. Deterministic edge-preserving regularization in computed imaging. IEEE Trans. Image Process. 1997, 6, 298–311. [Google Scholar] [CrossRef]
  23. Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
  24. Yohai, V.J.; Maronna, R.A. Asymptotic behavior of M-estimators for the linear model. Ann. Stat. 1979, 7, 258–268. [Google Scholar] [CrossRef]
  25. El Karoui, N. Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators: Rigorous results. arXiv 2013, arXiv:1311.2445. [Google Scholar]
  26. El Karoui, N. Concentration of measure and spectra of random matrices: Applications to correlation matrices, elliptical distributions and beyond. Ann. Appl. Probab. 2009, 19, 2362–2405. [Google Scholar] [CrossRef]
  27. Ledoux, M. The Concentration of Measure Phenomenon; American Mathematical Society: Providence, RI, USA, 2001; Volume 89. [Google Scholar]
  28. Huber, P.J.; Ronchetti, E.M. Robust Statistics; John Wiley & Sons: Hoboken, NJ, USA, 2009. [Google Scholar]
  29. Karlin, S. Total Positivity; Stanford University Press: Stanford, CA, USA, 1968. [Google Scholar]
  30. Ibragimov, I.A. On the composition of unimodal distributions. Teor. Veroyatnost. Primen. 1956, 1, 283–288. [Google Scholar] [CrossRef]
  31. Dharmadhikari, S.; Joag-Dev, K. Unimodality, Convexity, and Applications; Probability and Mathematical Statistics; Academic Press, Inc.: Boston, MA, USA, 1988. [Google Scholar]
  32. Moreau, J.J. Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. Fr. 1965, 93, 273–299. [Google Scholar] [CrossRef]
  33. Beck, A.; Teboulle, M. Gradient-based algorithms with applications to signal-recovery problems. In Convex Optimization in Signal Processing and Communications; Palomar, D.P., Eldar, Y.C., Eds.; Cambridge University Press: Cambridge, UK, 2010; pp. 42–88. [Google Scholar]
  34. Bean, D.; Bickel, P.J.; El Karoui, N.; Yu, B. Optimal M-estimation in high-dimensional regression. Proc. Natl. Acad. Sci. USA 2013, 110, 14563–14568. [Google Scholar] [CrossRef]
  35. Efron, B.; Stein, C. The jackknife estimate of variance. Ann. Stat. 1981, 9, 586–596. [Google Scholar] [CrossRef]
  36. Bhatia, R. Matrix Analysis; Springer: New York, NY, USA, 1997. [Google Scholar]
  37. Johnson, C.R.; Horn, R.A. Matrix Analysis; Cambridge University Press: Cambridge, UK, 1985. [Google Scholar]
  38. Stroock, D.W. Probability Theory: An Analytic View; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
  39. van der Vaart, A.W. Asymptotic Statistics; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
Figure 1. Boxplot of β ^ β 0 2 over 1000 simulations. The red point in each boxplot represents the theoretical value r ρ 2 from Theorem 1. Panels from top to bottom are for κ = 1 , 4 , respectively, while panels from left to right are for cases I , II , III , respectively.
Figure 1. Boxplot of β ^ β 0 2 over 1000 simulations. The red point in each boxplot represents the theoretical value r ρ 2 from Theorem 1. Panels from top to bottom are for κ = 1 , 4 , respectively, while panels from left to right are for cases I , II , III , respectively.
Entropy 28 00543 g001
Figure 2. Theoretical curves of r ρ as a function of β 0 w ^ for five values of τ under cases I III , obtained by numerically solving Corollary 1. The three panels correspond to cases I , II , and III , respectively.
Figure 2. Theoretical curves of r ρ as a function of β 0 w ^ for five values of τ under cases I III , obtained by numerically solving Corollary 1. The three panels correspond to cases I , II , and III , respectively.
Entropy 28 00543 g002
Figure 3. Boxplots of relative estimation errors (log scale) across 500 replications for varying β 0 w 0 under cases I III , with p = 400 and n = 400 . Case I includes all six methods, while cases II and III include only the four ridge-type procedures. Panels from top to bottom are for cases I , II , III , respectively.
Figure 3. Boxplots of relative estimation errors (log scale) across 500 replications for varying β 0 w 0 under cases I III , with p = 400 and n = 400 . Case I includes all six methods, while cases II and III include only the four ridge-type procedures. Panels from top to bottom are for cases I , II , III , respectively.
Entropy 28 00543 g003
Figure 4. Robustness check under AR(1) correlated predictors. Boxplots of relative estimation errors (log scale) across 500 replications for varying β 0 w 0 under cases I III with x i N ( 0 , ρ ) , where ρ , j k = ρ | j k | and ρ = 0.6 . Panels from top to bottom are for cases I , II , and III , respectively. The four ridge-type procedures, Single-RR, Trans-RR, Trans-RR-Ada, and Pooled-RR are shown.
Figure 4. Robustness check under AR(1) correlated predictors. Boxplots of relative estimation errors (log scale) across 500 replications for varying β 0 w 0 under cases I III with x i N ( 0 , ρ ) , where ρ , j k = ρ | j k | and ρ = 0.6 . Panels from top to bottom are for cases I , II , and III , respectively. The four ridge-type procedures, Single-RR, Trans-RR, Trans-RR-Ada, and Pooled-RR are shown.
Entropy 28 00543 g004
Figure 5. Distribution of RMSE across the 20 repeated random splits for each of the six methods, in the two transfer directions. Boxes show the inter-quartile range, whiskers extend to 1.5 × IQR, and circles mark splits beyond that.
Figure 5. Distribution of RMSE across the 20 repeated random splits for each of the six methods, in the two transfer directions. Boxes show the inter-quartile range, whiskers extend to 1.5 × IQR, and circles mark splits beyond that.
Entropy 28 00543 g005
Table 1. Mean and SD (in parentheses) of β ^ β 0 2 (denoted as r ^ 2 ) and the corresponding r ρ 2 over 1000 simulations. Rows are indexed by ( p , n ) , with  n 1 = 2 p when κ = 1 and n 1 = p / 2 when κ = 4 .
Table 1. Mean and SD (in parentheses) of β ^ β 0 2 (denoted as r ^ 2 ) and the corresponding r ρ 2 over 1000 simulations. Rows are indexed by ( p , n ) , with  n 1 = 2 p when κ = 1 and n 1 = p / 2 when κ = 4 .
( p , n )Case I Case II Case III
r ^ 2 r ρ 2 r ^ 2 r ρ 2 r ^ 2 r ρ 2
κ = 1
(200, 200)0.3653 (0.0318)0.36490.7163 (0.0738)0.72040.4683 (0.0478)0.4685
(400, 400)0.3472 (0.0208)0.34770.6970 (0.0549)0.69230.5076 (0.0368)0.5065
(800, 800)0.3603 (0.0151)0.35980.7206 (0.0374)0.72120.4989 (0.0238)0.4986
κ = 4
(200, 50)1.3415 (0.0734)1.34192.8222 (0.3311)2.82162.2410 (0.2407)2.2415
(400, 100)1.3565 (0.0531)1.35442.4896 (0.2363)2.49961.8087 (0.1590)1.8102
(800, 200)1.5261 (0.0427)1.52472.7226 (0.1693)2.72192.0342 (0.1153)2.0301
Table 2. Sensitivity of relative estimation error to the smoothed Huber parameters ( δ , η ) at the transition discrepancy h = 1 . Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over M = 500 replications. The three blocks correspond to the three error distributions of Section 4.3.
Table 2. Sensitivity of relative estimation error to the smoothed Huber parameters ( δ , η ) at the transition discrepancy h = 1 . Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error over M = 500 replications. The three blocks correspond to the three error distributions of Section 4.3.
Case I (Gaussian Errors)
δ η 0.05 0.10 0.20
1.00 0.653/0.655/0.651/0.8830.654/0.656/0.652/0.8830.656/0.658/0.654/0.883
1.35 0.644/0.647/0.642/0.8820.645/0.647/0.642/0.8820.646/0.648/0.643/0.882
2.00 0.641/0.642/0.637/0.8850.640/0.642/0.637/0.8850.641/0.643/0.637/0.885
Case II (Cauchy Errors)
δ η 0.05 0.10 0.20
1.00 0.884/0.893/0.884/0.9520.884/0.892/0.884/0.9520.884/0.893/0.884/0.952
1.35 0.883/0.893/0.884/0.9510.883/0.892/0.883/0.9510.883/0.892/0.884/0.951
2.00 0.890/0.899/0.891/0.9510.889/0.899/0.890/0.9510.889/0.899/0.890/0.951
Case III (Mixture Errors)
δ η 0.05 0.10 0.20
1.00 0.779/0.786/0.779/0.9210.780/0.787/0.779/0.9210.780/0.787/0.779/0.921
1.35 0.781/0.788/0.781/0.9200.781/0.787/0.780/0.9200.781/0.787/0.780/0.921
2.00 0.794/0.801/0.793/0.9240.793/0.800/0.792/0.9240.792/0.800/0.791/0.923
Table 3. Sensitivity of relative estimation error to the cross-validation criterion. All settings are as in Figure 3, except that every cross-validation loss (used to select τ 1 , τ , τ st , τ p , and  θ ) is changed from MAE to MSE. Each entry reports the default (MAE-CV) and the MSE-CV mean estimation error, in the format “default/MSE-CV”, over  M = 500 replications.
Table 3. Sensitivity of relative estimation error to the cross-validation criterion. All settings are as in Figure 3, except that every cross-validation loss (used to select τ 1 , τ , τ st , τ p , and  θ ) is changed from MAE to MSE. Each entry reports the default (MAE-CV) and the MSE-CV mean estimation error, in the format “default/MSE-CV”, over  M = 500 replications.
CasehSingle-RRTrans-RRTrans-RR-AdaPooled-RR
I 0.135 0.645/0.6440.550/0.5460.550/0.5460.600/0.601
0.223 0.645/0.6440.565/0.5610.566/0.5610.628/0.629
0.368 0.645/0.6440.588/0.5850.589/0.5850.678/0.681
0.607 0.645/0.6440.616/0.6140.617/0.6150.764/0.770
1.000 0.645/0.6440.647/0.6450.642/0.6410.882/0.888
1.649 0.645/0.6440.791/0.7930.645/0.6440.990/0.991
2.718 0.645/0.6441.639/1.6460.645/0.6441.261/1.479
II 0.135 0.883/1.7720.816/2.5820.819/1.8950.811/1.324
0.223 0.883/1.7720.826/2.5970.829/1.9040.829/1.326
0.368 0.883/1.7720.840/2.6180.843/1.9150.855/1.353
0.607 0.883/1.7720.860/2.6400.862/1.9200.898/1.401
1.000 0.883/1.7720.892/2.7450.883/1.9630.951/1.497
1.649 0.883/1.7721.021/2.9940.887/2.0201.005/1.679
2.718 0.883/1.7721.898/3.8270.883/2.1081.123/2.223
III 0.135 0.781/1.1080.693/1.3190.692/1.0940.706/0.944
0.223 0.781/1.1080.707/1.3340.707/1.1020.729/0.966
0.368 0.781/1.1080.727/1.3580.727/1.1140.770/0.993
0.607 0.781/1.1080.754/1.3950.754/1.1380.834/1.054
1.000 0.781/1.1080.787/1.4730.780/1.1610.920/1.155
1.649 0.781/1.1080.941/1.7490.783/1.1940.998/1.368
2.718 0.781/1.1081.985/2.6460.781/1.2741.181/1.961
Table 4. Sensitivity of relative estimation error to the choice of robust loss. All settings are as in Figure 3, except that the smoothed Huber loss is replaced by the pseudo-Huber loss ρ P ( t ; δ ) = δ 2 ( 1 + ( t / δ ) 2 1 ) with δ = 1.35 (the smoothing parameter η is no longer needed). Each entry reports the default (smoothed Huber) and the pseudo-Huber mean estimation error, in the format “default/pseudo-Huber”, over  M = 500 replications.
Table 4. Sensitivity of relative estimation error to the choice of robust loss. All settings are as in Figure 3, except that the smoothed Huber loss is replaced by the pseudo-Huber loss ρ P ( t ; δ ) = δ 2 ( 1 + ( t / δ ) 2 1 ) with δ = 1.35 (the smoothing parameter η is no longer needed). Each entry reports the default (smoothed Huber) and the pseudo-Huber mean estimation error, in the format “default/pseudo-Huber”, over  M = 500 replications.
CasehSingle-RRTrans-RRTrans-RR-AdaPooled-RR
I 0.135 0.645/0.6450.550/0.5450.550/0.5460.600/0.595
0.223 0.645/0.6450.565/0.5610.566/0.5620.628/0.625
0.368 0.645/0.6450.588/0.5850.589/0.5860.678/0.675
0.607 0.645/0.6450.616/0.6150.617/0.6160.764/0.763
1.000 0.645/0.6450.647/0.6470.642/0.6430.882/0.883
1.649 0.645/0.6450.791/0.7970.645/0.6450.990/0.995
2.718 0.645/0.6451.639/1.6420.645/0.6451.261/1.299
II 0.135 0.883/0.8910.816/0.8280.819/0.8300.811/0.823
0.223 0.883/0.8910.826/0.8360.829/0.8390.829/0.839
0.368 0.883/0.8910.840/0.8500.843/0.8520.855/0.865
0.607 0.883/0.8910.860/0.8690.862/0.8720.898/0.905
1.000 0.883/0.8910.892/0.9000.883/0.8910.951/0.954
1.649 0.883/0.8911.021/1.0230.887/0.8951.005/1.007
2.718 0.883/0.8911.898/1.8710.883/0.8911.123/1.129
III 0.135 0.781/0.7930.693/0.7030.692/0.7030.706/0.713
0.223 0.781/0.7930.707/0.7170.707/0.7180.729/0.738
0.368 0.781/0.7930.727/0.7400.727/0.7400.770/0.777
0.607 0.781/0.7930.754/0.7650.754/0.7650.834/0.842
1.000 0.781/0.7930.787/0.7980.780/0.7910.920/0.925
1.649 0.781/0.7930.941/0.9500.783/0.7940.998/1.002
2.718 0.781/0.7931.985/1.9720.781/0.7931.181/1.199
Table 5. Sensitivity of relative estimation error to the ridge-penalty cross-validation grid. The default grid contains 9 values from 1 / 9 to 9 on a geometric scale. The wide grid extends this to 13 values from 1 / 27 to 27 on the same geometric scale, and contains the default grid as a strict subset. All other settings are as in Figure 3, with  M = 500 replications per cell. Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error.
Table 5. Sensitivity of relative estimation error to the ridge-penalty cross-validation grid. The default grid contains 9 values from 1 / 9 to 9 on a geometric scale. The wide grid extends this to 13 values from 1 / 27 to 27 on the same geometric scale, and contains the default grid as a strict subset. All other settings are as in Figure 3, with  M = 500 replications per cell. Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error.
CasehDefault Grid (9 pts, [1/9, 9])Wide Grid (13 pts, [1/27, 27])
I 0.135 0.645/0.550/0.550/0.6000.645/0.550/0.550/0.600
I 0.223 0.645/0.565/0.566/0.6280.645/0.565/0.566/0.628
I 0.368 0.645/0.588/0.589/0.6780.645/0.589/0.590/0.678
I 0.607 0.645/0.616/0.617/0.7640.645/0.618/0.619/0.764
I 1.000 0.645/0.647/0.642/0.8820.645/0.648/0.642/0.883
I 1.649 0.645/0.791/0.645/0.9900.645/0.791/0.645/0.990
I 2.718 0.645/1.639/0.645/1.2610.645/1.639/0.645/1.261
II 0.135 0.883/0.816/0.819/0.8110.887/0.824/0.826/0.811
II 0.223 0.883/0.826/0.829/0.8290.887/0.835/0.838/0.829
II 0.368 0.883/0.840/0.843/0.8550.887/0.850/0.852/0.856
II 0.607 0.883/0.860/0.862/0.8980.887/0.869/0.871/0.900
II 1.000 0.883/0.892/0.883/0.9510.887/0.896/0.888/0.955
II 1.649 0.883/1.021/0.887/1.0050.887/1.021/0.891/1.006
II 2.718 0.883/1.898/0.883/1.1230.887/1.899/0.887/1.120
III 0.135 0.781/0.693/0.692/0.7060.782/0.694/0.694/0.706
III 0.223 0.781/0.707/0.707/0.7290.782/0.709/0.709/0.729
III 0.368 0.781/0.727/0.727/0.7700.782/0.731/0.731/0.770
III 0.607 0.781/0.754/0.754/0.8340.782/0.758/0.758/0.834
III 1.000 0.781/0.787/0.780/0.9200.782/0.789/0.781/0.922
III 1.649 0.781/0.941/0.783/0.9980.782/0.941/0.783/0.999
III 2.718 0.781/1.985/0.781/1.1810.782/1.985/0.782/1.180
Table 6. Sensitivity of relative estimation error to the ridge penalty value when cross-validation tuning is disabled. All four ridge penalties (Single-RR’s τ st , Trans-RR’s τ 1 and τ , Pooled-RR’s τ p ) are forced to a common fixed value from { 1 / 3 , 1 , 3 , 9 } . Trans-RR-Ada’s mixing weight θ is still selected by 5-fold cross-validation on the target sample (Algorithm 2). All other settings are as in Figure 3, with  M = 500 replications per cell. Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error.
Table 6. Sensitivity of relative estimation error to the ridge penalty value when cross-validation tuning is disabled. All four ridge penalties (Single-RR’s τ st , Trans-RR’s τ 1 and τ , Pooled-RR’s τ p ) are forced to a common fixed value from { 1 / 3 , 1 , 3 , 9 } . Trans-RR-Ada’s mixing weight θ is still selected by 5-fold cross-validation on the target sample (Algorithm 2). All other settings are as in Figure 3, with  M = 500 replications per cell. Each entry reports Single-RR/Trans-RR/Trans-RR-Ada/Pooled-RR mean estimation error.
Caseh τ = 1/3 τ = 1 τ = 3 τ = 9
I 0.135 0.742/0.769/0.687/0.6190.628/0.523/0.525/0.6050.742/0.620/0.620/0.7730.881/0.814/0.814/0.906
I 0.223 0.742/0.781/0.696/0.6450.628/0.538/0.538/0.6250.742/0.631/0.632/0.7830.881/0.820/0.820/0.910
I 0.368 0.742/0.804/0.709/0.6940.628/0.565/0.561/0.6610.742/0.651/0.652/0.8010.881/0.829/0.829/0.917
I 0.607 0.742/0.851/0.728/0.7900.628/0.616/0.596/0.7260.742/0.687/0.690/0.8330.881/0.846/0.846/0.929
I 1.000 0.742/0.953/0.745/0.9900.628/0.714/0.629/0.8480.742/0.752/0.740/0.8900.881/0.876/0.876/0.951
I 1.649 0.742/1.185/0.743/1.4230.628/0.913/0.628/1.0740.742/0.866/0.742/0.9860.881/0.924/0.881/0.986
I 2.718 0.742/1.739/0.742/2.3210.628/1.287/0.628/1.4360.742/1.033/0.742/1.1130.881/0.987/0.881/1.030
II 0.135 2.026/2.443/2.014/1.1430.978/0.964/0.930/0.7890.861/0.778/0.784/0.8500.924/0.877/0.878/0.936
II 0.223 2.026/2.459/2.019/1.1670.978/0.979/0.938/0.8040.861/0.787/0.793/0.8570.924/0.881/0.882/0.939
II 0.368 2.026/2.490/2.027/1.2110.978/1.005/0.953/0.8330.861/0.803/0.810/0.8700.924/0.888/0.889/0.943
II 0.607 2.026/2.551/2.037/1.2970.978/1.055/0.972/0.8850.861/0.832/0.836/0.8940.924/0.900/0.903/0.952
II 1.000 2.026/2.681/2.045/1.4750.978/1.152/0.988/0.9830.861/0.885/0.865/0.9370.924/0.923/0.922/0.968
II 1.649 2.026/2.971/2.042/1.8480.978/1.342/0.984/1.1610.861/0.977/0.864/1.0080.924/0.959/0.927/0.994
II 2.718 2.026/3.615/2.034/2.5690.978/1.663/0.978/1.4200.861/1.096/0.861/1.0930.924/1.000/0.925/1.022
III 0.135 1.286/1.442/1.246/0.8270.784/0.713/0.704/0.6860.798/0.693/0.694/0.8080.902/0.844/0.844/0.920
III 0.223 1.286/1.456/1.253/0.8520.784/0.729/0.715/0.7050.798/0.703/0.704/0.8170.902/0.849/0.849/0.923
III 0.368 1.286/1.484/1.265/0.9000.784/0.756/0.735/0.7370.798/0.721/0.724/0.8330.902/0.857/0.857/0.929
III 0.607 1.286/1.541/1.282/0.9940.784/0.808/0.763/0.7970.798/0.754/0.759/0.8610.902/0.872/0.873/0.940
III 1.000 1.286/1.661/1.294/1.1870.784/0.909/0.789/0.9100.798/0.815/0.799/0.9120.902/0.898/0.899/0.959
III 1.649 1.286/1.933/1.290/1.6000.784/1.109/0.786/1.1150.798/0.920/0.799/0.9960.902/0.941/0.903/0.990
III 2.718 1.286/2.559/1.286/2.4200.784/1.466/0.784/1.4260.798/1.064/0.798/1.1030.902/0.993/0.902/1.026
Table 7. Prediction performance on the NIR spectral dataset over 20 repeated random splits. Each entry reports the average RMSE, with the standard deviation in parentheses.
Table 7. Prediction performance on the NIR spectral dataset over 20 repeated random splits. Each entry reports the average RMSE, with the standard deviation in parentheses.
MethodDirection ADirection B
Trans-RR4.6230 (0.1732)4.7933 (0.2736)
Trans-RR-Ada4.6294 (0.1861)4.8211 (0.3757)
Pooled-RR5.0952 (0.0812)5.4909 (0.1335)
Trans-Lasso5.5668 (0.3650)5.6803 (0.3131)
Single-RR6.2666 (2.2628)6.8272 (2.2674)
Single-Lasso8.0672 (2.8904)8.5607 (3.3739)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lyu, L.; Guo, X.; Liu, Z. Transfer Learning for Moderate–Dimensional Ridge-Regularized Robust Linear Regression. Entropy 2026, 28, 543. https://doi.org/10.3390/e28050543

AMA Style

Lyu L, Guo X, Liu Z. Transfer Learning for Moderate–Dimensional Ridge-Regularized Robust Linear Regression. Entropy. 2026; 28(5):543. https://doi.org/10.3390/e28050543

Chicago/Turabian Style

Lyu, Lingfeng, Xiao Guo, and Zongqi Liu. 2026. "Transfer Learning for Moderate–Dimensional Ridge-Regularized Robust Linear Regression" Entropy 28, no. 5: 543. https://doi.org/10.3390/e28050543

APA Style

Lyu, L., Guo, X., & Liu, Z. (2026). Transfer Learning for Moderate–Dimensional Ridge-Regularized Robust Linear Regression. Entropy, 28(5), 543. https://doi.org/10.3390/e28050543

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop