Kernel Ridge-Type Shrinkage Estimators in Partially Linear Regression Models with Correlated Errors

Syed Ejaz Ahmed; Ersin Yilmaz; Dursun Aydın

doi:10.3390/math13121959

,

and

¹

Department of Mathematics and Statistics, Brock University, St. Catharines, ON L2S 3A, Canada

²

Department of Computer Science, Aalto University, Konemiehentie 2, 02150 Espoo, Finland

³

Department of Statistics, Faculty of Science, Mugla Sitki Kocman University, Mugla 4800, Turkey

⁴

Department of Mathematics, University of Wisconsin-Oshkosh, 800 Algoma Blvd, Oshkosh, WI 54901, USA

Mathematics2025, 13(12), 1959;https://doi.org/10.3390/math13121959

This article belongs to the Special Issue Statistical Forecasting: Theories, Methods and Applications

Version Notes

Order Reprints

Abstract

Partially linear time series models often suffer from multicollinearity among regressors and autocorrelated errors, both of which can inflate estimation risk. This study introduces a generalized ridge-type kernel (GRTK) framework that combines kernel smoothing with ridge shrinkage and augments it through ordinary and positive-part Stein adjustments. Closed-form expressions and large-sample properties are established, and data-driven criteria—including GCV, AICc, BIC, and RECP—are used to tune the bandwidth and shrinkage penalties. Monte-Carlo simulations indicate that the proposed procedures usually reduce risk relative to existing semiparametric alternatives, particularly when the predictors are strongly correlated and the error process is dependent. An empirical study of US airline-delay data further demonstrates that GRTK produces a stable, interpretable fit, captures a nonlinear air-time effect overlooked by conventional approaches, and leaves only a modest residual autocorrelation. By tackling multicollinearity and autocorrelation within a single, flexible estimator, the GRTK family offers practitioners a practical avenue for more reliable inference in partially linear time series settings.

Keywords:

shrinkage estimation; partially linear models; multicollinearity; ridge-type kernel smoother; parameter selection

MSC:

62G08; 62J07; 62M10

1. Introduction

In many time series applications, one must balance interpretability and flexibility in the modeling. Partially linear regression models achieve this by expressing the response as a parametric linear function of some covariates alongside a nonparametric smooth function of an additional variable. Specifically, we consider the following model setup:

y_{t} = x_{t}^{'} β + f (z_{t}) + ε_{t}, t = 1, \dots, n,

(1)

where

y_{t}

denotes the stationary response at time

t

, and its prediction depends on the p-dimensional vector of explanatory variables

x_{t} = {(x_{t 1}, \dots, x_{t p})}^{'}

with

x_{t 1} = {(x_{1}, \dots, x_{n})}^{'}

and an extra univariate variable

z_{t}

. Also,

β \in R^{p}

is an unknown p-dimensional parameter vector to be estimated,

f (.)

is an unknown smooth function, and

ε_{t} ’ s

are the error terms.

Unlike classical linear models, (1) offers the interpretability of

β

while retaining the capacity to model complex, nonlinear effects through

f (.)

(see [1]). However, two mojor issues often arise:

Multicollinearity among $x_{t} .$ When the columns of $x_{t}$ are nearly linearly dependent, ordinary least squares (OLS) or unpenalized nonparametric estimators can have extremely high variances [2].
Autocorrelation or dependence among $ε_{t}$ , as is common in time series or longitudinal settings. We capture this by a first-order autoregressive process:

$ε_{t} = ρ ε_{t - 1} + u_{t}, |ρ| < 1, u_{t} ~ i . i . d . N (0, σ^{2}), t = 1,2, \dots, n$

(2)

ensuring that $ε_{t}$ is stationarily dependent over time ([3,4]). In addition, to simplify the notational illustration, a matrix–vector form of the model (1) can be stated as

$y = X β + f + ε,$

(3)

where $y = (y_{1}, \dots, y_{n})$ , ${X = (x_{1}, \dots, x_{n})}^{'}$ , $f = (f (z_{1}), \dots, f (z_{n}))$ and $ε = (ε_{1}, \dots, ε_{n})$ . The main goal is to estimate the unknown parameter vector $β$ , the smooth function $f (t)$ and the mean vector $μ = X β + f$ based on the observations from the data set { $y_{t}, x_{t} {, z}_{t}$ }. Partially linear models enable easier interpretation of the effect of each variable and owing to the “curse of dimensionality”, these models can be preferred to a completely nonparametric regression model. Specifically, partially linear models are more practical than the classical linear model because they combine both the parametric part presented numerically and the nonparametric part displayed graphically.

Even though partially linear models have specific advantages, these models typically assume uncorrelated errors ([1,5]). In real-world time series or panel data, ignoring assumption (2) can degrade efficiency and produce biased inferences ([6,7]). Moreover, multicollinearity among

x_{t}

amplifies these problems, inflating variances to the point that

β

estimates become unreliable ([8,9]).

Despite steady progress on partially linear models, two shortcomings persist. First, kernel-based estimators that handle autocorrelated errors—e.g., efficient GLS-type fits [4] —leave the multicollinearity problem untreated. Second, ridge or Stein-type shrinkage estimators developed for i.i.d. settings ([8,10]) do not accommodate serial dependence. To our knowledge, no existing method tackles collinearity and autocorrelation simultaneously within a single, closed-form estimator; nor do prior studies examine how data-driven criteria jointly tune the kernel bandwidth and shrinkage weight. Filling this gap is the focus of the present work. Accordingly, our objective is to develop and rigorously evaluate a unified estimator that stabilizes the parametric coefficients under multicollinearity, adapts to unknown nonlinear structure, and remains efficient in the presence of autoregressive errors.

To mitigate these issues, we adopt ridge-type kernel smoothing for the parametric component. In essence, a shrinkage penalty is added so that the linear portion

β

is “shrunk” toward zero to stabilize the estimation. Mathematically, for instance, a ridge-type kernel (RTK) estimator of

β

can be expressed (after suitable data transformation) as in (6) which has been extensively studied by [11]. This ridge-inspired framework stems from Stein-type shrinkage theory, which shows that biased estimators can dominate unbiased ones, particularly in moderate-to-high dimensions or under strong correlation ([12,13]). Indeed, Stein’s paradox reveals that for

p \geq 3

shrinking an estimator can lower the overall mean squared error despite introducing bias, a phenomenon extensively explored by [12,14,15]).

When errors follow (2), we further incorporate transformations to handle the autocovariance structure, yielding generalized ridge-type kernel (GRTK) estimators that optimally exploit the correlation

ρ

. These GRTK methods align with earlier biased estimation approaches for linear or partially linear models with correlated errors ([10,13,16]). By merging the Stein-type shrinkage logic with kernel smoothing, one can obtain robust estimators of both

β

and

f (.)

that effectively address multicollinearity and autocorrelation in a unified fashion.

In this paper, we formalize and extend such modified ridge-type kernel smoothing estimators for partially linear time series models. We focus on:

The precise data transformation to accommodate (2).
The interplay between shrinkage $k$ and the bandwidth $λ$ for kernel smoothing.
Selecting these tuning parameters ( $k, λ$ ) using multiple criteria, including GCV, AICc, BIC, and RECP.

This framework is motivated by [17] broader research on shrinkage-based variable selection and penalty methods, which shows that penalized procedures (including ridge, lasso, and Stein-like estimators) can vastly reduce variance in semiparametric regressions—particularly with correlated data or limited samples. Consequently, the main goal is to fill a gap in semiparametric regression by uniting the autoregressive error modeling (2), handling multicollinearity and Stein-type shrinkage in kernel-based estimation. Based on the focuse stated above, the paper is expected to make the following contributions:

We introduce a generalized ridge-type kernel (GRTK) estimator that embeds an $l_{2}$ penalty inside each kernel-weighted fit, yielding a closed-form solution that simultaneously controls collinearity and accounts for AR(1) errors.
Under mild regularity conditions we derive the closed-form bias, variance and risk expansions, and prove that GRTK dominates the unpenalised local-linear estimator when the condition number of the local design exceeds a modest threshold.
We show how four information-based criteria (GCV, AICc, BIC, RECP) can jointly choose the kernel bandwidth and ridge weight—an aspect unexplored in earlier work.
Extensive simulations and an airline-delay application demonstrate that GRTK and its Stein-type shrinkage extensions yield more stable coefficient estimates and cleaner residual series than existing semiparametric alternatives.

The rest of the paper is organized as follows. Section 2 explains how we fit the autoregressive structure (2) and handle the collinearity. Section 3 provides the mathematical formulations of our GRTK shrinkage estimators for both

β

and the smooth function

f (.)

, including asymptotic properties. Section 4 details the procedures for choosing

k

and bandwidth

λ

. Section 5 illustrates, through extensive simulation and an airline delay time series dataset, how the proposed methodology outperforms traditional methods under multicollinearity and autocorrelation. Finally, Section 6 concludes by discussing of key findings and future research directions.

2. Fitting the Model Error Structure

In partially linear time series models, two phenomena often inflate the estimation risk: high correlation among the explanatory variables and serial dependence in the errors. A local-kernel fit alone addresses the non-parametric component but inherits numerical instability when the local design matrix is nearly singular. Conversely, ordinary ridge regression stabilises the parametric fit but cannot capture an unknown smooth

f (\cdot)

. Combining the two resolves both issues: the ridge penalty absorbs multicollinearity at each kernel centre, while the kernel weights allow

f (.)

to vary flexibly with

z_{t}

. The resulting estimator admits a closed-form solution, and its bias–variance balance can be tuned automatically through data-driven information criteria. This synergy yields a practical, single-step procedure that is robust to ill-conditioning yet retains the local adaptivity required for non-linear structure.

Assume that

y_{t} = (y_{1}, \dots, y_{n})

′ is a realization from a stationary time series described by model (1) with autoregressive error terms satisfying the following assumption.

Assumption 1.

The error

ε_{t}

is a stationary dependent sequence with

E (ε_{t}) = 0, V a r (ε_{t}) = σ_{ε}^{2},

and

E {|ε_{t}|}^{2 + δ} < \infty

for some constant

δ .

Letting

ε_{t} = (ε_{1}, \dots, ε_{n})

and

V a r (ε_{t}) = E (ε_{t} ε_{t}^{'}) = σ^{2} R

, where

R

is a

(n \times n)

positive definite symmetric matrix.

Here, Assumption 1 is standard in time series analysis because it allows for modeling of stationary errors with finite variance. Such an assumption is common in many time series models where the error term’s statistical properties remain constant over time. Note that the autoregressive errors follow an n −dimensional multivariate normal distribution with a mean zero and stationary

(n \times n)

covariance matrix

Σ

. We may also write this expression in the equivalent form

ε_{t} = (ε_{1}, \dots, ε_{n}) ~ N (0, Σ)

, where

Σ = σ^{2} R

is a

n \times n

covariance matrix, given by

R = \frac{1}{1 - ρ^{2}} [\begin{matrix} 1 & ρ & ρ^{2} & \dots & ρ^{n - 1} \\ ρ & 1 & ρ & \dots & ρ^{n - 2} \\ \begin{matrix} ρ^{2} \\ \begin{matrix} ⋮ \\ ρ^{n - 1} \end{matrix} \end{matrix} & \begin{matrix} ρ \\ \begin{matrix} ⋮ \\ ρ^{n - 2} \end{matrix} \end{matrix} & \begin{matrix} 1 \\ \begin{matrix} ⋮ \\ ρ^{n - 3} \end{matrix} \end{matrix} & \begin{matrix} \dots \\ \begin{matrix} ⋱ \\ \dots \end{matrix} \end{matrix} & \begin{matrix} ρ^{n - 3} \\ \begin{matrix} ⋮ \\ 1 \end{matrix} \end{matrix} \end{matrix}], R^{- 1} = \frac{1}{1 - ρ^{2}} [\begin{matrix} 1 & - ρ & 0 & \dots & 0 & 0 \\ - ρ & 1 + ρ^{2} & - ρ & 0 & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ 0 & \dots & 0 & 0 & - ρ & 1 \end{matrix}],

(4)

where

ρ

denotes the correlation between

ε_{t}

and

ε_{t - 1}

, as defined before. It should be emphasized that once we have an estimate

\hat{β}

of the parameter vector

β

and an estimator

\hat{f}

of the unknown smooth function

f

, the parameter

ρ

can be estimated by the residuals from the semiparametric regression

e_{t} = y_{t} - (x_{t} \hat{β} + \hat{f} (z_{t}))

, defined as

\hat{ρ} = \frac{C o v (e_{t}, e_{t - 1})}{\sqrt{V a r (e_{t})} \sqrt{V a r (e_{t - 1})}} = \frac{\sum_{t = 2}^{n} e_{t} e_{t - 1}}{\sum_{t = 1}^{n} e_{t}^{2}} .

(5)

There are several methods to estimate

β

and

f .

In this work, we adopt the ridge type kernel (RTK) method discussed by [11], which generalizes partial kernel method proposed by [14]. Specifically, the RTK estimator

\hat{β}

of

β

is the form

{\hat{β}}_{R T K} = {({\tilde{X}}^{'} \tilde{X} + k I)}^{- 1} {\tilde{X}}^{'} \tilde{y},

(6)

where

k > 0

is a shrinkage parameter,

I

is an

(p \times p)

identity matrix,

{\tilde{x}}_{t} = x_{t} - \sum_{i = 1}^{n} W_{λ t} (z_{i}) x_{i} = \{(I - W_{λ}) X = \tilde{X}\}

,

{\tilde{y}}_{t} = y_{t} - \sum_{i = 1}^{n} W_{λ t} (z_{i}) y_{i} = \{(I - W_{λ}) y = \tilde{y}\}

, and

W_{λ}

is a kernel smoother matrix of weights satisfying Assumption (6), given by

W_{λ t} (z) = K (\frac{z_{t} - z}{λ}) / \sum_{t = 1}^{n} K (\frac{z_{t} - z}{λ}) = W_{λ},

(7)

where

λ

is a bandwidth parameter to selected, and

K (.)

is the kernel function defined in Remark 2. For example,

K (.)

could be Gaussian, Epanechnikov, etc. Then, an estimator

\hat{f}

of the nonparametric part

f (z)

is obtained by

{\hat{f}}_{R T K} = \sum_{i = 1}^{n} W_{λ t} (z) (y_{t} - x_{t} {\hat{β}}_{R T K}) = W_{λ} (y - X {\hat{β}}_{R T K}) .

(8)

Hence, the autoregressive coefficient

\hat{ρ}

arises from (5) and the noise of the AR(1) is obtained by

{\hat{u}}_{t} = e_{t} - \hat{ρ} e_{t - 1}

. Note that since the errors in the model (1) are serially correlated, the estimators defined in (6) and (8) is not asymptotically efficient. To improve efficiency, we use kernel ridge type weighted estimation based on transforming data

\tilde{y}

.

3. Generalized Ridge Type Kernel Estimation

The generalized ridge type kernel (GRTK) estimator of the parametric components in model (1) is constructed by combining the partial kernel method with a suitable transformation to account for

ρ

in the AR(1) errors. We will then extend this GRTK estimator to incorporate the shrinkage estimation, analyze its asymptotic properties, and discuss its statistical characteristics. For notational clarity we first derive (9)–(12) under the working assumption that the true

β

and

R

are known; these quantities are replaced by consistent preliminary estimators in practice. Using the assumption

{E (ε}_{t}) = 0

, one can see that

f (z_{t}) = y_{t} - x_{t} β, t = 1, \dots, n

. Thus, for a given

β

, the natural estimator of the nonparametric component

f (.)

is

\hat{f} = \sum_{t = 1}^{n} W_{λ t} (z) (y_{t} - x_{t} \hat{β}) = W_{λ} (y - X \hat{β}),

(9)

where

W_{λ}

is a kernel smoother matrix, as defined in (7). Since

R

defined in Assumption 1 is positive definite, there exists a

(n \times n)

−dimensional matrix

P

such that

P^{'} P = R^{- 1}

. Also, the following assumptions are needed because they reflect those in standard local-linear and ridge analyses and hold for most applied data after routine diagnostics.

A1. Under Assumption 1,

\{ε_{t}\}

follows a weakly stationary AR(1) process with

|ρ| < 1

.

A2. Eigenvalues of

n^{- 1} X_{t}^{'} X_{t}

are bounded away from 0; the ridge term in (11) enforces this when multicollinearity is severe.

A3.

K (\cdot)

is symmetric, bounded, integrates to 1, with bandwidth

h \to 0

and

n h \to \infty

as

n \to \infty

.

In practice, the GRTK estimator is obtained through a short plug-in routine. First, we estimate the AR(1) coefficient

ρ

(and thus the covariance matrix

R

) by a standard

\sqrt{n}

-consistent method, such as maximum likelihood. Second, we compute an initial

{\hat{β}}^{(0)}

from a ridge-free partial-kernel fit. Finally,

\hat{R}

and

{\hat{β}}^{(0)}

are substituted into (9)–(12) and the system is solved once to give

{\hat{β}}_{G R T K} (k)

. Appendix B proves that this one-shot plug-in estimator shares the same asymptotic distribution as the oracle version that uses the true

R

and

β

.

We first assume that the matrix

R

and, hence,

P

are known. Then, we fit the error structure by the following residuals:

e_{t} = y_{t} - x_{t} β - \hat{f} = {\tilde{y}}_{t} - {\tilde{x}}_{t} β, 1 \leq t \leq n

(10)

Here, as defined in (6),

{\tilde{y}}_{t}

and

\tilde{x}

are locally centered residuals of

y_{t}

and

x

, respectively. By considering these partial residuals, the GRTK estimator of the vector

β

is obtained by minimizing the following weighted least squares (WLS) equation.

W L S (β) = {(P ({\tilde{y}}_{t} - {\tilde{x}}_{t} β))}^{'} (P ({\tilde{y}}_{t} - {\tilde{x}}_{t} β)) + k β^{'} β = {‖P (\tilde{y} - \tilde{X} β)‖}_{2}^{2} + k {‖β‖}_{2}^{2},

(11)

where

{‖.‖}_{2}

denotes the Euclidean norm,

k > 0

is a shrinkage parameter to be selected by a selection method,

\tilde{y} = (I - W_{λ}) y

and

\tilde{X} = (I - W_{λ}) X

are the centered residuals, as defined in (6). Also,

W_{λ}

is a kernel smoothing matrix based on a bandwidth parameter

λ > 0

to be selected.

Algebraically, after some operations, the solution of the equation

W L S (β)

yields the following GRTK estimator:

{\hat{β}}_{G R T K} (k) = {({\tilde{X}}^{'} R^{- 1} \tilde{X} + k I)}^{- 1} {\tilde{X}}^{'} R^{- 1} \tilde{y} .

(12)

Replacing

β

in the (9) with the

{\hat{β}}_{G R T K} (k)

in (12) to produce an GRTK estimator of the nonparametric component in the partially linear model

f (.)

given by

{\hat{f}}_{G R T K} = \sum_{t = 1}^{n} W_{λ t} (z) (y_{t} - x_{t} {\hat{β}}_{G R T K} (k)) = W_{λ} (y - X {\hat{β}}_{G R T K} (k)) .

(13)

This section details the construction of the Generalized Ridge-Type Kernel (GRTK) estimator for the parametric component in model (1). This is achieved by combining the partial kernel method with a suitable transformation that accounts for

ρ

in the AR(1) errors. We will then extend this GRTK estimator to incorporate shrinkage estimation, analyze its asymptotic properties, and discuss its statistical characteristics.

3.1. Shrinkage Estimators with GRTK Estimators

While the GRTK estimator in (12) addresses multicollinearity and autocorrelation simultaneously, further variance reduction is possible via Stein-type shrinkage. By combining the full model GRTK estimator with a more parsimonious submodel estimator, we can develop Stein-type shrinkage estimators that may achieve lower risk under certain conditions. This approach follows the framework developed by [14] and extends it to the partially linear time series context. To develop the shrinkage estimators, we consider two models:

1. Full Model (FM): The complete model with all predictors

(p = p_{1} + p_{2})

where

p_{1} < p

is the number of significant coefficients and

p_{2}

denotes the number of sparse coefficients, estimated using GRTK as in Equation (12).

2. Submodel (SM): A reduced model with dimension (

p_{1} < p

), selected via Bayesian information criterion (BIC) to minimize model selection, yielding

{\hat{β}}_{S M}

.

Let

{\hat{β}}_{F M}

be the parametric estimate from the ridge-type kernel smoothing (GRTK) on all predictors as full model

{\hat{β}}_{G R T K} (k)

from Equation (12) and let

{\hat{β}}_{S M}

be the corresponding estimate when only a submodel of predictors is retained using BIC. Accordingly, the ordinary shrinkage estimator (

{\hat{β}}_{S}

) can be give as follows:

{\hat{β}}_{S} = {\hat{β}}_{F M} - [(p_{2} - 2) / \hat{T}] ({\hat{β}}_{F M} - {\hat{β}}_{S M}),

(14)

where

\hat{T}

is a “distance measure” given below, and

(p_{2} - 2) / \hat{T}

parallels the classical shrinkage factor. Intuitively, we are pulling

{\hat{β}}_{F M}

toward

{\hat{β}}_{S M}

, thus introducing some bias but potentially reducing variance and risk. The distance is given by:

\hat{T} = n {\hat{σ}}^{- 2} {\hat{β}}_{2}^{'} [{\tilde{X}}_{2}^{'} {\tilde{U}}_{1} {\tilde{X}}_{2}] {\hat{β}}_{2},

where

{\hat{β}}_{2}^{'}

is the portion of

{\hat{β}}_{F M}

(or

{\hat{β}}_{S M}

) corresponding to the submodel’s parametric indices.

{\tilde{X}}_{2}

is the part of the design matrix associated with those sparse indices.

{\tilde{U}}_{1} = I_{n} - {\tilde{X}}_{1} {[{\tilde{X}}_{1}^{'} {\tilde{X}}_{1} + k I_{p_{1}}]}^{- 1} {\tilde{X}}_{1}^{'}

is a partial residual projection, removing the other block of parametric covariates from the fit.

{\hat{σ}}^{2}

is given in (25) but note that

H_{λ, k}

is chosen SM-based partial kernel fit (see [10] for details).

To prevent over-shrinking when

(p_{2} - 2) / \hat{T}

is large or negative, we define a positive-part version:

{\hat{β}}_{P S} = {\hat{β}}_{F M} - {[\frac{p_{2} - 2}{\hat{T}}]}^{+} ({\hat{β}}_{F M} - {\hat{β}}_{S M}),

(15)

where

{(x)}^{+} = \max \{0, x\} .

Thus, if

(p_{2} - 2) / \hat{T}

is negative, we do no shrinkage; if it is positive, we shrink as in (14).

Following a partially linear kernel-smoothing approach (GRTK), the nonparametric function

f (.)

can be recomputed (or simply computed once) based on these final shrunk parametric estimates using smoothing matrix

W_{λ}

:

{\hat{f}}_{S} = S_{λ} (y - X {\hat{β}}_{S}), {\hat{f}}_{P S} = S_{λ} (y - X {\hat{β}}_{P S})

(16)

Hence, each final semiparametric estimators become

({\hat{β}}_{S}, {\hat{f}}_{s}), ({\hat{β}}_{P S}, {\hat{f}}_{P s})

.

3.2. Asymptotic Distribution of the Estimators

This subsection presents the assumptions and theorems necessary to analyze the asymptotic properties of the proposed

{\hat{β}}_{G R T K} (k)

and

{\hat{f}}_{G R T K} (k)

estimators. Understanding the asymptotic distribution is crucial for statistical inference. In this context, we introduce the following assumptions that can be easily satisfied.

Assumption 2.

In the setting of the semiparametric regression model, covariates

x_{i}

and

z_{i}

are related via the following nonparametric regression model.

x_{i j} = g_{j} (z_{i}) + η_{i j}, (1 \leq i \leq n, 1 \leq j \leq p)

(a)

where

{η_{i j} = (η_{i 1}, \dots, η_{i p})}^{'}

are real sequence satisfying.

\lim_{n \to \infty} \frac{1}{n} \sum_{i = 1}^{n} η_{i} η_{i}^{'} = B,

(b)

and

{m a x}_{1 \leq j \leq n} ‖\sum_{i = 1}^{n} W_{λ i} (z_{i}) η_{i}‖ = O (n^{- 1 / 6})

(c)

where

B

is a

(p \times p)

positive definite matrix and

‖.‖

indicates the Euclidean norm.

Assumption 3.

The functions

g (.)

and

f_{j} (.), 1 \leq j \leq k,

satisfy a Lipschitz continuous of order 1 on data domain.

Assumption 4.

The weight functions

W_{λ i}

satisfy these conditions:

(i.): ${m a x}_{1 \leq i \leq n} \sum_{j = 1}^{n} W_{λ i} (z_{j}) = O (1)$
(ii.): ${m a x}_{1 \leq i, j \leq n} W_{λ i} (z_{j}) = O (n^{- 2 / 3})$
(iii.): ${m a x}_{1 \leq j \leq n} \sum_{i = 1}^{n} W_{λ i} (z_{j}) I (|z_{i} - z_{j}| > O (n^{- 1 / 3})) = O (n^{- 1 / 3}),$ where $I (A)$ is the indicator function of a set $A$ .
(iv.): $t r (W_{λ}^{'} W_{λ}) = t r (W_{λ}) = O (λ^{- 1})$ , where $t r (A)$ denotes the trace of a square matrix $A$ .

Regarding the assumptions given above, Assumption 2 which ise related to smoothness, requires two bounded derivatives of

f (.)

is the minimal condition for the

O (h^{2})

bias expansion that emphasizes well known kernel methods. In practice, visual inspection of a pilot kernel fit or a spline smoother is usually sufficient to rule out abrupt alterations. For Assumption 3 which refers to local design, the ridge term in (11) guarantees a strictly positive lower eigen-bound even when the raw design is ill-conditioned (see [18,19]) for analogous arguments in linear GLS. In Assumption 4 for error dependence, while we state an AR(1) model for clarity, Theorems 1–2 require only strong mixing with finite fourth moments [18].

Remark 1.

The

η_{i j}

stated in (a) of Assumption 2 behaves like as a zero mean uncorrelated sequence of stationary random variables of independent of

ε_{t}

. If the covariates

x_{i}

and

z_{i}

are the observations of the independent and identically distributed random variables, then

g_{j} (z)

can be considered as

E (x_{i j}| z_{i}) = g_{j} (z_{i}), E (η_{i} η_{i}^{'}) = B, a n d η_{i j} = (x_{i j} - g_{j} (z_{i}))

. See [14] for more details. In this case, (b) holds with probability 1 according to the law of large numbers, and (c) is provided by Lemma 1 in [16].

Remark 2.

If

K (x)

is a kernel function, then

K (x) = K (\frac{x}{λ})

is also a kernel function based on a positive bandwidth parameter

λ

, and it satisfies the following properties:

\begin{array}{l} (a) \int K (x) d x = 1, & (b) \int x K (x) d x = 0, \\ (c) μ_{2} = \int x^{2} K (x) d x < \infty, & (d) K (x) \geq 0 for all x, and K (x) = K (- x) \end{array}

(17)

These properties show that a kernel function needs to be symmetric and continuous probability density function with mean zero and constant variance. Note that the bandwidth parameter

λ

should be chosen optimally by a selection criterion in kernel estimation or smoothing. For example, a large

λ

provides an extremely smooth curve or estimate, while a small

λ

produces a wiggly function curve.

Theorem 1.

Suppose that the Assumption 1 and Assumptions 2–3 hold, and assume that

\lim_{n \to \infty} \frac{1}{n} \sum_{i = 1}^{n} η_{i} R^{- 1} η_{i}^{'} = V

is a non-singular covariance matrix with

η_{i} = {(η_{1}, \dots, η_{n})}^{'}

. Then, as

n \to \infty,

we have

\sqrt{n} ({\hat{β}}_{G R T K} (k) - β) \overset{D}{\to} N (0, σ^{2} V^{- 1}) and \sqrt{n} (\hat{ρ} - ρ) \overset{D}{\to} N (0, n^{- 1} (1 - ρ^{2}))

where

\overset{D}{\to}

denotes convergence in distribution. See [9] for the proof of the Theorem 1.

The asymptotic normality established in Theorem 1 allows for valid statistical inference based on the GRTK estimator. Intuitively, this result holds because the transformed data approach effectively accounts for autocorrelation, while the kernel smoothing effectively separates the nonparametric component, allowing the parametric component to be estimated consistently with a regular

\sqrt{n}

convergence rate.

Theorem 2.

If the Assumptions 1 and 2-3 hold, then it hold

{m a x}_{1 \leq i \leq n} |{\hat{f}}_{G R T K} (z_{i}) - f (z_{i})| = O (n^{- 1 / 3} l o g (n)) a . s .

See [9] for proof of the Theorem 2. The asymptotic normality and convergence results established in Theorems 1 and 2 provide a theoretical foundation for the GRTK estimators. In the following section, we analyze additional statistical properties, such as bias and variance.

This convergence rate for the nonparametric component is optimal in the sense that no estimator can achieve a faster uniform convergence rate without additional structural assumptions. The logarithmic factor arises from the need to establish uniform rather than pointwise convergence. Also note that for both Theorem 1 and 2, results also holds if Assumption 4 is replaced by the strong-mixing condition

\sum_{k = 1}^{\infty} α {(k)}^{1 - \frac{2}{δ}} < 0

for some

δ > 2

where

α (k)

is the Rosenblatt mixing coefficient at lag

k

(see [18] for details). Intuitively, this summation condition says the temporal dependence in the error process decays quickly enough that observations far apart behave almost independently allowing the usual kernel-based central-limit theory to go through (see [20] for similar insights).

Note that Theorem 2 express that

{\hat{f}}_{G R T K}

reaches the optimal strong convergence rate. Also, a consistent estimator of the asymptotic covariance matrix

V

is required for statistical inference based on

{\hat{β}}_{G R T K} (k)

. The estimate of covariance matrix

V

can be obtained as

\hat{V} = \frac{1}{n} ({\tilde{X}}^{'} R^{- 1} \tilde{X}),

(18)

where

\hat{V}

is the asymptotic covariance matrix of

\sqrt{n} ({\hat{β}}_{G R T K} (k) - β)

. Regarding the shrinkage estimator’s asymptotic properties for the currently considered low-dimensional (

p < n

) settings, we outline the key asymptotic results. Let

p_{2} \geq 3

be the size of the submodel portion. Suppose

β_{2}

is subject to a local alternatives framework:

β_{2 n} = \frac{1}{\sqrt{n}} ϑ, for fixed ϑ \in R^{p_{2}},

So that

β_{2 n} \to 0

as

n \to \infty

. Additionally, let

β_{1 n}

denotes the complementary block (nonzero coefficients) of dimension

p_{1}

. Under standard regularity conditions ([10]) we can claim that:

Both ${\hat{β}}_{S}$ and ${\hat{β}}_{P S}$ remain consistent for $β$ as a point of consistency.
By construction, these estimators introduce shrinkage-based bias in exchange for variance reduction. The bias depends on $(p_{2} - 2) / \hat{T}$ or ${[\frac{(p_{2} - 2)}{\hat{T}}]}^{+}$ and and generally involves certain noncentral $χ^{2}$ -based expectations.
Regarding the asymptotic quadratic bias (AQDB), asymptotic covariance and asymptotic distributional risk (ADR), similar expansions hold showing how ${\hat{β}}_{S}$ and ${\hat{β}}_{P S}$ incorporate both ridge (FM) plus Stein corrections.

Exact closed-form expressions for the bias, covariance, and risk match those in [10] or [14], with minor notational changes for

\hat{T}

and the partial kernel smoothing. Hence, we omit rewriting them here. However, for completeness, detailed bias and risk expansions for these estimators appear in Appendix B. The main takeaway is that

{\hat{β}}_{P S}

typically achieves the smallest asymptotic distributional risk among the four choices (FM, SM, ordinary Stein, positive-part Stein) whenever

(p_{2} - 2) / \hat{T}

would become negative or large, thus showing the advantage of positive-part truncation in a partially linear framework.

3.3. Statistical Properties of the Estimators

Here, we detail the statistical properties of the GRTK estimator, including its bias, variance, and expected value. These properties help to characterize the estimator’s behavior and quality. As defined in (12), when

k = 0,

the GRTK estimators of the parametric and nonparametric components reduce to ordinary kernel smoothing (KS) estimators for a partially linear model with correlated error. They can be defined, respectively, as follows:

{\hat{β}}_{K S} = {({\tilde{X}}^{'} R^{- 1} \tilde{X})}^{- 1} {\tilde{X}}^{'} R^{- 1} \tilde{y} and {\hat{f}}_{K S} {= W}_{λ} (y - X {\hat{β}}_{K S}) .

(19)

Using the abbreviation

{G_{k} = ({\tilde{X}}^{'} R^{- 1} \tilde{X} + k I)}^{- 1} .

Expected value, bias and variance of the estimator

{\hat{β}}_{G R T K}

can be defined, respectively, as

E [{\hat{β}}_{G R T K} (k)] = G_{k} {\tilde{X}}^{'} R^{- 1} \tilde{X} E [β] + G_{k} {\tilde{X}}^{'} R^{- 1} \tilde{f} = β - k G_{k} β + G_{k} {\tilde{X}}^{'} R^{- 1} \tilde{f}

(20)

B i a s ({\hat{β}}_{G R T K} (k)) - β = G_{k} {\tilde{X}}^{'} R^{- 1} \tilde{f} - k G_{k} β

(21)

V a r ({\hat{β}}_{G R T K} (k)) = {σ^{2} G}_{k} {\tilde{X}}^{'} R^{- 1} {(I - W_{λ})}^{2} R^{- 1} \tilde{X} G_{k} .

(22)

The implementation details of Equations (20)–(22) are given in Appendix A.1.

Clearly,

E [{\hat{β}}_{G R T K} (k)] \neq β

for any

> 0

. Hence, the GRTK estimator is a biased estimator. From the expression above it is clear that the expectation of the GRTK estimator vanishes as

k

tends to infinity

\lim_{k \to \infty} {\hat{β}}_{G R T K} (k) = {({\tilde{X}}^{'} R^{- 1} \tilde{X} + k I)}^{- 1} {\tilde{X}}^{'} R^{- 1} \tilde{X} β + {({\tilde{X}}^{'} R^{- 1} \tilde{X} + k I)}^{- 1} {\tilde{X}}^{'} R^{- 1} \tilde{f} = 0

Hence, all coefficients of the parametric component are shrunken towards zero as the ridge (or shrinkage) parameter increases.

As shown in the (22), although the smoothing method provides the estimates of the components in the model, they do not directly provide an estimate of the variance of the error terms (i.e.,

σ^{2}

). In a general partially linear model, the estimate of variance

σ^{2}

can be found by the residual sum of squares

R S S = {(y - \hat{y})}^{'} (y - \hat{y}) = {(y - (X \hat{β} + \hat{f}))}^{'} (y - (X \hat{β} + \hat{f})) = {(y - H_{λ, k} y)}^{'} (y - H_{λ, k} y) = {‖({I - H}_{λ, k}) y‖}_{2}^{2},

(23)

where

H_{λ, k} = W_{λ} + {(I - W_{λ})}^{2} X G_{k} {\tilde{X}}^{'} R^{- 1}

is a hat matrix which depends on a smoothing parameter

λ > 0

and a shrinkage parameter

k

> 0 for partially linear regression model. Note that the fitted values of the model defined in the (1) is obtained by this matrix

H_{λ, k}

, given by

\hat{y} = X \hat{β} + \hat{f} = [W_{λ} + {(I - W_{λ})}^{2} X G_{k} {\tilde{X}}^{'} R^{- 1}] y = H_{λ, k} y

(24)

where

{G_{k} = ({\tilde{X}}^{'} R^{- 1} \tilde{X} + k I)}^{- 1}

, as defined in Equations (20)–(22). The implementation details of Equation (24) is given in Appendix A.2. Thus, similar to ordinary least squares regression, estimation of the error variance can be stated as

{\hat{σ}}^{2} = \frac{{‖({I - H}_{λ, k}) y‖}_{2}^{2}}{t r {(I - H_{λ, k})}^{2}} = \frac{R S S}{n - p},

(25)

where

t r {(I - H_{λ, k})}^{2} = n - t r [2 H_{λ, k} - {(H_{λ, k})}^{'} (H_{λ, k})] = n - p

denotes the residual degrees of freedom (see, for example, [4]). It appears from the denominator of Equation (25) that the degrees of freedom for

R S S

can also be expressed as the sample size minus the number of estimated parameters in the model. Regarding the standard error computation, from (25), and the selected

\hat{R}

,

\hat{h}

and

\hat{k}

, we form the plug-in covariance matrix by substituting into the theoretical variance in (22). The reported standard errors are the square-roots of its diagonal entries.

3.4. Assessing the Risk and Efficiency

To further evaluate the estimators, we now consider measures of risk and efficiency. As discussed earlier,

V a r (\hat{β})

and

B i a s (\hat{β})

important criteria for assessing estimator quality. This section introduces the Mean Squared Deviation (MSD) and quadratic risk function to compare the performance of different estimators. In general, the ill-effect of

B i a s (\hat{β})

is known as information loss, and this loss can be measured by a function stated as the mean squared deviation (MSD). Note that the MSD is a risk function corresponding to the expected value of the squared error loss for an estimator. In terms of a squared error loss, the

M S D

can be defined as a matrix consisting of the sum of the variance-covariance matrix and the squared bias. While the MSD matrix provides comprehensive information about estimation quality, comparing matrices directly is challenging. Therefore, we derive a scalar measure by taking the trace of the MSD matrix, which represents the sum of mean squared errors across all coefficient estimates:

M S D (\hat{β}, β) = E ({(\hat{β} - β)}^{'} (\hat{β} - β)) = V a r (\hat{β}) + B i a s {(\hat{β})}^{2} .

(26)

The Equation (26) gives detailed information about the quality of an estimator. In addition to the

M S D

matrix, the expected loss, called the scalar-valued version, can also be used to compare different estimators. For convenience, we will work with the scalar valued mean squared deviation.

Definition 1.

The quadratic risk function of an estimator

\hat{β}

of the vector

β

is defined as the scaler valued version of the mean squared deviation matrix (SMDE), given by

R (\hat{β}, β, V) = E ({(\hat{β} - β)}^{'} V (\hat{β} - β)) = t r (V (M S D (\hat{β}, β))) = S M S D (\hat{β}, β),

where

V

is a

(p \times p)

symmetric and non-negative definite matrix. Based on the above risk function, we can define the following criterion to compare estimators.

Definition 2.

Let the vectors

{\hat{β}}_{1}

and

{\hat{β}}_{2}

be the two competing estimators of a parameter vector

β

. If the difference of the matrices

M S D

is non-negative definite, it can be said that the estimator

{\hat{β}}_{2}

is superior to estimator

{\hat{β}}_{1}

, and is given by

∆ ({\hat{β}}_{1}, {\hat{β}}_{2}) = M S D ({\hat{β}}_{1}, β) - M S D ({\hat{β}}_{2}, β) \geq 0 .

Accordingly, we get the following result.

Theorem 3

([21]). Let the vectors

{\hat{β}}_{1}

and

{\hat{β}}_{2}

be the two different estimators of a parameter vector

β

. Therefore, the following two expressions are equivalent

(i.): $R ({\hat{β}}_{1}, β, V) - R ({\hat{β}}_{2}, β, V) \geq 0$ for all non-negative definite matrices $V$
(ii.): $M S D ({\hat{β}}_{1}, β) - M S D ({\hat{β}}_{2}, β)$ is a non-negative definite matrix.

The results of Theorem 3 denotes that

{\hat{β}}_{2}

has a smaller

M S D ({\hat{β}}_{2}, β)

than the estimator

{\hat{β}}_{1}

if and only if the

R ({\hat{β}}_{1}, β, V)

of

{\hat{β}}_{2}

averaging over every quadratic risk function is less than that of the estimator

{\hat{β}}_{1}

. Thus, the superiority of

{\hat{β}}_{2}

over

{\hat{β}}_{1}

can be observed by comparing the matrices

M S D

.

Substituting Equations (21) and (22) in Equation (26), the matrix

M S D

of the proposed estimator

{\hat{β}}_{G R T K} (k)

is obtained as

M S D ({\hat{β}}_{G R T K} (k), β) = V a r ({\hat{β}}_{G R T K} (k)) + B i a s {({\hat{β}}_{G R T K} (k))}^{2} = G_{k} (σ^{2} {\tilde{X}}^{'} R^{- 1} {(I - W_{λ})}^{2} {\tilde{X}}^{'} R^{- 1} + {({\tilde{X}}^{'} R^{- 1} \tilde{f} - k β)}^{2}) G_{k}

Furthermore, considering Definition 1, the quadratic risk function for

{\hat{β}}_{G R T K} (k)

can be stated as follows:

\begin{array}{l} S M S D ({\hat{β}}_{G R T K} (k), β) & = R ({\hat{β}}_{G R T K} (k), β, V) \\ = t r [V (M S D ({\hat{β}}_{G R T K} (k), β))] = t r [M S D ({\hat{β}}_{G R T K} (k), β)] \\ = t r [G_{k} (σ^{2} {\tilde{X}}^{'} R^{- 1} {(I - W_{λ})}^{2} {\tilde{X}}^{'} R^{- 1} + {({\tilde{X}}^{'} R^{- 1} \tilde{f} - k β)}^{'} ({\tilde{X}}^{'} R^{- 1} \tilde{f} - k β)) G_{k}] . \end{array}

The theoretical properties established in this section demonstrate that GRTK estimators effectively balance bias and variance while accounting for both multicollinearity and error autocorrelation. However, the practical performance of these estimators depends critically on the selection of appropriate tuning parameters (

λ, k

). In the following section, we address this challenge by introducing several parameter selection criteria.

4. Choosing the Penalty Parameters

This section focuses on selecting the bandwidth parameter

λ

and the shrinkage parameter

k

, both of which are crucial components of the generalized weighted least squares Equation (11). The goal is to determine the best possible values for

λ

and

k

based on different criteria. To achieve this, we employ parameter selection methods. The parameters λ and k are therefore data-adaptive, chosen by minimizing GCV, AICc, BIC or RECP; under standard regularity conditions each criterion yields a selection that is asymptotically risk-optimal, though not necessarily the unique oracle optimum in finite samples. The most widely used selection criteria are summarized below:

Generalized cross-validation: The best possible bandwidth

λ

and shrinkage parameter

k

can be determined by implementing the

G C V

score function as in equation (27):

(\hat{λ}, \hat{k}) = \underset{λ > 0, k > 0}{argmin} G C V (λ, k) = \underset{λ > 0, k > 0}{argmin} (\frac{n^{- 1} {‖({I - H}_{λ, k}) y‖}^{2}}{{[n^{- 1} t r ({I - H}_{λ, k})]}^{2}}),

(27)

where

H_{λ, k} = W_{λ} + {(I - W_{λ})}^{2} X G_{k} {\tilde{X}}^{'} R^{- 1} w i t h {G_{k} = ({\tilde{X}}^{'} R^{- 1} \tilde{X} + k I)}^{- 1},

as is defined in Equations (23) and (24), is the smoother matrix based on the parameters

λ

and

k

.

Improved Akaike information criterion: To eliminate overfitting when the sample size is relatively small, [22] proposed

A I C_{c}

, an improved version of the Classical Akaike Information Criterion

A I C_{c} (λ, k) = \underset{λ > 0, k > 0}{a r g m i n} \{\log (\frac{{‖({I - H}_{λ, k}) y‖}^{2}}{n - tr (H_{λ, k})}) + 1 + \frac{2 \{t r (H_{λ, k}) + 1\}}{n - t r (H_{λ, k}) - 2}\}

(28)

Bayesian Information Criterion: Schwarz’s criterion, also known as the

B I C

is another statistical measure for selection of the penalty parameters. The

B I C

criterion is expressed as

B I C (λ, k) = \underset{λ > 0, k > 0}{a r g m i n} (\frac{{‖({I - H}_{λ, k}) y‖}^{2}}{n - t r (H_{λ, k})}) + (\frac{l o g (n)}{n}) t r (H_{λ, k}) .

(29)

Risk estimation using classical pilots (RECP): The key idea now is to estimate the risk

R (y, \hat{y})

by plugging-in pilot estimates of

y

and

σ^{2}

into (30), and choose the

λ

and

k

that minimizes the criterion (see [11])

R (y, \hat{y}) = \underset{λ > 0, k > 0}{a r g m i n} \frac{1}{n - t r (H_{λ, k})} \{{‖(H_{λ, k} - I) y‖}^{2} + σ^{2} t r ({H_{λ, k}}^{'} H_{λ, k})\} .

(30)

Each of these four criteria offers a valid approach for parameter selection; however, they have different characteristics in practice. GCV generally provides a good balance between bias and variance across different sample sizes.

A I C_{c}

is generally more suitable for smaller samples, where overfitting is a concern. BIC tends to produce more parsimonious models and is often preferred when the true model is believed to be sparse. RECP often provides good fits for the nonparametric component but can be more sensitive to the correlation structure. The simulation study in the next section provides further guidance on criterion selection under different data scenarios.

In practice we interpret

λ

as the bandwidth that sets the smoothness of the non-parametric curve and

k

as the ridge weight that governs the shrinkage on the linear part. The text now clarifies that we start from the rule-of-thumb anchors suggested by asymptotic theory—a bandwidth of the usual bias-variance order and a shrinkage weight on the scale of the inverse sample size—and place a modest logarithmic grid around those anchors. Each candidate pair is evaluated with the four information criteria introduced earlier, and the pair that attains the lowest value is selected; if two pairs tie, the one with the smaller bandwidth is kept to avoid over-smoothing. We also note that the mean-squared-error surface is flat in a neighborhood of the chosen point, which makes the estimator insensitive to small tuning changes, and that the entire grid search finishes almost instantly on standard hardware.

5. Numerical Examples

5.1. Simulation Study

This section presents a Monte Carlo simulation study designed to compare the estimation performances of the introduced kernel-type ridge estimator (GRTK) and shrinkage estimators for model (1) in the presence of correlated errors. To generate parametric predictors, we use a sparse model with

p = 20

. To introduce multicollinearity, covariates are generated with a specific level of collinearity, denoted as

δ

, the following equation is used:

x_{i j} = \sqrt{(1 - δ^{2})} d_{i j} + δ d_{i p}, i = 1, \dots, n, j = 1, \dots, p

(31)

where p is the number of the predictors,

d_{i j}

’s generated as

d_{i j} ~ N (μ_{d} = 0, σ_{d}^{2} = 1)

, and

δ = (0.95, 0.99)

denotes the two correlation levels between the predictors of parametric component. In this context, the simulated data sets are generated from the following model,

y_{t} = x_{t}^{'} β + f (z_{t}) + ε_{t}, t = 1,2, \dots, n,

(32)

as defined in (1). In given model (32),

β {= (- 1, 1.5, 2, 5, 0_{p_{2}})}^{'}

,

x_{t}^{'}

is constructed by Equation (31); the regression function

f (z_{t}) = 6 z_{t} {\sin (z_{t})}^{2}

and

z_{t} = 5 (t - 0.5) / n

; the error terms

ε_{t}

are generated using a first order autoregressive process (that is,

ε_{t} = ρ ε_{t - 1} {+ u}_{t}

) with

ρ = 0.5

and

u_{t} ~ N I I D (0, σ_{u}^{2} = 1)

. Using data generation procedure given by Equations (31) and (32), we consider the three different sample sizes that are

n = 50, 150

and

300

to investigate the performance of the introduced estimators for low, medium and large samples, and each we use 1000 repetition for each simulation combination. The simulation results are provided in the following figures and tables. In addition, Figure 1 shows the generated data to clarify the data generation procedure better.

Figure 1. Simulated data with true

f (z_{t})

, sparse coefficients with zero and non-zero ones and the autocorrelation function of error terms for

n = 150, δ = 0.95

.

From a computational perspective, the GRTK estimation procedure involves several steps: (1) initial parameter estimation to obtain residuals, (2) estimation of the autocorrelation parameter

ρ

, (3) data transformation, (4) parameter selection for (

λ, k

), and (5) final estimation. The most computationally intensive step is typically the selection of (

λ, k

) pairs, requiring a two-dimensional grid search. In our implementation, the GCV-based estimator was the most computationally efficient, followed by BIC, AICc, and RECP where RECP involves a pilot variance estimation, and it is more costly than others.

Before presenting the results, as mentioned in the sections above, selection of bandwidth of kernel smoother (

λ

) and the shrinkage (or ridge) parameter

(k)

has a critical importance due to its effect on the estimation accuracy. Therefore, for different simulation configurations, chosen pairs of

(\hat{λ}, \hat{k})

are presented in Table 1 and Figure 2. Hence, it is possible to see how these parameters are affected by the multicollinearity level between predictors

(δ)

and sample size

(n)

. When examining Table 1, the chosen bandwidth and shrinkage parameters, along with the specified four criteria, are evident for all possible simulation configurations. From the values, it can be observed that the values of the pair

(\hat{λ}, \hat{k})

increase as the correlation level rises, potentially negatively impacting estimation quality in terms of smoothing. On the other hand, the increase in

\hat{k}

, particularly when multicollinearity is high, is an expected behavior aimed at avoiding indefinability in the variance-covariance matrix. For large sample sizes, the criteria tend to select lower

\hat{λ},

but higher “

\hat{k}

” values as a general tendency. Similar selection behavior is observed when

δ = 0.99

. We take the behavior of

(\hat{λ}, \hat{k})

obtained in Table 1 and follow similar roadmap for the shrinkage estimators that are obtained rely on the GRTK estimator.

Table 1. Chosen

(\hat{λ}, \hat{k})

pairs selected by

A I C_{c}

,

B I C

and

R E C P

critera for different simulation configurations for GRTK.

Figure 2. Selection of

(\hat{λ}, \hat{k})

by

A I C_{c}

,

B I C

and

R E C P

under certain conditions:

n = 50, δ = 0.95

.

Additionally, Figure 1 illustrates the selection procedure of each criterion for different possible values of bandwidth and shrinkage parameters simultaneously. It can be concluded from this that

G C V

,

A I C_{c}

, and

B I C

exhibit closer patterns than

R E C P

. This difference is evidently sourced from its risk estimation procedure with the pilot variance shown in (30). Accordingly,

R E C P

-based estimator mostly present different performances under different conditions.

After determination of the data-adaptively chosen tuning parameters

(\hat{λ}, \hat{k})

, time series models are estimated and performance of the estimated regression coefficients of parametric component

({\hat{β}}_{G R T K}

,

{\hat{β}}_{S}, {\hat{β}}_{P S}

) based on the selection criteria are obtained. In this sense, bias, variance and

S M S D

values of

({\hat{β}}_{G R T K}

,

{\hat{β}}_{S}, {\hat{β}}_{P S}

) are presented in Table 2. In addition, 3D figure is given to observe the effect of both sample size and correlation level on the quality of the estimates in terms of parametric component.

Table 2. Outcomes of the simulations with bias, variance, and SMSD scores for

{\hat{β}}_{G R T K}

.

Table 2 reports the numerical performance of the baseline Generalized Ridge-Type Kernel (

{\hat{β}}_{G R T K}

) estimator under various sample sizes (

n

) and collinearity levels (

ρ

). Looking at the bias, variance, and aggregated measures such as SMSD, one sees that high correlation (

ρ = 0.99

) generally inflates estimation error compared to moderate correlation (

ρ

= 0.95). Conversely, larger sample sizes (

n = 150

and

n = 300

) mitigate these issues, resulting in smaller bias, variance, and SMSD. These findings confirm the paper’s theoretical claims that ridge-type kernel smoothing can handle partially linear models with correlated errors more robustly as

n

grows. Nevertheless, because the paper works with

m u l t i p l e

(

p \geq 2

) predictors, the variance inflation is still non-trivial, suggesting that additional shrinkage mechanisms beyond baseline GRTK might yield further improvements.

Table 3 summarizes the performance of the shrinkage estimator (notated as

{\hat{β}}_{S}

), which further regularizes the GRTK approach by incorporating Stein-type shrinkage toward a submodel. Also given

(p_{1}, p_{2})

represent the average number of chosen coefficients during the simulations using penalty functions for both nonzero and sparse ones. Relative to Table 2’s baseline GRTK results, the shrinkage estimator consistently displays smaller bias–variance trade-offs across most combinations of

ρ

and

n

. In particular, when the predictors are highly collinear, the

{\hat{β}}_{S}

estimator’s additional penalization proves beneficial by reducing the inflated variance typically caused by near-linear dependencies among regressors. Hence, these results validate one of the paper’s key arguments: that combining kernel smoothing with shrinkage can outperform a pure ridge-type kernel approach in high-dimensional or strongly correlated settings, especially for moderate sample sizes.

Table 3. Outcomes of the simulations with bias, variance, and SMSD scores for

{\hat{β}}_{P S}

.

Table 4 presents the outcomes for the positive-part Stein (

{\hat{β}}_{P S}

) shrinkage version, which modifies the ordinary Stein estimator by truncating the shrinkage factor at zero. According to the paper’s premise, PS is designed to avoid “overshrinking” when the shrinkage factor becomes negative and is thus expected to give the strongest performance among the three estimators, particularly in the given scenario with high collinearity. Indeed, Table 4’s bias, variance, and SMSD values are generally on par with—or superior to—those reported in Table 2 and Table 3. When

ρ = 0.99

is especially large, the PS estimator still manages to keep both the bias and variance relatively contained, highlighting that positive-part shrinkage delivers robust protection against severe collinearity. As a result, these findings support the paper’s claim that the PS shrinkage mechanism provides an appealing balance of bias and variance under challenging data conditions, outperforming both baseline GRTK and standard Stein shrinkage (S) in many of the tested simulation settings.

Table 4. Outcomes of the simulations with bias, variance, and SMSD scores for

{\hat{β}}_{S}

.

Overall, the results across Table 2, Table 3 and Table 4 demonstrate that each proposed estimator—baseline GRTK (

{\hat{β}}_{G R T K}

), shrinkage (

{\hat{β}}_{S}

), and positive-part Stein shrinkage (

{\hat{β}}_{P S}

)—responds predictably to varying sample sizes and degrees of multicollinearity. Larger samples result in smaller bias, variance, and SMSD values, confirming the benefits of increased information for parameter estimation. Simultaneously, strong correlation among predictors inflates these metrics, illustrating the challenge posed by multicollinearity. Transitioning from GRTK to the two shrinkage-based estimators generally yields progressively better performance, particularly in higher-dimensional settings (where

n > p

) with

p = 20

. In particular, the positive-part Stein approach (

{\hat{β}}_{P S}

) often delivers the most stable results, indicating that its careful control of shrinkage is advantageous for mitigating both variance inflation and overshrinking.

Figure 3 visually compares the distributions of the estimated coefficients for

{\hat{β}}_{G R T K}

,

{\hat{β}}_{S}

, and

{\hat{β}}_{P S}

via boxplots under different sample sizes and correlation levels, complementing the numerical findings in Table 2, Table 3 and Table 4. In each panel, the baseline

{\hat{β}}_{G R T K}

estimator often deviates from the true coefficient value and zero for the bias, but it remains competitive with shrinkage estimators in terms of variance, especially when the data exhibit high collinearity, reflecting the variance inflation reported in Table 2. By contrast, the two shrinkage methods—

{\hat{β}}_{S}

(Table 3) and

{\hat{β}}_{P S}

(Table 4)—show boxplots that are close to real values of the coefficients, indicating smaller biases and more stable estimates. Particularly under higher correlation (

ρ = 0.99

) and moderate sample sizes,

{\hat{β}}_{P S}

tends to yield less spread than the other estimators, consistent with the idea that positive-part Stein shrinkage further controls overshrinking while tackling multicollinearity. Overall, these graphical patterns support the tables’ numeric trends, emphasizing how adding shrinkage to a kernel-based estimator improves both bias reduction and variance stabilization in partially linear models.

Figure 3. Boxplots for bias and estimated regression coefficients obtained based on four criteria

G C V, A I C_{c}, B I C

and

R E C P

under certain conditions. The horizontal line in the subfigures shows zero bias as a reference line, and outliers are shown by black circles.

Table 5 reports the mean squared errors (MSE) for estimating the nonparametric component

f (z)

under various combinations of sample size (

n

) and correlation level

ρ

. Reflecting patterns observed in the parametric estimates, a larger

ρ

typically increases MSE values, while growing the sample size reduces them. Notably, the table shows that the proposed

{\hat{f}}_{G R T K}

and its shrinkage versions (

{\hat{f}}_{S}

and

{\hat{f}}_{P S}

) all perform better as

n

becomes larger; this aligns with the theoretical expectation that more data stabilize both kernel smoothing and shrinkage procedures in a partially linear context.

Table 5. MSE values for the estimated nonparametric component

{\hat{f}}_{G R T K}

for all simulation combinations.

Figure 4 further illustrates these differences in estimating

f (z)

by depicting the fitted nonparametric curves for selected simulation settings. Panels (a)–(b) highlight how increasing

ρ

can distort the estimated function, leading to higher MSE—an effect that is more pronounced for smaller samples. Meanwhile, panels (c)–(d) show that when nnn grows, each estimator recovers the true nonlinear pattern more closely, supporting the numerical results in Table 5. Taken together, Table 5 and Figure 4 confirm that although multicollinearity and smaller sample sizes can hamper nonparametric estimation, the combination of kernel smoothing with shrinkage remains robust and delivers reasonable reconstructions of

f (z)

in challenging time-series settings.

Figure 4. Smooth curves obtained for different configurations to show effect of the correlation level

(δ)

in panels (a,b) and (b,d) and the effect of the sample size (

n

) in panels (a–c).

5.2. Simulation Study with Large Samples

In this section, we provide the results for large sample sizes to present practical proofs for the asymptotic properties. To achieve that we decided

n = 500

and

n = 1000

with less configurations give in Section 5.1. The results are presented in the following Table 6, Table 7, Table 8 and Table 9 and Figure 5.

Table 6. Large-sample results of simulations with bias, variance and SMSD scores for

{\hat{β}}_{G R T K}

.

Table 7. Large-sample results of simulations with bias, variance and SMSD scores for

{\hat{β}}_{S}

.

Table 8. Large-sample results of simulations with bias, variance and SMSD scores for

{\hat{β}}_{P S}

.

Table 9. MSE values for the estimated nonparametric component

\hat{f}

for large samples.

Figure 5. Autocorrelation functions (acf) for the model error in (33): Panel (a) acf plot of original response variable; panel (b) acf plot for difference of the

A r r D e l a y

variable. The y-axes in both figures “lag #” refers number of lags.

In Table 6, the GRTK estimator represents the baseline semiparametric estimator with variance

V a r ({\hat{β}}_{G R T K (k)}) = σ^{2} G ₖ {\tilde{X}}^{'} R^{- 1} {(I - W_{λ})}^{2} R^{- 1} \tilde{X} G ₖ

, where

G ₖ = {({\tilde{X}}^{'} R^{- 1} \tilde{X} + k I)}^{- 1}

. In terms of sample size effects, one can see that bias reduces around ~27% and variance reduces around ~42% from

n = 500 \to 1000

. Regarding the correlation impact,

δ = 0.99

increases performance get worse around three times than

δ = 0.95

case, confirming multicollinearity effects. Moreover, RECP is consistently the best (lowest SMSD), and

A I C_{c}

varies the most due to its correction term

2 (t r (H) + 1) / (n - t r (H) - 2)

which is sensitive to changes in both sample size and degrees of freedom.

In Table 7, the Stein shrinkage estimator

{\hat{β}}_{S}

creates a bias-variance trade-off through intentional shrinkage. The shrinkage benefits are evident with SMSD reduction compared to GRTK, demonstrating stein-type shrinkage superiority for

p \geq 3

. Sample size effects follow similar patterns to GRTK, with bias and variance reductions maintaining the theoretical

O (1 / \sqrt{n})

scaling. Correlation impact shows that shrinkage provides larger relative improvements under

δ = 0.99

conditions, helping to mitigate multicollinearity effects. The selection criteria ranking

R E C P > B I C \approx G C V > A I C c

is preserved across shrinkage, indicating robust performance across different regularization approaches. Additionally, submodel selection shows sparser models (3.4, 16.6) for n = 1000 versus (3.2, 16.8) for n = 500, suggesting improved signal-noise discrimination with larger samples.

In Table 8, the positive-part Stein estimator

{\hat{β}}_{P S}

prevents over-shrinking by eliminating negative shrinkage factors. This estimator shows superior performance with the lowest SMSD in the majority of scenarios, achieving improvement over

{\hat{β}}_{G R T K}

and

{\hat{β}}_{S}

. The theoretical dominance property

R ({\hat{β}}_{P S}) \leq R ({\hat{β}}_{S}) \leq R ({\hat{β}}_{G R T K})

is empirically confirmed across all scenarios. Sample size effects maintain the same asymptotic scaling patterns, while correlation impacts are similar to other estimators but with enhanced robustness. The selection criteria performance continues to favor RECP, with AICc showing the highest variability due to its correction term sensitivity. Importantly, the advantages of positive-part shrinkage persist even at

n = 1000

, confirming that the theoretical benefits extend to large sample scenarios and validating the methodology’s long-term effectiveness.

In Table 9, the nonparametric component

\hat{f} (z)

uses Nadaraya-Watson weights as defined in Equation (7). From Theorem 2, the nonparametric estimator achieves

{m a x}_{1 \leq i \leq n} |{\hat{f}}_{G R T K} (z_{i}) - f (z_{i})| = O (n^{- 1 / 3} l o g (n))

convergence. Sample size effects show MSE reduction from

n = 500 \to 1000

as expected, consistent with the theoretical convergence established in Theorem 2. Another finding is shrinkage methods achieving very good MSE reduction versus GRTK, occurring because better

\hat{β}

improves residuals in Equation (13), where parametric improvements progress to nonparametric recovery.

Correlation impact shows that

δ = 0.99

increases MSE by approximately 15–30% compared to

δ = 0.95

, which is much smaller than the increase observed for parametric components in Table 6, Table 7 and Table 8, suggesting that the nonparametric component is more robust to multicollinearity in the parametric part. Regarding selection criteria performance, unlike the parametric case where RECP consistently dominated (see Section 4), the nonparametric component shows more varied optimal criteria, with GCV performing best for S estimator and BIC excelling for PS estimator, indicating that optimal bandwidth selection from Equations (27)–(30) may require different approaches depending on the underlying parametric estimation method.

The superior performance of shrinkage estimators validates the interconnected nature of parametric and nonparametric components in semiparametric models, consistent with the bias-variance decomposition in Equation (26), where improvements in one component gradually provide benefits to the other component.

5.3. Real Data Example: Airline Delay Dataset

5.3.1. Data and Model

In this section, we consider the airline delay dataset to show the performance of the modified GRTK estimators. The Bureau of Transportation Statistics (BTS), a division of the U.S. Department of Transportation (DOT), monitors the punctuality of domestic flights conducted by major air carriers. Commencing in June 2003, BTS initiated the collection of comprehensive information as a daily time series regarding the reasons behind flight delays. Based on this collected data, we obtain the partially linear time series model with GRTK estimators along with shrinkage estimators and measure the comparative performance according to the four selection criteria for both shrinkage

(k)

and bandwidth

(λ)

parameters. The dataset involves originally 1,048,576 data points which is almost impossible to process for the analysis. In this paper, we extracted a representative consecutive series from the data with

n = 300

data points. Notice that we do not try to solve a specific data-driven problem but showing the merits of the introduced semiparametric estimators. Therefore dataset with

n = 300

is enough to represent both the modelling procedure and its alignment with the simulation settings. To construct the semiparametric time series model, we consider the following six variables as explanatory variables for the parametric component of the model that are departure time (

D e p T i m e

), CRS departure time (

C R S D e p T i m e

), actual elapsed time (

A c t u a l E l a p s e d T i m e

), CRS elapsed time (

C R S E l a p s e d T i m e

), departure delay (

D e p D e l a y

) and distance (

D i s t a n c e

). The air time of the aircrafts (

A i r T i m e

) is determined as a nonparametric variable for the model. The reason for that can be seen in Figure 5 with the hypothetical curve which can be counted as evidence for the nonlinear relationship between the response variable arrival delays (

A r r D e l a y

) and the

A i r T i m e

variable. Accordingly, the model to be estimated is given by:

{A r r D e l a y}_{t} = \sum_{j = 1}^{p = 6} x_{j t} β_{j} + f ({A i r T i m e}_{t}) + ε_{t}, t = 1, \dots, 300,

(33)

where

x_{j t}

’s denote the six explanatory covariates defined above,

β_{j} ’ s

are the corresponding regression coefficients and

ε_{t}

’s are the autocorrelated error terms as shown in (2).

5.3.2. Pre-Processing

Autocorrelation parameter is estimated as

ρ = 0.01

for difference of

{A r r D e l a y}_{t}

. In Figure 5, autocorrelation functions are provided for both original (non-stationary) and differenced-

{A r r D e l a y}_{t}

series.

Figure 6, each panel includes important information about the data to describe the time series and its properties. In panel (a), variance inflation factors (VIFs) of the predictors in the parametric component of the model reveal that five out of six variables have VIF values greater than 10, indicating a significant multicollinearity problem that needs to be addressed. In panel (b), the linear relationship between predictors and the response variable

A r r D e l a y

can be observed, which is also evident in panel (c), depicting pairs of variables with correlation levels. Finally, in panel (d), the relationship between

A r r D e l a y

and the nonparametric covariate

A i r T i m e

is illustrated with a hypothetical curve.

Figure 6. Informative plots for the Airline Delay dataset. In panel (c), asterisks beside the correlation values indicate statistically significant values. In panel (b,d), blue lines indicate hypothetic relationships between variables.

In the simulation study, 3D plots are generated to illustrate the selection of bandwidth (

λ

) and shrinkage parameter (

k

) for the four criteria, as shown in Figure 7. According to process in Figure 7, selected pairs of

(\hat{λ}, \hat{k})

for the criteria are as follows:

({\hat{λ}}_{G C V} = 1.357, {\hat{k}}_{G C V} = 1.289)

,

({\hat{λ}}_{A I C_{c}} = 1.242, {\hat{k}}_{A I C_{c}} = 2)

,

({\hat{λ}}_{B I C} = 1.128, {\hat{k}}_{B I C} = 2)

and

({\hat{λ}}_{R E C P} = 0.10, {\hat{k}}_{R E C P} = 0.01)

. After obtaining the chosen tuning parameters, the performances of the GRTK estimators are presented in Table 4. This table includes the model variance, MSE of the nonparametric component estimate, and overall variance of the regression coefficient, calculated by

t r (V a r (\hat{β}))

, where

V a r (\hat{β})

is shown in (19). The best scores are indicated with bold colors in Table 4. According to the results, consistent with the findings from the simulation study, for

n = 300

, we observe closer performances.

Figure 7. Selection of

(\hat{λ}, \hat{k})

based on the four criteria

G C V

,

A I C_{c}

,

B I C

and

R E C P

for airline delay dataset.

Simultaneously, the

R E C P

-based estimator shows good performance, closely followed by the

G C V

-based estimator.

A I C_{c}

and

B I C

-based estimators also exhibit considerable performances and are close to the other two. Additionally, to assess the statistical significance of the estimated regression coefficients, Student t-test statistics and their p-values are presented in Table 5. Furthermore, the estimated models are tested using the F-test, and p-values are provided. The results indicate that all estimated models based on the four criteria are statistically significant (

p < 0.05

). However,

{\hat{β}}_{D e p T i m e}

,

{\hat{β}}_{C R S D e p T i m e}

, and

{\hat{β}}_{D i s t a n c e}

do not make a significant contribution to the estimated model, despite the overall significance of the models. In conclusion, it can be stated that all criteria yield meaningful models based on their parameter selection procedures.

5.3.3. Model Fitting and Results

Table 10 presents the final performance measures of the GRTK-based estimators (including their shrinkage variants) for the real airline delay dataset, focusing on both parametric and nonparametric components. These metrics—covering model variance, the MSE of the nonparametric estimate, and the overall variance of the regression coefficients—show that all four selection criteria (

G C V, A I C c, B I C, R E C P

) yield closely comparable results, mirroring the pattern observed in the simulation study for moderate-to-large sample sizes. In particular, RECP exhibits a slight edge in some measures, yet the other criteria remain competitive. This consistency with the simulation outcomes underscores two key conclusions: (1) the newly introduced GRTK framework and its shrinkage modifications successfully mitigate variance inflation in the presence of multicollinearity; and (2) once an adequately sized dataset is available, different penalty-parameter selection methods converge toward similarly effective solutions in partially linear time series models.

Table 10. Performance of the GRTK estimators for parametric and nonparametric components of the model.

Figure 8 provides a residual analysis for each of the four GRTK-based models (including shrinkage versions) applied to the airline delay dataset. The top row displays the autocorrelation functions (ACFs) of the residuals, indicating whether significant time dependence remains after fitting. In all four cases, autocorrelation appears modest and tapers off quickly, suggesting that the estimated models have captured most of the serial dependence in the data. The bottom row shows residuals scattered around zero without any pronounced pattern, hinting at an absence of systematic bias or heteroscedasticity. Consequently, the visual diagnostic supports the conclusion that, regardless of the specific selection criterion (GCV, AICc, BIC, RECP), the introduced GRTK estimators adequately address the combined challenges of autocorrelation and multicollinearity in this real-world time series setting.

Figure 8. Residual analysis for the airline delay dataset. In the plots at the top of the figure show the acf plots of residuals obtained for the estimated models based on the corresponding criterion. The plots at the bottom show the scatterplots of the residuals around

x = 0

(red dashed line).

Regarding the fitted curves in Figure 9, it presents the estimated curves for the nonparametric component

f (z)

under the proposed GRTK framework, incorporating both ordinary Stein shrinkage (S) and positive-part Stein shrinkage (PS), as well as the baseline GRTK. Each panel corresponds to one of the four selection criteria—GCV, AICc, BIC, and RECP—which choose the best possible pairs

(\hat{λ}, \hat{k})

to govern bandwidth (

\hat{λ}

) and shrinkage (

\hat{k}

). Mathematically, all four criteria minimize a penalized objective function (e.g., GCV, AICc, etc.) to balance bias (from a potentially large

\hat{λ}

) and variance (from insufficient shrinkage). As seen in the figure, different choices of

(\hat{λ}, \hat{k})

yield subtle variations in the smoothness and curvature of

f (z)

. For example,

A I C_{c}

typically enforces a slightly larger bandwidth to address overfitting risks in moderate samples, resulting in a smoother curve, whereas

B I C

can pick a smaller

λ

or a larger

k

, leading to more parsimonious fits with sharper inflection points. GCV often strikes a middle ground, and RECP sometimes selects a relatively large bandwidth or smaller shrinkage factor if its pilot-based risk estimation deems this configuration optimal

Figure 9. Fitted curves obtained based on the four criteria.

Beyond these distinctions, the figure also highlights the effect of adding shrinkage to the baseline GRTK estimates. Visually, the PS-based curves often appear slightly more adaptive in regions where GRTK or S might be overly penalized, reflecting a more balanced bias–variance trade-off. Despite these finer differences, all curves capture the primary nonlinear trends and fall within reasonable bounds, reinforcing the broader conclusion that combining kernel smoothing with either Stein or positive-part Stein shrinkage effectively handles both autocorrelation and multicollinearity in semiparametric time series models.

6. Conclusions

This paper introduces and analyzes modified ridge-type kernel smoothing estimators tailored for semiparametric time series models with multicollinearity in the parametric component. By combining generalized ridge-type kernel (GRTK) methodology with shrinkage and positive-part Stein shrinkage versions, we address both near-linear dependencies among regressors and autocorrelation in the error structure. We also explore how to data-adaptively select the bandwidth (

λ

) and shrinkage parameter (

k

) using four widely used criteria:

G C V, A I C_{c}, B I C

, and

R E C P

.

From the detailed simulation study and the real-world airline delay dataset, the following conclusions can be drawn:

The GCV-based GRTK estimator effectively balances bias and variance for both parametric and nonparametric components. In simulations, it consistently provides stable estimates for linear coefficients $β$ and suitably smooth fits for the nonlinear function $f (z)$ .
All four criteria show instability in small samples, as expected. However, in medium and large samples, the proposed estimators achieve more reliable performance, with reduced bias, variance, and SMSD. The GRTK-based approach effectively manages the autoregressive nature of the errors, highlighting the importance of accounting for correlation in time series data.
Shrinkage estimators, especially positive-part Stein, excel at mitigating variance inflation and overshrinking under strong multicollinearity. They outperform standard methods when predictors are strongly correlated, particularly in larger samples.
The large-sample results ( $n = 500, 1000$ ) empirically confirm the theoretical asymptotic properties established in Theorems 1–2. Bias reductions and variance reductions are seen when sample size increases from $n = 500$ to $n = 1000$ as expected, validating that GRTK estimators mostly achieve their theoretical performance in practice.
The positive-part Stein estimator demonstrates superior and consistent dominance over both baseline GRTK and ordinary Stein shrinkage across all large-sample scenarios, with the theoretical risk ordering $R ({\hat{β}}^{P S}) \leq R ({\hat{β}}^{S}) \leq R ({\hat{β}}^{G R T K})$ being empirically confirmed. This dominance persists even at $n = 1000$ along with very close performances for three estimators, indicating that the benefits of positive-part shrinkage are not merely finite-sample phenomena but extend to asymptotic regimes, making it the recommended approach for practical applications regardless of sample size.
Airline delay data results align with simulations. All selection methods yield comparable models, demonstrating their ability to balance penalty tuning. This advantage becomes most pronounced in moderate and large samples, and GCV often provides a straightforward yet effective method for setting ( $λ, k$ ).

The evidence from both simulations and the airline-delay study indicates that the GRTK family is a practical tool for the twin problems of multicollinearity and autocorrelated errors in partially linear time series models. Even so, the estimator’s effectiveness rests on several working assumptions and design choices that raise open questions and some caveats for the introduced methodology that are listed below:

The estimators assume weakly stationary, short-memory errors;
A two–dimensional grid search for ( $h, k$ ) is required which brings a serious computational burden;
The method assumes that only one of the covariates has a non-parametric relationship between response variable;
Shrinkage may introduce finite-sample bias when true coefficients are large;
The method relies on a consistent preliminary estimate of $ρ$ .

As a result, the proposed estimators fill an important gap in semiparametric modeling by jointly tackling multicollinearity and autocorrelation, thereby yielding more robust and interpretable results for partially linear models in time series contexts.

Author Contributions

Conceptualization, S.E.A. and D.A.; methodology, D.A. and E.Y.; software, E.Y.; validation, S.E.A. and D.A.; formal analysis, E.Y.; investigation, E.Y.; resources, S.E.A.; data curation, E.Y.; writing—original draft preparation, S.E.A. and E.Y.; writing—review and editing, S.E.A.; visualization, E.Y.; supervision, S.E.A. and D.A.; project administration, S.E.A. All authors have read and agreed to the published version of the manuscript.

Funding

The research of S. Ejaz Ahmed was supported by the Natural Sciences and the Engineering Research Council (NSERC) of Canada. This research received no other external funding.

Data Availability Statement

For simulation studies R function can be accessed from the link: https://github.com/yilmazersin13/Partially-linear-time-series-model (accessed on 5 April 2025) and for the real data example AirDelay dataset is publicly available in Kaggle platform and accessed by the link: https://www.kaggle.com/datasets/undersc0re/flight-delay-and-causes (accessed on 5 April 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Derivations of GRTK Estimators

Appendix A.1. Derivations of Equations (12)–(14)

\begin{array}{l} {\hat{β}}_{G R T K} (k) & = {({\tilde{X}}^{'} R^{- 1} \tilde{X} + k I)}^{- 1} {\tilde{X}}^{'} R^{- 1} \tilde{y} \\ = {({\tilde{X}}^{'} R^{- 1} \tilde{X} + k I)}^{- 1} {\tilde{X}}^{'} R^{- 1} ((I - W_{λ}) y) \\ = G_{k} {\tilde{X}}^{'} R^{- 1} ((I - W_{λ}) (X β + f + ε)) \\ = G_{k} {\tilde{X}}^{'} R^{- 1} (\tilde{X} β + \tilde{f} + (I - W_{λ}) ε) \\ = G_{k} {\tilde{X}}^{'} R^{- 1} \tilde{X} β + G_{k} {\tilde{X}}^{'} R^{- 1} \tilde{f} + G_{k} {\tilde{X}}^{'} R^{- 1} (I - W_{λ}) ε . \end{array}

Accordingly, Equation (20) can be expressed as:

\begin{array}{l} E ({\hat{β}}_{G R T K}) & = G_{k} {\tilde{X}}^{'} R^{- 1} \tilde{X} E (β) + G_{k} {\tilde{X}}^{'} R^{- 1} \tilde{f} \\ = {({\tilde{X}}^{'} R^{- 1} \tilde{X} + k I)}^{- 1} ({\tilde{X}}^{'} R^{- 1} \tilde{X} + k I - k I) β + {({\tilde{X}}^{'} R^{- 1} \tilde{X} + k I)}^{- 1} {\tilde{X}}^{'} R^{- 1} \tilde{f} \\ = {({\tilde{X}}^{'} R^{- 1} \tilde{X} + k I)}^{- 1} [({\tilde{X}}^{'} R^{- 1} \tilde{X} + k I) β - k I β] + {({\tilde{X}}^{'} R^{- 1} \tilde{X} + k I)}^{- 1} {\tilde{X}}^{'} R^{- 1} \tilde{f} \\ = (I - k {({\tilde{X}}^{'} R^{- 1} \tilde{X} + k I)}^{- 1}) β + {({\tilde{X}}^{'} R^{- 1} \tilde{X} + k I)}^{- 1} {\tilde{X}}^{'} R^{- 1} \tilde{f} \\ = β - k {({\tilde{X}}^{'} R^{- 1} \tilde{X} + k I)}^{- 1} β + {({\tilde{X}}^{'} R^{- 1} \tilde{X} + k I)}^{- 1} {\tilde{X}}^{'} R^{- 1} \tilde{f} . \end{array}

Equivalently, from (17), we obtain the bias as in Equation (21):

\begin{array}{l} E ({\hat{β}}_{R} (k)) & = E ({(I_{p} + k {({\tilde{X}}^{'} \tilde{X})}^{- 1})}^{- 1} \hat{β}) = E [{(I_{p} + k {({\tilde{X}}^{'} \tilde{X})}^{- 1})}^{- 1} {({\tilde{X}}^{'} \tilde{X})}^{- 1} {\tilde{X}}^{'} (I_{n} - W_{λ}) y] \\ = {({\tilde{X}}^{'} \tilde{X} + k I_{p})}^{- 1} ({\tilde{X}}^{'} \tilde{X} β + {\tilde{X}}^{'} \tilde{f}) . \end{array}

Also, the variance of an estimator

{\hat{β}}_{R} (k)

given in (22) can be derived as follows:

V a r (({\hat{β}}_{R} (k))) = E [\{({\hat{β}}_{R} (k)) - E ({\hat{β}}_{R} (k))\} {\{({\hat{β}}_{R} (k)) - E ({\hat{β}}_{R} (k))\}}^{'}] = E {\{\begin{array}{l} [{({\tilde{X}}^{'} \tilde{X} + k I_{p})}^{- 1} {\tilde{X}}^{'} \tilde{X} β + {({\tilde{X}}^{'} \tilde{X} + k I_{p})}^{- 1} {\tilde{X}}^{'} \tilde{f} + {({\tilde{X}}^{'} \tilde{X} + k I_{p})}^{- 1} {\tilde{X}}^{'} (I_{n} - W_{λ}) ε] \\ - [{({\tilde{X}}^{'} \tilde{X} + k I_{p})}^{- 1} {\tilde{X}}^{'} \tilde{X} β + {({\tilde{X}}^{'} \tilde{X} + k I_{p})}^{- 1} {\tilde{X}}^{'} \tilde{f}] \end{array}\}}^{2} \begin{array}{l} = E \{({({\tilde{X}}^{'} \tilde{X} + k I_{p})}^{- 1} {\tilde{X}}^{'} (I_{n} - W_{λ}) ε) {({({\tilde{X}}^{'} \tilde{X} + k I_{p})}^{- 1} {\tilde{X}}^{'} (I_{n} - W_{λ}) ε)}^{'}\} \\ = E ({({\tilde{X}}^{'} \tilde{X} + k I_{p})}^{- 1} {\tilde{X}}^{'} {(I_{n} - W_{λ})}^{2}) ({({\tilde{X}}^{'} \tilde{X} + k I_{p})}^{- 1} {\tilde{X}}^{'}) E (ε^{2}) \\ = σ^{2} {({\tilde{X}}^{'} \tilde{X} + k I_{p})}^{- 1} {\tilde{X}}^{'} {(I_{n} - W_{λ})}^{2} {({\tilde{X}}^{'} \tilde{X} + k I_{p})}^{- 1} \tilde{X} . \end{array}

As a result, it can be expressed as the following result with abbreviation as in (22):

V a r (({\hat{β}}_{R} (k))) = σ^{2} G_{k} {\tilde{X}}^{'} {(I_{n} - W_{λ})}^{2} \tilde{X} G_{k} .

Hence, derivation has been completed.

Appendix A.2. The Proof of the Equation (24)

\begin{array}{l} \hat{y} = X \hat{β} + \hat{f} = X {({\tilde{X}}^{'} R^{- 1} \tilde{X} + k I)}^{- 1} {\tilde{X}}^{'} R^{- 1} \tilde{y} + W_{λ} (y - X \hat{β}) \\ = X {({\tilde{X}}^{'} R^{- 1} \tilde{X} + k I)}^{- 1} {\tilde{X}}^{'} R^{- 1} (I - W_{λ}) y + W_{λ} (y - X {({\tilde{X}}^{'} R^{- 1} \tilde{X} + k I)}^{- 1} {\tilde{X}}^{'} R^{- 1} (I - W_{λ}) y) \\ = X G_{k} {\tilde{X}}^{'} R^{- 1} (I - W_{λ}) y + W_{λ} (y - X G_{k} {\tilde{X}}^{'} R^{- 1} (I - W_{λ}) y) \\ = X G_{k} {\tilde{X}}^{'} R^{- 1} (I - W_{λ}) y + W_{λ} y - W_{λ} X G_{k} {\tilde{X}}^{'} R^{- 1} (I - W_{λ}) \\ = W_{λ} y + X G_{k} {\tilde{X}}^{'} R^{- 1} (I - W_{λ}) y - W_{λ} X G_{k} {\tilde{X}}^{'} R^{- 1} (I - W_{λ}) y \\ = W_{λ} y + (I - W_{λ}) X G_{k} {\tilde{X}}^{'} R^{- 1} (I - W_{λ}) y \\ = [W_{λ} + (I - W_{λ}) X G_{k} {\tilde{X}}^{'} R^{- 1} (I - W_{λ})] y \end{array}

Thus, the hat matrix is written as follows:

H_{λ, k} = W_{λ} + (I - W_{λ}) X G_{k} {\tilde{X}}^{'} R^{- 1} (I - W_{λ}) .

Appendix B. Asymptotic Supplement: Bias, AQDB, and ADR

This appendix provides a treatment of the asymptotic bias, Asymptotic Quadratic Distributional Bias (AQDB), and Asymptotic Distributional Risk (ADR) for the GRTK-based estimators introduced in Section 3. Specifically, we focus on:

The baseline GRTK (FM) estimator ${\hat{β}}_{G R T K} (k) = {\hat{β}}_{F M}$ from Equation (12),
The ordinary Stein shrinkage estimator ${\hat{β}}_{S}$ from Equation (14), and
The positive-part Stein shrinkage estimator ${\hat{β}}_{P S}$ from Equation (15).

with covariance matrix $R = C o v (ε)$ capturing possible autocorrelation (e.g., $A R (1)$ ). Throughout, we assume $R$ is known (or replaced with a consistent estimator $\hat{R}$ , which suffices for the same asymptotic properties). We also assume standard regularity conditions on the kernel smoothing for $f (\cdot)$ (bandwidth $λ \to 0$ and smoothness conditions as $n \to \infty$ ). Under these conditions, all estimators in Section 3 are consistent for $β$ , and we can derive their bias, variance, and risk expansions.

Appendix B.1. Baseline GRTK (FM) Estimator

Recall that the full-model GRTK estimator may be written as

{\hat{β}}_{F M} = {({\tilde{X}}^{'} R^{- 1} \tilde{X} + k I_{p})}^{- 1} {\tilde{X}}^{⊤} R^{- 1} \tilde{y},

where

\tilde{y} = (I_{n} - W_{λ}) y

is the partially centered response (subtracting off the estimated nonparametric component

\hat{f} (\cdot)

), and

X

is the

n \times p

design matrix for the parametric covariates. Let us Define

L = {({\tilde{X}}^{'} R^{- 1} \tilde{X} + k I_{p})}^{- 1} {\tilde{X}}^{⊤} R^{- 1} .

Hence

{\hat{β}}_{F M} = L \tilde{y}

. If

\tilde{y} = \tilde{X} β + \tilde{ε}

(the decomposition after partial kernel smoothing of

f (\cdot)

, then

{\hat{β}}_{F M} = L \tilde{X} β + L \tilde{ε} .

One can show that

L \tilde{X}

is effectively

I_{p} - k {({\tilde{X}}^{⊤} R^{- 1} \tilde{X} + k I_{p})}^{- 1}

, capturing the ridge shrinkage on the parametric coefficients. Hence,

{\hat{β}}_{F M} = [I_{p} - k {(X^{'} R^{- 1} X + k I_{p})}^{- 1}] β + L \tilde{ε} .

Exact Bias. Taking expectation (condition on

X

) and noting

E [\tilde{ε}] \approx 0

for large

n

, we get

\begin{array}{l} B i a s ({\hat{β}}_{F M}) & = E [{\hat{β}}_{F M}] - β \\ \approx - k {({\tilde{X}}^{'} R^{- 1} \tilde{X} + k I_{p})}^{- 1} β \end{array},

plus small

o (1)

terms from the smoothing residual. Since

‖X^{'} R^{- 1} X‖ = O (n)

in typical regressions and

k

may vanish with

n

(or remain bounded), this bias goes to zero as

n \to \infty

. The asymptotic behavior of this bias term depends crucially on how k scales with n. We can distinguish several cases:

(i) If

k = O (1)

(remains constant or bounded as n→∞): Since

‖X^{'} R^{- 1} X‖ = O (n)

, we have (

X^{'} R^{- 1} X + k I)^{- 1} = O (1 / n)

, and the bias term becomes

k \cdot O (1 / n) \cdot β = O (\frac{1}{n})

, which converges to zero at rate

1 / n

.

(ii) If

k = O (n α)

for some

0 < α < 1

: The bias term becomes

O (n^{α - 1})

, which still converges to zero when α < 1, but at a slower rate than in case (i).

(iii) If

k = O (n)

: The bias may not vanish asymptotically, potentially compromising consistency.

Therefore, for consistency of the

{\hat{β}}_{F M}

estimator, we require that

k (X^{'} R^{- 1} X + k I)^{- 1}

→ 0 as

n \to \infty

, which is satisfied when

k = o (n)

. This asymptotic requirement aligns with our parameter selection methods in Section 4, which implicitly balance the bias-variance tradeoff by choosing appropriate

k

values. Hence the GRTK estimator is consistent and has

O (k)

bias in finite

n

, shrinking

β

towards

0

. In summary:

({\hat{β}}_{F M}) = O (k) \to 0 as n \to \infty .

Appendix B.2. Shrinkage Estimators

Denote by

{\hat{β}}_{S M}

the submodel estimator, obtained by fitting the same GRTK procedure but only to a subset of

p_{1} < p

regressors using BIC criterion. In practice, we set the omitted

(p - p_{1})

coefficients to zero. Then the ordinary Stein-type shrinkage estimator is

{\hat{β}}_{S} = {\hat{β}}_{S M} + \hat{θ} ({\hat{β}}_{F M} - {\hat{β}}_{S M}),

where

\hat{θ} \in [0,1]

is a shrinkage factor determined from the data (discussed in Subsection B3). If

\hat{θ} = 1

, no shrinkage is applied (we use the full model); if

\hat{θ} = 0

, we revert to the submodel; for

0 < \hat{θ} < 1

, we form a weighted compromise. The positive-part shrinkage estimator is

{\hat{β}}_{P S} = {\hat{β}}_{S M} + m a x \{0, \hat{θ}\} ({\hat{β}}_{F M} - {\hat{β}}_{S M}),

so that negative values of

\hat{k}

(which might arise due to sampling error) are replaced by 0 to prevent over-shrinkage. Regarding the bias of shrinkage estimators,

{\hat{β}}_{S} - β = ({\hat{β}}_{S M} - β) + \hat{θ} ({\hat{β}}_{F M} - {\hat{β}}_{S M}),

note that

{\hat{β}}_{S M}

itself has some bias (particularly if the omitted

(p - p_{1})

coefficients are not truly zero), while

{\hat{β}}_{F M}

has the

O (θ)

ridge bias but includes all parameters. Let

({\hat{β}}_{S M}) = b_{S M}

and

({\hat{β}}_{F M}) = b_{F M}

. A first-order approximation, assuming

\hat{θ}

is not strongly correlated with the random errors in (

{\hat{β}}_{F M} - {\hat{β}}_{S M})

, gives

E [{\hat{β}}_{S}] - β \approx [1 - E [\hat{θ}]] (b_{S M}) + E [\hat{θ}] (b_{F M}) .

Hence the bias of

{\hat{β}}_{S}

is approximately a linear mixture of submodel bias and fullmodel bias. In large

n

, we typically find

b_{F M} = O (θ)

small, and

b_{S M}

is either

O (θ)

(when the submodel is correct) or (if some omitted coefficients are actually nonzero) has a bigger

O (1)

or

O (n^{- 1 / 2})

term. Thus,

{\hat{β}}_{S}

has a bias that is smaller than the submodel’s in cases where

E [\hat{θ}] > 0

.

Positive-Part Variant. For

{\hat{β}}_{PS}

, the same expansion holds but with

\hat{θ}

replaced by

m a x {\hat{θ}, 0}

. This cannot exceed

E [\hat{θ}]

, so

({\hat{β}}_{P S}) \leq ({\hat{β}}_{S}),

elementwise (in a typical risk comparison sense). That is, the positive-part always improves or equals the Stein shrinkage in terms of bias magnitude, since it disallows negative

\hat{θ}

.

As given in Section 3, the distance measure

T

quantifies how far

{\hat{β}}_{F M}

is from the submodel

{\hat{β}}_{S M}

. A convenient choice is the squared Mahalanobis distance:

T = {({\hat{β}}_{F M} - {\hat{β}}_{S M})}^{⊤} {[C o v ({\hat{β}}_{F M} - {\hat{β}}_{S M})]}^{- 1} ({\hat{β}}_{F M} - {\hat{β}}_{S M}) .

Under the “null” submodel assumption (that the omitted parameters are actually zero),

T

follows approximately a

χ^{2}

distribution with

ν = (p - p_{2})

degrees of freedom, possibly noncentral if the omitted effects are not truly zero. The main text (see Equation (14)) uses

T

in the formula for the shrinkage factor:

\hat{θ} = 1 - \frac{ν - 2}{T},

where

ν = (p - p_{1})

. If

T < ν - 2, \hat{θ}

goes negative; the positive-part approach sets

{\hat{θ}}_{+} = m a x (0, \hat{θ})

. Asymptotic Distribution of

T

. Under standard conditions (normal or asymptotically normal errors, large

n

):

T \overset{d}{\to} χ_{ν}^{2} (Λ),

where

Λ

is a noncentrality parameter linked to how large the omitted true coefficients are. In a local-alternatives framework with

β_{2} = δ / \sqrt{n}

for the omitted block,

Λ

remains finite as

n \to \infty

. If

Λ = 0

(null case),

E [T] \approx ν

and

(T) \approx 2 ν

. If

Λ > 0, T

tends to be

> ν

, so

\hat{θ} \approx 1

. This adaptivity is what drives the shrinkage phenomenon.

If $T$ is large ( $≫ ν$ ), we suspect omitted coefficients are nonzero, so $\hat{θ} \approx 1$ retains the full-model estimate.
If $T$ is near $ν, \hat{θ}$ is moderate $(0 < \hat{θ} < 1)$ , partially shrinking FM toward SM.
If $T < ν - 2$ , then $\hat{θ}$ is negative, which is clipped to 0 by the positive-part rule.

Appendix B.3. Asymptotic Quadratic Bias (AQDB) and Distributional Risk (ADR)

We next assess each estimator’s risk and define the Asymptotic Quadratic Distributional Bias (AQDB). For an estimator

\hat{β}

, define

M S E (\hat{β}) = E [‖ \hat{β} - β ‖^{2}] = T r [V a r (\hat{β})] + ‖ B i a s (\hat{β}) ‖^{2} .

In the local alternatives setting (where the omitted block

β_{2} = δ / \sqrt{n}

), we often look at

n M S E (\hat{β})

as

n \to \infty

, which splits into an asymptotic variance component plus an asymptotic (squared) bias:

\underset{n \to \infty}{l i m} n E [‖ \hat{β} - β ‖^{2}] = T r (Σ_{\hat{β}}) + {‖μ_{\hat{β}}‖}^{2} .

Here

Σ_{\hat{β}} = \underset{n \to \infty}{l i m} V a r (\sqrt{n} \hat{β})

and

μ_{\hat{β}} = \underset{n \to \infty}{l i m} \sqrt{n} (E [\hat{β}] - β)

. We denote:

\underset{Asymptotic Variance}{\underset{⏟}{T r (Σ_{\hat{β}})}} + \underset{AQDB}{\underset{⏟}{{‖μ_{\hat{β}}‖}^{2}}} = A D R (\hat{β}),

where the Asymptotic Quadratic Distributional Bias (AQDB) is

{‖μ_{\hat{β}}‖}^{2}

and the Asymptotic Distributional Risk (ADR) is their sum.

Full-Model (FM): has negligible bias (so $μ_{F M} = 0$ ), but a higher variance from estimating all $p$ coefficients. Hence $A D R ({\hat{β}}_{F M}) = T r (Σ_{F M})$ , typically $> T r (Σ_{S M})$ .
Submodel (SM): has lower variance (only $p_{1}$ parameters) but possibly large bias if $β_{2} \neq 0$ , giving $μ_{S M} \neq 0$ and a big ${‖μ_{S M}‖}^{2}$ . Specifically, if the omitted block is $δ / \sqrt{n}$ , then $μ_{S M} \approx (0, δ)$ so $A Q D B (S M) = ‖ δ ‖^{2}$ .
Stein (S): interpolates. Variance is $< V a r (F M)$ but $> V a r (S M)$ . Bias is less than the SM’s if omitted effects are actually nonzero, but not zero: $(1 - E [\hat{θ}])$ times $δ$ roughly, so $A Q D B (S) \approx (1 - E [\hat{θ}])^{2} ‖ δ ‖^{2}$ .
Positive-Part Stein (PS): ensures no negative shrinkage, so $A Q D B (P S) \approx$ ${(1 - E [{\hat{θ}}_{+}])}^{2} ‖ δ ‖^{2}$ with $E [{\hat{θ}}_{+}] \geq E [\hat{θ}]$ . Consequently, $A Q D B (P S) \leq A Q D B (S)$ . Moreover, $V a r (P S) \leq V a r (S)$ in typical settings, so ${\hat{β}}_{P S}$ dominates ${\hat{β}}_{S}$ in ADR.

Thus, positive-part shrinkage is usually recommended when

p \geq 3

or under moderate correlation/collinearity, as it yields the lowest or nearly lowest ADR across different parameter regimes. This is consistent with the numerical results in Section 5 of the main text and references such as [10,14].

References

Speckman, P. Kernel smoothing in partially linear model. J. R. Stat. Soc. Ser. B 1988, 50, 413–435. [Google Scholar] [CrossRef]
Hu, H. Ridge estimation of a semiparametric regression model. J. Comput. Appl. Math. 2005, 176, 215–222. [Google Scholar] [CrossRef]
Gao, J. Asymptotic theory for partly linear models. Commun. Stat.—Theory Methods 1995, 24, 1985–2009. [Google Scholar] [CrossRef]
Schick, A. Efficient estimation in a semiparametric additive regression model with autoregressive errors. Stoch. Process. Their Appl. 1996, 61, 339–361. [Google Scholar] [CrossRef]
Green, P.J.; Silverman, B.W. Nonparametric Regression and Generalized Linear Models; Chapman & Hall: London, UK, 1994. [Google Scholar]
You, J.; Zhou, Y. Empirical likelihood for semiparametric varying-coefficient partially linear regression models. Stat. Probab. Lett. 2006, 76, 412–422. [Google Scholar] [CrossRef]
Kazemi, M.; Shahsvani, D.; Arashi, M.; Rodrigues, P.C. Identification for partially linear regression model with autoregressive errors. J. Stat. Comput. Simul. 2021, 91, 1441–1454. [Google Scholar] [CrossRef]
Özkale, M.R. A jackknifed ridge estimator in the linear regression model with heteroscedastic or correlated errors. Stat. Probab. Lett. 2008, 78, 3159–3169. [Google Scholar] [CrossRef]
You, J.; Chen, G. Semiparametric generalized least squares estimation in partially linear regression models with correlated errors. J. Stat. Plan. Inference 2007, 137, 117–132. [Google Scholar] [CrossRef]
Nooi Asl, M.; Bevrani, H.; Arabi Belaghi, R.; Mansson, K. Ridge-type shrinkage estimators in generalized linear models with an application to prostate cancer data. Stat. Pap. 2021, 62, 1043–1085. [Google Scholar] [CrossRef]
Kuran, Ö.; Yalaz, S. Kernel ridge prediction method in partially linear mixed measurement error model. Commun. Stat.-Simul. Comput. 2024, 53, 2330–2350. [Google Scholar] [CrossRef]
Ahmed, S.E. Penalty, Shrinkage and Pretest Strategies: Variable Selection and Estimation; Springer: New York, NY, USA, 2014. [Google Scholar]
Ahmed, S.E. Penalty, shrinkage and pretest strategies in statistical modeling. In Frontiers in Statistics; Springer: Cham, Switzerland, 2018. [Google Scholar]
Ahmed, S.E.; Ahmed, F.; Yüzbaşı, B. Post-Shrinkage Strategies in Statistical and Machine Learning for High-Dimensional Data; CRC Press: Boca Raton, FL, USA, 2023. [Google Scholar]
Chen, C.M.; Weng, S.C.; Tsai, J.R.; Shen, P.S. The mean residual life model for the right-censored data in the presence of covariate measurement errors. Stat. Med. 2023, 42, 2557–2572. [Google Scholar] [CrossRef] [PubMed]
Shi, J.; Lau, T.S. Empirical likelihood for partially linear models. J. Multivar. Anal. 2000, 72, 132–148. [Google Scholar] [CrossRef]
Lee, T.C.M. Smoothing parameter selection for smoothing splines: A simulation study. Comput. Stat. Data Anal. 2003, 42, 139–148. [Google Scholar] [CrossRef]
Andrews, D.W.K. Laws of large numbers for dependent non-identically distributed random variables. Econom. Theory 1988, 4, 458–467. [Google Scholar] [CrossRef]
White, H. Asymptotic Theory for Econometricians; Academic Press: New York, NY, USA, 1984. [Google Scholar]
Slaoui, Y. Recursive kernel regression estimation under α-mixing data. Commun. Stat.—Theory Methods 2022, 51, 8459–8475. [Google Scholar] [CrossRef]
Theobald, C.M. Generalizations of mean square error applied to ridge regression. J. R. Stat. Soc. Ser. B 1974, 36, 103–106. [Google Scholar] [CrossRef]
Hurvich, C.M.; Simonoff, J.S.; Tsai, C.L. Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion. J. R. Stat. Soc. Ser. B 1998, 60, 271–293. [Google Scholar] [CrossRef]

Figure 1. Simulated data with true

f (z_{t})

, sparse coefficients with zero and non-zero ones and the autocorrelation function of error terms for

n = 150, δ = 0.95

.

Figure 2. Selection of

(\hat{λ}, \hat{k})

by

A I C_{c}

,

B I C

and

R E C P

under certain conditions:

n = 50, δ = 0.95

.

Figure 3. Boxplots for bias and estimated regression coefficients obtained based on four criteria

G C V, A I C_{c}, B I C

and

R E C P

under certain conditions. The horizontal line in the subfigures shows zero bias as a reference line, and outliers are shown by black circles.

Figure 4. Smooth curves obtained for different configurations to show effect of the correlation level

(δ)

in panels (a,b) and (b,d) and the effect of the sample size (

n

) in panels (a–c).

Figure 5. Autocorrelation functions (acf) for the model error in (33): Panel (a) acf plot of original response variable; panel (b) acf plot for difference of the

A r r D e l a y

variable. The y-axes in both figures “lag #” refers number of lags.

Figure 6. Informative plots for the Airline Delay dataset. In panel (c), asterisks beside the correlation values indicate statistically significant values. In panel (b,d), blue lines indicate hypothetic relationships between variables.

Figure 7. Selection of

(\hat{λ}, \hat{k})

based on the four criteria

G C V

,

A I C_{c}

,

B I C

and

R E C P

for airline delay dataset.

Figure 8. Residual analysis for the airline delay dataset. In the plots at the top of the figure show the acf plots of residuals obtained for the estimated models based on the corresponding criterion. The plots at the bottom show the scatterplots of the residuals around

x = 0

(red dashed line).

Figure 9. Fitted curves obtained based on the four criteria.

Table 1. Chosen

(\hat{λ}, \hat{k})

pairs selected by

A I C_{c}

,

B I C

and

R E C P

critera for different simulation configurations for GRTK.

Table 1. Chosen

(\hat{λ}, \hat{k})

pairs selected by

A I C_{c}

,

B I C

and

R E C P

critera for different simulation configurations for GRTK.

		$δ = 0.95$				$δ = 0.99$
$n$		$G C V$	$A I C_{c}$	$B I C$	$R E C P$	$G C V$	$A I C_{c}$	$B I C$	$R E C P$
50	$\hat{λ}$	0.029	0.059	0.038	0.055	0.039	0.112	0.048	0.054
50	$\hat{k}$	0.553	1.368	0.881	0.806	1.053	1.599	1.348	0.686
150	$\hat{λ}$	0.015	0.031	0.019	0.039	0.017	0.031	0.018	0.075
150	$\hat{k}$	1.155	1.016	0.933	0.976	1.084	1.067	1.144	0.914
300	$\hat{λ}$	0.021	0.027	0.022	0.027	0.011	0.014	0.013	0.046
300	$\hat{k}$	1.323	1.323	1.283	0.717	1.059	1.078	1.067	1.005

Table 2. Outcomes of the simulations with bias, variance, and SMSD scores for

{\hat{β}}_{G R T K}

.

Table 2. Outcomes of the simulations with bias, variance, and SMSD scores for

{\hat{β}}_{G R T K}

.

Metric	$n$	50	150	300	50	150	300
Metric	Criteria	$δ = 0.95$			$δ = 0.99$
Bias	$G C V$	1.952	1.101	0.618	3.805	2.802	1.768
	$A I C_{c}$	2.699	5.679	0.595	5.679	5.679	1.647
	$B I C$	1.984	1.164	0.587	3.773	2.860	1.893
	$R E C P$	1.872	1.124	0.553	3.714	2.693	1.745
Variance	$G C V$	2.899	1.645	0.906	2.650	2.446	2.570
	$A I C_{c}$	2.710	0.000	0.885	0.000	0.000	2.698
	$B I C$	2.860	1.509	0.876	2.737	2.172	2.148
	$R E C P$	2.955	1.531	0.886	2.887	2.433	2.290
SMSD	$G C V$	6.709	2.857	1.288	17.131	10.300	5.697
	$A I C_{c}$	9.992	32.250	1.238	32.250	32.250	5.411
	$B I C$	6.798	2.864	1.220	16.976	10.352	5.730
	$R E C P$	6.458	2.794	1.191	16.682	9.688	5.336

Table 3. Outcomes of the simulations with bias, variance, and SMSD scores for

{\hat{β}}_{P S}

.

Table 3. Outcomes of the simulations with bias, variance, and SMSD scores for

{\hat{β}}_{P S}

.

Metric	$n$	50	150	300	50	150	300
	Criteria	$δ = 0.95$			$δ = 0.99$
	$(p ₁, p ₂)$	(3.0, 17.0)	(3.0, 17.0)	(3.2, 16.8)	(3.0, 17.0)	(3.0, 17.0)	(3.2, 16.8)
Bias	$G C V$	1.943	1.100	0.618	3.580	2.791	1.767
	$A I C_{c}$	2.698	5.679	0.595	5.679	5.679	1.647
	$B I C$	1.976	1.163	0.587	3.545	2.846	1.891
	$R E C P$	1.863	1.123	0.552	3.477	2.679	1.744
Variance	$G C V$	2.857	1.643	0.906	2.358	2.415	2.566
	$A I C_{c}$	2.707	0.010	0.885	0.010	0.010	2.698
	$B I C$	2.819	1.507	0.876	2.421	2.137	2.142
	$R E C P$	2.911	1.529	0.885	2.517	2.387	2.283
SMSD	$G C V$	6.633	2.854	1.288	15.174	10.204	5.690
	$A I C_{c}$	9.984	32.250	1.238	32.250	32.250	5.411
	$B I C$	6.723	2.861	1.220	14.986	10.237	5.719
	$R E C P$	6.381	2.790	1.191	14.604	9.562	5.324

Table 4. Outcomes of the simulations with bias, variance, and SMSD scores for

{\hat{β}}_{S}

.

Table 4. Outcomes of the simulations with bias, variance, and SMSD scores for

{\hat{β}}_{S}

.

Metric	$n$	50	150	300	50	150	300
	Criteria	$δ = 0.95$			$δ = 0.99$
	$(p ₁, p ₂)$	(3.0, 17.0)	(3.0, 17.0)	(3.2, 16.8)	(3.0, 17.0)	(3.0, 17.0)	(3.2, 16.8)
Bias	$G C V$	1.943	1.100	0.618	3.58	2.79	1.77
	$A I C_{c}$	2.70	5.68	0.59	5.68	5.68	1.65
	$B I C$	1.98	1.16	0.59	3.54	2.85	1.89
	$R E C P$	1.86	1.12	0.55	3.48	2.68	1.74
Variance	$G C V$	2.857	1.643	0.906	2.36	2.42	2.57
	$A I C_{c}$	2.71	0.00	0.88	0.00	0.00	2.70
	$B I C$	2.82	1.51	0.88	2.42	2.14	2.14
	$R E C P$	2.91	1.53	0.89	2.52	2.39	2.28
SMSD	$G C V$	6.633	2.854	1.288	15.17	10.20	5.69
	$A I C_{c}$	9.98	32.25	1.24	32.25	32.25	5.41
	$B I C$	6.72	2.86	1.22	14.99	10.24	5.72
	$R E C P$	6.38	2.79	1.19	14.60	9.56	5.32

Table 5. MSE values for the estimated nonparametric component

{\hat{f}}_{G R T K}

for all simulation combinations.

Table 5. MSE values for the estimated nonparametric component

{\hat{f}}_{G R T K}

for all simulation combinations.

Method	Criteria	δ = 0.95			δ = 0.99
Method		n = 50	n = 150	n = 300	n = 50	n = 150	n = 300
GRTK	$G C V$	3.668	3.178	1.932	3.819	2.956	2.277
	$A I C_{c}$	3.562	3.329	2.266	4.209	3.448	2.245
	$B I C$	3.747	3.353	2.025	3.908	3.396	2.228
	$R E C P$	3.758	3.367	2.264	4.272	3.444	2.319
S	$G C V$	1.267	1.277	0.802	1.345	1.355	1.077
	$A I C_{c}$	1.396	1.213	0.726	2.762	1.585	1.245
	$B I C$	1.267	1.183	0.861	1.555	1.346	1.218
	$R E C P$	1.086	0.968	0.947	2.499	0.871	1.319
PS	$G C V$	1.872	1.353	0.937	2.281	1.192	0.977
	$A I C_{c}$	1.996	1.451	0.878	2.109	1.583	0.945
	$B I C$	1.826	1.420	0.759	2.014	1.338	0.928
	$R E C P$	1.958	1.461	0.957	2.660	1.412	1.319

bold numbers show the best score among criteria.

Table 6. Large-sample results of simulations with bias, variance and SMSD scores for

{\hat{β}}_{G R T K}

.

Table 6. Large-sample results of simulations with bias, variance and SMSD scores for

{\hat{β}}_{G R T K}

.

Metric	$n$	500	1000	500	1000
Metric	Criteria	$δ = 0.95$		$δ = 0.99$
Bias	$G C V$	0.515	0.374	1.804	1.309
	$A I C_{c}$	0.460	0.326	1.609	1.139
	$B I C$	0.476	0.346	1.763	1.281
	$R E C P$	0.392	0.285	1.371	0.996
Variance	$G C V$	0.628	0.366	2.087	1.218
	$A I C_{c}$	0.703	0.451	2.340	1.503
	$B I C$	0.560	0.329	1.861	1.093
	$R E C P$	0.597	0.355	1.984	1.179
SMSD	$G C V$	0.893	0.506	5.344	2.932
	$A I C_{c}$	0.914	0.557	6.930	2.800
	$B I C$	0.787	0.449	5.971	2.734
	$R E C P$	0.751	0.436	4.863	2.171

Table 7. Large-sample results of simulations with bias, variance and SMSD scores for

{\hat{β}}_{S}

.

Table 7. Large-sample results of simulations with bias, variance and SMSD scores for

{\hat{β}}_{S}

.

Metric	$n$	500	1000	500	1000
	Criteria	$δ = 0.95$		$δ = 0.99$
	$(p ₁, p ₂)$	(3.2, 16.8)	(3.4, 16.6)	(3.2, 16.8)	(3.0, 17.0)
Bias	$G C V$	0.501	0.363	1.753	1.269
	$A I C_{c}$	0.446	0.316	1.561	1.105
	$B I C$	0.462	0.336	1.710	1.242
	$R E C P$	0.380	0.276	1.330	0.966
Variance	$G C V$	0.603	0.351	2.003	1.169
	$A I C_{c}$	0.672	0.430	2.246	1.443
	$B I C$	0.539	0.312	1.787	1.048
	$R E C P$	0.570	0.340	1.905	1.132
SMSD	$G C V$	0.851	0.482	5.065	2.779
	$A I C_{c}$	0.872	0.531	6.683	2.665
	$B I C$	0.756	0.428	5.710	2.595
	$R E C P$	0.718	0.417	4.674	2.065

Table 8. Large-sample results of simulations with bias, variance and SMSD scores for

{\hat{β}}_{P S}

.

Table 8. Large-sample results of simulations with bias, variance and SMSD scores for

{\hat{β}}_{P S}

.

Metric	$n$	500	1000	500	1000
	Criteria	$δ = 0.95$		$δ = 0.99$
	$(p ₁, p ₂)$	(3.2, 16.8)	(3.4, 16.6)	(3.2, 16.8)	(3.0, 17.0)
Bias	$G C V$	0.482	0.355	1.714	1.243
	$A I C_{c}$	0.430	0.310	1.529	1.082
	$B I C$	0.456	0.328	1.675	1.217
	$R E C P$	0.371	0.270	1.303	0.946
Variance	$G C V$	0.592	0.344	1.963	1.142
	$A I C_{c}$	0.651	0.420	2.201	1.423
	$B I C$	0.531	0.312	1.741	1.027
	$R E C P$	0.558	0.332	1.837	1.107
SMSD	$G C V$	0.822	0.470	4.902	2.541
	$A I C_{c}$	0.842	0.521	6.419	2.563
	$B I C$	0.721	0.437	5.547	2.503
	$R E C P$	0.703	0.417	4.523	2.002

Table 9. MSE values for the estimated nonparametric component

\hat{f}

for large samples.

Table 9. MSE values for the estimated nonparametric component

\hat{f}

for large samples.

Method	$n$	500	1000	500	1000
Method	Criteria	$δ = 0.95$		$δ = 0.99$
GRTK	$G C V$	1.149	0.573	1.366	0.683
	$A I C_{c}$	1.362	0.661	1.347	0.674
	$B I C$	1.217	0.609	1.332	0.669
	$R E C P$	1.354	0.681	1.390	0.691
S	$G C V$	0.481	0.243	0.641	0.320
	$A I C_{c}$	0.436	0.228	0.745	0.374
	$B I C$	0.517	0.255	0.732	0.366
	$R E C P$	0.568	0.274	0.791	0.396
PS	$G C V$	0.562	0.251	0.586	0.293
	$A I C_{c}$	0.527	0.214	0.568	0.284
	$B I C$	0.456	0.228	0.554	0.277
	$R E C P$	0.574	0.272	0.791	0.396

Table 10. Performance of the GRTK estimators for parametric and nonparametric components of the model.

Measure/Criterion	$G C V$	$A I C_{c}$	$B I C$	$R E C P$
$σ_{G R T K}^{2}$	0.952	0.953	0.953	0.965
$M S E ({\hat{f}}_{G R T K})$	0.932	0.936	0.936	0.935
$V a r ({\hat{β}}_{G R T K})$	0.270	0.246	0.246	0.31
$σ_{S}^{2}$	0.712	0.809	0.741	0.825
$M S E ({\hat{f}}_{S})$	0.821	0.836	0.831	0.857
$V a r ({\hat{β}}_{S})$	0.210	0.225	0.226	0.280
$σ_{P S}^{2}$	0.705	0.713	0.673	0.711
$M S E ({\hat{f}}_{P S})$	0.735	0.736	0.735	0.735
$V a r ({\hat{β}}_{P S})$	0.172	0.211	0.183	0.191

Bold numbers indicates the best scores.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Kernel Ridge-Type Shrinkage Estimators in Partially Linear Regression Models with Correlated Errors

Abstract

1. Introduction

2. Fitting the Model Error Structure

3. Generalized Ridge Type Kernel Estimation

3.1. Shrinkage Estimators with GRTK Estimators

3.2. Asymptotic Distribution of the Estimators

3.3. Statistical Properties of the Estimators

3.4. Assessing the Risk and Efficiency

4. Choosing the Penalty Parameters

5. Numerical Examples

5.1. Simulation Study

5.2. Simulation Study with Large Samples

5.3. Real Data Example: Airline Delay Dataset

5.3.1. Data and Model

5.3.2. Pre-Processing

5.3.3. Model Fitting and Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Derivations of GRTK Estimators

Appendix A.1. Derivations of Equations (12)–(14)

Appendix A.2. The Proof of the Equation (24)

Appendix B. Asymptotic Supplement: Bias, AQDB, and ADR

Appendix B.1. Baseline GRTK (FM) Estimator

Appendix B.2. Shrinkage Estimators

Appendix B.3. Asymptotic Quadratic Bias (AQDB) and Distributional Risk (ADR)

References

Article Metrics

Citations

Article Access Statistics