Multiple linear regression is one of the most widely used predictive and inferential models across a broad range of scientific disciplines, including economics, engineering, medicine, and the social sciences. The model relates a scalar response random variable
Y to a set of explanatory random variables
through the linear model, where
p denotes the number of covariates
where
is the unknown parameter vector and
is a random error term with
. For a sample of size
n drawn from
, let
denote the observed vector of responses and
be the
design matrix with rows
, where
for all
.
A common objective in regression analysis is to estimate
via estimators with a low
mean squared error (MSE); the MSE of an estimator is explicitly defined in (
3). This is typically achieved by minimising a loss functional
measuring the discrepancy between the dependent variable and its linear predictor. The standard choice is the
loss, which leads to the well-known
Ordinary Least Squares (OLS) estimator, denoted by
, which is obtained by minimising the
residual sum of squares (RSS):
Here,
. The OLS estimator has a closed-form solution as follows:
provided that
is invertible. Note that
is a symmetric matrix that is always positive semi-definite and not necessarily positive definite, which would guarantee the existence of
. The OLS estimator is often called the
Best Linear Unbiased Estimator (BLUE) according to the
Gauss–Markov Theorem (
Gauss 1821;
Markov 1912), which guarantees it has the lowest variance among all unbiased linear estimators. Furthermore, if the error term is normally distributed, OLS coincides with the maximum likelihood estimator, allowing for exact finite-sample inference (
Seber and Lee 2003).
In this paper, we introduce our novel regression method,
Parity Regression (PR), and outline three primary contributions.
First, we propose the PR estimator, which, rather than minimising the global empirical risk, ensures that prediction errors are fairly distributed across all model parameters through a rigorous theoretical characterisation.
Second, we empirically show that our estimator outperforms OLS, as well as existing penalised and shrinkage estimators, both on synthetic simulations and real-world datasets.
Third, to facilitate the reproduction of results and practical application, we have made the proposed estimator publicly available via the
R package
savvyPR on CRAN.
1 Literature Review
Our review of the literature begins with Stein’s paradox (
James and Stein 1961;
Stein 1956), which marked a fundamental shift in statistical thinking by demonstrating that shrinkage can systematically improve estimation accuracy under the MSE criterion. By showing that deliberate introduction of bias may reduce the overall estimation error through a variance–bias trade-off, it provided a conceptual foundation for modern regularisation techniques. Although shrinkage in its classical form emerged after Tikhonov regularisation (
Tikhonov et al. 1943), which laid the groundwork for penalised regression, the underlying principle is closely related. The common thread linking penalised regression and shrinkage, despite their origins in different applications, is that a controlled introduction of bias can substantially reduce the estimator variability, thereby yielding an estimator that outperforms the natural unbiased estimator.
We begin by setting aside the estimation of the regression parameter vector
and instead consider the problem of estimating the population mean vector
, thereby clarifying the foundations of the shrinkage principle arising from
Stein’s paradox. This paradox, introduced in the seminal papers by (
James and Stein 1961;
Stein 1956), fundamentally challenges the established statistical paradigm. Its core premise is that, although unbiased estimators possess robust theoretical properties, they may still be strictly suboptimal when efficiency is assessed using the MSE criterion. Recall that the MSE of a generic estimator
of
is defined as
The estimator proposed in (
James and Stein 1961;
Stein 1956), commonly referred to as the James–Stein estimator, demonstrates that the sample mean vector
is a sub-optimal estimator of the population mean vector
. Assuming a multivariate normal sampling distribution, an estimator that strictly dominates the sample mean in terms of MSE can be constructed via multiplicative shrinkage,
, where
c represents the theoretically optimal shrinkage intensity. This estimator is often termed the
oracle shrinkage estimator, as it still depends on unknown population parameters and is therefore not fully data-driven. In practice, substituting these unknown population parameters with sample estimates gives a fully data-driven counterpart, often referred to as a
bona fide shrinkage estimator. For example, the James–Stein estimator derived in
James and Stein (
1961) is given by
where
and
, with
denoting the sample covariance matrix estimator; note that
denotes the usual
p-norm. For a comprehensive treatment of mean vector shrinkage estimation, the reader is referred to (
Asimit et al., forthcoming-c;
Bodnar et al. 2022).
Beyond yielding mean vector estimators with strictly reduced estimation error, Stein’s paradox establishes a generalised shrinkage principle that extends well beyond the confines of high-dimensional mean estimation. In particular, this principle can be applied to the estimation of the regression parameter vector . We therefore provide a succinct review of shrinkage estimators, which deliberately introduce bias in order to reduce the overall MSE through a variance-bias trade-off, a concept that lies at the core of Cross Validation (CV) in statistics and machine learning.
An alternative to OLS is
Ridge Regression (RR), introduced by
Hoerl and Kennard (
1970), which is designed to mitigate overfitting by shrinking the regression parameters and is particularly useful in the presence of multicollinearity or ill-conditioning (i.e., when
possesses zero or near-zero eigenvalues). RR minimises the
-penalised
as follows:
where
is a tuning parameter controlling the strength of the penalty. The solution admits the closed-form expression
where
guarantees that
exists. The penalisation term reduces estimation error, particularly when some eigenvalues of
are zero or close to zero.
By standard duality arguments, (
4) is equivalent to the constrained formulation
where
controls the size of the constraint set.
RR is a particular case of Tikhonov regularisation (
Tikhonov et al. 1943), a broader framework for addressing ill-posed estimation problems. Specifically, for a penalty function
, the Tikhonov estimator is defined as
When
, the Tikhonov estimator reduces to the RR estimator. Since
RR is a
shrinkage estimator that increasingly biases the estimates towards the origin as
grows.
Hoerl and Kennard (
1970) showed that there exists an oracle estimator
such that
demonstrating that RR can outperform OLS when
is suitably chosen. CV provides a practical method for selecting a bona fide estimate of
; however, the in-sample optimal choice may not be too close to the OOS optimal choice, which may increase the estimation error of the linear model.
When
with
, the Tikhonov estimator reduces to the
Least Absolute Shrinkage and Selection Operator (LASSO) (
Tibshirani 1996) and
Basis Pursuit Denoising (
Chen and Donoho 1994). The
-norm penalty induces sparsity by driving certain coefficients exactly to zero, thereby performing explicit variable selection and regularisation. Owing to the non-differentiability of the
penalty, LASSO does not admit a closed-form solution; however, it is equivalent to the constrained optimisation problem
Although both LASSO and RR regularise the model, they differ fundamentally in their mechanisms. LASSO induces sparsity by setting certain parameters exactly to zero, thereby selecting a subset of predictors and enhancing interpretability. In contrast, RR shrinks the parameters continuously towards zero without eliminating any of them entirely.
RR and LASSO are examples of penalised regression methods that can be interpreted as shrinkage estimators, although they are not constructed explicitly as shrinkage procedures. In contrast, there exists a broad class of estimators that directly shrink the OLS estimator towards a specified target. Such shrinkage estimators are typically simple, admitting closed-form expressions that are designed to optimise the theoretical MSE. Ideally, the corresponding oracle optimal shrinkage estimator is available in closed form, with its plug-in counterpart serving as a bona fide estimator, although CV may alternatively be employed. While both approaches exhibit distinct computational and theoretical trade-offs during implementation, this practical distinction remains largely underexplored in the existing literature.
The
Liu estimator (Liu) (
Liu 1993) is a shrinkage estimator that directly shrinks the OLS estimator towards the target
. Specifically, it modifies the OLS estimator as follows:
where
is a shrinkage parameter. Under certain conditions,
Liu (
1993) showed that there exists an optimal
such that
Hence, the oracle optimal shrinkage estimator
admits a closed-form expression. Nevertheless, in practice, standard software implementations typically select
d via CV, similarly to the RR estimator. Finally, note that
.
Liu (
2003) extended this framework by proposing a
two-parameter Liu estimator to address multicollinearity more effectively. The objective is for this estimator to inherit the stabilising properties of RR, which is specifically designed to accommodate ill-conditioned design matrices where
exhibits zero or near-zero eigenvalues, which is a hallmark of multicollinearity. The two-parameter Liu estimator introduces an additional parameter
k to provide finer control over the shrinkage effect, while retaining the adjustment governed by
d. It is defined as
Liu (
2003) showed that for any
, there exists an optimal
such that
Although this estimator offers greater flexibility, both parameters must typically be selected via CV in practical implementations. Joint optimisation of
entails a considerable computational burden, and the additional estimation variability may increase the overall MSE, and thus, despite its appealing theoretical guarantees, the two-parameter Liu estimator is less practical for empirical applications, and we therefore exclude it from our current implementation.
The remainder of the paper is organised as follows.
Section 2 presents the main theoretical results.
Section 3 reports an informative simulation study, while
Section 4 provides a comprehensive real-data analysis. Concluding remarks are given in
Section 5. All proofs and supporting technical details are collected in three appendices.
Appendix A contains the proofs of all theoretical results.
Appendix B provides additional details on the data-generating process underlying the simulation study.
Appendix C includes further information on the datasets used in
Section 4.