A High-Dimensional Counterpart for the Ridge Estimator in Multicollinear Situations

Arashi, Mohammad; Norouzirad, Mina; Roozbeh, Mahdi; Khan, Naushad Mamode

doi:10.3390/math9233057

Open AccessArticle

A High-Dimensional Counterpart for the Ridge Estimator in Multicollinear Situations

¹

Department of Statistics, Faculty of Mathematical Sciences, Ferdowsi University of Mashhad, Mashhad P.O. Box 9177948974, Iran

²

Department of Statistics, University of Pretoria, Pretoria 0002, South Africa

³

Department of Statistics, Faculty of Mathematical Sciences, Shahrood University of Technology, Shahrood P.O. Box 3619995181, Iran

⁴

Department of Statistics, Faculty of Mathematics, Statistics and Computer Sciences, Semnan University, Semnan P.O. Box 3514799422, Iran

⁵

Department of Economics and Statistics, University of Mauritius, Réduit 80837, Mauritius

^*

Author to whom correspondence should be addressed.

Mathematics 2021, 9(23), 3057; https://doi.org/10.3390/math9233057

Submission received: 10 November 2021 / Revised: 23 November 2021 / Accepted: 24 November 2021 / Published: 28 November 2021

(This article belongs to the Special Issue Advances of Functional and High-Dimensional Data Analysis)

Download

Browse Figures

Versions Notes

Abstract

:

The ridge regression estimator is a commonly used procedure to deal with multicollinear data. This paper proposes an estimation procedure for high-dimensional multicollinear data that can be alternatively used. This usage gives a continuous estimate, including the ridge estimator as a particular case. We study its asymptotic performance for the growing dimension, i.e.,

p \to \infty

when n is fixed. Under some mild regularity conditions, we prove the proposed estimator’s consistency and derive its asymptotic properties. Some Monte Carlo simulation experiments are executed in their performance, and the implementation is considered to analyze a high-dimensional genetic dataset.

Keywords:

asymptotic; high–dimension; Liu estimator; multicollinear; ridge estimator

1. Introduction

Consider the multiple regression model given by

Y = X β + ϵ,

(1)

where

Y = {(y_{1}, \dots, y_{n})}^{⊤}

is a vector of n responses,

X = {(x_{1}, \dots, x_{n})}^{⊤}

is an

n \times p

design matrix, with the ith predictor

x_{i} \in R^{p}

,

β = {(β_{1}, \dots, β_{p})}^{⊤}

is the coefficients vector, and

ϵ

is an n-vector of unobserved errors. Further, we shall assume

E (ϵ) = 0

,

E (ϵ ϵ^{⊤}) = σ^{2} I_{n}

,

σ^{2} > 0

.

When

p < n

, the ordinary least squares (LS) estimator of

β

is given by

\begin{matrix} \hat{β} & = & \underset{β \in R^{p}}{\arg \min} S (β), S (β) = {(Y - X β)}^{⊤} (Y - X β) \\ = & {(X^{⊤} X)}^{- 1} X^{⊤} Y . \end{matrix}

(2)

However, for the high dimensional (HD) case,

p > n

the LS estimator cannot be obtained, because

X^{⊤} X

is rank deficient. It is well known that the ridge regression (RR) estimator of [1], followed by [2] regularization, however, exists. The rationale is to add a positive value

k > 0

to the eigenvalues of

X^{⊤} X

to efficiently estimate the parameters via

{\hat{β}}^{Ridge} = {(X^{⊤} X + k I_{p})}^{- 1} X^{⊤} Y

. Refer to Saleh et al. [3] for theory and application of the RR approach. Using the projection of

β

onto the row space of

X

is a well-described remedy. Wang et al. [4] used this technique and proposed a high dimensional LS estimator as a limiting case of the RR, while Buhlmann [5] also used the projection method and developed a bias correction in the RR estimator to propose a bias-corrected RR estimator for the high dimensional setting. Shao and Deng [6] used the method and proposed to threshold the RR estimator when the projection vector is sparse, in the sense that many of its components are small and demonstrated consistency. Dicker [7] studied the minimum property of the RR estimator and derived its asymptotic risk for the growing dimension, i.e.,

p \to \infty

. Although the RR estimator involves high dimensional problems, there exits a counterpart that has not been considered in high dimension.

An Existing Two-Parameter Biased Estimator

It is well known that the RR estimator is an efficient approach for multicollinear situations. Since then, many authors have developed ridge-type estimators to overcome the issue of multicollinearity. One drawback of the RR estimator is that it is a non-linear function of the tuning parameter. Hence, Liu [8] developed a similar estimator; however, it is linear for the tuning parameter via the following optimization problem, for the case

p < n

:

min_{β \in R^{p}} S (β) + {(d \hat{β} - β)}^{⊤} (d \hat{β} - β) .

(3)

The solution to the optimization problem (3) has the form

{\hat{β}}^{Liu} = {(X^{⊤} X + I_{p})}^{- 1} (X^{⊤} Y + d \hat{β}),

(4)

where

d \in (0, 1)

is termed as the biasing parameter.

Combining the advantages of the RR and Liu estimators, Ozkale and Kaciranlar [9] proposed a two-parameter estimator by solving the following optimization problem:

min_{β \in R^{p}} S (β) + k [{(d \hat{β} - β)}^{⊤} (d \hat{β} - β) - c],

(5)

where c is a constant, and k is the Lagrangian multiplier. The resulting two-parameter ridge estimator has the form

\begin{matrix} \hat{β} (k, d) = {(X^{⊤} X + k I_{p})}^{- 1} (X^{⊤} Y + k d \hat{β}) \end{matrix}

(6)

The above estimator has several advantages and can be simplified to LS, RR, and Liu estimators as limiting cases (see Figure 1). It can be argued that this estimator can also be interpreted as a restricted estimator under stochastic prior information about

β

.

With growing dimensions p,

p > n

, the LS estimator (2) cannot be obtained, so it is not possible to use the two-parameter ridge estimator in Equation (6). Hence, developing a high-dimensional two-parameter version of this estimator and studying its asymptotic performance is interesting and worthwhile. Therefore, in this paper, we propose a high-dimensional version of Ozkale and Kaciranlar’s estimator and give the asymptotic properties. The paper’s organization is as follows: In Section 2, a high-dimensional two-parameter estimator is proposed, and its asymptotic characteristics are discussed. Section 3 indicates the generalized cross validation for choosing the parameters. In Section 4, some simulation experiments are presented to assess the novel estimator’s statistical and computational performance, and an application to the AML data is illustrated in this section The conclusion is presented in the last section.

2. The Proposed Estimator

In this section, we develop an HD estimator and establish its asymptotic properties. To show a component is dependent to p, we shall use the subscript p and particularly consider the scenarios in which

p \to \infty

and n is fixed. This is termed large p, fixed n, which is more general than scenarios with

p / n \to ρ \in (0, \infty)

, a common assumption in high-dimensional settings.

Consider a diverging number of variables case, in which p is allowed to tend to infinity. This case fulfills the high-dimensional case

p > n

. Under this setting, the inverse of

X^{⊤} X

does not exist; however, the RR estimator is still valid and applicable. Further, the Liu estimator cannot be obtained. As a remedy, one can use the Moore–Penrose inverse of

X^{⊤} X

, a particular case of the generalized inverse. Wang and Leng [10] showed that

{(X^{⊤} X)}^{- 1} X^{⊤}

can be seen as the Moore-Penrose inverse of

X

for

p < n

, and that

X^{⊤} {(X X^{⊤})}^{- 1}

is the Moore–Penrose inverse of

X

when

p > n

. This gives, for any

p, n > 0

,

\begin{matrix} {(X^{⊤} X + s I_{p})}^{- 1} X^{⊤} = X^{⊤} {(X X^{⊤} + s I_{n})}^{- 1}, \end{matrix}

(7)

where s is an arbitrary nonegative constant.

Multiplying both sides of (7) by

Y

reveals that the LS estimator can be represented as

\begin{matrix} \hat{β} & = & lim_{s \to \infty} {(X^{⊤} X + s I_{p})}^{- 1} X^{⊤} Y \\ = & lim_{s \to \infty} X^{⊤} {(X X^{⊤} + s I_{n})}^{- 1} Y \\ = & X^{⊤} {(X X^{⊤})}^{- 1} Y . \end{matrix}

(8)

Now, for the HD case, substitute (8) in (6) to obtain

\begin{matrix} {\hat{β}}^{HD} & = & {(X^{⊤} X + k_{p} I_{p})}^{- 1} (X^{⊤} Y + k_{p} d_{p} X^{⊤} {(X X^{⊤})}^{- 1} Y) \\ = & {(X^{⊤} X + k_{p} I_{p})}^{- 1} (X^{⊤} + k_{p} d_{p} X^{⊤} {(X X^{⊤})}^{- 1}) Y \\ = & {(X^{⊤} X + k_{p} I_{p})}^{- 1} (X^{⊤} + k_{p} d_{p} X^{+}) Y, \end{matrix}

(9)

where

X^{+} = X^{⊤} {(X X^{⊤})}^{- 1}

is the Moore–Penrose inverse of

X

.

We impose the following regularity conditions for studying the asymptotic performance of the estimator.

{\hat{β}}^{HD}

given by (9).

(A1): $1 / k_{p} = o (1)$ . There exists a constant $0 \leq δ < 0.5$ , such that a component of $X$ is $O (k_{p}^{δ})$ .
(A2): $d_{p} = o (1)$ . There exists a constant $0 \leq η < 0.5$ , such that a component of $X^{+}$ is $O (d_{p}^{- η})$ .
(A3): For sufficiently large p, there is a vector $b_{p \times 1}$ , such that $β = X^{⊤} X b$ , and there exists a constant $ε > 0$ , such that each component of the vector $b_{p \times 1}$ is $O (1 / p^{ε + 1.5})$ , and $k_{p} = o (p^{ε} a_{p})$ , with $a_{p} = o (1)$ . (An example of such choice is $k_{p} = \sqrt{p}$ and $ε = 0.5 + δ$ ).
(A4): For sufficiently large p, there exists a constant $δ > 0$ , such that each component of $β$ is $O (p^{- 2 - δ})$ and $1 / d_{p} = o (p^{δ})$ . Further, $k_{p}^{δ - 1} = o (d_{p})$ .

Assumption (A3) is adopted from Luo [11]. Let

{\hat{β}}^{HD} = {({\hat{β}}_{1}^{HD}, \dots, {\hat{β}}_{p}^{HD})}^{⊤}

.

Theorem 1.

Assume(A1)and(A2). Then,

var ({\hat{β}}_{i}^{HD}) = o (1)

for all

i = 1, \dots, p

.

Proof.

For the proof, refer to Appendix A. □

Theorem 2.

Assume(A1)–(A3). Further, suppose

λ_{i p} = O (k_{p})

, where

λ_{i p} > 0

is the ith eigenvalue of

X^{⊤} X

. Then,

b i a s ({\hat{β}}_{i}^{HD}) = o (1)

for all

i = 1, 2, \dots, p

.

Proof.

For the proof, refer to Appendix A. □

Using Theorems 1 and 2, it can be verified that the HD estimator

{\hat{β}}^{HD}

is a consistent estimator for

β

as

p \to \infty

.

The following result reveals the asymptotic distribution of this estimator as

p \to \infty

.

Theorem 3.

Assume

1 / k_{p} = o (1)

, and for sufficiently large p, there exists a constant

δ > 0

, such that each component of β is

O (1 / p^{2 + δ})

. Let

k_{p} = o (p^{δ})

,

λ_{i p} = o (k_{p})

. Furthermore, suppose that

ϵ \sim N_{n} (0, σ^{2} I_{n})

,

σ^{2} > 0

. Then,

\frac{1}{d_{p}} ({\hat{β}}^{HD} - β) \overset{D}{\to} N (0, σ^{2} X^{+} {X^{+}}^{⊤}) a s p \to \infty .

(10)

Proof.

For the proof, refer to Appendix A. □

3. Generalized cross Validation

As noted, the estimator

{\hat{β}}^{HD}

depends on both the ridge parameter

k_{p}

and Liu parameter

d_{p}

that must be optimized in practice. To do this, we use the generalized cross-validation (GCV) criterion. The GCV uses to choose the ridge and Liu parameters by minimizing an estimate of the unobservable risk function

\begin{matrix} R (β; {\hat{β}}^{HD}) & = & \frac{1}{n} {(E (Y) - {\hat{Y}}^{HD} (k_{p}, d_{p}))}^{⊤} (E (Y) - {\hat{Y}}^{HD} (k_{p}, d_{p})) \\ = & \frac{1}{n} {∥E (Y) - X {\hat{β}}^{HD} (k_{p}, d_{p})∥}^{2}, \end{matrix}

where

\begin{matrix} {\hat{Y}}^{HD} (k_{p}, d_{p}) & = & X {\hat{β}}^{HD} \\ = & {(X^{⊤} X + k_{p} I_{p})}^{- 1} (X^{⊤} + k_{p} d_{p} X^{+}) Y \\ = & H (k_{p}, d_{p}) Y, \end{matrix}

(11)

with

H (k_{p}, d_{p}) = X {(X^{⊤} X + k_{p} I_{p})}^{- 1} (X^{⊤} + k_{p} d_{p} X^{+})

, termed as the hat matrix of

Y

.

This is straightforward to demonstrate, as in [12].

\begin{matrix} E (R (β; {\hat{β}}^{HD})) & = & \frac{1}{n} ∥ (I_{n} - H (k_{p}, d_{p})) X β ∥^{2} + \frac{σ^{2}}{n} tr (H {(k_{p}, d_{p})}^{⊤} H (k_{p}, d_{p})) \\ = & ν_{1}^{2} (k_{p}, d_{p}) + σ^{2} ν_{2} (k_{p}, d_{p}), \end{matrix}

where

ν_{1}^{2} (k_{p}, d_{p}) = \frac{1}{n} ∥ (I_{n} - H (k, d)) X β ∥^{2}

and

ν_{2} (k_{p}, d_{p}) = \frac{1}{n} tr (H {(k_{p}, d_{p})}^{⊤} H (k_{p}, d_{p})

.

The GCV function is then defined as

\begin{matrix} GCV ({\hat{β}}^{HD}) & = & \frac{\frac{1}{n} ∥ (I_{n} - H (k_{p}, d_{p})) y ∥^{2}}{{(1 - \frac{1}{n} tr (H (k_{p}, d_{p})))}^{2}} \\ = & \frac{\frac{1}{n} ∥ (I_{n} - H (k_{p}, d_{p})) y ∥^{2}}{{(1 - μ_{1} (k_{p}, d_{p}))}^{2}}, \end{matrix}

(12)

where

μ_{1} (k_{p}, d_{p}) = \frac{1}{n} tr (H (k_{p}, d_{p}))

.

The following theorem extends the GCV theorem proposed by Akdeniz and Roozbeh [13].

Theorem 4.

According to the definition of GCV, we have

\begin{matrix} \frac{E (R (β; {\hat{β}}^{HD})) - E (GCV ({\hat{β}}^{HD})) + σ^{2}}{E (R (β; {\hat{β}}^{HD}))} & = & (1 - \frac{σ^{2}}{{(1 - μ_{1} (k_{p}, d_{p}))}^{2}}) \\ + \frac{1}{D (k, d)} \times \frac{σ^{2} μ_{1} {(k_{p}, d_{p})}^{2}}{{(1 - μ_{1} (k_{p}, d_{p}))}^{2}}, \end{matrix}

where

D (k, d) = ν_{1}^{2} (k_{p}, d_{p}) + σ^{2} ν_{2} (k_{p}, d_{p})

, and consequently,

\begin{matrix} \frac{|E (R (β; {\hat{β}}^{HD})) - E (GCV ({\hat{β}}^{HD})) + σ^{2}|}{E (R (β; {\hat{β}}^{HD}))} & < & \frac{σ^{2}}{{(1 - μ_{1} (k_{p}, d_{p}))}^{2}} \\ \times (2 μ_{1} (k_{p}, d_{p}) + \frac{μ_{1} {(k_{p}, d_{p})}^{2}}{ν_{2} (k_{p}, d_{p})}), \end{matrix}

whenever

0 < μ_{1} (k_{p}, d_{p}) < 1

.

Proof.

r For the proof, refer to Appendix A. □

4. Numerical Investigations

In this section, for performance assessment of the proposed HD estimator

{\hat{β}}^{HD}

, we conduct a simulation study along with the analysis of real data.

4.1. Simulation

Here, we consider the multiple regression model with varying squared multiple correlation coefficient

R^{2}

and error distribution, given by the following relation:

Y = c X β + σ ϵ,

where

β = {(β_{1}, 0)}^{⊤}

,

β_{1}

is the active set, and its dimension is

p_{1} = 0.4 p

. The absolute values of a normal distribution with mean 0 and standard deviation 5 is considered

β_{1}

. The remaining

p - p_{1}

components are zero.

In this example, motivated by McDonald and Galarneau [14], the explanatory variables are computed by

x_{j} = \sqrt{1 - ρ^{2}} z_{j} + ρ z_{p_{1}}, j = 1, \dots, p,

where the

z_{j}

s are independent standard normal pseudo-random vectors, and

ρ

is specified such that the correlation between any two explanatory variables is given by

ρ^{2}

. Similarly to Zhu et al. [15], the variance is set to

σ^{2} = 6.83

, and two different kinds of error distribution are taken for

ϵ

: (1) the standard normal is

N_{n} (0, I_{n})

, and (2) standard t with 5 degrees of freedom

t_{n} (0, I_{n}, 5)

. The constant c is also varied to control the signal-to-noise ratio, and it is set to

0.5

, 1, and 2 with the corresponding

R^{2} = 20 %, 50 % and 80 %

.

R^{2}

represents the proportion of the variable for a dependent variable that is explained by an independent variable or variables in a regression model.

We consider

ρ \in {0.8, 0.95}

; the sample size and the number of covariates are set to

n \in {30, 50, 100}

,

p \in {256, 512, 1024}

, respectively. Following regularity conditions (

A 1

)–(

A 4

), we set

k_{p} = \sqrt{p}

. For

δ = 0.25 = 1 / 4

, we take

d_{p} = p^{- 1 / 5}

, which guarantees (

A 4

). We then simulate

{\hat{β}}^{HD}

and

{\hat{β}}^{Ridge}

100 times using Equation (9) and

{\hat{β}}^{Ridge} = {(X^{⊤} X + k_{p} I_{p})}^{- 1} X^{⊤} Y

.

For comparison purposes, the quadratic bias (QB) and mean squared error (MSE) are computed according to

QB ({\hat{β}}^{*}) = \frac{1}{100} \sum_{j = 1}^{100} {({\hat{β}}_{j}^{*} - β)}^{⊤} ({\hat{β}}_{j}^{*} - β), and MSE ({\hat{β}}^{*}) = \frac{1}{100} \sum_{j = 1}^{100} {({\hat{β}}_{j}^{*} - β)}^{⊤} ({\hat{β}}_{j}^{*} - β),

respectively, where

{\hat{β}}^{*}

is one of

{\hat{β}}^{HD}

or

{\hat{β}}^{Ridge}

.

4.2. Review of Results

In Theorem 2, the condition for which the proposed

{\hat{β}}^{HD}

is unbiased is investigated based on the eigenvalues of

X^{⊤} X

. Here, we numerically analyze the biasedness of this estimator by comparing the ridge estimator concerning the parameters of the model. For this purpose, the difference in QB is reported in Table 1 by evaluating

diff = QB ({\hat{β}}^{HD}) - QB ({\hat{β}}^{Ridge}) .

If diff is positive, then the quadratic bias of the proposed estimator is larger than that of the ridge estimator.

To comprise the MSEs, we use the relative mean square error (RMSE) given by

RMSE = \frac{MSE ({\hat{β}}^{Ridge})}{MSE ({\hat{β}}^{HD})} .

The results are reported in Table 2. If

RMSE > 1

, then the proposed estimator has a smaller MSE compared to the ridge.

Based on the results of Table 1 and Table 2, the following conclusions are made:

(1): The performance of the estimators is affected by the number of observations (n), the number of variables (p), the signal to noise ratio (c), and the degree of multicollinearity ( $ρ$ ).
(2): By increasing the degree of multicollinearity, $ρ$ , although for both cases of error distributions, the QB of the proposed estimator increases for $c = 0.5$ and 1, its MSE decreases dramatically since the RMSE increases.
(3): The signal-to-noise shows the effect of $β$ in the model. Lower values (less than 1) are a sign of model sparsity, since, when c is small, the proposed estimator performs better than the ridge. This is evidence that our estimator is a better candidate as an alternative in sparse models in the MSE sense. However, the QB increases for large c values, which forces the model to overestimate the parameters.
(4): As p increases, although the proposed estimator is superior to the ridge in sparse models (small c values), the efficiency decreases. This is more evident when the ratio $p / n$ becomes larger. This fact may come as poor performance of the proposed estimators, but our estimator is still preferred in high dimensions for sparse models.
(5): Obviously, as n increases, so does the RMSE; however, the QB becomes very large, and it is due to the nature of the proposed estimators because of its complicated form. It must be noted that this does not contradict the results of Theorem 2, since the simulation scheme does not obey the regularity condition.
(6): There is evidence of robustness for the distribution tail for sparse models, i.e., the QB and RMSE are the same for both normal and t distributions. However, as c increases, the QB of the proposed estimator explodes for the heavier tail distribution. This may be seen as a disadvantage of the proposed estimators, but even for large values of c, the RMSE stays the same, evidence of relatively small variance for the heavier tail distribution.

4.3. AML Data Analysis

This section assesses the performance of the proposed estimators using the mean prediction error (MPE) and MSE criteria of a data set adopted from Metzeler et al. [16], in which the information for 79 patients was collected. The data can be accessed from the Gene Expression Omnibus (GEO) data repository (http://www.ncbi.nlm.nih.gov/geo/ (accessed on 1 January 2021)) by the National Center for Biotechnology Information (NCBI), where the data is available under GEO accession number GSE12417. We only use the data set that was used as a test set. This contains gene expression data for 79 adult patients with cytogenetically normal acute myeloid leukemia (CN-AML), showing heterogeneous treatment outcomes. According to Sill et al. [17], we reduce the total number of 54,675 gene expression features that have been measured with the Affymetrix HG-U133 Plus 2.0 microarray technology to the top

p \in {1000, 2000}

features with the largest variance across all 79 samples. We considered overall survival time based on month as the response variable. The condition number of the design matrix for the AML data set is approximately

1095.80

, evident of severe multicollinearity among columns of the design matrix ([18], see p. 298). To find the optimum values of k and d, denoted by

k_{opt}

and

d_{opt}

for practical purposes, we use the GCV given by Equation (12). Hence, we use the following formulas:

\begin{matrix} {\hat{β}}^{HD*} & = & {(X^{⊤} X + k_{opt} I_{p})}^{- 1} (X^{⊤} Y + k_{opt} d_{opt} X^{⊤} {(X X^{⊤})}^{- 1} Y) \\ {\hat{β}}^{Ridge*} & = & {(X^{⊤} X + k_{opt} I_{p})}^{- 1} X^{⊤} Y . \end{matrix}

To compute the MPE and MSE, we divide the whole data set into two train (

T = (X^{train}, Y^{trian})

) and validation (

V = (X^{valid}, Y^{valid})

) sets, comprising

70 %

and

30 %

, respectively. Then, the measures are evaluated using

\begin{matrix} {MPE}_{boot} ({\hat{β}}^{*}) & = & \frac{1}{N . boot} \sum_{j = 1}^{N . boot} {(X^{valid} {\hat{β}}_{j}^{train *} - Y^{valid})}^{⊤} (X^{valid} {\hat{β}}_{j}^{train *} - Y^{valid}), \\ {MSE}_{boot} ({\hat{β}}^{*}) & = & \frac{1}{N . boot} \sum_{j = 1}^{N . boot} {({\hat{β}}_{j}^{train *} - β^{HD *})}^{⊤} ({\hat{β}}_{j}^{train *} - β^{HD *}), \end{matrix}

where

N . boot

stands for the number of bootstrapped sample,

{\hat{β}}^{*}

is one of the proposed and ridge estimators, and

{\hat{β}}^{HD *}

is the assumed true parameter obtained by Equation (9) from the whole data set.

\begin{matrix} {RMPE}_{boot} = \frac{{MPE}_{boot} ({\hat{β}}^{Ridge *})}{{MPE}_{boot} ({\hat{β}}^{HD *})} {RMSE}_{boot} = \frac{{MSE}_{boot} ({\hat{β}}^{Ridge *})}{{MSE}_{boot} ({\hat{β}}^{HD *})} \end{matrix}

The results are tabulated in Table 3 for the number of bootstrap

N . boot = 200

. The following conclusions are obtained from Table 3:

(1): Using the GCV, the proposed estimator is shown to be consistently superior to the ridge estimator, relative to RMSE and RMPE criteria.
(2): Similarly to the results of simulations, with growing p, the MSE of the proposed estimator increases compared to the ridge estimator. However, as p gets larger the mean prediction error becomes smaller, which shows the superiority for prediction purposes.

Further, Figure 2 depicts the MSE and MPE values for both HD and ridge estimators, for the case

p = 1000

. It is obvious that the high-dimensional estimator performs better compared to the ridge. For the case

p = 2000

, we obtained similar results.

5. Conclusions

In this note, we propose a high-dimensional two-parameter ridge estimator to the conventional ridge and Liu estimators. Its asymptotic properties have also been discussed. This estimator, via simulation and real-life experiments, is efficient in high dimensional problems and can potentially overcome multicollinearity. Additionally, the proposed high-dimensional ridge estimator yields superior performance in the small mean squared error sense.

Author Contributions

Conceptualization, M.A. and N.M.K.; methodology, M.A. and N.M.K.; validation, M.A., M.N., M.R. and N.M.K.; formal analysis, M.A., M.N., M.R. and N.M.K.; investigation, M.A., M.N., M.R. and N.M.K.; resources, M.N.; writing—original draft preparation, M.A.; writing—review and editing, M.A., M.N., M.R. and N.M.K.; visualization, M.A., M.N. and M.R.; supervision, M.A. and N.M.K.; project administration, M.A., M.N., M.R. and N.M.K.; funding acquisition, M.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was based upon research supported, in part, by the visiting professor program, University of Pretoria, and the National Research Foundation (NRF) of South Africa, SARChI Research Chair UID: 71199; Reference: IFR170227223754 grant No. 109214. The work of M. Norouzirad and M. Roozbeh is based on the research supported in part by the Iran National Science Foundation (INSF) (grant number 97018318). The opinions expressed and conclusions arrived at are those of the authors and are not necessarily attributed to the NRF.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this article may be simulated in R, using the stated seed value and parameter values. The real data set is available at http://www.ncbi.nlm.nih.gov/geo/ (accessed on 1 January 2021).

Acknowledgments

We would like to sincerely thank two anonymous reviewers for their constructive comments, which led us to put many details in the paper and improved the presentation.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of the Main Results

Proof of Theorem 1.

By definition, we have

\begin{matrix} var ({\hat{β}}^{HD}) & = & σ^{2} {(X^{⊤} X + k_{p} I_{p})}^{- 1} (X^{⊤} + k_{p} d_{p} X^{+}) {(X^{⊤} + k_{p} d_{p} X^{+})}^{⊤} {(X^{⊤} X + k_{p} I_{p})}^{- 1} \\ = & σ^{2} {(\frac{X^{⊤} X}{k_{p}} + I_{p})}^{- 1} (\frac{X^{⊤}}{k_{p}} + d_{p} X^{+}) (\frac{X}{k_{p}} + d_{p} {X^{⊤}}^{+}) {(\frac{X^{⊤} X}{k_{p}} + I_{p})}^{- 1} \end{matrix}

(A1)

By (A1),

X / k_{p} = O (1) k_{p}^{δ - 1} = o (1)

and

X^{⊤} X / k_{p} + I_{p} \to I_{p}

. By (A2),

d_{p} X^{+} = O (1) d_{p}^{1 - η} = o (1)

. Hence,

var ({\hat{β}}_{i}^{HD}) \to 0

as

p \to \infty

, and the proof is complete. □

Proof of Theorem 2.

By definition

\begin{matrix} E ({\hat{β}}^{HD}) & = & {(X^{⊤} X + k_{p} I_{p})}^{- 1} (X^{⊤} + k_{p} d_{p} X^{+}) X β \\ = & {(\frac{X^{⊤} X}{k_{p}} + I_{p})}^{- 1} (\frac{X^{⊤} X}{k_{p}} + d_{p} X^{+} X) β \\ = & {(\frac{X^{⊤} X}{k_{p}} + I_{p})}^{- 1} (\frac{X^{⊤} X}{k_{p}}) β + d_{p} X^{+} X β . \end{matrix}

(A2)

Under (A2),

d_{p} X^{+} X = o (1)

. The proof is complete using Theorem 2 of Luo [11]. □

Proof of Theorem 3.

We have

\begin{matrix} \frac{1}{d_{p}} ({\hat{β}}^{HD} - β) & = & \frac{1}{d_{p}} \{{(X^{⊤} X + k_{p} I_{p})}^{- 1} (X^{⊤} + k_{p} d_{p} X^{+}) (X β + ϵ) - β\} \\ = & {(\frac{X^{⊤} X}{k_{p}} + I_{p})}^{- 1} (\frac{X^{⊤}}{k_{p} d_{p}} + X^{+}) ϵ \\ + \frac{1}{d_{p}} {(\frac{X^{⊤} X}{k_{p}} + I_{p})}^{- 1} (d_{p} X^{+} X - I_{p}) β . \end{matrix}

By (A1),

X^{⊤} X / k_{p} + I_{p} \to I_{p}

, by (A2),

d_{p} X^{+} X = o (1)

, and by (A4),

X / k_{p} d_{p} = o (1)

. Hence,

\begin{matrix} \frac{1}{d_{p}} ({\hat{β}}^{HD} - β) \to X^{+} ϵ \end{matrix}

The proof is complete. □

Proof of Theorem 4.

It is straightforward to verify that

E (GCV ({\hat{β}}^{HD})) = \frac{ν_{1}^{2} (k_{p}, d_{p}) + σ^{2} (1 - 2 μ_{1} (k_{p}, d_{p}) + ν_{2} (k_{p}, d_{p}))}{{(1 - μ_{1} (k_{p}, d_{p}))}^{2}} .

Hence

\begin{matrix} E (R (β; {\hat{β}}^{HD})) - E (GCV ({\hat{β}}^{HD})) & = & E (R ({\hat{β}}^{HD} (k_{p}, d_{p}); β)) (1 - \frac{1}{{(1 - μ_{1} (k_{p}, d_{p}))}^{2}}) \\ - σ^{2} \frac{1 - 2 μ_{1} (k_{p}, d_{p})}{{(1 - μ_{1} (k_{p}, d_{p}))}^{2}}, \end{matrix}

which leads to the required result. □

References

Hoerl, A.E.; Kennard, R.W. Ridge regression: Biased estimation for non-orthogonal problems. Technometrics 1970, 12, 55–67. [Google Scholar] [CrossRef]
Tikhonov, A.N. Solution of incorrectly formulated problems and the regularization method. Sov. Math. Dokl. 1963, 4, 1035–1038. [Google Scholar]
Saleh, A.K.M.E.; Arashi, M.; Kibria, B.M.G. Theory of Ridge Regression Estimation with Applications; John Wiley: Hoboken, NJ, USA, 2019. [Google Scholar]
Wang, X.; Dunson, D.; Leng, C. No penalty no tears: Least squares in high-dimensional models. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 1814–1822. [Google Scholar]
Bühlmann, P. Statistical significance in high-dimensional linear models. Bernoulli 2013, 19, 1212–1242. [Google Scholar] [CrossRef] [Green Version]
Shao, J.; Deng, X. Estimation in high-dimensional linear models with deterministic design matrices. Ann. Stat. 2012, 40, 812–831. [Google Scholar] [CrossRef] [Green Version]
Dicker, L.H. Ridge regression and asymptotic minimum estimation over spheres of growing dimension. Bernoulli 2016, 22, 1–37. [Google Scholar] [CrossRef]
Liu, K. A new class of biased estimate in linear regression. Commun. Stat. Theory Methods 1993, 22, 393–402. [Google Scholar]
Ozkale, M.R.; Kaciranlar, S. The restricted and unrestricted two-parameter estimators. Commun. Stat. Theory Methods 2007, 36, 2707–2725. [Google Scholar] [CrossRef]
Wang, X.; Leng, C. High dimensional ordinary least squares projection for screening variables. J. R. Stat. Soc. Ser. B 2015. [Google Scholar] [CrossRef] [Green Version]
Luo, J. The discovery of mean square error consistency of ridge estimator. Stat. Probab. Lett. 2010, 80, 343–347. [Google Scholar] [CrossRef]
Amini, M.; Roozbeh, M. Optimal partial ridge estimation in restricted semiparametric regression models. J. Multivar. Anal. 2015, 136, 26–40. [Google Scholar] [CrossRef]
Akdeniz, F.; Roozbeh, M. Generalized difference-based weighted mixed almost unbiased ridge estimator in partially linear models. Stat. Pap. 2019, 60, 1717–1739. [Google Scholar] [CrossRef]
McDonald, G.C.; Galarneau, D.I. A Monte Carlo of Some Ridge-Type Estimators. J. Am. Stat. Assoc. 1975, 70, 407–416. [Google Scholar] [CrossRef]
Zhu, L.P.; Li, L.; Li, R.; Zhu, L.X. Model-free feature screening for ultrahigh dimensional data. J. Am. Stat. Assoc. 2011, 106, 1464–1475. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Metzeler, K.H.; Hummel, M.; Bloomfield, C.D.; Spiekermann, K.; Braess, J.; Sauerl, M.C.; Heinecke, A.; Radmacher, M.; Marcucci, G.; Whitman, S.P.; et al. An 86 Probe Set Gene Expression Signature Predicts Survival in Cytogenetically Normal Acute Myeloid Leukemia. Blood 2008, 112, 4193–4201. [Google Scholar] [CrossRef] [PubMed]
Sill, M.; Hielscher, T.; Becker, N.; Zucknick, M. c060: Extended Inference for Lasso and Elastic-Net Regularized Cox and Generalized Linear Models; R Package Version 0.2-4; 2014. Available online: http://CRAN.R-project.org/package=c060 (accessed on 1 January 2021).
Montgomery, D.C.; Peck, E.A.; Vining, G.G. Introduction to Linear Regression Analysis, 5th ed.; Wiley: Hoboken, NJ, USA, 2012. [Google Scholar]

Figure 1. Special limiting cases.

Figure 2. Box-plot of the MSE and MPE values for

p = 1000

in the AML data.

Figure 2. Box-plot of the MSE and MPE values for

p = 1000

in the AML data.

Table 1. The difference between quadratic biases of the high dimensional and ridge estimators.

			$ρ = 0.8$		$ρ = 0.95$
			$N (0, I_{n})$	$t_{n} (0, I_{n}, 5)$	$N (0, I_{n})$	$t_{n} (0, I_{n}, 5)$
$p$	$c$	$n$	diff	diff	diff	diff
256	$0.5$	30	5.7657	5.7643	10.0535	10.1134
		50	6.4911	6.4941	11.4722	11.5088
		100	17.8314	17.8493	30.1008	30.4137
	1	30	22.9671	487.4169	39.6556	459.3473
		50	25.8621	522.6298	45.1138	480.5501
		100	70.8693	798.6551	118.1326	676.1919
	2	30	91.7526	2413.9746	158.0026	2256.1664
		50	103.3996	2587.7382	179.8509	2357.4111
		100	283.0549	3922.0057	470.4114	3259.2211
512	$0.5$	30	3.1943	3.2012	6.5528	6.6001
		50	4.4800	4.4781	9.5861	9.6151
		100	10.2121	10.2489	20.1828	20.3366
	1	30	12.7657	926.7540	26.0911	916.2663
		50	17.8861	1009.3595	38.0549	969.4353
		100	40.7254	1192.4455	79.9094	1095.9628
	2	30	51.0605	4621.0862	104.2892	4555.0569
		50	71.5157	5029.3107	151.9461	4809.2595
		100	162.7616	5920.6337	318.7343	5397.7878
1024	$0.5$	30	1.7594	1.7584	3.7384	3.7410
		50	3.9188	3.9345	9.2523	9.3437
		100	5.1236	5.1189	12.6469	12.6455
	1	30	7.0318	1637.6798	14.8960	1636.5664
		50	15.6758	1804.8548	36.9649	1763.7468
		100	20.4564	1940.6091	50.2993	1856.0197
	2	30	28.1221	8181.4255	59.5312	8167.9835
		50	62.7157	9008.4246	147.8715	8781.1968
		100	81.7756	9682.7404	147.8715	9229.7803

Table 2. The relative MSE of the high dimensional and ridge estimators.

			$ρ = 0.8$		$ρ = 0.95$
			$N (0, I_{n})$	$t_{n} (0, I_{n}, 5)$	$N (0, I_{n})$	$t_{n} (0, I_{n}, 5)$
$p$	$c$	$n$	RMSE	RMSE	RMSE	RMSE
256	$0.5$	30	1.0050	1.0050	1.0140	1.0139
		50	1.0058	1.0058	1.0161	1.0160
		100	1.0222	1.0222	1.0543	1.0539
	1	30	1.0032	1.0032	1.0179	1.0178
		50	1.0039	1.0039	1.0209	1.0209
		100	1.0221	1.0220	1.0883	1.0876
	2	30	0.9816	0.9816	0.9852	0.9851
		50	0.9793	0.9793	0.9829	0.9829
		100	0.9434	0.9435	0.9587	0.9584
512	$0.5$	30	1.0011	1.0011	1.0031	1.0031
		50	1.0016	1.0016	1.0048	1.0048
		100	1.0041	1.0041	1.0119	1.0119
	1	30	1.0004	1.0004	1.0029	1.0029
		50	1.0007	1.0007	1.0048	1.0048
		100	1.0023	1.0023	1.0139	1.0139
	2	30	0.9948	0.9948	0.9924	0.9924
		50	0.9929	0.9929	0.9895	0.9895
		100	0.9843	0.9843	0.9810	0.9809
1024	$0.5$	30	1.0003	1.0003	1.0009	1.0009
		50	1.0007	1.0007	1.0022	1.0022
		100	1.0009	1.0009	1.0031	1.0031
	1	30	1.0001	1.0001	1.0006	1.0006
		50	1.0002	1.0002	1.0017	1.0017
		100	1.0003	1.0003	1.0025	1.0025
	2	30	0.9984	0.9984	0.9973	0.9973
		50	0.9964	0.9964	0.9933	0.9933
		100	0.9954	0.9954	0.9910	0.9911

Table 3. RMPE and RMSE values for 200 bootstrapped samples in the analysis of AML data.

Criterion	$p = 1000$	$p = 2000$
${RMPE}_{boot}$	$1.001981$	$1.002278$
${RMSE}_{boot}$	$1.046073$	$1.039997$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Arashi, M.; Norouzirad, M.; Roozbeh, M.; Khan, N.M. A High-Dimensional Counterpart for the Ridge Estimator in Multicollinear Situations. Mathematics 2021, 9, 3057. https://doi.org/10.3390/math9233057

AMA Style

Arashi M, Norouzirad M, Roozbeh M, Khan NM. A High-Dimensional Counterpart for the Ridge Estimator in Multicollinear Situations. Mathematics. 2021; 9(23):3057. https://doi.org/10.3390/math9233057

Chicago/Turabian Style

Arashi, Mohammad, Mina Norouzirad, Mahdi Roozbeh, and Naushad Mamode Khan. 2021. "A High-Dimensional Counterpart for the Ridge Estimator in Multicollinear Situations" Mathematics 9, no. 23: 3057. https://doi.org/10.3390/math9233057

APA Style

Arashi, M., Norouzirad, M., Roozbeh, M., & Khan, N. M. (2021). A High-Dimensional Counterpart for the Ridge Estimator in Multicollinear Situations. Mathematics, 9(23), 3057. https://doi.org/10.3390/math9233057

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A High-Dimensional Counterpart for the Ridge Estimator in Multicollinear Situations

Abstract

1. Introduction

An Existing Two-Parameter Biased Estimator

2. The Proposed Estimator

3. Generalized cross Validation

4. Numerical Investigations

4.1. Simulation

4.2. Review of Results

4.3. AML Data Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Proof of the Main Results

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI