Some Useful Techniques for High-Dimensional Statistics

Olive, David J.

doi:10.3390/stats8030060

Open AccessArticle

Some Useful Techniques for High-Dimensional Statistics

by

David J. Olive

School of Mathematical & Statistical Sciences, Southern Illinois University, Carbondale, IL 62901-4408, USA

Stats 2025, 8(3), 60; https://doi.org/10.3390/stats8030060

Submission received: 23 May 2025 / Revised: 29 June 2025 / Accepted: 6 July 2025 / Published: 13 July 2025

Download

Browse Figures

Versions Notes

Abstract

High-dimensional statistics are used when

n < 5 p

, where n is the sample size and p is the number of predictors. Useful techniques include (a) the use of a sparse fitted model, (b) use of principal component analysis for dimension reduction, (c) use of alternative multivariate dispersion estimators instead of the sample covariance matrix, (d) eliminate weak predictors, and (e) stack low-dimensional estimators into a vector. Some variants and theory for these techniques will be given or reviewed.

Keywords:

artificial intelligence; lasso; machine learning; model selection; outliers; PCA; PLS

1. Introduction

High-dimensional statistics are used when

n < 5 p

, where n is the sample size and p is the number of variables. Such a model is overfitting: the model does not have enough data to estimate the p parameters accurately. Then n tends to not be large enough for the classical statistical method to be useful. An alternative (but less general) definition of high-dimensional statistics is that p is large. Sometimes

p > K n

with

K \geq 10

is called ultrahigh-dimensional statistics.

Some important statistical methods include regression, multivariate statistics, and classification. These methods are important for statistical learning ≈ machine learning, an important part of artificial intelligence. Let predictor variables for regression or multivariate statistics be

x = {(x_{1}, \dots, x_{p})}^{T} .

Let Y be a response variable for regression or classification. Important regression models include generalized linear models, nonlinear regression, nonparametric regression, and survival regression models. Inference for multivariate regression where there are m response variables

Y_{1}, \dots, Y_{m}

is also of interest. Useful references for the following statistical methods include [1,2,3].

Let the population covariance matrices

Cov (x) = E [(x - E (x)) {(x - E (x))}^{T}] = Σ_{x}, and

Cov (x, Y) = E [(x - E (x)) (Y - E (Y))] = Σ_{x Y} .

Let the sample covariance matrices be

{\hat{Σ}}_{x} = \frac{1}{n - 1} \sum_{i = 1}^{n} (x_{i} - \bar{x}) {(x_{i} - \bar{x})}^{T} and {\hat{Σ}}_{x Y} = \frac{1}{n - 1} \sum_{i = 1}^{n} (x_{i} - \bar{x}) (Y_{i} - \bar{Y}) .

Let the population correlation

ρ_{i j} = ρ_{x_{i}, x_{j}} = Cor (x_{i}, x_{j})

and the sample correlation

r_{i j} = r_{x_{i}, x_{j}} = cor (x_{i}, x_{j}) .

Let the population correlation matrices

Cor (x) = ρ_{x} = (ρ_{i j})

and

Cor (x, Y) = ρ_{x Y} = {(ρ_{x_{1}, Y}, \dots, ρ_{x_{p}, Y})}^{T} .

Let the sample covariance matrices be

R_{x} = (r_{i j})

and

r_{x Y} = {(r_{x_{1}, Y}, \dots, r_{x_{p}, Y})}^{T} .

Then

{\hat{Σ}}_{x}

and R are dispersion estimators, and

(\bar{x}, {\hat{Σ}}_{x})

is an estimator of multivariate location and dispersion.

Suppose the positive semi-definite dispersion matrix

Σ

has eigenvalue eigenvector pairs

(λ_{1}, d_{1}), \dots, (λ_{p}, d_{p})

, where

λ_{1} \geq λ_{2} \geq \dots \geq λ_{p}

. Let the eigenvalue eigenvector pairs of

\hat{Σ}

be

({\hat{λ}}_{1}, {\hat{d}}_{1}), \dots, ({\hat{λ}}_{p}, {\hat{d}}_{p})

, where

{\hat{λ}}_{1} \geq {\hat{λ}}_{2} \geq \dots \geq {\hat{λ}}_{p}

. These vectors are important quantities for principal component analysis (PCA).

Let the multiple linear regression model

Y_{i} = α + x_{i, 1} β_{1} + \dots + x_{i, p} β_{p} + e_{i} = α + x_{i}^{T} β + e_{i}

(1)

for

i = 1, \dots, n .

In matrix form, this model is

Y = X δ + e

, where Y is an

n \times 1

vector of dependent variables, X is an

n \times (p + 1)

matrix of predictors,

δ = {(α, β^{T})}^{T}

is a

(p + 1) \times 1

vector of unknown coefficients, and e is an

n \times 1

vector of unknown errors. Assume that the

e_{i}

are independent and identically distributed (iid) with expected value

E (e_{i}) = 0

and variance

V (e_{i}) = σ^{2} .

Principal components regression (PCR), partial least squares (PLS), and several other dimension reduction models use p linear combinations

γ_{1}^{T} x, \dots, γ_{p}^{T} x

. Estimating the

γ_{i}

and performing the ordinary least squares (OLS) regression of Y on

({\hat{γ}}_{1}^{T} x, {\hat{γ}}_{2}^{T} x, \dots, {\hat{γ}}_{k}^{T} x)

and a constant gives the k-component estimator, e.g., the k-component PLS estimator or the k-component PCR estimator, for

k = 1, \dots, J

where

J \leq p

and the p-component estimator is the OLS estimator

{\hat{β}}_{O L S}

. Let

γ_{i} (P C R) = d_{i}

and

γ_{i} = γ_{i} (P L S) .

The model selection estimator chooses one of the k-component estimators, e.g., using cross validation, and it will be denoted by

{\hat{β}}_{M S P L S}

or

{\hat{β}}_{M S P C R}

.

Let

X = [1 X_{1}]

. Ref. [4] noted that one way to formulate PLS is to solve an optimization problem by forming

b_{j} = {\hat{γ}}_{j}

iteratively where

b_{k} = arg max_{b} {{[Cor (Y, X_{1} b)]}^{2} V (X_{1} b)}

(2)

subject to

b^{T} b = 1

and

b^{T} Σ_{x} b_{j} = 0

for

j = 1, \dots, k - 1 .

Here V stands for the variance. So PLS is a model free way to get predictors

{\hat{γ}}_{i}^{T} x

that are fairly highly correlated with the response, and the absolute correlations tend to decrease quickly.

In high dimensions, it is very difficult to estimate a

p \times 1

vector

θ

, e.g.,

θ = β

or

θ = Σ_{x Y}

. This result is a form of “the curse of dimensionality”. If a

\sqrt{n}

consistent estimator of

θ

is available, then the squared norm

∥ \hat{θ} {- θ ∥}^{2} = \sum_{i = 1}^{p} {({\hat{θ}}_{i} - θ_{i})}^{2} \propto p / n .

(3)

Since

\hat{θ}

is a consistent estimator of

θ

if

∥ \hat{θ} {- θ ∥}^{2} \to 0

as

n \to \infty

, often the high-dimensional estimator

\hat{θ}

has not shown to be consistent, except under very strong regularity conditions.

Although this paper reviews some important techniques for high-dimensional statistics, there is also new material. First, for multiple linear regression, Section 2.1 proves that “sensible model selection estimators” produce fitted values similar to that of the full OLS model when n is much larger than p, even if heterogeneity is present. Second, Section 2.3 shows how to improve PCR and gives a new explanation for why PLS components should be used instead of PCR components from PCA. Third, Section 2.3 provides a relationship between canonical correlation analysis and PLS. Fourth, Section 2.3 introduces the SC scree plot. Fifth, Section 2.5 suggests a simple method for choosing

δ

for a regularized correlation matrix

R_{δ}

. The outlier-resistant methods of Section 2.5 are not new, but they have never been used for most high-dimensional procedures.

2. Materials and Methods

2.1. Model Selection Estimators in Low Dimensions

This subsection explains why “sensible model selection estimators, including variable selection estimators,” produce fitted values (predictions) similar to that of the full OLS model when n is much larger than p. The result in Equation (4) that the residuals from the model selection model and the full OLS model are highly correlated was a property of OLS and Mallow’s

C_{p}

criterion, not of any underlying model, but linearity forces the fitted values to be highly correlated. Hence the result works if OLS is consistent and the population model is linear, so for weighted least squares, AR(p) time series, serially correlated errors, et cetera. In particular, the cases do not need to be iid from some distribution. Since the correlation gets arbitrarily close to 1, the model selection estimator and full OLS estimator are estimating the same population parameter

β

, but it is possible that the model selection estimator picks the full OLS model with probability going to one.

Consider the OLS regression of Y on a constant and

w = {(W_{1}, \dots, W_{p})}^{T}

where, for example,

W_{j} = x_{j}

,

W_{j} = {\hat{γ}}_{j}^{T} x

, or

W_{j} = {\hat{d}}_{j}^{T} x

. Let I index the variables in the model so

I = {1, 2, 4}

means that

w_{I} = {(W_{1}, W_{2}, W_{4})}^{T}

was selected. The full model

I = F

uses all p predictors and the constant with

β_{I} = β_{F} = β = β_{O L S}

. Let r be the residuals from the full OLS model and let

r_{I}

be the residuals from model I that uses

{\hat{β}}_{I}

. Suppose model I uses k predictors including a constant with

2 \leq k \leq p + 1

. Ref. [5] proved that the model I with k predictors that minimizes [6]

C_{p} (I)

maximizes cor(

r, r_{I})

, that

cor (r, r_{I}) = \sqrt{\frac{n - (p + 1)}{C_{p} (I) + n - 2 k}},

and under linearity, cor(

r, r_{I}) \to 1

forces

c o r (\hat{α} + w^{T} \hat{β}, {\hat{α}}_{I} + w_{I}^{T} {\hat{β}}_{I}) = cor (ESP, ESP (I)) = cor (\hat{Y}, {\hat{Y}}_{I}) \to 1 .

Thus

C_{p} (I) \leq 2 k

implies that

cor (r, r_{I}) \geq \sqrt{1 - \frac{p + 1}{n}} .

(4)

Let the model

I_{m i n}

minimize the

C_{p}

criterion among the models considered with

C_{p} (I) \leq 2 k_{I}

. Then

C_{p} (I_{m i n}) \leq C_{p} (F) = p + 1

, and if PLS or PCR is selected using model selection (on models

I_{1}, \dots, I_{p}

with

I_{j} = {1, 2, \dots, j}

corresponding to the j-component regression) with the

C_{p} (I)

criterion, and

n \geq 20 (p + 1)

, then cor(

r, r_{I}) \geq \sqrt{19 / 20} = 0.974 .

Hence the correlation of ESP(I) and ESP(F) will typically also be high. (For PCR, the following variant should work better, as follows: take

U_{j} = {\hat{d}}_{j}^{T} x

and

W_{1}

as the

U_{j}

with the highest squared correlation with Y,

W_{2}

as the

U_{j}

with the second highest squared correlation, etc.).

Machine learning methods for the multiple linear regression model can be incorporated as follows. Let k be the number of predictors selected by lasso, including a constant. Standardize the predictors to have unit sample variance and run the method. Let model I contain the variables corresponding to the

k - 1

predictors variables that have the largest

| {\hat{β}}_{i} |

. Fit the OLS model I to these predictors and a constant. If

C_{p} (I) < m i n (2 k, p + 1)

, use model I; otherwise, use the full OLS model. Many variants are possible. In low dimensions, comparisons between methods like lasso, PCR, PLS, and envelopes might use prediction intervals, the amount of dimension reduction, and standard errors if available.

If the above procedure is used, then model selection estimators, such as

{\hat{β}}_{M S P L S}

, produce predictions that are similar to those of the OLS full model if

n \geq 20 (p + 1)

. Other model selection criterion, such as k-fold cross validation, tend to behave like

C_{p}

in low dimensions, but getting bounds like Equation (4) may be difficult. Empirically, variable selection estimators and model selection estimators often do not select the full model. Equation (4) suggests that “weak” predictors will often be omitted, as long as cor(

r, r_{I})

stays high. (If the predictors are not orthogonal, “weak” might mean the predictor is not very useful given that the other predictors are in the model).

It is common in the model selection literature to assume, for the full model, that there is a model S such that

β_{i} \neq 0

for

i \in S

, and

β_{i} = 0

for

i \notin S

. Then model I underfits unless

S \subseteq I

. If

S ⊈ I

, then an “important” predictor has been left out of the model. Then under the model

x^{T} β = x_{S}^{T} β_{S}

,

cor (r, r_{I})

will not converge to 1 as

n \to \infty

, and for large enough n,

{[cor (r, r_{I})]}^{2} \leq γ < 1

. Thus

C_{p} (I) \to \infty

as

n \to \infty

. Hence

P (S \subseteq I_{m i n}) \to 1

as

n \to \infty

where

I_{m i n}

corresponds to the set of predictors selected by a variable selection method such as forward selection or lasso variable selection. Thus the probability that the model selection estimator underfits goes to zero as

n \to \infty

if p is fixed, the full model is one of the models considered, and the

C_{p}

criterion is used, as noted by [7].

For real data, an important question in variable selection is whether

β_{i} = 0

is a reasonable assumption. If X has full rank

p + 1

, then having

β_{i}

equal to zero for 20 decimal places may not be reasonable. See, for example, [8,9,10]. Then the probability that the variable selection estimator chooses the full model goes to one if the probability of underfitting goes to 0 as

n \to \infty

. If

{\hat{β}}_{I}

is

a \times 1

, use zero padding to form the

p \times 1

vector

{\hat{β}}_{I, 0}

from

{\hat{β}}_{I}

by adding 0s corresponding to the omitted variables.

For example, if

p = 3

and

Y = α + x_{1} + 0 x_{2} + 0 x_{3} + e

, then

S = {1}

and

β_{S, 0} = {(1, 0, 0)}^{T} = β = {(β_{1}, β_{2}, β_{3})}^{T} = β_{F} .

This population model has

a = a_{S} = 1

active predictors. Then the

J = 2^{p} = 8

possible subsets of

F = {1, 2, \dots, p}

are

I_{1} = \emptyset

,

S = I_{2} = {1}

,

I_{3} = {2}

,

I_{4} = {3}

,

I_{5} = {1, 2}

,

I_{6} = {1, 3}

,

I_{7} = {2, 3}

, and

F = I_{8} = {1, 2, 3}

. There are

2^{p - a_{S}} = 4

subsets

I_{2}, I_{5}, I_{6},

and

I_{8}

such that

S \subseteq I_{j}

. Let

{\hat{β}}_{I_{7}} = {({\hat{β}}_{2}, {\hat{β}}_{3})}^{T}

and

x_{I_{7}} = {(x_{2}, x_{3})}^{T} .

If

{\hat{β}}_{I_{m i n}} = {({\hat{β}}_{1}, {\hat{β}}_{3})}^{T}

, then the observed variable selection estimator

{\hat{β}}_{V S} = {\hat{β}}_{I_{m i n}, 0} = {({\hat{β}}_{1}, 0, {\hat{β}}_{3})}^{T} .

As a statistic,

{\hat{β}}_{V S} = {\hat{β}}_{I_{k}, 0}

with probabilities

π_{k n} = P (I_{m i n} = I_{k})

for

k = 1, \dots, J

where there are J subsets, e.g.,

J = 2^{p} - 1

. Theory for the variable selection estimator

{\hat{β}}_{V S}

is complicated. See [7] for models such as multiple linear regression, GLMs, and [11] Cox proportional hazards regression.

2.2. Sparse Fitted Models

A fitted or population regression model is sparse if a of the predictors are active (have nonzero

{\hat{β}}_{i}

or

β_{i}

) where

n \geq J a

with

J \geq 10

. Otherwise, the model is nonsparse. A high-dimensional population full regression model is abundant or dense if the regression information is spread out among the p predictors (nearly all of the predictors are active). Hence an abundant model is a nonsparse model. Under the above definitions, most classical low-dimensional models use sparse fitted models, and statisticians have over one hundred years of experience with such models.

The literature for high-dimensional sparse regression models often assumes that (i)

β_{I, 0} = β = β_{F}

, that (ii)

S \subseteq I

where I uses k predictors including a constant, and that (iii)

n \geq 10 k

. When these assumptions hold, the population model is sparse, the fitted model is sparse, and Equation (3) becomes

∥ {\hat{β}}_{I, 0} {- β ∥}^{2}

, which can be small. Getting rid of assumption (i) and the assumption that

S \subseteq I

greatly increases the applicability of variable selection estimators, such as forward selection, lasso, and the elastic net, for high-dimensional data, even if

∥ {\hat{β}}_{I, 0} {- β ∥}^{2}

is huge. As argued in the following paragraphs, the sparse fitted model often fits the data well, and often

{\hat{β}}_{I}

is a good estimator of

β_{I}

.

A sparse fitted model transforms a high-dimensional problem into a low-dimensional problem, and the sparse fitted model can be checked with the goodness of fit diagnostics available for that low-dimensional model. If the predictors used by the sparse fitted regression model are

x_{I}

, and if the regression model depends on

x_{I}

only through the sufficient predictor

S P = α_{I} + x_{I}^{T} β_{I}

, then a useful diagnostic is the response plot of

E S P (I) = {\hat{α}}_{I} + x_{I}^{T} {\hat{β}}_{I}

versus the response Y on the vertical axis. If there is goodness of fit, then

{\hat{β}}_{I}

tends to estimate

β_{I}

regardless of whether the population model is sparse or nonsparse. Data splitting may be needed for valid inference such as hypothesis testing.

Suppose the cases

{(x_{i}^{T}, Y_{i})}^{T}

are iid for

i = 1, \dots, n

. Then

Y_{1}, \dots, Y_{n}

are iid, resulting in a valid sparse fitted model regardless of whether the population model is sparse or nonsparse. This null model omits all of the predictors. For high-dimensional data, a reasonable goal is to find a model that greatly outperforms the null model.

The sparse fitted model using

(Y, x_{I})

is often useful when there are one or more strong predictors. The following [12] theorem gives two more situations where a sparse fitted model can greatly outperform the null model. The population models in Theorem 1 can be sparse or nonsparse. The high-dimensional multiple linear regression literature often assumes that the cases are iid from a multivariate normal distribution, and that the population model is sparse. Let

Σ_{Y} = σ_{Y}^{2}

. For multiple linear regression, note that

σ_{O}^{2} < σ_{Y}^{2} = Σ_{Y}

unless

η^{T} Σ_{x Y} = 0

.

Theorem 1.

Suppose the cases

{(Y_{i}, x_{i}^{T})}^{T}

are iid from some distribution.

(a) If the joint distribution of

{(Y, x^{T})}^{T}

is multivariate normal,

(\begin{matrix} Y \\ x \end{matrix}) \sim N_{p + 1} (\begin{matrix} (\begin{matrix} μ_{Y} \\ μ_{x} \end{matrix}), & (\begin{matrix} Σ_{Y} & Σ_{Y x} \\ Σ_{x Y} & Σ_{x} \end{matrix}) \end{matrix}),

then

Y | x \sim Y | (α_{O L S} + β_{O L S}^{T} x) \sim N (α_{O L S} + β_{O L S}^{T} x, σ^{2})

follows a multiple linear regression model, but so does

Y | η^{T} x \sim N (α_{O} + β_{O}^{T} x, σ_{O}^{2})

, where

α_{O} = μ_{Y} - β_{O}^{T} μ_{x}

,

β_{O} = λ η

,

σ_{O}^{2} = Σ_{Y} - β_{O}^{T} Σ_{x Y},

and

λ = \frac{Σ_{x Y}^{T} η}{η^{T} Σ_{x} η} .

(b) If the response Y is binary, then

Y | (α_{O} + β_{O}^{T} x) \sim

binomial(

m = 1, ρ (α_{O} + β_{O}^{T} x))

where

E [Y | (α_{O} + β_{O}^{T} x)] = ρ (α_{O} + β_{O}^{T} x) = P [Y = 1 | (α_{O} + β_{O}^{T} x)]

. Hence every linear combination of the predictors satisfies a binary regression model.

2.3. PCA-PLS

Another technique is to use PCA for dimension reduction. Let

U_{1}, \dots, U_{p}

be the PCA linear combinations (

U_{i} = {\hat{γ}}_{i}^{T} x

) ordered with respect to the largest eigenvalues. Then use

U_{1}, \dots, U_{k}

in the regression or classification model where k is chosen in some manner. This method can be used for models with m response variables

Y_{1}, \dots, Y_{m}

. See, for example, [13,14,15,16].

Consider a low or high-dimensional regression or classification method with a univariate response variable Y. Let

W_{1}, \dots, W_{p}

be the linear combinations ordered with respect to the highest squared correlations

r_{1}^{2} \geq r_{2}^{2} \geq \dots \geq r_{p}^{2}

where the sample correlation

r_{i, Y} = cor (x_{i}, Y)

. From a model selection viewpoint, using

W_{1}, \dots, W_{k}

should work much better than using

U_{1}, \dots, U_{k}

. Also, the PLS components

W_{i}

should be used instead of the PCA

W_{i}

, since the PLS components are chosen to be fairly highly correlated with Y. See Equation (2). Ref. [17] (pp. 71–72) shows that an equivalent way to compute the k-component PLS estimator is to maximize

{\hat{γ}}^{T} {\hat{Σ}}_{x Y}

under some constraints. If the predictors are standardized to have unit sample variance, then this method becomes a correlation vector optimization problem. Ref. [18] use the PLS components as predictors for nonlinear regression, but the above model selection viewpoint is new.

From canonical correlation analysis (CCA), if

{(Y_{i}, x_{i}^{T})}^{T}

are iid, then

M = max_{γ \neq 0} Cor (γ^{T} x, Y) = max_{γ \neq 0} \frac{γ^{T} Σ_{x Y}}{\sqrt{Σ_{Y}} \sqrt{γ^{T} Σ_{x} γ}} .

This optimization problem is equivalent to maximizing

Σ_{Y} M^{2} = max_{γ \neq 0} \frac{γ^{T} Σ_{x Y} Σ_{x Y}^{T} γ}{γ^{T} Σ_{x} γ},

which has a maximum at

γ = Σ_{x}^{- 1} Σ_{x Y} = β_{O L S} .

See [19] (pp. 168, 282). Hence PLS is a lot like CCA for

{(Y_{i}, x_{i}^{T})}^{T}

but with more constraints, and PLS can be computed in high dimensions. From the dimension reduction literature, if Y depends on x only through

α + β^{T} x

, then under the assumption of “linearly related predictors,”

{\hat{β}}_{O L S}

estimates

β_{O L S} = c β

for some constant c which is often nonzero. See, for example, [20] (p. 432).

The above results suggest computing lasso for multiple linear regression, find the number of predictors k chosen by lasso, and take k linear combinations

W_{i}

. An SC scree plot of i versus

r_{i}^{2}

behaves like a scree plot of i versus the eigenvalues. Hence quantities like

\sum_{i = 1}^{j} r_{i}^{2} / \sum_{i = 1}^{p} r_{i}^{2}

are of interest for

j = 1, \dots, p

, and scree plot techniques could be adapted to choose k. Many other possibilities exist, and there are many possibilities for models with m response variables

Y_{1}, \dots, Y_{m}

. See [18] for some ideas.

Another useful technique is to eliminate weak predictors before finding

W_{1}, \dots, W_{k}

. By Equation (3),

{\hat{γ}}_{i}

may not be close to

γ_{i}

in high dimensions, e.g.,

p = n^{6}

. For example, the sample eigenvectors

{\hat{d}}_{i}

tend to be poor estimators of the population eigenvectors

d_{i}

of

Σ_{x}

. An exception is when the correlation Cor

(x_{i}, x_{j}) = ρ

for

i \neq j

where

ρ

is close to 1. See [21]. One possibility is to take the j predictors

x_{i}

with the highest squared correlations with Y. The SC scree plot is useful. Then do lasso (meant for the multiple linear regression model) to further reduce the number of

x_{i}

. Here j should be proportional to n, for example,

j = min (K n, p)

, where

K = 1

is an interesting choice. When n is small, spurious correlations can be a problem as follows: if the actual correlation is near 0, the sample size n may need to be large before the sample correlation

r_{i}

is near 0. For more on the importance of eliminating weak predictors and high-dimensional variable selection, see, for example, [22,23,24,25,26].

2.4. Stack Low-Dimensional Estimators into a Vector

Another technique is to stack low-dimensional estimators into a vector. For example

{\bar{x}}_{2} - {\bar{x}}_{1}

, and elements from an estimated covariance matrix such as

c = v e c h (Σ_{z})

. Using

z = {(Y_{1}, \dots, Y_{m}, x_{1}, \dots, x_{p})}^{T}

can give information about a multivariate regression. Let

{\hat{θ}}_{F}

and

θ_{F}

denote the model that uses

{(θ_{1}, \dots, θ_{p})}^{T}

, and let I denote the model that uses

{(θ_{i 1}, \dots, θ_{i k})}^{T}

. Then some of these estimators satisfy

\sqrt{n} ({\hat{θ}}_{F} - θ_{F}) \overset{D}{\to} N_{p} (0, Σ_{F})

and

\sqrt{n} ({\hat{θ}}_{I} - θ_{I}) \overset{D}{\to} N_{k} (0, Σ_{I})

, where the estimator

{\hat{Σ}}_{F}

is easy to compute in high dimensions but is singular if

p > n

, while the estimator

{\hat{Σ}}_{I}

is easy to compute in high dimensions and is nonsingular if

n > J k

with

J \geq 10

. Often, the theory for F uses z while the theory for I uses

u = {(z_{i 1}, \dots, z_{i k})}^{T}

. (Values of J much larger than 10 may be needed if some of the

z_{i j}

are skewed).

Then the following simple testing method reduces a possibly high-dimensional problem to a low-dimensional problem. Consider testing

H_{0} : A θ = 0

versus

H_{1} : A θ \neq 0

where A is a

d \times p

constant matrix and the hypothesis test is equivalent to testing

H_{0} : B θ_{I} = 0

versus

H_{1} : B θ_{I} \neq 0

where B is a

d \times k

constant matrix. For example, tests such as

H_{0} : θ_{i} = 0

or

H_{0} : θ_{i} - θ_{k} = 0

are often of interest.

The marginal maximum likelihood estimator (MMLE) and one component partial least squares (OPLS) estimators stack low-dimensional estimators into a vector. In low dimensions, the OLS estimators are

{\hat{α}}_{O L S} = \bar{Y} - {\hat{β}}_{O L S}^{T} \bar{x}

and

{\hat{β}}_{O L S} = {\hat{Σ}}_{x}^{- 1} {\hat{Σ}}_{x Y} = {\hat{Σ}}_{x}^{- 1} \hat{η} .

For a multiple linear regression model with iid cases,

{\hat{β}}_{O L S}

is a consistent estimator of

β_{O L S} = Σ_{x}^{- 1} Σ_{x Y}

under mild regularity conditions, while

{\hat{α}}_{O L S}

is a consistent estimator of

E (Y) - β_{O L S}^{T} E (x)

.

Refs. [27,28] showed that

{\hat{β}}_{O P L S} = \hat{λ} {\hat{Σ}}_{x Y}

estimates

λ Σ_{x Y} = β_{O P L S}

where

λ = \frac{Σ_{x Y}^{T} Σ_{x Y}}{Σ_{x Y}^{T} Σ_{x} Σ_{x Y}}

and

\hat{λ} = \frac{{\hat{Σ}}_{x Y}^{T} {\hat{Σ}}_{x Y}}{{\hat{Σ}}_{x Y}^{T} {\hat{Σ}}_{x} {\hat{Σ}}_{x Y}}

(5)

for

Σ_{x Y} \neq 0

. If

Σ_{x Y} = 0

, then

β_{O P L S} = 0

. Let

{\hat{η}}_{O P L S} = {\hat{Σ}}_{x Y} .

Testing

H_{0} : A β_{O P L S} = 0

versus

H_{1} : A β_{O P L S} \neq 0

is equivalent to testing

H_{0} : A η = 0

versus

H_{1} : A η \neq 0

where A is a

k \times p

constant matrix and

η = Σ_{x Y}

.

The marginal maximum likelihood estimator (marginal least squares estimator) is due to [22,23]. This estimator computes the marginal regression of Y on

x_{i}

resulting in the estimator

({\hat{α}}_{i, M}, {\hat{β}}_{i, M})

for

i = 1, \dots, p

. Then

{\hat{β}}_{M M L E} = {({\hat{β}}_{1, M}, \dots, {\hat{β}}_{p, M})}^{T} .

For multiple linear regression, the marginal estimators are the simple linear regression (SLR) estimators, and

({\hat{α}}_{i, M}, {\hat{β}}_{i, M}) = ({\hat{α}}_{i, S L R}, {\hat{β}}_{i, S L R}) .

Hence

{\hat{β}}_{M M L E} = {[d i a g ({\hat{Σ}}_{x})]}^{- 1} {\hat{Σ}}_{x, Y} .

If the

w_{i}

are the predictors standardized to have unit sample variances, then

{\hat{β}}_{M M L E} = {\hat{β}}_{M M L E} (w, Y) = {\hat{Σ}}_{w, Y} = I^{- 1} {\hat{Σ}}_{w, Y} = {\hat{η}}_{O P L S} (w, Y)

where

(w, Y)

denotes that Y was regressed on w, and I is the

p \times p

identity matrix. Hence the SC scree plot is closely related to the MMLE for multiple linear regression with standardized predictors.

Consider a subset of k distinct elements from

\hat{Σ}

. Stack the elements into a vector, and let each vector have the same ordering. For example, the largest subset of distinct elements corresponds to

v e c h (\hat{Σ}) = {({\hat{σ}}_{11}, \dots, {\hat{σ}}_{1 p}, {\hat{σ}}_{22}, \dots, {\hat{σ}}_{2 p}, \dots, {\hat{σ}}_{p - 1, p - 1}, {\hat{σ}}_{p - 1, p}, {\hat{σ}}_{p p})}^{T} = [{\hat{σ}}_{j k}] .

For random variables

x_{1}, \dots, x_{p}

, use notation such as

{\bar{x}}_{j} =

the sample mean of

x_{j}

,

μ_{j} = E (x_{j})

, and

σ_{j k} = C o v (x_{j}, x_{k})

. Let

n v e c h (\hat{Σ}) = [n {\hat{σ}}_{j k}] = \sum_{i = 1}^{n} [(x_{i j} - {\bar{x}}_{j}) (x_{i k} - {\bar{x}}_{k})] .

For general vectors of elements, the ordering of the vectors will all be the same and be denoted vectors such as

\hat{c} = [{\hat{σ}}_{j k}]

and

c = [σ_{j k}]

.

Ref. [29] proved that

\sqrt{n} (\hat{c} - c) \overset{D}{\to} N_{d} (0, Σ_{F})

if

θ = c

is a

d \times 1

vector. The theorem may be a special case of the [30] theory for the multivariate linear regression estimator when there are no predictors. Ref. [31] also gave similar large sample theory, for example, for

c = v e c h (Σ_{x})

, but the proof in [29] and the estimator

{\hat{Σ}}_{F}

are much simpler. Also see [32].

The following [12] large sample theory for

{\hat{Σ}}_{x Y}

is also a special case. Let

x_{i} = {(x_{i 1}, \dots, x_{i p})}^{T}

and let

w_{i}

and

z_{i}

be defined below where

Cov (w_{i}) = Σ_{w} = E [(x_{i} - μ_{x}) {(x_{i} - μ_{x})}^{T} {(Y_{i} - μ_{Y})}^{2})] - Σ_{x Y} Σ_{x Y}^{T} .

Then the low-order moments are needed for

{\hat{Σ}}_{z}

to be a consistent estimator of

Σ_{w}

.

Theorem 2.

Assume the cases

{(x_{i}^{T}, Y_{i})}^{T}

are iid. Assume

E (x_{i j}^{k} Y_{i}^{m})

exist for

j = 1, \dots, p

and

k, m = 0, 1, 2 .

Let

μ_{x} = E (x)

and

μ_{Y} = E (Y)

. Let

w_{i} = (x_{i} - μ_{x}) (Y_{i} - μ_{Y})

with sample mean

{\bar{w}}_{n}

. Let

η = Σ_{x Y}

. Then (a)

\sqrt{n} ({\bar{w}}_{n} - η) \overset{D}{\to} N_{p} (0, Σ_{w}) and \sqrt{n} ({\hat{η}}_{n} - η) \overset{D}{\to} N_{p} (0, Σ_{w}) .

(6)

(b) Let

z_{i} = x_{i} (Y_{i} - {\bar{Y}}_{n})

and

v_{i} = (x_{i} - {\bar{x}}_{n}) (Y_{i} - {\bar{Y}}_{n}) .

Then

{\hat{Σ}}_{w} = {\hat{Σ}}_{z} + O_{P} (n^{- 1 / 2}) = {\hat{Σ}}_{v} + O_{P} (n^{- 1 / 2})

.

(c) Let A be a

k \times p

full rank constant matrix with

k \leq p

, assume

H_{0} : A β_{O P L S} = 0

is true, and assume

\hat{λ} \overset{P}{\to} λ \neq 0

. Then

\sqrt{n} A ({\hat{β}}_{O P L S} - β_{O P L S}) \overset{D}{\to} N_{k} (0, λ^{2} A Σ_{w} A^{T}) .

(7)

This method of hypothesis testing does not depend on whether the population model is sparse or abundant, and it does need data splitting for valid inference. Data splitting with sparse fitted models can also be used for high-dimensional hypothesis testing. See, for example, [12]. Ref. [29] also provides the theory for the OPLS estimator and MMLE for multiple linear regression where heterogeneity is possible and where the predictors may have been standardized to have unit sample variances.

The MMLE for multiple linear regression is often used for variable selection if the predictors have been standardized to have unit sample variances. Then the k predictors with the largest

| {\hat{β}}_{i} |

correspond to the k predictors with the largest squared correlations with Y. Hence this method can be used to select k in Section 2.4.

2.5. Alternative Dispersion Estimators

Let

\hat{Σ}

be a

p \times p

symmetric positive semi-definite matrix such as

R, R^{- 1}, {\hat{Σ}}_{x}, {\hat{Σ}}_{x}^{- 1}, X^{T} X

, or

{(X^{T} X)}^{- 1}

. When

\hat{Σ}

is singular or ill conditioned, some common techniques are to replace

\hat{Σ}

with a symmetric positive definite matrix

\hat{D}

such as

\hat{D} = d i a g (\hat{Σ}), \hat{D} = (\hat{Σ} + λ I_{p})

, where the constant

λ > 0

, or

\hat{D} = D = I_{p}

. Regularized estimators are also used.

For

δ \geq 0

, a simple way to regularize a

p \times p

correlation matrix

R = (r_{i j})

is to use

R_{δ} = \frac{1}{1 + δ} (R + δ I_{p}) = (t_{i j})

(8)

where

t_{i i} = 1

and

t_{i j} = \frac{r_{i j}}{1 + δ}

for

i \neq j

. Note that each correlation

r_{i j}

is divided by the same factor

1 + δ

. If

λ_{i}

is the ith eigenvalue of R, then

(λ_{i} + δ) / (1 + δ)

is the ith eigenvalue of

R_{δ}

. The eigenvectors of R and

R_{δ}

are the same since if

R x = λ_{i} x

, then

R_{δ} x = \frac{1}{1 + δ} (R + δ I_{p}) x = \frac{1}{1 + δ} (λ_{i} + δ) x .

Note that

R_{δ} = κ R + (1 - κ) I_{p}

, where

κ = 1 / (1 + δ) \in (0, 1]

. See [33,34].

Following [35], the condition number of a symmetric positive definite

p \times p

matrix A is

c o n d (A) = λ_{1} (A) / λ_{p} (A)

, where

λ_{1} (A) \geq λ_{2} (A) \geq \dots \geq λ_{p} (A) > 0

are the eigenvalues of A. Note that

c o n d (A) \geq 1 .

A well conditioned matrix has condition number

c o n d (A) \leq c

for some number c such as 50 or 500. Hence

R_{δ}

is nonsingular for

δ > 0

and well conditioned if

c o n d (R_{δ}) = \frac{λ_{1} + δ}{λ_{p} + δ} \leq c,

or

δ = max (0, \frac{λ_{1} - c λ_{p}}{c - 1})

(9)

if

1 < c \leq 500

. Taking

c = 50

suggests using

δ = max (0, \frac{λ_{1} - 50 λ_{p}}{49}) .

The matrix can be further regularized by setting

t_{i j} = 0

if

| t_{i j} | \leq τ

, where

τ \in [0, 1)

should be less than 0.5. Denote the resulting matrix by

R (δ, τ)

. We suggest using

τ = 0.05

. Note that

R_{δ} = R (δ, 0)

. Using

τ

is known as thresholding. We recommend computing

I_{p}, R (δ, 0)

and

R (δ, 0.05)

for

c =

50, 100, 200, 300, 400, and 500. Compute R if it is nonsingular. Note that a regularized covariance matrix can be found using

S (δ, τ) = D_{S} R (δ, τ) D_{S}

(10)

where

S = {\hat{Σ}}_{x}

and

D_{S} = diag (\sqrt{S_{11}}, \dots, \sqrt{S_{pp}})

.

A common type of regularization of a covariance matrix S is to use

S_{D} = d i a g (S)

where the

i j

th element of

S_{D} = 0

and

S_{D} (i, i) = S (i, i)

. The corresponding correlation matrix is the identity matrix, and Mahalanobis distances using the identity matrix correspond to Euclidean distances. These estimators tend to use too much regularization, and underfit. Note that as

δ \to \infty

,

R_{δ} \to I_{p}

, and

I_{p}

has

c = 1

. Note that

S_{D}

corresponds to using

R (δ = \infty, 0) = I_{p}

in Equation (9).

For the population correlation matrix

ρ_{x}

and the population precision matrix

ρ_{x}^{- 1}

, the literature often claims that most of the population correlations

ρ_{i j} = 0

, so that the population matrix is sparse, and that

\hat{D}

is a good estimator of the population matrix. Assume that

\hat{D}

estimates a population dispersion matrix D. Note that this assumption always holds when

\hat{D} = I_{p} = D

. Note that

d i a g (S)

estimates

d i a g (Σ_{x})

since

{({\hat{σ}}_{1}^{2}, \dots, {\hat{σ}}_{p}^{2})}^{T}

estimates

{(σ_{1}^{2}, \dots, σ_{p}^{2})}^{T}

where

σ_{i}^{2} = V (x_{i})

for

i = 1, \dots, p

. However, by Equation (3), the estimator tends to not be good in high dimensions.

Consider testing

H_{0} : θ = θ_{0}

versus

H_{1} : θ \neq θ_{0}

, where a

g \times 1

statistic

T_{n}

satisfies

\sqrt{n} (T_{n} - θ) \overset{D}{\to} u \sim N_{g} (0, Σ) .

If

{\hat{Σ}}^{- 1} \overset{P}{\to} Σ^{- 1}

and

H_{0}

is true, then

D_{n}^{2} = D_{θ_{0}}^{2} (T_{n}, \hat{Σ} / n) = n {(T_{n} - θ_{0})}^{T} {\hat{Σ}}^{- 1} (T_{n} - θ_{0}) \overset{D}{\to} u^{T} Σ^{- 1} u \sim χ_{g}^{2}

as

n \to \infty

. Then a Wald-type test rejects

H_{0}

if

D_{n}^{2} > χ_{g, 1 - δ}^{2}

where

P (X \leq χ_{g, 1 - δ}^{2}) = 1 - δ

if

X \sim χ_{g}^{2}

, a chi-quare distribution with g degrees of freedom. Note that

D_{θ_{0}}^{2} (T_{n}, \hat{Σ} / n)

is a squared Mahalanobis distance.

It is common to implement a Wald-type test using

D_{n}^{2} = D_{θ_{0}}^{2} (T_{n}, C_{n} / n) = n {(T_{n} - θ_{0})}^{T} C_{n}^{- 1} (T_{n} - θ_{0}) \overset{D}{\to} u^{T} C^{- 1} u

as

n \to \infty

if

H_{0}

is true, where the

g \times g

symmetric positive definite matrix

\hat{D} = C_{n} \overset{P}{\to} C \neq Σ

. Hence

C_{n}

is the wrong dispersion matrix, and

u^{T} C^{- 1} u

does not have a

χ_{g}^{2}

distribution when

H_{0}

is true. Ref. [36] showed how to bootstrap Wald tests with the wrong dispersion matrix. When

C_{n} = I_{g}

, the bootstrap tests often became conservative as g increased to n. For some of these tests, the m out of n bootstrap, which draws a sample of size m without replacement from the n, works better than the nonparametric bootstrap. Sampling without replacement is also known as subsampling and the delete d jackknife. For some methods, better high-dimensional tests are reviewed by [37].

Using a high-dimensional dispersion estimator with considerable outlier resistance is another useful technique. Let W be a data matrix, where the rows

w_{i}^{T}

correspond to the cases. For example,

w_{i} = x_{i}

or

w_{i} = z_{i} = {(Y_{i 1}, \dots, Y_{i m}, x_{i 1}, \dots, x_{i p})}^{T}

. One of the simplest outlier detection methods uses the Euclidean distances of the

w_{i}

from the coordinatewise median

D_{i} = D_{i} (MED (W), I_{p}) .

Concentration type steps compute the weighted median

{MED}_{j}

: the coordinatewise median computed from the “half set” of cases

x_{i}

with

D_{i}^{2} \leq MED (D_{i}^{2} ({MED}_{j - 1}, I_{p}))

where

{MED}_{0} = MED (W)

. We often use

j = 0

(no concentration type steps) or

j = 9

. Let

D_{i} = D_{i} ({MED}_{j}, I_{p})

. Let

W_{i} = 1

if

D_{i} \leq MED (D_{1}, \dots, D_{n}) + k MAD (D_{1}, \dots, D_{n})

where

k \geq 0

and

k = 5

is the default choice. Let

W_{i} = 0

, otherwise. Using

k \geq 0

ensures that at least half of the cases get weight 1. This weighting corresponds to the weighting that would be used in a one sided metrically trimmed mean (Huber-type skipped mean) of the distances. Here, the sample median absolute deviation is

MAD (n) = MAD (D_{1}, \dots, D_{n}) = MED (| D_{i} - MED (n) |, i = 1, \dots, n)

, where

MED (n) = MED (D_{1}, \dots, D_{n})

is the sample median of

D_{1}, \dots, D_{n}

.

Let the covmb2 set B of at least

n / 2

cases correspond to the cases with weight

W_{i} = 1

. Then the [38] (p. 120) covmb2 estimator

(T, C)

is the sample mean and sample covariance matrix applied to the cases in set B. If

w_{i} = x_{i}

, then

T = \frac{\sum_{i = 1}^{n} W_{i} x_{i}}{\sum_{i = 1}^{n} W_{i}} and C = \frac{\sum_{i = 1}^{n} W_{i} (x_{i} - T) {(x_{i} - T)}^{T}}{\sum_{i = 1}^{n} W_{i} - 1} .

This estimator was built for speed, applications, and outlier resistance. In low dimensions, the population dispersion matrix is the population covariance matrix of a spherically truncated distribution. In high dimensions, spherical truncation is still used, but the sample weighted median varies about the population weighted median by Equation (3).

A useful application is to apply high- (and low)-dimensional methods to the cases that get weight 1. If the ith case

w_{i} = {(y_{i}^{T}, x_{i}^{T})}^{T}

where

y = {(Y_{1}, \dots, Y_{m})}^{T}

, then this application can be used if all of the variables are continuous. For a variant, let the continuous predictors from

x_{i}

be denoted by

u_{i}

for

i = 1, \dots, n

. Apply the covmb2 estimator to the

u_{i}

, and then run the method on the m cases

w_{i}

corresponding to the covmb2 set B indices

i_{1}, \dots, i_{m}

, where

m \geq n / 2

. If the estimator has large sample theory “conditional” on the predictors x, then typically the same theory applies for the “robust estimator” since the response variables were not used to select the cases in B. These two applications can be used for regression, classification, neural networks, et cetera.

Another method to get an outlier-resistant estimator

{\hat{Σ}}_{x Y}

is to use the following identity. If X and Y are random variables, then

Cov (X, Y) = [Var (X + Y) - Var (X - Y)] / 4 .

Then replace Var(

W)

by

{[\hat{σ} (W)]}^{2}

, where

\hat{σ} (W)

is a robust estimator of scale or standard deviation and

W = X + Y

or

W = X - Y

. We used

\hat{σ} (W) = 1.483 M A D (W)

where

M A D (W) = M A D (n) = M A D (W_{1}, \dots, W_{n}) .

Hence

\hat{Cov} (X, Y) = ({[1.483 M A D (X + Y)]}^{2} - {[1.483 M A D (X - Y)]}^{2}) / 4 .

(11)

In low dimensions, the [38] Olive (2017) RMVN or RFCH estimator of dispersion can be used.

3. Results

Example 1.

The [39] data were collected from

n = 26

districts in Prussia in 1843. Let Y = the number of women married to civilians in the district, with a constant and predictors

x_{1}

= the population of the district in 1843,

x_{2}

= the number of married civilian men in the district,

x_{3}

= the number of married men in the military in the district, and

x_{4}

= the number of women married to husbands in the military in the district. Sometimes the person conducting the survey would not count a spouse if the spouse was not at home. Hence Y and

x_{2}

are highly correlated but not equal. Similarly,

x_{3}

and

x_{4}

are highly correlated but not equal. We expect

β = β_{O L S} \approx {(0, 1, 0, 0)}^{T}

. Then

{\hat{β}}_{O L S} = {(0.00035, 0.9995, - 0.2328, 0.1531)}^{T},

while forward selection with OLS and the

C_{p}

criterion used

{\hat{β}}_{I, 0} = {(0, 1.0010, 0, 0)}^{T}

, lasso had

{\hat{β}}_{L} = {(0.0015, 0.9605, 0, 0)}^{T}

, lasso variable selection

{\hat{β}}_{L V S} = {(0.00007, 1.006, 0, 0)}^{T}

,

{\hat{β}}_{M M L E} = {(0.1782, 1.0010, 48.5630, 51.5513)}^{T},

and

{\hat{β}}_{O P L S} = {(0.1727, 0.0311, 0.00018, 0.00018)}^{T} .

With scaled predictors t,

{\hat{β}}_{O L S} (t, Y) = {(81.0283, 40,877.4086, - 104.8576, 66.2739)}^{T}

while

{\hat{Σ}}_{t Y} = {\hat{β}}_{M M L E} (t, Y) = {\hat{Σ}}_{t Y} = {\hat{η}}_{O P L S} (t, Y) = {(40,678.97, 40,937.98, 21,877.44, 22,308.46)}^{T} .

The fitted values from the MMLE estimator tend to not estimate Y. The fitted values from the different estimators did have high correlation, but population values, such as

β_{O P L S}

and

β_{O L S}

, likely differ because

x_{1}

and

x_{2}

have very high correlation with Y.

Example 2.

The [40] pottery data have

n = 36

pottery shards of Roman earthware produced between second century B.C. and fourth century A.D. Often the pottery was stamped by the manufacturer. A chemical analysis was done for

p = 20

chemicals (variables), the types of pottery were 1, Arretine; 2, not Arretine; 3, North Italian; 4, Central Italian; and 5, questionable origin. These codes were used for the response variable Y. Figure 1 shows the SC scree plot for this data set. Seven of the predictors (

x_{7}, x_{16}, x_{9}, x_{20}, x_{12}, x_{2}, x_{11}

), had little correlation with Y. Changing the codes of the last 4 groups to 0 (binary response) gave a similar plot (not shown). Some output is shown below.

zy <-pottery[,1]

zx <- pottery[,-1]

SCscree(zx,zy)

$csums

[1] 0.1599179 0.2844740 0.4036888 0.4978070 0.5873033 0.6639728 0.7378077

[8] 0.8028591 0.8506671 0.8930894 0.9247144 0.9519310 0.9726192 0.9808558

[15] 0.9876990 0.9939687 0.9975125 0.9990487 0.9997179 1.0000000

$indx

[1] 5 4 17 1 10 8 6 13 18 14 15 19 3 7 16 9 20 12 2 11

The function ddplot5 plots the Euclidean distances from the coordinatewise median versus the Euclidean distances from the covmb2 location estimator. Typically, the plotted points in this DD plot cluster about the identity line, and outliers appear in the upper-right corner of the plot with a gap between the bulk of the data and the outliers.

The function rcovxy yields the classical and three robust estimators of

η = Σ_{x Y}

, and it makes a scatterplot matrix of the four estimated sufficient predictors

{\hat{η}}^{T} x

and Y. Only two robust estimators are made if

n \leq 2.5 p

.

Example 3.

For the [41] data with multiple linear regression, height was the response variable while an intercept, head length, nasal height, bigonal breadth, and cephalic index were used as predictors in the multiple linear regression model. Observation 9 was deleted since it had missing values. Five individuals, cases 61–65, were reported to be about 0.75 inches tall with head lengths well over five feet!

Figure 2a shows the response plot for lasso. The identity line passes right through the outliers which are obvious because of the large gap. Figure 2b shows the response plot from lasso for the cases in the covmb2 set B applied to the predictors, and the set B included all of the clean cases and omitted the 5 outliers. The response plot was made for all of the data, including the outliers. Prediction interval bands are also included for both plots. Both plots are useful for outlier detection, but the method for plot Figure 2b is better for data analysis as follows: impossible outliers should be deleted or given 0 weight, we do not want to predict that some people are about 0.75 inches tall, and we do want to predict that the people were about 1.6 to 1.8 m tall. Figure 3 shows the DD plot made using ddplot5. The five outliers are in the upper-right corner. Figure 4 shows the plot made using rcovxy. The five outliers are easy to detect, and the covmb2 estimator of

{\hat{Σ}}_{x Y}^{T} x

, denoted by rmb2ESP, linearized the data in the upper row.

4. Discussion

Detecting outliers, eliminating weak predictors, and recognizing that

{\hat{β}}_{I}

estimates

β_{I}

are important techniques for high-dimensional statistics. Another useful technique is to fit several high-dimensional estimators and choose the one that minimizes some criterion, such as 5-fold cross validation. For multiple linear regression, useful estimators include lasso, model selection PLS, ridge regression, et cetera, using all of the predictors and deleting weak predictors. Using several estimators for high-dimensional data increases the chance of getting a good fit to the data.

Lasso often works well on high-dimensional data sets if there are strong predictors. Lasso also deletes weak predictors, and

{\hat{β}}_{I}

often estimates

β_{I}

. PLS often works well on high-dimensional chemometrics data sets, but it is not real clear why PLS works well when

{\hat{γ}}_{j}

is not a good estimator of

γ_{j}

. The PLS literature often assumes that

Y | x =

Y | β_{k P L S}^{T} x

where

β_{k P L S}

is the population parameter for the k-component PLS estimator. Ref. [12] showed that this PLS assumption tends to be quite strong.

The R software was used. See [42]. Programs are available from the collection of R functions slpack.txt, available from (http://parker.ad.siu.edu/Olive/slpack.txt, accessed on 30 June 2025). Proofs for Equation (4) and Theorems 1 and 2 were not given, but they are available from preprints of the corresponding published papers from (http://parker.ad.siu.edu/Olive/preprints.htm, accessed on 30 June 2025).

The covmb2 estimator attempts to give a robust dispersion estimator that reduces the bias by using a big ball about

{MED}_{j}

instead of a ball that contains half of the cases. The median ball is the hypersphere centered at the coordinatewise median with radius

r = MED (D_{i} (MED (W), I_{p}), i = 1, \dots, n)

that tends to contain

(n + 1) / 2

of the cases if n is odd. The slpack function getB gives the set B of cases that got weight 1 along with the index indx of the case numbers that got weight 1.

The function corrlar produces the regularized correlation matrices Rd

= R (δ, 0)

and Rt

= R (δ, τ)

given a correlation matrix, condition number c, and threshold tau with

τ = 0.05

the default. The function SCscree makes the SC scree plot.

Funding

This research received no external funding.

Data Availability Statement

The three data sets are available from (http://parker.ad.siu.edu/Olive/sldata.txt, accessed on 30 June 2025).

Acknowledgments

The authors thank the editors and referees for their work.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CCA	canonical correlation analysis
iid	independent and identically distributed
MDPI	Multidisciplinary Digital Publishing Institute
MMLE	marginal maximum likelihood estimator
OLS	ordinary least squares
OPLS	one component partial least squares
PCA	principal component analysis
PCR	principal component regression
PLS	partial least squares
SP	sufficient predictor

References

Cook, R.D.; Forzani, L. Partial Least Squares Regression: And Related Dimension Reduction Methods; Chapman and Hall/CRC: Boca Raton, FL, USA, 2024. [Google Scholar]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: New York, NY, USA, 2009. [Google Scholar]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning with Applications in R, 2nd ed.; Springer: New York, NY, USA, 2021. [Google Scholar]
Chun, H.; Keleş, S. Sparse partial least squares regression for simultaneous dimension reduction and predictor selection. J. R. Stat. Soc. Ser. B 2010, 72, 3–25. [Google Scholar] [CrossRef] [PubMed]
Olive, D.J.; Hawkins, D.M. Variable selection for 1D regression models. Technometrics 2005, 47, 43–50. [Google Scholar] [CrossRef]
Mallows, C. Some comments on C_p. Technometrics 1973, 15, 661–676. [Google Scholar]
Rathnayake, R.C.; Olive, D.J. Bootstrapping some GLMs and survival regression models after variable selection. Commun. Stat. Theory Methods 2023, 52, 2625–2645. [Google Scholar] [CrossRef]
Gelman, A.; Carlin, J. Some natural solutions to the p-value communication problem-and why they won’t work. J. Am. Stat. Assoc. 2017, 112, 899–901. [Google Scholar] [CrossRef]
Nester, M.R. An applied statistician’s creed. J. R. Stat. Soc. Ser. C 1996, 45, 401–410. [Google Scholar] [CrossRef]
Tukey, J.W. The philosophy of multiple comparisons. Stat. Sci. 1991, 6, 100–116. [Google Scholar] [CrossRef]
Cox, D.R. Regression models and life-tables. J. R. Stat. Soc. Ser. B 1972, 34, 187–220. [Google Scholar] [CrossRef]
Olive, D.J.; Zhang, L. One component partial least squares, high dimensional regression, data splitting, and the multitude of models. Commun. Stat. Theory Methods 2025, 54, 130–145. [Google Scholar] [CrossRef]
Artigue, H.; Smith, G. The principal problem with principal components regression. Cogent Math. Stat. 2019, 6, 1622190. [Google Scholar] [CrossRef]
Cook, R.D. Fisher lecture: Dimension reduction in regression. Stat. Sci. 2007, 22, 1–26. [Google Scholar] [CrossRef]
Cook, R.D. Principal components, sufficient dimension reduction, and envelopes. Ann. Rev. Stat. Appl. 2018, 5, 533–559. [Google Scholar] [CrossRef]
Zhang, J.; Chen, X. Principal envelope model. J. Stat. Plan. Infer. 2020, 206, 249–262. [Google Scholar] [CrossRef]
Brown, P.J. Measurement, Regression, and Calibration; Oxford University Press: New York, NY, USA, 1993. [Google Scholar]
Cook, R.D.; Forzani, L. PLS regression algorithms in the presence of nonlinearity. Chemom. Intell. Lab. Syst. 2021, 213, 104307. [Google Scholar] [CrossRef]
Mardia, K.V.; Kent, J.T.; Bibby, J.M. Multivariate Analysis; Academic Press: London, UK, 1979. [Google Scholar]
Cook, R.D.; Weisberg, S. Applied Regression Including Computing and Graphics; Wiley: New York, NY, USA, 1999. [Google Scholar]
Jung, S.; Marron, J.S. PCA consistency in high dimension low sample size context. Ann. Stat. 2012, 37, 4104–4130. [Google Scholar] [CrossRef]
Fan, J.; Lv, J. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B 2008, 70, 849–911. [Google Scholar] [CrossRef]
Fan, J.; Song, R. Sure independence screening in generalized linear models with np-dimensionality. Ann. Stat. 2010, 38, 3217–3841. [Google Scholar] [CrossRef]
Mehmood, T.; Sæbø, S.; Liland, K.H. Comparison of variable selection methods in partial least squares regression. J. Chemom. 2020, 34, e3226. [Google Scholar] [CrossRef]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B 2005, 67, 301–320. [Google Scholar] [CrossRef]
Basa, J.; Cook, R.D.; Forzani, L.; Marcos, M. Asymptotic distribution of one-component partial least squares regression estimators in high dimensions. Can. J. Stat. 2024, 52, 118–130. [Google Scholar] [CrossRef]
Cook, R.D.; Helland, I.S.; Su, Z. Envelopes and partial least squares regression. J. R. Stat. Soc. Ser. B 2013, 75, 851–877. [Google Scholar] [CrossRef]
Olive, D.J.; Alshammari, A.A.; Pathiranage, K.G.; Hettige, L.A.W. Testing with the one component partial least squares and the marginal maximum likelihood estimators. Commun. Stat. Theory Methods 2025. [Google Scholar] [CrossRef]
Su, Z.; Cook, R.D. Inner envelopes: Efficient estimation in multivariate linear regression. Biometrika 2012, 99, 687–702. [Google Scholar] [CrossRef]
Yuan, K.H.; Chan, W. Biases and standard errors of standardized regression coefficients. Psychometrika 2011, 76, 670–690. [Google Scholar] [CrossRef] [PubMed]
Neudecker, H.; Wesselman, A.M. The asymptotic variance matrix of the sample correlation matrix. Lin. Alg. Appl. 1990, 127, 589–599. [Google Scholar] [CrossRef]
Ledoit, O.; Wolf, M. A well-conditioned estimator for large-dimensional covariance matrices. J. Mult. Analys. 2004, 88, 365–411. [Google Scholar] [CrossRef]
Warton, D.I. Penalized normal likelihood and ridge regularization of correlation and covariance matrices. J. Am. Stat. Assoc. 2008, 103, 340–349. [Google Scholar] [CrossRef]
Datta, B.N. Numerical Linear Algebra and Applications; Brooks/Cole: Pacific Grove, CA, USA, 1995. [Google Scholar]
Rajapaksha, K.W.G.D.H.; Olive, D.J. Wald type tests with the wrong dispersion matrix. Commun. Stat. Theory Methods 2024, 53, 2236–2251. [Google Scholar] [CrossRef]
Hu, J.; Bai, Z. A review of 20 years of naive tests of significance for high-dimensional mean vectors and covariance matrices. Sci. China Math. 2016, 59, 2281–2300. [Google Scholar] [CrossRef][Green Version]
Olive, D.J. Robust Multivariate Analysis; Springer: New York, NY, USA, 2017. [Google Scholar]
Hebbler, B. Statistics of Prussia. J. Stat. Soc. Lond. 1847, 10, 154–186. [Google Scholar] [CrossRef]
Wisseman, S.U.; Hopke, P.K.; Schindler-Kaudelka, E. Multielemental and multivariate analysis of Italian terra sigillata in the world heritage museum, university of Illinois at Urbana-Champaign. Archeomaterials 1987, 1, 101–107. [Google Scholar]
Buxton, L.H.D. The anthropology of Cyprus. J. R. Anthrop. Inst. Great Br. Irel. 1920, 50, 183–235. [Google Scholar] [CrossRef]
R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2024. [Google Scholar]

Figure 1. SC scree plot for pottery data.

Figure 2. Response plot for lasso and lasso applied to the covmb2 set B.

Figure 3. DD plot.

Figure 4. Diagnostics for OPLS and

{\hat{Σ}}_{x Y}

.

Figure 4. Diagnostics for OPLS and

{\hat{Σ}}_{x Y}

.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Olive, D.J. Some Useful Techniques for High-Dimensional Statistics. Stats 2025, 8, 60. https://doi.org/10.3390/stats8030060

AMA Style

Olive DJ. Some Useful Techniques for High-Dimensional Statistics. Stats. 2025; 8(3):60. https://doi.org/10.3390/stats8030060

Chicago/Turabian Style

Olive, David J. 2025. "Some Useful Techniques for High-Dimensional Statistics" Stats 8, no. 3: 60. https://doi.org/10.3390/stats8030060

APA Style

Olive, D. J. (2025). Some Useful Techniques for High-Dimensional Statistics. Stats, 8(3), 60. https://doi.org/10.3390/stats8030060

Article Menu

Some Useful Techniques for High-Dimensional Statistics

Abstract

1. Introduction

2. Materials and Methods

2.1. Model Selection Estimators in Low Dimensions

2.2. Sparse Fitted Models

2.3. PCA-PLS

2.4. Stack Low-Dimensional Estimators into a Vector

2.5. Alternative Dispersion Estimators

3. Results

4. Discussion

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI