A High Dimensional Omnibus Regression Test

Ahlam M. Abid; Paul A. Quaye; David J. Olive

doi:10.3390/stats8040107

,

and

¹

Department of Basic Sciences, Saudi Electronic University, Madinah 42351, Saudi Arabia

²

Department of Statistical Sciences, Wake Forest University, Winston-Salem, NC 27109, USA

³

School of Mathematical & Statistical Sciences, Southern Illinois University, Carbondale, IL 62901, USA

^*

Author to whom correspondence should be addressed.

Stats2025, 8(4), 107;https://doi.org/10.3390/stats8040107

This article belongs to the Section Regression Models

Version Notes

Order Reprints

Abstract

Consider regression models where the response variable Y only depends on the

p \times 1

vector of predictors

x = {(x_{1}, \dots, x_{p})}^{T}

through the sufficient predictor

S P = α + x^{T} β

. Let the covariance vector

Cov (x, Y) = Σ_{x Y}

. Assume the cases

{(x_{i}^{T}, Y_{i})}^{T}

are independent and identically distributed random vectors for

i = 1, \dots, n

. Then for many such regression models,

β = 0

if and only if

Σ_{x Y} = 0

where 0 is the

p \times 1

vector of zeroes. The test of

H_{0} : Σ_{x Y} = 0

versus

H_{1} : Σ_{x Y} \neq 0

is equivalent to the high dimensional one sample test

H_{0} : μ = 0

versus

H_{A} : μ \neq 0

applied to

w_{1}, \dots, w_{n}

where

w_{i} = (x_{i} - μ_{x}) (Y_{i} - μ_{Y})

and the expected values

E (x) = μ_{x}

and

E (Y) = μ_{Y}

. Since

μ_{x}

and

μ_{Y}

are unknown, the test of

H_{0} : β = 0

versus

H_{1} : β \neq 0

is implemented by applying the one sample test to

v_{i} = (x_{i} - \bar{x}) (Y_{i} - \bar{Y})

for

i = 1, \dots, n

. This test has milder regularity conditions than its few competitors. For the multiple linear regression one component partial least squares and marginal maximum likelihood estimators, the test can be adapted to test

H_{0} : {(β_{i_{1}}, \dots, β_{i_{k}})}^{T} = 0

versus

H_{1} : {(β_{i_{1}}, \dots, β_{i_{k}})}^{T} \neq 0

where

1 \leq k \leq p .

Keywords:

generalized linear models; multiple linear regression; one sample test; partial least squares; U-statistics

1. Introduction

This section reviews regression models where the response variable Y depends on the

p \times 1

vector of predictors

x = {(x_{1}, \dots, x_{p})}^{T}

only through the sufficient predictor

S P = α + x^{T} β

. Then there are n cases (

Y_{i}, x_{i}^{T})^{T}

. For the regression models, the conditioning and subscripts, such as i, will often be suppressed. This paper gives a high dimensional test for

H_{0} : β = 0

versus

H_{1} : β \neq 0

where

0 = {(0, \dots, 0)}^{T}

is the

p \times 1

vector of zeroes.

A useful multiple linear regression (MLR) model is

Y_{i} = α + x_{i, 1} β_{1} + \dots + x_{i, p} β_{p} + e_{i} = α + x_{i}^{T} β + e_{i}

(1)

for

i = 1, \dots, n .

Assume that the

e_{i}

are independent and identically distributed (iid) with expected value

E (e_{i}) = 0

and variance

V (e_{i}) = σ^{2}

. In matrix form, this model is

Y = X ϕ + e,

(2)

where

Y

is an

n \times 1

vector of dependent variables,

X

is an

n \times (p + 1)

matrix with ith row

(1, x_{i}^{T})

,

ϕ = {(α, β^{T})}^{T}

is a

(p + 1) \times 1

vector, and e is an

n \times 1

vector of unknown errors. Also

E (e) = 0

and Cov(

e) = σ^{2} I_{n}

where

I_{n}

is the

n \times n

identity matrix.

For a multiple linear regression model with heterogeneity, assume model (1) holds with

E (e) = 0

and Cov(

e) = Σ_{e} = d i a g (σ_{i}^{2}) = d i a g (σ_{1}^{2}, \dots, σ_{n}^{2})

is an

n \times n

positive definite matrix. Under regularity conditions, the ordinary least squares (OLS) estimator

{\hat{ϕ}}_{O L S} = {(X^{T} X)}^{- 1} X^{T} Y

can be shown to be a consistent estimator of

ϕ

.

For estimation with ordinary least squares, let the covariance matrix of x be

Cov (x) = Σ_{x} = E [(x - E (x)) {(x - E (x))}^{T}]

and the

p \times 1

vector

η = Cov (x, Y) = Σ_{x Y} = E [(x - E (x) (Y - E (Y))] = {(Cov (x_{1}, Y), \dots, Cov (x_{p}, Y))}^{T} .

Let

\hat{η} = {\hat{Σ}}_{x Y} = \frac{1}{n - 1} \sum_{i = 1}^{n} (x_{i} - \bar{x}) (Y_{i} - \bar{Y})

and

\tilde{η} = {\tilde{Σ}}_{x Y} = \frac{n - 1}{n} {\hat{Σ}}_{x Y} .

For a multiple linear regression model with iid cases,

{\hat{β}}_{O L S}

is a consistent estimator of

β_{O L S} = Σ_{x}^{- 1} Σ_{x Y}

under mild regularity conditions, while

{\hat{α}}_{O L S}

is a consistent estimator of

E (Y) - β_{O L S}^{T} E (x)

.

Ref. [1] showed that the one component partial least squares (OPLS) estimator

{\hat{β}}_{O P L S} = \hat{λ} {\hat{Σ}}_{x Y}

estimates

λ Σ_{x Y} = β_{O P L S}

where

λ = \frac{Σ_{x Y}^{T} Σ_{x Y}}{Σ_{x Y}^{T} Σ_{x} Σ_{x Y}} a n d \hat{λ} = \frac{{\hat{Σ}}_{x Y}^{T} {\hat{Σ}}_{x Y}}{{\hat{Σ}}_{x Y}^{T} {\hat{Σ}}_{x} {\hat{Σ}}_{x Y}}

(3)

for

Σ_{x Y} \neq 0

. If

Σ_{x Y} = 0

, then

β_{O P L S} = 0

. Also see [2,3,4]. Ref. [5] derived the large sample theory for

{\hat{η}}_{O P L S} = {\hat{Σ}}_{x Y}

and OPLS under milder regularity conditions than those in the previous literature, where

η_{O P L S} = Σ_{x Y} .

Ref. [6] showed that for iid cases

(x_{i}, Y_{i})

, these results still hold for multiple linear regression models with heterogeneity.

The marginal maximum likelihood estimator (MMLE or marginal least squares estimator) is due to [7,8]. This estimator computes the marginal regression of Y on

x_{i}

, such as Poisson regression, resulting in the estimator

({\hat{α}}_{i, M}, {\hat{β}}_{i, M})

for

i = 1, \dots, p

. Then

{\hat{β}}_{M M L E} = {({\hat{β}}_{1, M}, \dots, {\hat{β}}_{p, M})}^{T} .

For multiple linear regression, the marginal estimators are the simple linear regression estimators. Hence

{\hat{β}}_{M M L E} = {[d i a g ({\hat{Σ}}_{x})]}^{- 1} {\hat{Σ}}_{x, Y} .

(4)

If the

t_{i}

are the predictors that are scaled or standardized to have unit sample variances, then

{\hat{β}}_{M M L E} = {\hat{β}}_{M M L E} (t, Y) = {\hat{Σ}}_{t, Y} = {\hat{η}}_{O P L S} (t, Y)

(5)

where

(t, Y)

denotes that Y was regressed on t. Ref. [6] derived large sample theory for the MMLE for multiple linear regression models, including models with heterogeneity.

For Poisson regression and related models, the response variable Y is a nonnegative count variable. A useful Poisson regression (PR) model is

Y \sim P o i s s o n (e^{S P})

. This model has

E (Y | S P) = V (Y | S P) = exp (S P)

. The quasi-Poisson regression model has

E (Y | S P) = exp (S P)

and

V (Y | S P) = ϕ exp (S P)

where the dispersion parameter

ϕ > 0

. Note that this model and the Poisson regression model have the same conditional mean function, and the conditional variance functions are the same if

ϕ = 1

.

Some notation is needed for the negative binomial regression model. If Y has a (generalized) negative binomial distribution,

Y \sim N B (μ, κ)

, then the probability mass function (pmf) of Y is

P (Y = y) = \frac{Γ (y + κ)}{Γ (κ) Γ (y + 1)} {(\frac{κ}{μ + κ})}^{κ} {(1 - \frac{κ}{μ + κ})}^{y}

for

y = 0, 1, 2, \dots

where

μ > 0

and

κ > 0 .

Then

E (Y) = μ

and V(

Y) = μ + μ^{2} / κ

.

The negative binomial regression model states that

Y_{1}, \dots, Y_{n}

are independent random variables with

Y | S P \sim N B (exp (S P), κ) .

This model has

E (Y | S P) = exp (S P)

and

V (Y | S P) = exp (S P) (1 + \frac{exp (S P)}{κ}) = exp (S P) + τ exp (2 S P) .

Following Ref. [9] (p. 560), as

τ \equiv 1 / κ \to 0,

it can be shown that the negative binomial regression model converges to the Poisson regression model.

Let the log transformation

Z_{i} = log (Y_{i})

if

Y_{i} > 0

and

Z_{i} = log (0.5)

if

Y_{i} = 0

. This transformation often results in a linear model with heterogeneity:

Z_{i} = α_{Z} + x_{i}^{T} β_{Z} + e_{i}

(6)

where the

e_{i}

are independent with expected value

E (e_{i}) = 0

and variance

V (e_{i}) = σ_{i}^{2} .

For Poisson regression, the minimum chi-square estimator is the weighted least squares estimator from the regression of

Z_{i}

on

x_{i}

with weights

w_{i} = e^{Z_{i}}

. See [9] (pp. 611–612).

If the regression model for Y depends on x only through

α + x^{T} β

, and if the predictors

x_{i}

are iid from a large class of elliptically contoured distributions, then [10,11] showed that, under regularity conditions,

β_{O L S} = c β

. Hence

Σ_{x Y} = c Σ_{x} β .

Thus

Σ_{x Y} = d β

if

Σ_{x} = τ^{2} I_{p}

where

τ^{2} > 0

and

I_{p}

is the

p \times p

identity matrix. If

β = β_{O L S}

in this case, then

β_{i} = 0

implies that

C o v (x_{i}, Y) = 0

. The constant c is typically nonzero unless the model has a lot of symmetry about the distribution of

α + x^{T} β

. Simulation with

{\hat{Σ}}_{x Y}

can be difficult if the population values of c and d are unknown. Results from [12] (p. 89) suggest that for Poisson regression model, a rough approximation is

{\hat{β}}_{P R} \approx {\hat{β}}_{O L S} / \bar{Y} .

Results from [13] suggest that for binary logistic regression, a rough approximation is

{\hat{β}}_{L R} \approx {\hat{β}}_{O L S} / M S E

where MSE is the mean square error from the OLS regression.

Ref. [14] has an interesting result for the multiple linear regression model (1). Assume that the cases

{(x_{i}^{T}, Y_{i})}^{T}

are iid with

E (Y) = μ_{Y}

,

E (x) = μ_{x}

and nonsingular

C o v (x) = Σ_{x}

. Let

β = β_{O L S}

. Then testing

H_{0} : β = β_{0}

versus

H_{1} : β \neq β_{0}

is equivalent to testing

H_{0} : μ = 0

versus

H_{1} : μ \neq 0

with

μ = E (w_{i}) = Σ_{x} (β - β_{0})

where

w_{i} = (x_{i} - μ_{x}) (Y_{i} - μ_{Y} - {(x_{i} - μ_{x})}^{T} β_{0})

, and a one sample test can be applied to

v_{i} = (x_{i} - \bar{x}) (Y_{i} - \bar{Y} - {(x_{i} - \bar{x})}^{T} β_{0}) .

Ref. [14] notes that there are only a few high dimensional analogs of the low dimensional multiple linear regression F-test for

H_{0}

versus

H_{1}

. See [15,16,17,18]. The assumptions on the predictors in these four papers are very strong.

This paper uses the above test for

β_{0} = 0

, which is equivalent to a test for

Σ_{x Y} = 0

. The resulting test is not limited to OLS for multiple linear regression with iid errors. As shown below and in the following paragraph, the test can be used for multiple linear regression when heterogeneity is present, and the test can also be used for many regression models that depend on the predictors only through

x_{i}^{T} β

. Suppose

β_{D} = D^{- 1} Σ_{x Y}

where D is a

p \times p

positive definite matrix. Then

β_{D} = 0

if and only if

Σ_{x Y} = 0

. Then

D^{- 1} = λ I

for OPLS,

D^{- 1} = Σ_{x}^{- 1}

for OLS, and

D^{- 1} = {[d i a g (Σ_{x})]}^{- 1}

for the MMLE. The k-component partial least squares estimator can be found by regressing Y on a constant and on

W_{i} = {\hat{η}}_{i}^{T} x

for

i = 1, \dots, k

where

{\hat{η}}_{i} = {\hat{Σ}}_{x}^{i - 1} {\hat{Σ}}_{x Y}

for

i = 1, \dots, k

. See [19]. Hence

β_{k P L S} = 0

if

Σ_{x Y} = 0

. Thus if the cases

{(x_{i}^{T}, Y_{i})}^{T}

are iid, then using

β_{0} = 0

gives tests for

H_{0} : β = 0

,

H_{0} : β_{M M L E} = 0

,

H_{0} : Σ_{x Y} = 0

,

H_{0} : β_{O P L S} = 0

, and

H_{0} : β_{k P L S} = 0

. For multiple linear regression with heterogeneity,

{\hat{β}}_{O L S}

is still a consistent estimator of

β = β_{O L S} = Σ_{x}^{- 1} Σ_{x Y}

. Hence the test can be used when the constant variance assumption is violated.

Under iid cases with

β = 0

, if the response variables

Y_{i}

depend on the

x_{i}

only through

x_{i}^{T} β

, then

Y_{i} \sim Y_{i} | α \sim Y_{i} | (α + x_{i}^{T} β)

. Hence the

Y_{i}

are iid and do not depend on x, and thus satisfy a multiple linear regression model with

β_{O L S} = β = 0 .

For a parametric regression, such as a generalized linear model, assume

Y_{i} \sim D (τ (α + x_{i}^{T} β), θ)

where D is the parametric distribution and

τ

is a real valued function. For example, D could be the negative binomial distribution with

τ (S P) = e^{S P}

and

θ = κ

. If

β = 0,

then the iid

Y_{i} \sim D (τ (α), θ)

. Typically, if

β \neq 0

, then

Σ_{x Y} \neq 0

, and the test can have good power. An exception is when there is a lot of symmetry which rarely occurs with real data. For example, suppose

Y = m (S P) + e

where the iid errors

e_{i} \sim N (0, σ_{1}^{2})

are independent of the predictors,

S P \sim N (0, σ_{2}^{2})

, and the function m is symmetric about 0, e.g.,

m (S P) = {(S P)}^{2} .

Then

β_{O L S} = 0

and

Σ_{x Y} = 0

even if

β \neq 0

.

If

β_{0} = 0,

then

w_{i} = (x_{i} - μ_{x}) (Y_{i} - μ_{Y})

, and

E (w_{i}) = Σ_{x Y}

. Then apply a high dimensional one sample test on the

v_{i} = (x_{i} - \bar{x}) (Y_{i} - \bar{Y})

. Note that the sample mean

\bar{v} = {\tilde{Σ}}_{x Y} .

Section 2.1 reviews and derives some results for the one sample test that will be used. Section 2.2 reviews some two sample tests. Section 2.3 gives theory for the test given in the above paragraph.

2. Materials and Methods

2.1. A High Dimensional One Sample Test

This section reviews and derives some results for the one sample test that will be used. Suppose

x_{1}, \dots, x_{n}

are iid random vectors with

E (x) = μ

and covariance matrix

Cov (x) = Σ

. Then the test

H_{0} : μ = 0

versus

H_{1} : μ \neq 0

is equivalent to the test

H_{0} : μ^{T} μ = 0

versus

H_{1} : μ^{T} μ \neq 0

. Let

S = \hat{Σ}

. A U-statistic for estimating

μ^{T} μ

is

T_{n} = T_{n} (x) = \frac{1}{n (n - 1)} \sum_{i \neq j} x_{i}^{T} x_{j} = \frac{n {\bar{x}}^{T} \bar{x} - t r (S)}{n}

(7)

where tr() is the trace function. See, for example, [20].

To see that the last equality holds, note that

T_{n} = \frac{1}{n (n - 1)} [\sum_{i} \sum_{j} x_{i}^{T} x_{j} - \sum_{i} x_{i}^{T} x_{i}] = \frac{n^{2} {\bar{x}}^{T} \bar{x} - \sum_{i} x_{i}^{T} x_{i}}{n (n - 1)} .

Now

S = \frac{1}{n - 1} \sum_{i = 1}^{n} (x_{i} - \bar{x}) {(x_{i} - \bar{x})}^{T} = \frac{1}{n - 1} [\sum_{i} x_{i} x_{i}^{T} - n \bar{x} {\bar{x}}^{T}] .

Thus

t r (S) = \frac{1}{n - 1} [\sum_{i} t r (x_{i} x_{i}^{T}) - n t r (\bar{x} {\bar{x}}^{T})] = \frac{1}{n - 1} [\sum_{i} x_{i}^{T} x_{i} - n {\bar{x}}^{T} \bar{x}] .

Thus

n {\bar{x}}^{T} \bar{x} - t r (S) = n {\bar{x}}^{T} \bar{x} + \frac{n}{n - 1} {\bar{x}}^{T} \bar{x} - \frac{1}{n - 1} \sum_{i} x_{i}^{T} x_{i} = \frac{n^{2} {\bar{x}}^{T} \bar{x} - \sum_{i} x_{i}^{T} x_{i}}{n - 1} .

Next, we derive a simple test. Let the variance

V (x_{i}^{T} x_{j}) = V (W) = V (W_{i j}) = σ_{W}^{2}

for

i \neq j

. Let

m = f l o o r (n / 2) = ⌊ n / 2 ⌋

be the integer part of

n / 2

. So floor(100/2) = floor(101/2) = 50. Let the iid random variables

W_{1} = x_{1}^{T} x_{2}, W_{2} = x_{3}^{T} x_{4},, \dots, W_{m} = x_{2 m - 1}^{T} x_{2 m} .

Note that

E (W_{i}) = μ^{T} μ

and

V (W_{i}) = σ_{W}^{2}

. Let

S_{W}^{2}

be the sample variance of the

W_{i}

:

S_{W}^{2} = \frac{1}{m - 1} \sum_{i = 1}^{m} {(W_{i} - \bar{W})}^{2} .

The following new theorem follows from the univariate central limit theorem.

Theorem 1.

Assume

x_{1}, \dots, x_{n}

are iid,

E (x_{i}) = μ

, and the variance

V (x_{i}^{T} x_{j}) = σ_{W}^{2}

for

i \neq j

. Let

W_{1}, \dots, W_{m}

be defined as above. Then

(a)

\sqrt{m} (\bar{W} - μ^{T} μ) \overset{D}{\to} N (0, σ_{W}^{2}) .

(b) \frac{\sqrt{m} (\bar{W} - μ^{T} μ)}{S_{W}} \overset{D}{\to} N (0, 1)

as

n \to \infty

.

The following theorem derives the variance

V (T_{n})

under simpler regularity conditions than those in the literature, and the new proof of the theorem is also simpler.

Theorem 2.

Assume

x_{1}, \dots, x_{n}

are iid,

E (x_{i}) = μ

, and the variance

V (x_{i}^{T} x_{j}) = σ_{W}^{2}

for

i \neq j

. Let

W_{i j} = x_{i}^{T} x_{j}

for

i \neq j

. Let

θ = C o v (W_{i j}, W_{i d}) = μ^{T} Σ μ

where

j \neq d

,

i < j

, and

i < d

. Then

(a) V (T_{n}) = \frac{2 σ_{W}^{2}}{n (n - 1)} + \frac{4 (n - 2) θ}{n (n - 1)} .

(b) If

H_{0} : μ = 0

is true, then

θ = 0

and

V_{0} = V (T_{n}) = \frac{2 σ_{W}^{2}}{n (n - 1)} .

Proof.

(a) To find the variance

V (T_{n})

with

T_{n}

from Equation (7), let

W_{i j} = x_{i}^{T} x_{j} = W_{j i}

, and note that

T_{n} = \frac{2}{n (n - 1)} H_{n} w h e r e H_{n} = \sum_{i <} \sum_{j} x_{i}^{T} x_{j} = \sum_{i < j} x_{i}^{T} x_{j} .

Then

V (H_{n}) = C o v (H_{n}, H_{n}) =

C o v (\sum_{i <} \sum_{j} W_{i j}, \sum_{k <} \sum_{d} W_{k d}) = \sum_{i <} \sum_{j} \sum_{k <} \sum_{d} C o v (W_{i j}, W_{k d}) .

(8)

Let

V (W_{i j}) = σ_{W}^{2}

for

i \neq j

. The covariances are of 3 types. First, if

(i j) = (k d)

with

i < j

, then

C o v (W_{i j}, W_{k d}) = V (W_{i j}) = σ_{W}^{2} .

Second, if

i, j, k, d

are distinct with

i < j

and

k < d

, then

W_{i j}

and

W_{k d}

are independent with

C o v (W_{i j}, W_{k d}) = 0

. Third, there are terms where exactly three of the four subscripts are distinct, which have

C o v (W_{i j}, W_{i d}) = θ

where

j \neq d

,

i < j

, and

i < d

or

C o v (W_{i j}, W_{k j}) = θ

where

i \neq k

,

i < j

, and

k < j

. These covariance terms are all equal to the same number

θ

since

W_{i j} = W_{j i} .

The number of ways to get three distinct subscripts is

a - b - c = {(\binom{n}{2})}^{2} - (\binom{n}{2}) (\binom{n - 2}{2}) - (\binom{n}{2}) = n (n - 1) (n - 2)

since a is the number of terms on the right hand side of (8), b is the number of terms where

i, j, k, d

are distinct with

i < j

and

k < d

, and c is the number of terms where

(i j) = (k d)

with

i < j

. [Note that

n (n - 1)

terms have i and j distinct. Half of these terms have

i < j

and half have

i > j

. Similarly,

n (n - 1) (n - 2) (n - 3)

terms have

i j k d

distinct, and half of the

n (n - 1)

terms have

i < j

, while half of the

(n - 2) (n - 3)

terms have

k < d

.] Thus

V (H_{n}) = 0.5 n (n - 1) σ_{W}^{2} + n (n - 1) (n - 2) θ .

This calculation was adapted from [21] (pp. 336–337). Thus

V (T_{n}) = \frac{4}{{[n (n - 1)]}^{2}} V (H_{n}) = \frac{2 σ_{W}^{2}}{n (n - 1)} + \frac{4 (n - 2) θ}{n (n - 1)} .

(b) Now

θ = C o v (x_{i}^{T} x_{j}, x_{i}^{T} x_{k})

where

x_{i}, x_{j},

and

x_{k}

are iid. Hence

θ =

C o v (\sum_{d} x_{i d} x_{j d}, \sum_{t} x_{i t} x_{k t}) = \sum_{d} \sum_{t} C o v (x_{i d} x_{j d}, x_{i t} x_{k t}) =

\sum_{d} \sum_{t} [E (x_{i d} x_{j d} x_{i t} x_{k t}) - E (x_{i d} x_{j d}) E (x_{i t} x_{k t})] =

\sum_{d} \sum_{t} [E (x_{i d} x_{i t}) E (x_{j d}) E (x_{k t}) - E (x_{i d}) E (x_{j d}) E (x_{i t}) E (x_{k t})] =

\sum_{d} \sum_{t} [E (x_{j d}) E (x_{k t}) (E (x_{i d} x_{i t}) - E (x_{i d}) E (x_{i t}))] =

\sum_{d} \sum_{t} [E (x_{j d}) E (x_{k t}) C o v (x_{i d}, x_{i t})] = μ^{T} Σ μ .

Under

H_{0},

μ = 0

and thus

θ = 0

. □

Note that

T_{n}

is the sample mean of the

0.5 n (n - 1)

distinct, identically distributed

W_{i j} = x_{i}^{T} x_{j}

for

i < j

. When

μ = 0

, Theorem 2 proves that the

W_{i j}

are uncorrelated. Hence when

H_{0}

is true,

V (T_{n})

satisfies (Theorem 2b). Ref. [14] (p. 2024) showed that

V (W_{i j}) = V (x_{i}^{T} x_{j}) = σ_{W}^{2} = t r (Σ^{2}) + 2 μ^{T} Σ μ

. Plugging this value into (Theorem 2a) gives the [22] result

V (T_{n}) = \frac{2}{n (n - 1)} t r (Σ^{2}) + \frac{4 μ^{T} Σ μ}{n} .

Note that

θ = μ^{T} Σ μ

can be consistently estimated as follows. Let

g = f l o o r (n / 3)

. Let

W_{1} = x_{1}^{T} x_{2}

,

Z_{1} = x_{1}^{T} x_{3}

,

W_{2} = x_{4}^{T} x_{5}

,

Z_{2} = x_{4}^{T} x_{6}

, …,

W_{g} = x_{3 g - 2}^{T} x_{3 g - 1}

,

Z_{g} = x_{3 g - 2}^{T} x_{3 g}

. Then

\hat{θ}

is the sample covariance of the

(W_{i}, Z_{i})

where

i = 1, \dots, g

. Note that a consistent estimator of

t r (Σ^{2})

is

S_{W}^{2} - 2 \hat{θ}

.

Let

\hat{V} (T_{n})

and

{\hat{V}}_{0} (T_{n})

be consistent estimators of

V (T_{n})

and

V_{0} (T_{n})

, respectively. Then ref. [22,23,24,25], and others proved that under mild regularity conditions when

H_{0}

is true,

T_{n} / \sqrt{\hat{V} (T_{n})} = T_{n} / \sqrt{{\hat{V}}_{0} (T_{n})} \overset{D}{\to} N (0, 1) .

Under regularity conditions when

H_{0}

is true, ref. [25] proved that

T_{n} / \sqrt{{\hat{V}}_{0} (T_{n})} \overset{D}{\to} t_{k}

as

p \to \infty

for fixed

n \geq 3

where

k = 0.5 n (n - 1) - 1

.

A consistent estimator of

V_{0} (T_{n})

needs a consistent estimator of

σ_{W}^{2} = 0.5 n (n - 1) V_{0} (T_{n})

. Let

s_{n}^{2} = {\hat{V}}_{0} (T_{n}) .

Then one estimator is

0.5 n (n - 1) s_{n}^{2} = S_{W}^{2}

from Theorem 1. An estimator nearly the same as the one used by [25] is

0.5 n (n - 1) s_{n}^{2} = {\hat{σ}}_{W}^{2} = \frac{1}{n (n - 1)} \sum \sum_{i \neq j} {(x_{i}^{T} x_{j} - T_{n})}^{2} = \frac{1}{n (n - 1)} \sum \sum_{i \neq j} {(W_{i j} - T_{n})}^{2} .

Note that

σ_{W}

can be proportional to p since

σ_{W}

is the standard deviation of a sum of p random variables. Thus to have good asymptotic power against all alternatives, likely need

p / n \to 0

as

n, p \to \infty .

When

μ \neq 0

,

T_{n} / \sqrt{{\hat{V}}_{0} (T_{n})}

tends to have more power than

T_{n} / \sqrt{\hat{V} (T_{n})}

since

V_{0} (T_{n}) < V (T_{n})

. Suppose

μ = δ 1

where the constant

δ > 0

and 1 is the

p \times 1

vector of ones. Then

μ^{T} μ = δ^{2} p

, and the test using

{\hat{V}}_{0} (T_{n})

may have good power for

T_{n} / \sqrt{{\hat{V}}_{0} (T_{n})} > 1.96 \approx 2

or for

\frac{δ^{2} p}{\sqrt{\frac{2 σ_{W}^{2}}{n (n - 1)}}} > 2 o r δ^{2} > \frac{2 \sqrt{2} σ_{W}}{n p} .

For computing

{\hat{V}}_{0} (T_{n})

, a question is whether to use an estimator of

σ_{W}^{2}

or of

τ^{2} = t r (Σ^{2}) .

Let the

i j

th element of

Σ

be

σ_{i j}

with

Σ = (σ_{i j})

. Let

{∥ Σ ∥}_{F}

be the Frobenius norm of

Σ

, and

∥ a ∥

be the Euclidean norm of vector a. Let

v e c (Σ)

be the vector formed by stacking the columns of

Σ

into a vector. Then

τ^{2} = t r (Σ^{T} Σ) = {∥ Σ ∥}_{F}^{2} = \sum_{i = 1}^{p} \sum_{j = 1}^{p} σ_{i j}^{2} = {∥ v e c (Σ) ∥}^{2}

. There is a level-power tradeoff. Using

{\hat{σ}}_{W}^{2}

is good for controlling the level = P(type I) error when

H_{0}

is true. Since

σ_{W}^{2} = τ^{2} + 2 μ^{T} Σ μ = τ^{2} + 2 θ

, the parameter

τ^{2}

can be much smaller than

σ_{W}^{2}

, and using a good estimator of

τ^{2}

may result in better power.

In high dimensions, it is often very difficult to estimate a

k \times 1

vector

θ

when

k > n

. This result is a form of “the curse of dimensionality.” If a

\sqrt{n}

consistent estimator of

θ

is available, then the squared norm

∥ \hat{θ} {- θ ∥}^{2} = \sum_{i = 1}^{k} {({\hat{θ}}_{i} - θ_{i})}^{2} \propto k / n .

Hence estimators

{\hat{τ}}^{2}

that use many parameters, such as plug in estimators

\hat{Σ}

, are likely to be poor. The two parameter estimator

{\hat{τ}}^{2} = {\hat{σ}}_{W}^{2} - 2 \hat{θ}

likely has more variability than

{\hat{σ}}_{W}^{2}

when

H_{0}

is true, and better estimators of

θ

are needed. In simulations,

{\hat{τ}}_{1}^{2} = {\hat{σ}}_{W}^{2} - 2 \hat{θ}

was often negative. Let

{\hat{τ}}_{2}^{2} = {\hat{τ}}_{1}^{2}

if

{\hat{τ}}_{1}^{2} > 0

and

{\hat{τ}}_{2}^{2} = {\hat{σ}}_{W}^{2}

, otherwise. In limited simulations, this estimator did about as well as

{\hat{τ}}_{3}^{2} = {\hat{σ}}_{W}^{2}

. Obtaining an estimator that clearly outperforms

{\hat{σ}}_{W}^{2}

would improve the omnibus test, but is beyond the scope of this paper.

We also considered replacing

x_{i}

by

z_{i} = s s (x_{i})

where the spatial sign function

s s (x_{i}) = 0

if

x_{i} = 0

, and

s s (x_{i}) = x_{i} / ∥ x_{i} ∥

otherwise. This function projects the nonzero

x_{i}

onto the unit p-dimensional hypersphere centered at

0 .

Let

T_{n} (w)

denote the statistic

T_{n}

computed from an iid sample

w_{1}, \dots, w_{n}

. Since the

z_{i}

are iid if the

x_{i}

are iid, use

T_{n} (z)

to test

H_{0} : μ_{z} = 0

versus

H_{A} : μ_{z} \neq 0

where

μ_{z} = E (z_{i}) .

In general,

μ_{z} \neq μ = μ_{x} = E (x_{i}),

but

μ_{z} = μ = 0

can occur if the

x_{i}

have a lot of symmetry about 0. In particular,

μ_{z} = μ = 0

if the

x_{i}

are iid from an elliptically contoured distribution with

E (x_{i}) = μ = 0

. The test based on the statistic

T_{n} (z)

can be useful if the first or second moments of the

x_{i}

do not exist, for example if the

x_{i}

are iid from a multivariate Cauchy distribution. These results may be useful for understanding papers such as [26].

The nonparametric bootstrap draws a bootstrap data set

x_{1}^{*}, \dots, x_{n}^{*}

with replacement from the

x_{i}

and computes

T_{1}^{*}

by applying

T_{n}

on the bootstrap data set. This process is repeated B times to get a bootstrap sample

T_{1}^{*}, \dots, T_{B}^{*}

. For the statistic

T_{n}

, the nonparametric bootstrap fails in high dimensions because terms like

x_{j}^{T} x_{j}

need to be avoided, and the nonparametric bootstrap has replicates: the proportion of cases in the bootstrap sample that are not replicates is about

1 - e^{1} \approx 2 / 3 \approx 7 / 11 .

The m out of n bootstrap draws a sample of size m without replacement from the n cases. Using

m = f l o o r (2 n / 3)

worked well in simulations. Sampling without replacement is also known as subsampling and the delete d jackknife.

2.2. Three High Dimensional Two Sample Tests

If

(x_{1 i}, x_{2 i})

come in correlated pairs, a high dimensional analog of the paired t test applies the one sample test on

z_{i} = x_{1 i} - x_{2 i}

.

Now suppose there are two independent random samples

x_{1, 1}, \dots, x_{1, n_{1}}

and

x_{2, 1}, \dots, x_{2, n_{2}}

from two populations or groups, and that it is desired to test

H_{0} : μ_{1} = μ_{2}

versus

H_{1} : μ_{1} \neq μ_{2}

where

E (x_{i}) = μ_{i}

are

p \times 1

vectors. Let

n = n_{1} + n_{2}

. Let

S_{i}

be the sample covariance matrix of

x_{i}

and let

C o v (x_{i}) = Σ_{i}

for

i = 1, 2 .

A simple test takes

m = min (n_{1}, n_{2})

and

z_{i} = x_{1 i} - x_{2 i}

for

i = 1, \dots, m

. Then apply the one sample test from Theorem 2 to the

z_{i}

. This paired test might work well in high dimensions because of the superior power of the Theorem 2 test, but in low dimensions, it is known that there are better tests.

Let

x_{1}

be the

x_{i}

that has

n_{1} \leq n_{2}

. Then let

y_{i} = x_{1 i} - \sqrt{\frac{n_{1}}{n_{2}}} x_{2 i} + \frac{1}{\sqrt{n_{1} n_{2}}} \sum_{j = 1}^{n_{1}} x_{2 j} - {\bar{x}}_{2} = x_{1 i} - \sqrt{\frac{n_{1}}{n_{2}}} x_{2 i} + a_{n_{1}, n_{2}} - {\bar{x}}_{2}

for

i = 1, \dots, n_{1}

. Note that

y_{i} = z_{i} = x_{1 i} - x_{2 i}

if

n_{1} = n_{2}

. Ref. [27] (pp. 177–178) proved that

\bar{y} = {\bar{x}}_{1} - {\bar{x}}_{2}

, that

y_{i}

and

y_{j}

are uncorrelated for

i \neq j

, that

E (y_{i}) = μ_{1} - μ_{2}

, and that

C o v (y_{i}) = C o v (x_{1}) + (n_{1} / n_{2}) C o v (x_{2})

for

i = 1, \dots, n_{1}

. Ref. [25] showed that

T_{n} (y) / \sqrt{{\hat{V}}_{0} (y)} \overset{D}{\to} N (0, 1)

where the y denotes that the one sample test was computed using the

y_{i}

.

Note that

H_{0} : μ_{1} = μ_{2}

holds if and only if

∥ μ_{1} - μ_{2} ∥^{2} = μ_{1}^{T} μ_{1} + μ_{2}^{T} μ_{2} - 2 μ_{1}^{T} μ_{2} = 0 .

These terms can be estimated by

T_{n} = T_{n} (x, y) = T_{1} + T_{2} - 2 T_{3}

where

T_{1}

and

T_{2}

are the one sample test statistic applied to samples 1 and 2 and

n_{1} n_{2} T_{3} = \sum_{i = 1}^{n_{1}} \sum_{j = 1}^{n_{2}} x_{1 i}^{T} x_{2 j} .

Let

X_{i j} = x_{1 i}^{T} x_{1 j} = X_{j i}

and

Y_{i j} = x_{2 i}^{T} x_{2 j} = Y_{j i}

where

i \neq j

. Let

Z_{i j} = x_{1 i}^{T} x_{2 j} = Z_{j i}

. Let

σ_{X}^{2} = V (X_{i j})

,

σ_{Y}^{2} = V (Y_{i j})

, and

σ_{Z}^{2} = V (Z_{i j})

. Let

V_{0} (T_{n})

be the variance of

T_{n}

when

H_{0}

is true. Assume

s_{n}^{2}

is a consistent estimator of

V_{0} (T_{n})

. Under

H_{0} : μ_{1} = μ_{2}

and additional regularity conditions, ref. [22] showed that

V_{0} (T_{n}) =

\frac{2}{n_{1} (n_{1} - 1)} t r (Σ_{1}^{2}) + \frac{2}{n_{2} (n_{2} - 1)} t r (Σ_{2}^{2}) + \frac{4}{n_{1} n_{2}} t r (Σ_{1} Σ_{2})

and that

\frac{T_{n}}{s_{n}} \overset{D}{\to} N (0, 1) .

Let

θ_{1} = C o v (X_{i j}, X_{i t}) = μ_{1}^{T} Σ_{1} μ_{1}

where

j \neq t

,

i < j

, and

i < t

,

θ_{2} = C o v (Y_{i j}, Y_{i t}) = μ_{2}^{T} Σ_{2} μ_{2}

where

j \neq t

,

i < j

, and

i < t

,

θ_{3} = C o v (Z_{i j}, Z_{i t}) = μ_{2}^{T} Σ_{1} μ_{2}

where

j \neq t

, and

θ_{4} = C o v (Z_{i j}, Z_{k j}) = μ_{1}^{T} Σ_{2} μ_{1}

where

i \neq k

.

Ref. [22] showed that

V (T_{3}) = \frac{t r (Σ_{1} Σ_{2})}{n_{1} n_{2}} + \frac{θ_{3}}{n_{1}} + \frac{θ_{4}}{n_{2}} .

Ref. [28], using arguments similar to Theorem 2, showed

V (T_{3}) = \frac{σ_{Z}^{2}}{n_{1} n_{2}} + \frac{θ_{3} (n_{2} - 1)}{n_{1} n_{2}} + \frac{θ_{4} (n_{1} - 1)}{n_{1} n_{2}} .

Thus

σ_{Z}^{2} = t r (Σ_{1} Σ_{2}) + θ_{3} + θ_{4}

and

t r (Σ_{1} Σ_{2}) = σ_{Z}^{2} - θ_{3} - θ_{4} .

Hence

V_{0} (T_{n}) = \frac{2 (σ_{X}^{2} - 2 θ_{1})}{n_{1} (n_{1} - 1)} + \frac{2 (σ_{Y}^{2} - 2 θ_{2})}{n_{2} (n_{2} - 1)} + \frac{4 (σ_{Z}^{2} - θ_{3} - θ_{4})}{n_{1} n_{2}} .

If

μ_{1} = μ_{2} = 0

, then the

θ_{i} = 0

, and the formula with the

θ_{i} = 0

worked well in simulations. Note that

σ_{X}^{2}

,

σ_{Y}^{2}

, and the

θ_{i}

can be estimated as in Section 2.1. Let

m = min (n_{1}, n_{2})

, and

Z_{i} = x_{1 i}^{T} x_{2 i}

for

i = 1, \dots, m

. Let

S_{Z}^{2}

be the sample variance of the

Z_{i}

. Another estimator of

σ_{Z}^{2}

is

{\hat{σ}}_{Z}^{2} = \frac{1}{n_{1} n_{2}} \sum_{i = 1}^{n_{1}} \sum_{j = 1}^{n_{2}} {(x_{1 i}^{T} x_{2 j} - T_{3} (x, y))}^{2} .

2.3. Theory for Testing $H_{0} : A Σ_{x Y} = 0$

Consider tests of the form

H_{0} : A Σ_{x Y} = 0

versus

H_{1} : A Σ_{x Y} \neq 0

. The omnibus test uses

A = I_{p}

and tests

H_{0} : Σ_{x Y} = 0

versus

H_{1} : Σ_{x Y} \neq 0

.

Let

w_{i} = (x_{i} - μ_{x}) (Y_{i} - μ_{Y})

and

v_{i} = (x_{i} - \bar{x}) (Y_{i} - \bar{Y})

for

i = 1, \dots, n

. Then

T_{n} (w) / s_{n} (w)

\overset{D}{\to} N (0, 1)

under mild regularity conditions by Section 2.1 where w indicates that the test was applied to the

w_{i}

. Ref. [14] showed that

T_{n} (v) / s_{n} (v) \overset{D}{\to} N (0, 1)

and used

p = 1.5 n

for multiple linear regression in their simulations.

Let

O = {i_{1}, \dots, i_{k}}

,

x_{O} = {(x_{i_{1}}, \dots, x_{i_{k}})}^{T}

, and

x_{i, O} = {(x_{i, i_{1}}, \dots, x_{i, i_{k}})}^{T}

. Then testing

H_{0} : Σ_{x_{O} Y} = 0

uses the one sample test on the

v_{i, O} = (x_{i, O} - {\bar{x}}_{O}) (Y_{i} - \bar{Y})

. This test is equivalent to testing

H_{0} : β_{O P L S, O} = 0

and

H_{0} : β_{M M L E, O} = 0

. Note that data splitting could be used to select O. For multiple linear regression and the MMLE and OPLS estimators, these tests are high dimensional analogs for the OLS partial F tests for testing whether a reduced model is good. If

I \cup O = {1, \dots, p}

, then I corresponds to the predictors in the reduced model while O corresponds to the predictors out of the reduced model.

In low dimensions, important tests for regression include (a)

H_{0} : β_{i} = 0

(the Wald tests for MLR), (b)

H_{0} : β = 0

(the Anova F test for MLR), and (c)

H_{0} : {(β_{i_{1}}, \dots, β_{i_{k}})}^{T} = 0

(the partial F test for MLR). The above paragraph shows how to do these high dimensional tests for the multiple linear regression OPLS and MMLE estimators, with or without heterogeneity. Data splitting is not needed if O is known. Note that (a) corresponds to testing

H_{0} : Cov (x_{i}, Y) = 0

while (c) corresponds to testing

H_{0} : {(Cov (x_{i_{1}}, Y), \dots, Cov (x_{i_{k}}, Y))}^{T} = 0

.

The next subsection reviews competitors for the above tests when k is small compared to n.

2.4. Theory for Certain A

This subsection reviews some large sample theory for

{\hat{η}}_{O P L S} = {\hat{Σ}}_{x Y}

and OPLS for the multiple linear regression model, including some high dimensional tests for low dimensional quantities such as

H_{0} : β_{i} = 0

or

H_{0} : β_{i} - β_{j} = 0

. These tests depended on iid cases, but not on linearity or the constant variance assumption. Hence the tests are useful for multiple linear regression with heterogeneity.

The following [5] theorem gives the large sample theory for

\hat{η} = \hat{Cov} (x, Y)

. Ref. [6] gave alternative proofs. This theory needs

η = η_{O P L S} = Σ_{x, Y}

to exist for

\hat{η} = {\hat{Σ}}_{x, Y}

to be a consistent estimator of

η

. Let

x_{i} = {(x_{i 1}, \dots, x_{i p})}^{T}

and let

w_{i}

and

z_{i}

be defined below where

Cov (w_{i}) = Σ_{w} = E [(x_{i} - μ_{x}) {(x_{i} - μ_{x})}^{T} {(Y_{i} - μ_{Y})}^{2})] - Σ_{x Y} Σ_{x Y}^{T} .

Then the low order moments are needed for

{\hat{Σ}}_{z}

to be a consistent estimator of

Σ_{w}

.

Theorem 3.

Assume the cases

{(x_{i}^{T}, Y_{i})}^{T}

are iid. Assume

E (x_{i j}^{k} Y_{i}^{m})

exist for

j = 1, \dots, p

and

k, m = 0, 1, 2 .

Let

μ_{x} = E (x)

and

μ_{Y} = E (Y)

. Let

w_{i} = (x_{i} - μ_{x}) (Y_{i} - μ_{Y})

with sample mean

{\bar{w}}_{n}

. Let

η = Σ_{x, Y}

. Then (a)

\sqrt{n} ({\bar{w}}_{n} - η) \overset{D}{\to} N_{p} (0, Σ_{w}), \sqrt{n} ({\hat{η}}_{n} - η) \overset{D}{\to} N_{p} (0, Σ_{w}),

(9)

a n d \sqrt{n} ({\tilde{η}}_{n} - η) \overset{D}{\to} N_{p} (0, Σ_{w}) .

(b) Let

v_{i} = (x_{i} - {\bar{x}}_{n}) (Y_{i} - {\bar{Y}}_{n}) .

Then

{\hat{Σ}}_{w} = {\hat{Σ}}_{v} + O_{P} (n^{- 1 / 2})

. Hence

{\tilde{Σ}}_{w} = {\tilde{Σ}}_{v} + O_{P} (n^{- 1 / 2})

.(c) Let

A

be a

k \times p

full rank constant matrix with

k \leq p

, assume

H_{0} : A β_{O P L S} = 0

is true, and assume

\hat{λ} \overset{P}{\to} λ \neq 0

. Then

\sqrt{n} A ({\hat{β}}_{O P L S} - β_{O P L S}) \overset{D}{\to} N_{k} (0, λ^{2} A Σ_{w} A^{T}) .

(10)

For the following theorem, consider a subset of k distinct elements from

\tilde{Σ}

or from

\hat{Σ}

. Stack the elements into a vector, and let each vector have the same ordering. For example, the largest subset of distinct elements corresponds to

v e c h (\tilde{Σ}) = {({\tilde{σ}}_{11}, \dots, {\tilde{σ}}_{1 p}, {\tilde{σ}}_{22}, \dots, {\tilde{σ}}_{2 p}, \dots, {\tilde{σ}}_{p - 1, p - 1}, {\tilde{σ}}_{p - 1, p}, {\tilde{σ}}_{p p})}^{T} = [{\tilde{σ}}_{j k}] .

For random variables

x_{1}, \dots, x_{p}

, use notation such as

{\bar{x}}_{j} =

the sample mean of the

x_{j}

,

μ_{j} = E (x_{j})

, and

σ_{j k} = C o v (x_{j}, x_{k})

. Let

n v e c h (\tilde{Σ}) = [n {\tilde{σ}}_{j k}] = \sum_{i = 1}^{n} [(x_{i j} - {\bar{x}}_{j}) (x_{i k} - {\bar{x}}_{k})] .

For general vectors of elements, the ordering of the vectors will all be the same and be denoted by vectors such as

\hat{c} = [{\hat{σ}}_{j k}]

,

\tilde{c} = [{\tilde{σ}}_{j k}]

,

c = [σ_{j k}]

,

v_{i} = [(x_{i j} - {\bar{x}}_{j}) (x_{i k} - {\bar{x}}_{k})],

and

w_{i} = [(x_{i j} - μ_{j}) (x_{i k} - μ_{k})] .

Let

{\bar{w}}_{n} = \sum_{i = 1}^{n} w_{i} / n

be the sample mean of the

w_{i}

. Assuming that

C o v (w_{i}) = Σ_{w}

exists, then

E (w_{i}) = E ({\bar{w}}_{n}) = c .

The following [6] theorem provides large sample theory for

\hat{c}

and

\tilde{c}

. We use

C o v (w_{i}) = Σ_{d}

to avoid confusion with the

Σ_{w}

used in Theorem 3. Note that

x_{i}

are dummy variables and could be replaced by

u_{i} = {(Y_{i 1}, \dots, Y_{i m}, x_{i 1}, \dots, x_{i p})}^{T}

to get information about m response variables

Y_{1}, \dots, Y_{m}

. Testing

H_{0} : {(Σ_{x Y_{1}}^{T}, Σ_{x Y_{2}}^{T})}^{T} = 0

could likely be done applying the one sample test to

z_{1} = {({(x_{1} - \bar{x})}^{T} (Y_{1, 1} - {\bar{Y}}_{1}), {(x_{2} - \bar{x})}^{T} (Y_{2, 2} - {\bar{Y}}_{2}))}^{T}

, …,

z_{m} = {({(x_{n - 1} - \bar{x})}^{T} (Y_{1, n - 1} - {\bar{Y}}_{1}), {(x_{n} - \bar{x})}^{T} (Y_{2, n} - {\bar{Y}}_{2}))}^{T}

assuming

n = 2 m

and iid cases.

Theorem 4.

Assume the cases

x_{i}

are iid and that

C o v (w_{i}) = Σ_{d}

exists. Using the above notation with

c

a

k \times 1

vector,

(i)

\sqrt{n} (\tilde{c} - c) \overset{D}{\to} N_{k} (0, Σ_{d})

.

(ii)

\sqrt{n} (\hat{c} - c) \overset{D}{\to} N_{k} (0, Σ_{d})

.

(iii)

{\hat{Σ}}_{d} = {\hat{Σ}}_{v} + O_{P} (n^{- 1 / 2})

and

{\tilde{Σ}}_{d} = {\tilde{Σ}}_{v} + O_{P} (n^{- 1 / 2})

.

2.5. Testing

As noted by [5], the following simple testing method reduces a possibly high dimensional problem to a low dimensional problem. Testing

H_{0} : A β_{O P L S} = 0

versus

H_{1} : A β_{O P L S} \neq 0

is equivalent to testing

H_{0} : A η = 0

versus

H_{1} : A η \neq 0

where A is a

k \times p

constant matrix. Let

Cov ({\hat{Σ}}_{x Y}) = Cov (\hat{η}) = Σ_{w}

be the asymptotic covariance matrix of

\hat{η} = {\hat{Σ}}_{x Y}

. In high dimensions where

n < 5 p

, we can’t get a good nonsingular estimator of

Cov ({\hat{Σ}}_{x Y})

, but we can get good nonsingular estimators of

Cov ({\hat{Σ}}_{u Y}) = Cov ({({\hat{η}}_{i_{1}}, \dots, {\hat{η}}_{i_{k}})}^{T})

with

u = x_{I} = {(x_{i_{1}}, \dots, x_{i_{k}})}^{T}

where

n \geq J k

with

J \geq 10

. Here

I = {i_{1}, \dots, i_{k}}

denotes predictors that are in the model. (Values of J much larger than 10 may be needed if some of the k predictors and/or Y are skewed.) Simply apply Theorem 3 to the predictors u used in the hypothesis test, and thus use the sample covariance matrix of the vectors

(u_{i} - \bar{u}) (Y_{i} - \bar{Y}) .

Hence we can test hypotheses like

H_{0} : β_{i} - β_{j} = 0 .

In particular, testing

H_{0} : β_{i} = 0

is equivalent to testing

H_{0} : η_{i} = σ_{x_{i}, Y} = 0

where

σ_{x_{i}, Y} = Cov (x_{i}, Y)

.

2.6. High Dimensional Outlier Detection

High dimensional outlier detection is important. This subsection follows [29] closely. See [29,30] for examples and simulations. Let W be a data matrix, where the rows

w_{i}

correspond to cases. For example,

w_{i} = x_{i}

or

w_{i} = z_{i} = {(Y_{i}, x_{i 1}, \dots, x_{i p})}^{T}

. One of the simplest outlier detection methods uses the Euclidean distances of the

w_{i}

from the coordinatewise median

D_{i} = D_{i} (MED (W), I_{p}) .

Concentration type steps compute the weighted median

{MED}_{j}

: the coordinatewise median computed from the “half set” of cases

w_{i}

with

D_{i}^{2} \leq MED (D_{i}^{2} ({MED}_{j - 1}, I_{p}))

where

{MED}_{0} = MED (W)

. We often used

j = 0

(no concentration type steps) or

j = 9

. Let

D_{i} = D_{i} ({MED}_{j}, I_{p})

. Let

W_{i} = 1

if

D_{i} \leq MED (D_{1}, \dots, D_{n}) +

k MAD (D_{1}, \dots, D_{n})

where

k \geq 0

and

k = 5

is the default choice. Let

W_{i} = 0

, otherwise. Using

k \geq 0

insures that at least half of the cases get weight 1. This weighting corresponds to the weighting that would be used in a one sided metrically trimmed mean (Huber type skipped mean) of the distances. Here, the sample median absolute deviation is

MAD (n) = MAD (D_{1}, \dots, D_{n}) = MED (| D_{i} - MED (n) |, i = 1, \dots, n)

where

MED (n) =

MED (D_{1}, \dots, D_{n})

is the sample median of

D_{1}, \dots, D_{n}

.

Let the covmb2 set B of at least

n / 2

cases correspond to the cases with weight

W_{i} = 1

. Then the covmb2 estimator

(T, C)

is the sample mean and sample covariance matrix applied to the cases in set B. If

w_{i} = x_{i}

, then

T = \frac{\sum_{i = 1}^{n} W_{i} x_{i}}{\sum_{i = 1}^{n} W_{i}} and C = \frac{\sum_{i = 1}^{n} W_{i} (x_{i} - T) {(x_{i} - T)}^{T}}{\sum_{i = 1}^{n} W_{i} - 1} .

This estimator was built for speed, applications, and outlier resistance.

Another method to get an outlier resistant estimator

{\hat{Σ}}_{x Y}

is to use the following identity. If X and Y are random variables, then

Cov (X, Y) = [Var (X + Y) - Var (X - Y)] / 4 .

Then replace Var(

W)

by

{[\hat{σ} (W)]}^{2}

where

\hat{σ} (W)

is a robust estimator of scale or standard deviation and

W = X + Y

or

W = X - Y

. We used

\hat{σ} (W) = 1.483 M A D (W)

where

M A D (W) = M A D (n) = M A D (W_{1}, \dots, W_{n}) .

Hence

\hat{Cov} (X, Y) = ({[1.483 M A D (X + Y)]}^{2} - {[1.483 M A D (X - Y)]}^{2}) / 4 .

(11)

The function ddplot5 plots the Euclidean distances from the coordinatewise median versus the Euclidean distances from the covmb2 location estimator. Typically the plotted points in this DD plot cluster about the identity line, and outliers appear in the upper right corner of the plot with a gap between the bulk of the data and the outliers.

The function rcovxy makes the classical and three robust estimators of

η = Σ_{x Y}

, and makes a scatterplot matrix of the four estimated sufficient predictors

{\hat{η}}^{T} x

and Y. Only two robust estimators are made if

n \leq 2.5 p

.

3. Results

Example 1.

The [31] data was collected from

n = 26

districts in Prussia in 1843. Let Y = the number of women married to civilians in the district with a constant and predictors

x_{1}

= the population of the district in 1843,

x_{2}

= the number of married civilian men in the district,

x_{3}

= the number of married men in the military in the district, and

x_{4}

= the number of women married to husbands in the military in the district. Sometimes the person conducting the survey would not count a spouse if the spouse was not at home. Hence Y and

x_{2}

are highly correlated but not equal. Similarly,

x_{3}

and

x_{4}

are highly correlated but not equal. We expect

β = β_{O L S} \approx {(0, 1, 0, 0)}^{T} .

Then

{\hat{β}}_{O L S} = {(0.00035, 0.9995, - 0.2328, 0.1531)}^{T}

,

{\hat{β}}_{M M L E} = {(0.1782, 1.0010, 48.5630, 51.5513)}^{T}

,

{\hat{Σ}}_{x Y} = {(9285758004, 1674298902, 9855702, 9653811)}^{T}

, and

{\hat{β}}_{O P L S} = {(0.1727, 0.0311, 0.0002, 0.0002)}^{T}

. Let the omnibus test statistic

Z (v) = T_{n} (v) / \sqrt{{\hat{V}}_{0} (T_{n} (v))}

applied to the

v_{i} = (x_{i} - \bar{x}) (Y_{i} - \bar{Y})

. Then

Z (v) = 9.3281

and the hypotheses

H_{0} : Σ_{x Y} = 0

,

H_{0} : β_{O L S} = 0

,

H_{0} : β_{O P L S} = 0

and

H_{0} : β_{M M L E} = 0

are all rejected. The classical F-test also rejects

H_{0}

with p-value=0.

Example 2.

The [32] pottery data has

n = 36

pottery shards of Roman earthware produced between second century B.C. and fourth century A.D. Often the pottery was stamped by the manufacturer. A chemical analysis was done for

p = 20

chemicals (variables), the types of pottery were 1-Arretine, 2-not-Arretine, 3-North Italian, 4-Central Italian, 5-questionable origin. Let the binary response variable

Y = 1

for type 1 and 0 for types 2–5. The omnibus test had

Z = 2.146

for a two sided p-value of 0.0319 and the more correct right tailed p-value of 0.016. The chi-square logistic regression test for

β = 0

had p-value = 0.0002, but the GLM did not converge.

3.1. One Sample Tests

In the simulations, we examined five one sample tests. The first “test” used the m out of n bootstrap to compute

T_{1}^{*}, \dots, T_{B}^{*}

with

B = 100

. We used the shorth bootstrap confidence interval described in [30] (ch. 2). This “test” has not been proven to have level

α

. The second test computed the usual t confidence interval

[\bar{W} - t_{1 - α / 2, m - 1} S_{W} / \sqrt{m}, \bar{W} + t_{1 - α / 2, m - 1} S_{W} / \sqrt{m}]

for

μ^{T} μ

based on the

W_{i}

from Theorem 1. The third and fourth tests used Theorem 2 (b) and

T_{n} / s_{n} \overset{D}{\to} N (0, 1)

if

s_{n}^{2}

is a consistent estimator of

V (T_{n})

when

H_{0}

is true. The third test used

0.5 n (n - 1) s_{n}^{2} = {\hat{σ}}_{W}^{2}

, while the fourth test used

0.5 n (n - 1) s_{n}^{2} = S_{W}^{2}

based on Theorem 1. These two tests computed intervals (“confidence intervals for 0”)

[T_{n} - t_{1 - α / 2, m - 1} s_{n}, T_{n} + t_{1 - α / 2, m - 1} s_{n}] .

The tests 2–4 use the same cutoff

t_{1 - α / 2, m - 1}

so that the average interval lengths are more comparable. The fifth test used the Theorem 2 test applied to the spatial sign vectors with

S_{W}^{2}

.

The simulation used four distribution types where

x = A y + δ 1

with

E (x) = δ 1

where 1 is the

p \times 1

vector of ones. Type 1 used

y \sim N_{p} (0, I),

type 2 used a mixture distribution

y \sim 0.6 N_{p} (0, I) + 0.4 N_{p} (0, 25 I)

, type 3 for a multivariate

t_{4}

distribution, and type 4 for a multivariate lognormal distribution where

y = (y_{1}, \dots, y_{p})

with

w_{i} = exp (Z)

where

Z \sim N (0, 1)

and

y_{i} = w_{i} - E (w_{i})

where

E (w_{i}) = exp (0.5)

. The covariance matrix type depended on the matrix A. Type 1 used

A = I_{p}

, type 2 used

A = d i a g (\sqrt{1}, \dots, \sqrt{p})

, and type 3 used

A = ψ 1 1^{T} + (1 - ψ) I_{p}

giving cor(

x_{i j}, x_{i k}) = ρ

for

j \neq k

where

ρ = 0

if

ψ = 0

,

ρ \to 1 / (c + 1)

as

p \to \infty

if

ψ = 1 / \sqrt{c p}

where

c > 0

, and

ρ \to 1

as

p \to \infty

if

ψ \in (0, 1)

is a constant. We used

δ = 0

and

δ > 0

chosen so at least one test had good power. The simulation used 5000 runs, the 4 x distributions, and the 3 matrices A. For the third A, we used

ψ = 1 / \sqrt{p}

.

Table 1 and Table 2 summarize some simulation results. There are two lines for each simulation scenario. The first line gives the simulated power = proportion of times

H_{0} : μ = 0

was rejected. The second line gives the average length of the confidence interval for 0 where

H_{0}

is rejected if 0 is not in the confidence interval. When

δ = 0

, observed coverage between 0.04 and 0.06 suggests coverage = power = level is close to the nominal value 0.05. For larger

δ

, want the coverage near 1 for good power. See [28] for more simulations.

Table 1. One sample tests, covtyp = 1, cov = observed type I error for

δ = 0

and power for

δ > 0

. Boldface for good performance not including spatial.

Table 2. One sample tests, covtyp = 2, p = 10,000, cov = observed type I error for

δ = 0

and power for

δ > 0

. Boldface for good performance not including spatial.

The bootstrap test corresponds to the boot column, the tests using

(\bar{w}, S_{W})

,

(T_{n}, {\hat{σ}}_{W})

, and

(T_{n}, S_{W})

correspond to the next three columns. The last column corresponds to the spatial sign test. This test tends to have much shorter lengths because of the transformation of the data. The test using

(\bar{w}, S_{W})

has simple large sample theory, but low power compared to the other methods. This test’s length is approximately

\sqrt{n - 1}

times the length of that corresponding to

(T_{n}, S_{W})

where

\sqrt{99} \approx 10

in the tables. The bootstrap test was sometimes conservative with observed coverage

< 0.04

when

δ = 0

. For xtype = 4 and

δ = 0

,

H_{0}

was not true for the spatial test. Hence the coverage for the spatial test was sometimes higher than 0.06 for this scenario. For

δ = 0

, the test with

(T_{n}, {\hat{σ}}_{W})

sometimes had coverage less than 0.04, while the test with

(T_{n}, S_{W})

sometimes had coverage greater than 0.06. In the simulations, the spatial test often performed well, but typically

E (z_{i}) = μ_{z} \neq μ_{x} = E (x_{i})

, which makes the spatial test harder to use. For testing

H_{0} : μ_{x} = 0

, the test with

(T_{n}, {\hat{σ}}_{W})

appeared to perform better than the three competitors.

3.2. Two Sample Tests

In the simulations, we examined three two sample tests. The first “test” used the m out of n bootstrap where

m_{i} = 2 n_{i} / 3

to bootstrap the [22] test that estimates

∥ μ_{1} - μ_{2} ∥^{2} = μ_{1}^{T} μ_{1} + μ_{2}^{T} μ_{2} - 2 μ_{1}^{T} μ_{2} .

The second test was the “paired test” with

m = min (n_{1}, n_{2})

and

z_{i} = x_{1 i} - x_{2 i}

for

i = 1, \dots, m

. Then apply the one sample test from Theorem 2 to the

z_{i}

. The third test was the [25] Li test. Both of these tests used

S_{W}^{2}

applied to the

z_{i}

or the

y_{i}

.

The simulation used four distribution types where

x_{1} = A_{1} y_{1} + δ 1

and

x_{2} = A_{2} y_{2}

where

y_{1}

and

y_{2}

had the same distribution, with

E (x_{1}) = δ 1

and

E (x_{2}) = 0

. Type 1 used

y \sim N_{p} (0, I),

type 2 used a mixture distribution

y \sim 0.6 N_{p} (0, I) + 0.4 N_{p} (0, 25 I)

, type 3 for a multivariate

t_{4}

distribution, and type 4 for a multivariate lognormal distribution where

y = (y_{1}, \dots, y_{p})

with

w_{i} = exp (Z)

where

Z \sim N (0, 1)

and

y_{i} = w_{i} - E (w_{i})

where

E (w_{i}) = exp (0.5)

. The covariance matrix type depended on the matrix A.

For the covariance types,

C o v (x_{1}) = I, C o v (x_{2}) = σ^{2} C o v (x_{1})

for covtyp = 1.

C o v (x_{1})

= d i a g (1, 2, \dots, p), C o v (x_{2}) = σ^{2} C o v (x_{1})

for covtyp = 2.

C o v (x_{1}) = I, C o v (x_{2}) = σ^{2} d i a g

(1,2,...,p) for covtyp = 3. Table 3 shows some results. Two lines were used for each simulation scenario, with coverages on the first line and lengths on the second line. When

n_{1} = n_{2}

, the paired test and Li test gave the same results. When

n_{1} / n_{2}

was not near 1, the Li test had better power and shorter length. Increasing

δ

could greatly increase the length for the bootstrap test, but the coverage would be 1. Improving the one sample test would improve the Li test, but the Li test performed well in simulations.

Table 3. Two sample tests, covtyp = 1, cov = observed type I error for

δ = 0

and power for

δ > 0

. Boldface for better performance.

3.3. Theorem 3 Tests

We illustrate Theorem 3 and Section 2.5 for Poisson regression and negative binomial regression. This simulation is similar to that done by [6] for multiple linear regression with and without heterogeneity. Let

x \sim N_{p - 1} (0, I)

be the

(p - 1) \times 1

vector of nontrivial predictors. Let

S P_{i} = α + x_{i}^{T} β = 1 + 1 x_{i, 1} + \dots + 1 x_{i, k}

for

i = 1, \dots, n

. Hence

α = 1

and

ϕ = {(α, β^{T})}^{T} = {(1, . ., 1, 0, \dots, 0)}^{T}

with

k + 1

ones and

p - k - 1

zeros. Here

β

is the Poisson regression parameter vector

β_{P R}

or the negative binomial regression parameter vector

β_{N B R}

. Let

Z_{i} = log (Y_{i})

if

Y_{i} > 0

and

Z_{i} = log (0.5)

if

Y_{i} = 0

. Then a multiple linear regression model with heterogeneity is

Z_{i} = α_{Z} + x_{i}^{T} β_{Z} + e_{i}

where the

e_{i}

are independent with expected value

E (e_{i}) = 0

and variance

V (e_{i}) = σ_{i}^{2} .

Since the cases

(x_{i}, Y_{i})

are iid, the OLS estimator

β_{O L S} = c_{o} β = Σ_{x}^{- 1} Σ_{x Z} = Σ_{x Z}

because

Σ_{x} = I_{p - 1}

. Thus

Σ_{x Z} = {(c_{o}, \dots, c_{o}, 0, \dots, 0)}^{T}

with the first k values equal to

c_{o}

and

p - k - 1

zeros.

Let

η_{O P L S} = Σ_{x Z} = {(η_{1}, \dots, η_{p - 1})}^{T}

. Then the Theorem 3 large sample

100 (1 - δ)

confidence interval (CI) is

{\hat{η}}_{i} \pm t_{n - 1, 1 - δ / 2} S E ({\hat{η}}_{i})

could be computed for each

η_{i}

. If 0 is not in the confidence interval, then

H_{0} : η_{i} = 0

and

H_{0} : β_{i E} = 0

are both rejected for estimators E = OPLS and MMLE for the multiple linear regression model with Z. In the simulations with

n = 50

,

p = 4

, and

ψ > 0

, the maximum observed undercoverage was about

0.05 = 5 %

. Hence the program has the option to replace the cutoff

t_{n - 1, 1 - δ / 2}

by

t_{n - 1, u p}

where

u p = m i n (1 - δ / 2 + 0.05, 1 - δ / 2 + 2.5 / n)

if

δ / 2 > 0.1

,

u p = m i n (1 - δ / 4, 1 - δ / 2 + 12.5 δ / n)

if

δ / 2 \leq 0.1

. If

u p < 1 - δ / 2 + 0.001

, then use

u p = 1 - δ / 2

. This correction factor was used in the simulations for the nominal 95% CIs, where the correction factor uses a cutoff that is between

t_{n - 1, 0.975}

and the cutoff

t_{n - 1, 0.9875}

that would be used for a 97.5% CI. The nominal coverage was

0.95

with

δ = 0.05

. Observed coverage between 0.94 and 0.96 suggests coverage is close to the nominal value. Ref. [33] noted that weighted least squares tests tend to reject

H_{0}

too often (liberal tests with undercoverage).

To summarize the

p - 1

confidence intervals, the average length of the

p - 1

confidence intervals over 5000 runs was computed. Then the minimum, mean, and maximum of the average lengths was computed. The proportion of times each confidence interval contained zero was computed. These proportions were the observed coverages of the

p - 1

confidence intervals. Then the minimum observed coverage was found. The percentage of the observed coverages that were ≥ 0.9, 0.92, 0.93, 0.94, and 0.96 were also recorded. The test

H_{0} : {(η_{i}, η_{j})}^{T} = {(0, 0)}^{T}

was also done where

H_{0}

was true. The coverage of the test was recorded and a correction factor was not used. Negative binomial regression and Poisson regression were used, where

κ = \infty

indicates that Poisson regression was used.

Table 4 illustrates Theorem 3(a) where

k = 1

and Table 4 replaces Y with Z. For Table 4, confidence intervals were made for

η_{i} = C o v (x_{i}, Z)

for

i = 1, \dots, 99

and the coverage was the percentage of the 5000 CIs that contained 0. Here

η_{1} \neq 0

, but

η_{i} = 0

for

i = 2, \dots, 99 .

The first two lines of Table 4 correspond to Poisson regression. The confidence interval for

η_{1}

never contained 0, hence the minimum coverage was 0 with observed power

= 1 - 0 = 1

. The proportion of CIs that had coverage

\geq 0.94

was 0.9898 (98/99 CIs). Hence this was also the proportion of CIs with coverage

\geq 0.90, 0.92

and

0.93

. The proportion of CIs that had coverage

\geq 0.96

was 0.8081 (80/99 CIs). The typical coverage was near 0.965, hence the correction factor was slightly too large. The test

H_{0} : {(η_{98}, η_{99})}^{T} = {(0, 0)}^{T}

did not use a correction factor, and coverage was 0.9438. The minimum average CI length was 0.4166, the sample mean of the average CI lengths was 0.4187, and the maximum average length was 0.4875, corresponding to

η_{1}

. The second two lines and below for Table 4 were for the negative binomial regression with kappa

= κ = 0.5, 1, 10, 100

. For

κ =

1000 and 10,000, the simulations were very similar to those for

κ = \infty

. Using Y instead of Z gave similar results with longer lengths.

Table 4. Cov(x,Z), n = 100, p = 100, k = 1, want cov > 0.94 except for mincov and cov96.

3.4. Omnibus Test

Multiple Linear Regression

For this simulation, the x were generated as in Section 3.1 with

μ = 0

, and then

Y = α + x^{T} β + e

where

β = δ 1

. Hence

H_{0} : β = 0

is true when

δ = 0

. The one sample test was applied on the

v_{i}

using

S_{W}

and

{\hat{σ}}_{W}

. The zero mean iid errors

e_{i}

were iid from five distributions: (i) N(0,1), (ii)

t_{3}

, (iii) EXP(1) − 1, (iv) uniform(

- 1, 1

), and (v) 0.9 N(0,1) + 0.1 N(0,100). Only distribution (iii) is not symmetric. With 5000 runs, would like the coverage to be between 0.04 and 0.06 when

δ = 0

. In Table 5, the coverage was a bit high when

S_{W}

was used (second to last column) instead of

{\hat{σ}}_{W}

(fourth column). Power near 0.95 was good for

δ = 0.0035

.

Table 5. Omnibus test for multiple linear regression, cov = observed type I error for

δ = 0

and power for

δ > 0

. Boldface for good performance.

Poisson Regression

For this simulation, the

x_{i}

were generated in a manner similar to Section 3.1 when the

x_{i}

were from a multivariate normal distribution. Let

β = δ {(1, \dots, 1, 0, \dots, 0)}^{T}

where there were k 1’s and

p - k

0’s. Then the

x_{i}

were scaled such that

S P \sim N (1, 1)

when

δ = 1

. In general,

S P \sim N (1, δ^{2})

for

δ \geq 0

. Hence the population Poisson regression was fairly strong for

δ = 1

and rather weak for

δ = 0.25

. Table 6 shows that using

{\hat{σ}}_{W}

controlled the nominal level 0.05 better than using

S_{W}

. As p got larger, the power performance could decrease. See line 8 of Table 6.

Table 6. Omnibus test for Poisson regression, cov = observed type I error for

δ = 0

and power for

δ > 0

. Boldface for good performance.

Sample R code for the above two tables is shown below.

source (‘‘http://parker.ad.siu.edu/Olive/slpack.txt’’)
mlrcovxysim (n=100,p=500,nruns=5000,xtype=3,etype=2,delta=0)
prcovxysim (n=500,p=100,k=100,nruns=5000,psi=0,delta=0)

4. Discussion

The omnibus test is resistant to model misspecification. For example, (a) the constant variance multiple linear regression model could be assumed when there is heterogeneity, and (b) for count data, a multiple linear regression model, or a negative binomial regression model, or a quasi-Poisson regression model may fit the data much better than the count model actually chosen. The test can also be used in low dimensions when the MLE fails to converge.

Based on the simulations and the theory, (a) the omnibus test and one sample test will not have good power against all alternatives unless

σ_{W} / n \to 0

as

n, p \to \infty

. (b) The omnibus test and one sample test tended to have simulated observed level near the nominal level (control the type I error) if

{\hat{σ}}_{W}

was used, but the omnibus test could be conservative if n was small:

n = 10

for multiple linear regression and

n = 30

for Poisson regression in the simulations. Sometimes

{\hat{σ}}_{W}

exploded if p was large or if

H_{0}

was false. (c) The omnibus test and one sample test have little outlier resistance. Thus it is important to check for outliers before performing the tests. (d) Both tests worked fairly well in simulations for

n \geq 50

and

p \leq 10 n

, and Ref. [14] used

p = 1.5 n

in their simulations for multiple linear regression.

Right tail tests should be used for

μ^{T} μ

since they have more power, but two tail tests are easier to explain and compare. Ref. [14] used the statistic

t_{n} = \frac{2}{n (n - 1)} \sum_{i < j} [v_{i}^{T} v_{j} + k_{n} (a^{T} v_{i}) (v_{j}^{T} a)]

with

a = 1 / \sqrt{p}

and

k_{n} = {(p / l n (p))}^{1 / 2} .

This statistic can also be used for an omnibus test when

β_{0} = 0

. The extra term was used to increase power and is likely a good idea, but better formulas for

{\hat{V}}_{0} (t_{n})

may be needed.

Ref. [28] has many references for high dimensional one and two sample tests. For classification with two groups, let

Σ

be the pooled covariance matrix. Then

β = Σ^{- 1} (μ_{1} - μ_{2}) = 0

if and only if

μ_{1} - μ_{2} = 0

, which can be tested with a two sample test. For the importance of

β

in discriminant analysis, see, for example, [34].

Let the “fail to reject region” be the compliment of the rejection region. Often the fail to reject region is a confidence region for the parameter or parameter vector of interest, where a confidence interval is a special case of a confidence region. In high dimensions, the length or volume of the fail to reject region does not necessarily converge to 0 as

n, p \to \infty

, and the volume could diverge to ∞ if

p / n \to \infty .

For the one sample test, the fail to reject region using

V_{0}

has much more power than using a confidence interval for

μ^{T} μ

.

Simulations were done in R. See [35]. The collection of [30] R functions slpack, available from (http://parker.ad.siu.edu/Olive/slpack.txt, accessed on 28 October 2025). has some useful functions for the inference. The function hdomni does the omnibus test. The relevant R code is shown below.

hdomni(x,y,alpha=0.05)
k <- n*(n-1)
xx <- scale(x,scale=F) #centered but not scaled
v <- xx*c(y-mean(y))
a <- apply(v,2,sum)
Thd <- (t(a)%*%a - sum(v^2))/k #1 by 1 matrix
Thd <- as.double(Thd) #so the test statistic Thd=Tn is a scalar
sscp <- v%*%t(v)
ss <- sscp - Thd
ss <- ss^2
vw1 <- (sum(ss) - sum(diag(ss)))/k
Vohat <- 2*vw1/k
Z <- Thd/sqrt(Vohat)
pval <- 2*pnorm(-abs(Z)) #two tail pvalue
rpval=1-pnorm(Z) #right tail pvalue

The function hdhot1sim3 was used to simulate the five one sample tests, and was used for Table 1 and Table 2. The function hdhot1sim4 added the test using

{\hat{τ}}_{2}

. The function hdhot2sim simulates the two sample test which applies the fast paired test on the

z_{i} = x_{i 1} - x_{i 2}

for

i = 1, \dots, m

, the [25] test, and the two sample [22] test based on subsampling with

m_{i} = f l o o r (2 n_{i} / 3)

for i = 1, 2. See Table 3. Proofs for Theorems 3 and 4 were not given, but are available from preprints of the corresponding published papers from (http://parker.ad.siu.edu/Olive/preprints.htm, accessed on 28 October 2025).

For Table 4, the function nbinroplssimz was used to create negative binomial regression data sets for finite

κ

, while the function PRoplssimz was used to create the Poisson regression data sets corresponding to

κ = \infty

. The functions without the z do not use the

Z = l o g (Y)

transformation.

For the omnibus test, the function mlrcovxysim was used for multiple linear regression, while the function prcovxysim was used for Poisson regression.

The spatial sign vectors have a some outlier resistance. If the predictor variables are all continuous, the covmb2 and ddplot5 functions are useful for detecting outliers in high dimensions. See [30] (section 1.4.3). Ref. [36] gave estimators for the variance of U-statistics.

Author Contributions

Conceptualization, A.M.A., P.A.Q. and D.J.O.; methodology, A.M.A., P.A.Q. and D.J.O.; writing-original draft preparation, D.J.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data sets are available from (http://parker.ad.siu.edu/Olive/sldata.txt, accessed on 28 October 2025).

Acknowledgments

The authors thank the editors and referees for their work.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CI	confidence interval
iid	independent and identically distributed
MDPI	Multidisciplinary Digital Publishing Institute
MLR	Multiple Linear Regression
MMLE	marginal maximum likelihood estimator
OLS	ordinary least squares
OPLS	one component partial least squares
SP	sufficient predictor

References

Cook, R.D.; Helland, I.S.; Su, Z. Envelopes and partial least squares regression. J. Roy. Stat. Soc. B 2013, 75, 851–877. [Google Scholar] [CrossRef]
Basa, J.; Cook, R.D.; Forzani, L.; Marcos, M. Asymptotic distribution of one-component partial least squares regression estimators in high dimensions. Can. J. Stat. 2024, 52, 118–130. [Google Scholar] [CrossRef]
Cook, R.D.; Forzani, L. Partial Least Squares Regression: And Related Dimension Reduction Methods; Chapman and Hall/CRC: Boca Raton, FL, USA, 2024. [Google Scholar]
Wold, H. Soft modelling by latent variables: The non-linear partial least squares (NIPALS) approach. J. Appl. Prob. 1975, 12, 117–142. [Google Scholar] [CrossRef]
Olive, D.J.; Zhang, L. One component partial least squares, high dimensional regression, data splitting, and the multitude of models. Commun. Stat. Theory Methods 2025, 54, 130–145. [Google Scholar] [CrossRef]
Olive, D.J.; Alshammari, A.A.; Pathiranage, K.G.; Hettige, L.A.W. Testing with the one component partial least squares and the marginal maximum likelihood estimators. Commun. Stat. Theory Methods 2025. [Google Scholar] [CrossRef]
Fan, J.; Lv, J. Sure independence screening for ultrahigh dimensional feature space. J. Roy. Stat. Soc. B 2008, 70, 849–911. [Google Scholar] [CrossRef]
Fan, J.; Song, R. Sure independence screening in generalized linear models with np-Dimensionality. Ann. Stat. 2010, 38, 3217–3841. [Google Scholar] [CrossRef]
Agresti, A. Categorical Data Analysis, 2nd ed.; Wiley: Hoboken, NJ, USA, 2002. [Google Scholar]
Li, K.C.; Duan, N. Regression analysis under link violation. Ann. Stat. 1989, 17, 1009–1052. [Google Scholar] [CrossRef]
Chen, C.H.; Li, K.C. Can SIR be as popular as multiple linear regression? Stat. Sinica 1998, 8, 289–316. [Google Scholar]
Cameron, A.C.; Trivedi, P.K. Regression Analysis of Count Data; Cambridge University Press: Cambridge, UK, 1998. [Google Scholar]
Haggstrom, G.W. Logistic regression and discriminant analysis by ordinary least squares. J. Bus. Econ. Stat. 1983, 1, 229–238. [Google Scholar] [CrossRef]
Zhao, A.; Li, C.; Li, R.; Zhang, Z. Testing high-dimensional regression coefficients in linear models. Ann. Stat. 2024, 52, 2034–2058. [Google Scholar] [CrossRef]
Cui, H.; Guo, W.; Zhong, W. Test for high-dimensional regression coefficients using refitted cross-validation variance estimation. Ann. Stat. 2018, 46, 958–988. [Google Scholar] [CrossRef]
Goeman, J.J.; van de Geer, S.A.; van Houwelingen, H.C. Testing against a high dimensional alternative. J. R. Stat. Soc. B 2006, 68, 477–493. [Google Scholar] [CrossRef]
Lan, W.; Wang, H.; and Tsai, C.-L. Testing covariates in high-dimensional regression. Ann. Inst. Statist. Math. 2014, 66, 279–301. [Google Scholar] [CrossRef]
Zhong, P.-S.; Chen, S.X. Tests for high dimensional regression coefficients with factorial designs. J. Amer. Stat. Assoc. 2011, 106, 260–274. [Google Scholar] [CrossRef]
Helland, I.S. Partial least squares regression and statistical models. Scand. J. Stat. 1990, 17, 97–114. [Google Scholar]
Park, J.; Ayyala, D.N. A test for the mean vector in large dimension and small samples. J. Stat. Plan. Inf. 2013, 143, 929–943. [Google Scholar] [CrossRef]
Lehmann, E.L. Nonparametrics: Statistical Methods Based on Ranks; Holden-Day: San Francisco, CA, USA, 1975. [Google Scholar]
Chen, S.X.; Qin, Y.L. A two sample test for high-dimensional data with applications to gene-set testing. Ann. Stat. 2010, 38, 808–835. [Google Scholar] [CrossRef]
Srivastava, M.S.; Du, M. A test for the mean vector with fewer observations than the dimension. J. Mult. Anal. 2008, 99, 386–402. [Google Scholar] [CrossRef]
Bai, Z.D.; Saranadasa, H. Effects of high dimension: By an example of a two sample problem. Stat. Sinica 1996, 6, 311–329. [Google Scholar]
Li, J. Finite sample t-tests for high-dimensional means. J. Mult. Anal. 2023, 196, 105183. [Google Scholar] [CrossRef] [PubMed]
Wang, L.; Peng, B.; and Li, R. A high-dimensional nonparametric multivariate test for mean vector. J. Am. Stat. Assoc. 2015, 110, 1658–1669. [Google Scholar] [CrossRef]
Anderson, T.W. An Introduction to Multivariate Statistical Analysis, 2nd ed. Wiley: New York, NY, USA, 1984.
Abid, A.M. Some Simple High Dimensional One and Two Sample Tests. Ph.D. Thesis, Southern Illinois University, Carbondale, IL, USA, 2025. Available online: http://parker.ad.siu.edu/Olive/sAhlam.pdf (accessed on 28 October 2025).
Olive, D.J. Some useful techniques for high dimensional statistics. Stats 2025, 8, 60. [Google Scholar] [CrossRef]
Olive, D.J. Prediction and Statistical Learning, Online Course Notes. 2025. Available online: http://parker.ad.siu.edu/Olive/slearnbk.htm (accessed on 28 October 2025).
Hebbler, B. Statistics of Prussia. J. Roy. Stat. Soc. A 1847, 10, 154–186. [Google Scholar] [CrossRef]
Wisseman, S.U.; Hopke, P.K.; Schindler-Kaudelka, E. Multielemental and multivariate analysis of Italian terra sigillata in the world heritage museum, university of Illinois at Urbana-Champaign. Archeomaterials 1987, 1, 101–107. [Google Scholar]
Pötscher, B.M.; Preinerstorfer, D. How reliable are bootstrap-based heteroskedasticity robust tests? Econ. Theory 2023, 39, 789–847. [Google Scholar] [CrossRef]
Wang, Y.; Wu, Z.; Wang, C. High dimensional discriminant analysis under weak sparsity. Commun. Stat. Theory Methods 2025, 54, 2657–2674. [Google Scholar] [CrossRef]
R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2024. [Google Scholar]
Xu, T.; Zhu, R.; Shao, X. On variance estimation of random forests with infinite-order U-statistics. Electr. J. Stat. 2024, 18, 2135–2207. [Google Scholar] [CrossRef]

Table 1. One sample tests, covtyp = 1, cov = observed type I error for

δ = 0

and power for

δ > 0

. Boldface for good performance not including spatial.

Table 1. One sample tests, covtyp = 1, cov = observed type I error for

δ = 0

and power for

δ > 0

. Boldface for good performance not including spatial.

n	p	psi/xtype	$δ$	Boot	$(\bar{w}, S_{W})$	$(T_{n}, {\hat{σ}}_{W})$	$(T_{n}, S_{W})$	Spatial
100	100	0	0	0.0230	0.0580	0.0400	0.0452	0.0444
	len	1		0.6732	5.6520	0.5711	0.5681	0.0057
100	100	0	0.075	0.8160	0.0688	0.9216	0.9176	0.9166
	len	1		0.8081	5.7018	0.5741	0.5731	0.0057
100	100	0	0	0.0236	0.0436	0.0466	0.0776	0.0478
	len	2		7.0590	58.2593	6.0094	5.8553	0.0057
100	100	0	0.15	0.1938	0.0506	0.3128	0.3490	0.9988
	len	2		7.5830	58.1417	6.0204	5.8435	0.0057
100	100	0	0	0.0222	0.0466	0.0450	0.0680	0.0468
	len	3		1.3031	10.6946	1.1140	1.0749	0.0057
100	100	0	0.1	0.7536	0.0544	0.8720	0.8714	0.9956
	len	3		1.5563	10.8976	1.1260	1.0953	0.0057
100	100	0	0	0.0206	0.0556	0.0372	0.0656	0.0906
	len	4		3.1105	25.4558	2.6543	2.5584	0.0057
100	100	0	0.17	0.9024	0.0546	0.9622	0.9496	0.7668
	len	4		3.7816	25.5420	2.6708	2.5671	0.0057
100	1000	0	0	0.0236	0.0482	0.0448	0.0506	0.0506
	len	1		2.1403	17.8302	1.8059	1.7920	0.0018
100	1000	0	0.0415	0.872	0.068	0.9438	0.9398	0.9388
	len	1		2.2771	17.9004	1.8089	1.7991	0.0018
100	1000	0	0	0.0236	0.0448	0.0458	0.0712	0.0558
	len	2		22.4434	185.1105	19.0973	18.6043	0.0018
100	1000	0	0.075	0.142	0.0480	0.2222	0.2616	0.9978
	len	2		22.8203	182.6556	18.9772	18.3576	0.0018
100	1000	0	0	0.0214	0.0432	0.0436	0.0650	0.0450
	len	3		4.1649	34.1708	3.5444	3.4343	0.0018
100	1000	0	0.05	0.6458	0.0558	0.7642	0.7770	0.9908
	len	3		4.3708	34.0483	3.5586	3.4220	0.0018
100	1000	0	0	0.0192	0.0544	0.0378	0.0518	0.0484
	len	4		9.9417	82.3953	8.4267	8.2810	0.0018
100	1000	0	0.087	0.8430	0.0576	0.9282	0.9242	0.8774
	len	4		10.5664	82.8816	8.4523	8.3299	0.0018

Table 2. One sample tests, covtyp = 2, p = 10,000, cov = observed type I error for

δ = 0

and power for

δ > 0

. Boldface for good performance not including spatial.

Table 2. One sample tests, covtyp = 2, p = 10,000, cov = observed type I error for

δ = 0

and power for

δ > 0

. Boldface for good performance not including spatial.

n	p	psi/xtype	$δ$	boot	$(\bar{w}, S_{W})$	$(T_{n}, {\hat{σ}}_{W})$	$(T_{n}, S_{W})$	Spatial
100	10,000	0	0	0.0272	0.0536	0.045	0.0502	0.0496
	len	1		39,006.52	326,271.5	32,976.41	32,791.52	0.0007
100	10,000	0	1.69	0.8482	0.0582	0.93	0.9294	0.9286
	len	1		39,690.34	327,648.8	32,994.63	32,929.94	0.0007
100	10,000	0	0	0.0244	0.0442	0.0486	0.0876	0.0526
	len	2		408,860	3,330,506	347,476	334,728.5	0.0007
100	10,000	0	3	0.1126	0.0488	0.1778	0.2148	0.9952
	len	2		411,196.1	3,349,674	347,862.6	336,654.9	0.0007
100	10,000	0	0	0.0206	0.044	0.0436	0.0632	0.051
	len	3		75,976.41	624,134.1	64,858.9	62,727.84	0.0007
100	10,000	0	2.5	0.8918	0.0608	0.9462	0.9454	1
	len	3		77,389.1	625,801.9	64,740.62	62,895.46	0.0007
100	10,000	0	0	0.0236	0.0534	0.038	0.0444	0.0454
	len	4		181,871.7	1,517,807	154,052.2	152,545.3	0.0007
100	10,000	0	3.80	0.8952	0.0578	0.9558	0.9522	0.948
	len	4		185,192.4	1,518,189	154,094.1	152,583.7	0.0007

Table 3. Two sample tests, covtyp = 1, cov = observed type I error for

δ = 0

and power for

δ > 0

. Boldface for better performance.

Table 3. Two sample tests, covtyp = 1, cov = observed type I error for

δ = 0

and power for

δ > 0

. Boldface for better performance.

$(n_{1}, n_{2}, σ, p)$	xtype	covtype	$δ$	Boot	Pair	Li
(100, 100, 1, 100)	1	1	0	0.0246	0.0494	0.0494
len	1	1	0	1.3426	1.1389	1.1389
(100, 100, 1, 100)	1	1	0.1	0.7224	0.8586	0.8586
len	1	1	0.1	1.5789	1.1417	1.1417
(100, 200, 1, 100)	1	1	0	0.0256	0.0456	0.0462
len	1	1	0	1.0019	1.1360	0.8535
(100, 200, 1, 100)	1	1	0.1	0.9166	0.8602	0.9612
len	1	1	0.1	1.2396	1.1432	0.8609

Table 4. Cov(x,Z), n = 100, p = 100, k = 1, want cov > 0.94 except for mincov and cov96.

$κ$	mincov	cov90	cov92	cov93	cov94	cov96	testcov
∞	0.0000	0.9899	0.9899	0.9899	0.9899	0.8081	0.9438
len	0.4166	0.4187	0.4875
0.5	0.0062	0.9899	0.9899	0.9899	0.9899	0.7576	0.9440
len	0.5050	0.5084	0.5686
1	0.0000	0.9899	0.9899	0.9899	0.9899	0.7475	0.9410
len	0.4809	0.4834	0.5421
10	0.0000	0.9899	0.9899	0.9899	0.9899	0.6970	0.9412
len	0.4258	0.4279	0.4929
100	0.0000	0.9899	0.9899	0.9899	0.9899	0.6566	0.9464
len	0.4174	0.4195	0.4882

Table 5. Omnibus test for multiple linear regression, cov = observed type I error for

δ = 0

and power for

δ > 0

. Boldface for good performance.

Table 5. Omnibus test for multiple linear regression, cov = observed type I error for

δ = 0

and power for

δ > 0

. Boldface for good performance.

$(n, p)$	(xtype,etype, $ψ$ )	$δ$	cov	len	$S_{W}$ cov	len
(100, 100)	(1, 1, 0)	0	0.0574	6.3950	0.0710	5.9478
(100, 100)	(1, 1, 0)	0.0035	0.9524	9.0769	0.9542	8.3259
(100, 100)	(1, 1, 0.1)	0	0.0524	6.3914	0.0684	6.0284
(100, 100)	(1, 1, 0.1)	0.0035	0.9592	9.1093	0.9550	8.3757
(200, 100)	(1, 1, 0)	0	0.0484	3.2456	0.0586	3.1284
(500, 100)	(1, 1, 0)	0	0.0488	1.3147	0.0548	1.2821
(100, 100)	(3, 2, 0)	0	0.0518	30.055	0.1394	24.370
(100, 500)	(3, 2, 0)	0	0.0466	657.47	0.1320	524.95
(50, 500)	(2, 4, 0)	0	0.0572	940.29	0.1372	799.28
(50, 500)	(2, 4, 0)	0.0001	0.9400	2587.0	0.9600	1547.5
(10, 500)	(2, 4, 0)	0	0.0258	4417.8	0.1546	3734.8

Table 6. Omnibus test for Poisson regression, cov = observed type I error for

δ = 0

and power for

δ > 0

. Boldface for good performance.

Table 6. Omnibus test for Poisson regression, cov = observed type I error for

δ = 0

and power for

δ > 0

. Boldface for good performance.

$(n, p)$	(k, $ψ$ )	$δ$	cov	len	$S_{W}$ cov	len
(100, 100)	(100, 0)	0	0.0448	0.0151	0.0878	0.0144
(100, 100)	(100, 0)	0.7	0.9318	0.0616	0.9184	0.0548
(100, 100)	(100, 0.1)	0	0.0568	0.0015	0.0672	0.0014
(100, 100)	(100, 0.1)	0.25	0.9758	0.0024	0.9742	0.0021
(200, 100)	(100, 0)	0	0.0438	0.0075	0.0652	0.0073
(500, 100)	(100, 0)	0	0.0484	0.0030	0.0606	0.0030
(100, 200)	(100, 0)	0	0.0478	0.0215	0.0852	0.0205
(100, 500)	(100, 0)	2	0.3196	61.606	0.6008	38.942
(100, 500)	(100, 0)	0	0.0502	0.0344	0.0852	0.0330
(50, 500)	(100, 0)	0	0.0370	0.0741	0.0906	0.0702
(30, 500)	(100, 0)	0	0.0232	0.1413	0.0896	0.1302

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

A High Dimensional Omnibus Regression Test

Abstract

1. Introduction

2. Materials and Methods

2.1. A High Dimensional One Sample Test

2.2. Three High Dimensional Two Sample Tests

2.3. Theory for Testing $H_{0} : A Σ_{x Y} = 0$

2.4. Theory for Certain A

2.5. Testing

2.6. High Dimensional Outlier Detection

3. Results

3.1. One Sample Tests

3.2. Two Sample Tests

3.3. Theorem 3 Tests

3.4. Omnibus Test

4. Discussion

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics

A High Dimensional Omnibus Regression Test

Abstract

1. Introduction

2. Materials and Methods

2.1. A High Dimensional One Sample Test

2.2. Three High Dimensional Two Sample Tests

2.3. Theory for Testing H 0 : A Σ x Y = 0

2.4. Theory for Certain A

2.5. Testing

2.6. High Dimensional Outlier Detection

3. Results

3.1. One Sample Tests

3.2. Two Sample Tests

3.3. Theorem 3 Tests

3.4. Omnibus Test

4. Discussion

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics

2.3. Theory for Testing $H_{0} : A Σ_{x Y} = 0$