Abstract
Consider regression models where the response variable Y only depends on the vector of predictors through the sufficient predictor . Let the covariance vector . Assume the cases are independent and identically distributed random vectors for . Then for many such regression models, if and only if where 0 is the vector of zeroes. The test of versus is equivalent to the high dimensional one sample test versus applied to where and the expected values and . Since and are unknown, the test of versus is implemented by applying the one sample test to for . This test has milder regularity conditions than its few competitors. For the multiple linear regression one component partial least squares and marginal maximum likelihood estimators, the test can be adapted to test versus where
1. Introduction
This section reviews regression models where the response variable Y depends on the vector of predictors only through the sufficient predictor . Then there are n cases (. For the regression models, the conditioning and subscripts, such as i, will often be suppressed. This paper gives a high dimensional test for versus where is the vector of zeroes.
A useful multiple linear regression (MLR) model is
for Assume that the are independent and identically distributed (iid) with expected value and variance . In matrix form, this model is
where is an vector of dependent variables, is an matrix with ith row , is a vector, and e is an vector of unknown errors. Also and Cov( where is the identity matrix.
For a multiple linear regression model with heterogeneity, assume model (1) holds with and Cov( is an positive definite matrix. Under regularity conditions, the ordinary least squares (OLS) estimator can be shown to be a consistent estimator of .
For estimation with ordinary least squares, let the covariance matrix of x be and the vector Let
and
For a multiple linear regression model with iid cases, is a consistent estimator of under mild regularity conditions, while is a consistent estimator of .
Ref. [1] showed that the one component partial least squares (OPLS) estimator estimates where
for . If , then . Also see [2,3,4]. Ref. [5] derived the large sample theory for and OPLS under milder regularity conditions than those in the previous literature, where Ref. [6] showed that for iid cases , these results still hold for multiple linear regression models with heterogeneity.
The marginal maximum likelihood estimator (MMLE or marginal least squares estimator) is due to [7,8]. This estimator computes the marginal regression of Y on , such as Poisson regression, resulting in the estimator for . Then
For multiple linear regression, the marginal estimators are the simple linear regression estimators. Hence
If the are the predictors that are scaled or standardized to have unit sample variances, then
where denotes that Y was regressed on t. Ref. [6] derived large sample theory for the MMLE for multiple linear regression models, including models with heterogeneity.
For Poisson regression and related models, the response variable Y is a nonnegative count variable. A useful Poisson regression (PR) model is . This model has . The quasi-Poisson regression model has and where the dispersion parameter . Note that this model and the Poisson regression model have the same conditional mean function, and the conditional variance functions are the same if .
Some notation is needed for the negative binomial regression model. If Y has a (generalized) negative binomial distribution, , then the probability mass function (pmf) of Y is
for where and Then and V(.
The negative binomial regression model states that are independent random variables with
This model has and
Following Ref. [9] (p. 560), as it can be shown that the negative binomial regression model converges to the Poisson regression model.
Let the log transformation if and if . This transformation often results in a linear model with heterogeneity:
where the are independent with expected value and variance For Poisson regression, the minimum chi-square estimator is the weighted least squares estimator from the regression of on with weights . See [9] (pp. 611–612).
If the regression model for Y depends on x only through , and if the predictors are iid from a large class of elliptically contoured distributions, then [10,11] showed that, under regularity conditions, . Hence Thus if where and is the identity matrix. If in this case, then implies that . The constant c is typically nonzero unless the model has a lot of symmetry about the distribution of . Simulation with can be difficult if the population values of c and d are unknown. Results from [12] (p. 89) suggest that for Poisson regression model, a rough approximation is Results from [13] suggest that for binary logistic regression, a rough approximation is where MSE is the mean square error from the OLS regression.
Ref. [14] has an interesting result for the multiple linear regression model (1). Assume that the cases are iid with , and nonsingular . Let . Then testing versus is equivalent to testing versus with where , and a one sample test can be applied to
Ref. [14] notes that there are only a few high dimensional analogs of the low dimensional multiple linear regression F-test for versus . See [15,16,17,18]. The assumptions on the predictors in these four papers are very strong.
This paper uses the above test for , which is equivalent to a test for . The resulting test is not limited to OLS for multiple linear regression with iid errors. As shown below and in the following paragraph, the test can be used for multiple linear regression when heterogeneity is present, and the test can also be used for many regression models that depend on the predictors only through . Suppose where D is a positive definite matrix. Then if and only if . Then for OPLS, for OLS, and for the MMLE. The k-component partial least squares estimator can be found by regressing Y on a constant and on for where for . See [19]. Hence if . Thus if the cases are iid, then using gives tests for , , , , and . For multiple linear regression with heterogeneity, is still a consistent estimator of . Hence the test can be used when the constant variance assumption is violated.
Under iid cases with , if the response variables depend on the only through , then . Hence the are iid and do not depend on x, and thus satisfy a multiple linear regression model with For a parametric regression, such as a generalized linear model, assume where D is the parametric distribution and is a real valued function. For example, D could be the negative binomial distribution with and . If then the iid . Typically, if , then , and the test can have good power. An exception is when there is a lot of symmetry which rarely occurs with real data. For example, suppose where the iid errors are independent of the predictors, , and the function m is symmetric about 0, e.g., Then and even if .
If then , and . Then apply a high dimensional one sample test on the . Note that the sample mean
Section 2.1 reviews and derives some results for the one sample test that will be used. Section 2.2 reviews some two sample tests. Section 2.3 gives theory for the test given in the above paragraph.
2. Materials and Methods
2.1. A High Dimensional One Sample Test
This section reviews and derives some results for the one sample test that will be used. Suppose are iid random vectors with and covariance matrix . Then the test versus is equivalent to the test versus . Let . A U-statistic for estimating is
where tr() is the trace function. See, for example, [20].
To see that the last equality holds, note that
Now
Thus
Thus
Next, we derive a simple test. Let the variance for . Let be the integer part of . So floor(100/2) = floor(101/2) = 50. Let the iid random variables Note that and . Let be the sample variance of the :
The following new theorem follows from the univariate central limit theorem.
Theorem 1.
Assume are iid, , and the variance for . Let be defined as above. Then
(a)
as .
The following theorem derives the variance under simpler regularity conditions than those in the literature, and the new proof of the theorem is also simpler.
Theorem 2.
Assume are iid, , and the variance for . Let for . Let where , , and . Then
(b) If is true, then and
Proof.
(a) To find the variance with from Equation (7), let , and note that
Then
Let for . The covariances are of 3 types. First, if with , then Second, if are distinct with and , then and are independent with . Third, there are terms where exactly three of the four subscripts are distinct, which have where , , and or where , , and . These covariance terms are all equal to the same number since The number of ways to get three distinct subscripts is
since a is the number of terms on the right hand side of (8), b is the number of terms where are distinct with and , and c is the number of terms where with . [Note that terms have i and j distinct. Half of these terms have and half have . Similarly, terms have distinct, and half of the terms have , while half of the terms have .] Thus
This calculation was adapted from [21] (pp. 336–337). Thus
(b) Now where and are iid. Hence
Under and thus . □
Note that is the sample mean of the distinct, identically distributed for . When , Theorem 2 proves that the are uncorrelated. Hence when is true, satisfies (Theorem 2b). Ref. [14] (p. 2024) showed that . Plugging this value into (Theorem 2a) gives the [22] result
Note that can be consistently estimated as follows. Let . Let , , , , …, , . Then is the sample covariance of the where . Note that a consistent estimator of is .
Let and be consistent estimators of and , respectively. Then ref. [22,23,24,25], and others proved that under mild regularity conditions when is true,
Under regularity conditions when is true, ref. [25] proved that as for fixed where .
A consistent estimator of needs a consistent estimator of . Let Then one estimator is from Theorem 1. An estimator nearly the same as the one used by [25] is
Note that can be proportional to p since is the standard deviation of a sum of p random variables. Thus to have good asymptotic power against all alternatives, likely need as When , tends to have more power than since . Suppose where the constant and 1 is the vector of ones. Then , and the test using may have good power for or for
For computing , a question is whether to use an estimator of or of Let the th element of be with . Let be the Frobenius norm of , and be the Euclidean norm of vector a. Let be the vector formed by stacking the columns of into a vector. Then . There is a level-power tradeoff. Using is good for controlling the level = P(type I) error when is true. Since , the parameter can be much smaller than , and using a good estimator of may result in better power.
In high dimensions, it is often very difficult to estimate a vector when . This result is a form of “the curse of dimensionality.” If a consistent estimator of is available, then the squared norm
Hence estimators that use many parameters, such as plug in estimators , are likely to be poor. The two parameter estimator likely has more variability than when is true, and better estimators of are needed. In simulations, was often negative. Let if and , otherwise. In limited simulations, this estimator did about as well as . Obtaining an estimator that clearly outperforms would improve the omnibus test, but is beyond the scope of this paper.
We also considered replacing by where the spatial sign function if , and otherwise. This function projects the nonzero onto the unit p-dimensional hypersphere centered at Let denote the statistic computed from an iid sample . Since the are iid if the are iid, use to test versus where In general, but can occur if the have a lot of symmetry about 0. In particular, if the are iid from an elliptically contoured distribution with . The test based on the statistic can be useful if the first or second moments of the do not exist, for example if the are iid from a multivariate Cauchy distribution. These results may be useful for understanding papers such as [26].
The nonparametric bootstrap draws a bootstrap data set with replacement from the and computes by applying on the bootstrap data set. This process is repeated B times to get a bootstrap sample . For the statistic , the nonparametric bootstrap fails in high dimensions because terms like need to be avoided, and the nonparametric bootstrap has replicates: the proportion of cases in the bootstrap sample that are not replicates is about The m out of n bootstrap draws a sample of size m without replacement from the n cases. Using worked well in simulations. Sampling without replacement is also known as subsampling and the delete d jackknife.
2.2. Three High Dimensional Two Sample Tests
If come in correlated pairs, a high dimensional analog of the paired t test applies the one sample test on .
Now suppose there are two independent random samples and from two populations or groups, and that it is desired to test versus where are vectors. Let . Let be the sample covariance matrix of and let for
A simple test takes and for . Then apply the one sample test from Theorem 2 to the . This paired test might work well in high dimensions because of the superior power of the Theorem 2 test, but in low dimensions, it is known that there are better tests.
Let be the that has . Then let
for . Note that if . Ref. [27] (pp. 177–178) proved that , that and are uncorrelated for , that , and that for . Ref. [25] showed that where the y denotes that the one sample test was computed using the .
Note that holds if and only if These terms can be estimated by where and are the one sample test statistic applied to samples 1 and 2 and
Let and where . Let . Let , , and . Let be the variance of when is true. Assume is a consistent estimator of . Under and additional regularity conditions, ref. [22] showed that
and that
Let where , , and , where , , and , where , and where .
Ref. [22] showed that
Ref. [28], using arguments similar to Theorem 2, showed
Thus and Hence
If , then the , and the formula with the worked well in simulations. Note that , , and the can be estimated as in Section 2.1. Let , and for . Let be the sample variance of the . Another estimator of is
2.3. Theory for Testing
Consider tests of the form versus . The omnibus test uses and tests versus .
Let and for . Then under mild regularity conditions by Section 2.1 where w indicates that the test was applied to the . Ref. [14] showed that and used for multiple linear regression in their simulations.
Let , , and . Then testing uses the one sample test on the . This test is equivalent to testing and . Note that data splitting could be used to select O. For multiple linear regression and the MMLE and OPLS estimators, these tests are high dimensional analogs for the OLS partial F tests for testing whether a reduced model is good. If , then I corresponds to the predictors in the reduced model while O corresponds to the predictors out of the reduced model.
In low dimensions, important tests for regression include (a) (the Wald tests for MLR), (b) (the Anova F test for MLR), and (c) (the partial F test for MLR). The above paragraph shows how to do these high dimensional tests for the multiple linear regression OPLS and MMLE estimators, with or without heterogeneity. Data splitting is not needed if O is known. Note that (a) corresponds to testing while (c) corresponds to testing .
The next subsection reviews competitors for the above tests when k is small compared to n.
2.4. Theory for Certain A
This subsection reviews some large sample theory for and OPLS for the multiple linear regression model, including some high dimensional tests for low dimensional quantities such as or . These tests depended on iid cases, but not on linearity or the constant variance assumption. Hence the tests are useful for multiple linear regression with heterogeneity.
The following [5] theorem gives the large sample theory for . Ref. [6] gave alternative proofs. This theory needs to exist for to be a consistent estimator of . Let and let and be defined below where
Then the low order moments are needed for to be a consistent estimator of .
Theorem 3.
Assume the cases are iid. Assume exist for and Let and . Let with sample mean . Let . Then (a)
(b) Let Then . Hence .(c) Let be a full rank constant matrix with , assume is true, and assume . Then
For the following theorem, consider a subset of k distinct elements from or from . Stack the elements into a vector, and let each vector have the same ordering. For example, the largest subset of distinct elements corresponds to
For random variables , use notation such as the sample mean of the , , and . Let
For general vectors of elements, the ordering of the vectors will all be the same and be denoted by vectors such as , , , and Let be the sample mean of the . Assuming that exists, then
The following [6] theorem provides large sample theory for and . We use to avoid confusion with the used in Theorem 3. Note that are dummy variables and could be replaced by to get information about m response variables . Testing could likely be done applying the one sample test to , …, assuming and iid cases.
Theorem 4.
Assume the cases are iid and that exists. Using the above notation with a vector,
(i) .
(ii) .
(iii) and .
2.5. Testing
As noted by [5], the following simple testing method reduces a possibly high dimensional problem to a low dimensional problem. Testing versus is equivalent to testing versus where A is a constant matrix. Let be the asymptotic covariance matrix of . In high dimensions where , we can’t get a good nonsingular estimator of , but we can get good nonsingular estimators of with where with . Here denotes predictors that are in the model. (Values of J much larger than 10 may be needed if some of the k predictors and/or Y are skewed.) Simply apply Theorem 3 to the predictors u used in the hypothesis test, and thus use the sample covariance matrix of the vectors Hence we can test hypotheses like In particular, testing is equivalent to testing where .
2.6. High Dimensional Outlier Detection
High dimensional outlier detection is important. This subsection follows [29] closely. See [29,30] for examples and simulations. Let W be a data matrix, where the rows correspond to cases. For example, or . One of the simplest outlier detection methods uses the Euclidean distances of the from the coordinatewise median Concentration type steps compute the weighted median : the coordinatewise median computed from the “half set” of cases with where . We often used (no concentration type steps) or . Let . Let if where and is the default choice. Let , otherwise. Using insures that at least half of the cases get weight 1. This weighting corresponds to the weighting that would be used in a one sided metrically trimmed mean (Huber type skipped mean) of the distances. Here, the sample median absolute deviation is where is the sample median of .
Let the covmb2 set B of at least cases correspond to the cases with weight . Then the covmb2 estimator is the sample mean and sample covariance matrix applied to the cases in set B. If , then
This estimator was built for speed, applications, and outlier resistance.
Another method to get an outlier resistant estimator is to use the following identity. If X and Y are random variables, then
Then replace Var( by where is a robust estimator of scale or standard deviation and or . We used where Hence
The function ddplot5 plots the Euclidean distances from the coordinatewise median versus the Euclidean distances from the covmb2 location estimator. Typically the plotted points in this DD plot cluster about the identity line, and outliers appear in the upper right corner of the plot with a gap between the bulk of the data and the outliers.
The function rcovxy makes the classical and three robust estimators of , and makes a scatterplot matrix of the four estimated sufficient predictors and Y. Only two robust estimators are made if .
3. Results
Example 1.
The [31] data was collected from districts in Prussia in 1843. Let Y = the number of women married to civilians in the district with a constant and predictors = the population of the district in 1843, = the number of married civilian men in the district, = the number of married men in the military in the district, and = the number of women married to husbands in the military in the district. Sometimes the person conducting the survey would not count a spouse if the spouse was not at home. Hence Y and are highly correlated but not equal. Similarly, and are highly correlated but not equal. We expect Then , , , and . Let the omnibus test statistic applied to the . Then and the hypotheses , , and are all rejected. The classical F-test also rejects with p-value=0.
Example 2.
The [32] pottery data has pottery shards of Roman earthware produced between second century B.C. and fourth century A.D. Often the pottery was stamped by the manufacturer. A chemical analysis was done for chemicals (variables), the types of pottery were 1-Arretine, 2-not-Arretine, 3-North Italian, 4-Central Italian, 5-questionable origin. Let the binary response variable for type 1 and 0 for types 2–5. The omnibus test had for a two sided p-value of 0.0319 and the more correct right tailed p-value of 0.016. The chi-square logistic regression test for had p-value = 0.0002, but the GLM did not converge.
3.1. One Sample Tests
In the simulations, we examined five one sample tests. The first “test” used the m out of n bootstrap to compute with . We used the shorth bootstrap confidence interval described in [30] (ch. 2). This “test” has not been proven to have level . The second test computed the usual t confidence interval
for based on the from Theorem 1. The third and fourth tests used Theorem 2 (b) and if is a consistent estimator of when is true. The third test used , while the fourth test used based on Theorem 1. These two tests computed intervals (“confidence intervals for 0”)
The tests 2–4 use the same cutoff so that the average interval lengths are more comparable. The fifth test used the Theorem 2 test applied to the spatial sign vectors with .
The simulation used four distribution types where with where 1 is the vector of ones. Type 1 used type 2 used a mixture distribution , type 3 for a multivariate distribution, and type 4 for a multivariate lognormal distribution where with where and where . The covariance matrix type depended on the matrix A. Type 1 used , type 2 used , and type 3 used giving cor( for where if , as if where , and as if is a constant. We used and chosen so at least one test had good power. The simulation used 5000 runs, the 4 x distributions, and the 3 matrices A. For the third A, we used .
Table 1 and Table 2 summarize some simulation results. There are two lines for each simulation scenario. The first line gives the simulated power = proportion of times was rejected. The second line gives the average length of the confidence interval for 0 where is rejected if 0 is not in the confidence interval. When , observed coverage between 0.04 and 0.06 suggests coverage = power = level is close to the nominal value 0.05. For larger , want the coverage near 1 for good power. See [28] for more simulations.
Table 1.
One sample tests, covtyp = 1, cov = observed type I error for and power for . Boldface for good performance not including spatial.
Table 2.
One sample tests, covtyp = 2, p = 10,000, cov = observed type I error for and power for . Boldface for good performance not including spatial.
The bootstrap test corresponds to the boot column, the tests using , , and correspond to the next three columns. The last column corresponds to the spatial sign test. This test tends to have much shorter lengths because of the transformation of the data. The test using has simple large sample theory, but low power compared to the other methods. This test’s length is approximately times the length of that corresponding to where in the tables. The bootstrap test was sometimes conservative with observed coverage when . For xtype = 4 and , was not true for the spatial test. Hence the coverage for the spatial test was sometimes higher than 0.06 for this scenario. For , the test with sometimes had coverage less than 0.04, while the test with sometimes had coverage greater than 0.06. In the simulations, the spatial test often performed well, but typically , which makes the spatial test harder to use. For testing , the test with appeared to perform better than the three competitors.
3.2. Two Sample Tests
In the simulations, we examined three two sample tests. The first “test” used the m out of n bootstrap where to bootstrap the [22] test that estimates The second test was the “paired test” with and for . Then apply the one sample test from Theorem 2 to the . The third test was the [25] Li test. Both of these tests used applied to the or the .
The simulation used four distribution types where and where and had the same distribution, with and . Type 1 used type 2 used a mixture distribution , type 3 for a multivariate distribution, and type 4 for a multivariate lognormal distribution where with where and where . The covariance matrix type depended on the matrix A.
For the covariance types, for covtyp = 1. for covtyp = 2. (1,2,...,p) for covtyp = 3. Table 3 shows some results. Two lines were used for each simulation scenario, with coverages on the first line and lengths on the second line. When , the paired test and Li test gave the same results. When was not near 1, the Li test had better power and shorter length. Increasing could greatly increase the length for the bootstrap test, but the coverage would be 1. Improving the one sample test would improve the Li test, but the Li test performed well in simulations.
Table 3.
Two sample tests, covtyp = 1, cov = observed type I error for and power for . Boldface for better performance.
3.3. Theorem 3 Tests
We illustrate Theorem 3 and Section 2.5 for Poisson regression and negative binomial regression. This simulation is similar to that done by [6] for multiple linear regression with and without heterogeneity. Let be the vector of nontrivial predictors. Let for . Hence and with ones and zeros. Here is the Poisson regression parameter vector or the negative binomial regression parameter vector . Let if and if . Then a multiple linear regression model with heterogeneity is where the are independent with expected value and variance Since the cases are iid, the OLS estimator because . Thus with the first k values equal to and zeros.
Let . Then the Theorem 3 large sample confidence interval (CI) is could be computed for each . If 0 is not in the confidence interval, then and are both rejected for estimators E = OPLS and MMLE for the multiple linear regression model with Z. In the simulations with , , and , the maximum observed undercoverage was about . Hence the program has the option to replace the cutoff by where if ,
if . If , then use . This correction factor was used in the simulations for the nominal 95% CIs, where the correction factor uses a cutoff that is between and the cutoff that would be used for a 97.5% CI. The nominal coverage was with . Observed coverage between 0.94 and 0.96 suggests coverage is close to the nominal value. Ref. [33] noted that weighted least squares tests tend to reject too often (liberal tests with undercoverage).
To summarize the confidence intervals, the average length of the confidence intervals over 5000 runs was computed. Then the minimum, mean, and maximum of the average lengths was computed. The proportion of times each confidence interval contained zero was computed. These proportions were the observed coverages of the confidence intervals. Then the minimum observed coverage was found. The percentage of the observed coverages that were ≥ 0.9, 0.92, 0.93, 0.94, and 0.96 were also recorded. The test was also done where was true. The coverage of the test was recorded and a correction factor was not used. Negative binomial regression and Poisson regression were used, where indicates that Poisson regression was used.
Table 4 illustrates Theorem 3(a) where and Table 4 replaces Y with Z. For Table 4, confidence intervals were made for for and the coverage was the percentage of the 5000 CIs that contained 0. Here , but for The first two lines of Table 4 correspond to Poisson regression. The confidence interval for never contained 0, hence the minimum coverage was 0 with observed power . The proportion of CIs that had coverage was 0.9898 (98/99 CIs). Hence this was also the proportion of CIs with coverage and . The proportion of CIs that had coverage was 0.8081 (80/99 CIs). The typical coverage was near 0.965, hence the correction factor was slightly too large. The test did not use a correction factor, and coverage was 0.9438. The minimum average CI length was 0.4166, the sample mean of the average CI lengths was 0.4187, and the maximum average length was 0.4875, corresponding to . The second two lines and below for Table 4 were for the negative binomial regression with kappa . For 1000 and 10,000, the simulations were very similar to those for . Using Y instead of Z gave similar results with longer lengths.
Table 4.
Cov(x,Z), n = 100, p = 100, k = 1, want cov > 0.94 except for mincov and cov96.
3.4. Omnibus Test
Multiple Linear Regression
For this simulation, the x were generated as in Section 3.1 with , and then where . Hence is true when . The one sample test was applied on the using and . The zero mean iid errors were iid from five distributions: (i) N(0,1), (ii) , (iii) EXP(1) − 1, (iv) uniform(), and (v) 0.9 N(0,1) + 0.1 N(0,100). Only distribution (iii) is not symmetric. With 5000 runs, would like the coverage to be between 0.04 and 0.06 when . In Table 5, the coverage was a bit high when was used (second to last column) instead of (fourth column). Power near 0.95 was good for .
Table 5.
Omnibus test for multiple linear regression, cov = observed type I error for and power for . Boldface for good performance.
Poisson Regression
For this simulation, the were generated in a manner similar to Section 3.1 when the were from a multivariate normal distribution. Let where there were k 1’s and 0’s. Then the were scaled such that when . In general, for . Hence the population Poisson regression was fairly strong for and rather weak for . Table 6 shows that using controlled the nominal level 0.05 better than using . As p got larger, the power performance could decrease. See line 8 of Table 6.
Table 6.
Omnibus test for Poisson regression, cov = observed type I error for and power for . Boldface for good performance.
Sample R code for the above two tables is shown below.
- source (‘‘http://parker.ad.siu.edu/Olive/slpack.txt’’)
- mlrcovxysim (n=100,p=500,nruns=5000,xtype=3,etype=2,delta=0)
- prcovxysim (n=500,p=100,k=100,nruns=5000,psi=0,delta=0)
4. Discussion
The omnibus test is resistant to model misspecification. For example, (a) the constant variance multiple linear regression model could be assumed when there is heterogeneity, and (b) for count data, a multiple linear regression model, or a negative binomial regression model, or a quasi-Poisson regression model may fit the data much better than the count model actually chosen. The test can also be used in low dimensions when the MLE fails to converge.
Based on the simulations and the theory, (a) the omnibus test and one sample test will not have good power against all alternatives unless as . (b) The omnibus test and one sample test tended to have simulated observed level near the nominal level (control the type I error) if was used, but the omnibus test could be conservative if n was small: for multiple linear regression and for Poisson regression in the simulations. Sometimes exploded if p was large or if was false. (c) The omnibus test and one sample test have little outlier resistance. Thus it is important to check for outliers before performing the tests. (d) Both tests worked fairly well in simulations for and , and Ref. [14] used in their simulations for multiple linear regression.
Right tail tests should be used for since they have more power, but two tail tests are easier to explain and compare. Ref. [14] used the statistic
with and This statistic can also be used for an omnibus test when . The extra term was used to increase power and is likely a good idea, but better formulas for may be needed.
Ref. [28] has many references for high dimensional one and two sample tests. For classification with two groups, let be the pooled covariance matrix. Then if and only if , which can be tested with a two sample test. For the importance of in discriminant analysis, see, for example, [34].
Let the “fail to reject region” be the compliment of the rejection region. Often the fail to reject region is a confidence region for the parameter or parameter vector of interest, where a confidence interval is a special case of a confidence region. In high dimensions, the length or volume of the fail to reject region does not necessarily converge to 0 as , and the volume could diverge to ∞ if For the one sample test, the fail to reject region using has much more power than using a confidence interval for .
Simulations were done in R. See [35]. The collection of [30] R functions slpack, available from (http://parker.ad.siu.edu/Olive/slpack.txt, accessed on 28 October 2025). has some useful functions for the inference. The function hdomni does the omnibus test. The relevant R code is shown below.
- hdomni(x,y,alpha=0.05)
- k <- n*(n-1)
- xx <- scale(x,scale=F) #centered but not scaled
- v <- xx*c(y-mean(y))
- a <- apply(v,2,sum)
- Thd <- (t(a)%*%a - sum(v^2))/k #1 by 1 matrix
- Thd <- as.double(Thd) #so the test statistic Thd=Tn is a scalar
- sscp <- v%*%t(v)
- ss <- sscp - Thd
- ss <- ss^2
- vw1 <- (sum(ss) - sum(diag(ss)))/k
- Vohat <- 2*vw1/k
- Z <- Thd/sqrt(Vohat)
- pval <- 2*pnorm(-abs(Z)) #two tail pvalue
- rpval=1-pnorm(Z) #right tail pvalue
The function hdhot1sim3 was used to simulate the five one sample tests, and was used for Table 1 and Table 2. The function hdhot1sim4 added the test using . The function hdhot2sim simulates the two sample test which applies the fast paired test on the for , the [25] test, and the two sample [22] test based on subsampling with for i = 1, 2. See Table 3. Proofs for Theorems 3 and 4 were not given, but are available from preprints of the corresponding published papers from (http://parker.ad.siu.edu/Olive/preprints.htm, accessed on 28 October 2025).
For Table 4, the function nbinroplssimz was used to create negative binomial regression data sets for finite , while the function PRoplssimz was used to create the Poisson regression data sets corresponding to . The functions without the z do not use the transformation.
For the omnibus test, the function mlrcovxysim was used for multiple linear regression, while the function prcovxysim was used for Poisson regression.
The spatial sign vectors have a some outlier resistance. If the predictor variables are all continuous, the covmb2 and ddplot5 functions are useful for detecting outliers in high dimensions. See [30] (section 1.4.3). Ref. [36] gave estimators for the variance of U-statistics.
Author Contributions
Conceptualization, A.M.A., P.A.Q. and D.J.O.; methodology, A.M.A., P.A.Q. and D.J.O.; writing-original draft preparation, D.J.O. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
The data sets are available from (http://parker.ad.siu.edu/Olive/sldata.txt, accessed on 28 October 2025).
Acknowledgments
The authors thank the editors and referees for their work.
Conflicts of Interest
The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| CI | confidence interval |
| iid | independent and identically distributed |
| MDPI | Multidisciplinary Digital Publishing Institute |
| MLR | Multiple Linear Regression |
| MMLE | marginal maximum likelihood estimator |
| OLS | ordinary least squares |
| OPLS | one component partial least squares |
| SP | sufficient predictor |
References
- Cook, R.D.; Helland, I.S.; Su, Z. Envelopes and partial least squares regression. J. Roy. Stat. Soc. B 2013, 75, 851–877. [Google Scholar] [CrossRef]
- Basa, J.; Cook, R.D.; Forzani, L.; Marcos, M. Asymptotic distribution of one-component partial least squares regression estimators in high dimensions. Can. J. Stat. 2024, 52, 118–130. [Google Scholar] [CrossRef]
- Cook, R.D.; Forzani, L. Partial Least Squares Regression: And Related Dimension Reduction Methods; Chapman and Hall/CRC: Boca Raton, FL, USA, 2024. [Google Scholar]
- Wold, H. Soft modelling by latent variables: The non-linear partial least squares (NIPALS) approach. J. Appl. Prob. 1975, 12, 117–142. [Google Scholar] [CrossRef]
- Olive, D.J.; Zhang, L. One component partial least squares, high dimensional regression, data splitting, and the multitude of models. Commun. Stat. Theory Methods 2025, 54, 130–145. [Google Scholar] [CrossRef]
- Olive, D.J.; Alshammari, A.A.; Pathiranage, K.G.; Hettige, L.A.W. Testing with the one component partial least squares and the marginal maximum likelihood estimators. Commun. Stat. Theory Methods 2025. [Google Scholar] [CrossRef]
- Fan, J.; Lv, J. Sure independence screening for ultrahigh dimensional feature space. J. Roy. Stat. Soc. B 2008, 70, 849–911. [Google Scholar] [CrossRef]
- Fan, J.; Song, R. Sure independence screening in generalized linear models with np-Dimensionality. Ann. Stat. 2010, 38, 3217–3841. [Google Scholar] [CrossRef]
- Agresti, A. Categorical Data Analysis, 2nd ed.; Wiley: Hoboken, NJ, USA, 2002. [Google Scholar]
- Li, K.C.; Duan, N. Regression analysis under link violation. Ann. Stat. 1989, 17, 1009–1052. [Google Scholar] [CrossRef]
- Chen, C.H.; Li, K.C. Can SIR be as popular as multiple linear regression? Stat. Sinica 1998, 8, 289–316. [Google Scholar]
- Cameron, A.C.; Trivedi, P.K. Regression Analysis of Count Data; Cambridge University Press: Cambridge, UK, 1998. [Google Scholar]
- Haggstrom, G.W. Logistic regression and discriminant analysis by ordinary least squares. J. Bus. Econ. Stat. 1983, 1, 229–238. [Google Scholar] [CrossRef]
- Zhao, A.; Li, C.; Li, R.; Zhang, Z. Testing high-dimensional regression coefficients in linear models. Ann. Stat. 2024, 52, 2034–2058. [Google Scholar] [CrossRef]
- Cui, H.; Guo, W.; Zhong, W. Test for high-dimensional regression coefficients using refitted cross-validation variance estimation. Ann. Stat. 2018, 46, 958–988. [Google Scholar] [CrossRef]
- Goeman, J.J.; van de Geer, S.A.; van Houwelingen, H.C. Testing against a high dimensional alternative. J. R. Stat. Soc. B 2006, 68, 477–493. [Google Scholar] [CrossRef]
- Lan, W.; Wang, H.; and Tsai, C.-L. Testing covariates in high-dimensional regression. Ann. Inst. Statist. Math. 2014, 66, 279–301. [Google Scholar] [CrossRef]
- Zhong, P.-S.; Chen, S.X. Tests for high dimensional regression coefficients with factorial designs. J. Amer. Stat. Assoc. 2011, 106, 260–274. [Google Scholar] [CrossRef]
- Helland, I.S. Partial least squares regression and statistical models. Scand. J. Stat. 1990, 17, 97–114. [Google Scholar]
- Park, J.; Ayyala, D.N. A test for the mean vector in large dimension and small samples. J. Stat. Plan. Inf. 2013, 143, 929–943. [Google Scholar] [CrossRef]
- Lehmann, E.L. Nonparametrics: Statistical Methods Based on Ranks; Holden-Day: San Francisco, CA, USA, 1975. [Google Scholar]
- Chen, S.X.; Qin, Y.L. A two sample test for high-dimensional data with applications to gene-set testing. Ann. Stat. 2010, 38, 808–835. [Google Scholar] [CrossRef]
- Srivastava, M.S.; Du, M. A test for the mean vector with fewer observations than the dimension. J. Mult. Anal. 2008, 99, 386–402. [Google Scholar] [CrossRef]
- Bai, Z.D.; Saranadasa, H. Effects of high dimension: By an example of a two sample problem. Stat. Sinica 1996, 6, 311–329. [Google Scholar]
- Li, J. Finite sample t-tests for high-dimensional means. J. Mult. Anal. 2023, 196, 105183. [Google Scholar] [CrossRef] [PubMed]
- Wang, L.; Peng, B.; and Li, R. A high-dimensional nonparametric multivariate test for mean vector. J. Am. Stat. Assoc. 2015, 110, 1658–1669. [Google Scholar] [CrossRef]
- Anderson, T.W. An Introduction to Multivariate Statistical Analysis, 2nd ed. Wiley: New York, NY, USA, 1984.
- Abid, A.M. Some Simple High Dimensional One and Two Sample Tests. Ph.D. Thesis, Southern Illinois University, Carbondale, IL, USA, 2025. Available online: http://parker.ad.siu.edu/Olive/sAhlam.pdf (accessed on 28 October 2025).
- Olive, D.J. Some useful techniques for high dimensional statistics. Stats 2025, 8, 60. [Google Scholar] [CrossRef]
- Olive, D.J. Prediction and Statistical Learning, Online Course Notes. 2025. Available online: http://parker.ad.siu.edu/Olive/slearnbk.htm (accessed on 28 October 2025).
- Hebbler, B. Statistics of Prussia. J. Roy. Stat. Soc. A 1847, 10, 154–186. [Google Scholar] [CrossRef]
- Wisseman, S.U.; Hopke, P.K.; Schindler-Kaudelka, E. Multielemental and multivariate analysis of Italian terra sigillata in the world heritage museum, university of Illinois at Urbana-Champaign. Archeomaterials 1987, 1, 101–107. [Google Scholar]
- Pötscher, B.M.; Preinerstorfer, D. How reliable are bootstrap-based heteroskedasticity robust tests? Econ. Theory 2023, 39, 789–847. [Google Scholar] [CrossRef]
- Wang, Y.; Wu, Z.; Wang, C. High dimensional discriminant analysis under weak sparsity. Commun. Stat. Theory Methods 2025, 54, 2657–2674. [Google Scholar] [CrossRef]
- R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2024. [Google Scholar]
- Xu, T.; Zhu, R.; Shao, X. On variance estimation of random forests with infinite-order U-statistics. Electr. J. Stat. 2024, 18, 2135–2207. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).