Abstract
We consider a k-nearest neighbor-based nonparametric lack-of-fit test of constant regression in presence of heteroscedastic variances. The asymptotic distribution of the test statistic is derived under the null and local alternatives for a fixed number of nearest neighbors. Advantages of our test compared to classical methods include: (1) The response variable can be discrete or continuous regardless of whether the conditional distribution is symmetric or not and can have variations depending on the predictor. This allows our test to have broad applicability to data from many practical fields; (2) this approach does not need nonlinear regression function estimation that often affects the power for moderate sample sizes; (3) our test statistic achieves the parametric standardizing rate, which gives more power than smoothing-based nonparametric methods for moderate sample sizes. Our numerical simulation shows that the proposed test is powerful and has noticeably better performance than some well known tests when the data were generated from high frequency alternatives or binary data. The test is illustrated with an application to gene expression data and an assessment of Richards growth curve fit to COVID-19 data.
1. Introduction
Nonparametric lack-of-fit tests where the constant regression is assumed for the null hypothesis have been considered by many authors. The order selection test [1], the rank-based order selection test [2], and the Bayes sum test [3] are among the top few that are intuitive and easy to compute. A classical textbook review of extensive efforts in nonparametric lack-of-fit tests based on smoothing methods is available in Reference [4]. Hart [2] extended the order selection method of Reference [1] to rank-based test under the constant variance assumption so that the test statistic is relatively insensitive to misspecification of distributional assumptions. These two order selection tests show excellent performance under low frequency alternatives. However, they may have low power under high frequency alternatives.
In another paper, Hart proposed several new tests based on Laplace approximations to better handle the high frequency alternatives [3]. In particular, one test with overall good power is the Bayes sum test. It is a modified cusum statistic with a better use of the sample Fourier coefficients arranged in the order of increasing frequency. Two versions of approximating the critical values were given in Reference [3], one based on normally generated data, and the other based on bootstrap resampling of the residuals under the null hypothesis of constant regression. It is interesting to note that, even though the response variable may not be from the normal distribution, the normal approximation approach tends to give even higher power than the bootstrap approach. An explanation for this is that the Bayes sum test starts with the canonical model that the estimators of the Fourier coefficients are normally distributed, and here the sample Fourier coefficients are approximately normally distributed for large sample sizes. Thus, the Bayes sum test works well for large sample sizes and is more powerful than the order selection test and the rank-based order selection test.
A major motivation for the current work is that the practical data may have variances vary with the covariate, whereas the order selection (OS), rank-based order selection (ROS), and Bayes sum test were derived for homoscedastic regression problems. The scale parameter of the error term is assumed to be a constant in these three tests. Even in such a case, different estimators of the scale parameter may be used assuming either the null or alternative hypothesis is true.
To deal with the presence of heteroscedasticity for testing the no-effect null hypothesis, Chen et al. [5] proposed another test statistic in addition to bootstrapping the [6] version of the order selection test. The approximate sampling distribution of that test statistic was obtained using the wild bootstrap method. In the case of heteroscedasticity, it was shown in Reference [5] that the asymptotic distribution of the [6] version of the order selection test depends on the unknown variance function of the errors. Moreover, they showed that their statistic is more robust than that of Reference [6] to heteroscedasticity and has better level accuracy. It was further shown in Reference [5] that the wild bootstrap technique has an overall good performance in terms of level accuracy and power properties in the case of heteroscedasticity.
Other consistent nonparametric lack-of-fit tests using some smoothing techniques have been proposed (cf. References [7,8,9,10,11,12,13,14]). Some of them are difficult to compute in addition to complicated conditions that are hard to justify. All of the aforementioned methods require the response variable to be continuous.
In this paper, we consider a nonparametric lack-of-fit test of constant regression in presence of heteroscedastic variances. This test has better power for data from high frequency alternatives than the four tests reviewed above. In addition, our test can also be applied to discrete data. The test statistic is derived using the k-nearest neighbor augmentation defined through the ranks of the predictor. This idea was first proposed in Reference [15] for analysis of covariance model, and further used in Reference [16] for a diagnostic test and in Reference [17] for a test of independence between a response variable and a covariate in presence of treatments. A test statistic was defined in Reference [16] for lack-of-fit test in the present regression setting. The authors considered each distinct covariate value as a factor level. Then, they augmented the observed data to construct what they called an artificial balanced one-way ANOVA (see Section 2.1 for further description of the augmentation). This way of constructing test statistics has great potential to gain power over smoothing-based methods. However, we found that their asymptotic variance estimator of the test statistic in Reference [16] seriously underestimates the true variance for intermediate sample sizes. As a consequence, regardless of the error distribution, their test has highly inflated type I error rates when k is small and becomes very conservative when k gets large.
In this paper, we present a very different asymptotic variance formula for the test statistic. In the special case of homoscedastic variance, our derived asymptotic variance contains one more term (a function of k) than that in Reference [16]. This explains the unstable behavior of the type I error pattern of their test. On the other hand, our test has consistent type I error rates across different sample sizes and different k values, and they are very close to the nominal alpha levels.
In Section 2, we state the hypotheses and define the test statistic as a difference of two quadratic forms, both of which estimate a common quantity but one under the null hypothesis and the other under the alternatives. Then, the asymptotic distribution of the test statistic is obtained under the null and the local alternatives for a fixed number of nearest neighbors. Moreover, we consider the idea of the Least Squares Cross-Validation (LSCV) procedure of Reference [18] to estimate the number of nearest neighbors. In Section 3, we present simulation studies with data generated having symmetric normal, light-tailed uniform, heavy-tailed T, and asymmetric heteroscedastic error distributions. The numerical results show that our test has encouragingly better performance in terms of type I error and power compared to the existing tests. In addition to the simulation comparisons, we present in Section 4 an application to gene expression data from patients undergoing radical prostatectomy and an application to assess COVID-19 model fit. A summary is given in Section 5. Technical proofs are provided in Appendix A.
2. Theoretical Results
2.1. The Hypotheses and Test Statistic
Let , , be an independent and identically distributed random sample. Let and denote the marginal probability density function and cumulative distribution function of , respectively. Denote Var and .
We wish to test the hypotheses:
This formulation works for both continuous and categorical response variable Y. For simplicity in the presentation, we assume that there are no duplicated observations for each value of covariate X. If there are duplicated observations, we can use the middle ranks to take care of this issue. In regression settings, the nonlinear conditional mean regression is often estimated through pooling observations from neighbors by one of the smoothing methods, such as loess, smoothing spline, kernel estimation, etc. For smoothing spline or kernel method, the number of observations in a window essentially needs to go to infinity as the sample size goes to infinity. The k-nearest neighbor approach is a popular method for classification, but the theory for a fixed k is very difficult for general regression. In this work, we use a fixed number of k-nearest neighbors in the data augmentation to help define a statistic for conducting a lack-of-fit test. This augmentation is done for each unique value of the predictor by generating a cell that contains k values of the response Y whose corresponding x values are among the k closest to in rank. We consider k to be an odd number for convenience so that the augmentation contains half of the (k− 1) values symmetrically on each side of when is an inner point. Let c denote an index defined by the covariate value , where and let denote the empirical distribution of X. We make the augmentation for each cell by selecting pairs of observations whose covariate values are among the k closest to in rank in addition to . Let denote the set of indices for the covariate values used in the augmented cell . Thus, for any pair to be selected in the augmentation of the cell , the difference between the ranks of and is no more than if is an interior point whose rank is between and , i.e., . For whose rank is less than or greater than , the difference between the ranks of and is no more than . This idea was first proposed in [15] and further used in References [16,17] for different problems. A test statistic was derived in Reference [16] for lack-of-fit testing in the present regression setting by considering each distinct covariate value as a factor level. Then, the observed data were augmented by considering a window around each that contains the nearest covariate values to construct what the authors called an artificial balanced one-way ANOVA. Similar augmentation was considered in Reference [17] when there are more than one treatment. Their results cannot be applied here since the asymptotic variance calculation is ill-defined when there is no treatment factor as in our lack-of-fit setting.
Let , denote the augmented response values in cell under the null hypothesis. Define to be the indicator function that the difference between the ranks of and is no more than . Let and denote the average between-cell and within-cell variations defined as the following:
where , Note that and can be easily calculated since they resemble the mean squares statistics for an ANOVA model. The calculation is on the augmented data. In most cases in the literature, is used for constructing the test statistic when has fixed degrees of freedom. However, in our case, the degrees of freedom for is , which goes to infinity. Therefore, the staztistic typically used in this case is (see Reference [19]), which involves showing that converges in distribution to normality and converges in probability to a constant. With augumented data, it is complicated to show that converges in probability. So, we define the following difference-based
as our test statistic instead of using -based one, where is a variance estimator for given later in (9). This test statistic is similar to that proposed in Reference [16], but with a different variance estimator.
To express and in terms of the original data, we can write
2.2. Asymptotic Distribution of the Test Statistic under the Null Hypothesis
Even though the test statistic is easy to calculate, the derivation of the asymptotic distribution is challenging since the augmented data in neighboring cells are correlated. In this subsection, we derive the asymptotic distribution of the test statistic derived with a different strategy than that proposed in Reference [16]. We first simplify it by finding its projection. Specifically, define
where . Then, we project onto the space:
of the form , where are constants, , and is some function that is possibly nonlinear. This projection will help us to split into two terms, one of which includes a summation over c and the other over c and for :
where and . Then, is in the space defined in (3) and where
and
Note that the term in (5) is closely related to the expected covariance between every pair of response values with correlation induced by their dependence on . The in (6) serves as a weight function which associates the response locally the empirical distribution function of X. The term in (5) is more intuitive than to evaluate the lack-of-fit. However, cannot be calculated from the sample since is unknown. On the other hand, can be directly obtained from the sample.
We assume the following condition to obtain the result under the null hypothesis:
Assumption 1.
For all x, suppose that is differentiable, and the fourth conditional central moment of given is uniformly bounded.
The advantage of using a small or fixed k instead of a large k can be seen here. Even though is a quadratic form, only nearby cells have correlated observations due to the fixed number of nearest neighbors augmentation. On the other hand, when the number of nearest neighbors tends to infinity, the augmented data in many more cells will be correlated; therefore, might diverge, and the derivation of the asymptotic distribution will require unnecessarily strong conditions on the magnitude of the correlation. It is straightforward to show that with a small or fixed k. Hence, is asymptotically negligible. We state this result in Lemma 1 below.
Lemma 1
(Projection of ). Let be as defined in (4). If the Assumption 1 is satisfied, then
where the notation denotes convergence in probability.
To obtain the asymptotic distribution of the test statistic under the null hypothesis, we work with
where is defined in (6). We first give the large sample behavior of the variance of this term.
Theorem 1.
Under Assumption 1, exists and
where
and are the ranks of and among the covariate values .
To estimate the asymptotic variance, let be the rank of among all covariate values. Then, it is readily seen that a consistent estimator of under is
where is the sample variance based on the augmented observations for the cell determined by , i.e.,
Note that are bounded counts and (7) is a clean quadratic form as defined in Reference [20]. The Central Limit Theorem for clean quadratic forms (Proposition 3.2) in Reference [20] can be applied to obtain the following result. We omit the details of the proof.
Theorem 2.
2.3. Results under Local or Fixed Alternatives
In this subsection, we consider the theoretical properties of the test under fixed or local alternatives in which the conditional expectation of Y given X is . Let be an univariate function of x.
Under a fixed alternative, can be expressed as
where is the conditional expectation of Y given X under the null hypothesis.
For local alternatives, consider the sequence of conditional expectations that approach in the order of :
Both alternatives are valid for either discrete or continuous response variable and allow the data to have different conditional variance under the alternative hypotheses from that under the null. For example, if has a Poisson distribution with mean under the alternative, then the variance is instead of .
Suppose , are observed data under either the fixed alternatives in (10) or the local alternatives in (11). Let be the augmented response values. Note that is equal to the observed response variable whose covariate value is one of the following:
Then, can be written as , where includes the conditional mean under the null hypothesis and departure from the null. Note that satisfies the null hypothesis and can be viewed as the augmented data for if are under the fixed alternative in (10) or for if are under the local alternatives in (11). In either case, the conditional mean given satisfies the null hypothesis but with equal to under the alternative hypotheses. For convenience, define to be the function evaluated at the covariate value for augmented observation . Let , , , , , and . Denote and to be the average between-cell variations and the average within-cell variations under the alternative hypotheses, respectively.
Under the local alternatives,
and
In this case, the numerator of the test statistic can be written as
where
Similarly, under the fixed alternatives,
where , are given in (13)–(17).
The following additional condition is needed for the result under the alternative hypotheses:
Assumption 2.
Suppose that has bounded support , and is locally Lipschitz continuous on : for each , there exists an such that is Lipschitz continuous on the neighborhood . Further, we assume that the fourth central moments of are uniformly bounded.
Before we give the asymptotic distribution of the test statistic under the alternatives, we state the following results which are valid under both the local and fixed alternative hypotheses.
Lemma 2.
Under Assumptions 1 and 2, as ,
where , , and are defined in (15), (16), and (17), respectively.
The proof of Lemma 2 is given in Appendix A. From this Lemma and Equations (12) and (18), we can see that and are the major terms that provide power under the alternative hypotheses. We state the results separately for fixed and local alternatives.
Theorem 3.
Note that in Theorem 1 and in Theorem 3 share the same formula, except that in needs to be calculated under the alternatives in (11). For example, if Y given X has a Bernoulli distribution, then the conditional variance of Y given X under the local alternatives in (11) is , which is different from that under the null hypothesis .
Theorem 4.
For the fixed alternative in (10), under Assumptions 1 and 2, the power of the test using statistic goes to one as .
The proofs of Theorems 3 and 4 are given in Appendix A.
In heteroscedastic regression, it is common in the literature to write with independent of . In this formulation, the entire error term is uncorrelated with . In the ideal case that there is no lack-of-fit, such a model is reasonable. However, when there is a lack-of-fit because a wrong regression function is specified, the error term still contains some systematic information of . Then, it is possible that the error resulting from the specified regression function is still correlated with .
2.4. Selection of the Number of Nearest Neighbors
The number of nearest neighbors k in the test statistic specifies the number of values augmented in each cell. Our theory requires that it takes a finite small odd integer. In simulations, we have found that the type I error remains close to the nominal level for different small k values and stays stable for a broad range of sample sizes and error distributions. Under the alternative hypothesis, different k may lead to different power for our test statistic. This section discusses how to select the parameter k.
Under the alternative hypothesis, our k-nearest neighbor augmentation is parallel to regression using a local constant based on k-nearest neighbors. For a continuous response variable, Hardle et al. [18] suggested the Least Squares Cross-Validation (LSCV) method for smoothing parameter (bandwidth) selection in kernel regression estimation. Chen et al. [5] recommended using the one-sided cross-validation procedure of Reference [21] to select smoothing parameter (bandwidth) for hypothesis testing. The number of nearest neighbors k in our setting has a similar role as the smoothing parameter in kernel regression.
For a categorical response variable, Holmes et al. [22] proposed an approach to select the parameter k in the k-nearest neighbor (KNN) classification algorithm using likelihood-based inference. Choosing k in this method can be considered as a generalized linear model variable-selection problem. In particular, for multinomial data , , where denotes the class label of the ith observation, and is a vector of p predictor variables, they considered the probability model
where denotes the data with the ith observation deleted, is a single regression parameter, and is the difference between the proportion of observations in class and that in class within the k-nearest neighbors of , i.e.,
where the notation denotes that the summation is over the k-nearest neighbors of in the set , and the neighbors are defined based on the Euclidean distance. The prediction for a new point is given by the most common class in the k-nearest neighbors of . Afterwards, the value that maximizes the profile pseudolikelihood is chosen to estimate the parameter k. However, this method is only valid when the response variable is a categorical variable and the nearest neighbor is defined using the Euclidian distance.
In our case, the response variable could be continuous or categorical, and our nearest neighbors are defined through ranks. So, we do not recommend to use our test statistic with an estimate of k obtained with aforementioned procedures. We consider an alternative method to estimate k which uses ranks to define nearest neighbors and can be applied in both categorical and continuous response cases. Here, we adopt the idea of the Least Squares Cross-Validation (LSCV) procedure of Reference [18] to select the parameter k. Different from Reference [18], where the regression function is estimated using kernel estimation, we consider k-nearest neighbor estimates with the neighbors defined through the ranks of the predictor variable. In the case of categorical response variable with Q classes, we re-code the response variable to have integer values from 1 to Q. To estimate the class for the response variable, we use the majority vote (the most common value) from the k-nearest neighbors. For tied situation where there are multiple classes achieving the same highest frequency, one of them is assigned randomly to be the estimated response. In the case of continuous response variable, the regression function is estimated by the average of the k-nearest neighbors.
In a leave-one-out procedure, for each , we eliminate and use the rest of the observations to estimate the regression function which then is used to predict the response value Y at . Here are our steps:
- 1.
- Find the observation in such that the absolute difference between this observation, and is minimized. DenoteThen, is the closest to .
- 2.
- Find the k-nearest neighbors of in terms of ranks. We use the corresponding values such thatto obtain the leave-one-out estimate of the regression function at . That is,where the Mode is defined as the most frequently observed value in a set of numbers. In the case where the most frequently observed values are not unique, one of them is randomly selected.
- 3.
- Repeat steps 1 and 2 for to obtain all leave-one-out estimates.
Then, define the leave-one-out Least Squares Cross-Validation error as
Finally, the number of nearest neighbors is estimated by
where the set consists of small odd integers.
When the response variable is categorical, the estimate of k from this algorithm depends on how well the covariate values from different classes are separated and how many observations are in each class. For large class sizes, it is very possible that the resulting estimate is much greater than 10 if we leave unconstrained. However, our theory requires k to be a finite, positive, and odd integer.
In the continuous case with k-nearest neighbor estimation, the average of a big proportion of Y values is used to approximate the response variable if a large k value is utilized. As a consequence, bigger k tends to give larger least squares error when the regression function is under the alternative hypothesis. This is especially true when the regression function has substantial curvature, such as in high frequency alternatives. On the other hand, larger k tends to give smaller least squares error when the data were generated under the constant regression null hypothesis.
In either case, the smallest value for k is 3 (note: corresponds to the case of no data augmentation). In order to keep the least squares error minimized under the alternative hypothesis and reasonable under the null hypothesis, we recommend to let contain a few small integer values. For example, , which is a safe choice for both moderate and large sample sizes.
Figure 1 shows the typical pattern of as a function of k for when the response variable was generated as (1) ; (2) ; (3) ; and (4) , where and are i.i.d .
Figure 1.
Typical patterns of versus k in continuous data.
3. Monte Carlo Simulation Studies
In this section, we present the results of some simulation studies to investigate the type I error and power performance of our test. The test has a parameter k to specify the number of nearest neighbors for data augmentation. The inference for our test requires the k to be a small odd positive integer. We report the results for and 5 and denote them as and , respectively. This is for the user to have an idea of how the test behaves with a given k. Furthermore, we report the results of our test with k selected from 3 and 5 using our considered method in Section 2.4 and denote it as . For the applied to each generated data set, the value of the k is selected using in (20), and our test with parameter is used to obtain the p-value.
For comparison, we also report the corresponding results for the test of Reference [16], the order selection (OS) test of Reference [1], the rank-based test (ROS) of Reference [2], the bootstrap order selection test (BOS) of Reference [5], and the Bayes sum test of Reference [3]. As argued in Section 7.1 of Reference [4], evenly spaced design points should be used for calculation of these four test statistics even when they are unevenly spaced. So, the generated covariate values in increasing order were replaced by evenly spaced design points on for all four tests. For BOS, we apply the wild bootstrap algorithm of Reference [5] based on the residuals and use their test statistic with 1000 bootstrap samples for each replication. For the Bayes sum test, we use the statistic that has been reported to have good power from a comprehensive simulation study in Reference [3]. For approximating the p-values of the Bayes sum test, Hart [3] gave two versions of the approximation, one assuming normality (BN) and one using the bootstrap (BB). For BN, a random sample of the same sample size as the data was generated from the standard normal distribution, and the Bayes sum test statistic was calculated from the data so generated, regardless of the actual distribution of the response variable. The process was independently repeated 10,000 times, and the p-value was obtained based on the empirical distribution of these 10,000 values. For BB, the bootstrap samples were drawn from the empirical distribution of the residuals , , rather than the normal distribution, and the p-value approximation was carried out similarly. The scale parameter for a given data set in both BB and BN statistics was estimated by , as was suggested in Reference [3]. It was reported in Reference [3] that the results obtained using the normality assumption were in basic agreement with those obtained using the bootstrap. So, we only report the simulation results for BN.
The values for the covariate X were independently generated from Uniform. First, we consider the performance of different tests under the . The data were generated from
where the error term were independently generated with one of the four error distributions:
- 1.
Table 1. Percent of rejection at 0.05 level under for data generated from Model . OS: order selection test of Reference [1]; BN: Reference [3] Bayes sum test with Normal approximation; BOS: order selection test with wild bootstrap of Reference [5]; ROS: rank-based test of Reference [2]; WAI3, WAI5, WAI7: test of Reference [16] with , 5, and 7, respectively; : the proposed test with , 5, 7, and from (20). Highly inflated empirical type I errors are marked with red color.
Table 2. Percent of rejection under high frequency alternatives to with and sample size . The legend of the tests are same as in Table 1. WAI is not included since it could not keep its type I error under control. Note: BN also has highly elevated type I error in the Heteroscedastic case.- 2.
- 3.
- 4.
The empirical type I error rates (in percentage) under are reported in Table 1. It can be seen that the test of Reference [16] with or 5 generally has inflated type I error and is particularly serious with smaller sample sizes. For , their test has type I error close to 0.05 when the error distribution is Normal or T(5)/30 but is as high as about twice of the significance level in the heteroscedastic case. Its performance for is better than with other values but still is inflated for the heteroscedastic case. The order selection test of Reference [1] and the proposed test with different k-values have better type I error control. Among the three k-values, larger k pulls more observations around each covariate value as pseudo replicates. This could lead to the test being less sensitive to curvature departure against the null hypothesis. Hence, we recommend to choose k between 3 and 5.
Next, we consider the performance of the tests with data generated from nonlinear models. The response values were independently generated according to the following four models for , with the moderate sample size of in all cases:
- Model : ,
- Model : ,
- Model : , and
- Model : ,
where q in Models – represents the frequency. We considered and . The case with is a higher frequency alternative compared to those reported in Reference [3]. The data for the error term in each model were independently generated with one of the four error distributions listed earlier. Model serves as the null model to obtain the type I error rates for all tests. For each error distribution, the data were generated from Models –, with sample size for 2000 times, and the rejection rates (in percentage) at significance level 0.05 are reported in Table 2.
It can be seen that the type I error estimates for all tests were below or close to the nominal level 0.05 for all models with homoscedastic errors. For the heteroscedastic regression model, the variance of the error depends on the covariate, while the conditional mean of the response variable given the covariate is a constant under Model . In this case, all the tests tend to be liberal.
The rows to in Table 2 show the power comparison for the different combinations between Models – and the four types of the error distribution. The powers of our test with () are higher than all other tests in all cases. BN has power close to our test. OS, ROS, and BOS fall far behind. The low power performance of BOS in the case of high frequency alternatives was mentioned in Reference [5], and they suggested (without details) to use smoothing squared residuals to deal with that.
It is noticeable that the power of our test is 1 for Models and for all different types of the error distribution and very close to 1 for Models and . In addition, the power for OS was slightly higher than that for ROS in all cases.
Models and are similar, except that Model has lower signal to noise ratio than Model . With the lower signal to noise ratio, the power for ROS, OS and BOS drops drastically. To have a closer look at the numerical performance of all tests under local alternatives, we considered the model , with and Uniform. The empirical power curves are given in Figure 2. It is obvious that our test has consistently higher power than the other tests.
Figure 2.
Empirical power of the tests for data generated from with sample size and Uniform. Due to small values of C, the signal to noise ratio is low. WAI is not included since it could not keep its type I error under control for this sample size and the uniform error distribution.
The discussions above are for high frequency alternatives with and moderate sample size . When sample size increases, while the frequency stays the same, the power of each test also increases. For sample size of 100, the empirical power is 1 for all the compared tests of NB, OS, ROS, and under Models –. Under Model , OS and ROS have power slightly below 1 for the case with uniform error. The rest of the tests have power close to 1. Similarly, for lower frequency alternatives, for example, when and , all these tests have power close to 1.
To examine how the power of these tests changes with the sample size, we generated data with model , where , Uniform, for . The empirical power of these tests is presented in Figure 3, where is our test with selected from and 5 based on (20). It is obvious that the proposed test consistently has the highest power over all the sample sizes considered.
Figure 3.
Empirical power for different sample sizes. The data were generated from , where Uniform. As the sample sizes N increases, the power of all tests increases. The test approaches 1 much faster than other tests. WAI is not included since it could not keep its type I error under control for the sample sizes and the uniform error distribution.
Even though BN showed a comparable performance to our test in many cases, the running time of BN is much longer than . In particular, the average running time from 10,000 runs from BEOCAT cluster machines for is 0.03 s, while that for BN is 9.7 s. So, is more than 300 times faster than BN.
4. Applications to Real Data
4.1. Application to Gene Expression Data from Patients Undergoing Radical Prostatectomy
In this subsection, we present an application of our test to gene expression data from patients undergoing radical prostatectomy in order to predict the behavior of Prostate cancer. This data set was collected between 1995 and 1997 at the Brigham and Women’s Hospital from 52 tumor and 50 normal prostate samples using oligonucleotide microarrays containing probes for 12,600 genes and expressed sequence tags (the data is available at https://www.ncbi.nlm.nih.gov/gds/?linkname=pubmed_gds&from_uid=12086878 on 11 July 2021). The data shows heterogeneity and has a binary response variable which is the patient outcome (tumor or normal). Applying our test to the expression data from each gene, we identified 980 genes that are significantly associated with the response variable after Bonferroni correction 12,600). On the other hand, Singh et al. [23] used permutation test to identify important genes. They found 456 genes whose expression values are significantly correlated with patient outcome . Note that the significance declared by Reference [23] is at 0.001 level without any multiple comparison adjustment. Ours are obtained at the same significance level but with the Bonferroni control, which is a very conservative method for multiple comparison adjustment. With such conservative control, we still identified more than twice of the genes than Reference [23]. It is worth mentioning that our test was developed under very general assumptions that are expected to hold true for the microarry data here. These results suggest that our test is much more powerful than the permutation test of Reference [23]. Furthermore, we performed k-nearest neighbor (KNN) classification on the data for the top i genes (i genes with smallest p-values, ) to predict the patient outcomes. The leave-one-out cross validation (LOOCV) was used as a validation method. The parameter k in KNN was estimated with the training part of the data in LOOCV procedure by the profile pseudolikelihood method of Reference [22]. The leave-one-out accuracy curve with increasing number of selected top i genes is shown in Figure 4. We would like to comment that these genes were obtained individually. Our simple application of the test is not meant to find the best combination of genes that have the best classification accuracy. Even under such circumstances, the top genes found with our test give good LOOCV accuracy.
Figure 4.
The leave-one-out accuracy curve with increasing number of selected genes.
4.2. Application to Assess Richards Growth Curve Fit for the COVID Cases and Deaths
In this application, we would like to assess if the popular Richards growth model can fit the COVID-19 cases or deaths for the U.S.
The Richards growth curve model has been adapted recently for real-time prediction of outbreak of diseases in epidemiology. Below is a form of the the Richards curve that was used to fit the COVID-19 outbreak in Reference [24]:
where , , and are real numbers, and is a positive real number.
Using this parameterization of the Richards curve, Lee et al. [24] stated that, given a progression constant with , the flat time point is given by
They predicted that the posterior means of the flat time points for the U.S. to be 30 May, 16 July, 30 August, and 15 October when the corresponding ’s are chosen by 0.9, 0.99, 0.999, and 0.9999, respectively. However, as of the end of 2020, the COVID-19 confirmed cases are still continuing to climb up. This is an evidence that the Richards curve does not fit the COVID-19 infection growth well.
We downloaded the cumulative number of confirmed cases of U.S. COVID-19 historical data from https://covidtracking.com/data/download/national-history.csv on 12 July 2021. This website stopped tracking COVID in late March 2021. We also downloaded the daily number of death counts for U.S. COVID-19 data from the CSSE data repository of Johns Hopkins University at https://github.com/CSSEGISandData/COVID-19 on the same day. The data contains death counts up to 11 July 2021.
To fit a Richards curve for each data set, we only included data on days such that the cumulative count is at least seven and removed the last ten days’ data so that those ten days’ counts could be used as an additional out-of-sample assessment of the model fit. A separate Richards curve was fitted for the daily cumulative confirmed cases and deaths. Figure 5 shows the nonlinear least squares fitted Richards growth curves and the observed counts. The fitted curve also contains prediction for the future 10 days beyond the observed days. Table 3 gives the correlation coefficients between the observed counts and the estimated count. Even though the s between the fitted curves and observed numbers are over 99%, the p-values for the lack-of-fit tests are both essentially 0. The Richards curve seems to be better at predicting the number of confirmed cases than predicting the number of deaths. It appears that the pandemic progression is more complicated than the Richards curve due to changes in social distancing, stay-at-home order policies, and vaccine availability. U.S. holidays, such as Thanksgiving and Christmas, also contributed drastically to the increase of count in late November. The parameter predicts the epidemic size. For the total number of deaths, the model predicted epidemic size is 594,331. For the confirmed cases, the model predicted size is around 27.9 million. Both numbers underestimate the true epidemic size. As of 12 July 2021, the U.S. confirmed cases is over 33.8 million, and the number of U.S. deaths is 606,000. These numbers are much larger than the model predicted size, suggesting that the Richards curves could not model the confirmed cases or deaths adequately.
Figure 5.
Observed U.S. COVID-19 mortality and confirmed cases with the Richards growth curve estimates.
Table 3.
Parameters of the Richard growth models, between the observed and model estimated counts, and the p-values of the lack of fit test.
5. Conclusions
In this paper, we derived the asymptotic distribution of a nonparametric lack-of-fit test of constant regression in the presence of heteroscedastic variances. We considered a test statistic obtained using the augmentation of a small number of k-nearest neighbors defined through the ranks of the predictor variable. The test statistic is the studentized version of the difference of two quadratic forms. Both of two quadratic forms estimate a common quantity under the null hypothesis but they converge to different quantities under any alternatives The asymptotic distribution of the difference was also given in Reference [16] but with a biased asymptotic variance. We derived the correct form of the asymptotic distribution of the test statistic under both the null hypothesis and local alternatives. In addition, we also provided a procedure to choose the parameter k based on the Least Squares Cross-Validation idea used in k-nearest neighbor regression. Our test has several advantages. It provides a unified framework for testing lack-of-fit for a given regression function when the response is either a discrete or a continuous random variable and the covariate is a continuous variable. This makes it convenient for unified inferences and applications. There is no need to assume the distribution of the data, which makes the test widely applicable to many practical data. The fixed number of nearest neighbors augmentation ensures good power to detect both low and high frequency alternatives even for moderate sample sizes, and the parametric standardizing rate for the test statistic is achieved. The test statistic is easy and fast to calculate. Our simulation studies show that the test is more powerful than some well known competing test procedures when data are generated under high frequency alternatives. Therefore, the results in this paper offer a useful tool for lack-of-fit testing.
Author Contributions
Conceptualization, M.M.G. and H.W.; methodology, M.M.G. and H.W.; software, M.S. and H.W.; validation, M.M.G., M.S., H.W., and S.W.; formal analysis, M.S. and H.W.; investigation, M.M.G. and H.W.; data curation, H.W.; writing—original draft preparation, M.M.G. and H.W.; writing—review and editing, H.W. and S.W.; visualization, H.W.; supervision, H.W. and S.W.; project administration, H.W.; funding acquisition, H.W. and S.W. All authors have read and agreed to the published version of the manuscript.
Funding
This work was partially supported by two grants #246077 and #499650 from the Simons Foundation.
Institutional Review Board Statement
This study used publicly available data. Not institutional review is needed.
Informed Consent Statement
Not applicable.
Data Availability Statement
The COVID-19 data used in the study are available at https://covidtracking.com/data/download/national-history.csv and https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv. The data analyzed in this article were downloaded on 12 July 2021.
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A
Appendix A.1. Proof of Theorem 1
We can write . It is clear that since by the definition of in (5),
Therefore, we only need to consider to obtain . Let . Then,
Let be the order statistic for so that is the rank of among . Then,
| . |
To find the conditional expectation, without loss of generality, assume that , so that . Let
where the upper spacing, and the lower spacing, from . Applying Taylor’s expansion twice, we can write
From the properties of spacings in Reference [25], we have
Therefore, for and , we have
if (or symmetrically ), then
Collecting terms from (A2) and (A3), we have
Now, consider the conditional variance. Note that, when , the term in is a constant. Therefore,
where the last equality is due to the fact that the indicator functions involving and are conditionally independent when and neither , is or . Plugging (A2) through (A4) into the right-hand side of the equation above, we obtain
Putting (A4) and (A5) into (A1), we have
Next, we will show that the limit of exists. Note that , where
It is clear that and are both at least 1. Therefore, is nonnegative. Consequently, is a summation of nonnegative terms.
Under Assumption 1, the conditional variance of given is uniformly bounded (i.e., there exists a constant such that for all j). We have
If we replace the summation in (A6) over the original sample index by the summation over the ranks and denoting
then we have
The right-hand side of the inequality (A7) converges to
which is finite for finite C and fixed (note that, in our augmentation, k is a finite odd integer with minimum value of 3). Note that is the summation of nonnegative terms (with probability 1) due to the fact that . Hence, the limit of exists as a result of the Comparison Test in calculus.
The convergence of is due to the Dominated Convergence Theorem after noticing that the expectation of (A8) is finite. Applying the Dominated Convergence Theorem to , we get . This completes the proof.
Appendix A.2. Lemma A1 and Its Proof
The following lemma will be needed in the proof of Lemma 2.
Lemma A1.
For locally Lipschitz continuous function on a bounded support , we have
uniformly in , for a given .
Sketch Proof of Lemma A1.
Recall that and are the marginal probability density function and cumulative distribution function of , respectively. Let be independent Exponential random variables with mean 1, and be independent Uniform random variables on . Without loss of generality, assume that are ordered. Define , for . Then, from the properties of spacings on page 406 of Reference [25], there exists an such that and for . For ,
where is some positive constant.
Note that the random variables and are independent, , and has distribution. Therefore,
and
where the inequality in (A10) is due to the fact that . Due to (A9) and (A10) and by Theorem 14.4.1 in Reference [26], we have
Consequently, for in the same cell,
where the last equality in (A11) is due to since are included in the same cell.
Appendix A.3. Proof of Lemma 2
The proof of each part is given separately.
Sketch Proof of Lemma 2 Part (i).
From (15), we have
By Lemma A1 and Assumption 2,
where . Therefore, can be written as
Denote and . Then, we can write
First, we will show that
therefore, the first term in (A15) is . Note that , and is a function of . Therefore, we have
and
Denote the first term and second term in (A18) as and , respectively. Let , for all i. Then,
where the equality in (A19) is due to the fact that and are independent when . Similarly,
Consider individual summands in (A20) and (A21). By the Cauchy-Schwarz inequality and Assumptions 1 and 2,
Similarly,
Note that can only be used to augment at most cells. That is, if the rank of is r, then cannot be used to augment cells whose x values have ranks not in the set of positive integers . Therefore, the summation over c in (A20) and that over c and in (A21) each contain no more than terms. As a result, the two terms and are ; therefore,
Due to (A17) and (A22), the proof of (A16) is completed by applying Theorem 14.4-1 in Reference [26].
Sketch Proof of Lemma 2 Part (ii).
From (16), we have
By Lemma A1, we have . Thus,
therefore, is . This completes the proof. □
Sketch Proof of Lemma 2 Part (iii).
From (17), we have
By Hölder’s inequality,
Next, we show that
We can write
Note that
where the last equality in (A28) is due to the fact that is uniformly bounded by Assumption 1, and the summation over i in (A28) contains only k terms. Denote , for all i. Then,
where the equality in (A30) is due to the fact that and are uniformly bounded by Assumption 1, and the summation over c in (A29) and that over c and in (A30) each contain no more than terms.
From (A28) and (A30), we have
Due to (A28) and (A31) and by Theorem 14.4-1 in Reference [26], we have
Similarly, it can be shown that the second term in (A27) is ; therefore, the proof of (A26) is completed.
From (A24)–(A26),
This completes the proof. □
Appendix A.4. Sketch Proof of Theorem 3
The proof of the existence of is similar to that for in Theorem 1. Now, we show that
From (12), we have
where through are defined in (13)–(17). The and are the average between-cell and within-cell variations for augmented observations with as the response. Note that the conditional mean of given satisfies the null hypothesis. But is equal to . Theorem 2 implies that
By Lemma 2, we have
Thus, we only need to consider to obtain the asymptotic mean under the alternatives.
Note that are i.i.d. since are i.i.d.. From (A13) and (A14), we can write in (14) as
where is the sample variance of . By the Weak Law of Large Numbers,
as and k stays fixed.
From (A35)–(A37), we have
From (A33), (A34), and (A38) and by applying Slutsky’s Theorem, we have
Theorem 3 then follows immediately since it is readily seen that under the local alternative (11).
Appendix A.5. Sketch Proof of Theorem 4
From the proof of Theorem 3, we know (see (A34)), where is defined similarly as in Theorem 1 but with calculated under the fixed alternative hypothesis.
We also know (see (A37)), where is given in (19). Hence, .
Compared to , the remaining three terms in (18) involving , , or are all negligible. This is because Lemma 2 implies that
Putting all terms together as in Equation (18) and applying Slutsky’s Theorem, we know that
Since and the test reject the null hypothesis at significance level if , the power of the test is approximately
due to (A39). Essentially, the asymptotic mean of the test statistics converges to infinity. Hence, the power goes to one.
References
- Eubank, R.L.; Hart, J.D. Testing goodness-of-fit in regression via order selection criteria. Ann. Statist. 1992, 20, 1412–1425. [Google Scholar] [CrossRef]
- Hart, J. Smoothing-inspired lack-of-fit tests based on ranks. In Beyond Parametrics in Interdisciplinary Research: Festschrift in Honor of Professor Pranab K. Sen; Institute of Mathematical Statistics: Beachwood, OH, USA, 2008; Volume 1, pp. 138–155. [Google Scholar]
- Hart, J. Frequentist-Bayes lack-of-fit tests based on Laplace approximations. J. Stat. Theory Pract. 2009, 3, 681–704. [Google Scholar] [CrossRef]
- Hart, J. Nonparametric Smoothing and Lack-of-Fit Test; Springer: New York, NY, USA, 1997. [Google Scholar]
- Chen, C.-F.; Hart, J.D.; Wang, S. Bootstrapping the order selection test. J. Nonparametr. Stat. 2001, 13, 851–882. [Google Scholar] [CrossRef]
- Kuchibhatla, M.; Hart, J.D. Smoothing-based lack-of-fit tests: Variations on a theme. J. Nonparametr. Stat. 1996, 7, 1–22. [Google Scholar] [CrossRef]
- Lee, B.J. A Nonparametric Model Specification Test Using a Kernel Regression Method. Ph.D. Thesis, University of Wisconsin, Madison, WL, USA, 1988. [Google Scholar]
- Yatchew, A.J. Nonparametric regression tests based on least square. Econom. Theory 1992, 8, 435–451. [Google Scholar] [CrossRef]
- Eubank, R.L.; Spiegelman, C.H. Testing the goodness of fit of a linear model via nonparametric regression techniques. J. Am. Stat. Assoc. 1990, 85, 387–392. [Google Scholar] [CrossRef]
- Hardle, W.; Mammen, E. Comparing nonparametric versus parametric regression fits. Ann. Statist. 1993, 21, 1926–1947. [Google Scholar] [CrossRef]
- Zheng, J.X. A consistent test of functional form via nonparametric estimation techniques. J. Econom. 1996, 75, 263–289. [Google Scholar] [CrossRef]
- Horowitz, J.Z.; Spokoiny, V.G. An adaptive, rate-optimal test of a parametric mean-regression model against a nonparametric alternative. Econometrica 2001, 69, 599–631. [Google Scholar] [CrossRef]
- Guerre, E.; Lavergne, P. Data-driven rate-optimal specification testing in regression models. Ann. Statist. 2005, 33, 840–870. [Google Scholar] [CrossRef] [Green Version]
- Song, W.; Du, J. A note on testing the regression functions via nonparametric smoothing. Can. J. Stat. 2011, 39, 108–125. [Google Scholar] [CrossRef]
- Wang, L.; Akritas, M. Testing for covariate effects in the fully nonparametric analysis of covariance model. J. Am. Stat. Assoc. 2006, 101, 722–736. [Google Scholar] [CrossRef]
- Wang, L.; Akritas, M.G.; Van Keilegom, I. An ANOVA-type nonparametric diagnostic test for heteroscedastic regression models. J. Nonparametr. Stat. 2008, 20, 365–382. [Google Scholar] [CrossRef]
- Wang, H.; Tolos, S.; Wang, S. A distribution free test to detect general dependence between a response variable and a covariate in the presence of heteroscedastic treatment effects. Can. J. Stat. 2010, 38, 408–433. [Google Scholar] [CrossRef]
- Hardle, W.; Hall, P.; Marron, J.S. How far are automatically chosen regression smoothing parameters from their optimum? J. Am. Stat. Assoc. 1988, 83, 86–95. [Google Scholar]
- Wang, H.; Akritas, M. Asymptotically distribution free tests in heteroscedastic unbalanced high dimensional anova. Stat. Sin. 2011, 21, 1341–1377. [Google Scholar] [CrossRef] [Green Version]
- de Jong, P. A central limit theorem for generalized quadratic forms. Probab. Theory Relat. Fields 1987, 75, 261–277. [Google Scholar] [CrossRef]
- Hart, J.D.; Yi, S. One-sided cross-validation. J. Am. Stat. Assoc. 1998, 93, 620–631. [Google Scholar] [CrossRef]
- Holmes, C.C.; Adams, N.M. Likelihood inference in nearest-neighbour classification models. Biometrika 2003, 90, 99–112. [Google Scholar] [CrossRef] [Green Version]
- Singh, D.; Febbo, P.G.; Ross, K.; Jackson, D.G.; Manola, J.; Ladd, C.; Tamayo, P.; Renshaw, A.A.; D’Amico, A.V.; Richie, J.P.; et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 2002, 1, 203–209. [Google Scholar] [CrossRef] [Green Version]
- Lee, S.Y.; Lei, B.; Mallick, B.; Samy, A.M. Estimation of covid-19 spread curves integrating global data and borrowing information. PLoS ONE 2020, 15, e0236860. [Google Scholar] [CrossRef] [PubMed]
- Pyke, R. Spacings (with discusssion). J. R. Stat. Soc. Ser. B Stat. Methodol. 1965, 27, 395–449. [Google Scholar]
- Bishop, Y.M.; Fienberg, S.E.; Holland, P.W. Discrete Multivariate Analysis; Springer: New York, NY, USA, 2007. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).




