A Nonparametric Lack-of-Fit Test of Constant Regression in the Presence of Heteroscedastic Variances

Mohammed M. Gharaibeh; Mohammad Sahtout; Haiyan Wang; Suojin Wang

doi:10.3390/sym13071264

Abstract

We consider a k-nearest neighbor-based nonparametric lack-of-fit test of constant regression in presence of heteroscedastic variances. The asymptotic distribution of the test statistic is derived under the null and local alternatives for a fixed number of nearest neighbors. Advantages of our test compared to classical methods include: (1) The response variable can be discrete or continuous regardless of whether the conditional distribution is symmetric or not and can have variations depending on the predictor. This allows our test to have broad applicability to data from many practical fields; (2) this approach does not need nonlinear regression function estimation that often affects the power for moderate sample sizes; (3) our test statistic achieves the parametric standardizing rate, which gives more power than smoothing-based nonparametric methods for moderate sample sizes. Our numerical simulation shows that the proposed test is powerful and has noticeably better performance than some well known tests when the data were generated from high frequency alternatives or binary data. The test is illustrated with an application to gene expression data and an assessment of Richards growth curve fit to COVID-19 data.

Keywords:

lack-of-fit; hypothesis testing; k-nearest neighbor; heteroscedastic regression

1. Introduction

Nonparametric lack-of-fit tests where the constant regression is assumed for the null hypothesis have been considered by many authors. The order selection test [1], the rank-based order selection test [2], and the Bayes sum test [3] are among the top few that are intuitive and easy to compute. A classical textbook review of extensive efforts in nonparametric lack-of-fit tests based on smoothing methods is available in Reference [4]. Hart [2] extended the order selection method of Reference [1] to rank-based test under the constant variance assumption so that the test statistic is relatively insensitive to misspecification of distributional assumptions. These two order selection tests show excellent performance under low frequency alternatives. However, they may have low power under high frequency alternatives.

In another paper, Hart proposed several new tests based on Laplace approximations to better handle the high frequency alternatives [3]. In particular, one test with overall good power is the Bayes sum test. It is a modified cusum statistic with a better use of the sample Fourier coefficients arranged in the order of increasing frequency. Two versions of approximating the critical values were given in Reference [3], one based on normally generated data, and the other based on bootstrap resampling of the residuals under the null hypothesis of constant regression. It is interesting to note that, even though the response variable may not be from the normal distribution, the normal approximation approach tends to give even higher power than the bootstrap approach. An explanation for this is that the Bayes sum test starts with the canonical model that the estimators of the Fourier coefficients are normally distributed, and here the sample Fourier coefficients are approximately normally distributed for large sample sizes. Thus, the Bayes sum test works well for large sample sizes and is more powerful than the order selection test and the rank-based order selection test.

A major motivation for the current work is that the practical data may have variances vary with the covariate, whereas the order selection (OS), rank-based order selection (ROS), and Bayes sum test were derived for homoscedastic regression problems. The scale parameter of the error term is assumed to be a constant in these three tests. Even in such a case, different estimators of the scale parameter may be used assuming either the null or alternative hypothesis is true.

To deal with the presence of heteroscedasticity for testing the no-effect null hypothesis, Chen et al. [5] proposed another test statistic in addition to bootstrapping the [6] version of the order selection test. The approximate sampling distribution of that test statistic was obtained using the wild bootstrap method. In the case of heteroscedasticity, it was shown in Reference [5] that the asymptotic distribution of the [6] version of the order selection test depends on the unknown variance function of the errors. Moreover, they showed that their statistic is more robust than that of Reference [6] to heteroscedasticity and has better level accuracy. It was further shown in Reference [5] that the wild bootstrap technique has an overall good performance in terms of level accuracy and power properties in the case of heteroscedasticity.

Other consistent nonparametric lack-of-fit tests using some smoothing techniques have been proposed (cf. References [7,8,9,10,11,12,13,14]). Some of them are difficult to compute in addition to complicated conditions that are hard to justify. All of the aforementioned methods require the response variable to be continuous.

In this paper, we consider a nonparametric lack-of-fit test of constant regression in presence of heteroscedastic variances. This test has better power for data from high frequency alternatives than the four tests reviewed above. In addition, our test can also be applied to discrete data. The test statistic is derived using the k-nearest neighbor augmentation defined through the ranks of the predictor. This idea was first proposed in Reference [15] for analysis of covariance model, and further used in Reference [16] for a diagnostic test and in Reference [17] for a test of independence between a response variable and a covariate in presence of treatments. A test statistic was defined in Reference [16] for lack-of-fit test in the present regression setting. The authors considered each distinct covariate value as a factor level. Then, they augmented the observed data to construct what they called an artificial balanced one-way ANOVA (see Section 2.1 for further description of the augmentation). This way of constructing test statistics has great potential to gain power over smoothing-based methods. However, we found that their asymptotic variance estimator of the test statistic in Reference [16] seriously underestimates the true variance for intermediate sample sizes. As a consequence, regardless of the error distribution, their test has highly inflated type I error rates when k is small and becomes very conservative when k gets large.

In this paper, we present a very different asymptotic variance formula for the test statistic. In the special case of homoscedastic variance, our derived asymptotic variance contains one more term (a function of k) than that in Reference [16]. This explains the unstable behavior of the type I error pattern of their test. On the other hand, our test has consistent type I error rates across different sample sizes and different k values, and they are very close to the nominal alpha levels.

In Section 2, we state the hypotheses and define the test statistic as a difference of two quadratic forms, both of which estimate a common quantity but one under the null hypothesis and the other under the alternatives. Then, the asymptotic distribution of the test statistic is obtained under the null and the local alternatives for a fixed number of nearest neighbors. Moreover, we consider the idea of the Least Squares Cross-Validation (LSCV) procedure of Reference [18] to estimate the number of nearest neighbors. In Section 3, we present simulation studies with data generated having symmetric normal, light-tailed uniform, heavy-tailed T, and asymmetric heteroscedastic error distributions. The numerical results show that our test has encouragingly better performance in terms of type I error and power compared to the existing tests. In addition to the simulation comparisons, we present in Section 4 an application to gene expression data from patients undergoing radical prostatectomy and an application to assess COVID-19 model fit. A summary is given in Section 5. Technical proofs are provided in Appendix A.

2. Theoretical Results

2.1. The Hypotheses and Test Statistic

Let

(X_{j}, Y_{j})

,

j = 1, \dots, N

, be an independent and identically distributed random sample. Let

f (x)

and

F (x)

denote the marginal probability density function and cumulative distribution function of

X_{j}

, respectively. Denote Var

(Y_{i} | X_{i} = x) = σ^{2} (x)

and

ε_{i} = Y_{i} - E (Y_{i} | X_{i})

.

We wish to test the hypotheses:

\begin{matrix} H_{0} : E (Y | X = x) = m_{0} (x), where m_{0} (\cdot) is a known function, \\ vs . \end{matrix}

(1)

H_{1} : E (Y | X = x) = m (x), which depends on x via other functions instead of m_{0} (\cdot) .

(2)

This formulation works for both continuous and categorical response variable Y. For simplicity in the presentation, we assume that there are no duplicated observations for each value of covariate X. If there are duplicated observations, we can use the middle ranks to take care of this issue. In regression settings, the nonlinear conditional mean regression

E (Y | X)

is often estimated through pooling observations from neighbors by one of the smoothing methods, such as loess, smoothing spline, kernel estimation, etc. For smoothing spline or kernel method, the number of observations in a window essentially needs to go to infinity as the sample size goes to infinity. The k-nearest neighbor approach is a popular method for classification, but the theory for a fixed k is very difficult for general regression. In this work, we use a fixed number of k-nearest neighbors in the data augmentation to help define a statistic for conducting a lack-of-fit test. This augmentation is done for each unique value

x_{i}

of the predictor by generating a cell that contains k values of the response Y whose corresponding x values are among the k closest to

x_{i}

in rank. We consider k to be an odd number for convenience so that the augmentation contains half of the (k− 1) values symmetrically on each side of

x_{i}

when

x_{i}

is an inner point. Let c denote an index defined by the covariate value

X_{j_{1}}

, where

c = j_{1}

and let

\hat{F} (x) = N^{- 1} \sum_{j = 1}^{N} I (X_{j} \leq x)

denote the empirical distribution of X. We make the augmentation for each cell

(c)

by selecting

k - 1

pairs of observations whose covariate values are among the k closest to

X_{j_{1}}

in rank in addition to

(X_{j_{1}}, Y_{j_{1}})

. Let

C_{c}

denote the set of indices for the covariate values used in the augmented cell

(c)

. Thus, for any pair

(X_{j}, Y_{j})

to be selected in the augmentation of the cell

(c)

, the difference between the ranks of

X_{j}

and

X_{j_{1}}

is no more than

(k - 1) / 2

if

X_{j_{1}}

is an interior point whose rank is between

(k - 1) / 2

and

N - (k - 1) / 2

, i.e.,

N | \hat{F} (X_{j_{1}}) - \hat{F} (X_{j}) | \leq (k - 1) / 2

. For

X_{j_{1}}

whose rank is less than

(k - 1) / 2

or greater than

N - (k - 1) / 2

, the difference between the ranks of

X_{j}

and

X_{j_{1}}

is no more than

k - 1

. This idea was first proposed in [15] and further used in References [16,17] for different problems. A test statistic was derived in Reference [16] for lack-of-fit testing in the present regression setting by considering each distinct covariate value as a factor level. Then, the observed data were augmented by considering a window around each

x_{i}

that contains the

k_{n}

nearest covariate values to construct what the authors called an artificial balanced one-way ANOVA. Similar augmentation was considered in Reference [17] when there are more than one treatment. Their results cannot be applied here since the asymptotic variance calculation is ill-defined when there is no treatment factor as in our lack-of-fit setting.

Let

R_{c t}, t = 1, \dots, k

, denote the augmented response values in cell

(c)

under the null hypothesis. Define

g_{N k} (X_{1}, X_{2}) = I (N | \hat{F} (X_{1}) - \hat{F} (X_{2}) | \leq \frac{k - 1}{2})

to be the indicator function that the difference between the ranks of

X_{1}

and

X_{2}

is no more than

(k - 1) / 2

. Let

B_{N}

and

W_{N}

denote the average between-cell and within-cell variations defined as the following:

\begin{matrix} B_{N} = \frac{k}{N - 1} \sum_{c = 1}^{N} {({\bar{R}}_{c \cdot} - {\bar{R}}_{\cdot \cdot})}^{2} and W_{N} = \frac{1}{N (k - 1)} \sum_{c = 1}^{N} \sum_{t = 1}^{k} {(R_{c t} - {\bar{R}}_{c \cdot})}^{2}, \end{matrix}

where

{\bar{R}}_{c \cdot} = k^{- 1} \sum_{t = 1}^{k} R_{c t}

,

{\bar{R}}_{\cdot \cdot} = N^{- 1} \sum_{c = 1}^{N} {\bar{R}}_{c \cdot} .

Note that

B_{N}

and

W_{N}

can be easily calculated since they resemble the mean squares statistics for an ANOVA model. The calculation is on the augmented data. In most cases in the literature,

B_{N} / W_{N}

is used for constructing the test statistic when

B_{N}

has fixed degrees of freedom. However, in our case, the degrees of freedom for

B_{N}

is

N - 1

, which goes to infinity. Therefore, the staztistic typically used in this case is

\sqrt{N} [(B_{N} / W_{N}) - 1]

(see Reference [19]), which involves showing that

\sqrt{N} (B_{N} - W_{N})

converges in distribution to normality and

W_{N}

converges in probability to a constant. With augumented data, it is complicated to show that

W_{N}

converges in probability. So, we define the following difference-based

G_{N} = \sqrt{N} (B_{N} - W_{N}) / \sqrt{{\hat{λ}}_{N}}

as our test statistic instead of using

B_{N} / W_{N}

-based one, where

{\hat{λ}}_{N}

is a variance estimator for

\sqrt{N} (B_{N} - W_{N})

given later in (9). This test statistic is similar to that proposed in Reference [16], but with a different variance estimator.

To express

B_{N}

and

W_{N}

in terms of the original data, we can write

\begin{matrix} B_{N} & = & \frac{k}{N - 1} \sum_{j_{1} = 1}^{N} {[\frac{1}{k} \sum_{j = 1}^{N} Y_{j} g_{N k} (X_{j 1}, X_{j}) - \frac{1}{N k} \sum_{j_{2} = 1}^{N} \sum_{j = 1}^{N} Y_{j} g_{N k} (X_{j 2}, X_{j})]}^{2} + O_{p} (N^{- 1}), \\ W_{N} & = & \frac{1}{N (k - 1)} \sum_{j_{1} = 1}^{N} \sum_{j = 1}^{N} {[Y_{j} g_{N k} (X_{j 1}, X_{j}) - \frac{1}{k} \sum_{j_{2} = 1}^{N} Y_{j_{2}} g_{N k} (X_{j 1}, X_{j 2})]}^{2} + O_{p} (N^{- 1}) . \end{matrix}

2.2. Asymptotic Distribution of the Test Statistic under the Null Hypothesis

Even though the test statistic is easy to calculate, the derivation of the asymptotic distribution is challenging since the augmented data in neighboring cells are correlated. In this subsection, we derive the asymptotic distribution of the test statistic derived with a different strategy than that proposed in Reference [16]. We first simplify it by finding its projection. Specifically, define

\begin{matrix} V_{c t} = R_{c t} - E (R_{c t} | X), \end{matrix}

where

X = {(X_{1}, \dots, X_{N})}^{'}

. Then, we project

B_{N}

onto the space:

\begin{matrix} Extended span {V_{c}, c = 1, \dots, N} \end{matrix}

(3)

of the form

\sum_{c = 1}^{N} a_{c} g_{c} (V_{c})

, where

a_{i}

are constants,

V_{c} = {(V_{c 1}, \cdot \cdot \cdot, V_{c k})}^{'}

, and

g_{c} (V_{c})

is some function that is possibly nonlinear. This projection will help us to split

B_{N}

into two terms, one of which includes a summation over c and the other over c and

c^{'}

for

c \neq c^{'}

:

\begin{matrix} B_{N} = P_{B} (V) + S_{B} (V), \end{matrix}

\begin{matrix} P_{B} (V) = \frac{k}{N} \sum_{c = 1}^{N} {\bar{V}}_{c \cdot}^{2}, S_{B} (V) = \frac{- k}{N (N - 1)} \sum_{c \neq c^{'}}^{N} {\bar{V}}_{c \cdot} {\bar{V}}_{c^{'} \cdot}, \end{matrix}

(4)

where

V^{'} = (V_{1}^{'}, \dots, V_{N}^{'})

and

{\bar{V}}_{c \cdot} = k^{- 1} \sum_{t = 1}^{k} V_{c t}

. Then,

P_{B} (V)

is in the space defined in (3) and

B_{N} - W_{N} = (P_{B} (V) - W_{N}) + S_{B} (V) = T_{B} + S_{B} (V),

where

\begin{matrix} T_{B} & = & \frac{1}{N (k - 1)} \sum_{c = 1}^{N} \sum_{t \neq t^{'}}^{k} V_{c t} V_{c t^{'}} = \frac{1}{N (k - 1)} \sum_{c = 1}^{N} \sum_{t \neq t^{'}}^{k} (R_{c t} - E (R_{c t} | X)) (R_{c t^{'}} - E (R_{c t^{'}} | X)) \\ = & \frac{1}{N (k - 1)} \sum_{j \neq j^{'}}^{N} (Y_{j} - E (Y_{j} | X)) (Y_{j^{'}} - E (Y_{j^{'}} | X)) \sum_{c = 1}^{N} I (j \in C_{c}) I (j^{'} \in C_{c}) \\ = & \frac{1}{N (k - 1)} \sum_{j \neq j^{'}}^{N} (Y_{j} - E (Y_{j} | X)) (Y_{j^{'}} - E (Y_{j^{'}} | X)) K_{j j^{'}}, \end{matrix}

(5)

and

\begin{matrix} K_{j j^{'}} = \sum_{c = 1}^{N} I (j \in C_{c}) I (j^{'} \in C_{c}) . \end{matrix}

(6)

Note that the term in (5) is closely related to the expected covariance between every pair of response values with correlation induced by their dependence on

X

. The

K_{j j^{'}}

in (6) serves as a weight function which associates the response locally the empirical distribution function of X. The

T_{B}

term in (5) is more intuitive than

\sqrt{N} (B_{N} - W_{N})

to evaluate the lack-of-fit. However,

T_{B}

cannot be calculated from the sample since

E (Y | X)

is unknown. On the other hand,

\sqrt{N} (B_{N} - W_{N})

can be directly obtained from the sample.

We assume the following condition to obtain the result under the null hypothesis:

Assumption 1.

For all x, suppose that

F (x)

is differentiable, and the fourth conditional central moment of

Y_{j}

given

X_{j}

is uniformly bounded.

The advantage of using a small or fixed k instead of a large k can be seen here. Even though

S_{B} (V)

is a quadratic form, only nearby cells have correlated observations due to the fixed number of nearest neighbors augmentation. On the other hand, when the number of nearest neighbors tends to infinity, the augmented data in many more cells will be correlated; therefore,

S_{B} (V)

might diverge, and the derivation of the asymptotic distribution will require unnecessarily strong conditions on the magnitude of the correlation. It is straightforward to show that

S_{B} (V) = O_{P} (N^{- 1})

with a small or fixed k. Hence,

\sqrt{N} S_{B} (V)

is asymptotically negligible. We state this result in Lemma 1 below.

Lemma 1

(Projection of

B_{N}

). Let

S_{B} (V)

be as defined in (4). If the Assumption 1 is satisfied, then

\begin{matrix} \sqrt{N} S_{B} (V) \overset{p}{\to} 0, a s N \to \infty, \end{matrix}

where the notation

\overset{p}{\to}

denotes convergence in probability.

To obtain the asymptotic distribution of the test statistic under the null hypothesis, we work with

\begin{matrix} \sqrt{N} T_{B} = \frac{\sqrt{N}}{N (k - 1)} \sum_{j \neq j^{'}}^{N} (Y_{j} - E (Y_{j} | X)) (Y_{j^{'}} - E (Y_{j^{'}} | X)) K_{j j^{'}}, \end{matrix}

(7)

where

K_{j j^{'}}

is defined in (6). We first give the large sample behavior of the variance of this term.

Theorem 1.

Under Assumption 1,

λ : = {lim}_{N \to \infty} λ_{N} = {lim}_{N \to \infty} Var (\sqrt{N} T_{B})

exists and

\begin{matrix} λ = E (lim_{N \to \infty} δ_{N}), \end{matrix}

where

\begin{matrix} δ_{N} & = \sum_{j < j^{'}}^{N} \frac{4 σ^{2} (X_{j}) σ^{2} (X_{j^{'}})}{N {(k - 1)}^{2}} [[k - | j_{*}^{'} - j_{*} {|]}^{2} + [k - | j_{*}^{'} - j_{*} |] \\ - 2 I (| j_{*}^{'} - j_{*} | \leq \frac{k - 1}{2}) + O (N^{- 1})] I (| j_{*}^{'} - j_{*} | \leq k - 1), \end{matrix}

(8)

and

j_{*}^{'}, j_{*}

are the ranks of

X_{j^{'}}

and

X_{j}

among the covariate values

X_{1}, \dots, X_{N}

.

To estimate the asymptotic variance, let

j_{*}

be the rank of

X_{j}

among all covariate values. Then, it is readily seen that a consistent estimator of

λ

under

H_{0}

is

{\hat{λ}}_{N} = \sum_{j < j^{'}}^{N} \frac{4 {\hat{σ}}^{2} (X_{j}) {\hat{σ}}^{2} (X_{j^{'}})}{N {(k - 1)}^{2}} [[k - | j_{*}^{'} - j_{*} {|]}^{2} + [k - | j_{*}^{'} - j_{*} |] - 2 I (| j_{*}^{'} - j_{*} | \leq \frac{k - 1}{2})] I (| j_{*}^{'} - j_{*} | \leq k - 1),

(9)

where

{\hat{σ}}^{2} (X_{j})

is the sample variance based on the augmented observations for the cell determined by

X_{j}

, i.e.,

\begin{matrix} {\hat{σ}}^{2} (X_{j}) = \frac{1}{k - 1} \{\sum_{l = 1}^{N} Y_{l}^{2} g_{N k} (X_{l}, X_{j}) - \frac{1}{k} {(\sum_{l = 1}^{N} Y_{l} g_{N k} (X_{l}, X_{j}))}^{2}\} . \end{matrix}

Note that

K_{j j^{'}}

are bounded counts and (7) is a clean quadratic form as defined in Reference [20]. The Central Limit Theorem for clean quadratic forms (Proposition 3.2) in Reference [20] can be applied to obtain the following result. We omit the details of the proof.

Theorem 2.

Under

H_{0}

in (1) and Assumption 1,

\begin{matrix} G_{N} \overset{d}{\to} N (0, 1), a s N \to \infty, \end{matrix}

where the notation

\overset{d}{\to}

denotes convergence in distribution.

2.3. Results under Local or Fixed Alternatives

In this subsection, we consider the theoretical properties of the test under fixed or local alternatives in which the conditional expectation of Y given X is

m (x) = E (Y | X = x)

. Let

A (x)

be an univariate function of x.

Under a fixed alternative,

m (x)

can be expressed as

\begin{matrix} m (x) = m_{0} (x) + A (x), \end{matrix}

(10)

where

m_{0} (x) = E_{0} (Y | X = x)

is the conditional expectation of Y given X under the null hypothesis.

For local alternatives, consider the sequence of conditional expectations

E_{N} (Y | X = x)

that approach

m_{0} (x)

in the order of

N^{- 1 / 4}

:

\begin{matrix} m (x) = E_{N} (Y | X = x) = m_{0} (x) + N^{- 1 / 4} A (x) . \end{matrix}

(11)

Both alternatives are valid for either discrete or continuous response variable and allow the data to have different conditional variance under the alternative hypotheses from that under the null. For example, if

Y | X

has a Poisson distribution with mean

m (x)

under the alternative, then the variance is

m (x)

instead of

m_{0} (x)

.

Suppose

(X_{i}, Y_{i})

,

i = 1, \dots, N

are observed data under either the fixed alternatives in (10) or the local alternatives in (11). Let

Q = {Q_{c t}; c = 1, \dots, N, t = 1, \dots, k}

be the augmented response values. Note that

Q_{c t}

is equal to the observed response variable whose covariate value is one of the following:

\begin{matrix} \{\begin{matrix} X_{(t)} & if c < (k - 1) / 2, \\ X_{(c + t - (k + 1) / 2)} & if (k - 1) / 2 \leq c \leq N - (k - 1) / 2, \\ X_{(N - k + t)} & if c > N - (k - 1) / 2 . \end{matrix} \end{matrix}

Then,

Q_{c t}

can be written as

Q_{c t} = ε_{c t} + E (Q_{c t} | X)

, where

E (Q_{c t} | X)

includes the conditional mean under the null hypothesis and departure from the null. Note that

ε_{c t} = Q_{c t} - E (Q_{c t} | X)

satisfies the null hypothesis and can be viewed as the augmented data for

Z_{i} = Y_{i} - (m_{0} (X_{i}) + A (X_{i}))

if

(X_{i}, Y_{i})

are under the fixed alternative in (10) or for

Z_{i} = Y_{i} - (m_{0} (X_{i}) + N^{- 1 / 4} A (X_{i}))

if

(X_{i}, Y_{i})

are under the local alternatives in (11). In either case, the conditional mean

Z_{i}

given

X_{i}

satisfies the null hypothesis but with

Var (Z_{i} | X_{i})

equal to

Var (Y_{i} | X_{i})

under the alternative hypotheses. For convenience, define

A_{c t}

to be the

A (x)

function evaluated at the covariate value for augmented observation

Q_{c t}

. Let

{\bar{A}}_{c \cdot} = k^{- 1} \sum_{t = 1}^{k} A_{c t}

,

{\bar{A}}_{\cdot \cdot} = N^{- 1} \sum_{c = 1}^{N} {\bar{A}}_{c \cdot}

,

{\bar{Q}}_{c \cdot} = k^{- 1} \sum_{t = 1}^{k} Q_{c t}

,

{\bar{Q}}_{\cdot \cdot} = N^{- 1} \sum_{c = 1}^{N} {\bar{Q}}_{c \cdot}

,

{\bar{ε}}_{c \cdot} = k^{- 1} \sum_{t = 1}^{k} ε_{c t}

, and

{\bar{ε}}_{\cdot \cdot} = N^{- 1} \sum_{c = 1}^{N} {\bar{ε}}_{c \cdot}

. Denote

B_{N} (Q)

and

W_{N} (Q)

to be the average between-cell variations and the average within-cell variations under the alternative hypotheses, respectively.

Under the local alternatives,

\begin{matrix} B_{N} (Q) & = & k {(N - 1)}^{- 1} \sum_{c = 1}^{N} {({\bar{Q}}_{c \cdot} - {\bar{Q}}_{\cdot \cdot})}^{2} = k {(N - 1)}^{- 1} \sum_{c = 1}^{N} {[({\bar{ε}}_{c \cdot} - {\bar{ε}}_{\cdot \cdot}) + N^{- 1 / 4} ({\bar{A}}_{c \cdot} - {\bar{A}}_{\cdot \cdot})]}^{2} \\ = & k {(N - 1)}^{- 1} \sum_{c = 1}^{N} [{({\bar{ε}}_{c \cdot} - {\bar{ε}}_{\cdot \cdot})}^{2} + N^{- 1 / 2} {({\bar{A}}_{c \cdot} - {\bar{A}}_{\cdot \cdot})}^{2} + 2 N^{- 1 / 4} ({\bar{A}}_{c \cdot} - {\bar{A}}_{\cdot \cdot}) ({\bar{ε}}_{c \cdot} - {\bar{ε}}_{\cdot \cdot})], \end{matrix}

and

\begin{matrix} W_{N} (Q) \\ = {N (k - 1)}^{- 1} \sum_{c = 1}^{N} \sum_{t = 1}^{k} {(Q_{c t} - {\bar{Q}}_{c \cdot})}^{2} = {N (k - 1)}^{- 1} \sum_{c = 1}^{N} \sum_{t = 1}^{k} {[(ε_{c t} - {\bar{ε}}_{c \cdot}) + N^{- 1 / 4} (A_{c t} - {\bar{A}}_{c \cdot})]}^{2} \\ = {N (k - 1)}^{- 1} \sum_{c = 1}^{N} \sum_{t = 1}^{k} [{(ε_{c t} - {\bar{ε}}_{c \cdot})}^{2} + N^{- 1 / 2} {(A_{c t} - {\bar{A}}_{c \cdot})}^{2} + 2 N^{- 1 / 4} (ε_{c t} - {\bar{ε}}_{c \cdot}) (A_{c t} - {\bar{A}}_{c \cdot})] . \end{matrix}

In this case, the numerator of the test statistic can be written as

\begin{matrix} \sqrt{N} (B_{N} (Q) - W_{N} (Q)) = Δ_{N, 0} + Δ_{N, 1} + Δ_{N, 2} - Δ_{N, 3} - Δ_{N, 4}, \end{matrix}

(12)

where

\begin{matrix} Δ_{N, 0} & = & \sqrt{N} (k {(N - 1)}^{- 1} \sum_{c = 1}^{N} {({\bar{ε}}_{c \cdot} - {\bar{ε}}_{\cdot \cdot})}^{2} - {N (k - 1)}^{- 1} \sum_{c = 1}^{N} \sum_{t = 1}^{k} {(ε_{c t} - {\bar{ε}}_{c \cdot})}^{2}), \end{matrix}

(13)

\begin{matrix} Δ_{N, 1} & = & \sqrt{N} k {(N - 1)}^{- 1} \sum_{c = 1}^{N} [N^{- 1 / 2} {({\bar{A}}_{c \cdot} - {\bar{A}}_{\cdot \cdot})}^{2}], \end{matrix}

(14)

\begin{matrix} Δ_{N, 2} & = & \sqrt{N} k {(N - 1)}^{- 1} \sum_{c = 1}^{N} [2 N^{- 1 / 4} ({\bar{A}}_{c \cdot} - {\bar{A}}_{\cdot \cdot}) ({\bar{ε}}_{c \cdot} - {\bar{ε}}_{\cdot \cdot})], \end{matrix}

(15)

\begin{matrix} Δ_{N, 3} & = & \sqrt{N} {N (k - 1)}^{- 1} \sum_{c = 1}^{N} \sum_{t = 1}^{k} [N^{- 1 / 2} {(A_{c t} - {\bar{A}}_{c \cdot})}^{2}], \end{matrix}

(16)

\begin{matrix} Δ_{N, 4} & = & 2 \sqrt{N} {N (k - 1)}^{- 1} \sum_{c = 1}^{N} \sum_{t = 1}^{k} (ε_{c t} - {\bar{ε}}_{c \cdot}) N^{- 1 / 4} (A_{c t} - {\bar{A}}_{c \cdot}) . \end{matrix}

(17)

Similarly, under the fixed alternatives,

\begin{matrix} \sqrt{N} (B_{N} (Q) - W_{N} (Q)) = Δ_{N, 0} + \sqrt{N} Δ_{N, 1} + N^{\frac{1}{4}} Δ_{N, 2} - \sqrt{N} Δ_{N, 3} - N^{\frac{1}{4}} Δ_{N, 4}, \end{matrix}

(18)

where

Δ_{N, i}

,

i = 0, \dots, 4

are given in (13)–(17).

The following additional condition is needed for the result under the alternative hypotheses:

Assumption 2.

Suppose that

X_{i}

has bounded support

[a, b]

, and

A (x)

is locally Lipschitz continuous on

[a, b]

: for each

z \in [a, b]

, there exists an

L > 0

such that

A (x)

is Lipschitz continuous on the neighborhood

B_{L} (z) = {y \in [a, b] : | y - z | < L}

. Further, we assume that the fourth central moments of

A (X_{i})

are uniformly bounded.

Before we give the asymptotic distribution of the test statistic under the alternatives, we state the following results which are valid under both the local and fixed alternative hypotheses.

Lemma 2.

Under Assumptions 1 and 2, as

N \to \infty

,

\begin{matrix} (i) Δ_{N, 2} = O_{p} (N^{- \frac{1}{4}}); (i i) Δ_{N, 3} = O_{p} (N^{- 2}); (i i i) Δ_{N, 4} = O_{p} (N^{- \frac{3}{4}}), \end{matrix}

where

Δ_{N, 2}

,

Δ_{N, 3}

, and

Δ_{N, 4}

are defined in (15), (16), and (17), respectively.

The proof of Lemma 2 is given in Appendix A. From this Lemma and Equations (12) and (18), we can see that

Δ_{N, 0}

and

Δ_{N, 1}

are the major terms that provide power under the alternative hypotheses. We state the results separately for fixed and local alternatives.

Theorem 3.

Let

λ_{N A}

be defined similarly as

λ_{N}

in Theorem 1 but with

σ^{2} (X_{j})

calculated under the alternatives in (11). For the sequence of local alternatives

E_{N} (Y | X)

in (11) and under the Assumptions 1 and 2, the limit

λ_{A} = {lim}_{N \to \infty} λ_{N A}

exists and

\begin{matrix} G_{N} \overset{d}{\to} N (k σ_{A}^{2} / \sqrt{λ_{A}}, 1), \end{matrix}

where

\begin{matrix} σ_{A}^{2} = \int_{- \infty}^{\infty} A^{2} (x) f (x) d x - {(\int_{- \infty}^{\infty} A (x) f (x) d x)}^{2} = Var (A (X)) . \end{matrix}

(19)

Note that

λ_{N}

in Theorem 1 and

λ_{N A}

in Theorem 3 share the same formula, except that

σ^{2} (X_{j}) = Var (Y_{j} | X_{j})

in

λ_{N A}

needs to be calculated under the alternatives in (11). For example, if Y given X has a Bernoulli distribution, then the conditional variance of Y given X under the local alternatives in (11) is

σ^{2} (x) = E_{N} (Y | X = x) (1 - E_{N} (Y | X = x)) = m (x) (1 - m (x))

, which is different from that under the null hypothesis

E_{0} (Y | X = x) (1 - E_{0} (Y | X = x)) = m_{0} (x) (1 - m_{0} (x))

.

Theorem 4.

For the fixed alternative in (10), under Assumptions 1 and 2, the power of the test using statistic

G_{N}

goes to one as

N \to \infty

.

The proofs of Theorems 3 and 4 are given in Appendix A.

In heteroscedastic regression, it is common in the literature to write

Y_{i} = m (X_{i}) + σ (X_{i}) e_{i}

with

e_{i}

independent of

X_{i}

. In this formulation, the entire error term

σ (X_{i}) e_{i}

is uncorrelated with

X_{i}

. In the ideal case that there is no lack-of-fit, such a model is reasonable. However, when there is a lack-of-fit because a wrong regression function is specified, the error term still contains some systematic information of

E (Y_{i} | X_{i})

. Then, it is possible that the error resulting from the specified regression function is still correlated with

X_{i}

.

2.4. Selection of the Number of Nearest Neighbors

The number of nearest neighbors k in the test statistic specifies the number of values augmented in each cell. Our theory requires that it takes a finite small odd integer. In simulations, we have found that the type I error remains close to the nominal level for different small k values and stays stable for a broad range of sample sizes and error distributions. Under the alternative hypothesis, different k may lead to different power for our test statistic. This section discusses how to select the parameter k.

Under the alternative hypothesis, our k-nearest neighbor augmentation is parallel to regression using a local constant based on k-nearest neighbors. For a continuous response variable, Hardle et al. [18] suggested the Least Squares Cross-Validation (LSCV) method for smoothing parameter (bandwidth) selection in kernel regression estimation. Chen et al. [5] recommended using the one-sided cross-validation procedure of Reference [21] to select smoothing parameter (bandwidth) for hypothesis testing. The number of nearest neighbors k in our setting has a similar role as the smoothing parameter in kernel regression.

For a categorical response variable, Holmes et al. [22] proposed an approach to select the parameter k in the k-nearest neighbor (KNN) classification algorithm using likelihood-based inference. Choosing k in this method can be considered as a generalized linear model variable-selection problem. In particular, for multinomial data

(y_{i}, x_{i})

,

i = 1, \dots, n

, where

y_{i} \in {C_{0}, \dots, C_{Q}}

denotes the class label of the ith observation, and

x_{i}

is a vector of p predictor variables, they considered the probability model

\begin{matrix} p r (y_{i} = C_{i} | y_{[- i]}, x_{i}, k) = \frac{exp (z_{i}^{(k, j)} θ)}{\sum_{υ = 0}^{Q} exp (z_{i}^{(k, υ)} θ)}, \end{matrix}

where

y_{[- i]} = {y_{1}, \dots, y_{i - 1}, y_{i + 1}, \dots, y_{n}}

denotes the data with the ith observation deleted,

θ

is a single regression parameter, and

z_{i}^{(k, υ)}

is the difference between the proportion of observations in class

C_{υ}

and that in class

C_{0}

within the k-nearest neighbors of

x_{i}

, i.e.,

\begin{matrix} z_{i}^{(k, υ)} = \frac{1}{k} \sum_{j \overset{k}{\sim} i} {I (y_{j} = C_{υ}) - I (y_{j} = C_{0})}, \end{matrix}

where the notation

\sum_{j \overset{k}{\sim} i}

denotes that the summation is over the k-nearest neighbors of

x_{i}

in the set

{x_{1}, \dots, x_{i - 1}, x_{i + 1}, \dots, x_{n}}

, and the neighbors are defined based on the Euclidean distance. The prediction for a new point

y_{n + 1} | x_{n + 1}

is given by the most common class in the k-nearest neighbors of

x_{n + 1}

. Afterwards, the value that maximizes the profile pseudolikelihood is chosen to estimate the parameter k. However, this method is only valid when the response variable is a categorical variable and the nearest neighbor is defined using the Euclidian distance.

In our case, the response variable could be continuous or categorical, and our nearest neighbors are defined through ranks. So, we do not recommend to use our test statistic with an estimate of k obtained with aforementioned procedures. We consider an alternative method to estimate k which uses ranks to define nearest neighbors and can be applied in both categorical and continuous response cases. Here, we adopt the idea of the Least Squares Cross-Validation (LSCV) procedure of Reference [18] to select the parameter k. Different from Reference [18], where the regression function is estimated using kernel estimation, we consider k-nearest neighbor estimates with the neighbors defined through the ranks of the predictor variable. In the case of categorical response variable with Q classes, we re-code the response variable to have integer values from 1 to Q. To estimate the class for the response variable, we use the majority vote (the most common value) from the k-nearest neighbors. For tied situation where there are multiple classes achieving the same highest frequency, one of them is assigned randomly to be the estimated response. In the case of continuous response variable, the regression function is estimated by the average of the k-nearest neighbors.

In a leave-one-out procedure, for each

c \in {1, \dots, N}

, we eliminate

(X_{c}, Y_{c})

and use the rest of the observations to estimate the regression function which then is used to predict the response value Y at

X_{c}

. Here are our steps:

1.: Find the observation in $X_{[- c]} = {all X_{i}, where i = 1, \dots, N a n d i \neq c}$ such that the absolute difference between this observation, and $X_{c}$ is minimized. Denote

$\begin{matrix} J (c) = {arg min_{j} | X_{j} - X_{c} |, where j = 1, \dots, N and j \neq c} . \end{matrix}$

Then, $X_{J (c)}$ is the closest to $X_{c}$ .
2.: Find the k-nearest neighbors of $X_{J (c)}$ in terms of ranks. We use the corresponding $Y_{i}$ values such that

$\begin{matrix} N | \hat{F} (X_{J (c)}) - \hat{F} (X_{i}) | \leq \frac{k - 1}{2} for i \neq c, \end{matrix}$

to obtain the leave-one-out estimate of the regression function at $X_{c}$ . That is,

$\begin{matrix} {\hat{m}}_{k, - c} (X_{c}) \\ = \{\begin{matrix} k^{- 1} \sum_{i = 1, i \neq c}^{N} Y_{i} I (N | \hat{F} (X_{J (c)}) - \hat{F} (X_{i}) | \leq \frac{k - 1}{2}), & continuous case, \\ Mode of {Y_{i} : all i \neq c such that N | \hat{F} (X_{J (c)}) - \hat{F} (X_{i}) | \leq \frac{k - 1}{2}}, & categorical case, \end{matrix} \end{matrix}$

where the Mode is defined as the most frequently observed value in a set of numbers. In the case where the most frequently observed values are not unique, one of them is randomly selected.
3.: Repeat steps 1 and 2 for $c = 1, \dots, N$ to obtain all leave-one-out estimates.

Then, define the leave-one-out Least Squares Cross-Validation error as

\begin{matrix} L S C V (k) = \frac{1}{N} \sum_{c = 1}^{N} {({\hat{m}}_{k, - c} (X_{c}) - Y_{c})}^{2} . \end{matrix}

Finally, the number of nearest neighbors is estimated by

\begin{matrix} \hat{k} = arg min_{k \in κ} L S C V (k), \end{matrix}

(20)

where the set

κ

consists of small odd integers.

When the response variable is categorical, the estimate of k from this algorithm depends on how well the covariate values from different classes are separated and how many observations are in each class. For large class sizes, it is very possible that the resulting estimate is much greater than 10 if we leave

κ

unconstrained. However, our theory requires k to be a finite, positive, and odd integer.

In the continuous case with k-nearest neighbor estimation, the average of a big proportion of Y values is used to approximate the response variable if a large k value is utilized. As a consequence, bigger k tends to give larger least squares error when the regression function is under the alternative hypothesis. This is especially true when the regression function has substantial curvature, such as in high frequency alternatives. On the other hand, larger k tends to give smaller least squares error when the data were generated under the constant regression null hypothesis.

In either case, the smallest value for k is 3 (note:

k = 1

corresponds to the case of no data augmentation). In order to keep the least squares error minimized under the alternative hypothesis and reasonable under the null hypothesis, we recommend to let

κ

contain a few small integer values. For example,

κ = {3, 5}

, which is a safe choice for both moderate and large sample sizes.

Figure 1 shows the typical pattern of

L S C V (k)

as a function of k for

k = 3, 5, 7, 9

when the response variable was generated as (1)

Y_{i} = e_{i}

; (2)

Y_{i} = 2 X_{i}^{2} + e_{i}

; (3)

Y_{i} = 10 sin (8 π X_{i}) + e_{i}

; and (4)

Y_{i} = 10 sin (8 π X_{i}) + e_{i} X_{i}

, where

e_{i}

and

X_{i}

are i.i.d

\sim N (0, 1)

.

Figure 1. Typical patterns of

L S C V (k)

versus k in continuous data.

3. Monte Carlo Simulation Studies

In this section, we present the results of some simulation studies to investigate the type I error and power performance of our test. The test has a parameter k to specify the number of nearest neighbors for data augmentation. The inference for our test requires the k to be a small odd positive integer. We report the results for

k = 3

and 5 and denote them as

G_{N 3}

and

G_{N 5}

, respectively. This is for the user to have an idea of how the test behaves with a given k. Furthermore, we report the results of our test with k selected from 3 and 5 using our considered method in Section 2.4 and denote it as

G_{N}

. For the

G_{N}

applied to each generated data set, the value of the k is selected using

\hat{k}

in (20), and our test with parameter

\hat{k}

is used to obtain the p-value.

For comparison, we also report the corresponding results for the test of Reference [16], the order selection (OS) test of Reference [1], the rank-based test (ROS) of Reference [2], the bootstrap order selection test (BOS) of Reference [5], and the Bayes sum test of Reference [3]. As argued in Section 7.1 of Reference [4], evenly spaced design points should be used for calculation of these four test statistics even when they are unevenly spaced. So, the generated covariate values in increasing order were replaced by evenly spaced design points on

(0, 1)

for all four tests. For BOS, we apply the wild bootstrap algorithm of Reference [5] based on the residuals

Y_{i} - \bar{Y}, i = 1, \dots, n,

and use their test statistic with 1000 bootstrap samples for each replication. For the Bayes sum test, we use the statistic that has been reported to have good power from a comprehensive simulation study in Reference [3]. For approximating the p-values of the Bayes sum test, Hart [3] gave two versions of the approximation, one assuming normality (BN) and one using the bootstrap (BB). For BN, a random sample of the same sample size as the data was generated from the standard normal distribution, and the Bayes sum test statistic was calculated from the data so generated, regardless of the actual distribution of the response variable. The process was independently repeated 10,000 times, and the p-value was obtained based on the empirical distribution of these 10,000 values. For BB, the bootstrap samples were drawn from the empirical distribution of the residuals

Y_{i} - \bar{Y}

,

i = 1, \dots, n

, rather than the normal distribution, and the p-value approximation was carried out similarly. The scale parameter

σ^{2}

for a given data set

Y_{1}, \dots, Y_{n}

in both BB and BN statistics was estimated by

{\hat{σ}}^{2} = {(n - 2)}^{- 1} \sum_{i = 2}^{n - 1} {(0.809 Y_{i - 1} - 0.5 Y_{i} - 0.309 Y_{i + 1})}^{2}

, as was suggested in Reference [3]. It was reported in Reference [3] that the results obtained using the normality assumption were in basic agreement with those obtained using the bootstrap. So, we only report the simulation results for BN.

The values for the covariate X were independently generated from Uniform

(0, 1)

. First, we consider the performance of different tests under the

H_{0}

. The data were generated from

\begin{matrix} Model M_{0} : Y_{i} = 10 + ϵ_{i}, \end{matrix}

where the error term

ϵ_{i}

were independently generated with one of the four error distributions:

1.: $ϵ_{i} \sim$ Uniform $(- 0.1, 0.1)$ (denoted as Unif in Table 1 and Table 2);

Table 1. Percent of rejection at 0.05 level under $H_{0}$ for data generated from Model $M_{0}$ . OS: order selection test of Reference [1]; BN: Reference [3] Bayes sum test with Normal approximation; BOS: order selection test with wild bootstrap of Reference [5]; ROS: rank-based test of Reference [2]; WAI3, WAI5, WAI7: test of Reference [16] with $k = 3$ , 5, and 7, respectively; $G_{N}$ : the proposed test with $k = 3$ , 5, 7, and $\hat{k}$ from (20). Highly inflated empirical type I errors are marked with red color.

Table 2. Percent of rejection under high frequency alternatives $M_{1}$ to $M_{4}$ with $q = 8$ and sample size $n = 50$ . The legend of the tests are same as in Table 1. WAI is not included since it could not keep its type I error under control. Note: BN also has highly elevated type I error in the Heteroscedastic case.
2.: (Normal) $ϵ_{i} \sim$ Normal $(0, 0 . 02^{2})$ (denoted as Normal in Table 1 and Table 2);
3.: $ϵ_{i} = V_{i} / 30$ , where $V_{i}$ follows t-distribution with 5 degrees of freedom (denoted as $T (5) / 30$ in Table 1 and Table 2); and
4.: $ϵ_{i} = X_{i} \cdot e_{i}$ , where $e_{i} \sim$ Uniform $(- 0.1, 0.1)$ . This is a heteroscedastic regression model and denoted as Heter in Table 1 and Table 2.

The empirical type I error rates (in percentage) under

H_{0}

are reported in Table 1. It can be seen that the test of Reference [16] with

k_{n} = 3

or 5 generally has inflated type I error and is particularly serious with smaller sample sizes. For

k_{n} = 7

, their test has type I error close to 0.05 when the error distribution is Normal or T(5)/30 but is as high as about twice of the significance level in the heteroscedastic case. Its performance for

k_{n} = 7

is better than with other

k_{n}

values but still is inflated for the heteroscedastic case. The order selection test of Reference [1] and the proposed test with different k-values have better type I error control. Among the three k-values, larger k pulls more observations around each covariate value as pseudo replicates. This could lead to the test being less sensitive to curvature departure against the null hypothesis. Hence, we recommend to choose k between 3 and 5.

Next, we consider the performance of the tests with data generated from nonlinear models. The response values were independently generated according to the following four models for

i = 1, \dots, n

, with the moderate sample size of

n = 50

in all cases:

Model $M_{1}$ : $Y_{i} = 10 c o s (q π X_{i}) + ϵ_{i}$ ,
Model $M_{2}$ : $Y_{i} = 10 s i n (q π X_{i}) + ϵ_{i}$ ,
Model $M_{3}$ : $Y_{i} = e^{- 2 X_{i}} c o s (q π X_{i}) + ϵ_{i}$ , and
Model $M_{4}$ : $Y_{i} = 0.2 e^{- 2 X_{i}} c o s (q π X_{i}) + ϵ_{i}$ ,

where q in Models

M_{1}

–

M_{4}

represents the frequency. We considered

q = 8

and

q = 2

. The case with

q = 8

is a higher frequency alternative compared to those reported in Reference [3]. The data for the error term

ϵ_{i}

in each model were independently generated with one of the four error distributions listed earlier. Model

M_{0}

serves as the null model to obtain the type I error rates for all tests. For each error distribution, the data were generated from Models

M_{0}

–

M_{4}

, with sample size

n = 50

for 2000 times, and the rejection rates (in percentage) at significance level 0.05 are reported in Table 2.

It can be seen that the type I error estimates for all tests were below or close to the nominal level 0.05 for all models with homoscedastic errors. For the heteroscedastic regression model, the variance of the error depends on the covariate, while the conditional mean of the response variable given the covariate is a constant under Model

M_{0}

. In this case, all the tests tend to be liberal.

The rows

M_{1}

to

M_{4}

in Table 2 show the power comparison for the different combinations between Models

M_{1}

–

M_{4}

and the four types of the error distribution. The powers of our test with

\hat{k}

(

G_{N}

) are higher than all other tests in all cases. BN has power close to our test. OS, ROS, and BOS fall far behind. The low power performance of BOS in the case of high frequency alternatives was mentioned in Reference [5], and they suggested (without details) to use smoothing squared residuals to deal with that.

It is noticeable that the power of our test is 1 for Models

M_{1}

and

M_{2}

for all different types of the error distribution and very close to 1 for Models

M_{3}

and

M_{4}

. In addition, the power for OS was slightly higher than that for ROS in all cases.

Models

M_{3}

and

M_{4}

are similar, except that Model

M_{4}

has lower signal to noise ratio than Model

M_{3}

. With the lower signal to noise ratio, the power for ROS, OS and BOS drops drastically. To have a closer look at the numerical performance of all tests under local alternatives, we considered the model

Y_{i} = C e^{- 2 X_{i}} c o s (8 π X_{i}) + ϵ_{i}

, with

C = 0.1, 0.12, 0.14, 0.16, 0.18

and

ϵ_{i} \sim

Uniform

(- 0.1, 0.1)

. The empirical power curves are given in Figure 2. It is obvious that our test

G_{N}

has consistently higher power than the other tests.

Figure 2. Empirical power of the tests for data generated from

Y_{i} = C e^{- 2 X_{i}} c o s (8 π X_{i}) + ϵ_{i}

with sample size

n = 50

and

ϵ_{i} \sim

Uniform

(- 0.1, 0.1)

. Due to small values of C, the signal to noise ratio is low. WAI is not included since it could not keep its type I error under control for this sample size and the uniform error distribution.

The discussions above are for high frequency alternatives with

q = 8

and moderate sample size

n = 50

. When sample size increases, while the frequency stays the same, the power of each test also increases. For sample size of 100, the empirical power is 1 for all the compared tests of NB, OS, ROS, and

G_{N}

under Models

M_{1}

–

M_{3}

. Under Model

M_{4}

, OS and ROS have power slightly below 1 for the case with uniform error. The rest of the tests have power close to 1. Similarly, for lower frequency alternatives, for example, when

q = 2

and

n = 50

, all these tests have power close to 1.

To examine how the power of these tests changes with the sample size, we generated data with model

Y_{i} = N^{- 1 / 4} A (X_{i}) + ϵ_{i}

, where

A (X_{i}) = 0.3 e^{- 2 X_{i}} c o s (8 π X_{i})

,

ϵ_{i} \sim

Uniform

(- 0.1, 0.1)

, for

N = 15, 25, 50, 75, 100, 125, 150, 175, 200, 250

. The empirical power of these tests is presented in Figure 3, where

G_{N}

is our test with

\hat{k}

selected from

k = 3

and 5 based on (20). It is obvious that the proposed test

G_{N}

consistently has the highest power over all the sample sizes considered.

Figure 3. Empirical power for different sample sizes. The data were generated from

Y_{i} = N^{- 1 / 4} 0.3 e^{- 2 X_{i}} c o s (8 π X_{i}) + ϵ_{i}

, where

ϵ_{i} \sim

Uniform

(- 0.1, 0.1)

. As the sample sizes N increases, the power of all tests increases. The

G_{N}

test approaches 1 much faster than other tests. WAI is not included since it could not keep its type I error under control for the sample sizes and the uniform error distribution.

Even though BN showed a comparable performance to our test

G_{N}

in many cases, the running time of BN is much longer than

G_{N}

. In particular, the average running time from 10,000 runs from BEOCAT cluster machines for

G_{N}

is 0.03 s, while that for BN is 9.7 s. So,

G_{N}

is more than 300 times faster than BN.

4. Applications to Real Data

4.1. Application to Gene Expression Data from Patients Undergoing Radical Prostatectomy

In this subsection, we present an application of our test to gene expression data from patients undergoing radical prostatectomy in order to predict the behavior of Prostate cancer. This data set was collected between 1995 and 1997 at the Brigham and Women’s Hospital from 52 tumor and 50 normal prostate samples using oligonucleotide microarrays containing probes for 12,600 genes and expressed sequence tags (the data is available at https://www.ncbi.nlm.nih.gov/gds/?linkname=pubmed_gds&from_uid=12086878 on 11 July 2021). The data shows heterogeneity and has a binary response variable which is the patient outcome (tumor or normal). Applying our test to the expression data from each gene, we identified 980 genes that are significantly associated with the response variable after Bonferroni correction

(p \leq 0.001 /

12,600). On the other hand, Singh et al. [23] used permutation test to identify important genes. They found 456 genes whose expression values are significantly correlated with patient outcome

(p \leq 0.001)

. Note that the significance declared by Reference [23] is at 0.001 level without any multiple comparison adjustment. Ours are obtained at the same significance level but with the Bonferroni control, which is a very conservative method for multiple comparison adjustment. With such conservative control, we still identified more than twice of the genes than Reference [23]. It is worth mentioning that our test was developed under very general assumptions that are expected to hold true for the microarry data here. These results suggest that our test is much more powerful than the permutation test of Reference [23]. Furthermore, we performed k-nearest neighbor (KNN) classification on the data for the top i genes (i genes with smallest p-values,

i = 1, 2, \dots, 980

) to predict the patient outcomes. The leave-one-out cross validation (LOOCV) was used as a validation method. The parameter k in KNN was estimated with the training part of the data in LOOCV procedure by the profile pseudolikelihood method of Reference [22]. The leave-one-out accuracy curve with increasing number of selected top i genes is shown in Figure 4. We would like to comment that these genes were obtained individually. Our simple application of the test is not meant to find the best combination of genes that have the best classification accuracy. Even under such circumstances, the top genes found with our test give good LOOCV accuracy.

Figure 4. The leave-one-out accuracy curve with increasing number of selected genes.

4.2. Application to Assess Richards Growth Curve Fit for the COVID Cases and Deaths

In this application, we would like to assess if the popular Richards growth model can fit the COVID-19 cases or deaths for the U.S.

The Richards growth curve model has been adapted recently for real-time prediction of outbreak of diseases in epidemiology. Below is a form of the the Richards curve

f (t; θ_{1}, θ_{2}, θ_{3}, ξ)

that was used to fit the COVID-19 outbreak in Reference [24]:

\begin{matrix} f (t; θ_{1}, θ_{2}, θ_{3}, ξ) = θ_{1} \cdot {[1 + ξ \cdot exp \{- θ_{2} \cdot (t - θ_{3})\}]}^{- 1 / ξ}, \end{matrix}

where

θ_{1}

,

θ_{2}

, and

θ_{3}

are real numbers, and

ξ

is a positive real number.

Using this parameterization of the Richards curve, Lee et al. [24] stated that, given a progression constant

γ

with

0 < γ < 1

, the flat time point

t_{f l a t, γ}

is given by

\begin{matrix} t_{f l a t, γ} = θ_{3} - \frac{1}{θ_{2}} \cdot log [\frac{1}{ξ} \cdot \{{(\frac{1}{γ})}^{ξ} - 1\}], 0 < γ < 1 . \end{matrix}

They predicted that the posterior means of the flat time points

t_{f l a t, γ}

for the U.S. to be 30 May, 16 July, 30 August, and 15 October when the corresponding

γ

’s are chosen by 0.9, 0.99, 0.999, and 0.9999, respectively. However, as of the end of 2020, the COVID-19 confirmed cases are still continuing to climb up. This is an evidence that the Richards curve does not fit the COVID-19 infection growth well.

We downloaded the cumulative number of confirmed cases of U.S. COVID-19 historical data from https://covidtracking.com/data/download/national-history.csv on 12 July 2021. This website stopped tracking COVID in late March 2021. We also downloaded the daily number of death counts for U.S. COVID-19 data from the CSSE data repository of Johns Hopkins University at https://github.com/CSSEGISandData/COVID-19 on the same day. The data contains death counts up to 11 July 2021.

To fit a Richards curve for each data set, we only included data on days such that the cumulative count is at least seven and removed the last ten days’ data so that those ten days’ counts could be used as an additional out-of-sample assessment of the model fit. A separate Richards curve was fitted for the daily cumulative confirmed cases and deaths. Figure 5 shows the nonlinear least squares fitted Richards growth curves and the observed counts. The fitted curve also contains prediction for the future 10 days beyond the observed days. Table 3 gives the correlation coefficients between the observed counts and the estimated count. Even though the

R^{2}

s between the fitted curves and observed numbers are over 99%, the p-values for the lack-of-fit tests are both essentially 0. The Richards curve seems to be better at predicting the number of confirmed cases than predicting the number of deaths. It appears that the pandemic progression is more complicated than the Richards curve due to changes in social distancing, stay-at-home order policies, and vaccine availability. U.S. holidays, such as Thanksgiving and Christmas, also contributed drastically to the increase of count in late November. The

θ_{1}

parameter predicts the epidemic size. For the total number of deaths, the model predicted epidemic size is 594,331. For the confirmed cases, the model predicted size is around 27.9 million. Both numbers underestimate the true epidemic size. As of 12 July 2021, the U.S. confirmed cases is over 33.8 million, and the number of U.S. deaths is 606,000. These numbers are much larger than the model predicted size, suggesting that the Richards curves could not model the confirmed cases or deaths adequately.

Figure 5. Observed U.S. COVID-19 mortality and confirmed cases with the Richards growth curve estimates.

Table 3. Parameters of the Richard growth models,

R^{2}

between the observed and model estimated counts, and the p-values of the lack of fit test.

5. Conclusions

In this paper, we derived the asymptotic distribution of a nonparametric lack-of-fit test of constant regression in the presence of heteroscedastic variances. We considered a test statistic obtained using the augmentation of a small number of k-nearest neighbors defined through the ranks of the predictor variable. The test statistic is the studentized version of the difference of two quadratic forms. Both of two quadratic forms estimate a common quantity under the null hypothesis but they converge to different quantities under any alternatives The asymptotic distribution of the difference was also given in Reference [16] but with a biased asymptotic variance. We derived the correct form of the asymptotic distribution of the test statistic under both the null hypothesis and local alternatives. In addition, we also provided a procedure to choose the parameter k based on the Least Squares Cross-Validation idea used in k-nearest neighbor regression. Our test has several advantages. It provides a unified framework for testing lack-of-fit for a given regression function when the response is either a discrete or a continuous random variable and the covariate is a continuous variable. This makes it convenient for unified inferences and applications. There is no need to assume the distribution of the data, which makes the test widely applicable to many practical data. The fixed number of nearest neighbors augmentation ensures good power to detect both low and high frequency alternatives even for moderate sample sizes, and the parametric standardizing rate for the test statistic is achieved. The test statistic is easy and fast to calculate. Our simulation studies show that the test is more powerful than some well known competing test procedures when data are generated under high frequency alternatives. Therefore, the results in this paper offer a useful tool for lack-of-fit testing.

Author Contributions

Conceptualization, M.M.G. and H.W.; methodology, M.M.G. and H.W.; software, M.S. and H.W.; validation, M.M.G., M.S., H.W., and S.W.; formal analysis, M.S. and H.W.; investigation, M.M.G. and H.W.; data curation, H.W.; writing—original draft preparation, M.M.G. and H.W.; writing—review and editing, H.W. and S.W.; visualization, H.W.; supervision, H.W. and S.W.; project administration, H.W.; funding acquisition, H.W. and S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by two grants #246077 and #499650 from the Simons Foundation.

Institutional Review Board Statement

This study used publicly available data. Not institutional review is needed.

Informed Consent Statement

Not applicable.

Data Availability Statement

The COVID-19 data used in the study are available at https://covidtracking.com/data/download/national-history.csv and https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv. The data analyzed in this article were downloaded on 12 July 2021.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A.1. Proof of Theorem 1

We can write

λ_{N} = Var (\sqrt{N} T_{B}) = E (Var (\sqrt{N} T_{B} | X)) + Var (\sqrt{N} E

(T_{B} | X))

. It is clear that

Var (\sqrt{N} E (T_{B} | X)) = 0

since by the definition of

T_{B}

in (5),

\begin{matrix} E (\sqrt{N} T_{B} | X) = E (\frac{N^{- 1 / 2}}{(k - 1)} \sum_{j \neq j^{'}} (Y_{j} - E (Y_{j} | X)) (Y_{j^{'}} - E (Y_{j^{'}} | X))| X) K_{j j^{'}} = 0 a . s . \end{matrix}

Therefore, we only need to consider

E (Var (\sqrt{N} T_{B} | X))

to obtain

λ_{N}

. Let

t_{j j^{'}} = (Y_{j} - E (Y_{j} | X)) (Y_{j^{'}} - E (Y_{j^{'}} | X)) K_{j j^{'}}

. Then,

$N {(k - 1)}^{2} E (Var (\sqrt{N} T_{B} \| X))$	$= E [Var (\sum_{j \neq j^{'}} t_{j j^{'}} \| X)] = 2 E (\sum_{j \neq j^{'}} E (t_{j j^{'}}^{2} \| X))$
	$= 2 \sum_{j \neq j^{'}} E [σ^{2} (X_{j}) σ^{2} (X_{j^{'}}) K_{j j^{'}}^{2}]$ .

Let

X_{(j_{*})}

be the order statistic for

X_{j}

so that

j_{*}

is the rank of

X_{j}

among

{X_{j_{1}}, j_{1} = 1, \dots, N}

. Then,

\begin{matrix} λ_{N} = E (Var (\sqrt{N} T_{B} | X)) = \frac{4}{N {(k - 1)}^{2}} E \{\sum_{j < j^{'}} σ^{2} (X_{j}) σ^{2} (X_{j^{'}}) E [K_{j j^{'}}^{2}| X_{j}, X_{j^{'}}, j_{*}, j_{*}^{'}]\} \\ = \frac{4}{N {(k - 1)}^{2}} E \{\sum_{j < j^{'}} σ^{2} (X_{j}) σ^{2} (X_{j^{'}}) [E^{2} (K_{j j^{'}} | X_{j}, X_{j^{'}}, j_{*}, j_{*}^{'}) + Var (K_{j j^{'}} | X_{j}, X_{j^{'}}, j_{*}, j_{*}^{'})]\} . \end{matrix}

(A1)

To find the conditional expectation, without loss of generality, assume that

X_{j} < X_{j^{'}}

, so that

j_{*} < j_{*}^{'}

. Let

\begin{matrix} Λ_{j j^{'}} & = & E (I (j \in C_{c}, j^{'} \in C_{c}) | X_{j}, X_{j^{'}}, j_{*}, j_{*}^{'}) \\ = & P (X_{j} \in C_{c}, X_{j^{'}} \in C_{c} | X_{j}, X_{j^{'}}, j_{*}, j_{*}^{'}) = \int_{X_{j} - L_{j}}^{X_{j} + D_{j}} f (x) d x I (j_{*}^{'} - j_{*} \leq k - 1), \end{matrix}

where

D_{j} =

the upper

k / 2

spacing, and

L_{j} =

the lower

(k / 2 - (j_{*}^{'} - j_{*}))

spacing, from

X_{j}

. Applying Taylor’s expansion twice, we can write

\begin{matrix} Λ_{j j^{'}} = {[F (X_{j} + D_{j}) - F (X_{j} - L_{j})] + O_{p} (N^{- 2})} I (j_{*}^{'} - j_{*} \leq k - 1) . \end{matrix}

From the properties of spacings in Reference [25], we have

\begin{matrix} E (F (X_{j} + D_{j}) - F (X_{j} - L_{j}) | X_{j}, X_{j^{'}}, j_{*}, j_{*}^{'}) = [k - (j_{*}^{'} - j_{*})] / (N + 1) \cdot I (j_{*}^{'} - j_{*} \leq k - 1) . \end{matrix}

Therefore, for

X_{j_{1}} \neq X_{j}

and

X_{j_{1}} \neq X_{j^{'}}

, we have

\begin{matrix} E (Λ_{j j^{'}} | X_{j}, X_{j^{'}}, j_{*}, j_{*}^{'}) & = & \{[k - (j_{*}^{'} - j_{*}) - 2 I (j_{*}^{'} - j_{*} \leq (k - 1) / 2)] / (N + 1) + O_{p} (N^{- 2})\} \\ \times I (j_{*}^{'} - j_{*} \leq k - 1); \end{matrix}

(A2)

if

X_{j_{1}} = X_{j}

(or symmetrically

X_{j_{1}} = X_{j^{'}}

), then

\begin{matrix} Λ_{j j} = I (j_{*}^{'} \in C_{X_{(j_{*})}}) = I (j_{*}^{'} - j_{*} \leq (k - 1) / 2) . \end{matrix}

(A3)

Collecting terms from (A2) and (A3), we have

\begin{matrix} E (K_{j j^{'}} | X_{j}, X_{j^{'}}, j_{*}, j_{*}^{'}) = (k - (j_{*}^{'} - j_{*}) + O_{p} (N^{- 1})) I (j_{*}^{'} - j_{*} \leq k - 1) . \end{matrix}

(A4)

Now, consider the conditional variance. Note that, when

X_{c} \in {X_{j}, X_{j^{'}}}

, the term in

K_{j j^{'}}

is a constant. Therefore,

$Var$	$(K_{j j^{'}} \| X_{j}, X_{j^{'}}, j_{}, j_{}^{'}) = Var (\sum_{c = 1}^{N} I (j \in C_{c}) I (j^{'} \in C_{c}) I (X_{c} \notin {X_{j}, X_{j^{'}}})\| X_{j}, X_{j^{'}}, j_{}, j_{}^{'})$
	$= \sum_{c_{1} = 1}^{N} \sum_{c_{2} = 1}^{N} \{E [I (j \in C_{c_{1}}) I (j^{'} \in C_{c_{1}}) I (j \in C_{c_{2}}) I (j^{'} \in C_{c_{2}}) \| X_{j}, X_{j^{'}}, j_{}, j_{}^{'}]$
	$- E [I (j \in C_{c_{1}}) I (j^{'} \in C_{c_{1}}) \| X_{j}, X_{j^{'}}, j_{}, j_{}^{'}] E [I (j \in C_{c_{2}}) I (j^{'} \in C_{c_{2}}) \| X_{j}, X_{j^{'}}, j_{}, j_{}^{'}]\}$
	$\times I (X_{c_{1}} \notin {X_{j}, X_{j^{'}}}) I (X_{c_{2}} \notin {X_{j}, X_{j^{'}}})$
	$= \sum_{c = 1}^{N} E [I (j \in C_{c}) I (j^{'} \in C_{c}) I (X_{c} \notin {X_{j}, X_{j^{'}}})\| X_{j}, X_{j^{'}}, j_{}, j_{}^{'}]$
	$- \sum_{c = 1}^{N} {[E (I (j \in C_{c}) I (j^{'} \in C_{c}) \| X_{j}, X_{j^{'}}, j_{}, j_{}^{'})]}^{2} I (X_{c} \notin {X_{j}, X_{j^{'}}}),$

where the last equality is due to the fact that the indicator functions involving

c_{1}

and

c_{2}

are conditionally independent when

c_{1} \neq c_{2}

and neither

c_{1}

,

c_{2}

is

X_{j}

or

X_{j^{'}}

. Plugging (A2) through (A4) into the right-hand side of the equation above, we obtain

\begin{matrix} Var (K_{j j^{'}} | X_{j}, X_{j^{'}}, j_{*}, j_{*}^{'}) = [(k - (j_{*}^{'} - j_{*})) - 2 I (j_{*}^{'} - j_{*} \leq \frac{k - 1}{2}) + O_{p} (N^{- 1})] I (j_{*}^{'} - j_{*} \leq k - 1) . \end{matrix}

(A5)

Putting (A4) and (A5) into (A1), we have

\begin{matrix} λ_{N} & = & \sum_{j < j^{'}}^{N} E \{\frac{4 σ^{2} (X_{j}) σ^{2} (X_{j^{'}})}{N {(k - 1)}^{2}} [{[k - (j_{*}^{'} - j_{*})]}^{2} + [k - (j_{*}^{'} - j_{*})] - 2 I (j_{*}^{'} - j_{*} \leq \frac{k - 1}{2}) \\ + O_{p} (N^{- 1})] I (j_{*}^{'} - j_{*} \leq k - 1)\} . \end{matrix}

Next, we will show that the limit of

λ_{N}

exists. Note that

λ_{N} = E (δ_{N})

, where

\begin{matrix} δ_{N} = \sum_{j < j^{'}}^{N} \frac{4 σ^{2} (X_{j}) σ^{2} (X_{j^{'}})}{N {(k - 1)}^{2}} [[k - | j_{*}^{'} - j_{*} {|]}^{2} + [k - | j_{*}^{'} - j_{*} |] - 2 I (| j_{*}^{'} - j_{*} | \leq \frac{k - 1}{2}) + O_{p} (N^{- 1})] I (| j_{*}^{'} - j_{*} | \leq k - 1) . \end{matrix}

It is clear that

[[k - | j_{*}^{'} - j_{*} {|]}^{2}] I (| j_{*}^{'} - j_{*} | \leq k - 1)

and

[k - | j_{*}^{'} - j_{*} |] I (| j_{*}^{'} - j_{*} | \leq k - 1)

are both at least 1. Therefore,

[[k - | j_{*}^{'} - j_{*} {|]}^{2} +

[k - | j_{*}^{'} - j_{*} |] - 2 I (| j_{*}^{'} - j_{*} | \leq \frac{k - 1}{2}))] I (| j_{*}^{'} - j_{*} | \leq k - 1)

is nonnegative. Consequently,

δ_{N}

is a summation of nonnegative terms.

Under Assumption 1, the conditional variance of

Y_{j}

given

X_{j}

is uniformly bounded (i.e., there exists a constant

C > 0

such that

σ^{2} (X_{j}) \leq C

for all j). We have

\begin{matrix} δ_{N} & = \sum_{j < j^{'}}^{N} \frac{4 σ^{2} (X_{j}) σ^{2} (X_{j^{'}})}{N {(k - 1)}^{2}} [[k - | j_{*}^{'} - j_{*} {|]}^{2} + [k - | j_{*}^{'} - j_{*} |] - 2 I (| j_{*}^{'} - j_{*} | \leq \frac{k - 1}{2}) + O_{p} (N^{- 1})] I (| j_{*}^{'} - j_{*} | \leq k - 1) \\ \leq \sum_{j < j^{'}}^{N} \frac{4 C^{2}}{N {(k - 1)}^{2}} [[k - | j_{*}^{'} - j_{*} {|]}^{2} + [k - | j_{*}^{'} - j_{*} |] - 2 I (| j_{*}^{'} - j_{*} | \leq \frac{k - 1}{2}) + O_{p} (N^{- 1})] I (| j_{*}^{'} - j_{*} | \leq k - 1) . \end{matrix}

(A6)

If we replace the summation in (A6) over the original sample index

j, j^{'}

by the summation over the ranks

j_{*}, j_{*}^{'}

and denoting

M (| j_{*}^{'} - j_{*} |) = [[k - | j_{*}^{'} - j_{*} {|]}^{2} + [k - | j_{*}^{'} - j_{*} |] - 2 I (| j_{*}^{'} - j_{*} | \leq \frac{k - 1}{2}) + O_{p} (N^{- 1})] I (| j_{*}^{'} - j_{*} | \leq k - 1),

then we have

\begin{matrix} δ_{N} & \leq & \sum_{j_{*} < j_{*}^{'}}^{N} \frac{4 C^{2}}{N {(k - 1)}^{2}} M (| j_{*}^{'} - j_{*} |) . \end{matrix}

(A7)

The right-hand side of the inequality (A7) converges to

\begin{matrix} \frac{2 k (2 k - 1)}{3 (k - 1)} C^{2} + \frac{2 (k - 2)}{(k - 1)} C^{2}, \end{matrix}

(A8)

which is finite for finite C and fixed

k > 1

(note that, in our augmentation, k is a finite odd integer with minimum value of 3). Note that

δ_{N}

is the summation of nonnegative terms (with probability 1) due to the fact that

M (| j_{*}^{'} - j_{*} |) \geq 0

. Hence, the limit of

δ_{N}

exists as a result of the Comparison Test in calculus.

The convergence of

λ_{N} = E (δ_{N})

is due to the Dominated Convergence Theorem after noticing that the expectation of (A8) is finite. Applying the Dominated Convergence Theorem to

λ_{N}

, we get

{lim}_{N \to \infty} λ_{N} = {lim}_{N \to \infty} E (δ_{N}) = E ({lim}_{N \to \infty} δ_{N})

. This completes the proof.

Appendix A.2. Lemma A1 and Its Proof

The following lemma will be needed in the proof of Lemma 2.

Lemma A1.

For locally Lipschitz continuous function

A (x)

on a bounded support

[a, b]

, we have

\begin{matrix} A (X_{i}) I (i \in C_{c}) - A (X_{j}) I (j \in C_{c}) = O_{p} (N^{- 1}), \end{matrix}

uniformly in

i, j = 1, 2, \dots, N

, for a given

c = 1, 2, \dots, N

.

Sketch Proof of Lemma A1.

Recall that

f (x)

and

F (x)

are the marginal probability density function and cumulative distribution function of

X_{j}

, respectively. Let

Y_{1}, Y_{2}, \dots, Y_{N}

be independent Exponential random variables with mean 1, and

U_{1}, U_{2}, \dots, U_{N}

be independent Uniform random variables on

(0, 1)

. Without loss of generality, assume that

X_{1}, X_{2}, \dots, X_{N}

are ordered. Define

D_{i} = X_{i} - X_{i - 1}

, for

2 \leq i \leq N

. Then, from the properties of spacings on page 406 of Reference [25], there exists an

a_{i} \in [a, b]

such that

F (a_{i}) \in (U_{(i - 1)}, U_{(i)})

and

D_{i} = {(N - i + 1)}^{- 1} Y_{i} {1 - F (a_{i})} {f (a_{i})}^{- 1}

for

2 \leq i \leq N

. For

j > i

,

\begin{matrix} X_{j} - X_{i} & = & D_{i + 1} + D_{i + 2} + \dots + D_{j} = \sum_{l = i + 1}^{j} \frac{1}{N - l + 1} Y_{l} \frac{1 - F (a_{l})}{f (a_{l})} \\ \leq & \sum_{l = i + 1}^{j} \frac{1}{N - l + 1} Y_{l} \frac{1 - U_{(l - 1)}}{f (a_{l})} \leq \frac{1}{{inf}_{l \in [i + 1, j]} f (a_{l})} \sum_{l = i + 1}^{j} \frac{1}{N - l + 1} Y_{l} (1 - U_{(l - 1)}) \\ = & K^{*} \sum_{l = i + 1}^{j} \frac{1}{N - l + 1} Y_{l} (1 - U_{(l - 1)}), \end{matrix}

where

K^{*}

is some positive constant.

Note that the random variables

Y_{l}

and

U_{(l)}

are independent,

1 \leq l \leq N

, and

U_{(l - 1)}

has

B e t a (l - 1, N - l + 2)

distribution. Therefore,

\begin{matrix} E (\frac{1}{N - l + 1} Y_{l} (1 - U_{(l - 1)})) = \frac{1}{N - l + 1} E (Y_{l}) E (1 - U_{(l - 1)}) = \frac{N - l + 2}{(N - l + 1) (N + 1)} = O (N^{- 1}), \end{matrix}

(A9)

and

\begin{matrix} Var (\frac{1}{N - l + 1} Y_{l} (1 - U_{(l - 1)})) \\ = & \frac{1}{{(N - l + 1)}^{2}} \{E {(Y_{l})}^{2} E {(1 - U_{(l - 1)})}^{2} - {(E (Y_{l}) E (1 - U_{(l - 1)}))}^{2}\} \\ = & \frac{1}{{(N - l + 1)}^{2}} \{2 [\frac{(l - 1) (N - l + 2)}{{(N + 1)}^{2} (N + 2)} + \frac{{(N - l + 2)}^{2}}{{(N + 1)}^{2}}] - \frac{{(N - l + 2)}^{2}}{{(N + 1)}^{2}}\} \\ = & \frac{1}{{(N - l + 1)}^{2}} \{\frac{2 (l - 1) (N - l + 2)}{{(N + 1)}^{2} (N + 2)} + \frac{{(N - l + 2)}^{2}}{{(N + 1)}^{2}}\} \\ = & \frac{1}{{(N + 1)}^{2} (N + 2)} \{\frac{2 (l - 1) (N - l + 2)}{{(N - l + 1)}^{2}} + \frac{(N + 2) {(N - l + 2)}^{2}}{{(N - l + 1)}^{2}}\} \\ \leq & \frac{1}{{(N + 1)}^{2} (N + 2)} \{\frac{2 (N + 2) {(N - l + 2)}^{2}}{{(N - l + 1)}^{2}}\} = O (N^{- 2}), \end{matrix}

(A10)

where the inequality in (A10) is due to the fact that

2 (l - 1) < (N + 2) (N - l + 2)

. Due to (A9) and (A10) and by Theorem 14.4.1 in Reference [26], we have

\begin{matrix} \frac{1}{N - l + 1} Y_{l} (1 - U_{(l - 1)}) = O_{p} (N^{- 1}), for all l = 2, \dots, N . \end{matrix}

Consequently, for

X_{i}, X_{j}

in the same cell,

\begin{matrix} X_{j} - X_{i} \leq K^{*} \sum_{l = i + 1}^{j} \frac{1}{N - l + 1} Y_{l} (1 - U_{(l - 1)}) = O_{p} (\frac{j - i}{N}) = O_{p} (N^{- 1}), \end{matrix}

(A11)

where the last equality in (A11) is due to

j - i \leq 2 k

since

X_{i}, X_{j}

are included in the same cell.

From the local Lipschitz continuity of

A (x)

on

[a, b]

, when

N \to \infty

, the following condition is satisfied for any two

X_{i}, X_{j}

inside the same cell:

\begin{matrix} | A (X_{j}) - A (X_{i}) | \leq L^{*} | X_{j} - X_{i} |, for i, j \in C_{c}, \end{matrix}

(A12)

where

L^{*}

is a positive constant. From (A11) and (A12), we have

\begin{matrix} | A (X_{j}) - A (X_{i}) | = O_{p} (N^{- 1}), for i, j \in C_{c} . \end{matrix}

This completes the proof. □

Appendix A.3. Proof of Lemma 2

The proof of each part is given separately.

Sketch Proof of Lemma 2 Part (i).

From (15), we have

\begin{matrix} Δ_{N, 2} & = & \sqrt{N} k {(N - 1)}^{- 1} \sum_{c = 1}^{N} [2 N^{- 1 / 4} ({\bar{A}}_{c \cdot} - {\bar{A}}_{\cdot \cdot}) ({\bar{ε}}_{c \cdot} - {\bar{ε}}_{\cdot \cdot})] . \end{matrix}

By Lemma A1 and Assumption 2,

\begin{matrix} {\bar{A}}_{c \cdot} & = & k^{- 1} \sum_{t = 1}^{k} A_{c t} = k^{- 1} \sum_{i = 1}^{N} A (X_{i}) I (i \in C_{c}) = A (X_{c}) + O_{p} (N^{- 1}), \end{matrix}

(A13)

\begin{matrix} {\bar{A}}_{\cdot \cdot} & = & N^{- 1} \sum_{c = 1}^{N} {\bar{A}}_{c \cdot} = \bar{A (X)} + O_{p} (N^{- 1}), \end{matrix}

(A14)

where

\bar{A (X)} = N^{- 1} \sum_{c = 1}^{N} A (X_{c})

. Therefore,

Δ_{N, 2}

can be written as

\begin{matrix} Δ_{N, 2} & = & \sqrt{N} k {(N - 1)}^{- 1} \sum_{c = 1}^{N} [2 N^{- 1 / 4} (A (X_{c}) - \bar{A (X)}) ({\bar{ε}}_{c \cdot} - {\bar{ε}}_{\cdot \cdot})] + o_{p} (1) . \end{matrix}

Denote

U_{c} = A (X_{c}) - E (A (X_{c}))

and

{\bar{U}}_{\cdot} = N^{- 1} \sum_{c = 1}^{N} U_{c}

. Then, we can write

\begin{matrix} Δ_{N, 2} & = & 2 k N^{- \frac{1}{4}} [\frac{\sqrt{N}}{(N - 1)} \sum_{c = 1}^{N} (A (X_{c}) - \bar{A (X)}) ({\bar{ε}}_{c \cdot} - {\bar{ε}}_{\cdot \cdot})] + o_{p} (1) \\ = & 2 k N^{- \frac{1}{4}} [\frac{\sqrt{N}}{(N - 1)} \sum_{c = 1}^{N} ([A (X_{c}) - E (A (X_{c}))] - [\bar{A (X)} - E (A (X_{c}))]) ({\bar{ε}}_{c \cdot} - {\bar{ε}}_{\cdot \cdot})] + o_{p} (1) \\ = & 2 k N^{- \frac{1}{4}} [\frac{\sqrt{N}}{(N - 1)} \sum_{c = 1}^{N} (U_{c} - {\bar{U}}_{\cdot}) ({\bar{ε}}_{c \cdot} - {\bar{ε}}_{\cdot \cdot})] + o_{p} (1) \\ = & 2 k N^{- \frac{1}{4}} \frac{\sqrt{N}}{(N - 1)} [\sum_{c = 1}^{N} U_{c} {\bar{ε}}_{c \cdot} - N {\bar{U}}_{\cdot} {\bar{ε}}_{\cdot \cdot}] + o_{p} (1) \end{matrix}

\begin{matrix} = & 2 k N^{- \frac{1}{4}} [\frac{\sqrt{N}}{(N - 1)} \sum_{c = 1}^{N} U_{c} {\bar{ε}}_{c \cdot}] - \frac{2 k N^{\frac{1}{4}}}{(N - 1)} [\sqrt{N} {\bar{U}}_{\cdot}] [\sqrt{N} {\bar{ε}}_{\cdot \cdot}] + o_{p} (1) . \end{matrix}

(A15)

First, we will show that

\begin{matrix} [\frac{\sqrt{N}}{(N - 1)} \sum_{c = 1}^{N} U_{c} {\bar{ε}}_{c \cdot}] = O_{p} (1); \end{matrix}

(A16)

therefore, the first term in (A15) is

o_{p} (1)

. Note that

E ({\bar{ε}}_{c \cdot} | X) = E ({\bar{Q}}_{c \cdot} - E ({\bar{Q}}_{c \cdot} | X) | X) = 0

, and

U_{c}

is a function of

X_{c}

. Therefore, we have

\begin{matrix} E [\frac{\sqrt{N}}{(N - 1)} \sum_{c = 1}^{N} U_{c} {\bar{ε}}_{c \cdot}] = \frac{\sqrt{N}}{(N - 1)} \sum_{c = 1}^{N} E [U_{c} E ({\bar{ε}}_{c \cdot} | X)] = 0, \end{matrix}

(A17)

and

\begin{matrix} Var [\frac{\sqrt{N}}{(N - 1)} \sum_{c = 1}^{N} U_{c} {\bar{ε}}_{c \cdot}] & = & \frac{N}{{(N - 1)}^{2}} E {[\sum_{c = 1}^{N} U_{c} {\bar{ε}}_{c \cdot}]}^{2} = \frac{N}{{(N - 1)}^{2}} E [\sum_{c = 1}^{N} U_{c}^{2} {\bar{ε}}_{c \cdot}^{2} + \sum_{c \neq c^{'}}^{N} U_{c} {\bar{ε}}_{c \cdot} U_{c^{'}} {\bar{ε}}_{c^{'} \cdot}] \\ = & \frac{N}{{(N - 1)}^{2}} [\sum_{c = 1}^{N} E (U_{c}^{2} {\bar{ε}}_{c \cdot}^{2})] + \frac{N}{{(N - 1)}^{2}} [\sum_{c \neq c^{'}}^{N} E (U_{c} U_{c^{'}} {\bar{ε}}_{c \cdot} {\bar{ε}}_{c^{'} \cdot})] . \end{matrix}

(A18)

Denote the first term and second term in (A18) as

υ_{N, 1}

and

υ_{N, 2}

, respectively. Let

r_{i} = Y_{i} - E (Y_{i} | X)

, for all i. Then,

\begin{matrix} υ_{N, 1} & = & \frac{N}{{(N - 1)}^{2}} [\sum_{c = 1}^{N} E (U_{c}^{2} E ({\bar{ε}}_{c \cdot}^{2} | X))] = \frac{N}{{(N - 1)}^{2}} [\sum_{c = 1}^{N} E (U_{c}^{2} E ({({\bar{Q}}_{c \cdot} - E ({\bar{Q}}_{c \cdot} | X))}^{2} | X))] \\ = & \frac{N}{{(N - 1)}^{2}} \sum_{c = 1}^{N} E \{U_{c}^{2} E \{{(\frac{1}{k} \sum_{i = 1}^{N} r_{i} I (i \in C_{c}))}^{2}| X\}\} \\ = & \frac{N}{k^{2} {(N - 1)}^{2}} \sum_{c = 1}^{N} E \{U_{c}^{2} E \{(\sum_{i = 1}^{N} r_{i}^{2} I (i \in C_{c}) + \sum_{i \neq i^{'}}^{N} r_{i} I (i \in C_{c}) r_{i^{'}} I (i^{'} \in C_{c}))| X\}\} \end{matrix}

\begin{matrix} = & \frac{N}{k^{2} {(N - 1)}^{2}} \sum_{c = 1}^{N} \sum_{i = 1}^{N} E \{U_{c}^{2} E (r_{i}^{2} | X) I (i \in C_{c})\} \end{matrix}

(A19)

\begin{matrix} = & \frac{N}{k^{2} {(N - 1)}^{2}} \sum_{i = 1}^{N} \sum_{c = 1}^{N} E \{U_{c}^{2} σ^{2} (X_{i}) I (i \in C_{c})\}, \end{matrix}

(A20)

where the equality in (A19) is due to the fact that

Y_{i}

and

Y_{i^{'}}

are independent when

i \neq i^{'}

. Similarly,

\begin{matrix} υ_{N, 2} & = & \frac{N}{{(N - 1)}^{2}} [\sum_{c \neq c^{'}}^{N} E (U_{c} U_{c^{'}} E ({\bar{ε}}_{c \cdot} {\bar{ε}}_{c^{'} \cdot} | X))] \\ = & \frac{N}{{(N - 1)}^{2}} \sum_{c \neq c^{'}}^{N} E \{U_{c} U_{c^{'}} E \{(\frac{1}{k} \sum_{i = 1}^{N} r_{i} I (i \in C_{c})) \times (\frac{1}{k} \sum_{i^{'} = 1}^{N} r_{i^{'}} I (i^{'} \in C_{c^{'}}))| X\}\} \\ = & \frac{N}{k^{2} {(N - 1)}^{2}} \sum_{i = 1}^{N} \sum_{c \neq c^{'}}^{N} E \{U_{c} U_{c^{'}} E (r_{i}^{2}| X) I (i \in C_{c}) I (i \in C_{c^{'}})\} \\ = & \frac{N}{k^{2} {(N - 1)}^{2}} \sum_{i = 1}^{N} \sum_{c \neq c^{'}}^{N} E \{U_{c} U_{c^{'}} σ^{2} (X_{i}) I (i \in C_{c} \cap C_{c^{'}})\} . \end{matrix}

(A21)

Consider individual summands in (A20) and (A21). By the Cauchy-Schwarz inequality and Assumptions 1 and 2,

\begin{matrix} E \{U_{c}^{2} σ^{2} (X_{i}) I (i \in C_{c})\} & \leq & E \{U_{c}^{2} σ^{2} (X_{i})\} \leq {[E (U_{c}^{4})]}^{\frac{1}{2}} {[E {(σ^{2} (X_{i}))}^{2}]}^{\frac{1}{2}} \\ = & {[E (U_{c}^{4})]}^{\frac{1}{2}} {[E {(E (r_{i}^{2}| X))}^{2}]}^{\frac{1}{2}} \leq {[E (U_{c}^{4})]}^{\frac{1}{2}} {[E {(E (r_{i}^{4}| X))}^{2}]}^{\frac{1}{2}} < \infty . \end{matrix}

Similarly,

\begin{matrix} |E \{U_{c} U_{c^{'}} σ^{2} (X_{i}) I (i \in C_{c} \cap C_{c^{'}})\}| \leq E \{| U_{c} U_{c^{'}} | σ^{2} (X_{i}) I (i \in C_{c} \cap C_{c^{'}})\} \leq E \{| U_{c} U_{c^{'}} | σ^{2} (X_{i})\} \\ \leq & {[E {(U_{c} U_{c^{'}})}^{2}]}^{\frac{1}{2}} {[E {(σ^{2} (X_{i}))}^{2}]}^{\frac{1}{2}} = {[E (U_{c}^{2})]}^{\frac{1}{2}} {[E (U_{c^{'}}^{2})]}^{\frac{1}{2}} {[E {(E (r_{i}^{2} | X))}^{2}]}^{\frac{1}{2}} \\ \leq & {[E (U_{c}^{4})]}^{\frac{1}{2}} {[E (U_{c^{'}}^{4})]}^{\frac{1}{2}} {[E {(E (r_{i}^{4} | X))}^{2}]}^{\frac{1}{2}} < \infty . \end{matrix}

Note that

X_{i}

can only be used to augment at most

2 k

cells. That is, if the rank of

X_{i}

is r, then

X_{i}

cannot be used to augment cells whose x values have ranks not in the set of positive integers

{max {1, r - k}, \dots, min {r + k, N}}

. Therefore, the summation over c in (A20) and that over c and

c^{'}

in (A21) each contain no more than

2 k

terms. As a result, the two terms

υ_{N, 1}

and

υ_{N, 2}

are

O (1)

; therefore,

\begin{matrix} Var [\frac{\sqrt{N}}{(N - 1)} \sum_{c = 1}^{N} U_{c} {\bar{ε}}_{c \cdot}] = O (1) . \end{matrix}

(A22)

Due to (A17) and (A22), the proof of (A16) is completed by applying Theorem 14.4-1 in Reference [26].

Next, we will show that the second term in (A15) is

o_{p} (1)

. Using the same technique as in the proof of (A16), it can be shown that

\sqrt{N} {\bar{ε}}_{\cdot \cdot} = O_{p} (1) .

In addition,

\begin{matrix} [\sqrt{N} {\bar{U}}_{\cdot}] = O_{p} (1) \end{matrix}

(A23)

is a result of the Central Limit Theorem applied to

U_{1}, \dots, U_{N}

since they are i.i.d. due to the fact that

X_{1}, \dots, X_{N}

are i.i.d. Consequently,

\begin{matrix} Δ_{N, 2} = O_{p} (N^{- \frac{1}{4}}) + O_{p} (\frac{N^{\frac{1}{4}}}{N - 1}) + o_{p} (1) = o_{p} (1), as N \to \infty . \end{matrix}

This completes the proof. □

Sketch Proof of Lemma 2 Part (ii).

From (16), we have

\begin{matrix} Δ_{N, 3} = \sqrt{N} {N (k - 1)}^{- 1} \sum_{c = 1}^{N} \sum_{t = 1}^{k} [N^{- 1 / 2} {(A_{c t} - {\bar{A}}_{c \cdot})}^{2}] = {N (k - 1)}^{- 1} \sum_{c = 1}^{N} \sum_{t = 1}^{k} [{(A_{c t} - {\bar{A}}_{c \cdot})}^{2}] . \end{matrix}

By Lemma A1, we have

(A_{c t} - {\bar{A}}_{c \cdot}) = O_{p} (N^{- 1})

. Thus,

\begin{matrix} Δ_{N, 3} = O_{p} (N^{- 2}); \end{matrix}

(A24)

therefore,

Δ_{N, 3}

is

o_{p} (1)

. This completes the proof. □

Sketch Proof of Lemma 2 Part (iii).

From (17), we have

\begin{matrix} Δ_{N, 4} & = & 2 \sqrt{N} {N (k - 1)}^{- 1} \sum_{c = 1}^{N} \sum_{t = 1}^{k} (ε_{c t} - {\bar{ε}}_{c \cdot}) N^{- 1 / 4} (A_{c t} - {\bar{A}}_{c \cdot}) . \end{matrix}

By Hölder’s inequality,

\begin{matrix} | Δ_{N, 4} | \leq {[\frac{2 \sqrt{N}}{N (k - 1)} \sum_{c = 1}^{N} \sum_{t = 1}^{k} {(ε_{c t} - {\bar{ε}}_{c \cdot})}^{2}]}^{\frac{1}{2}} {[\frac{2 \sqrt{N}}{N (k - 1)} \sum_{c = 1}^{N} \sum_{t = 1}^{k} {(N^{- 1 / 4} (A_{c t} - {\bar{A}}_{c \cdot}))}^{2}]}^{\frac{1}{2}} . \end{matrix}

(A25)

Next, we show that

\begin{matrix} \sum_{c = 1}^{N} \sum_{t = 1}^{k} {(ε_{c t} - {\bar{ε}}_{c \cdot})}^{2} = O_{p} (N) . \end{matrix}

(A26)

We can write

\begin{matrix} \sum_{c = 1}^{N} \sum_{t = 1}^{k} {(ε_{c t} - {\bar{ε}}_{c \cdot})}^{2} = \sum_{c = 1}^{N} \sum_{t = 1}^{k} ε_{c t}^{2} - k \sum_{c = 1}^{N} {\bar{ε}}_{c \cdot}^{2} \end{matrix}

(A27)

Note that

\begin{matrix} E \{\sum_{c = 1}^{N} \sum_{t = 1}^{k} ε_{c t}^{2}\} & = & E \{E (\sum_{c = 1}^{N} \sum_{t = 1}^{k} ε_{c t}^{2} | X)\} = E \{E (\sum_{c = 1}^{N} \sum_{i = 1}^{N} [{(Y_{i} - E (Y_{i} | X))}^{2} I (i \in C_{c})] | X)\} \\ = & \sum_{c = 1}^{N} \sum_{i = 1}^{N} E \{E ({[Y_{i} - E (Y_{i} | X)]}^{2} | X) I (i \in C_{c})\} \\ = & \sum_{c = 1}^{N} \sum_{i = 1}^{N} E \{σ^{2} (X_{i}) I (i \in C_{c})\} = O (N), \end{matrix}

(A28)

where the last equality in (A28) is due to the fact that

σ^{2} (X_{i})

is uniformly bounded by Assumption 1, and the summation over i in (A28) contains only k terms. Denote

r_{i} = Y_{i} - E (Y_{i} | X)

, for all i. Then,

\begin{matrix} E {\{\sum_{c = 1}^{N} \sum_{t = 1}^{k} ε_{c t}^{2}\}}^{2} = E \{E ({[\sum_{c = 1}^{N} \sum_{t = 1}^{k} ε_{c t}^{2}]}^{2}| X)\} = E \{E ({[\sum_{c = 1}^{N} \sum_{i = 1}^{N} r_{i}^{2} I (i \in C_{c})]}^{2}| X)\} \\ = & E \{E ([\sum_{c = 1}^{N} \sum_{i = 1}^{N} r_{i}^{4} I (i \in C_{c})] + [\sum_{c = 1}^{N} \sum_{i \neq i^{'}}^{N} r_{i}^{2} I (i \in C_{c}) r_{i^{'}}^{2} I (i^{'} \in C_{c})] \\ + [\sum_{c \neq c^{'}}^{N} \sum_{i = 1}^{N} r_{i}^{2} I (i \in C_{c}) r_{i}^{2} I (i \in C_{c^{'}})] + [\sum_{c \neq c^{'}}^{N} \sum_{i \neq i^{'}}^{N} r_{i}^{2} I (i \in C_{c}) r_{i^{'}}^{2} I (i^{'} \in C_{c^{'}})]| X)\} \end{matrix}

\begin{matrix} = & \sum_{c = 1}^{N} \sum_{i = 1}^{N} E \{E (r_{i}^{4} | X) I (i \in C_{c})\} + \sum_{c = 1}^{N} \sum_{i \neq i^{'}}^{N} E \{σ^{2} (X_{i}) σ^{2} (X_{i^{'}}) I (i, i^{'} \in C_{c})\} \end{matrix}

(A29)

\begin{matrix} + \sum_{c \neq c^{'}}^{N} \sum_{i = 1}^{N} E \{E (r_{i}^{4} | X) I (i \in C_{c} \cap C_{c^{'}})\} + \sum_{c \neq c^{'}}^{N} \sum_{i \neq i^{'}}^{N} E \{σ^{2} (X_{i}) σ^{2} (X_{i^{'}}) I (i \in C_{c}) I (i^{'} \in C_{c^{'}})\} \\ = & O (N^{2}), \end{matrix}

(A30)

where the equality in (A30) is due to the fact that

σ^{2} (X_{i})

and

E ({[Y_{i} - E (Y_{i} | X)]}^{4} | X)

are uniformly bounded by Assumption 1, and the summation over c in (A29) and that over c and

c^{'}

in (A30) each contain no more than

2 k

terms.

From (A28) and (A30), we have

\begin{matrix} Var \{\sum_{c = 1}^{N} \sum_{t = 1}^{k} ε_{c t}^{2}\} = O (N^{2}) . \end{matrix}

(A31)

Due to (A28) and (A31) and by Theorem 14.4-1 in Reference [26], we have

\begin{matrix} \sum_{c = 1}^{N} \sum_{t = 1}^{k} ε_{c t}^{2} = O_{p} (E \{\sum_{c = 1}^{N} \sum_{t = 1}^{k} ε_{c t}^{2}\}) + O_{p} ({\{Var \{\sum_{c = 1}^{N} \sum_{t = 1}^{k} ε_{c t}^{2}\}\}}^{1 / 2}) = O_{p} (N) . \end{matrix}

(A32)

Similarly, it can be shown that the second term in (A27) is

O_{p} (N)

; therefore, the proof of (A26) is completed.

From (A24)–(A26),

\begin{matrix} | Δ_{N, 4} | & \leq & {[\frac{2 \sqrt{N}}{N (k - 1)} O_{p} (N)]}^{\frac{1}{2}} {[O_{p} (N^{- 2})]}^{\frac{1}{2}} = O_{p} (N^{- 3 / 4}) = o_{p} (1), as N \to \infty . \end{matrix}

This completes the proof. □

Appendix A.4. Sketch Proof of Theorem 3

The proof of the existence of

{lim}_{N \to \infty} λ_{N A}

is similar to that for

{lim}_{N \to \infty} λ_{N}

in Theorem 1. Now, we show that

\begin{matrix} \sqrt{N} (B_{N} (Q) - W_{N} (Q)) \overset{d}{\to} N (k σ_{A}^{2}, lim_{N \to \infty} λ_{N A}) . \end{matrix}

From (12), we have

\begin{matrix} \sqrt{N} (B_{N} (Q) - W_{N} (Q)) & = & Δ_{N, 0} + Δ_{N, 1} + Δ_{N, 2} - Δ_{N, 3} - Δ_{N, 4} \\ = & \sqrt{N} (B_{N} (ε) - W_{N} (ε)) + Δ_{N, 1} + Δ_{N, 2} - Δ_{N, 3} - Δ_{N, 4}, \end{matrix}

(A33)

where

Δ_{N, 0}

through

Δ_{N, 4}

are defined in (13)–(17). The

B_{N} (ε)

and

W_{N} (ε)

are the average between-cell and within-cell variations for augmented observations with

Z_{i} = Y_{i} - m_{0} (X_{i}) - N^{- \frac{1}{4}} A (X_{i})

as the response. Note that the conditional mean of

Z_{i}

given

X_{i} = x

satisfies the null hypothesis. But

Var (Z_{i} | X_{i} = x)

is equal to

Var (Y_{i} | X_{i} = x)

. Theorem 2 implies that

\begin{matrix} Δ_{N, 0} = \sqrt{N} (B_{N} (ε) - W_{N} (ε)) \overset{d}{\to} N (0, lim_{N \to \infty} λ_{N A}) . \end{matrix}

(A34)

By Lemma 2, we have

\begin{matrix} Δ_{N, i} \overset{p}{\to} 0, as N \to \infty, for i = 2, 3, 4 . \end{matrix}

(A35)

Thus, we only need to consider

Δ_{N, 1}

to obtain the asymptotic mean under the alternatives.

Note that

A (X_{1}), A (X_{2}), \dots, A (X_{N})

are i.i.d. since

X_{1}, X_{2}, \dots, X_{N}

are i.i.d.. From (A13) and (A14), we can write

Δ_{N, 1}

in (14) as

\begin{matrix} Δ_{N, 1} = \sqrt{N} k {(N - 1)}^{- 1} \sum_{c = 1}^{N} [N^{- 1 / 2} {({\bar{A}}_{c \cdot} - {\bar{A}}_{\cdot \cdot})}^{2}] = k {(N - 1)}^{- 1} \sum_{c = 1}^{N} {(A (X_{c}) - \bar{A (X)})}^{2} = k {\hat{σ}}_{A}^{2}, \end{matrix}

(A36)

where

{\hat{σ}}_{A}^{2}

is the sample variance of

A (X_{1}), A (X_{2}), \dots, A (X_{N})

. By the Weak Law of Large Numbers,

\begin{matrix} k {\hat{σ}}_{A}^{2} \overset{p}{\to} k σ_{A}^{2} = k Var (A (X)) = k [\int_{- \infty}^{\infty} A^{2} (x) f (x) d x - {(\int_{- \infty}^{\infty} A (x) f (x) d x)}^{2}], \end{matrix}

(A37)

as

N \to \infty

and k stays fixed.

From (A35)–(A37), we have

\begin{matrix} Δ_{N, 1} + Δ_{N, 2} - Δ_{N, 3} - Δ_{N, 4} \overset{p}{\to} k σ_{A}^{2} . \end{matrix}

(A38)

From (A33), (A34), and (A38) and by applying Slutsky’s Theorem, we have

\begin{matrix} \sqrt{N} (B_{N} (Q) - W_{N} (Q)) \overset{d}{\to} N (k σ_{A}^{2}, λ_{A}) . \end{matrix}

Theorem 3 then follows immediately since it is readily seen that

{\hat{λ}}_{N} \overset{p}{\to} λ_{A}

under the local alternative (11).

Appendix A.5. Sketch Proof of Theorem 4

From the proof of Theorem 3, we know

Δ_{N, 0} \overset{d}{\to} N (0, λ_{A})

(see (A34)), where

λ_{A} = {lim}_{N \to \infty} λ_{N A}

is defined similarly as

λ_{N}

in Theorem 1 but with

σ^{2} (X_{j})

calculated under the fixed alternative hypothesis.

We also know

Δ_{N, 1} \overset{p}{\to} k σ_{A}^{2}

(see (A37)), where

σ_{A}^{2}

is given in (19). Hence,

\sqrt{N} Δ_{N, 1} \overset{p}{\to} \infty

.

Compared to

\sqrt{N} Δ_{N, 1}

, the remaining three terms in (18) involving

Δ_{N, 2}

,

Δ_{N, 3}

, or

Δ_{N, 4}

are all negligible. This is because Lemma 2 implies that

\begin{matrix} N^{\frac{1}{4}} Δ_{N, 2} = O_{p} (1), \sqrt{N} Δ_{N, 3} = O_{p} (N^{- \frac{3}{2}}), N^{\frac{1}{4}} Δ_{N, 4} = O_{p} (N^{- \frac{1}{2}}) . \end{matrix}

Putting all terms together as in Equation (18) and applying Slutsky’s Theorem, we know that

\begin{matrix} \sqrt{N} (B_{N} (Q) - W_{N} (Q)) - \sqrt{N} Δ_{N, 1} - N^{\frac{1}{4}} Δ_{N, 2} \overset{d}{\to} N (0, λ_{A}) . \end{matrix}

(A39)

Since

G_{N} = \sqrt{N} (B_{N} (Q) - W_{N} (Q)) / {\hat{λ}}_{N}^{\frac{1}{2}}

and the test reject the null hypothesis at significance level

α

if

G_{N} > z_{α}

, the power of the test is approximately

\begin{matrix} P (G_{N} > z_{α}) & = & P (\sqrt{N} (B_{N} (Q) - W_{N} (Q)) > z_{α} {\hat{λ}}_{N}^{\frac{1}{2}}) \\ = & P (\sqrt{N} (B_{N} (Q) - W_{N} (Q)) - \sqrt{N} Δ_{N, 1} - N^{\frac{1}{4}} Δ_{N, 2} > {\hat{λ}}_{N}^{\frac{1}{2}} z_{α} - \sqrt{N} Δ_{N, 1} - N^{\frac{1}{4}} Δ_{N, 2}) \\ \approx & 1 - Φ (\sqrt{λ_{A}} z_{α} - \sqrt{N} k σ_{A}^{2} - N^{\frac{1}{4}} Δ_{N, 2}) \to 1, \end{matrix}

due to (A39). Essentially, the asymptotic mean of the test statistics

G_{N}

converges to infinity. Hence, the power goes to one.

References

Eubank, R.L.; Hart, J.D. Testing goodness-of-fit in regression via order selection criteria. Ann. Statist. 1992, 20, 1412–1425. [Google Scholar] [CrossRef]
Hart, J. Smoothing-inspired lack-of-fit tests based on ranks. In Beyond Parametrics in Interdisciplinary Research: Festschrift in Honor of Professor Pranab K. Sen; Institute of Mathematical Statistics: Beachwood, OH, USA, 2008; Volume 1, pp. 138–155. [Google Scholar]
Hart, J. Frequentist-Bayes lack-of-fit tests based on Laplace approximations. J. Stat. Theory Pract. 2009, 3, 681–704. [Google Scholar] [CrossRef]
Hart, J. Nonparametric Smoothing and Lack-of-Fit Test; Springer: New York, NY, USA, 1997. [Google Scholar]
Chen, C.-F.; Hart, J.D.; Wang, S. Bootstrapping the order selection test. J. Nonparametr. Stat. 2001, 13, 851–882. [Google Scholar] [CrossRef]
Kuchibhatla, M.; Hart, J.D. Smoothing-based lack-of-fit tests: Variations on a theme. J. Nonparametr. Stat. 1996, 7, 1–22. [Google Scholar] [CrossRef]
Lee, B.J. A Nonparametric Model Specification Test Using a Kernel Regression Method. Ph.D. Thesis, University of Wisconsin, Madison, WL, USA, 1988. [Google Scholar]
Yatchew, A.J. Nonparametric regression tests based on least square. Econom. Theory 1992, 8, 435–451. [Google Scholar] [CrossRef]
Eubank, R.L.; Spiegelman, C.H. Testing the goodness of fit of a linear model via nonparametric regression techniques. J. Am. Stat. Assoc. 1990, 85, 387–392. [Google Scholar] [CrossRef]
Hardle, W.; Mammen, E. Comparing nonparametric versus parametric regression fits. Ann. Statist. 1993, 21, 1926–1947. [Google Scholar] [CrossRef]
Zheng, J.X. A consistent test of functional form via nonparametric estimation techniques. J. Econom. 1996, 75, 263–289. [Google Scholar] [CrossRef]
Horowitz, J.Z.; Spokoiny, V.G. An adaptive, rate-optimal test of a parametric mean-regression model against a nonparametric alternative. Econometrica 2001, 69, 599–631. [Google Scholar] [CrossRef]
Guerre, E.; Lavergne, P. Data-driven rate-optimal specification testing in regression models. Ann. Statist. 2005, 33, 840–870. [Google Scholar] [CrossRef] [Green Version]
Song, W.; Du, J. A note on testing the regression functions via nonparametric smoothing. Can. J. Stat. 2011, 39, 108–125. [Google Scholar] [CrossRef]
Wang, L.; Akritas, M. Testing for covariate effects in the fully nonparametric analysis of covariance model. J. Am. Stat. Assoc. 2006, 101, 722–736. [Google Scholar] [CrossRef]
Wang, L.; Akritas, M.G.; Van Keilegom, I. An ANOVA-type nonparametric diagnostic test for heteroscedastic regression models. J. Nonparametr. Stat. 2008, 20, 365–382. [Google Scholar] [CrossRef]
Wang, H.; Tolos, S.; Wang, S. A distribution free test to detect general dependence between a response variable and a covariate in the presence of heteroscedastic treatment effects. Can. J. Stat. 2010, 38, 408–433. [Google Scholar] [CrossRef]
Hardle, W.; Hall, P.; Marron, J.S. How far are automatically chosen regression smoothing parameters from their optimum? J. Am. Stat. Assoc. 1988, 83, 86–95. [Google Scholar]
Wang, H.; Akritas, M. Asymptotically distribution free tests in heteroscedastic unbalanced high dimensional anova. Stat. Sin. 2011, 21, 1341–1377. [Google Scholar] [CrossRef] [Green Version]
de Jong, P. A central limit theorem for generalized quadratic forms. Probab. Theory Relat. Fields 1987, 75, 261–277. [Google Scholar] [CrossRef]
Hart, J.D.; Yi, S. One-sided cross-validation. J. Am. Stat. Assoc. 1998, 93, 620–631. [Google Scholar] [CrossRef]
Holmes, C.C.; Adams, N.M. Likelihood inference in nearest-neighbour classification models. Biometrika 2003, 90, 99–112. [Google Scholar] [CrossRef] [Green Version]
Singh, D.; Febbo, P.G.; Ross, K.; Jackson, D.G.; Manola, J.; Ladd, C.; Tamayo, P.; Renshaw, A.A.; D’Amico, A.V.; Richie, J.P.; et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 2002, 1, 203–209. [Google Scholar] [CrossRef] [Green Version]
Lee, S.Y.; Lei, B.; Mallick, B.; Samy, A.M. Estimation of covid-19 spread curves integrating global data and borrowing information. PLoS ONE 2020, 15, e0236860. [Google Scholar] [CrossRef] [PubMed]
Pyke, R. Spacings (with discusssion). J. R. Stat. Soc. Ser. B Stat. Methodol. 1965, 27, 395–449. [Google Scholar]
Bishop, Y.M.; Fienberg, S.E.; Holland, P.W. Discrete Multivariate Analysis; Springer: New York, NY, USA, 2007. [Google Scholar]

Figure 1. Typical patterns of

L S C V (k)

versus k in continuous data.

Figure 1. Typical patterns of

L S C V (k)

versus k in continuous data.

Figure 2. Empirical power of the tests for data generated from

Y_{i} = C e^{- 2 X_{i}} c o s (8 π X_{i}) + ϵ_{i}

with sample size

n = 50

and

ϵ_{i} \sim

Uniform

(- 0.1, 0.1)

. Due to small values of C, the signal to noise ratio is low. WAI is not included since it could not keep its type I error under control for this sample size and the uniform error distribution.

Figure 2. Empirical power of the tests for data generated from

Y_{i} = C e^{- 2 X_{i}} c o s (8 π X_{i}) + ϵ_{i}

with sample size

n = 50

and

ϵ_{i} \sim

Uniform

(- 0.1, 0.1)

. Due to small values of C, the signal to noise ratio is low. WAI is not included since it could not keep its type I error under control for this sample size and the uniform error distribution.

Figure 3. Empirical power for different sample sizes. The data were generated from

Y_{i} = N^{- 1 / 4} 0.3 e^{- 2 X_{i}} c o s (8 π X_{i}) + ϵ_{i}

, where

ϵ_{i} \sim

Uniform

(- 0.1, 0.1)

. As the sample sizes N increases, the power of all tests increases. The

G_{N}

test approaches 1 much faster than other tests. WAI is not included since it could not keep its type I error under control for the sample sizes and the uniform error distribution.

Figure 3. Empirical power for different sample sizes. The data were generated from

Y_{i} = N^{- 1 / 4} 0.3 e^{- 2 X_{i}} c o s (8 π X_{i}) + ϵ_{i}

, where

ϵ_{i} \sim

Uniform

(- 0.1, 0.1)

. As the sample sizes N increases, the power of all tests increases. The

G_{N}

test approaches 1 much faster than other tests. WAI is not included since it could not keep its type I error under control for the sample sizes and the uniform error distribution.

Figure 4. The leave-one-out accuracy curve with increasing number of selected genes.

Figure 5. Observed U.S. COVID-19 mortality and confirmed cases with the Richards growth curve estimates.

Table 1. Percent of rejection at 0.05 level under

H_{0}

for data generated from Model

M_{0}

. OS: order selection test of Reference [1]; BN: Reference [3] Bayes sum test with Normal approximation; BOS: order selection test with wild bootstrap of Reference [5]; ROS: rank-based test of Reference [2]; WAI3, WAI5, WAI7: test of Reference [16] with

k = 3

, 5, and 7, respectively;

G_{N}

: the proposed test with

k = 3

, 5, 7, and

\hat{k}

from (20). Highly inflated empirical type I errors are marked with red color.

Table 1. Percent of rejection at 0.05 level under

H_{0}

for data generated from Model

M_{0}

. OS: order selection test of Reference [1]; BN: Reference [3] Bayes sum test with Normal approximation; BOS: order selection test with wild bootstrap of Reference [5]; ROS: rank-based test of Reference [2]; WAI3, WAI5, WAI7: test of Reference [16] with

k = 3

, 5, and 7, respectively;

G_{N}

: the proposed test with

k = 3

, 5, 7, and

\hat{k}

from (20). Highly inflated empirical type I errors are marked with red color.

	$ϵ_{i} \sim$ Normal and $n =$					$ϵ_{i} \sim$ Unif and $n =$
Test	25	50	75	100	200	25	50	75	100	200
BN	4.25	4.90	5.40	5.70	4.45	5.20	6.15	5.45	4.85	4.95
OS	4.35	5.55	5.05	5.50	3.65	4.70	5.60	5.00	4.95	5.10
ROS	3.50	4.95	4.95	5.55	3.70	3.95	5.90	5.20	4.60	5.10
BOS	3.85	5.45	5.05	5.85	4.00	4.35	5.65	6.00	5.15	5.15
WAI $k_{n} = 3$	8.75	7.25	7.10	7.35	6.40	8.90	9.05	8.20	7.55	6.65
WAI $k_{n} = 5$	6.85	5.85	6.25	6.40	6.15	6.85	8.40	7.05	6.80	6.45
WAI $k_{n} = 7$	5.00	5.20	5.40	6.25	6.30	5.75	8.10	6.30	6.35	6.40
$G_{N}$ $k = 3$	5.60	4.05	4.10	3.95	3.45	5.40	4.95	4.30	3.30	3.05
$G_{N}$ $k = 5$	4.40	3.45	3.80	3.90	3.35	4.45	5.15	3.95	3.80	3.80
$G_{N}$ $k = 7$	3.65	3.45	4.15	4.40	4.00	3.60	5.20	4.25	3.95	4.25
$G_{N}$ $\hat{k}$	6.20	5.95	4.95	4.65	4.60	6.00	4.85	4.85	5.15	4.35
	$ϵ_{i} \sim T (5) / 30$ and $n =$					$ϵ_{i} \sim$ Heter and $n =$
Test	25	50	75	100	200	25	50	75	100	200
BN	4.15	4.95	4.75	4.70	5.15	8.95	9.20	7.95	8.55	7.05
OS	3.70	4.35	5.20	4.20	4.65	6.40	7.65	7.85	7.95	7.20
ROS	4.40	4.45	4.85	4.50	5.10	5.70	7.00	7.55	7.25	7.10
BOS	3.75	4.90	5.25	4.45	5.20	4.70	5.45	5.10	5.45	4.95
WAI $k_{n} = 3$	8.75	8.85	8.40	7.20	6.95	12.50	10.05	11.50	10.20	8.10
WAI $k_{n} = 5$	6.50	7.25	6.85	6.10	5.90	9.05	9.10	9.55	9.15	7.50
WAI $k_{n} = 7$	4.65	6.30	5.60	5.35	6.00	6.80	7.80	7.50	8.45	7.95
$G_{N}$ $k = 3$	3.80	3.45	2.50	1.95	1.95	7.50	5.80	6.45	6.50	5.00
$G_{N}$ $k = 5$	3.10	3.25	2.65	1.70	1.80	5.30	5.65	6.30	6.15	4.40
$G_{N}$ $k = 7$	2.50	2.95	2.80	2.25	1.95	4.85	5.40	5.30	6.10	5.20
$G_{N}$ $\hat{k}$	4.30	3.95	3.60	2.55	2.05	8.20	6.70	7.40	6.90	5.00

Table 2. Percent of rejection under high frequency alternatives

M_{1}

to

M_{4}

with

q = 8

and sample size

n = 50

. The legend of the tests are same as in Table 1. WAI is not included since it could not keep its type I error under control. Note: BN also has highly elevated type I error in the Heteroscedastic case.

Table 2. Percent of rejection under high frequency alternatives

M_{1}

to

M_{4}

with

q = 8

and sample size

n = 50

. The legend of the tests are same as in Table 1. WAI is not included since it could not keep its type I error under control. Note: BN also has highly elevated type I error in the Heteroscedastic case.

	$ϵ_{i} \sim$ Normal					$ϵ_{i} \sim$ Unif
Model	BN	OS	ROS	BOS	$G_{N}$	BN	OS	ROS	BOS	$G_{N}$
$M_{1}$	100	78.35	71.65	66.70	100	100	78.35	72.20	66.85	100
$M_{2}$	100	87.75	80.70	91.55	100	100	87.70	81.00	91.95	100
$M_{3}$	99.65	70.70	70.25	43.15	99.70	99.50	67.85	63.00	41.15	99.60
$M_{4}$	99.05	61.75	52.20	34.45	99.60	81.15	17.35	11.60	10.25	93.20
	$ϵ_{i} \sim T (5) / 30$					$ϵ_{i} \sim$ Heter
Model	BN	OS	ROS	BOS	$G_{N}$	BN	OS	ROS	BOS	$G_{N}$
$M_{1}$	100	78.35	71.70	67.30	100	100	78.40	72.25	66.45	100
$M_{2}$	100	87.50	80.55	92.00	100	100	87.75	80.75	91.95	100
$M_{3}$	99.60	69.20	66.95	42.40	99.65	99.60	69.85	67.65	43.40	99.65
$M_{4}$	92.30	29.80	21.20	15.90	97.35	97.05	44.75	32.20	23.45	99.40

Table 3. Parameters of the Richard growth models,

R^{2}

between the observed and model estimated counts, and the p-values of the lack of fit test.

Table 3. Parameters of the Richard growth models,

R^{2}

between the observed and model estimated counts, and the p-values of the lack of fit test.

	$θ_{1}$	$θ_{2}$	$θ_{3}$	$ξ$	$R^{2}$	p-Value of $G_{N}$
U.S. deaths	594,331	0.0452	−17.1062	6.8505	0.9950	0.0000
U.S. confirmed cases	27,890,138	0.1270	−43.7943	11.5151	0.9982	0.0000

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.

$Var$	$(K_{j j^{'}} \| X_{j}, X_{j^{'}}, j_{}, j_{}^{'}) = Var (\sum_{c = 1}^{N} I (j \in C_{c}) I (j^{'} \in C_{c}) I (X_{c} \notin {X_{j}, X_{j^{'}}})\| X_{j}, X_{j^{'}}, j_{}, j_{}^{'})$
	$= \sum_{c_{1} = 1}^{N} \sum_{c_{2} = 1}^{N} \{E [I (j \in C_{c_{1}}) I (j^{'} \in C_{c_{1}}) I (j \in C_{c_{2}}) I (j^{'} \in C_{c_{2}}) \| X_{j}, X_{j^{'}}, j_{}, j_{}^{'}]$
	$- E [I (j \in C_{c_{1}}) I (j^{'} \in C_{c_{1}}) \| X_{j}, X_{j^{'}}, j_{}, j_{}^{'}] E [I (j \in C_{c_{2}}) I (j^{'} \in C_{c_{2}}) \| X_{j}, X_{j^{'}}, j_{}, j_{}^{'}]\}$
	$\times I (X_{c_{1}} \notin {X_{j}, X_{j^{'}}}) I (X_{c_{2}} \notin {X_{j}, X_{j^{'}}})$
	$= \sum_{c = 1}^{N} E [I (j \in C_{c}) I (j^{'} \in C_{c}) I (X_{c} \notin {X_{j}, X_{j^{'}}})\| X_{j}, X_{j^{'}}, j_{}, j_{}^{'}]$
	$- \sum_{c = 1}^{N} {[E (I (j \in C_{c}) I (j^{'} \in C_{c}) \| X_{j}, X_{j^{'}}, j_{}, j_{}^{'})]}^{2} I (X_{c} \notin {X_{j}, X_{j^{'}}}),$