A Bayesian Motivated Two-Sample Test Based on Kernel Density Estimates

A new nonparametric test of equality of two densities is investigated. The test statistic is an average of log-Bayes factors, each of which is constructed from a kernel density estimate. Prior densities for the bandwidths of the kernel estimates are required, and it is shown how to choose priors so that the log-Bayes factors can be calculated exactly. Critical values of the test statistic are determined by a permutation distribution, conditional on the data. An attractive property of the methodology is that a critical value of 0 leads to a test for which both type I and II error probabilities tend to 0 as sample sizes tend to ∞. Existing results on Kullback–Leibler loss of kernel estimates are crucial to obtaining these asymptotic results, and also imply that the proposed test works best with heavy-tailed kernels. Finite sample characteristics of the test are studied via simulation, and extensions to multivariate data are straightforward, as illustrated by an application to bivariate connectionist data.


Introduction
Ref. [1] proposed the use of cross-validation Bayes factors in the classic two-sample problem of comparing two distributions. Their basic idea is to randomly divide the data into two distinct parts, call them A and B, and to define two models based on kernel density estimates from part A. One model assumes that the two distributions are the same and the other allows them to be different. A Bayes factor comparing the two part A models is then defined from the part B data. In order to stabilize the Bayes factor, Ref. [1] suggest that a number of different random data splits be used, and the resulting log-Bayes factors averaged.
In the current paper we consider a special case of this approach in which the part A data consists of all the available observations save one. If the sample sizes of the two data sets are m and n, this entails that a total of m + n log-Bayes factors may be calculated. The average of these m + n quantities becomes the test statistic here considered, and is termed ALB.
Although ALB is an average of log-Bayes factors, it does not lead to a consistent Bayes test because each of the log-Bayes factors is based on just a single observation. Ref. [1] suppose that the validation set size grows to ∞, while in our case it remains of size 1. This results in the ALB converging to the Kullback-Leibler divergence of the two densities, and not ∞ as in the case of [1]. We therefore use frequentist ideas to construct our test. The exact null distribution of ALB conditional on order statistics is obtained using permutations of the data. Doing so leads to a consistent frequentist test whose size is controlled exactly. The problem of bandwidth selection is dealt with by using leave-one-out likelihood crossvalidation applied to the combination of the two data sets. This method is computationally efficient in that the resulting bandwidth is invariant to permutations of the combined data, and therefore has to be computed just once. Our methodology is easily extended to bivariate data, and we do so in a real data example.
Ref. [2] also use a permutation test based on kernel estimates for the two-sample problem, their statistic being based on an L 2 distance. Ref. [3] shows how other distances and divergences compare when applying them to the general k-sample problem, restricting their comparisons to the one-dimensional case. Our method mainly differs from these procedures by virtue of its Bayesian motivation. Existing methodology that most closely resembles ours is that of [4], who use a kernel-based marginal likelihood ratio to test goodness of fit of parametric models for a distribution. Their marginal likelihood employs a prior for a bandwidth, as does ours.

Methodology
We assume that X = (X 1 , . . . , X m ) are independent and identically distributed (i.i.d.) from density f , and independently Y = (Y 1 , . . . , Y n ) are i.i.d. from density g. We are interested in the problem of testing the null hypothesis that f and g are identical on the basis of the data X and Y. Let U = (U 1 , . . . , U k ) be an arbitrary set of k scalar observations, and define a kernel density estimate bŷ where K is the kernel and h > 0 the bandwidth.

The Test Statistic
. . , Z m+n ) and Z i be the vector Z with all its components except Z i , i = 1, . . . , m + n. Furthermore, let X i be all the components of X except X i , i = 1, . . . , m, and Y j all the components of Y except Y j , j = 1, . . . , n. If we assume that f is identical to g, then potential models for f are M 0i = {f K ( · |h, Z i ) : h > 0}, i = 1, . . . , m + n. Suppose that 1 ≤ i ≤ m. If we allow that f and g are different, then a model for the datum Z i is M 1i = {f K ( · |a, X i ) : a > 0}. In this case a legitimate Bayes factor for comparing M 0i and M 1i on the basis of the datum Z i has the form where, mainly for convenience, we have assumed that the bandwidth priors are the same in all cases. Likewise, if i = m + 1, . . . , m + n, then M 1i = {f K ( · |b, Y i−m ) : b > 0} is a model for the datum Z i , and a Bayes factor for comparing M 0i and M 1i is When m and n are large, it is expected that M 1i will be a good model for f if i = 1, . . . , m and for g if i = m + 1, . . . , m + n. Likewise, each of M 0i will be a good model for the common density on the assumption that f and g are identical. However, none of B 1 , . . . , B m+n will be Bayes factors that can provide convincing evidence for either hypothesis simply because each one uses likelihoods based on a single datum. At first blush one might think that a solution to this problem is to take the average of the m + n log-Bayes factors: However, this results in a statistic that will consistently estimate 0 or a positive constant in the respective cases f ≡ g or f ≡ g. In neither case does the statistic have the property of Bayes consistency, i.e., the property that the Bayes factor tends to 0 and ∞ when f ≡ g and f ≡ g, respectively.
The discussion immediately above points out a fundamental fact that seems not to have been widely discussed: combining a large number of inconsistent Bayes factors does not necessarily lead to a consistent Bayes factor. A guiding principle in [1] was that of averaging log-Bayes factors from different random splits of the data with the aim of producing a more stable log-Bayes factor. However, in order for this practice to yield a consistent Bayes factor, it is important that each of the log-Bayes factors being averaged is consistent. Furthermore, to ensure this consistency, it is necessary that the sizes of both the training and validation sets tend to ∞ with the samples sizes m and n. Obviously this is not the case when the size of each validation set is just 1, as in the current paper.
An advantage of the approach proposed herein is that the practitioner does not have to choose the size of the training sets. The cost is that the resulting statistic does not have the property of Bayes consistency. We thus propose that the statistic be used in frequentist fashion. An appealing way of doing so is to use a permutation test, which (save for certain practical issues to be discussed) leads to a test with exact type I error probability for all m > 1 and n > 1. Let Z (1) < Z (2) < · · · < Z (m+n) be the order statistics for the combined sample. Let j = (j 1 , . . . , j m+n ) be a random permutation of 1, . . . , m + n, and define T(j) to be the statistic (1) when the X-sample is taken to be Z j 1 , . . . , Z j m and the Y-sample to be Z j m+1 , . . . , Z j m+n . It follows that, conditional on the order statistics Z (1) , . . . , Z (m+n) , the (m + n)! values taken on by T(·) are equally likely. Therefore, if t m,n is a 1 − α quantile of the empirical distribution of T(·), then the test that rejects f ≡ g when T ≥ t m,n will have an (unconditional) type I error probability of α. As will be shown in the Appendix A.3, ALB is negative with probability tending to 1 as m, n → ∞, implying that for any α > 0 t m,n will be negative for m and n large enough. From an evidentiary standpoint, it is nonsense to reject H 0 for a negative value of ALB. We therefore suggest using the critical value max(0, t m,n ), which ensures that the test is sensible and has level α.

The Effect of Using Scale Family Priors
Let π 0 be an arbitrary density with support (0, ∞). A possible family of priors is one that contains all rescaled versions of π 0 . For b > 0, using the prior π(h) = π 0 (h/b)/b and making the change of variable h/b = u in the denominator of B i , we have where the kernel L is So, by using this type of prior, each marginal likelihood comprising ALB becomes a kernel density estimate with bandwidth equal to the scale parameter of the prior. In one sense this is disappointing since it means that averaging kernel estimates with respect to a bandwidth prior does not actually sidestep the issue of choosing a smoothing parameter. One has simply traded bandwidth choice for choice of the prior's scale. However, it turns out that there is a quantifiable advantage to using a prior for the bandwidth of K. As detailed in the Appendix A.2, likelihood cross-validation is often more efficient when applied tof L rather than tof K . When using a scale family of priors, the result immediately above implies that and so the proposed statistic is proportional to the log of a likelihood ratio. The two likelihoods are cross-validation likelihoods, and the numerator and denominator of the ratio correspond to the hypotheses of different and equal densities, respectively. In practice one must select both the kernel L and bandwidth b. For the moment we assume that L is given. The denominator of exp((m + n)ALB) as a function of b is the likelihood cross-validation criterion, as studied by [5], based on the combined sample. We propose using b =b, the maximizer of this denominator. This bandwidth has the desirable property that it is invariant to the ordering of the data in the combined sample. Let ALB * be the value of test statistic (1) for a permuted data set. One should use the principle that ALB * is the same function of the permuted data as ALB is of the original data. So, in principle the bandwidth should be selected for every permuted data set, but because of the invariance ofb to the ordering of the combined sample, this data-driven bandwidth equalsb for every permuted data set. This results in a large computational savings relative to a procedure that selects the bandwidth differently for the Xand Y-samples. Using the same bandwidth under both null and alternative hypotheses also fits with the principle espoused by [6].
Concerning L, Ref. [5] showed that kernels must be relatively heavy-tailed in order for them to perform well with respect to likelihood cross-validation. In particular, he shows that likelihood cross-validation fails miserably as a method for choosing the bandwidth of a kde based on a Gaussian kernel. The tails of the kernel must be considerably heavier than those of a Gaussian density in order for likelihood cross-validation to be effective. Proposition A1 in the Appendix A.1 shows that under very general conditions L (as defined in (2)) has heavier tails than those of K. Therefore, the Bayesian notion of averaging commonly used kernel estimates with respect to a prior brings the resulting kernel estimate more in line with the conditions of [5]. This has a substantial benefit for our statistic inasmuch as we use a likelihood cross-validation bandwidth in its construction.
Consider the following kernel proposed by [5]: Suppose that a kde is defined using kernel L 0 and its bandwidth is chosen by likelihood cross-validation. Ref. [5] shows that, in general, this cross-validation bandwidth will be asymptotically optimal in a Kullback-Leibler sense. We will therefore use L 0 in all subsequent simulations. Results in the Appendix A.2 provide a kernel K and corresponding prior that produce L 0 .

Further Properties of ALB
In the Appendix A.3 we will show that the ALB test is consistent in the frequentist sense. In other words, for any alternative the power of an ALB test of fixed level tends to 1 as m and n tend to ∞.
Interestingly, ALB has the property of being sharply bounded above. It can be rewritten as follows: A similar bound applies for the other component of ALB, implying that .
Unless one of m and n is very small, the effective bound on ALB is log (2). This reinforces the fact that ALB does not have the property of Bayes consistency. While it is true that ALB is an average of Bayes factors, none of these Bayes factors can ever provide compelling evidence in favor of the alternative. To reiterate, this problem is overcome by employing ALB in frequentist fashion. While ALB can take on positive values when the null hypothesis is true, our proof of frequentist consistency shows that, under H 0 , P(ALB < 0) → 1 as m, n → ∞. This implies that if 0 is used as a critical value, then the resulting test level tends to 0 as m, n → ∞. So, even though |ALB| does not tend to ∞, the sign of ALB provides compelling evidence for the hypotheses of interest when the sample sizes are large.
The exact conditional distribution of ALB is known under the null hypothesis, as we use a permutation test. Nonetheless, it is of some interest to have an impression of the unconditional distribution of ALB. To this end, we randomly select two normal mixture densities that differ. The number of components M in the first mixture is between 2 and 20 and chosen from a distribution such that the probability of m is proportional to m −1 , m = 2, . . . , 20. Given M = m, mixture weights are drawn from a Dirichlet distribution with all m parameters equal to 1/2. Given M = m and mixture weights, variances σ 2 1 , . . . , σ 2 m of the normal components are a random sample from an inverse gamma distribution with both parameters equal to 1/2. Finally, means µ 1 , . . . , µ m of the normal components are such that µ 1 , . . . , µ m given σ 1 , . . . , σ m are independent with µ j |σ j ∼ N(0, σ 2 j ), j = 1, . . . , m. The second normal mixture is independently selected using exactly the same mechanism. Random selection of densities in this manner for simulation studies has been proposed and explored in [7].
We draw a sample of size 100 from each of the two randomly generated densities (so that m = n = 100), and then compute ALB. This procedure is replicated on the same two densities 100 times. After this, we repeat the whole procedure for nine more pairs of randomly selected densities. The results are seen in Figure 1. Save for case 3, the proportion of positive ALBs is nearly 1 in all cases.
We repeated a similar procedure for the null hypothesis setting. The simulation was exactly the same except that in each of the ten cases, only one density was generated, and a pair of independent samples (of size 100 each) was selected from this same density. The resulting ALB distributions can be seen in Figure 2. The proportion of the cases where ALB < 0 for the 10 densities were, respectively, 0.89, 0.83, 0.83, 0.84, 0.85, 0.87, 0.91, 0.84, 0.84, and 0.76. These results are consistent with the fact that P(ALB < 0) tends to 1 with sample size. We feel that ALB has potential for screening variables in a binary classification problem. Since ALB is negative with high probability under H 0 , we feel that 0 is a nicely interpretable cutoff for variable inclusion. However, we leave this topic for future research.

Simulations
We perform a small simulation study to investigate the size and power of our test. To explore the effect of the number of permutations, we generate 500 pairs of data sets, with one data set being a random sample of size m = 50 from a standard normal distribution, and the other a random sample of size n = 50 from a normal distribution with mean 0 and standard deviation 2. For each of the 500 pairs of data sets, the 95th percentile of ALBs is approximated using a range of different numbers (N) of permutations starting at 100 and increasing by a factor of 1.5 up to 3845. Results are indicated by the boxplots in Figure 3. The percentiles are centered at approximately the same value for all N. Not surprisingly, the variability of the percentiles becomes smaller as N increases. This implies a certain amount of mismatch between percentiles at N = 3845 and those at smaller N. The consequence of the mismatch just alluded to can be investigated by determining the true conditional and unconditional levels of tests based on small N. For the null case, two data sets, each of size 50, are generated from a common normal distribution. Since the distribution of ALB is invariant to location and scale in the null case, we use a standard normal without loss of generality. For each pair of data sets, the data are randomly permuted 338 times, which leads to 338 values of ALB. A second set of 3845 permutations is then performed, leading to 3845 more values of ALB. The proportion of ALBs from the second set that exceed the 95th percentile of the ALBs formed from the first set is then determined. This proportion is approximately equal to the conditional level of the test based on 338 permutations. This same procedure is used for each of 500 data sets, and the resulting distribution of approximate levels is shown in Figure 4.
The histogram is centered near 0.05, and 87% of the conditional levels are between 0.03 and 0.07. Furthermore, an approximation to the unconditional level is ∑ 500 i=1α i /500 = 0.053, whereα i is the approximate conditional level for the ith data set, i = 1, . . . , 500. Based on these results, use of only 338 permutations is arguably adequate.
The same experiment is repeated except now the two data sets are drawn from different distributions, a standard normal and a normal with mean 0 and standard deviation 2. Results from this experiment are given in Figure 5. As in the null case, the conditional levels based on the use of 338 permutations are quite good. Eighty-eight percent of the levels are between 0.03 and 0.07, and the approximate unconditional level is 0.051.
The proportion of ALBs from permuted data sets that are larger than the ALB computed from the original data provides a p-value. The p-values obtained with our method (based on 3845 permutations) are compared to the p-values obtained with the Kolmogorov-Smirnov test and Bowman's two-sample test. Results are summarized in Figures 6 and 7. In 98% of the replications the K-S p-value was larger than the ALB p-value, and in 57% of the cases the Bowman p-value was equal to or larger than the ALB p-value. These results suggest that in this case our test has much better power than that of the Kolmogorov-Smirnov test and power at least comparable to that of Bowman's test.  The ALB p-value is less than, more than and equal to the Bowman p-value in 49%, 43% and 8% of cases, respectively.

A Bivariate Extension of the Two-Sample Test and Application to Connectionist Bench Data
Our method can be extended to the bivariate case by using a bivariate kernel density estimate. Assume now that X = (X 1 , ..., X m ) are independent and identically distributed from density f and Y = (Y 1 , ..., Y m ) are independent and identically distributed from g, where X i and Y j are each bivariate observations, i = 1, . . . , m, j = 1, . . . , n.
A product kernel K will be used, i.e., the bivariate kernel K is the product of two univariate kernels. For k arbitrary bivariate observations U = (U 1 , . . . , U k ), U i = (U i1 , U i2 ), i = 1, . . . , k, and u = (u 1 , u 2 ), the kernel estimate is defined bŷ is a two-vector of (positive) bandwidths. We will use the same sort of notation as before, i.e., Z i = X i , i = 1, . . . , m, Z i = Y i−m , i = m + 1, . . . , m + n, Z = (Z 1 , . . . , Z m+n ) and Z i is the object Z with all its components except Z i , i = 1, . . . , m + n. In this case the ith Bayes factor is defined as and similarly for i = m + 1, . . . , m + n. As before the test statistic is This form may seem daunting, but reduces to a more familiar form if we take π(h 1 , h 2 ) = π 0 (h 1 /b 1 )π 0 (h 2 /b 2 )/(b 1 b 2 ). In this case, proceeding exactly as in Section 2, B i has the form and similarly for i = m + 1, . . . , m + n, where b = (b 1 , b 2 ) and L is defined by (2). We will analyze a subset of the connectionist bench data, which consist of measurements obtained after bouncing sonar waves off of either rocks or metal cylinders. The data may be found at the UCI Machine Learning repository, Ref. [8]. There are 60 variables in the data set, with m = 111 and n = 97 measurements of each variable for the metal cylinders and rocks, respectively. Variable numbers (1 to 60) correspond to increasing aspect angles at which signals are bounced off of either metal or rock, and each of the 60 numbers is an amount of energy within a particular frequency band, integrated over a certain period of time. We will apply our test to see if the first two variables (corresponding to the smallest aspect angles) have a different distribution for rocks than they do for metal cylinders. In our analysis K is taken to be φ, the standard normal density, and π 0 to be of the form (A1). In this event L is a t-density with ν degrees of freedom. We will use ν = 3, leading to a fairly heavy-tailed kernel, which is desirable for reasons discussed previously.
The data for each variable are inherently between 0 and 1, and bivariate kernel estimates display boundary effects along the lines x = 0 and y = 0, with the largest bias near the origin. We therefore use a reflection technique to reduce bias along these two lines. Suppose one has k observations (x 1 , y 1 ), . . . , (x k , y k ) on the unit square. Each observation (x i , y i ) is reflected to create three new observations: (x i , −y i ), (−x i , −y i ) and (−x i , y i ), i = 1, . . . , k. One then simply computes, at points in the unit square, a standard kernel density estimate from the data set of size 4k, and multiplies it by 4 to ensure integration to 1. The value of ALB is computed as described previously except that each leave-out estimate leaves out four values: the observation at which the estimate is evaluated plus its three reflected versions. In this way the kde is constructed from data that are independent of the value at which the kde is evaluated.
Kernel density estimates for variables 1 and 2 in the form of heat maps are shown in Figures 8 and 9, and contours of the estimates are given in Figure 10. The latter figure suggests that the distributions for metal cylinders and rock are different. The value of ALB turned out to be 0.013, and an approximate p-value based on 10,000 permuted data sets was 0.0076. So, there is strong evidence of a difference between the rock and metal bivariate distributions. Interestingly, the percentage of negative ALBs among the 10,000 permutations was 0.9785. A kernel density estimate based on the 10,000 values of ALB * is shown in Figure 11.

Conclusions and Future Work
We have proposed a new nonparametric test of the null hypothesis that two densities are equal. An attractive property of the test is that its critical values are defined by a permutation distribution, allaying essentially any concern about test validity. The fact that the statistic is an average of log-Bayes factors leads to another attractive property: a critical value of 0 leads to a test with type I error probability tending to 0 with sample size. A simulation study showed the new test to have much better power than the Kolmogorov-Smirnov test in a case where the two densities differed with respect to scale. An application to connectionist data illustrated the usefulness of our methodology for bivariate data.
Future work includes efforts to increase the speed of computing the test statistic and its permutation distribution, especially for large data sets. We are also interested in applying the new test to the problem of screening variables prior to performing binary classification. A common method of doing so is to compute a two-sample test statistic for each variable, and to then select variables whose statistics exceed some threshold. An inherent problem in this approach is objectively choosing a threshold. Results of the current paper suggest that 0 would be a natural and effective threshold for variable screening. the more diffuse, or noninformative priors, i.e., those for which ν is small. (The mean and variance of (A1) exist for ν > 2. At ν = 3, the two are 1.382 and 1.090, respectively, and as ν → ∞ they converge to 1 and 0).
The fact that the kernel L is more heavy-tailed than K in the previous example is not an isolated phenomenon, as indicated by the following proposition (which is straightforward to prove): Proposition A1. If π 0 has support (0, C) with 1 < C ≤ ∞ and the tails of K decay exponentially, then the tails of L are heavier than those of K in that K(u)/L(u) → 0 as u → ∞.
In principle, many different choices of π 0 and K could produce the same kernel L. Or, one might ask "given kernel K, what prior π 0 would produce a specified L?" When K is Gaussian, the latter question is answered by solving an integral equation. Unfortunately, doing so, at least in a general sense, exceeds our mathematical abilities. In the case where K is uniform, though, an elegant solution exists, as seen in the next section.

Appendix A.2. When K Is Uniform
In the special case where K is uniform on the interval (−1/2, 1/2), it is easy to check that, for all u, If π 0 has support (0, ∞), then L has support (−∞, ∞), and hence we see again that averaging kernels with respect to a prior leads to a more heavy-tailed kernel. Since our statistic ends up being a log-likelihood ratio based on kernel L, an interesting question is "what prior π 0 gives rise to a specified kernel L?" Taking u ≥ 0, (A2) implies that π 0 (2u) = −uL (u). (A3) When L is decreasing on [0, ∞) it follows that π 0 is a density. (Under mild tail conditions on L and assuming that L (0+) exists finite, it is easy to show using integration by parts that (A3) integrates to 1 on (0, ∞).) Suppose that a kde is defined using the Hall kernel L 0 and its bandwidth is chosen by likelihood cross-validation. Ref. [5] shows that, in general, this cross-validation bandwidth will be asymptotically optimal in a Kullback-Leibler sense. In contrast, using cross-validation to choose the bandwidth of a uniform kernel kde will produce a bandwidth that diverges to ∞ as the sample size tends to ∞.
Using (A3) the prior, shown in Figure A1, that produces L 0 is This shape for the bandwidth prior could be considered canonical inasmuch as L will be similarly shaped for kernels that are decreasing on (0, ∞). Here we prove R1. frequentist consistency of our test, and R2. P(ALB < 0) → 1 as m, n → ∞.
Our proof uses the following assumptions.
A1. Under the null and alternative hypotheses the following integrals exist finite: When the alternative hypothesis is true, f and g are assumed to be different in the sense that the total variation distance, δ( f , g), is positive. A2. The kernel L in ALB (expression (3)) is the Hall kernel, L 0 . A3. The combined data likelihood cross-validation is maximized over an interval of the form [(m + n) −1+ , (m + n) − ], where is an arbitrarily small positive constant. The maximizer of this cross-validation is denotedb m+n . A4. The ratio m/(m + n) tends to ρ, 0 < ρ < 1, as m, n tend to ∞. A5. The densities f , g and ρ f (x) + (1 − ρ)g(x) satisfy the conditions of [5] that are needed for the asymptotic optimality of a likelihood cross-validation bandwidth. A6. Under the null hypothesis, let k (b) be the Kullback-Leibler risk of a kernel density estimate based on sample size k, kernel L 0 and bandwidth b. Then k satisfies for positive constants a, C V and C B with 0 < a < 1. Before proceeding to the proof, remarks about assumption A6 are in order. This condition is needed only in proving R2, and represents a subset of the cases studied by [5]. It has been assumed merely to allow a more concise proof of R2, which remains true under more general conditions on k .
The critical values of a test with fixed size α > 0 will tend to 0 as m, n tend to ∞ so long as ALB tends to 0 in probability under the null hypothesis. Therefore, the power of the test will tend to 1 if we can show that ALB tends to a positive constant under the alternative. Our proof of consistency thus boils down to showing that, as m, n tend to ∞, ALB converges in probability to 0 and a positive number under the null and alternative hypotheses, respectively.
For data U = (U 1 , . . . , U k ), define The statistic ALB may then be written ALB = m m + n CV(b|X) + n m + n CV(b|Y) − CV(b|Z), whereb maximizes CV(b|Z) for b ∈ [(m + n) −1+ , (m + n) − ]. Now suppose that U is a random sample from density d, k (b) is the expectation of the Kullback-Leibler loss off L (·|b, U) and define where d(x) log d(x) dx exists finite. Then if d satisfies the conditions of [5] and k → ∞, where is arbitrarily small. By the strong law of large numbers Q(k) converges to 0 in probability. Furthermore, max b∈[k −1+ ,k − ] k (b) tends to 0 as k → ∞. If the maximizerb of CV(b|U) is in [k −1+ , k − ] it therefore follows that CV(b|U) converges in probability to d(x) log d(x) dx as k → ∞.
In the null case, (A4) implies that To prove R2, we first observe that the bias component of k (b) is free of sample size, and hence the first order term of (A5) is free of bias components. Along with A3 and A6, this implies that Using the fact that (ρ a + (1 − ρ) a − 1) > 0 it now follows that P(ALB < 0) → 1 as m, n → ∞.