Abstract
With the development of modern data collection techniques, researchers often encounter high-dimensional data across various research fields. An important problem is to determine whether several groups of these high-dimensional data originate from the same population. To address this, this paper presents a novel k-sample test for equal distributions for high-dimensional data, utilizing the Maximum Mean Discrepancy (MMD). The test statistic is constructed using a V-statistic-based estimator of the squared MMD derived for several samples. The asymptotic null and alternative distributions of the test statistic are derived. To approximate the null distribution accurately, three simple methods are described. To evaluate the performance of the proposed test, two simulation studies and a real data example are presented, demonstrating the effectiveness and reliability of the test in practical applications.
Keywords:
multi-sample test; hypothesis testing; parametric bootstrap; random permutation; Welch–Satterthwaite χ2-approximation; chi-squared-type mixtures MSC:
62H15
1. Introduction
Testing whether multiple samples follow the same distribution is a fundamental challenge in data analysis, with wide-ranging applications across diverse fields. Traditional nonparametric tests designed for comparing the distributions of two samples, such as the Wald–Wolfowitz runs test, the Mann–Whitney–Wilcoxon test based on signed ranks, and the Kolmogorov–Smirnov test utilizing the Empirical Distribution Function (EDF), are well established for univariate data [1]. Extending these tests to the multivariate setting in has been the focus of extensive research. This has led to the development of novel tests based on multivariate runs, ranks, EDF, distances, and projections, as pioneered by researchers like [2,3,4,5], among others.
In this paper, our primary focus is on addressing the multi-sample problem for equal distributions in high-dimensional data. In contemporary data analysis, high-dimensional datasets have become increasingly prevalent and easily accessible across various domains. For instance, Section 5 introduces a dataset derived from a keratoconus study involving corneal surfaces. This collaborative effort involves Ms. Nancy Tripoli and Dr. Kenneth L. Cohen from the Department of Ophthalmology at the University of North Carolina, Chapel Hill. The dataset comprises 150 corneal surfaces, each characterized by over 6000 height measurements. These surfaces are categorized into four distinct groups based on their corneal shapes, prompting the question of whether these groups share a common underlying distribution. Consequently, there is a pressing need to develop tests that can assess distributional equality in the context of high-dimensional data.
Mathematically, a multi-sample problem for high-dimensional data can be described as follows. Assume that we have the following k samples of observed random elements in :
where the data dimension p can be very large and , are unknown cumulative distribution functions. Of interest is to test if the k distribution functions are the same:
When , a variety of distance-based tests designed for multivariate data, such as those proposed by [2,3,4,5], can potentially be applied to test (2) in the context of high-dimensional data. However, it has been demonstrated by [6] that these tests may lack the power to effectively detect differences in distribution scales in high dimensions. In response to this challenge, ref. [6] introduced a test based on interpoint distances, while [6] proposed a high-dimensional two-sample run test based on the shortest Hamiltonian path. Nevertheless, ref. [7] showed that these tests, along with a graph-based test by [8], are less effective in detecting differences in location. To address this limitation, ref. [7] introduced two asymptotic normal tests for both location and scale differences, based on interpoint distances. However, it is worth noting that these tests rely on a strong mixing condition and impose a natural ordering on the components of high-dimensional observations, limiting their applicability. Furthermore, these tests involve complex U-statistic estimators, making them computationally intensive. Several other approaches have also been proposed for high-dimensional distribution testing. Ref. [9] presented a high-dimensional permutation test using a symmetric measure of distance between data points, but it comes with high computational costs. Refs. [10,11] proposed tests based on projections, which are generally more effective in detecting low-dimensional distributional differences. Ref. [12] introduced a test based on energy distance and permutation, but it is less powerful in detecting scale differences. Refs. [13,14] proposed kernel two-sample tests based on the Maximum Mean Discrepancy (MMD). Ref. [14] demonstrated the equivalence between the energy-distance-based test and the kernel-based test, showing that the energy-distance-based test can be viewed as a kernel test utilizing a kernel induced by the interpoint distance. The MMD leverages the kernel trick to define a distance between the embeddings of distributions in Reproducing Kernel Hilbert Spaces (RKHSs). It is well suited for checking distribution differences among several high-dimensional samples and is applicable to various data types, such as vectors, strings, or graphs. Recently, there has been further investigation into unbiased and biased MMD-based two-sample tests for high-dimensional data by [15,16], respectively.
On the other hand, when , there are limited tests available for testing (2). One such test is the energy test developed by [12]. This test statistic is obtained by directly summing all pairwise energy distances, with its null distribution approximated through permutation. However, this test has some drawbacks, including being time-consuming and yielding p-values that may vary when applied multiple times to the same dataset, as supported by the evidence from Figure 3 in Section 5. Another approach is presented by [17], who extended the idea of MMD from the two-sample problem to the multi-sample problem for equal distributions. They constructed an MMD-based test statistic capable of detecting deviations from distribution homogeneity in several samples. However, the resulting test statistic and its null limit distribution are very complicated in form, which restricts its practical use. To address this limitation, ref. [18] developed a new MMD-based test for testing (2). This test statistic is constructed using U-statistics, making it easy to conduct and yielding accurate results.
In this paper, we maintain our focus on the MMD-based approach for testing (2). However, unlike [18], where a U-statistic technique is employed to construct the test statistic, we take a distinct path by constructing an -norm-based test statistic. To achieve this, we first employ a canonical feature map to transform the k original samples (1) into k induced samples (3). Concurrently, we transform the k-sample equal distribution testing problem (2) into a mean vector testing problem (4). This transformation facilitates the straightforward construction of an -norm-based test (5) for assessing the mean vector testing problem (4). Leveraging a kernel trick, we derive a formula for computing the -norm-based test statistic using the k original samples (1).
Additionally, this paper makes several other significant contributions. Firstly, akin to the work of [17,18], we extend the concept of MMD from two-sample problems to the domain of multi-sample problems with equal distributions. Secondly, we derive the asymptotic null and alternative distributions of the proposed test statistic. Thirdly, we offer three distinct approaches for approximating the null distribution of the MMD-based test statistic, utilizing parametric bootstrap, random permutation, and the Welch–Satterthwaite (W–S) -approximation methods. Lastly, we examine two specific scenarios in our comprehensive simulation studies. In the first scenario, the samples have the same mean vector but different covariance matrices, while in the second scenario, the samples exhibit both distinct mean vectors and covariance matrices. Our simulation results demonstrate that the tests we propose effectively maintain precise control over size in both scenarios. However, in terms of empirical power, they outperform (underperform) the energy test introduced by [12] in the first (second) scenario. In other words, when the primary difference in the distributions of the samples lies in their covariance matrices, the new tests are the preferred choice in terms of statistical power.
The remainder of this paper is organized as follows: Section 2 presents the main results and Section 3 introduces three methods to implement our test. In Section 4, we provide two simulation studies. Section 5 showcases an application to the corneal surface data mentioned earlier. Concluding remarks can be found in Section 6. Technical proofs of the main results are included in Appendix A.
2. Main Results
2.1. MMD for Several Distributions
In this section, we show how the MMD can be defined for several distributions. Let be an RKHS associated with a characteristic reproducing kernel . For any and in , the inner product and -norm of are defined as and , respectively. Let be the canonical feature mapping associated with , i.e., . Using this feature mapping and setting for , we obtain the following k induced samples in the RKHS :
which are derived from the k original samples in (1). Define for , representing the mean embeddings of the k distributions , where .
Ref. [13] established the MMD for two distributions in a separable metric space (see, e.g., Theorem 5 of [13]). Here, we extend it naturally for multi-sample distributions in . According to the MMD of [13], for any , where , “testing vs. ” based on the two samples and is equivalent to “testing vs. ” based on the two samples and . Therefore, testing (2) using the k original samples in (1) is equivalent to testing the following hypothesis using the k induced samples in (3):
To test (4), following [19], a natural -norm-based test statistic using (3) is given by
where and denote the group and grand sample means, respectively. Through some simple algebra, as given in Appendix A, we can express as
Let denote the weights of the k distributions such that and . Then, when we estimate using , the test statistic estimates the following quantity:
which can be naturally defined as the MMD of the k distributions with the weight vector . It is worth noting that the MMD for multiple distributions presented above is equivalent to the one derived by [18], offering a much simpler alternative compared to the formulation proposed by [17].
It is easy to justify that is indeed an MMD of the k distributions in the sense that if and only if . On the one hand, when , we have so that for any , we have , implying that for any , we have and hence . On the other hand, when , for any , we have so that for any , we have and hence . Therefore, we can use to test the equality of k distributions based on the k induced samples (3).
2.2. Computation of the Test Statistic
Notice that the k induced samples (3) are not directly computable, as the canonical feature mapping is explicitly defined through the reproducing kernel. Fortunately, the reproducing kernel and its canonical feature mapping can be utilized with the following useful kernel trick: . Using this, we can express the inner product as follows:
Let and . Then, using (7), we have
Therefore, using (6), we can compute using any of the following useful expressions:
In other words, you can compute the value of using the original k samples (1) and the above expressions.
2.3. Asymptotic Null Distribution
To explore the null distribution of , we can rewrite as
where represents the weighted average of the mean embeddings of the k distributions. Additionally, we have
In the expression for , you can observe that the mean embeddings of the k distributions have been subtracted. Consequently, follows the same distribution as that of under the null hypothesis. Thus, studying the null distribution of is equivalent to studying the distribution of .
Similar to the proof of (6), we can express as follows:
Let denote the centered version of , defined as
where , , and and are independent copies of and , respectively. We can observe two useful properties: when , we have
When and are independent, we have
Using (12), we can express
Let
It is evident that and can be considered as centered versions of and , respectively. Consequently, by using (11) and (16), we can express
Utilizing (15) and performing some straightforward algebraic manipulations, we can express
Assuming that is square-integrable, i.e., , we can express using Mercer’s expansion:
where represent the eigenvalues of , and are the corresponding orthonormal eigenelements, satisfying
where equals 1 when and 0 otherwise. Now, let us introduce the following conditions:
- C1.
- We have .
- C2.
- As , we have .
- C3.
- is a reproduced kernel such that .
Condition C1 assumes that the null hypothesis is satisfied and the common distribution function is F. Condition C2 is a regularity condition for k-sample problems and it requires that the group sample sizes tend to ∞ proportionally. Condition C3 is required such that is square-integrable and expression (19) is valid.
In fact, under Condition C3, using (19) and the Cauchy–Schwarz inequality, we obtain the following results:
where . These inequalities hold due to the square-integrability assumption and the properties of Mercer’s expansion. Now, let us state the following theorem that establishes the asymptotic distribution of .
Theorem 1.
Under Conditions C1–C3, as , we have , where
It is worth highlighting that the limit null distribution of the proposed test statistic differs from the one derived in [18] (Theorem 1) and it offers a more straightforward alternative compared to the limit null distribution obtained by [17] (Theorem 1). This explains why the limit null distribution presented by [17] is not employed to approximate the null distribution of their test statistic. However, as demonstrated in Section 3, it is indeed feasible to utilize this distribution if desired.
2.4. Mean and Variance of
Theorem 2.
Under Condition C1, we have , and
where .
Note that under Condition C2, we have as . Then as , we have .
2.5. Asymptotic Power
In this subsection, we examine the asymptotic power of the proposed test under the following local alternative hypothesis:
where and are constant elements in the RKHS such that
and
It is important to note that when , the local hypothesis (24) simplifies into a fixed alternative hypothesis for . However, when , Equation (24) represents a strict local alternative hypothesis. In this case, as , the strict local hypothesis tends to converge toward the null hypothesis. Detecting a strict local alternative hypothesis becomes exceedingly challenging in such scenarios. A test is typically considered root-n-consistent if it can detect a strict local alternative hypothesis with a probability approaching 1 as . A root-n-consistent test is considered effective because it achieves the best possible detection rate for a local alternative hypothesis as n grows.
Theorem 3.
Assume that for all for some . Then, we have and , where .
Theorem 4.
Assume that for all for some . Then, under Condition C2 and the local alternative hypothesis (24), as , we have (a) ; (b) ; and (c)
where denotes the upper percentile of with ϵ being the given significance level, ; are defined in Condition C2; ; and denotes the cumulative distribution of .
Theorem 4 shows that the proposed test is indeed a root-n-consistent test.
3. Methods for Implementing the Proposed Test
In this section, we will outline three different approaches for approximating the null distribution of (8) in order to conduct the proposed test. These methods include a parametric bootstrap approach, a random permutation technique, and a -approximation method. We will evaluate and compare their performance in the next section.
3.1. Parametric Bootstrap Method
Theorem 1 reveals that the asymptotic null distribution of takes the form of a chi-squared-type mixture denoted as (22). The coefficients of this mixture are determined by the unknown eigenvalues of , where are independently and identically distributed according to the common distribution function F representing the k distributions when the null hypothesis is valid. Consequently, in order to estimate the asymptotic null distribution of , it is essential to consistently estimate . This consistency can be achieved by utilizing the empirical eigenvalues of the centered Gram matrix, as suggested by [20], to construct a reliable estimator for (22).
Let us recall that represents the total sample size. We pool the k samples (1) and denote it as
Under the null hypothesis, are independently and identically distributed from the common distribution F. Let represent the Gram matrix, where the th entry is defined as for . Additionally, let denote a vector of ones with dimensions , and denote the identity matrix of size . Then, the matrix is a projection matrix of size with rank .
Now, define , commonly referred to as the centered Gram matrix. Its th entry is given by
where . As n approaches infinity, for any fixed i and j, we can observe that, by the law of large numbers:
Let be all the non-zero eigenvalues of that can be obtained via an eigen-decomposition of . Set . Then, following [20], we can show that under Condition C3, the distribution of can be consistently estimated by
The parametric bootstrap method can be described as follows. Let us choose a large value for N, for example, . Using expression (28), we can obtain a sample of by independently generating a total of N times. Now, let represent the observed test statistic calculated using (8) based on the k samples (1). Using the parametric bootstrap method, we can conduct the proposed test by calculating the approximate p-value, which is given by , where is an indicator function that takes 1 when S is a true event and 0 otherwise.
3.2. Random Permutation Method
We can also approximate the null distribution of using a random permutation method. Let represent a random permutation of the indices from the pooled sample (26). Consequently, the sequence
forms a permutation of the pooled sample (26). To create permuted samples, we utilize the first observations in the permuted pooled sample (29) as the first permuted sample, the next observations as the second permuted sample, and so on, until we obtain k permuted samples. These permuted samples are denoted as
The permutated test statistic, denoted as , is calculated using (8) but with the k samples (1) replaced by the k permuted samples (30).
The random permutation method proceeds as follows. Let N be a sufficiently large number, for instance, . Suppose we repeat the permutation process described above N times, resulting in N permutated test statistics denoted as . Then, we can use the empirical distribution of to approximate the null distribution of . Recall that represents the test statistic computed using (8) based on the k original samples (1). Following the random permutation method, the proposed test can be conducted by calculating the approximated p-value, given by .
3.3. Welch–Satterthwaite -Approximation Method
The parametric bootstrap method and the random permutation method are effective for controlling size but can be computationally intensive, particularly with large total sample sizes. To address this issue, we can utilize the well-known Welch–Satterthwaite (W–S) -approximation method [21,22]. This method is known to be reliable for approximating the distribution of a chi-squared-type mixture. Theorem 1 demonstrates that the asymptotic null distribution of is a chi-squared-type mixture (22).
The core concept of the W–S -approximation method is to approximate the null distribution of using that of a random variable of the form: , where and d are unknown parameters. These parameters can be determined by matching the means and variances of and W, where , defined in (10), has the same distribution as under the null hypothesis. Specifically, the mean and variance of W are and , respectively, while the mean and variance of are given in Theorem 2. Equating the means and variances of and W, we obtain
To implement the W–S -approximation method, we need to consistently estimate and based on the pooled sample (26) from the k samples (1). According to Theorem 2, these estimates can be obtained as follows:
where
with being defined in (27). Substituting these estimators into (31), we obtain
Let denote the upper percentile of , where is the given significance level, and let denote the observed test statistic computed using (8) based on the k samples (1). Then, through the W–S -approximation method, the proposed test can be conducted via rejecting the null hypothesis when or when the approximated p-value is less than .
4. Simulation Studies
In this section, we delve into intensive simulation studies aimed at assessing the performance of the test we propose when compared to the energy test introduced by [12], which we denote as . Our proposed test employs the parametric bootstrap, the random permutation, and the W–S -approximation methods as described in Section 3. For simplicity, we refer to the resulting tests as , , and , respectively.
For simplicity, we opt for the Gaussian Radial Basis Function (RBF) kernel, denoted as , which is defined as follows:
Here, is referred to as the kernel width. It is worth noting that the Gaussian RBF kernel described above is bounded by 1, ensuring that Condition C3 is always met. Following the approach outlined in [20], we set to be equal to the median distance between observed vectors in the pooled sample (26).
We also employ the Average Relative Error (ARE) introduced by [23] to evaluate the overall effectiveness of a test in maintaining its nominal size. The ARE is calculated as follows: , where represents the empirical sizes observed across M different simulation settings. A smaller ARE value indicates the better performance of a test in terms of size control. In this simulation study, we set the nominal size to .
4.1. Simulation 1
We set for simplicity. We generate the samples (1) as follows. We set
where , while , are generated using the following three models:
- Model 1.
- .
- Model 2.
- with ; .
- Model 3.
- with .
It is important to note that and play pivotal roles as tuning parameters that govern the similarity of the distributions among the three generated samples. Specifically, when both and are set to zero (), the three samples generated from the three models exhibit identical distributions. If at least one of , where i can be 1 or 2, is non-zero, the samples still share the same mean vector but differ in their covariance matrices. Furthermore, it is worth mentioning that the test’s power increases as both and increase.
We set as and as , where denotes a matrix of ones with dimensions . To assess the performance of the considered tests across a range of dimensionality settings, we examine three cases: , , and . For each of these cases, we consider three sets of sample sizes : , , and . Additionally, we investigate three levels of correlation: , , and . These three values of correspond to samples with varying degrees of correlation, ranging from nearly uncorrelated to moderately correlated and highly correlated. Notably, correlation increases as grows. For simplicity, we set and across all three models.
In the case of the parametric bootstrap and random permutation methods, as well as the energy test, we use a total of replicates for computing the p-values at each simulation run, as described in Section 3.1 and Section 3.2. It is worth noting that the W–S -approximation method, which does not require generating replicates, is the least time-consuming among the methods considered. The empirical sizes and powers are computed based on 1000 simulation runs.
Table 1 provides an overview of the empirical sizes of , , , and , with the last row displaying the associated ARE values for the three different values. Several observations can be made based on this table: Firstly, for nearly uncorrelated samples (), exhibits a slight tendency to be liberal, with an ARE value of that is marginally higher than the ARE values of the other three tests. Secondly, when the generated samples are moderately correlated () or highly correlated (), all four tests demonstrate fairly similar empirical sizes and ARE values, making them comparable in terms of size control. Finally, it is seen that the influence of sample sizes on size control is relatively minor, even though, in theory, a larger total sample size should result in better size control.
Table 1.
Empirical sizes (in %) of Simulation 1.
Figure 1 displays the empirical powers of all four tests in scenarios where all three generated samples have the same mean vectors, but they differ from each other in covariance matrices. Several conclusions can be drawn regarding these power values: Firstly, for , and , , , and exhibit similar empirical powers. This suggests that these three tests perform comparably, regardless of whether the data are nearly uncorrelated, moderately correlated, or highly correlated. Secondly, it is seen that under similar settings, as expected, the empirical powers of the tests generally increase with larger sample sizes. Finally, the empirical powers of consistently rank the lowest among all four tests. This indicates that is less powerful compared to the other three tests in these scenarios.
Figure 1.
Simulation 1. The empirical powers (in %) of , , , and under different cases of : 1. (10, 20, 30, 40, 1.2, 0.6), 2. (10, 80, 120, 160, 0.75, 0.375), 3. (10, 160, 240, 320, 0.65, 0.325), 4. (100, 20, 30, 40, 1.3, 0.65), 5. (100, 80, 120, 160, 0.85, 0.425), 6. (100, 160, 240, 320, 0.72, 0.36), 7. (500, 20, 30, 40, 1.65, 0.825), 8. (500, 80, 120, 160, 1, 0.5), 9. (500, 160, 240, 320, 0.8, 0.4).
4.2. Simulation 2
Certainly, it should be acknowledged that the MMD-based tests , , and may not consistently demonstrate superior performance when compared to the energy-distance-based test as in Simulation 1. To illustrate this point, in the context of this simulation study, we keep the same experimental framework as described in Simulation 1. However, we now introduce a new collection of three models, which are defined as follows:
- Model 4.
- .
- Model 5.
- with .
- Model 6.
- with , .
Please be aware that , where i ranges from 1 to , takes values 1 and 2, and r ranges from 1 to p, is adjusted to ensure that and across all three models. When both and are set to 0, the three generated samples follow the same distributions as in Simulation 1. Consequently, this implies that the empirical sizes of all four tests in Simulation 2 will be similar to those observed in Simulation 1. However, when at least one of (where i takes values 1 and 2) is non-zero, the three samples exhibit distinct mean vectors and covariance matrices, differing from those observed in Simulation 1. As a result, our focus should be on calculating the empirical powers of all four tests based on Models 4–6.
Figure 2 presents the empirical powers of all four tests based on Models 4–6, offering several noteworthy insights. First, it is evident that , , and demonstrate similar empirical powers. This implies that these three tests exhibit comparable performance regardless of whether the generated data follow a normal or non-normal distribution. Second, it is also observed that under similar settings, the empirical powers of the tests generally increase with larger sample sizes. Lastly, when takes on values of 0.5 and 0.9, the empirical powers of surpass those of the other three tests. However, when , the empirical powers of all four tests are generally comparable. This indicates that, in the scenarios under consideration, demonstrates greater effectiveness compared to the other three tests when the correlation coefficient is relatively high.
Figure 2.
Simulation 2. The empirical powers (in %) of , , , and under different cases of : 1. (10, 20, 30, 40, 0.37, 0.185), 2. (10, 80, 120, 160, 0.195, 0.097), 3. (10, 160, 240, 320, 0.18, 0.09), 4. (100, 20, 30, 40, 0.142, 0.071), 5. (100, 80, 120, 160, 0.068, 0.034), 6. (100, 160, 240, 320, 0.044, 0.022), 7. (500, 20, 30, 40, 0.06, 0.03), 8. (500, 80, 120, 160, 0.028, 0.014), 9. (500, 160, 240, 320, 0.022, 0.011).
From these two simulation studies, it can be inferred that the proposed MMD-based tests , , and may outperform the energy-distance-based test when the differences in distributions are primarily in covariance matrices, while the reverse could be true when the differences in distribution involve both mean vectors and covariance matrices. Notably, the MMD-based test generally requires less computational effort compared to the bootstrap or permutation-based tests , , and .
5. Application to the Corneal Surface Data
The corneal surface data are briefly mentioned in Section 1. They were acquired during a keratoconus study, a collaborative project involving Ms. Nancy Tripoli and Dr. Kenneth L. Cohen from the Department of Ophthalmology at the University of North Carolina, Chapel Hill. This dataset comprises 150 observations with each corneal surface having more than 6000 height measurements. It can be categorized into four distinct groups: a group of 43 healthy corneas (referred to as the normal cornea group), a group of 14 corneas with unilateral suspect characteristics, a group of 21 corneas with suspect map features, and a group of 72 corneas clinically diagnosed with keratoconus. It is important to note that the corneal surfaces within the normal, unilateral suspect, and suspect map groups exhibit similar shapes, but they significantly differ from the corneal surfaces observed in the clinical keratoconus group (refer to Figure 1 in [24] for visualization). In the process of reconstructing a corneal surface, ref. [24] utilized the Zernike regression model to fit the height measurements associated with the corneal surface. The height of the corneal surface at a specific radius r and angle is denoted as , while represents the height estimated through the fitted model within the predefined region of interest. This region of interest spans from to and from to , with being a predetermined positive constant. To naturally represent each corneal surface, a feature vector is constructed, consisting of values , where i ranges from 1 to K and j ranges from 1 to L. These values are obtained by evaluating the fitted corneal surface at a grid of points defined as and for and . For simplicity, we choose to set and , resulting in a feature vector with dimensions of 2000 for each corneal surface.
For simplicity, we put the fitted feature vectors for the complete corneal surface dataset collectively into a feature matrix with dimensions of . In this matrix, each row corresponds to a feature vector representing a corneal surface. Specifically, the initial 43 rows of the feature matrix correspond to observations from the normal group, sequentially followed by 14 rows from the unilateral suspect group, 21 rows from the suspect map group, and lastly, 72 rows from the clinical keratoconus group.
Our objective is to examine whether there are significant differences in the distributions among various corneal surface groups, referred to as multi-sample problems for the equality of distributions for high-dimensional data, given that the high-dimensional feature vectors represent the observations of the corneal surface data. In this application, we employ , , , and to address these problems. For both the parametric bootstrap and random permutation methods, as well as the energy test, we perform a total of N = 10,000 replicates to compute the associated p-values. For simplicity, we denote the normal, unilateral suspect, suspect map, and clinical keratoconus groups as NOR, UNI, SUS, and CLI, respectively.
Table 2 displays the results obtained from the application of four statistical tests, namely , , , and , to assess the equality of distributions among different corneal surface groups. A careful examination of these results yields several noteworthy insights. To begin with, when considering the comparison among the corneal surface groups labeled “NOR vs. UNI vs. SUS vs. CLI”, it is evident that all four tests reject the null hypothesis. This rejection signifies that there exists at least one significant difference among the distributions of these four corneal surface groups. Consequently, further investigation is warranted to identify which specific group or groups differ from the others. Secondly, it is important to highlight the outcome for the comparison involving “NOR vs. UNI vs. SUS”. In this case, none of the four tests reject the null hypothesis at the significance level. This outcome indicates that the normal, unilateral suspect, and suspect map groups share a similar distribution pattern. Thirdly, across the remaining three comparisons, all four tests consistently reject the null hypothesis. These results align with the observations depicted in Figure 1 of [24], which illustrates the distinctiveness of corneal surfaces within the clinical keratoconus group when compared to the other three groups. Fourthly, when focusing on the comparison “NOR vs. UNI vs. SUS”, it is worth noting that the p-values obtained from , , and are quite similar, suggesting their comparable performance. This consistency in p-values is also reflected in the empirical sizes presented in Table 1. Lastly, when analyzing the cases involving CLI, it becomes evident that the p-values generated by consistently exhibit larger values compared to those produced by the other three tests. This discrepancy implies that may have a lower sensitivity in detecting distribution differences when compared to the other tests, indicating the potentially reduced statistical power in this real data example.
Table 2.
p-values (in %) for testing the distribution equality of corneal surface groups.
Notice that the test is bootstrap-based and the tests and are permutation-based. This means that their p-values are obtained via bootstrapping or permutating numerous random samples to compute the associated p-values as described in Section 3.1 and Section 3.2. Thus, the p-values of these tests are random, i.e., they are different at different instances. However, the p-value of remains fixed. In order to investigate this clearly, we performed 500 iterations of , and on the case “NOR vs. UNI vs. SUS”. The boxplots of the corresponding p-values of the four tests are presented in Figure 3. It is evident that the p-values obtained from remain fixed, whereas those derived from , and exhibit variability. This contrast underscores the fact that p-values resulting from bootstrap-based or permutation-based tests indeed differ across various instances.
Figure 3.
Boxplots of p-values of , , , and when applied to the case “NOR vs. UNI vs. SUS” of the corneal surface data 500 times.
6. Concluding Remarks
Testing whether multiple high-dimensional samples adhere to the same distribution is a common area of research interest. This paper introduces and investigates a novel MMD-based test to address this question. The null distribution of this test is approximated using three methods: parametric bootstrap, random permutation, and the W–S -approximation approach. Results from two simulation studies and a real data application demonstrate that the proposed test exhibits effective size control and superior statistical power compared to the energy test introduced by [12] when the differences among sample distributions are primarily related to covariance matrices rather than mean vectors. Thus, the proposed test is generally well suited for conducting multi-sample equal distribution testing on high-dimensional data. We particularly recommend its use in scenarios where distribution differences are associated with covariance matrices. Conversely, when distribution differences predominantly pertain to means, the energy test is a more powerful choice. However, in practice, determining whether distribution differences are related to means or covariance matrices can be challenging. Therefore, we suggest considering both the new test and the energy test as viable options. Nevertheless, it is important to note that implementing the proposed test comes with certain challenges. Both the parametric bootstrap and random permutation methods can be computationally intensive, leading to variable p-values across different applications. In contrast, the W–S -approximation method offers computational efficiency and produces fixed p-values. However, its accuracy is limited as it solely relies on matching two cumulants of the test statistic under the null hypothesis.
An intriguing question arises naturally: Can we enhance the accuracy of the proposed test by matching three cumulants of the test statistic? Recent work by [18] suggests that this is indeed possible. However, deriving the third cumulant of the test statistic presents a current challenge and requires further investigation. Another aspect to consider is the choice of kernel width. While the paper opts for simplicity by utilizing the median distance between observed vectors in the pooled sample, it is worth exploring the kernel width choice recommended by [18] to potentially enhance the test’s statistical power. These avenues for future research promise exciting developments and warrant further exploration.
Author Contributions
Conceptualization, J.-T.Z. and A.A.C.; methodology, J.-T.Z., T.Z. and Z.P.O.; software, Z.P.O. and T.Z.; validation, J.-T.Z. and T.Z.; formal analysis, Z.P.O.; investigation, J.-T.Z. and Z.P.O.; resources, Z.P.O.; data curation, Z.P.O.; writing—original draft preparation, J.-T.Z. and Z.P.O.; writing—review and editing, J.-T.Z. and T.Z.; visualization, A.A.C. and T.Z.; supervision, J.-T.Z.; project administration, J.-T.Z.; funding acquisition, J.-T.Z. All authors have read and agreed to the published version of the manuscript.
Funding
Zhang’s research was partially funded by the National University of Singapore academic research grant (22-5699-A0001).
Data Availability Statement
Publicly available datasets were analyzed in this study. This data can be found here: https://tandf.figshare.com/articles/dataset/Linear_hypothesis_testing_with_functional_data/6063026/1?file=10914914 (accessed on 20 October 2023).
Conflicts of Interest
The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| MMD | Maximum Mean Discrepancy |
| EDF | Empirical Distribution Function |
| RKHS | Reproducing Kernel Hilbert Spaces |
| RBF | Radial Basis Function |
| ARE | Average Relative Error |
Appendix A. Technical Proofs
Proof of Theorem 1.
Under Condition C1, we have . Let . Under Condition C3, we have Mercer’s expansion (19). Through (14) and (20), we have
This, together with (20), implies that
Set . Under Condition C1, we have .
It follows that for a fixed , are i.i.d. with mean 0 and variance 1. For different r, are uncorrelated. Then, through (18) and (19), we have
where . Through (17), we have
where and .
Let denote the characteristic function of a random variable X. Set . Then, we have For any given , through the central limit theorem, under Conditions C2 and C3, as , we have . Set . Then, as , we have and are independently identically distributed. Set . Under Condition C2, we have . Thus, . Both and are idempotent matrices with rank . It follows that
which is a chi-squared distribution with degrees of freedom and are independent. It follows that as , we have and . Therefore, as , we have
It follows that
Let t be fixed. Under Condition C3 and (21), as , we have . Thus, for any given , there exist and , depending on and , such that as and , we have
For any fixed , through (A2), as , we have . Thus, there exists , depending on q and , such that as , we have
Recall that . Along the same lines as those for proving (A4), we can show that there exists , depending on and , such that as , we have
The convergence in distribution of to follows as we can let . □
Proof of Theorem 2.
Under Condition C1, let . Then, through (23), we have and . Thus, we have
Through (23) again, we have
□
Proof of Theorem 3.
First, through (12), we have for all , . Thus, we have
Then, through Theorem 2, we have
Finally, since are independent, we have
Furthermore, through the Cauchy–Schwarz inequality, we have
□
Proof of Theorem 4.
Under the given conditions, through Theorems 2 and 3, we have
and as ,
where are given in Condition C2, It follows that as , we have and . Thus, through the Markov inequality, as , we have
for all . Therefore, we have and (a) is proved. To prove (b), notice that through the central limit theorem, as , we have
Since are independent and , we have . To prove (c), notice that as , through (25), (a) and (b), we have
Thus, as , and through (A7), we have
where and . Thus, the theorem is proved. □
References
- Lehmann, E.L. Nonparametrics: Statistical Methods Based on Ranks; Springer: New York, NY, USA, 2006. [Google Scholar]
- Friedman, J.H.; Rafsky, L.C. Multivariate Generalizations of the Wald–Wolfowitz and Smirnov Two-Sample Tests. Ann. Stat. 1979, 7, 697–717. [Google Scholar] [CrossRef]
- Schilling, M.F. Multivariate Two-Sample Tests Based on Nearest Neighbors. J. Am. Stat. Assoc. 1986, 81, 799–806. [Google Scholar] [CrossRef]
- Baringhaus, L.; Franz, C. On a new multivariate two-sample test. J. Multivar. Anal. 2004, 88, 190–206. [Google Scholar] [CrossRef]
- Rosenbaum, P.R. An exact distribution-free test comparing two multivariate distributions based on adjacency. J. R. Stat. Soc. Ser. B 2005, 67, 515–530. [Google Scholar] [CrossRef]
- Biswas, M.; Mukhopadhyay, M.; Ghosh, A.K. A distribution-free two-sample run test applicable to high-dimensional data. Biometrika 2014, 101, 913–926. [Google Scholar] [CrossRef]
- Li, J. Asymptotic normality of interpoint distances for high-dimensional data with applications to the two-sample problem. Biometrika 2018, 105, 529–546. [Google Scholar] [CrossRef]
- Chen, H.; Friedman, J.H. A New Graph-Based Two-Sample Test for Multivariate and Object Data. J. Am. Stat. Assoc. 2017, 112, 397–409. [Google Scholar] [CrossRef]
- Hall, P.; Tajvidi, N. Permutation Tests for Equality of Distributions in High-Dimensional Settings. Biometrika 2002, 89, 359–374. [Google Scholar] [CrossRef]
- Wei, S.; Lee, C.; Wichers, L.; Marron, J.S. Direction-Projection-Permutation for High-Dimensional Hypothesis Tests. J. Comput. Graph. Stat. 2016, 25, 549–569. [Google Scholar] [CrossRef]
- Ghosh, A.K.; Biswas, M. Distribution-free high-dimensional two-sample tests based on discriminating hyperplanes. Test 2016, 25, 525–547. [Google Scholar] [CrossRef]
- Székely, G.J.; Rizzo, M.L. Testing for equal distributions in high dimension. InterStat 2004, 5, 1249–1272. [Google Scholar]
- Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A. A Kernel Two-Sample Test. J. Mach. Learn. Res. 2012, 13, 723–773. [Google Scholar]
- Sejdinovic, D.; Sriperumbudur, B.; Gretton, A.; Fukumizu, K. Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Ann. Stat. 2013, 41, 2263–2291. [Google Scholar] [CrossRef]
- Zhang, J.T.; Smaga, Ł. Two-sample test for equal distributions in separable metric space: New maximum mean discrepancy based approaches. Electron. J. Stat. 2022, 16, 4090–4132. [Google Scholar] [CrossRef]
- Zhou, B.; Ong, Z.P.; Zhang, J.T. A new MMD-based two-sample test for equal distributions in separable metric spaces. Manuscript 2023, in press.
- Balogoun, A.S.K.; Nkiet, G.M.; Ogouyandjou, C. k-Sample problem based on generalized maximum mean discrepancy. arXiv 2018, arXiv:1811.09103. [Google Scholar]
- Zhang, J.T.; Guo, J.; Zhou, B. Testing equality of several distributions in separable metric spaces: A maximum mean discrepancy based approach. J. Econom. 2022, in press. [CrossRef]
- Zhang, J.T.; Guo, J.; Zhou, B. Linear hypothesis testing in high-dimensional one-way MANOVA. J. Multivar. Anal. 2017, 155, 200–216. [Google Scholar] [CrossRef]
- Gretton, A.; Fukumizu, K.; Harchaoui, Z.; Sriperumbudur, B.K. A Fast, Consistent Kernel Two-Sample Test. In Advances in Neural Information Processing Systems 22; Curran Associates, Inc.: New York, NY, USA, 2009; pp. 673–681. [Google Scholar]
- Welch, B.L. The generalization of `student’s’ problem when several different population variances are involved. Biometrika 1947, 34, 28–35. [Google Scholar] [CrossRef]
- Satterthwaite, F.E. An Approximate Distribution of Estimates of Variance Components. Biom. Bull. 1946, 2, 110–114. [Google Scholar] [CrossRef]
- Zhang, J.T. Two-Way MANOVA with Unequal Cell Sizes and Unequal Cell Covariance Matrices. Technometrics 2011, 53, 426–439. [Google Scholar] [CrossRef]
- Smaga, Ł.; Zhang, J.T. Linear Hypothesis Testing with Functional Data. Technometrics 2019, 61, 99–110. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).