Testing Equality of Several Distributions at High Dimensions: A Maximum-Mean-Discrepancy-Based Approach

Ong, Zhi Peng; Chen, Aixiang Andy; Zhu, Tianming; Zhang, Jin-Ting

doi:10.3390/math11204374

Open AccessFeature PaperArticle

Testing Equality of Several Distributions at High Dimensions: A Maximum-Mean-Discrepancy-Based Approach

¹

Department of Information Systems and Analytics, National University of Singapore, Singapore 117417, Singapore

²

School of Statistics and Mathematics, Guangdong University of Finance and Economics, Guangzhou 510320, China

³

National Institute of Education, Nanyang Technological University, Singapore 637616, Singapore

⁴

Department of Statistics and Data Science, National University of Singapore, Singapore 117546, Singapore

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(20), 4374; https://doi.org/10.3390/math11204374

Submission received: 25 September 2023 / Revised: 18 October 2023 / Accepted: 18 October 2023 / Published: 21 October 2023

(This article belongs to the Special Issue Advances of Functional and High-Dimensional Data Analysis)

Download

Browse Figures

Versions Notes

Abstract

:

With the development of modern data collection techniques, researchers often encounter high-dimensional data across various research fields. An important problem is to determine whether several groups of these high-dimensional data originate from the same population. To address this, this paper presents a novel k-sample test for equal distributions for high-dimensional data, utilizing the Maximum Mean Discrepancy (MMD). The test statistic is constructed using a V-statistic-based estimator of the squared MMD derived for several samples. The asymptotic null and alternative distributions of the test statistic are derived. To approximate the null distribution accurately, three simple methods are described. To evaluate the performance of the proposed test, two simulation studies and a real data example are presented, demonstrating the effectiveness and reliability of the test in practical applications.

Keywords:

multi-sample test; hypothesis testing; parametric bootstrap; random permutation; Welch–Satterthwaite χ²-approximation; chi-squared-type mixtures

MSC:

62H15

1. Introduction

Testing whether multiple samples follow the same distribution is a fundamental challenge in data analysis, with wide-ranging applications across diverse fields. Traditional nonparametric tests designed for comparing the distributions of two samples, such as the Wald–Wolfowitz runs test, the Mann–Whitney–Wilcoxon test based on signed ranks, and the Kolmogorov–Smirnov test utilizing the Empirical Distribution Function (EDF), are well established for univariate data [1]. Extending these tests to the multivariate setting in

R^{p}

has been the focus of extensive research. This has led to the development of novel tests based on multivariate runs, ranks, EDF, distances, and projections, as pioneered by researchers like [2,3,4,5], among others.

In this paper, our primary focus is on addressing the multi-sample problem for equal distributions in high-dimensional data. In contemporary data analysis, high-dimensional datasets have become increasingly prevalent and easily accessible across various domains. For instance, Section 5 introduces a dataset derived from a keratoconus study involving corneal surfaces. This collaborative effort involves Ms. Nancy Tripoli and Dr. Kenneth L. Cohen from the Department of Ophthalmology at the University of North Carolina, Chapel Hill. The dataset comprises 150 corneal surfaces, each characterized by over 6000 height measurements. These surfaces are categorized into four distinct groups based on their corneal shapes, prompting the question of whether these groups share a common underlying distribution. Consequently, there is a pressing need to develop tests that can assess distributional equality in the context of high-dimensional data.

Mathematically, a multi-sample problem for high-dimensional data can be described as follows. Assume that we have the following k samples of observed random elements in

R^{p}

:

y_{α 1}, \dots, y_{α n_{α}} \overset{i . i . d .}{\sim} F_{α}, α = 1, \dots, k,

(1)

where the data dimension p can be very large and

F_{α}, α = 1, \dots, k

, are unknown cumulative distribution functions. Of interest is to test if the k distribution functions are the same:

H_{0} : F_{1} = \dots = F_{k}, vs . H_{1} : F_{α} \neq F_{β} for some α \neq β .

(2)

When

k = 2

, a variety of distance-based tests designed for multivariate data, such as those proposed by [2,3,4,5], can potentially be applied to test (2) in the context of high-dimensional data. However, it has been demonstrated by [6] that these tests may lack the power to effectively detect differences in distribution scales in high dimensions. In response to this challenge, ref. [6] introduced a test based on interpoint distances, while [6] proposed a high-dimensional two-sample run test based on the shortest Hamiltonian path. Nevertheless, ref. [7] showed that these tests, along with a graph-based test by [8], are less effective in detecting differences in location. To address this limitation, ref. [7] introduced two asymptotic normal tests for both location and scale differences, based on interpoint distances. However, it is worth noting that these tests rely on a strong mixing condition and impose a natural ordering on the components of high-dimensional observations, limiting their applicability. Furthermore, these tests involve complex U-statistic estimators, making them computationally intensive. Several other approaches have also been proposed for high-dimensional distribution testing. Ref. [9] presented a high-dimensional permutation test using a symmetric measure of distance between data points, but it comes with high computational costs. Refs. [10,11] proposed tests based on projections, which are generally more effective in detecting low-dimensional distributional differences. Ref. [12] introduced a test based on energy distance and permutation, but it is less powerful in detecting scale differences. Refs. [13,14] proposed kernel two-sample tests based on the Maximum Mean Discrepancy (MMD). Ref. [14] demonstrated the equivalence between the energy-distance-based test and the kernel-based test, showing that the energy-distance-based test can be viewed as a kernel test utilizing a kernel induced by the interpoint distance. The MMD leverages the kernel trick to define a distance between the embeddings of distributions in Reproducing Kernel Hilbert Spaces (RKHSs). It is well suited for checking distribution differences among several high-dimensional samples and is applicable to various data types, such as vectors, strings, or graphs. Recently, there has been further investigation into unbiased and biased MMD-based two-sample tests for high-dimensional data by [15,16], respectively.

On the other hand, when

k > 2

, there are limited tests available for testing (2). One such test is the energy test developed by [12]. This test statistic is obtained by directly summing all pairwise energy distances, with its null distribution approximated through permutation. However, this test has some drawbacks, including being time-consuming and yielding p-values that may vary when applied multiple times to the same dataset, as supported by the evidence from Figure 3 in Section 5. Another approach is presented by [17], who extended the idea of MMD from the two-sample problem to the multi-sample problem for equal distributions. They constructed an MMD-based test statistic capable of detecting deviations from distribution homogeneity in several samples. However, the resulting test statistic and its null limit distribution are very complicated in form, which restricts its practical use. To address this limitation, ref. [18] developed a new MMD-based test for testing (2). This test statistic is constructed using U-statistics, making it easy to conduct and yielding accurate results.

In this paper, we maintain our focus on the MMD-based approach for testing (2). However, unlike [18], where a U-statistic technique is employed to construct the test statistic, we take a distinct path by constructing an

L^{2}

-norm-based test statistic. To achieve this, we first employ a canonical feature map to transform the k original samples (1) into k induced samples (3). Concurrently, we transform the k-sample equal distribution testing problem (2) into a mean vector testing problem (4). This transformation facilitates the straightforward construction of an

L^{2}

-norm-based test (5) for assessing the mean vector testing problem (4). Leveraging a kernel trick, we derive a formula for computing the

L^{2}

-norm-based test statistic using the k original samples (1).

Additionally, this paper makes several other significant contributions. Firstly, akin to the work of [17,18], we extend the concept of MMD from two-sample problems to the domain of multi-sample problems with equal distributions. Secondly, we derive the asymptotic null and alternative distributions of the proposed test statistic. Thirdly, we offer three distinct approaches for approximating the null distribution of the MMD-based test statistic, utilizing parametric bootstrap, random permutation, and the Welch–Satterthwaite (W–S)

χ^{2}

-approximation methods. Lastly, we examine two specific scenarios in our comprehensive simulation studies. In the first scenario, the samples have the same mean vector but different covariance matrices, while in the second scenario, the samples exhibit both distinct mean vectors and covariance matrices. Our simulation results demonstrate that the tests we propose effectively maintain precise control over size in both scenarios. However, in terms of empirical power, they outperform (underperform) the energy test introduced by [12] in the first (second) scenario. In other words, when the primary difference in the distributions of the samples lies in their covariance matrices, the new tests are the preferred choice in terms of statistical power.

The remainder of this paper is organized as follows: Section 2 presents the main results and Section 3 introduces three methods to implement our test. In Section 4, we provide two simulation studies. Section 5 showcases an application to the corneal surface data mentioned earlier. Concluding remarks can be found in Section 6. Technical proofs of the main results are included in Appendix A.

2. Main Results

2.1. MMD for Several Distributions

In this section, we show how the MMD can be defined for several distributions. Let

H

be an RKHS associated with a characteristic reproducing kernel

K (\cdot, \cdot)

. For any

u

and

v

in

H

, the inner product and

L^{2}

-norm of

H

are defined as

〈 u, v 〉

and

∥ u ∥ = \sqrt{〈 u, u 〉}

, respectively. Let

ϕ (\cdot)

be the canonical feature mapping associated with

K (\cdot, \cdot)

, i.e.,

ϕ (y) = K (\cdot, y)

. Using this feature mapping and setting

x_{α i} = ϕ (y_{α i})

for

i = 1, \dots, n_{α}

, we obtain the following k induced samples in the RKHS

H

:

x_{α 1}, \dots, x_{α n_{α}}, α = 1, \dots, k,

(3)

which are derived from the k original samples in (1). Define

μ_{α} = E (x_{α 1})

for

α = 1, \dots, k

, representing the mean embeddings of the k distributions

F_{α}

, where

α = 1, \dots, k

.

Ref. [13] established the MMD for two distributions in a separable metric space (see, e.g., Theorem 5 of [13]). Here, we extend it naturally for multi-sample distributions in

R^{p}

. According to the MMD of [13], for any

α \neq β

, where

α, β = 1, \dots, k

, “testing

H_{0} : F_{α} = F_{β}

vs.

H_{1} : F_{α} \neq F_{β}

” based on the two samples

y_{α i}, i = 1, \dots, n_{α}

and

y_{β i}, i = 1, \dots, n_{β}

is equivalent to “testing

H_{0} : μ_{α} = μ_{β}

vs.

H_{1} : μ_{α} \neq μ_{β}

” based on the two samples

x_{α i}, i = 1, \dots, n_{α}

and

x_{β i}, i = 1, \dots, n_{β}

. Therefore, testing (2) using the k original samples in (1) is equivalent to testing the following hypothesis using the k induced samples in (3):

H_{0} : μ_{1} = \dots = μ_{k}, vs . H_{1} : μ_{α} \neq μ_{β} for some α \neq β .

(4)

To test (4), following [19], a natural

L^{2}

-norm-based test statistic using (3) is given by

T_{n} = \sum_{α = 1}^{k} n_{α} {∥ {\bar{x}}_{α} - \bar{x} ∥}^{2},

(5)

where

{\bar{x}}_{α} = n_{α}^{- 1} \sum_{i = 1}^{n_{α}} x_{α i}

and

\bar{x} = n^{- 1} \sum_{α = 1}^{k} \sum_{i = 1}^{n_{α}} x_{α i}

denote the group and grand sample means, respectively. Through some simple algebra, as given in Appendix A, we can express

T_{n}

as

T_{n} = \sum_{α = 1}^{k} \frac{n_{α} (n - n_{α})}{n} ∥ {\bar{x}}_{α} ∥^{2} - 2 \sum_{1 \leq α < β \leq k} \frac{n_{α} n_{β}}{n} 〈 {\bar{x}}_{α}, {\bar{x}}_{β} 〉 = \sum_{1 \leq α < β \leq k} \frac{n_{α} n_{β}}{n} {∥ {\bar{x}}_{α} - {\bar{x}}_{β} ∥}^{2} .

(6)

Let

π = {(π_{1}, \dots, π_{k})}^{⊤}

denote the weights of the k distributions

F_{1}, \dots, F_{k}

such that

π_{1}, \dots, π_{k} \in (0, 1)

and

\sum_{α = 1}^{k} π_{α} = 1

. Then, when we estimate

π

using

{(n_{1}, \dots, n_{k})}^{⊤} / n

, the test statistic

T_{n} / n

estimates the following quantity:

{MMD}^{2} (F_{1}, \dots, F_{k} | π) = \sum_{1 \leq α < β \leq k} π_{α} π_{β} {∥ μ_{α} - μ_{β} ∥}^{2},

which can be naturally defined as the MMD of the k distributions

F_{1}, \dots, F_{k}

with the weight vector

π

. It is worth noting that the MMD for multiple distributions presented above is equivalent to the one derived by [18], offering a much simpler alternative compared to the formulation proposed by [17].

It is easy to justify that

{MMD}^{2} (F_{1}, \dots, F_{k} | π)

is indeed an MMD of the k distributions

F_{1}, \dots, F_{k}

in the sense that

{MMD}^{2} (F_{1}, \dots, F_{k} | π) = 0

if and only if

F_{1} = \dots = F_{k}

. On the one hand, when

{MMD}^{2} (F_{1}, \dots, F_{k} | π) = 0

, we have

μ_{1} = \dots = μ_{k}

so that for any

1 \leq α \neq β \leq k

, we have

μ_{α} = μ_{β}

, implying that for any

1 \leq α \neq β \leq k

, we have

F_{α} = F_{β}

and hence

F_{1} = \dots = F_{k}

. On the other hand, when

F_{1} = \dots = F_{k}

, for any

1 \leq α \neq β \leq k

, we have

F_{α} = F_{β}

so that for any

1 \leq α \neq β \leq k

, we have

μ_{α} = μ_{β}

and hence

μ_{1} = \dots = μ_{k}

. Therefore, we can use

T_{n}

to test the equality of k distributions based on the k induced samples (3).

2.2. Computation of the Test Statistic

Notice that the k induced samples (3) are not directly computable, as the canonical feature mapping

ϕ (y)

is explicitly defined through the reproducing kernel. Fortunately, the reproducing kernel

K (\cdot, \cdot)

and its canonical feature mapping

ϕ (\cdot)

can be utilized with the following useful kernel trick:

K (y, y^{'}) = 〈 ϕ (y), ϕ (y^{'}) 〉

. Using this, we can express the inner product as follows:

〈 x_{α i}, x_{β j} 〉 = 〈 ϕ (y_{α i}), ϕ (y_{β j}) 〉 = K (y_{α i}, y_{β j}) .

(7)

Let

V_{α α} = {∥ {\bar{x}}_{α} ∥}^{2} = 〈 {\bar{x}}_{α}, {\bar{x}}_{α} 〉

and

V_{α β} = 〈 {\bar{x}}_{α}, {\bar{x}}_{β} 〉

. Then, using (7), we have

V_{α α} = \frac{1}{n_{α}^{2}} \sum_{i = 1}^{n_{α}} \sum_{j = 1}^{n_{α}} K (y_{α i}, y_{α j}), and V_{α β} = \frac{1}{n_{α} n_{β}} \sum_{i = 1}^{n_{α}} \sum_{j = 1}^{n_{β}} K (y_{α i}, y_{β j}) .

Therefore, using (6), we can compute

T_{n}

using any of the following useful expressions:

T_{n} = \sum_{α = 1}^{k} \frac{n_{α} (n - n_{α})}{n} V_{α α} - 2 \sum_{1 \leq α < β \leq k} \frac{n_{α} n_{β}}{n} V_{α β} = \sum_{1 \leq α < β \leq k} \frac{n_{α} n_{β}}{n} (V_{α α} + V_{β β} - 2 V_{α β}) .

(8)

In other words, you can compute the value of

T_{n}

using the original k samples (1) and the above expressions.

2.3. Asymptotic Null Distribution

To explore the null distribution of

T_{n}

, we can rewrite

T_{n}

as

T_{n} = {\tilde{T}}_{n} + 2 S_{n} + \sum_{α = 1}^{k} n_{α} {∥ μ_{α} - \bar{μ} ∥}^{2},

(9)

where

\bar{μ} = n^{- 1} \sum_{α = 1}^{k} n_{α} μ_{α}

represents the weighted average of the mean embeddings of the k distributions. Additionally, we have

{\tilde{T}}_{n} = \sum_{α = 1}^{k} n_{α} {∥ ({\bar{x}}_{α} - μ_{α}) - (\bar{x} - \bar{μ}) ∥}^{2}, and S_{n} = \sum_{α = 1}^{k} n_{α} 〈 ({\bar{x}}_{α} - μ_{α}) - (\bar{x} - \bar{μ}), μ_{α} - \bar{μ} 〉 .

(10)

In the expression for

{\tilde{T}}_{n}

, you can observe that the mean embeddings of the k distributions have been subtracted. Consequently,

{\tilde{T}}_{n}

follows the same distribution as that of

T_{n}

under the null hypothesis. Thus, studying the null distribution of

T_{n}

is equivalent to studying the distribution of

{\tilde{T}}_{n}

.

Similar to the proof of (6), we can express

{\tilde{T}}_{n}

as follows:

\begin{matrix} {\tilde{T}}_{n} & = & \sum_{α = 1}^{k} \frac{n_{α} (n - n_{α})}{n} {∥ {\bar{x}}_{α} - μ_{α} ∥}^{2} - 2 \sum_{1 \leq α < β \leq k} \frac{n_{α} n_{β}}{n} 〈 {\bar{x}}_{α} - μ_{α}, {\bar{x}}_{β} - μ_{β} 〉 \\ = & \sum_{1 \leq α < β \leq k} \frac{n_{α} n_{β}}{n} {∥ ({\bar{x}}_{α} - μ_{α}) - ({\bar{x}}_{β} - μ_{β}) ∥}^{2} . \end{matrix}

(11)

Let

\tilde{K} (u, v)

denote the centered version of

K (u, v)

, defined as

\begin{matrix} \tilde{K} (u, v) & = & 〈 ϕ (u) - μ_{u}, ϕ (v) - μ_{v} 〉 \\ = & K (u, v) - E_{v^{'}} [K (u, v^{'})] - E_{u^{'}} [K (u^{'}, v)] + E_{u^{'}, v^{'}} [K (u^{'}, v^{'})], \end{matrix}

(12)

where

μ_{u} = E [ϕ (u)]

,

μ_{v} = E [ϕ (v)]

, and

u^{'}

and

v^{'}

are independent copies of

u

and

v

, respectively. We can observe two useful properties: when

u = v

, we have

E_{u} [\tilde{K} (u, u)] = E_{u} {∥ ϕ (u) - μ_{u} ∥}^{2} > 0 .

(13)

When

u

and

v

are independent, we have

E_{u} [\tilde{K} (u, v)] = E_{v} [\tilde{K} (u, v)] = E_{u, v} [\tilde{K} (u, v)] = 0 .

(14)

Using (12), we can express

\tilde{K} (y_{α i}, y_{β j}) = 〈 x_{α i} - μ_{α}, x_{β j} - μ_{β} 〉 .

(15)

Let

{\tilde{V}}_{α α} = {∥ {\bar{x}}_{α} - μ_{α} ∥}^{2}, and {\tilde{V}}_{α β} = 〈 {\bar{x}}_{α} - μ_{α}, {\bar{x}}_{β} - μ_{β} 〉 .

(16)

It is evident that

{\tilde{V}}_{α α}

and

{\tilde{V}}_{α β}

can be considered as centered versions of

V_{α α}

and

V_{α β}

, respectively. Consequently, by using (11) and (16), we can express

{\tilde{T}}_{n} = \sum_{α = 1}^{k} \frac{n_{α} (n - n_{α})}{n} {\tilde{V}}_{α α} - 2 \sum_{1 \leq α < β \leq k} \frac{n_{α} n_{β}}{n} {\tilde{V}}_{α β} .

(17)

Utilizing (15) and performing some straightforward algebraic manipulations, we can express

\begin{matrix} {\tilde{V}}_{α α} & = & \frac{1}{n_{α}^{2}} \sum_{i = 1}^{n_{α}} \sum_{j = 1}^{n_{α}} \tilde{K} (y_{α i}, y_{α j}) = \frac{1}{n_{α}^{2}} [\sum_{i = 1}^{n_{α}} \tilde{K} (y_{α i}, y_{α i}) + 2 \sum_{1 \leq i < j \leq n_{α}} \tilde{K} (y_{α i}, y_{α j})], and \\ {\tilde{V}}_{α β} & = & \frac{1}{n_{α} n_{β}} \sum_{i = 1}^{n_{α}} \sum_{j = 1}^{n_{β}} \tilde{K} (y_{α i}, y_{β j}), α \neq β . \end{matrix}

(18)

Assuming that

\tilde{K} (y, y^{'})

is square-integrable, i.e.,

E [{\tilde{K}}^{2} (y, y^{'})] < \infty

, we can express

\tilde{K} (y, y^{'})

using Mercer’s expansion:

\tilde{K} (y, y^{'}) = \sum_{r = 1}^{\infty} λ_{r} ψ_{r} (y) ψ_{r} (y^{'}),

(19)

where

λ_{1}, λ_{2}, \dots

represent the eigenvalues of

\tilde{K} (y, y^{'})

, and

ψ_{1} (y), ψ_{2} (y), \dots

are the corresponding orthonormal eigenelements, satisfying

\int \tilde{K} (y, y^{'}) ψ_{r} (y) d F (y) = λ_{r} ψ_{r} (y^{'}), and \int ψ_{r} (y) ψ_{s} (y) d F (y) = δ_{r s}, r, s = 1, 2, \dots,

(20)

where

δ_{r s}

equals 1 when

r = s

and 0 otherwise. Now, let us introduce the following conditions:

C1.: We have $y_{α 1}, \dots, y_{α, n_{α}}, α = 1, \dots, k \overset{i . i . d .}{\sim} F$ .
C2.: As $n \to \infty$ , we have $n_{α} / n \to τ_{α} \in (0, 1), α = 1, \dots, k$ .
C3.: $K (y, y^{'})$ is a reproduced kernel such that $E_{y} [{\tilde{K}}^{2} (y, y)] \leq \infty$ .

Condition C1 assumes that the null hypothesis is satisfied and the common distribution function is F. Condition C2 is a regularity condition for k-sample problems and it requires that the group sample sizes tend to ∞ proportionally. Condition C3 is required such that

\tilde{K} (y, y^{'})

is square-integrable and expression (19) is valid.

In fact, under Condition C3, using (19) and the Cauchy–Schwarz inequality, we obtain the following results:

\begin{matrix} E [\tilde{K} (y, y)] & = \sum_{r = 1}^{\infty} λ_{r} < \sqrt{E [{\tilde{K}}^{2} (y, y)]} < \infty, \\ E_{y, y^{'}} {[\tilde{K} (y, y^{'})]}^{2} & = \sum_{r = 1}^{\infty} λ_{r}^{2} < E_{y, y^{'}} [\tilde{K} (y, y) \tilde{K} (y^{'}, y^{'})] = E_{y}^{2} [\tilde{K} (y, y)] < \infty, \end{matrix}

(21)

where

y, y^{'} \overset{i . i . d .}{\sim} F

. These inequalities hold due to the square-integrability assumption and the properties of Mercer’s expansion. Now, let us state the following theorem that establishes the asymptotic distribution of

{\tilde{T}}_{n}

.

Theorem 1.

Under Conditions C1–C3, as

n \to \infty

, we have

{\tilde{T}}_{n} \overset{L}{⟶} \tilde{T}

, where

\tilde{T} \overset{d}{=} \sum_{r = 1}^{\infty} λ_{r} A_{r}, A_{r} \overset{i . i . d .}{\sim} χ_{k - 1}^{2} .

(22)

It is worth highlighting that the limit null distribution of the proposed test statistic differs from the one derived in [18] (Theorem 1) and it offers a more straightforward alternative compared to the limit null distribution obtained by [17] (Theorem 1). This explains why the limit null distribution presented by [17] is not employed to approximate the null distribution of their test statistic. However, as demonstrated in Section 3, it is indeed feasible to utilize this distribution if desired.

2.4. Mean and Variance of ${\tilde{T}}_{n}$

Based on (13), (14) and (18), through some simple calculation, under Condition C1, we have

\begin{matrix} E [{\tilde{V}}_{α α}] = \frac{1}{n_{α}} E [\tilde{K} (y, y)], E [{\tilde{V}}_{α β}] = 0, \\ Var [{\tilde{V}}_{α α}] = \frac{1}{n_{α}^{3}} Var [\tilde{K} (y, y)] + \frac{2 (n_{α} - 1)}{n_{α}^{3}} E [{\tilde{K}}^{2} (y, y^{'})], \\ Var [{\tilde{V}}_{α β}] = \frac{1}{n_{α} n_{β}} E [{\tilde{K}}^{2} (y, y^{'})], and Cov ({\tilde{V}}_{α α}, {\tilde{V}}_{α β}) = 0, α \neq β, \end{matrix}

(23)

where

y, y^{'} \overset{i . i . d .}{\sim} F

.

Theorem 2.

Under Condition C1, we have

E ({\tilde{T}}_{n}) = (k - 1) E [\tilde{K} (y, y)]

, and

Var ({\tilde{T}}_{n}) = [\sum_{α = 1}^{k} \frac{{(n - n_{α})}^{2}}{n^{2} n_{α}}] Var [\tilde{K} (y, y)] + 2 [(k - 1) - \sum_{α = 1}^{k} \frac{{(n - n_{α})}^{2}}{n^{2} n_{α}}] E [{\tilde{K}}^{2} (y, y^{'})],

where

y, y^{'} \overset{i . i . d .}{\sim} F

.

Note that under Condition C2, we have

\sum_{α = 1}^{k} {(n - n_{α})}^{2} / (n^{2} n_{α}) \leq \sum_{α = 1}^{k} n_{α}^{- 1} \to 0

as

n \to \infty

. Then as

n \to \infty

, we have

Var ({\tilde{T}}_{n}) = 2 (k - 1) E [{\tilde{K}}^{2} (y, y^{'})] [1 + o (1)]

.

2.5. Asymptotic Power

In this subsection, we examine the asymptotic power of the proposed test under the following local alternative hypothesis:

H_{1 n} : μ_{α} = μ + n^{- (1 / 2 - Δ)} h_{α}, α = 1, \dots, k,

(24)

where

0 < Δ \leq 1 / 2

and

h_{α}, α = 1, \dots, k

are constant elements in the RKHS

H

such that

\begin{matrix} 0 < \sum_{α = 1}^{k} \frac{n_{α}}{n} {∥ h_{α} - \bar{h} ∥}^{2} \leq B_{h} < \infty with \bar{h} = \frac{1}{n} \sum_{α = 1}^{k} n_{α} h_{α}, \end{matrix}

and

\sum_{α = 1}^{k} σ_{α}^{2} > 0, where σ_{α}^{2} = E {[〈 x_{α 1} - μ_{α}, h_{α} - \bar{h} 〉]}^{2}, α = 1, \dots, k .

It is important to note that when

Δ = 1 / 2

, the local hypothesis (24) simplifies into a fixed alternative hypothesis

H_{1} : μ_{α} = μ + h_{α}

for

α = 1, \dots, k

. However, when

0 < Δ < 1 / 2

, Equation (24) represents a strict local alternative hypothesis. In this case, as

n \to \infty

, the strict local hypothesis tends to converge toward the null hypothesis. Detecting a strict local alternative hypothesis becomes exceedingly challenging in such scenarios. A test is typically considered root-n-consistent if it can detect a strict local alternative hypothesis with a probability approaching 1 as

n \to \infty

. A root-n-consistent test is considered effective because it achieves the best possible detection rate for a local alternative hypothesis as n grows.

Through (9) and (24),

T_{n}

can be written as

T_{n} = {\tilde{T}}_{n} + 2 S_{n} + n^{2 Δ} \sum_{α = 1}^{k} \frac{n_{α}}{n} {∥ h_{α} - \bar{h} ∥}^{2},

(25)

where

S_{n} = n^{- (1 / 2 - Δ)} \sum_{α = 1}^{k} n_{α} 〈 {\bar{x}}_{α} - μ_{α}, h_{α} - \bar{h} 〉 .

Theorem 3.

Assume that

| K (y, y^{'}) | \leq B_{K}

for all

y, y^{'} \in R^{p}

for some

B_{K} < \infty

. Then, we have

Var ({\tilde{T}}_{n}) \leq 32 (k - 1) B_{K}^{2}

and

Var (S_{n}) = n^{2 Δ} \sum_{α = 1}^{k} n_{α} σ_{α}^{2} / n

, where

\sum_{α = 1}^{k} n_{α} σ_{α}^{2} / n \leq 4 B_{K} B_{h} < \infty

.

Theorem 4.

Assume that

| K (y, y^{'}) | \leq B_{K}

for all

y, y^{'} \in R^{p}

for some

B_{K} < \infty

. Then, under Condition C2 and the local alternative hypothesis (24), as

n \to \infty

, we have (a)

{\tilde{T}}_{n} / \sqrt{Var (S_{n})} \overset{P}{⟶} 0

; (b)

S_{n} / \sqrt{Var (S_{n})} \overset{L}{⟶} N (0, 1)

; and (c)

\begin{matrix} Pr (T_{n} \geq C_{ϵ}) = Φ [\frac{n^{Δ} \sum_{α = 1}^{k} τ_{α} {∥ h_{α} - h^{*} ∥}^{2}}{2 \sqrt{\sum_{α = 1}^{k} τ_{α} σ_{α}^{* 2}}}] [1 + o (1)] ⟶ 1, \end{matrix}

where

C_{ϵ}

denotes the upper

100 ϵ

percentile of

{\tilde{T}}_{n}

with ϵ being the given significance level,

τ_{α}

;

α = 1, \dots, k

are defined in Condition C2;

h^{*} = \sum_{α = 1}^{k} τ_{α} h_{α}; σ_{α}^{* 2} = E {[〈 x_{α 1} - μ_{α}, h_{α} - h^{*} 〉]}^{2}

; and

Φ (\cdot)

denotes the cumulative distribution of

N (0, 1)

.

Theorem 4 shows that the proposed test is indeed a root-n-consistent test.

3. Methods for Implementing the Proposed Test

In this section, we will outline three different approaches for approximating the null distribution of

T_{n}

(8) in order to conduct the proposed test. These methods include a parametric bootstrap approach, a random permutation technique, and a

χ^{2}

-approximation method. We will evaluate and compare their performance in the next section.

3.1. Parametric Bootstrap Method

Theorem 1 reveals that the asymptotic null distribution of

T_{n}

takes the form of a chi-squared-type mixture denoted as

\tilde{T}

(22). The coefficients of this mixture are determined by the unknown eigenvalues

λ_{r}, r = 1, 2, \dots

of

\tilde{K} (y, y^{'})

, where

y, y^{'}

are independently and identically distributed according to the common distribution function F representing the k distributions when the null hypothesis is valid. Consequently, in order to estimate the asymptotic null distribution of

T_{n}

, it is essential to consistently estimate

λ_{r}, r = 1, 2, \dots

. This consistency can be achieved by utilizing the empirical eigenvalues of the centered Gram matrix, as suggested by [20], to construct a reliable estimator for

\tilde{T}

(22).

Let us recall that

n = \sum_{α = 1}^{k} n_{α}

represents the total sample size. We pool the k samples (1) and denote it as

y_{1}, \dots, y_{n} .

(26)

Under the null hypothesis,

y_{1}, \dots, y_{n}

are independently and identically distributed from the common distribution F. Let

K

represent the

n \times n

Gram matrix, where the

(i, j)

th entry is defined as

K (y_{i}, y_{j})

for

i, j = 1, \dots, n

. Additionally, let

1_{n}

denote a vector of ones with dimensions

n \times 1

, and

I_{n}

denote the identity matrix of size

n \times n

. Then, the matrix

P_{n} = I_{n} - 1_{n} 1_{n}^{⊤} / n

is a projection matrix of size

n \times n

with rank

n - 1

.

Now, define

{\tilde{K}}_{n} = P_{n} K P_{n}

, commonly referred to as the centered Gram matrix. Its

(i, j)

th entry is given by

{\tilde{K}}_{n} (y_{i}, y_{j}) = K (y_{i}, y_{j}) - n^{- 1} \sum_{v = 1}^{n} K (y_{i}, y_{v}) - n^{- 1} \sum_{u = 1}^{n} K (y_{u}, y_{j}) + n^{- 2} \sum_{u = 1}^{n} \sum_{v = 1}^{n} K (y_{u}, y_{v}),

(27)

where

i, j = 1, \dots, n

. As n approaches infinity, for any fixed i and j, we can observe that, by the law of large numbers:

{\tilde{K}}_{n} (y_{i}, y_{j}) \overset{L}{⟶} \tilde{K} (y_{i}, y_{j}) = K (y_{i}, y_{j}) - E_{y^{'}} [K (y_{i}, y^{'})] - E_{y} [K (y, y_{j})] + E_{y, y^{'}} [K (y, y^{'})] .

Let

ϖ_{1}, \dots, ϖ_{q}

be all the non-zero eigenvalues of

{\tilde{K}}_{n}

that can be obtained via an eigen-decomposition of

{\tilde{K}}_{n}

. Set

{\hat{λ}}_{r} = ϖ_{r} / n, r = 1, \dots, q

. Then, following [20], we can show that under Condition C3, the distribution of

\tilde{T}

can be consistently estimated by

\hat{\tilde{T}} = \sum_{r = 1}^{q} {\hat{λ}}_{r} A_{r}, A_{r} \overset{i . i . d .}{\sim} χ_{k - 1}^{2} .

(28)

The parametric bootstrap method can be described as follows. Let us choose a large value for N, for example,

N = 1000

. Using expression (28), we can obtain a sample

{\hat{\tilde{T}}}^{(1)}, \dots, {\hat{\tilde{T}}}^{(N)}

of

\hat{\tilde{T}}

by independently generating

A_{r}, r = 1, \dots, q

a total of N times. Now, let

T_{o b s}

represent the observed test statistic calculated using (8) based on the k samples (1). Using the parametric bootstrap method, we can conduct the proposed test by calculating the approximate p-value, which is given by

N^{- 1} \sum_{i = 1}^{N} I {{\hat{\tilde{T}}}^{(i)} \geq T_{o b s}}

, where

I {S}

is an indicator function that takes 1 when S is a true event and 0 otherwise.

3.2. Random Permutation Method

We can also approximate the null distribution of

T_{n}

using a random permutation method. Let

ℓ_{1}, \dots, ℓ_{n}

represent a random permutation of the indices

1, \dots, n

from the pooled sample (26). Consequently, the sequence

y_{ℓ_{1}}, \dots, y_{ℓ_{n}},

(29)

forms a permutation of the pooled sample (26). To create permuted samples, we utilize the first

n_{1}

observations in the permuted pooled sample (29) as the first permuted sample, the next

n_{2}

observations as the second permuted sample, and so on, until we obtain k permuted samples. These permuted samples are denoted as

y_{α i}^{*}, i = 1, \dots, n_{α}; α = 1, \dots, k .

(30)

The permutated test statistic, denoted as

T_{n}^{*}

, is calculated using (8) but with the k samples (1) replaced by the k permuted samples (30).

The random permutation method proceeds as follows. Let N be a sufficiently large number, for instance,

N = 1000

. Suppose we repeat the permutation process described above N times, resulting in N permutated test statistics denoted as

T_{n}^{* (i)}, i = 1, \dots, N

. Then, we can use the empirical distribution of

T_{n}^{* (i)}, i = 1, \dots, N

to approximate the null distribution of

T_{n}

. Recall that

T_{o b s}

represents the test statistic computed using (8) based on the k original samples (1). Following the random permutation method, the proposed test can be conducted by calculating the approximated p-value, given by

N^{- 1} \sum_{i = 1}^{N} I {T_{n}^{* (i)} \geq T_{o b s}}

.

3.3. Welch–Satterthwaite $χ^{2}$ -Approximation Method

The parametric bootstrap method and the random permutation method are effective for controlling size but can be computationally intensive, particularly with large total sample sizes. To address this issue, we can utilize the well-known Welch–Satterthwaite (W–S)

χ^{2}

-approximation method [21,22]. This method is known to be reliable for approximating the distribution of a chi-squared-type mixture. Theorem 1 demonstrates that the asymptotic null distribution of

T_{n}

is a chi-squared-type mixture

\tilde{T}

(22).

The core concept of the W–S

χ^{2}

-approximation method is to approximate the null distribution of

T_{n}

using that of a random variable of the form:

W \overset{d}{=} β χ_{d}^{2}

, where

β

and d are unknown parameters. These parameters can be determined by matching the means and variances of

{\tilde{T}}_{n}

and W, where

{\tilde{T}}_{n}

, defined in (10), has the same distribution as

T_{n}

under the null hypothesis. Specifically, the mean and variance of W are

β d

and

2 β^{2} d

, respectively, while the mean and variance of

{\tilde{T}}_{n}

are given in Theorem 2. Equating the means and variances of

{\tilde{T}}_{n}

and W, we obtain

β = \frac{Var ({\tilde{T}}_{n})}{2 E ({\tilde{T}}_{n})} and d = \frac{2 E^{2} ({\tilde{T}}_{n})}{Var ({\tilde{T}}_{n})} .

(31)

To implement the W–S

χ^{2}

-approximation method, we need to consistently estimate

E ({\tilde{T}}_{n})

and

Var ({\tilde{T}}_{n})

based on the pooled sample (26) from the k samples (1). According to Theorem 2, these estimates can be obtained as follows:

\begin{matrix} \hat{E} ({\tilde{T}}_{n}) & = & (k - 1) \hat{E} [\tilde{K} (y, y)], and \\ \hat{Var} ({\tilde{T}}_{n}) & = & [\sum_{α = 1}^{k} \frac{{(n - n_{α})}^{2}}{n^{2} n_{α}}] \hat{Var} [\tilde{K} (y, y)] + 2 [(k - 1) - \sum_{α = 1}^{k} \frac{{(n - n_{α})}^{2}}{n^{2} n_{α}}] \hat{E} [{\tilde{K}}^{2} (y, y^{'})], \end{matrix}

where

\begin{matrix} \hat{E} [\tilde{K} (y, y)] & = & n^{- 1} \sum_{i = 1}^{n} {\tilde{K}}_{n} (y_{i}, y_{i}), \\ \hat{Var} [\tilde{K} (y, y)] & = & {(n - 1)}^{- 1} \sum_{i = 1}^{n} {\{{\tilde{K}}_{n} (y_{i}, y_{i}) - n^{- 1} \sum_{j = 1}^{n} {\tilde{K}}_{n} (y_{j}, y_{j})\}}^{2}, and \\ \hat{E} [{\tilde{K}}^{2} (y, y^{'})] & = & \frac{2}{n (n - 1)} \sum_{1 \leq i < j \leq n} {[{\tilde{K}}_{n} (y_{i}, y_{j})]}^{2}, \end{matrix}

with

{\tilde{K}}_{n} (y_{i}, y_{j})

being defined in (27). Substituting these estimators into (31), we obtain

\hat{β} = \frac{\hat{Var} ({\tilde{T}}_{n})}{2 \hat{E} ({\tilde{T}}_{n})} and \hat{d} = \frac{2 {(\hat{E} ({\tilde{T}}_{n}))}^{2}}{\hat{Var} ({\tilde{T}}_{n})} .

Let

χ_{\hat{d}}^{2} (α^{*})

denote the upper

100 α^{*}

percentile of

χ_{\hat{d}}^{2}

, where

α^{*}

is the given significance level, and let

T_{o b s}

denote the observed test statistic computed using (8) based on the k samples (1). Then, through the W–S

χ^{2}

-approximation method, the proposed test can be conducted via rejecting the null hypothesis when

T_{o b s} > \hat{β} χ_{\hat{d}}^{2} (α^{*})

or when the approximated p-value

Pr (χ_{\hat{d}}^{2} > T_{o b s} / \hat{β})

is less than

α^{*}

.

4. Simulation Studies

In this section, we delve into intensive simulation studies aimed at assessing the performance of the test we propose when compared to the energy test introduced by [12], which we denote as

T_{S R}

. Our proposed test employs the parametric bootstrap, the random permutation, and the W–S

χ^{2}

-approximation methods as described in Section 3. For simplicity, we refer to the resulting tests as

T_{P B}

,

T_{R P}

, and

T_{W S}

, respectively.

For simplicity, we opt for the Gaussian Radial Basis Function (RBF) kernel, denoted as

K (\cdot, \cdot)

, which is defined as follows:

K (y, y^{'}) = exp (- \frac{∥ y - y^{'} ∥^{2}}{2 σ^{2}}) .

Here,

σ^{2}

is referred to as the kernel width. It is worth noting that the Gaussian RBF kernel described above is bounded by 1, ensuring that Condition C3 is always met. Following the approach outlined in [20], we set

σ

to be equal to the median distance between observed vectors in the pooled sample (26).

We also employ the Average Relative Error (ARE) introduced by [23] to evaluate the overall effectiveness of a test in maintaining its nominal size. The ARE is calculated as follows:

A R E = 100 M^{- 1} \sum_{i = 1}^{M} | {\hat{α_{i}}}^{*} - α^{*} | / α^{*}

, where

{\hat{α_{i}}}^{*}, i = 1, \dots, M

represents the empirical sizes observed across M different simulation settings. A smaller ARE value indicates the better performance of a test in terms of size control. In this simulation study, we set the nominal size to

α^{*} = 5 %

.

4.1. Simulation 1

We set

k = 3

for simplicity. We generate the

k = 3

samples (1) as follows. We set

\begin{matrix} y_{1 i} & = & μ + Γ (u_{1 i} + δ_{1} v_{1 i}), i = 1, \dots, n_{1}, \\ y_{2 i} & = & μ + Γ (u_{2 i} + δ_{2} v_{2 i}), i = 1, \dots, n_{2}, \\ y_{3 i} & = & μ + Γ u_{3 i}, i = 1, \dots, n_{3}, \end{matrix}

where

u_{α i}, i = 1, \dots, n_{α}; α = 1, 2, 3 \overset{i . i . d .}{\sim} N_{p} (0, I_{p})

, while

v_{α i} = {(v_{α, i 1}, \dots, v_{α, i p})}^{⊤}

,

i = 1, \dots, n_{α}; α = 1, 2

are generated using the following three models:

Model 1.: $v_{α, i r}, i = 1, \dots, n_{α}; α = 1, 2; r = 1, \dots, p \overset{i . i . d .}{\sim} N (0, 1)$ .
Model 2.: $v_{α, i r} = u_{α, i r} / \sqrt{2}, i = 1, \dots, n_{α}; α = 1, 2; r = 1, \dots, p$ with $u_{α, i r}, i = 1, \dots, n_{α}$ ; $α = 1, 2; r = 1, \dots, p \overset{i . i . d .}{\sim} t_{4}$ .
Model 3.: $v_{α, i r} = (u_{α, i r} - 1) / \sqrt{2}, i = 1, \dots, n_{α}; α = 1, 2; r = 1, \dots, p$ with $u_{α, i r}, i = 1, \dots, n_{α};$ $α = 1, 2; r = 1, \dots, p \overset{i . i . d .}{\sim} χ_{1}^{2}$ .

It is important to note that

δ_{1}

and

δ_{2}

play pivotal roles as tuning parameters that govern the similarity of the distributions among the three generated samples. Specifically, when both

δ_{1}

and

δ_{2}

are set to zero (

δ_{1} = δ_{2} = 0

), the three samples generated from the three models exhibit identical distributions. If at least one of

δ_{i}

, where i can be 1 or 2, is non-zero, the samples still share the same mean vector but differ in their covariance matrices. Furthermore, it is worth mentioning that the test’s power increases as both

δ_{1}

and

δ_{2}

increase.

We set

μ

as

h {(1, \dots, p)}^{⊤} / \sqrt{\sum_{i = 1}^{p} i^{2}}

and

Γ

as

a [(1 - ρ) I_{p} + ρ J_{p}]

, where

J_{p}

denotes a matrix of ones with dimensions

p \times p

. To assess the performance of the considered tests across a range of dimensionality settings, we examine three cases:

p = 10

,

p = 100

, and

p = 500

. For each of these cases, we consider three sets of sample sizes

(n_{1}, n_{2}, n_{3})

:

(20, 30, 40)

,

(80, 120, 160)

, and

(160, 240, 320)

. Additionally, we investigate three levels of correlation:

ρ = 0.1

,

ρ = 0.5

, and

ρ = 0.9

. These three values of

ρ

correspond to samples with varying degrees of correlation, ranging from nearly uncorrelated to moderately correlated and highly correlated. Notably, correlation increases as

ρ

grows. For simplicity, we set

h = 2

and

a = 1.5

across all three models.

In the case of the parametric bootstrap and random permutation methods, as well as the energy test, we use a total of

N = 1000

replicates for computing the p-values at each simulation run, as described in Section 3.1 and Section 3.2. It is worth noting that the W–S

χ^{2}

-approximation method, which does not require generating replicates, is the least time-consuming among the methods considered. The empirical sizes and powers are computed based on 1000 simulation runs.

Table 1 provides an overview of the empirical sizes of

T_{P B}

,

T_{R P}

,

T_{W S}

, and

T_{S R}

, with the last row displaying the associated ARE values for the three different

ρ

values. Several observations can be made based on this table: Firstly, for nearly uncorrelated samples (

ρ = 0.1

),

T_{W S}

exhibits a slight tendency to be liberal, with an ARE value of

16.22

that is marginally higher than the ARE values of the other three tests. Secondly, when the generated samples are moderately correlated (

ρ = 0.5

) or highly correlated (

ρ = 0.9

), all four tests demonstrate fairly similar empirical sizes and ARE values, making them comparable in terms of size control. Finally, it is seen that the influence of sample sizes on size control is relatively minor, even though, in theory, a larger total sample size should result in better size control.

Figure 1 displays the empirical powers of all four tests in scenarios where all three generated samples have the same mean vectors, but they differ from each other in covariance matrices. Several conclusions can be drawn regarding these power values: Firstly, for

ρ = 0.1, 0.5

, and

0.9

,

T_{P B}

,

T_{R P}

, and

T_{W S}

exhibit similar empirical powers. This suggests that these three tests perform comparably, regardless of whether the data are nearly uncorrelated, moderately correlated, or highly correlated. Secondly, it is seen that under similar settings, as expected, the empirical powers of the tests generally increase with larger sample sizes. Finally, the empirical powers of

T_{S R}

consistently rank the lowest among all four tests. This indicates that

T_{S R}

is less powerful compared to the other three tests in these scenarios.

4.2. Simulation 2

Certainly, it should be acknowledged that the MMD-based tests

T_{P B}

,

T_{R P}

, and

T_{W S}

may not consistently demonstrate superior performance when compared to the energy-distance-based test

T_{S R}

as in Simulation 1. To illustrate this point, in the context of this simulation study, we keep the same experimental framework as described in Simulation 1. However, we now introduce a new collection of three models, which are defined as follows:

Model 4.: $v_{α, i r}, i = 1, \dots, n_{α}; α = 1, 2; r = 1, \dots, p \overset{i . i . d .}{\sim} N (0.5, 1)$ .
Model 5.: $v_{α, i r} = u_{α, i r} / \sqrt{2} + 0.5, i = 1, \dots, n_{α}; α = 1, 2; r = 1, \dots, p$ with $u_{α, i r}, i = 1, \dots, n_{α};$ $α = 1, 2; r = 1, \dots, p \overset{i . i . d .}{\sim} t_{4}$ .
Model 6.: $v_{α, i r} = (u_{α, i r} - 1) / \sqrt{2} + 0.5, i = 1, \dots, n_{α}; α = 1, 2; r = 1, \dots, p$ with $u_{α, i r}$ , $i = 1, \dots, n_{α}; α = 1, 2; r = 1, \dots, p \overset{i . i . d .}{\sim} χ_{1}^{2}$ .

Please be aware that

v_{α, i r}

, where i ranges from 1 to

n_{α}

,

α

takes values 1 and 2, and r ranges from 1 to p, is adjusted to ensure that

E (v_{α, i r}) = 0.5

and

Var (v_{α, i r}) = 1

across all three models. When both

δ_{1}

and

δ_{2}

are set to 0, the three generated samples follow the same distributions as in Simulation 1. Consequently, this implies that the empirical sizes of all four tests in Simulation 2 will be similar to those observed in Simulation 1. However, when at least one of

δ_{i}

(where i takes values 1 and 2) is non-zero, the three samples exhibit distinct mean vectors and covariance matrices, differing from those observed in Simulation 1. As a result, our focus should be on calculating the empirical powers of all four tests based on Models 4–6.

Figure 2 presents the empirical powers of all four tests based on Models 4–6, offering several noteworthy insights. First, it is evident that

T_{P B}

,

T_{R P}

, and

T_{W S}

demonstrate similar empirical powers. This implies that these three tests exhibit comparable performance regardless of whether the generated data follow a normal or non-normal distribution. Second, it is also observed that under similar settings, the empirical powers of the tests generally increase with larger sample sizes. Lastly, when

ρ

takes on values of 0.5 and 0.9, the empirical powers of

T_{S R}

surpass those of the other three tests. However, when

ρ = 0.1

, the empirical powers of all four tests are generally comparable. This indicates that, in the scenarios under consideration,

T_{S R}

demonstrates greater effectiveness compared to the other three tests when the correlation coefficient

ρ

is relatively high.

From these two simulation studies, it can be inferred that the proposed MMD-based tests

T_{P B}

,

T_{R P}

, and

T_{W S}

may outperform the energy-distance-based test

T_{S R}

when the differences in distributions are primarily in covariance matrices, while the reverse could be true when the differences in distribution involve both mean vectors and covariance matrices. Notably, the MMD-based test

T_{W S}

generally requires less computational effort compared to the bootstrap or permutation-based tests

T_{P B}

,

T_{R P}

, and

T_{S R}

.

5. Application to the Corneal Surface Data

The corneal surface data are briefly mentioned in Section 1. They were acquired during a keratoconus study, a collaborative project involving Ms. Nancy Tripoli and Dr. Kenneth L. Cohen from the Department of Ophthalmology at the University of North Carolina, Chapel Hill. This dataset comprises 150 observations with each corneal surface having more than 6000 height measurements. It can be categorized into four distinct groups: a group of 43 healthy corneas (referred to as the normal cornea group), a group of 14 corneas with unilateral suspect characteristics, a group of 21 corneas with suspect map features, and a group of 72 corneas clinically diagnosed with keratoconus. It is important to note that the corneal surfaces within the normal, unilateral suspect, and suspect map groups exhibit similar shapes, but they significantly differ from the corneal surfaces observed in the clinical keratoconus group (refer to Figure 1 in [24] for visualization). In the process of reconstructing a corneal surface, ref. [24] utilized the Zernike regression model to fit the height measurements associated with the corneal surface. The height of the corneal surface at a specific radius r and angle

θ

is denoted as

h (r, θ)

, while

\hat{h} (r, θ)

represents the height estimated through the fitted model within the predefined region of interest. This region of interest spans from

r = 0

to

r = r_{0}

and from

θ = 0

to

θ = 2 π

, with

r_{0}

being a predetermined positive constant. To naturally represent each corneal surface, a feature vector is constructed, consisting of values

\hat{h} (r_{i}, θ_{j})

, where i ranges from 1 to K and j ranges from 1 to L. These values are obtained by evaluating the fitted corneal surface

\hat{h} (r, θ)

at a grid of points defined as

r_{i} = r_{0} i / K

and

θ_{j} = 2 π j / L

for

i = 1, \dots, K

and

j = 1, \dots, L

. For simplicity, we choose to set

K = 20

and

L = 100

, resulting in a feature vector with dimensions of 2000 for each corneal surface.

For simplicity, we put the fitted feature vectors for the complete corneal surface dataset collectively into a feature matrix with dimensions of

150 \times 2000

. In this matrix, each row corresponds to a feature vector representing a corneal surface. Specifically, the initial 43 rows of the feature matrix correspond to observations from the normal group, sequentially followed by 14 rows from the unilateral suspect group, 21 rows from the suspect map group, and lastly, 72 rows from the clinical keratoconus group.

Our objective is to examine whether there are significant differences in the distributions among various corneal surface groups, referred to as multi-sample problems for the equality of distributions for high-dimensional data, given that the high-dimensional feature vectors represent the observations of the corneal surface data. In this application, we employ

T_{P B}

,

T_{R P}

,

T_{W S}

, and

T_{S R}

to address these problems. For both the parametric bootstrap and random permutation methods, as well as the energy test, we perform a total of N = 10,000 replicates to compute the associated p-values. For simplicity, we denote the normal, unilateral suspect, suspect map, and clinical keratoconus groups as NOR, UNI, SUS, and CLI, respectively.

Table 2 displays the results obtained from the application of four statistical tests, namely

T_{P B}

,

T_{R P}

,

T_{W S}

, and

T_{S R}

, to assess the equality of distributions among different corneal surface groups. A careful examination of these results yields several noteworthy insights. To begin with, when considering the comparison among the corneal surface groups labeled “NOR vs. UNI vs. SUS vs. CLI”, it is evident that all four tests reject the null hypothesis. This rejection signifies that there exists at least one significant difference among the distributions of these four corneal surface groups. Consequently, further investigation is warranted to identify which specific group or groups differ from the others. Secondly, it is important to highlight the outcome for the comparison involving “NOR vs. UNI vs. SUS”. In this case, none of the four tests reject the null hypothesis at the

5 %

significance level. This outcome indicates that the normal, unilateral suspect, and suspect map groups share a similar distribution pattern. Thirdly, across the remaining three comparisons, all four tests consistently reject the null hypothesis. These results align with the observations depicted in Figure 1 of [24], which illustrates the distinctiveness of corneal surfaces within the clinical keratoconus group when compared to the other three groups. Fourthly, when focusing on the comparison “NOR vs. UNI vs. SUS”, it is worth noting that the p-values obtained from

T_{P B}

,

T_{R P}

, and

T_{W S}

are quite similar, suggesting their comparable performance. This consistency in p-values is also reflected in the empirical sizes presented in Table 1. Lastly, when analyzing the cases involving CLI, it becomes evident that the p-values generated by

T_{S R}

consistently exhibit larger values compared to those produced by the other three tests. This discrepancy implies that

T_{S R}

may have a lower sensitivity in detecting distribution differences when compared to the other tests, indicating the potentially reduced statistical power in this real data example.

Notice that the test

T_{P B}

is bootstrap-based and the tests

T_{R P}

and

T_{S R}

are permutation-based. This means that their p-values are obtained via bootstrapping or permutating numerous random samples to compute the associated p-values as described in Section 3.1 and Section 3.2. Thus, the p-values of these tests are random, i.e., they are different at different instances. However, the p-value of

T_{W S}

remains fixed. In order to investigate this clearly, we performed 500 iterations of

T_{P B}, T_{R P}, T_{W S}

, and

T_{S R}

on the case “NOR vs. UNI vs. SUS”. The boxplots of the corresponding p-values of the four tests are presented in Figure 3. It is evident that the p-values obtained from

T_{W S}

remain fixed, whereas those derived from

T_{P B}, T_{R P}

, and

T_{S R}

exhibit variability. This contrast underscores the fact that p-values resulting from bootstrap-based or permutation-based tests indeed differ across various instances.

6. Concluding Remarks

Testing whether multiple high-dimensional samples adhere to the same distribution is a common area of research interest. This paper introduces and investigates a novel MMD-based test to address this question. The null distribution of this test is approximated using three methods: parametric bootstrap, random permutation, and the W–S

χ^{2}

-approximation approach. Results from two simulation studies and a real data application demonstrate that the proposed test exhibits effective size control and superior statistical power compared to the energy test introduced by [12] when the differences among sample distributions are primarily related to covariance matrices rather than mean vectors. Thus, the proposed test is generally well suited for conducting multi-sample equal distribution testing on high-dimensional data. We particularly recommend its use in scenarios where distribution differences are associated with covariance matrices. Conversely, when distribution differences predominantly pertain to means, the energy test is a more powerful choice. However, in practice, determining whether distribution differences are related to means or covariance matrices can be challenging. Therefore, we suggest considering both the new test and the energy test as viable options. Nevertheless, it is important to note that implementing the proposed test comes with certain challenges. Both the parametric bootstrap and random permutation methods can be computationally intensive, leading to variable p-values across different applications. In contrast, the W–S

χ^{2}

-approximation method offers computational efficiency and produces fixed p-values. However, its accuracy is limited as it solely relies on matching two cumulants of the test statistic under the null hypothesis.

An intriguing question arises naturally: Can we enhance the accuracy of the proposed test by matching three cumulants of the test statistic? Recent work by [18] suggests that this is indeed possible. However, deriving the third cumulant of the test statistic presents a current challenge and requires further investigation. Another aspect to consider is the choice of kernel width. While the paper opts for simplicity by utilizing the median distance between observed vectors in the pooled sample, it is worth exploring the kernel width choice recommended by [18] to potentially enhance the test’s statistical power. These avenues for future research promise exciting developments and warrant further exploration.

Author Contributions

Conceptualization, J.-T.Z. and A.A.C.; methodology, J.-T.Z., T.Z. and Z.P.O.; software, Z.P.O. and T.Z.; validation, J.-T.Z. and T.Z.; formal analysis, Z.P.O.; investigation, J.-T.Z. and Z.P.O.; resources, Z.P.O.; data curation, Z.P.O.; writing—original draft preparation, J.-T.Z. and Z.P.O.; writing—review and editing, J.-T.Z. and T.Z.; visualization, A.A.C. and T.Z.; supervision, J.-T.Z.; project administration, J.-T.Z.; funding acquisition, J.-T.Z. All authors have read and agreed to the published version of the manuscript.

Funding

Zhang’s research was partially funded by the National University of Singapore academic research grant (22-5699-A0001).

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://tandf.figshare.com/articles/dataset/Linear_hypothesis_testing_with_functional_data/6063026/1?file=10914914 (accessed on 20 October 2023).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MMD	Maximum Mean Discrepancy
EDF	Empirical Distribution Function
RKHS	Reproducing Kernel Hilbert Spaces
RBF	Radial Basis Function
ARE	Average Relative Error

Appendix A. Technical Proofs

Proof of (6).

We have

\begin{matrix} T_{n} & = & \sum_{α = 1}^{k} n_{α} {∥ {\bar{x}}_{α} - \bar{x} ∥}^{2} \\ = & \sum_{α = 1}^{k} n_{α} (∥ {\bar{x}}_{α} ∥^{2} - 2 〈 {\bar{x}}_{α}, \bar{x} 〉 + {∥ \bar{x} ∥}^{2}) \\ = & \sum_{α = 1}^{k} n_{α} ∥ {\bar{x}}_{α} ∥^{2} - n {∥ \bar{x} ∥}^{2} \\ = & \sum_{α = 1}^{k} n_{α} {∥ {\bar{x}}_{α} ∥}^{2} - \sum_{α = 1}^{k} \sum_{β = 1}^{k} \frac{n_{α} n_{β}}{n} 〈 {\bar{x}}_{α}, {\bar{x}}_{β} 〉 \\ = & \sum_{α = 1}^{k} \frac{n_{α} (n - n_{α})}{n} {∥ {\bar{x}}_{α} ∥}^{2} - 2 \sum_{1 \leq α < β \leq k} \frac{n_{α} n_{β}}{n} 〈 {\bar{x}}_{α}, {\bar{x}}_{β} 〉 \\ = & \sum_{α \neq β}^{k} \frac{n_{α} n_{β}}{n} {∥ {\bar{x}}_{α} ∥}^{2} - 2 \sum_{1 \leq α < β \leq k} \frac{n_{α} n_{β}}{n} 〈 {\bar{x}}_{α}, {\bar{x}}_{β} 〉 \\ = & \sum_{1 \leq α < β \leq k} \frac{n_{α} n_{β}}{n} (∥ {\bar{x}}_{α} ∥^{2} + ∥ {\bar{x}}_{β} ∥^{2} - 2 〈 {\bar{x}}_{α}, {\bar{x}}_{β} 〉) \\ = & \sum_{1 \leq α < β \leq k} \frac{n_{α} n_{β}}{n} {∥ {\bar{x}}_{α} - {\bar{x}}_{β} ∥}^{2} . \end{matrix}

□

Proof of Theorem 1.

Under Condition C1, we have

F_{1} = \dots = F_{k} = F

. Let

y, y^{'} \overset{i . i . d .}{\sim} F

. Under Condition C3, we have Mercer’s expansion (19). Through (14) and (20), we have

λ_{r} E_{y} [ψ_{r} (y)] = \int_{X} E_{y} \tilde{K} (y, y^{'}) ψ_{r} (y^{'}) d F (y^{'}) = 0 .

This, together with (20), implies that

E [ψ_{r} (y)] = 0 whenever λ_{r} \neq 0 and Var [ψ_{r} (y)] = \int_{X} ψ_{r}^{2} (y) d F (y) = 1 .

(A1)

Set

z_{r, α i} = ψ_{r} (y_{α i}), i = 1, \dots, n_{α}; α = 1, \dots, k

. Under Condition C1, we have

y_{α i} \overset{i . i . d .}{\sim} F

.

It follows that for a fixed

r = 1, 2, \dots

,

z_{r, α i}, i = 1, \dots, n_{α}; α = 1, \dots, k

are i.i.d. with mean 0 and variance 1. For different r,

z_{r, α i}, i = 1, \dots, n_{α}; α = 1, \dots, k

are uncorrelated. Then, through (18) and (19), we have

\begin{matrix} {\tilde{V}}_{α α} & = & \frac{1}{n_{α}^{2}} \sum_{i = 1}^{n_{α}} \sum_{j = 1}^{n_{α}} (\sum_{r = 1}^{\infty} λ_{r} z_{r, α i} z_{r, α j}) = \sum_{r = 1}^{\infty} λ_{r} (\frac{1}{n_{α}^{2}} \sum_{i = 1}^{n_{α}} \sum_{j = 1}^{n_{α}} z_{r, α i} z_{r, α j}) \\ = & \sum_{r = 1}^{\infty} λ_{r} {\bar{z}}_{r, α}^{2}, and \\ {\tilde{V}}_{α β} & = & \frac{1}{n_{α} n_{β}} \sum_{i = 1}^{n_{α}} \sum_{j = 1}^{n_{β}} (\sum_{r = 1}^{\infty} λ_{r} z_{r, α i} z_{r, β j}) = \sum_{r = 1}^{\infty} λ_{r} {\bar{z}}_{r, α} {\bar{z}}_{r, β}, \end{matrix}

where

{\bar{z}}_{r, α} = n_{α}^{- 1} \sum_{i = 1}^{n_{α}} z_{r, α i}, α = 1, \dots, k; r = 1, 2, \dots

. Through (17), we have

\begin{matrix} {\tilde{T}}_{n} & = & \sum_{r = 1}^{\infty} λ_{r} (\sum_{α = 1}^{k} \frac{n_{α} (n - n_{α})}{n} {\bar{z}}_{r, α}^{2} - 2 \sum_{1 \leq α < β \leq k} \frac{n_{α} n_{β}}{n} {\bar{z}}_{r, α} {\bar{z}}_{r, β}) \\ = & \sum_{r = 1}^{\infty} λ_{r} [\sum_{α = 1}^{k} n_{α} {({\bar{z}}_{r, α} - {\bar{z}}_{r})}^{2}] \\ = & \sum_{r = 1}^{\infty} λ_{r} A_{n, r}, \end{matrix}

where

A_{n, r} = \sum_{α = 1}^{k} n_{α} {({\bar{z}}_{r, α} - {\bar{z}}_{r})}^{2}, r = 1, 2, \dots

and

{\bar{z}}_{r} = n^{- 1} \sum_{α = 1}^{k} n_{α} {\bar{z}}_{r, α}

.

Let

φ_{X} (t) = E (e^{i t X})

denote the characteristic function of a random variable X. Set

{\tilde{T}}_{n}^{(q)} = \sum_{r = 1}^{q} λ_{r} A_{n, r}

. Then, we have

| φ_{{\tilde{T}}_{n}} (t) - φ_{{\tilde{T}}_{n}^{(q)}} (t) | \leq | t | {[E {({\tilde{T}}_{n} - {\tilde{T}}_{n}^{(q)})}^{2}]}^{1 / 2} .

For any given

r = 1, 2, \dots

, through the central limit theorem, under Conditions C2 and C3, as

n \to \infty

, we have

\sqrt{n_{α}} {\bar{z}}_{r, α} \overset{L}{⟶} N (0, 1), α = 1, \dots, k

. Set

w_{n, r} = {[\sqrt{n_{1}} {\bar{z}}_{r, 1}, \dots, \sqrt{n_{k}} {\bar{z}}_{r, k}]}^{⊤}

. Then, as

n \to \infty

, we have

w_{n, r} \overset{L}{⟶} w_{r} \sim N_{k} (0, I_{k})

and

w_{r}, r = 1, 2, \dots

are independently identically distributed. Set

δ_{n, k} = {[\sqrt{n_{1} / n}, \dots, \sqrt{n_{k} / n}]}^{⊤}

. Under Condition C2, we have

δ_{n, k} \to δ_{k} = {[\sqrt{τ_{1}}, \dots, \sqrt{τ_{k}}]}^{⊤}

. Thus,

G_{n} = I_{k} - δ_{n, k} δ_{n, k}^{⊤} \to G = I_{k} - δ_{k} δ_{k}^{⊤}

. Both

G_{n}

and

G

are idempotent matrices with rank

k - 1

. It follows that

A_{n, r} = w_{n, r}^{⊤} G_{n} w_{n, r} \overset{L}{⟶} A_{r} = w_{r}^{⊤} G w_{r} \sim χ_{k - 1}^{2},

(A2)

which is a chi-squared distribution with

k - 1

degrees of freedom and

A_{r}, r = 1, 2, \dots

are independent. It follows that as

n \to \infty

, we have

Var (A_{n, r}) = 2 (k - 1) [1 + o (1)]

and

E (A_{n, r}) = (k - 1) [1 + o (1)]

. Therefore, as

n \to \infty

, we have

\begin{matrix} E {({\tilde{T}}_{n} - {\tilde{T}}_{n}^{(q)})}^{2} & = & E {(\sum_{r = q + 1}^{\infty} λ_{r} A_{n, r})}^{2} \\ = & Var (\sum_{r = q + 1}^{\infty} λ_{r} A_{n, r}) + E^{2} (\sum_{r = q + 1}^{\infty} λ_{r} A_{n, r}) \\ \leq & {[\sum_{r = q + 1}^{\infty} \sqrt{Var (λ_{r} A_{n, r})}]}^{2} + {[\sum_{r = q + 1}^{\infty} E (λ_{r} A_{n, r})]}^{2} \\ = & (k^{2} - 1) {(\sum_{r = q + 1}^{\infty} λ_{r})}^{2} [1 + o (1)] . \end{matrix}

It follows that

| φ_{{\tilde{T}}_{n}} (t) - φ_{{\tilde{T}}_{n}^{(q)}} (t) | \leq | t | {(k^{2} - 1)}^{1 / 2} (\sum_{r = q + 1}^{\infty} λ_{r}) [1 + o (1)] .

(A3)

Let t be fixed. Under Condition C3 and (21), as

q \to \infty

, we have

\sum_{r = q + 1}^{\infty} λ_{r} \to 0

. Thus, for any given

ϵ > 0

, there exist

N_{1}

and

Q_{1}

, depending on

| t |

and

ϵ

, such that as

n > N_{1}

and

q > Q_{1}

, we have

| φ_{{\tilde{T}}_{n}} (t) - φ_{{\tilde{T}}_{n}^{(q)}} (t) | \leq ϵ .

(A4)

For any fixed

q > Q_{1}

, through (A2), as

n \to \infty

, we have

{\tilde{T}}_{n}^{(q)} \overset{L}{⟶} {\tilde{T}}^{(q)} \overset{d}{=} \sum_{r = 1}^{q} λ_{r} A_{r}, A_{r} \overset{i . i . d .}{\sim} χ_{k - 1}^{2}

. Thus, there exists

N_{2}

, depending on q and

ϵ

, such that as

n > N_{2}

, we have

| φ_{{\tilde{T}}_{n}^{(q)}} (t) - φ_{{\tilde{T}}^{(q)}} (t) | \leq ϵ .

(A5)

Recall that

\tilde{T} = \sum_{r = 1}^{\infty} λ_{r} A_{r}, A_{r} \overset{i . i . d .}{\sim} χ_{k - 1}^{2}

. Along the same lines as those for proving (A4), we can show that there exists

Q_{2}

, depending on

| t |

and

ϵ

, such that as

q > Q_{2}

, we have

| φ_{{\tilde{T}}^{(q)}} (t) - φ_{\tilde{T}} (t) | \leq ϵ .

(A6)

It follows from (A4)–(A6) that for any

n \geq max (N_{1}, N_{2})

and

q \geq max (Q_{1}, Q_{2})

, we have

| φ_{{\tilde{T}}_{n}} (t) - φ_{\tilde{T}} (t) | \leq | φ_{{\tilde{T}}_{n}} (t) - φ_{{\tilde{T}}_{n}^{(q)}} (t) | + | φ_{{\tilde{T}}_{n}^{(q)}} (t) - φ_{{\tilde{T}}^{(q)}} (t) | + | φ_{{\tilde{T}}^{(q)}} (t) - φ_{\tilde{T}} (t) | \leq 3 ϵ .

The convergence in distribution of

{\tilde{T}}_{n}

to

\tilde{T}

follows as we can let

ϵ \to 0

. □

Proof of Theorem 2.

Under Condition C1, let

y, y^{'} \overset{i . i . d .}{\sim} F

. Then, through (23), we have

E ({\tilde{V}}_{α α}) = n_{α}^{- 1} E [\tilde{K} (y, y)]

and

E ({\tilde{V}}_{α β}) = 0

. Thus, we have

E ({\tilde{T}}_{n}) = \sum_{α = 1}^{k} \frac{n_{α} (n - n_{α})}{n} \{n_{α}^{- 1} E [\tilde{K} (y, y)]\} = (k - 1) E [\tilde{K} (y, y)] .

Through (23) again, we have

\begin{matrix} Var ({\tilde{T}}_{n}) = \sum_{α = 1}^{k} {[\frac{n_{α} (n - n_{α})}{n}]}^{2} Var ({\tilde{V}}_{α α}) + 4 \sum_{1 \leq α < β \leq k} {(\frac{n_{α} n_{β}}{n})}^{2} Var ({\tilde{V}}_{α β}) \\ = & \sum_{α = 1}^{k} \frac{n_{α}^{2} {(n - n_{α})}^{2}}{n^{2}} \{\frac{1}{n_{α}^{3}} Var [\tilde{K} (y, y)] + \frac{2 (n_{α} - 1)}{n_{α}^{3}} E [{\tilde{K}}^{2} (y, y^{'})]\} \\ + & 4 \sum_{1 \leq α < β \leq k} \frac{n_{α}^{2} n_{β}^{2}}{n^{2}} \{\frac{1}{n_{α} n_{β}} E [{\tilde{K}}^{2} (y, y^{'})]\} \\ = & \sum_{α = 1}^{k} \frac{{(n - n_{α})}^{2}}{n^{2} n_{α}} Var [\tilde{K} (y, y)] + 2 [\sum_{α = 1}^{k} \frac{{(n - n_{α})}^{2} (n_{α} - 1)}{n^{2} n_{α}} + 2 \sum_{1 \leq α < β \leq k} \frac{n_{α} n_{β}}{n^{2}}] E [{\tilde{K}}^{2} (y, y^{'})] \\ = & \sum_{α = 1}^{k} \frac{{(n - n_{α})}^{2}}{n^{2} n_{α}} Var [\tilde{K} (y, y)] + 2 [(k - 1) - \sum_{α = 1}^{k} \frac{{(n - n_{α})}^{2}}{n^{2} n_{α}}] E [{\tilde{K}}^{2} (y, y^{'})] . \end{matrix}

□

Proof of Theorem 3.

First, through (12), we have

| \tilde{K} (y, y) |, | \tilde{K} (y, y^{'}) | \leq 4 B_{K}

for all

y

,

y^{'} \in R^{p}

. Thus, we have

Var [\tilde{K} (y, y)] \leq E [{\tilde{K}}^{2} (y, y)] \leq 16 B_{K}^{2}, and E [{\tilde{K}}^{2} (y, y^{'})] \leq 16 B_{K}^{2} .

Then, through Theorem 2, we have

\begin{matrix} Var ({\tilde{T}}_{n}) \\ = & [\sum_{α = 1}^{k} \frac{{(n - n_{α})}^{2}}{n^{2} n_{α}}] Var [\tilde{K} (y_{α 1}, y_{α 1})] + 2 [(k - 1) - \sum_{α = 1}^{k} \frac{{(n - n_{α})}^{2}}{n^{2} n_{α}}] E [{\tilde{K}}^{2} (y_{α 1}, y_{β 1})] \\ \leq & 16 B_{K}^{2} [\sum_{α = 1}^{k} \frac{{(n - n_{α})}^{2}}{n^{2} n_{α}}] + 2 \cdot 16 B_{K}^{2} [(k - 1) - \sum_{α = 1}^{k} \frac{{(n - n_{α})}^{2}}{n^{2} n_{α}}] \\ = & 16 B_{K}^{2} [2 (k - 1) - \sum_{α = 1}^{k} \frac{{(n - n_{α})}^{2}}{n^{2} n_{α}}] \\ \leq & 32 B_{K}^{2} (k - 1) . \end{matrix}

Finally, since

{\bar{x}}_{α}, α = 1, \dots, k

are independent, we have

\begin{matrix} Var (S_{n}) = n^{- (1 - 2 Δ)} \sum_{α = 1}^{k} n_{α}^{2} \cdot \frac{σ_{α}^{2}}{n_{α}} = n^{2 Δ} \sum_{α = 1}^{k} \frac{n_{α}}{n} σ_{α}^{2} . \end{matrix}

Furthermore, through the Cauchy–Schwarz inequality, we have

\begin{matrix} \sum_{α = 1}^{k} \frac{n_{α}}{n} σ_{α}^{2} & = & \sum_{α = 1}^{k} \frac{n_{α}}{n} E {[〈 x_{α 1} - μ_{α}, h_{α} - \bar{h} 〉]}^{2} \\ \leq & \sum_{α = 1}^{k} \frac{n_{α}}{n} E (∥ x_{α 1} - μ_{α} ∥^{2}) {∥ h_{α} - \bar{h} ∥}^{2} \\ \leq & \sum_{α = 1}^{k} \frac{n_{α}}{n} {∥ h_{α} - \bar{h} ∥}^{2} E [\tilde{K} (y_{α 1}, y_{α 1})] \\ \leq & 4 B_{K} \sum_{α = 1}^{k} \frac{n_{α}}{n} {∥ h_{α} - \bar{h} ∥}^{2} \\ \leq & 4 B_{K} B_{h} < \infty . \end{matrix}

□

Proof of Theorem 4.

Under the given conditions, through Theorems 2 and 3, we have

E ({\tilde{T}}_{n}) = (k - 1) E [\tilde{K} (y_{α 1}, y_{α 1})], and Var ({\tilde{T}}_{n}) \leq 32 B_{K}^{2} (k - 1),

and as

n \to \infty

,

Var (S_{n}) = n^{2 Δ} \sum_{α = 1}^{k} τ_{α} σ_{α}^{2} [1 + o (1)],

(A7)

where

τ_{α}, α = 1, \dots, k

are given in Condition C2, It follows that as

n \to \infty

, we have

E ({\tilde{T}}_{n}) / \sqrt{Var (S_{n})} \to 0

and

Var ({\tilde{T}}_{n}) / Var (S_{n}) \to 0

. Thus, through the Markov inequality, as

n \to \infty

, we have

\begin{matrix} Pr (|\frac{{\tilde{T}}_{n}}{\sqrt{Var (S_{n})}}| > ε) = Pr [{(\frac{{\tilde{T}}_{n}}{\sqrt{Var (S_{n})}})}^{2} > ε^{2}] \leq \frac{1}{ε^{2}} \cdot \frac{E ({\tilde{T}}_{n}^{2})}{Var (S_{n})} \to 0, \end{matrix}

for all

ε > 0

. Therefore, we have

{\tilde{T}}_{n} / \sqrt{Var (S_{n})} \overset{P}{⟶} 0

and (a) is proved. To prove (b), notice that through the central limit theorem, as

n_{α} \to \infty

, we have

\begin{matrix} u_{α} = \sqrt{n_{α}} 〈 {\bar{x}}_{α} - μ_{α}, h_{α} - \bar{h} 〉 / σ_{α} \overset{L}{⟶} N (0, 1), α = 1, \dots, k . \end{matrix}

Since

{\bar{x}}_{α}, α = 1, \dots, k

are independent and

S_{n} = n^{Δ} \sum_{α = 1}^{k} \sqrt{n_{α} / n} σ_{α} u_{α}

, we have

S_{n} / \sqrt{Var (S_{n})}

\overset{L}{⟶} N (0, 1)

. To prove (c), notice that as

n \to \infty

, through (25), (a) and (b), we have

\frac{T_{n} - n^{2 Δ} \sum_{α = 1}^{k} \frac{n_{α}}{n} {∥ h_{α} - \bar{h} ∥}^{2}}{2 \sqrt{Var (S_{n})}} = \frac{{\tilde{T}}_{n}}{2 \sqrt{Var (S_{n})}} + \frac{S_{n}}{\sqrt{Var (S_{n})}} \overset{L}{⟶} N (0, 1) .

Thus, as

n \to \infty

, and through (A7), we have

\begin{matrix} Pr (T_{n} \geq C_{ϵ}) & = & Pr [\frac{T_{n} - n^{2 Δ} \sum_{α = 1}^{k} \frac{n_{α}}{n} {∥ h_{α} - \bar{h} ∥}^{2}}{2 \sqrt{Var (S_{n})}} \geq \frac{C_{ϵ} - n^{2 Δ} \sum_{α = 1}^{k} \frac{n_{α}}{n} {∥ h_{α} - \bar{h} ∥}^{2}}{2 \sqrt{Var (S_{n})}}] \\ = & 1 - Φ (\frac{C_{ϵ}}{2 n^{Δ} \sqrt{\sum_{α = 1}^{k} \frac{n_{α}}{n} σ_{α}^{2}}} - \frac{n^{2 Δ} \sum_{α = 1}^{k} \frac{n_{α}}{n} {∥ h_{α} - \bar{h} ∥}^{2}}{2 n^{Δ} \sqrt{\sum_{α = 1}^{k} \frac{n_{α}}{n} σ_{α}^{2}}}) [1 + o (1)] \\ = & 1 - Φ (- \frac{n^{Δ} \sum_{α = 1}^{k} τ_{α} {∥ h_{α} - h^{*} ∥}^{2}}{2 \sqrt{\sum_{α = 1}^{k} τ_{α} σ_{α}^{* 2}}}) [1 + o (1)] \\ = & Φ (\frac{n^{Δ} \sum_{α = 1}^{k} τ_{α} {∥ h_{α} - h^{*} ∥}^{2}}{2 \sqrt{\sum_{α = 1}^{k} τ_{α} σ_{α}^{* 2}}}) [1 + o (1)] \\ \to & 1, \end{matrix}

where

h^{*} = \sum_{α = 1}^{k} τ_{α} h_{α}

and

σ_{α}^{* 2} = E {[〈 x_{α 1} - μ_{α}, h_{α} - h^{*} 〉]}^{2}

. Thus, the theorem is proved. □

References

Lehmann, E.L. Nonparametrics: Statistical Methods Based on Ranks; Springer: New York, NY, USA, 2006. [Google Scholar]
Friedman, J.H.; Rafsky, L.C. Multivariate Generalizations of the Wald–Wolfowitz and Smirnov Two-Sample Tests. Ann. Stat. 1979, 7, 697–717. [Google Scholar] [CrossRef]
Schilling, M.F. Multivariate Two-Sample Tests Based on Nearest Neighbors. J. Am. Stat. Assoc. 1986, 81, 799–806. [Google Scholar] [CrossRef]
Baringhaus, L.; Franz, C. On a new multivariate two-sample test. J. Multivar. Anal. 2004, 88, 190–206. [Google Scholar] [CrossRef]
Rosenbaum, P.R. An exact distribution-free test comparing two multivariate distributions based on adjacency. J. R. Stat. Soc. Ser. B 2005, 67, 515–530. [Google Scholar] [CrossRef]
Biswas, M.; Mukhopadhyay, M.; Ghosh, A.K. A distribution-free two-sample run test applicable to high-dimensional data. Biometrika 2014, 101, 913–926. [Google Scholar] [CrossRef]
Li, J. Asymptotic normality of interpoint distances for high-dimensional data with applications to the two-sample problem. Biometrika 2018, 105, 529–546. [Google Scholar] [CrossRef]
Chen, H.; Friedman, J.H. A New Graph-Based Two-Sample Test for Multivariate and Object Data. J. Am. Stat. Assoc. 2017, 112, 397–409. [Google Scholar] [CrossRef]
Hall, P.; Tajvidi, N. Permutation Tests for Equality of Distributions in High-Dimensional Settings. Biometrika 2002, 89, 359–374. [Google Scholar] [CrossRef]
Wei, S.; Lee, C.; Wichers, L.; Marron, J.S. Direction-Projection-Permutation for High-Dimensional Hypothesis Tests. J. Comput. Graph. Stat. 2016, 25, 549–569. [Google Scholar] [CrossRef]
Ghosh, A.K.; Biswas, M. Distribution-free high-dimensional two-sample tests based on discriminating hyperplanes. Test 2016, 25, 525–547. [Google Scholar] [CrossRef]
Székely, G.J.; Rizzo, M.L. Testing for equal distributions in high dimension. InterStat 2004, 5, 1249–1272. [Google Scholar]
Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A. A Kernel Two-Sample Test. J. Mach. Learn. Res. 2012, 13, 723–773. [Google Scholar]
Sejdinovic, D.; Sriperumbudur, B.; Gretton, A.; Fukumizu, K. Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Ann. Stat. 2013, 41, 2263–2291. [Google Scholar] [CrossRef]
Zhang, J.T.; Smaga, Ł. Two-sample test for equal distributions in separable metric space: New maximum mean discrepancy based approaches. Electron. J. Stat. 2022, 16, 4090–4132. [Google Scholar] [CrossRef]
Zhou, B.; Ong, Z.P.; Zhang, J.T. A new MMD-based two-sample test for equal distributions in separable metric spaces. Manuscript 2023, in press.
Balogoun, A.S.K.; Nkiet, G.M.; Ogouyandjou, C. k-Sample problem based on generalized maximum mean discrepancy. arXiv 2018, arXiv:1811.09103. [Google Scholar]
Zhang, J.T.; Guo, J.; Zhou, B. Testing equality of several distributions in separable metric spaces: A maximum mean discrepancy based approach. J. Econom. 2022, in press. [CrossRef]
Zhang, J.T.; Guo, J.; Zhou, B. Linear hypothesis testing in high-dimensional one-way MANOVA. J. Multivar. Anal. 2017, 155, 200–216. [Google Scholar] [CrossRef]
Gretton, A.; Fukumizu, K.; Harchaoui, Z.; Sriperumbudur, B.K. A Fast, Consistent Kernel Two-Sample Test. In Advances in Neural Information Processing Systems 22; Curran Associates, Inc.: New York, NY, USA, 2009; pp. 673–681. [Google Scholar]
Welch, B.L. The generalization of `student’s’ problem when several different population variances are involved. Biometrika 1947, 34, 28–35. [Google Scholar] [CrossRef]
Satterthwaite, F.E. An Approximate Distribution of Estimates of Variance Components. Biom. Bull. 1946, 2, 110–114. [Google Scholar] [CrossRef]
Zhang, J.T. Two-Way MANOVA with Unequal Cell Sizes and Unequal Cell Covariance Matrices. Technometrics 2011, 53, 426–439. [Google Scholar] [CrossRef]
Smaga, Ł.; Zhang, J.T. Linear Hypothesis Testing with Functional Data. Technometrics 2019, 61, 99–110. [Google Scholar] [CrossRef]

Figure 1. Simulation 1. The empirical powers (in %) of

T_{P B}

,

T_{R P}

,

T_{W S}

, and

T_{S R}

under different cases of

(p, n_{1}, n_{2}, n_{3}, δ_{1}, δ_{2})

: 1. (10, 20, 30, 40, 1.2, 0.6), 2. (10, 80, 120, 160, 0.75, 0.375), 3. (10, 160, 240, 320, 0.65, 0.325), 4. (100, 20, 30, 40, 1.3, 0.65), 5. (100, 80, 120, 160, 0.85, 0.425), 6. (100, 160, 240, 320, 0.72, 0.36), 7. (500, 20, 30, 40, 1.65, 0.825), 8. (500, 80, 120, 160, 1, 0.5), 9. (500, 160, 240, 320, 0.8, 0.4).

Figure 1. Simulation 1. The empirical powers (in %) of

T_{P B}

,

T_{R P}

,

T_{W S}

, and

T_{S R}

under different cases of

(p, n_{1}, n_{2}, n_{3}, δ_{1}, δ_{2})

: 1. (10, 20, 30, 40, 1.2, 0.6), 2. (10, 80, 120, 160, 0.75, 0.375), 3. (10, 160, 240, 320, 0.65, 0.325), 4. (100, 20, 30, 40, 1.3, 0.65), 5. (100, 80, 120, 160, 0.85, 0.425), 6. (100, 160, 240, 320, 0.72, 0.36), 7. (500, 20, 30, 40, 1.65, 0.825), 8. (500, 80, 120, 160, 1, 0.5), 9. (500, 160, 240, 320, 0.8, 0.4).

Figure 2. Simulation 2. The empirical powers (in %) of

T_{P B}

,

T_{R P}

,

T_{W S}

, and

T_{S R}

under different cases of

(p, n_{1}, n_{2}, n_{3}, δ_{1}, δ_{2})

: 1. (10, 20, 30, 40, 0.37, 0.185), 2. (10, 80, 120, 160, 0.195, 0.097), 3. (10, 160, 240, 320, 0.18, 0.09), 4. (100, 20, 30, 40, 0.142, 0.071), 5. (100, 80, 120, 160, 0.068, 0.034), 6. (100, 160, 240, 320, 0.044, 0.022), 7. (500, 20, 30, 40, 0.06, 0.03), 8. (500, 80, 120, 160, 0.028, 0.014), 9. (500, 160, 240, 320, 0.022, 0.011).

Figure 2. Simulation 2. The empirical powers (in %) of

T_{P B}

,

T_{R P}

,

T_{W S}

, and

T_{S R}

under different cases of

(p, n_{1}, n_{2}, n_{3}, δ_{1}, δ_{2})

: 1. (10, 20, 30, 40, 0.37, 0.185), 2. (10, 80, 120, 160, 0.195, 0.097), 3. (10, 160, 240, 320, 0.18, 0.09), 4. (100, 20, 30, 40, 0.142, 0.071), 5. (100, 80, 120, 160, 0.068, 0.034), 6. (100, 160, 240, 320, 0.044, 0.022), 7. (500, 20, 30, 40, 0.06, 0.03), 8. (500, 80, 120, 160, 0.028, 0.014), 9. (500, 160, 240, 320, 0.022, 0.011).

Figure 3. Boxplots of p-values of

T_{P B}

,

T_{R P}

,

T_{W S}

, and

T_{S R}

when applied to the case “NOR vs. UNI vs. SUS” of the corneal surface data 500 times.

Figure 3. Boxplots of p-values of

T_{P B}

,

T_{R P}

,

T_{W S}

, and

T_{S R}

when applied to the case “NOR vs. UNI vs. SUS” of the corneal surface data 500 times.

Table 1. Empirical sizes (in %) of Simulation 1.

			$ρ = 0.1$				$ρ = 0.5$				$ρ = 0.9$
Model	$p$	(n₁, n₂, n₃)	$T_{PB}$	$T_{RP}$	$T_{WS}$	$T_{SR}$	$T_{PB}$	$T_{RP}$	$T_{WS}$	$T_{SR}$	$T_{PB}$	$T_{RP}$	$T_{WS}$	$T_{SR}$
		$(20, 30, 40)$	4.0	4.5	5.2	4.9	4.4	4.0	4.7	4.1	5.7	5.5	6.1	5.9
	10	$(80, 120, 160)$	4.9	5.0	5.6	5.3	5.8	5.7	6.0	5.2	4.2	4.1	4.3	4.8
		$(160, 240, 320)$	5.1	5.1	5.2	4.9	4.0	4.2	4.4	3.9	4.6	5.4	4.9	4.3
		$(20, 30, 40)$	4.8	5.2	6.5	4.8	4.2	4.6	4.3	4.7	5.9	5.6	6.1	5.9
1	100	$(80, 120, 160)$	4.6	4.4	4.9	3.9	4.8	5.1	5.3	5.3	5.1	5.3	5.3	5.5
		$(160, 240, 320)$	5.8	5.6	6.9	5.5	4.7	4.5	4.5	4.2	4.6	4.3	5.0	4.5
		$(20, 30, 40)$	5.0	5.4	5.8	4.9	4.3	4.4	4.7	4.8	3.8	4.4	4.5	4.5
	500	$(80, 120, 160)$	4.9	5.3	5.8	4.3	5.0	5.2	5.3	5.4	5.0	5.3	5.5	5.3
		$(160, 240, 320)$	6.2	6.2	7.0	5.7	6.4	6.2	6.4	6.4	4.8	4.4	5.0	5.2
		$(20, 30, 40)$	4.9	5.5	6.6	5.7	4.4	4.5	5.0	4.9	5.6	5.0	5.7	5.6
	10	$(80, 120, 160)$	5.6	5.6	6.4	5.8	3.6	3.5	4.0	4.5	5.4	5.3	5.3	4.9
		$(160, 240, 320)$	5.7	5.8	6.8	6.3	4.9	5.5	5.8	5.2	5.3	5.5	5.7	5.1
		$(20, 30, 40)$	4.9	5.1	5.9	5.0	5.4	5.5	5.5	5.3	3.9	4.5	4.7	4.3
2	100	$(80, 120, 160)$	5.5	5.6	6.6	5.5	4.5	4.3	4.9	3.8	5.8	5.6	5.7	5.6
		$(160, 240, 320)$	5.1	5.3	5.6	4.9	6.4	6.5	6.8	7.0	5.0	4.8	4.8	4.1
		$(20, 30, 40)$	5.0	5.1	5.4	5.5	5.9	5.8	6.0	6.6	5.0	5.2	5.3	5.5
	500	$(80, 120, 160)$	4.8	4.3	5.4	5.2	5.7	5.6	5.7	5.1	4.1	3.9	4.0	3.9
		$(160, 240, 320)$	4.2	4.3	5.0	4.9	4.6	4.4	4.8	4.4	5.4	5.7	5.4	5.9
		$(20, 30, 40)$	4.2	5.0	5.6	4.8	5.4	6.0	6.5	5.7	4.9	4.9	5.5	4.8
	10	$(80, 120, 160)$	4.9	5.0	5.5	4.9	4.9	4.4	5.0	5.0	5.2	5.1	5.4	5.4
		$(160, 240, 320)$	4.5	4.6	5.5	4.9	5.0	5.0	5.4	5.0	6.0	5.9	5.8	5.8
		$(20, 30, 40)$	4.1	4.4	5.1	4.7	5.0	5.5	5.8	4.5	5.0	5.0	5.4	4.4
3	100	$(80, 120, 160)$	3.1	3.5	4.0	4.5	4.8	5.2	5.0	4.4	5.0	5.0	5.0	5.0
		$(160, 240, 320)$	4.7	4.5	5.4	5.0	4.8	4.8	5.1	4.5	4.6	4.6	4.8	3.9
		$(20, 30, 40)$	4.9	4.7	5.4	6.0	5.0	4.6	5.4	4.5	4.1	4.1	4.5	3.9
	500	$(80, 120, 160)$	5.6	5.8	6.1	5.5	4.3	4.2	4.5	4.7	4.8	5.0	5.1	5.2
		$(160, 240, 320)$	5.0	4.8	5.5	6.0	4.4	4.5	4.6	4.5	5.5	5.4	5.6	6.0
ARE			9.04	9.33	16.22	8.67	10.67	12.52	11.56	11.70	9.26	8.74	9.19	11.56

Table 2. p-values (in %) for testing the distribution equality of corneal surface groups.

	Method
Case	$T_{PB}$	$T_{RP}$	$T_{WS}$	$T_{SR}$
NOR vs. UNI vs. SUS vs. CLI	0	0	0.00004	0.09
NOR vs. UNI vs. SUS	31.2	32.3	33.3	38.9
NOR vs. UNI vs. CLI	0.02	0	0.0012	0.25
NOR vs. SUS vs. CLI	0	0	0.00001	0.01
UNI vs. SUS vs. CLI	0.01	0	0.0012	0.24

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ong, Z.P.; Chen, A.A.; Zhu, T.; Zhang, J.-T. Testing Equality of Several Distributions at High Dimensions: A Maximum-Mean-Discrepancy-Based Approach. Mathematics 2023, 11, 4374. https://doi.org/10.3390/math11204374

AMA Style

Ong ZP, Chen AA, Zhu T, Zhang J-T. Testing Equality of Several Distributions at High Dimensions: A Maximum-Mean-Discrepancy-Based Approach. Mathematics. 2023; 11(20):4374. https://doi.org/10.3390/math11204374

Chicago/Turabian Style

Ong, Zhi Peng, Aixiang Andy Chen, Tianming Zhu, and Jin-Ting Zhang. 2023. "Testing Equality of Several Distributions at High Dimensions: A Maximum-Mean-Discrepancy-Based Approach" Mathematics 11, no. 20: 4374. https://doi.org/10.3390/math11204374

APA Style

Ong, Z. P., Chen, A. A., Zhu, T., & Zhang, J.-T. (2023). Testing Equality of Several Distributions at High Dimensions: A Maximum-Mean-Discrepancy-Based Approach. Mathematics, 11(20), 4374. https://doi.org/10.3390/math11204374

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Testing Equality of Several Distributions at High Dimensions: A Maximum-Mean-Discrepancy-Based Approach

Abstract

1. Introduction

2. Main Results

2.1. MMD for Several Distributions

2.2. Computation of the Test Statistic

2.3. Asymptotic Null Distribution

2.4. Mean and Variance of ${\tilde{T}}_{n}$

2.5. Asymptotic Power

3. Methods for Implementing the Proposed Test

3.1. Parametric Bootstrap Method

3.2. Random Permutation Method

3.3. Welch–Satterthwaite $χ^{2}$ -Approximation Method

4. Simulation Studies

4.1. Simulation 1

4.2. Simulation 2

5. Application to the Corneal Surface Data

6. Concluding Remarks

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Technical Proofs

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Testing Equality of Several Distributions at High Dimensions: A Maximum-Mean-Discrepancy-Based Approach

Abstract

1. Introduction

2. Main Results

2.1. MMD for Several Distributions

2.2. Computation of the Test Statistic

2.3. Asymptotic Null Distribution

2.4. Mean and Variance of T ˜ n

2.5. Asymptotic Power

3. Methods for Implementing the Proposed Test

3.1. Parametric Bootstrap Method

3.2. Random Permutation Method

3.3. Welch–Satterthwaite χ 2 -Approximation Method

4. Simulation Studies

4.1. Simulation 1

4.2. Simulation 2

5. Application to the Corneal Surface Data

6. Concluding Remarks

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Technical Proofs

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.4. Mean and Variance of ${\tilde{T}}_{n}$

3.3. Welch–Satterthwaite $χ^{2}$ -Approximation Method