High-Dimensional U-Statistics Type Hypothesis Testing via Jackknife Pseudo-Values with Multiplier Bootstrap

Mingjuan Zhang; Libin Jin

doi:10.3390/math12233837

and

School of Statistics and Mathematics, Interdisciplinary Research Institute of Data Science, Shanghai Lixin University of Accounting and Finance, Shanghai 201209, China

^*

Author to whom correspondence should be addressed.

Mathematics2024, 12(23), 3837;https://doi.org/10.3390/math12233837

This article belongs to the Special Issue Computational Statistics and Data Analysis, 2nd Edition

Version Notes

Order Reprints

Abstract

High-dimensional parameter testing is commonly used in bioinformatics to analyze complex relationships in gene expression and brain connectivity studies, involving parameters like means, covariances, and correlations. In this paper, we present a novel approach for testing U-statistics-type parameters by leveraging jackknife pseudo-values. Inspired by Tukey’s conjecture, we establish the asymptotic independence of these pseudo-values, allowing us to reformulate U-statistics-type parameter testing as a sample mean testing problem. This reformulation enables the use of established sample mean testing frameworks, simplifying the testing procedure. We apply a multiplier bootstrap method to obtain critical values and provide a rigorous theoretical analysis to validate the approach. Simulation studies demonstrate the robustness of our method across a variety of scenarios. Additionally, we apply our approach to investigate differences in the dependency structures of a subset of genes within the Wnt signaling pathway, which is associated with lung cancer.

Keywords:

high-dimensional; hypothesis testing; U-statistic; jackknife; multiplier bootstrap

MSC:

62F03

1. Introduction

With the development of technology, high-dimensional parameter testing is widely used in various bioinformatics analyses. Numerous studies employ mean tests to investigate changes in genes of interest, such as [1,2]. Additionally, some studies use covariance matrix or precision matrix tests to illustrate brain connectivity analysis and apply correlation matrix tests for gene co-expression network analysis, such as [3,4,5]. In this paper, we concentrate on U-statistics type parameter testing, encompassing but not limited to mean, covariance, and correlation tests.

High-dimensional mean tests have been well studied, as reviewed by [6]. We categorize the proposed test statistics into five broad groups. Firstly,

ℓ_{2}

-norm-based test statistics, such as [7,8,9,10,11], are known for their effectiveness under dense alternatives. Another group is the

ℓ_{\infty}

-norm-based tests, as evidenced by [12,13,14], which are more suitable for sparse alternatives. Since the alternative hypothesis is usually unknown in advance, a third category of test statistics aims to accommodate diverse alternatives by combining the p-values of tests based on various norms, such as [15,16,17]. Additionally, some tests simplify the problem by projecting the high-dimensional mean vector onto lower dimensions. For instance, some studies explore random projections, such as [18,19], while ref. [20] seeks the optimal projection directions. Most of the tests mentioned earlier impose sparsity conditions on covariance matrices, but dense patterns are common in reality. To address this, studies such as [21,22,23,24] enhance the signal strength and test performance by incorporating common factors.

High-dimensional covariance tests have also achieved significant advancements in recent years. The one-sample covariance test (

H_{0}

:

Σ = I

) mainly used methods including spectral norm-based tests, such as [25,26,27], and Frobenius norm-based tests, such as [28,29,30]. For the two-sample covariance test, Frobenius norm-based tests, such as [30,31,32,33], perform well for dense alternatives, while tests based on maximum entry-wise norm, such as [34], show a strong performance for sparse alternatives.

High-dimensional correlation tests have received significant attention. For the one-sample Pearson correlation test, ref. [35] proposes a test suitable for sparse alternatives, while ref. [36] provides a test that is powerful for dense alternatives. For the two-sample correlation test, ref. [37] introduces a test suitable for sparse alternatives, and ref. [38] develops a general framework for testing correlation structures across one, two, and multiple sample scenarios. In addition to Pearson correlation matrix tests, rank-based correlation matrix tests, including Kendall’s tau and Spearman’s rank correlation, have been well studied, as demonstrated by [39,40,41]. Furthermore, ref. [42] proposes a framework for the equality test of U-statistic-based correlation matrices.

For all the aforementioned parameter tests, there exist U-statistic-based test statistics, such as [10,30,33,34,42]. There are also some adaptive and unified tests. Ref. [16] proposes a unified framework for testing high-dimensional parameters that can be estimated by U-statistic-based vectors. Ref. [43] constructs a family of U-statistics as unbiased estimators for

ℓ_{p}

-norms of test parameters, further combining the p-values of different orders. Ref. [44] proposes a two-step Gaussian approximation for high-dimensional non-degenerate U-statistics and a bootstrap method for computing their probabilities within hyper-rectangles.

In this paper, we propose a novel approach for U-statistics-type parameters. Inspired by Tukey’s conjecture [45], we establish the asymptotic independence of jackknife pseudo-values for the U-statistic estimator. By constructing test statistics based on the sample means of these pseudo-values, we effectively transfer U-statistic-type testing into the sample mean testing framework. This reformulation allows us to apply established methods from sample mean testing, simplifying the testing procedure. We derive the critical values for our test statistics by applying a multiplier bootstrap to the pseudo-values. In addition, we conduct a comprehensive theoretical analysis of our proposed test, including validation of the multiplier bootstrap procedure for accurate critical value estimation, as well as an assessment of its asymptotic properties, such as size control and power performance.

The rest of this paper is organized as follows: In Section 2, we present the detailed testing procedures. Section 3 verifies the effectiveness of the multiplier bootstrap used in Section 2 and analyzes the theoretical performance of our proposed tests. Section 4 presents simulation results to justify the empirical performance of our methods. In Section 5, we apply our methodology to analyze the dependency differences in the Wnt signaling pathway between lung cancer patients and control patients. Finally, some conclusions and discussions are provided in Section 6.

2. Methodology

Let

X = {(X_{1}, \dots, X_{d})}^{⊤}

and

Y = {(Y_{1}, \dots, Y_{d})}^{⊤}

be two d -dimensional random vectors independent of each other.

X_{1}, \dots, X_{n_{1}}

are independent and identically distributed (i.i.d.) random samples from

X

with

X_{i} = {(X_{i 1}, X_{i 2}, \dots, X_{i d})}^{⊤} .

Similarly,

Y_{1}, \dots, Y_{n_{2}}

are i.i.d. random samples from

Y

with

Y_{i} = {(Y_{i 1}, Y_{i 2}, \dots, Y_{i d})}^{⊤}

. We set

X = \{X_{1}, \dots, X_{n_{1}}\}

,

Y = \{Y_{1}, \dots, Y_{n_{2}}\}

, and

\begin{matrix} {\tilde{u}}_{1, s} = {(\begin{matrix} n_{1} \\ m \end{matrix})}^{- 1} \sum_{1 \leq i_{1} < \dots < i_{m} \leq n_{1}} Φ_{s} (X_{i_{1}}, \dots, X_{i_{m}}), \\ {\tilde{u}}_{2, s} = {(\begin{matrix} n_{2} \\ m \end{matrix})}^{- 1} \sum_{1 \leq i_{1} < \dots < i_{m} \leq n_{2}} Φ_{s} (Y_{i_{1}}, \dots, Y_{i_{m}}), \end{matrix}

(1)

where

s = 1, \dots, q

, with q being the dimensionality of the parameter we are interested in, and

Φ_{s}

is an m-order symmetric kernel function. We assume that

Φ_{s}

is symmetric and that each kernel function is of the same order m only for notational simplicity. We then define two U-statistic based vectors as (1),

{\tilde{u}}_{1} : = {({\tilde{u}}_{1, 1}, {\tilde{u}}_{1, 2}, \dots, {\tilde{u}}_{1, q})}^{⊤}, and {\tilde{u}}_{2} : = {({\tilde{u}}_{2, 1}, {\tilde{u}}_{2, 2}, \dots, {\tilde{u}}_{2, q})}^{⊤} .

We use

u_{γ}

to denote the expectation of

{\tilde{u}}_{γ},

i.e.,

u_{γ} = {(u_{γ, 1}, u_{γ, 2}, \dots, u_{γ, q})}^{⊤}

with

u_{γ, s} = E [{\tilde{u}}_{γ, s}]

for

γ = 1, 2

and

s = 1, \dots, q .

We are interested in testing the following hypotheses:

(i): (One-sample problem) For a given $u_{0} \in R^{q}$

H_{0} : u_{1} = u_{0} v . s . H_{1} : u_{1} \neq u_{0} .

(2)

(ii): (Two-sample problem)

H_{0} : u_{1} = u_{2} v . s . H_{1} : u_{1} \neq u_{2} .

(3)

Intuitively, with the estimations for

u_{γ} (γ = 1, 2)

in (1), the two problems above in (2) and (3) can be treated as the one sample mean and two sample mean tests in high dimensions. However, it is difficult to obtain or calculate the asymptotic distribution for the test statistics based on the U-statistics estimation in (1). Hence, to overcome this obstacle, we propose the test based on the jackknife pseudo-values, which are defined by

{\hat{u}}_{γ, i} = n_{γ} {\tilde{u}}_{γ} - (n_{γ} - 1) {\tilde{u}}_{γ}^{- i},

(4)

where

{\tilde{u}}_{1}^{- i}

is the U-statistic vector based on

X ∖ \{X_{i}\}

, and

{\tilde{u}}_{2}^{- i}

is the U-statistic vector based on

Y ∖ \{Y_{i}\}

. In [45,46], the jackknife pseudo-values are unbiased estimators of

u_{γ}

, i.e.,

E {\hat{u}}_{γ, i} = u_{γ} .

(5)

According to [47], the jackknife pseudo-values are not only uncorrelated but also asymptotically independent. Hence,

u_{γ}

can be estimated by a sample mean of jackknife pseudo-values,

{\hat{u}}_{γ} = n_{γ}^{- 1} \sum_{i = 1}^{n_{γ}} {\hat{u}}_{γ, i}

(6)

Further, we provide the variance estimators for

\sqrt{n} {\hat{u}}_{γ, s}

as follows:

{\hat{σ}}_{γ, s s} = {(n_{γ} - 1)}^{- 1} \sum_{i = 1}^{n_{γ}} {({\hat{u}}_{γ, i, s} - {\hat{u}}_{γ, s})}^{2}, γ = 1, 2, s = 1, \dots, q .

(7)

Remark 1.

The variance estimator in (7) is also the delete-1 jackknife estimator for

\sqrt{n} {\tilde{u}}_{γ, s}

. As long as the

{\tilde{u}}_{γ, s}

is a smooth function of the observations, this jackknife estimator is consistent.

Hence, for the one sample testing problem (2), we construct the following test statistic:

T_{o} = \max_{1 \leq s \leq q} \frac{| {\hat{u}}_{1, s} - u_{0, s} |}{\sqrt{{\hat{σ}}_{1, s s} / n_{1}}} .

(8)

For the two sample testing problem (3), we set the test statistic as follows:

T_{d} = \max_{1 \leq s \leq q} \frac{| {\hat{u}}_{1, s} - {\hat{u}}_{2, s} |}{\sqrt{{\hat{σ}}_{1, s s} / n_{1} + {\hat{σ}}_{2, s s} / n_{2}}} .

(9)

Remark 2.

If

u_{γ}

represents the mean of

X

, the jackknife pseudo-values reduce to the independent sample observations, and the test statistics simplify to the sample mean test in [12,13].

In high dimensions, for the centered independent random vectors

{Z_{1}, \dots, Z_{n}} \in R^{d}

, ref. [48] derives that the distribution of

\max_{1 \leq j \leq d} n^{- 1 / 2} \sum_{i = 1}^{n} Z_{i, j}

can be approximated by the maximum of the sum of the Gaussian random vectors with the same covariance matrices as

{Z_{1}, \dots, Z_{n}}

. Meanwhile, ref. [48] proposes a multiplier bootstrap procedure to obtain the Gaussian random vectors. Motivated by [48], we apply the multiplier bootstrap procedure to the asymptotically independent jackknife pseudo-values

{\hat{u}}_{γ, i}

. By this procedure, one can approximate the distribution for

T_{o}

and

T_{d}

in (8) and (9). Specifically, set

ξ_{γ, 1}^{b}, ξ_{γ, 2}^{b}, \dots, ξ_{γ, n_{γ}}^{b}

as a sequence of iid

N (0, 1)

with

γ = 1, 2

and

b = 1, 2, \dots, B

. The b-th multiplier bootstrap samples of

{\hat{u}}_{γ, i}

are

{ξ_{γ, 1}^{b} {\hat{u}}_{γ, 1}, \dots, ξ_{γ, n_{γ}}^{b} {\hat{u}}_{γ, n_{γ}}}

. Correspondingly, the b-th multiplier bootstrap sample of

{\hat{u}}_{γ, s}

is as follows:

{\hat{u}}_{γ, s}^{b} = \frac{1}{n_{γ}} \sum_{i = 1}^{n_{γ}} ξ_{γ, i}^{b} ({\hat{u}}_{γ, i, s} - {\hat{u}}_{γ, s}) for γ = 1, 2, and s = 1, 2, \dots, q .

(10)

Based on

{\hat{u}}_{γ, s}^{b}

, we define the b-th bootstrap sample of the test statistics

T_{o}

and

T_{d}

by the following:

T_{o}^{b} = \max_{1 \leq s \leq q} \frac{| {\hat{u}}_{1, s}^{b} - u_{0, s} |}{\sqrt{σ_{1, s s} / n_{1}}}, T_{d}^{b} = \max_{1 \leq s \leq q} \frac{| {\hat{u}}_{1, s}^{b} - {\hat{u}}_{2, s}^{b} |}{\sqrt{σ_{1, s s} / n_{1} + σ_{2, s s} / n_{2}}} .

Based on these multiplier bootstrap samples, the critical value and p-value for

T_{o}

and

T_{d}

can be estimated by

\begin{matrix} {\hat{t}}_{α}^{o} = \inf \{t \in R : \frac{1}{B} \sum_{b = 1}^{B} 1 \{T_{o}^{b} \leq t\} > 1 - α\}, {\hat{t}}_{α}^{d} = \inf \{t \in R : \frac{1}{B} \sum_{b = 1}^{B} 1 \{T_{d}^{b} \leq t\} > 1 - α\} . \end{matrix}

After obtaining the critical value, we obtain the test for the hypothesis in (2) and (3) as follows:

ϕ_{α}^{o} = 1 {T_{o} > {\hat{t}}_{α}^{o}}, ϕ_{α}^{d} = 1 {T_{d} > {\hat{t}}_{α}^{d}} .

Thus, we reject (2) if and only if

Φ_{α}^{o} = 1

, and reject (3) if and only if

Φ_{α}^{d} = 1

. Correspondingly, we can construct the p-value estimator of

T_{o}

and

T_{d}

as

{\hat{P}}_{o} = \frac{1}{B + 1} \sum_{b = 1}^{B} 1 \{T_{o}^{b} > T_{o}\}, {\hat{P}}_{d} = \frac{1}{B + 1} \sum_{b = 1}^{B} 1 \{T_{d}^{b} > T_{d}\} .

(11)

For a given significance level

α

, we then reject the

H_{0}

of (2) if and only if

{\hat{P}}_{o} \leq α

, and reject (3) if and only if

{\hat{P}}_{d} \leq α

.

3. Theoretical Analysis

In this section, we justify the validity of the multiplier bootstrap method used in the last section and study the empirical size and power of our proposed tests. Before presenting the detailed theoretical results, some mild assumptions are introduced.

Assumption 1.

There exists

0 < δ < 1 / 7

such that

\log (q) = O (n^{δ})

. For the one-sample problem,

n = n_{1}

. For the two-sample problem,

n = \max (n_{1}, n_{2})

, where the two sample sizes are comparable

n_{1} ≍ n_{2}

.

Next, we will introduce additional notations to present the assumptions for the kernel functions. Specifically, based on

X = \{X_{1}, \dots, X_{n_{1}}\}

, set the centered kernel function and its derivation as follows:

\begin{matrix} Ψ_{s} (X_{i_{1}}, \dots, X_{i_{m}}) & = Φ_{s} (X_{i_{1}}, \dots, X_{i_{m}}) - u_{γ, s}, \\ g_{s} (X_{i}) & = E (Ψ_{s} (X_{i_{1}}, \dots, X_{i_{m}}) | X_{i}) . \end{matrix}

(12)

Further, set

\begin{matrix} Ψ (X_{i_{1}}, \dots, X_{i_{m}}) & = (Ψ_{1} (X_{i_{1}}, \dots, X_{i_{m}}), \dots, Ψ_{q} (X_{i_{1}}, \dots, X_{i_{m}})), \\ g (X_{i}) & = (g_{1} (X_{i}), \dots, g_{q} (X_{i})) . \end{matrix}

Analogously, we can construct vectors

Ψ (Y_{i_{1}}, \dots, Y_{i_{m}})

and

g (Y_{i})

based on

Y = \{Y_{1}, \dots, Y_{n_{2}}\}

.

Assumption 2.

For any indexes

0 < i_{1}, \dots, i_{m} < n_{1}

and

0 < j_{1}, \dots, j_{m} < n_{2}

,

\begin{matrix} \max_{1 \leq s \leq q} E [\exp (|Ψ_{s} (X_{i_{1}}, \dots, X_{i_{m}})| / K)] \leq 2, \\ \max_{1 \leq s \leq q} E [\exp (|Ψ_{s} (Y_{i_{1}}, \dots, Y_{i_{m}})| / K)] \leq 2 . \end{matrix}

Assumption 3.

There exists a positive constant b, such that

E [| v^{⊤} g (X) |^{2}] ⩾ b

and

E [| v^{⊤} g (Y) |^{2}] ⩾ b

, for any

v \in V_{s_{0}}

, where

V_{s_{0}} = \{v \in S^{q - 1} : {∥ v ∥}_{0} \leq s_{0}\}

.

Assumption 4.

There exists a constant

Q ⩾ 1

such that

E [| g_{s} (X) |^{2 + m}] ⩽ Q^{m}

E [| g_{s} (Y) |^{2 + m}] ⩽ Q^{m}

holds for all

1 \leq s \leq q

, and

m = 1, 2

.

This assumption specifies the relationship between the sample size n and the parameter dimension q. Assumption 1 permits the parameter dimension q and sample size n go to infinity as long as

\log (q) = O (n^{δ})

holds. Additionally, it requires the sample size of each group to be in the same order. Assumption 2 requires that the centered kernel functions

Ψ_{s} (X_{i_{1}}, \dots, X_{i_{m}})

and

Ψ_{s} (Y_{i_{1}}, \dots, Y_{i_{m}})

follow a sub-exponential distribution. This assumption is mild, as the bounded

Ψ_{s}

(such as useful rank-based U-statistic) satisfies this condition. Assumption 3 excludes the degenerate U-statistics and requires that the inner product of

g (X)

g (Y)

and

v \in V_{s_{0}}

is not degenerate. Assumption 4 include mild moment conditions. These Assumptions are crucial for the high-dimensional central limiting theorem.

The multiplier bootstrap method plays an important role in our tests. Based on the above assumptions, we justify the validity of the multiplier bootstrap procedure used in Section 2 by the following theorem:

Theorem 1.

Set

X = {X_{1}, \dots, X_{n_{1}}}

and

Y = {Y_{1}, \dots, Y_{n_{2}}}

. Suppose Assumptions 1–4 hold. Under

H_{0}

of (2), we have

\sup_{z \in (0, \infty)} | P (T_{o} \leq z) - P (T_{o}^{b} \leq z | X) | = o_{p} (1), as n \to \infty .

(13)

Under

H_{0}

of (3),

\sup_{z \in (0, \infty)} | P (T_{d} \leq z) - P (T_{d}^{b} \leq z | X, Y) | = o_{p} (1), as n \to \infty .

(14)

The detailed proof of Theorem 1 is presented in Appendix A.3. This theorem ensures that the distribution of our tests can be well approximated by their corresponding multiplier bootstrap’s distribution under

H_{0}

.

Theorem 2.

Suppose Assumptions 1–4 hold. Under

H_{0}

of (2), we have

P_{H_{0}} (Φ_{α}^{o} = 1) \to α and as n, B \to \infty .

(15)

Under

H_{0}

of (3), we have

P_{H_{0}} (Φ_{α}^{d} = 1) \to α and as n, B \to \infty .

(16)

Given the pre-specified level

α

, Theorem 2 establishes that the size of

T_{o}

and

T_{d}

are well under control. We omit the proof procedure, as this theorem can be viewed as a consequence of Theorem 1. Specifically, according to Theorem 1, the distribution of our tests

T_{o}

and

T_{d}

can be well approximated by that of

T_{o}^{b}

and

T_{d}^{b}

. Further, by the Dvoretsky–Kiefer–Wolfowitz inequality of the Massart version, we have

\sup |{\hat{F}}_{T_{o}^{b}} (z) - F_{T_{o}^{b}} (z)| \to 0, \sup |{\hat{F}}_{T_{d}^{b}} (z) - F_{T_{d}^{b}} (z)| \to 0 as n, B \to \infty,

where

{\hat{F}}_{T_{o}^{b}} (z)

and

{\hat{F}}_{T_{d}^{b}} (z)

are the empirical distributions used to obtain the critical values in the test procedures, while

F_{T_{o}^{b}} (z)

and

F_{T_{d}^{b}} (z)

represent the theoretical distributions in Theorem 1. Thus, Theorem 2 is established by combining these results. In addition to asymptotic size control, we investigate the power properties of

T_{o}

and

T_{d}

in the subsequent theorem.

Theorem 3.

Suppose Assumptions 1–4 hold, and let

ϵ_{n} = o (1)

with

ϵ_{n} \sqrt{\log q} \to \infty

as

n, q \to \infty

. Further, assume

\log (q) = o (n^{1 / 2})

. For one sample test we have

P (Φ_{α}^{o} = 1) \to 1 as n, q, B \to \infty,

(17)

if

H_{1}

of (2) holds with

\max_{1 \leq s \leq q} |\frac{u_{1, s} - u_{0, s}}{\sqrt{σ_{1, s s} / n_{1}}}| \geq (1 + ϵ_{n}) n^{- 1 / 2} (\sqrt{2 \log (q)} + \sqrt{2 \log (1 / α)}) .

(18)

For two sample tests, we have

P (Φ_{α}^{d} = 1) \to 1 as n, q, B \to \infty,

(19)

if

H_{1}

of (3) holds with

\max_{1 \leq s \leq q} |\frac{u_{1, s} - u_{2, s}}{\sqrt{σ_{1, s s} / n_{1} + σ_{2, s s} / n_{2}}}| \geq (1 + ϵ_{n}) (\sqrt{2 \log (q)} + \sqrt{2 \log (1 / α)}) .

(20)

The detailed proof for this theorem is provided in Appendix A.4. This result demonstrates that, under certain mild conditions, the power of our proposed test converges to 1.

4. Simulation Study

In this section, we conduct comprehensive simulations to investigate the empirical performance of our proposed test. As discussed in Section 2, for the sample mean test our method reduces to what was introduced by [12,13]. Therefore, we focus on testing the Kendall tau correlation matrix and compare our method with several established methods. Specifically, we consider the sample covariance-based methods from [49,50], as well as the U-statistic-based Kendall tau correlation test from [42]. For simplicity, we refer to these methods as CLX, HD, and ZU, respectively, and denote our method as UJB.

We generate two sets of independent random samples,

{X_{i}}_{i = 1}^{n_{1}}

and

{Y_{i}}_{i = 1}^{n_{2}}

, where

X_{i} = Σ_{1}^{* 1 / 2} Z_{i}^{(1)}

and

Y_{i} = Σ_{2}^{* 1 / 2} Z_{i}^{(2)}

. Here,

Z_{i}^{(γ)} = {(Z_{i 1}^{(γ)}, \dots, Z_{i d}^{(γ)})}^{T}

, where

γ = 1, 2

, and

Z_{i 1}^{(γ)}, \dots, Z_{i d}^{(γ)}

are independent and identically distributed (i.i.d.) random variables with variances

σ_{Z, γ}^{2}

. To mimic practical scenarios, we assume

Z_{i j}^{γ}

are samples from the following three models:

Model 1 (Gamma Distribution): Let $Z_{i j}^{γ} \sim Gamma (4, 10)$ , for $j = 1, \dots, d$ .
Model 2 (Zero-Inflated Poisson Distribution): Let $Z_{i j}^{γ} \sim Pois (1000)$ with probability 0.15, and $Z_{i 1}^{γ} = 0$ with probability 0.85, for $j = 1, \dots, d$ .
Model 3 (Student’s t Distribution): Let $Z_{i j}^{γ} \sim t_{3} (μ)$ , where $μ$ is drawn from Unif $(- 2, 2)$ , for $j = 1, \dots, d$ .

Thus, the covariance matrix of

X

is

Σ_{1} = σ_{Z, 1}^{2} Σ_{1}^{*}

, and the covariance matrix of

Y

is

Σ_{2} = σ_{Z, 2}^{2} Σ_{2}^{*}

. Under the null hypothesis, we assume

Σ_{1}^{*} = Σ_{2}^{*} = Σ^{*}

. We consider the following four covariance structures for

Σ^{*}

:

Case I (Block): Set $Σ^{*} = D^{1 / 2} R D^{1 / 2}$ , where $D$ is a diagonal matrix with i.i.d. entries drawn from Unif $(0.5, 2.5)$ . The matrix $R = {(r_{k ℓ})}_{1 \leq k, ℓ \leq d}$ represents a block correlation structure, with $r_{k k} = 1$ for all k, $r_{k ℓ} = 0.55$ for $10 (p - 1) + 1 \leq k \neq ℓ \leq 10 p$ (for $p = 1, \dots, ⌊ d / 10 ⌋$ ), and $r_{k ℓ} = 0$ otherwise.
Case II (Decay): Set $Σ^{*} = {(σ_{k ℓ})}_{1 \leq k, ℓ \leq d}$ , where $σ_{k ℓ} = 0 . 6^{| k - ℓ |}$ .
Case III (Non-Sparse): Define $Σ^{*} = D^{1 / 2} (F + U U^{T}) D^{1 / 2}$ , where $F = {(f_{k ℓ})}_{1 \leq k, ℓ \leq d}$ with $f_{k k} = 1$ and $f_{k, k + 1} = f_{ℓ + 1, ℓ} = 0.5$ . The matrix $D$ is diagonal, with entries drawn from Unif $(1, 6)$ , and $U$ is uniformly distributed on the Stiefel manifold $V (d, k_{0})$ , i.e., $U \in R^{d \times k_{0}}$ and $U^{T} U = I_{k_{0}}$ , where $I_{k_{0}}$ is the identity matrix of dimension $k_{0}$ (set $k_{0} = 10$ ).
Case IV (Long-range dependence): Set $Σ^{*} = {(σ_{k ℓ})}_{1 \leq k, ℓ \leq d}$ , where $σ_{k k} \overset{i . i . d .}{\sim} Unif (1, 2)$ , $σ_{k ℓ} = f (| k - ℓ |)$ with $f (x) = {{(x + 1)}^{1.7} + {(x - 1)}^{1.7} - 2 x^{1.7}} / 2$ .

Table 1 shows the empirical sizes across Models 1–3 and Cases I–III. Overall, all methods maintain sizes close to the nominal level of

α = 0.05

, which indicates good control of Type I error rates. In Models 1 and 2, the empirical sizes of the CLX and HD methods remain stable around the nominal level. ZU and UJB also perform well, although UJB slightly exceeds the nominal size in some instances. In Model 3, which employs a heavy-tailed t-distribution, the empirical size of UJB continues to align closely with the nominal level. In contrast, the empirical sizes for CLX and HD are more conservative, highlighting the robustness of UJB in the presence of heavy-tailed data.

Table 1. Empirical sizes of Model 1–3 with Case I–IV based on 500 replications, with

α = 0.05

.

Under the alternative hypothesis, we introduce a random symmetric matrix

Δ = (δ_{k ℓ}) \in R^{d \times d}

with exactly eight nonzero entries. Among these, four entries are randomly selected from the upper triangle of

Δ

, with magnitudes generated from the uniform distribution on

(0, σ_{\max}^{2})

, where

σ_{\max}^{2}

is the maximum diagonal value of

Σ^{*}

. The remaining four entries are determined by symmetry. We then define

{\tilde{Σ}}_{1} = Σ^{*} + δ I

and

{\tilde{Σ}}_{2} = Σ^{*} + Δ + δ I

, where

δ = | \min {λ_{\min} (Σ^{*} + Δ), λ_{\min} (Σ^{*})} | + 0.05

. These matrices,

{\tilde{Σ}}_{1}

and

{\tilde{Σ}}_{2}

, are used to generate samples for

X

and

Y

under the alternative hypothesis.

The empirical power performance across Models 1–3 and Cases I-IV is summarized in Table 2. For Models 1 and 2, which involve Gamma and zero-inflated Poisson distributions, the data quickly approximate normality. This allows the CLX and HD methods to perform comparably to UJB and ZU. However, UJB and the rank-based ZU tests demonstrate a slight advantage. In Model 3, where the data follow a heavy-tailed t-distribution, UJB and ZU significantly outperform CLX and HD, highlighting the effectiveness of rank-based correlation tests for heavy-tailed distributions. Across all cases, UJB maintains comparable or slightly superior performance to ZU, highlighting the strength and robustness of our proposed method.

Table 2. Empirical powers of Model 1–3 with Case I–IV based on 500 replications, with

α = 0.05

.

5. Real Data Analysis

In this section, we apply our proposed method to explore potential dependency differences within the Wnt signaling pathway, which is associated with lung cancer as well as other cancers, including gastric and breast cancer [51,52,53]. The dataset used for this analysis is publicly available through the Gene Expression Omnibus (GEO) repository (https://www.ncbi.nlm.nih.gov/geo/, accessed on 25 November 2024) under accession number GDS2771. It contains 22,283 microarray-derived gene expression measurements from large airway epithelial cells in 97 patients diagnosed with lung cancer and 90 control patients without lung cancer.

In this study, we focus on a subset of 119 genes within the Wnt signaling pathway. As the number of potential dependencies grows quadratically with the number of genes, reliable statistical inference becomes challenging with a limited sample size. To address this, we apply individual t-tests to each gene to assess differential expression between the two groups. This conservative approach reduces dimensionality while retaining potentially relevant genes for further analysis. Ultimately, we select 32 genes with statistically significant expression differences for the dependency analysis. This selected subset includes genes previously identified as important in lung cancer research, such as WNT1, WNT2, WNT5A [51], FDZ [54], and RHOA [55]. Based on this subset, we use the testing methods outlined in Section 4 to test for the equality in the correlation matrices of these 32 genes between lung cancer patients and control patients. Given that the CLX and HD methods are conservative for heavy-tailed data, we transformed the gene expression levels by applying a logarithmic scale and then standardized each gene within its respective group.

All methods reject the null hypothesis, providing strong evidence against equal dependency structures. Specifically, the p-values for our proposed method (UJB) and the comparison methods (CLX and HD) are 0.019, 0.046, and 0.029, respectively. This finding suggests that the dependency structure within the Wnt signaling pathway is likely distinct between lung cancer patients and controls. Thus, beyond differential gene expression, changes in dependency structures among genes in the Wnt pathway may offer new insights into the mechanisms underlying lung cancer progression. For example, ref. [56] shows that WNT5A–RHOA signaling drives tumorigenesis and represents a therapeutic target in small-cell lung cancer. It also highlights the importance of gene dependency structures in the Wnt pathway for understanding lung cancer progression.

6. Discussion

In this paper, we introduce a novel approach that reformulates the testing of U-statistic-type parameters into a sample mean testing problem by using jackknife pseudo-values. This transformation allows us to apply established methods for sample mean testing, simplifying the test process while maintaining the flexibility and power inherent to U-statistic-based inference. Moreover, we obtain critical values for the test by using multiplier bootstrap and ensuring the validity of this process.

Our simulation study involving samples from various distributions with different covariance structures, demonstrates the robustness and adaptability of the proposed method. Theoretical analysis further confirms the validity of our approach. However, a notable limitation lies in the computational cost of the jackknife procedure, which increases with the sample size. This can become a significant challenge as the number of samples grows too large. Future work could explore strategies to improve computational efficiency or alternative approaches that preserve the statistical accuracy of the test while reducing its computational burden.

In addition to addressing computational challenges, another possible work direction lies in addressing size distortions. Some empirical sizes in Table 1 deviate from the nominal significance level of

5 %

, particularly for Model 3. To improve the empirical performance, future work could incorporate methods like simulation-based calibration to better approximate null distributions [57], imputation-based techniques for enhanced accuracy [58], or adjustments to test statistics to address structural dependencies [59].

Author Contributions

Methodology, M.Z.; validation, L.J.; writing—original draft preparation, M.Z.; writing—review and editing, L.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (No. 12101412) to M.Z., and the National Philosophy and Social Science Foundation of China (No. 23BTJ046) to L.J.

Data Availability Statement

The real data can be obtained from the Gene Expression Omnibus (GEO) repository (https://www.ncbi.nlm.nih.gov/geo/, accessed on 25 November 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Proof of the Main Theorem

Appendix A.1. Useful Lemmas

To establish the proof of the main results, some useful lemmas are required. Firstly, some additional notations are introduced. Let

X_{1}, \dots, X_{n}

be independent centered random vectors in

R^{d}

with the covariance matrix

Σ_{X}

, and

X_{i} = (X_{i, 1}, \dots, X_{i, d})

with

i = 1, \dots, n

.

Set

Z_{i} (i = 1, \dots, n)

to be independent Gaussian random vectors in

R^{d}

with the same mean vector and covariance matrix as

X_{i}

. Further, set

M_{k} = \max_{1 \leq j \leq d} (E [| X_{i, j}^{k} {|]}^{1 / k}) .

Define

u_{X} (ζ)

as the infimum over all numbers u such that

P (| X_{i, j} | \leq u {(E [X_{i, j}^{2}])}^{1 / 2}, 1 \leq i \leq n, 1 \leq j \leq d) \geq 1 - ζ .

Similarly,

u_{Z} (ζ)

are the corresponding quantities for the analog Gaussian case, namely with

{(X_{i})}_{i = 1}^{n}

is replaced by

{(Z_{i})}_{i = 1}^{n}

in the above definition. According to [48], the following Kolmogorov–Smirnov distance

ρ = \sup_{z \in R} | P (∥ n^{- 1 / 2} \sum_{i = 1}^{n} X_{i} ∥_{\infty} \leq z) - P (∥ n^{- 1 / 2} \sum_{i = 1}^{n} Z_{i} ∥_{\infty} \leq z) |,

can be bounded as follows.

Lemma A1

(Gaussian approximation, Theorem 2.2 of [48]). Suppose there exist some positive constants

c_{1} < c_{2} < \infty

such that

c_{1} \leq σ_{X, k, k} \leq c_{2}

for all

1 \leq k \leq d

. For any

0 < ζ < 1

, we have

ρ \leq C \{n^{- 1 / 8} (M_{3}^{3 / 4} \lor M_{4}^{1 / 2}) {(\log (d n / ζ))}^{7 / 8} + n^{- 1 / 2} {(\log (d n / ζ))}^{3 / 2} u (ζ) + ζ\},

(A1)

where C is a positive constant that only depends on

c_{1}

and

c_{2}

, and

u (t) = \max \{u_{X} (t), u_{G} (t)\}

.

For all

i \leq n, j \leq d

, set

P (| X_{i, j} | > s) \leq \exp (- c_{1} s)

. Then,

u (ζ)

in (A1) can be bounded by

u (ζ) \leq \log (d n / ζ)

. Hence, by taking

ζ = n^{- 1 / 2}

, we have the bound for

ρ

as follows

ρ = O_{p} \{n^{- 1 / 8} {(\log (d n))}^{7 / 8} + n^{- 1 / 2} {(\log (d n))}^{5 / 2}\} .

(A2)

In addition, suppose

X_{1}

and

X_{2}

are two centered Gaussian random vectors in

R^{d}

, with the covariance matrices as

Σ_{1}

and

Σ_{2}

, respectively. Ref. [48] proposed to bound the Kolmogorov–Smirnov distance of the maximum of these two different Gaussian distributions by the following lemma.

Lemma A2

(Comparison of the distributions of Gaussian maxima, Lemma 3.1 of [48]). Suppose all the diagonal elements of

Σ_{1}

are bounded away from 0 and ∞. Then, we have

\sup_{x \in R} | P (∥ X_{1} ∥_{\infty} \leq x) - P (∥ X_{2} ∥_{\infty} \leq x) | \leq C Δ^{1 / 3} {(1 \lor \log (d / Δ))}^{2 / 3},

where

Δ = ∥ Σ_{1} - Σ_{2} ∥_{\infty}

, C is a positive constant.

Let

X_{1}, \dots, X_{n}

be independent random vectors,

Φ (X_{i_{1}}, \dots, X_{i_{m}})

is a kernel function of

m (m < n)

order. Define the corresponding U-statistic as follows:

U = {(\begin{matrix} n \\ m \end{matrix})}^{- 1} \sum_{1 \leq i_{1} < \dots < i_{m} \leq n} Φ (X_{i_{1}}, \dots, X_{i_{m}}) .

Lemma A3 (Hoeffding, 1963).

If the kernel function

Φ (X_{i_{1}}, \dots, X_{i_{m}})

is bounded,

P (U - E (U) \geq t) \leq \exp (- 2 k t^{2} / {(b - a)}^{2}),

where a and b are the lower and upper bound of

Φ (X_{i_{1}}, \dots, X_{i_{m}})

, respectively, and

k = ⌊ n / m ⌋

.

Lemma A4.

Z = {(Z_{1}, \dots, Z_{d})}^{⊤}

is a random vector with marginal distribution

N (0, σ^{2})

. For any

t > 0

, we have

E [\max_{1 \leq j \leq d} | Z_{j} |] \leq \frac{\log (2 d)}{t} + \frac{t σ^{2}}{2} .

By

Φ_{s} (X_{i_{1}}, \dots, X_{i_{m}})

is the m -order symmetric kernel function, and combining the definition in (12), we set the covariance matrices of

m g (X_{i})

and

m g (Y_{i})

are

Σ_{1} = (σ_{1, s t})

and

Σ_{2} = (σ_{2, s t})

, respectively. Further, the corresponding correlation matrices are defined as follows:

R_{γ} = Diag {(Σ_{γ})}^{- 1 / 2} Σ_{γ} Diag {(Σ_{γ})}^{- 1 / 2}, γ = 1, 2 .

Specifically, the

(s, t)

-th entry of

R_{γ}

is given by the following:

r_{γ, s t} = \frac{σ_{γ, s t}}{\sqrt{σ_{γ, s s} σ_{γ, t t}}}, 1 \leq s, t \leq q .

The following lemma provides an upper bound for the estimation errors of the sample covariance matrix

{\hat{Σ}}_{γ} = ({\hat{σ}}_{γ, s t})

and its correlation matrix

{\hat{R}}_{γ} = ({\hat{r}}_{γ, s t})

.

Lemma A5.

Suppose the Assumptions 1–4 hold. When

d, T

are sufficiently large, we have that

\begin{matrix} \max_{γ = 1, 2} (∥ {\hat{Σ}}_{γ} - Σ_{γ} ∥_{\infty}, {∥ {\hat{R}}_{γ} - R_{γ} ∥}_{\infty}) = O_{p} (\sqrt{\frac{\log^{3} (q n)}{n}}) . \end{matrix}

(A3)

Appendix A.2. Proof of Lemma A5

Proof.

To prove Lemma A5, it suffices to show that the upper bound in (A3) holds for both

γ = 1

and

γ = 2

individually. According to similar definitions and assumptions, it is easy to address the upper bound with similar arguments. Without loss of generality, we will show

γ = 1

. According to the definition,

Σ_{1} = m^{2} E [{(g (X_{i}))}^{⊤} g (X_{i})] .

By the definition of

∥ {\hat{Σ}}_{1} - Σ_{1} ∥_{\infty}

, we need to bound the following equation,

\begin{matrix} \max_{1 \leq s, t \leq q} |{\hat{σ}}_{1, s t} - σ_{1, s t}| = & \max_{1 \leq s, t \leq q} |\frac{1}{n_{1}} \sum_{i = 1}^{n_{1}} [({\hat{u}}_{1, i, s} - {\hat{u}}_{1, s}) ({\hat{u}}_{1, i, t} - {\hat{u}}_{1, t}) - m^{2} E (g_{s} (X_{i}) g_{t} (X_{i}))]| \\ \leq & \underset{L_{1}}{\underset{︸}{\max_{1 \leq s, t \leq q} |\frac{1}{n_{1}} \sum_{i = 1}^{n_{1}} [({\hat{u}}_{1, i, s} - {\hat{u}}_{1, s}) ({\hat{u}}_{1, i, t} - {\hat{u}}_{1, t}) - m^{2} \frac{1}{n_{1}} \sum_{i = 1}^{n_{1}} (g_{s} (X_{i}) g_{t} (X_{i}))]|}} \\ + \underset{L_{2}}{\underset{︸}{\max_{1 \leq s, t \leq q} | m^{2} \frac{1}{n_{1}} \sum_{i = 1}^{n_{1}} (g_{s} (X_{i}) g_{t} (X_{i}) - m^{2} E (g_{s} (X_{i}) g_{t} (X_{i})) |}} . \end{matrix}

By Theorem 6 in [60], and the sub-exponential assumptions for

g_{s} (X_{i})

, it is easy to obtain the following upper bound for

L_{2}

:

L_{2} \leq C m^{2} \sqrt{\frac{\log (q n_{1})}{n_{1}}} + C_{1} m^{2} \frac{\log^{2} (q n_{1})}{n_{1}} .

(A4)

For

L_{1}

,

\begin{matrix} L_{1} = & \underset{L_{11}}{\underset{︸}{\max_{1 \leq s, t \leq q} | \frac{1}{n_{1}} \sum_{i = 1}^{n_{1}} [({\hat{u}}_{1, i, s} - u_{1, s}) ({\hat{u}}_{1, i, t} - u_{1, t})] - m^{2} \frac{1}{n_{1}} \sum_{i = 1}^{n_{1}} [g_{s} (X_{i}) g_{t} (X_{i})] |}} \\ + \underset{L_{12}}{\underset{︸}{\max_{1 \leq s, t \leq q} | ({\hat{u}}_{1 s} - u_{1 s}) ({\hat{u}}_{1 t} - u_{1 t}) |}} . \end{matrix}

To bound

L_{1}

, we bound

L_{11}

and

L_{12}

, respectively. For

L_{12}

, it is obvious that

\begin{matrix} P (L_{12} > z) & \leq q^{2} \max_{1 \leq s, t \leq q} P (|({\hat{u}}_{1 s} - u_{1 s}) ({\hat{u}}_{1 t} - u_{1 t})| > z) \\ \leq 2 q^{2} \max_{1 \leq s \leq q} P (|{\hat{u}}_{1 s} - u_{1 s}| > \sqrt{z}) . \end{matrix}

(A5)

Hence, the key to bound

L_{12}

is to study

\max_{1 \leq s \leq q} P (|{\hat{u}}_{1 s} - u_{1 s}| > \sqrt{z})

. According to the definition of

{\hat{u}}_{1 s}

in (4), and with some simple calculations, we have the following:

{\hat{u}}_{1 s} - u_{1 s} = {\tilde{u}}_{1 s} - u_{1 s} .

(A6)

Next, we introduce

{\tilde{u}}_{1 s}^{'} - u_{1 s} = {(\begin{matrix} n_{1} \\ m \end{matrix})}^{- 1} \sum_{1 \leq i_{1} < \dots < i_{m} \leq n_{1}} V_{1, s}^{i_{1}, \dots, i_{m}} - E_{1, s},

(A7)

where

\begin{matrix} V_{1, s}^{i_{1}, \dots, i_{m}} = Ψ_{s} (X_{i_{1}}, \dots, X_{i_{m}}) I \{|Ψ_{s} (X_{i_{1}}, \dots, X_{i_{m}})| \leq B_{n}\}, \\ E_{1, s} = E (Ψ_{s} (X_{i_{1}}, \dots, X_{i_{m}}) I \{|Ψ_{s} (X_{i_{1}}, \dots, X_{i_{m}})| \leq B_{n}\}), \end{matrix}

(A8)

with threshold

B_{n} = C \log (q n)

. By the triangle inequality, we have

| {\tilde{u}}_{1 s} - u_{1 s} | \leq | {\tilde{u}}_{1 s} - {\tilde{u}}_{1 s}^{'} | + | {\tilde{u}}_{1 s}^{'} - u_{1 s} | .

According to the definition of

{\tilde{u}}_{1 s}^{'} - u_{1 s}

in (A7), and by the triangle inequality, we have

|{\tilde{u}}_{1 s} - {\tilde{u}}_{1 s}^{'}| \leq \underset{L_{2 s}}{\underset{︸}{|{\tilde{u}}_{1 s} - u_{1 s} - {(\begin{matrix} n_{1} \\ m \end{matrix})}^{- 1} \sum_{1 \leq i_{1} < \dots, < i_{m} \leq n_{1}} V_{1 s}^{i_{1}, \dots, i_{m}}|}} + E_{1 s} .

For any

δ > 0

, by choosing proper C, we have

E_{1, s} ≺ {(q n)}^{- δ}

. By setting

z ≻ {(q n)}^{- δ}

, we have

\max_{1 \leq s \leq q} P (|{\tilde{u}}_{1 s} - u_{1 s}| > z) \leq \max_{1 \leq s \leq q} (P (|{\tilde{u}}_{1 s}^{'} - u_{1 s}| > z / 3) + P (L_{2 s} > z / 3)) .

Using the exponential inequality for bounded U -statistics, we have

\max_{1 \leq s \leq q} P (|{\tilde{u}}_{1 s}^{'} - u_{1 s}| > z / 3) \leq C \exp (- C_{1} n_{1} z^{2} / B_{n}^{2}) .

By Assumption 2, we also have

\max_{1 \leq s \leq q} P (L_{2 s} > z / 3) \leq C n_{1}^{m} \exp (- C_{1} B_{n}) .

Combining these results, we have

\max_{1 \leq s \leq q} P (|{\tilde{u}}_{1 s} - u_{1 s}| > z) \leq C \exp (- C_{1} n_{1} z^{2} / B_{n}^{2}) + C n_{1}^{m} \exp (- C_{1} B_{n}) .

(A9)

Therefore, for a sufficiently large

n_{1}

, we have

L_{12} \leq \log^{3} (q n_{1}) / n_{1},

(A10)

with probability

1 - C n_{1}^{- 1}

.

For

L_{11}

, using the definition of

{\hat{u}}_{1, i, s} - u_{1 s}

, we simplify the expression by applying the following Hoeffding decomposition. Specifically,

\begin{matrix} {\tilde{u}}_{1 s} - u_{1 s} = \frac{m}{n_{1}} \sum_{k = 1}^{n} g (X_{k}) + {(\begin{matrix} n_{1} \\ m \end{matrix})}^{- 1} \sum_{\begin{matrix} 1 \leq i_{1} < \dots < i_{m} \leq n_{1} \end{matrix}} φ_{s} (X_{i_{1}}, \dots, X_{i_{m}}), \end{matrix}

where

φ_{s} (X_{i_{1}}, \dots, X_{i_{m}}) = Φ_{s} (X_{i_{1}}, \dots, X_{i_{m}}) - u_{1 s} - \sum_{ℓ = 1}^{n} g (X_{i_{ℓ}})

,

s = 1, \dots, q

. Similarly, for

{\tilde{u}}_{1 s}^{- i} - u_{1 s}

, we have

\begin{matrix} {\tilde{u}}_{1 s}^{- i} - u_{1 s} = \frac{m}{n_{1} - 1} \sum_{\begin{matrix} ℓ = 1 \\ ℓ \neq i \end{matrix}}^{n} g (X_{ℓ}) + {(\begin{matrix} n_{1} - 1 \\ m \end{matrix})}^{- 1} \sum_{\begin{matrix} 1 \leq i_{1} < \dots < i_{m} \leq n_{1} \\ i_{j} \neq i, j = 1, \dots, m \end{matrix}} φ_{s} (X_{i_{1}}, \dots, X_{i_{m}}) . \end{matrix}

Set

\begin{matrix} Δ_{1, i, s} = & n_{1} {(\begin{matrix} n_{1} \\ m \end{matrix})}^{- 1} \sum_{\begin{matrix} 1 \leq i_{1} < \dots < i_{m} \leq n_{1} \end{matrix}} φ_{s} (X_{i_{1}}, \dots, X_{i_{m}}) - \\ (n_{1} - 1) {(\begin{matrix} n_{1} - 1 \\ m \end{matrix})}^{- 1} \sum_{\begin{matrix} 1 \leq i_{1} < \dots < i_{m} \leq n_{1} \\ i_{j} \neq i, j = 1, \dots, m \end{matrix}} φ_{s} (X_{i_{1}}, \dots, X_{i_{m}}) . \end{matrix}

According to the definition of

{\hat{u}}_{1, i, s} - u_{1 s}

, we have

{\hat{u}}_{1, i, s} - u_{1, s} = m g_{s} (X_{i}) + Δ_{1, i, s}, for s = 1, \dots, q .

Hence, plugging these into

L_{11}

, and performing some calculations, we obtain

\begin{matrix} L_{11} & = \max_{1 \leq s, t \leq q} | \frac{1}{n_{1}} \sum_{i = 1}^{n_{1}} [m g_{s} (X_{i}) Δ_{1, i, t} + m g_{t} (X_{i}) Δ_{1, i, s} + Δ_{1, i, s} Δ_{1, i, t}] | . \end{matrix}

By triangle inequality and Cauchy–Swartz inequality, we have

L_{11} \leq 2 \underset{L_{11, 1}}{\underset{︸}{\sqrt{\max_{1 \leq s \leq q} \frac{1}{n_{1}} \sum_{i = 1}^{n_{1}} {(m g_{s} (X_{i}))}^{2}}}} \sqrt{\max_{1 \leq t \leq q} \frac{1}{n_{1}} \sum_{i = 1}^{n_{1}} Δ_{1, i, t}^{2}} + \underset{L_{11, 2}}{\underset{︸}{\max_{1 \leq t \leq q} \frac{1}{n_{1}} \sum_{i = 1}^{n_{1}} Δ_{1, i, t}^{2}}} .

Next, we bound

L_{11, 1}

and

L_{11, 2}

, respectively. By Assumption 4, it’s obvious that

L_{11, 1}

bounded. For

L_{11, 2}

,

\begin{matrix} Δ_{1, i, s} = & m \underset{Δ_{11, i, s}}{\underset{︸}{{(\begin{matrix} n_{1} - 1 \\ m - 1 \end{matrix})}^{- 1} \sum_{\begin{matrix} 1 \leq i_{1} < \dots < i_{m - 1} \leq n_{1} \\ i_{j} \neq i, j = 1, \dots, m - 1 \end{matrix}} φ_{s} (X_{i}, X_{i_{1}}, \dots, X_{i_{m - 1}})}} \\ + (m - 1) \underset{Δ_{12, i, s}}{\underset{︸}{{(\begin{matrix} n_{1} - 1 \\ m \end{matrix})}^{- 1} \sum_{\begin{matrix} 1 \leq i_{1} < \dots < i_{m} \leq (n_{1} - 1) \\ i_{j} \neq i, j = 1, \dots, m \end{matrix}} φ_{s} (X_{i_{1}}, \dots, X_{i_{m}})}} . \end{matrix}

Given

X_{i}

,

φ_{s} (X_{i}, X_{i_{1}}, \dots, X_{i_{m - 1}})

can be treated as a symmetric kernel function of zero mean and

m - 1

order, and

Δ_{11, i, s}

is a U-statistic. Analogously,

Δ_{12, i, s}

is a U-statistic with a kernel function of zero mean and m order. Using the comparable technique employed to bound

L_{12}

, we can introduce a threshold kernel and exponential inequality for U-statistic. Thus, given

X_{i}

, for sufficiently large

n_{1}

, we have

\begin{matrix} \max_{1 \leq s \leq q} Δ_{11, i, s}^{2} & \leq C \frac{\log^{3} (q n_{1})}{n_{1}}, \max_{1 \leq s \leq q} Δ_{12, i, s}^{2} \leq C \frac{\log^{3} (q n_{1})}{n_{1}}, \\ \max_{1 \leq s \leq q} Δ_{11, i, s} Δ_{12, i, s} \leq C \frac{\log^{3} (q n_{1})}{n_{1}}, \end{matrix}

with probability

1 - C n_{1}^{- 1}

. Hence,

L_{11, 2} \leq C \log^{3} (q n_{1}) / n_{1}

with probability

1 - C n_{1}^{- 1}

. Furthermore, for sufficiently large

n_{1}

,

L_{11} \leq C \sqrt{\frac{\log^{3} (q n_{1})}{n_{1}}},

(A11)

with probability

1 - C n_{1}^{- 1}

. Combining the bound of

L_{12}

in (A10), it is straightforward to obtain

L_{1} \leq C \sqrt{\frac{\log^{3} (q n_{1})}{n_{1}}} .

(A12)

Therefore, combining the bound of

L_{2}

in (A4), we have

\max_{1 \leq s, t \leq q} |{\hat{σ}}_{1, s t} - σ_{1, s t}| = O_{p} (\sqrt{\frac{\log^{3} (q n_{1})}{n_{1}}}) .

(A13)

Next, we show the bound for

\max_{1 \leq s, t \leq q} | {\hat{r}}_{1, s t} - r_{1, s t} |

. According to the definition of

r_{1, s t}

and the triangle inequality, we have

\begin{matrix} | {\hat{r}}_{1, s t} - r_{1, s t} | & = |\frac{{\hat{σ}}_{1, s t}}{\sqrt{{\hat{σ}}_{1, s s} {\hat{σ}}_{1, t t}}} - \frac{σ_{1, s t}}{\sqrt{σ_{1, s s} σ_{1, t t}}}| \\ \leq \underset{A_{1}}{\underset{︸}{|\frac{{\hat{σ}}_{1, s t}}{\sqrt{{\hat{σ}}_{1, s s} {\hat{σ}}_{1, t t}}} - \frac{{\hat{σ}}_{1, s t}}{\sqrt{σ_{1, s s} σ_{1, t t}}}|}} + \underset{A_{2}}{\underset{︸}{|\frac{{\hat{σ}}_{1, s t}}{\sqrt{σ_{1, s s} σ_{1, t t}}} - \frac{σ_{1, s t}}{\sqrt{σ_{1, s s} σ_{1, t t}}}|}} . \end{matrix}

Hence, we need to bound

A_{1}

and

A_{2}

, respectively. For

A_{1}

, considering

| {\hat{r}}_{1, s t} | \leq 1

, we have

A_{1} = |\frac{{\hat{σ}}_{1, s t}}{\sqrt{{\hat{σ}}_{1, s s} {\hat{σ}}_{1, t t}}}| |1 - \frac{\sqrt{{\hat{σ}}_{1, s s} {\hat{σ}}_{1, t t}}}{\sqrt{σ_{1, s s} σ_{1, t t}}}| \leq \frac{| {\hat{σ}}_{1, s s} {\hat{σ}}_{1, t t} - σ_{1, s s} σ_{1, t t} |}{σ_{1, s s} σ_{1, t t}} .

By Assumptions 3 and 4, there are constants b and B, such that

0 < b \leq σ_{1, s s} \leq B < \infty

for

s = 1, \dots, q .

Hence, we have

A_{1} \leq b^{- 2} \max_{1 \leq s \leq q} | {\hat{σ}}_{1, s s} - σ_{1, s s} |^{2} + 2 B b^{- 2} \max_{1 \leq s \leq q} | {\hat{σ}}_{1, s s} - σ_{1, s s} | .

For

A_{2}

, we have

A_{2} \leq b^{- 1} \max_{1 \leq s, t \leq q} | {\hat{σ}}_{1, s t} - σ_{1, s t} | .

Therefore, we have

\max_{1 \leq, s, t \leq q} | {\hat{r}}_{1, s t} - r_{1, s t} | = O_{p} (\sqrt{\frac{\log^{3} (q n_{1})}{n_{1}}}) .

(A14)

By combining the bounding results from (A13) and (A14), we complete the proof of Lemma A5. □

Appendix A.3. Proof of Theorem 1

Proof.

The proof procedures for (13) and (14) are similar. Without loss of generality, we provide the specific proof process for (14).

According to the definition of

T_{d}

in (9), we introduce an oracle test statistic with known variances

{\tilde{T}}_{d} = \max_{1 \leq s \leq q} \frac{{\hat{u}}_{1, s} - {\hat{u}}_{2, s}}{\sqrt{σ_{1, s s} / n_{1} + σ_{2, s s} / n_{2}}} .

As the jackknife pseudo-values are asymptotically independent, and by Lemma A1, we have

\sup_{z \in (0, \infty)} | P ({\tilde{T}}_{d} \leq z) - P ({∥ G ∥}_{\infty} \leq z) | = o_{p} (1), as n \to \infty,

(A15)

where

G

is a Gaussian distribution random vector defined as

G \sim N_{q} (0, R_{12})

with

R_{12} = Diag {(Σ_{12})}^{- 1 / 2} Σ_{12} Diag {(Σ_{12})}^{- 1 / 2}

. Here,

Σ_{12} = Σ_{1} / n_{1} + Σ_{2} / n_{2}

, where

Σ_{1}

and

Σ_{2}

are the covariance matrices of

m g (X_{i})

and

m g (Y_{i})

, respectively.

T_{d} \leq {\tilde{T}}_{d} + \underset{L}{\underset{︸}{\max_{1 \leq s \leq q} |\frac{{\hat{u}}_{1, s} - {\hat{u}}_{2, s}}{\sqrt{{\hat{σ}}_{1, s s} / n_{1} + {\hat{σ}}_{2, s s} / n_{2}}} - \frac{{\hat{u}}_{1, s} - {\hat{u}}_{2, s}}{\sqrt{σ_{1, s s} / n_{1} + σ_{2, s s} / n_{2}}}|}},

(A16)

\begin{matrix} L & \leq \underset{L_{1}}{\underset{︸}{\max_{1 \leq s \leq q} |\frac{{\hat{u}}_{1, s} - {\hat{u}}_{2, s}}{\sqrt{σ_{1, s s} / n_{1} + σ_{2, s s} / n_{2}}}|}} \underset{L_{2}}{\underset{︸}{\max_{1 \leq s \leq q} | \frac{\sqrt{σ_{1, s s} / n_{1} + σ_{2, s} / n_{2}}}{\sqrt{{\hat{σ}}_{1, s s} / n_{1} + {\hat{σ}}_{2, s s} / n_{2}}} - 1 |}} . \end{matrix}

For

y \in (0, 1]

, with a simple calculation, we have

P (L_{2} \leq y) \geq P (\max_{1 \leq s \leq q} |\frac{\sqrt{{\hat{σ}}_{1, s s} / n_{1} + {\hat{σ}}_{2, s s} / n_{2}}}{\sqrt{σ_{1, s s} / n_{1} + σ_{2, s} / n_{2}}} - 1| \leq y / 2) .

(A17)

By Assumptions 3 and 4, there exist positive constants b and B, such that

0 < b \leq σ_{γ, s s} \leq B < \infty

. According to Assumption 1, the sample size

n_{1}

and

n_{2}

is in the same order. Thus,

\begin{matrix} \max_{1 \leq s \leq q} | \frac{{\hat{σ}}_{1, s s} / n_{1} + {\hat{σ}}_{2, s s} / n_{2}}{σ_{1, s s} / n_{1} + σ_{2, s} / n_{2}} - 1 | \leq \max_{1 \leq s \leq q} |σ_{1, s s} - {\hat{σ}}_{1, s s}| + \max_{1 \leq s \leq q} |σ_{2, s s} - {\hat{σ}}_{2, s s}| . \end{matrix}

Further, by Lemma A5 we have

\max_{1 \leq s \leq q} | \frac{{\hat{σ}}_{1, s s} / n_{1} + {\hat{σ}}_{2, s s} / n_{2}}{σ_{1, s s} / n_{1} + σ_{2, s} / n_{2}} - 1 | = O_{p} (\sqrt{\frac{\log^{3} (q n)}{n}}) .

(A18)

Hence, we have

L_{2} = O_{p} (\sqrt{\log^{3} (q n) / n})

. For

L_{1}

, there are similar arguments for

L_{12}

in Lemma A5. By Hoeffding inequality, we have

P (L_{1} > ε) \leq C q e x p (- C_{1} n ε^{2}) .

Hence, by combining (A16), we have

\sup_{z \in (0, \infty)} | P (T_{d} \leq z) - P ({\tilde{T}}_{d} \leq z) | = o_{p} (1), as n \to \infty .

(A19)

As

ξ_{γ, 1}^{b}, ξ_{γ, 2}^{b}, \dots, ξ_{γ, n_{γ}}^{b}

is a sequence of iid

N (0, 1)

, and according to the definition of

{\hat{u}}_{γ, s}^{b}

in (10), given

X

and

Y

, we obtain

{\hat{u}}_{γ}^{b} = {({\hat{u}}_{γ, 1}^{b},, \dots, {\hat{u}}_{γ, q}^{b})}^{⊤} \sim N (u_{γ}, {\hat{Σ}}_{γ} / n_{γ})

. By Lemma A1, we have

\sup_{z \in (0, \infty)} | P (T_{d}^{b} \leq z | {X, Y}) - P ({∥ G ∥}_{\infty} \leq z) | = o_{p} (1), as n \to \infty .

(A20)

By triangle inequality and combining the results of (A15), (A19) and (A20), we have

\sup_{z \in (0, \infty)} | P (T_{d} \leq z) - P (T_{d}^{b} \leq z | X, Y) | = o_{p} (1), as n \to \infty .

(A21)

Therefore, Theorem 1 is proved. □

Appendix A.4. Proof of Theorem 3

Proof.

The proof procedures for (17) and (19) are similar. Without loss of generality, we outline the specific proof process for (19). Following the approach in Theorem 2, we define the oracle critical value

t_{α}^{d}

from the theoretical distribution

F_{T_{d}^{b}} (z)

,

t_{α}^{d} = \inf \{t \in R : P (T_{d} \leq t) > 1 - α\} .

The critical value

{\hat{t}}_{α}^{d}

used in the test

Φ_{α}^{d}

serves as the bootstrap estimator of

t_{α}^{d}

. As

B \to \infty

, we obtain

\lim_{n, B \to \infty} |P (T_{d} > t_{α}^{d}) - P (T_{d} > {\hat{t}}_{α}^{d})| = o_{p} (1) .

Therefore, to prove Theorem 3, it suffices to show that the lower bound of

P (T_{d}^{b} \leq t)

approaches 1 as

n \to \infty

.

First, we establish the lower bound for

t_{α}^{d}

. As demonstrated in the proof for Theorem 1, given

X

and

Y

, we find that

({\hat{u}}_{1, s}^{b} - {\hat{u}}_{2, s}^{b}) / \sqrt{σ_{1, s s} / n_{1} + σ_{2, s s} / n_{2}}

follows the standard normal distribution. According to Lemma A4, by setting

σ = 1

and

t = \sqrt{2 \log q}

, we have

\begin{matrix} E [T_{d}^{b} | X, Y] \leq \sqrt{\log (2 q)} (1 + {(\log (2 q))}^{- 1}) . \end{matrix}

By Theorem 5.8 of [61], it follows that

P (T_{d}^{b} \geq E [T_{d}^{b} | X, Y] + z | X, Y) \leq \exp (- z^{2} / 2) .

Thus, we have

t_{α}^{d} \leq \sqrt{\log (2 q)} (1 + {(\log (2 q))}^{- 1}) + \sqrt{2 \log (1 / α)} .

(A22)

Consequently, we have

P (T_{d} > t_{α}^{d}) \geq \underset{L}{\underset{︸}{P (T_{d} > \sqrt{\log (2 q)} (1 + {(\log (2 q))}^{- 1}) + \sqrt{2 \log (1 / α)})}} .

(A23)

Next, we focus on establishing the lower bound for L. Under

H_{1}

, there exists some

s \in {1, \dots, q}

,

u_{1, s} \neq u_{2, s}

. Set,

\begin{matrix} T_{d}^{'} & = \max_{1 \leq s \leq q} \frac{| {\hat{u}}_{1, s} - {\hat{u}}_{2, s} - u_{1, s} + u_{2, s} |}{\sqrt{{\hat{σ}}_{1, s s} / n_{1} + {\hat{σ}}_{2, s s} / n_{2}}} . \end{matrix}

By the triangle inequality we have

T_{d} \geq \max_{1 \leq s \leq q} \frac{| u_{1, s} - u_{2, s} |}{\sqrt{{\hat{σ}}_{1, s s} / n_{1} + {\hat{σ}}_{2, s s} / n_{2}}} - T_{d}^{'} .

Define the subset

E_{0} (x) = \{\max_{1 \leq s \leq q} | \frac{{\hat{σ}}_{1, s s} / n_{1} + {\hat{σ}}_{2, s s} / n_{2}}{σ_{1, s s} / n_{1} + σ_{2, s} / n_{2}} - 1 | \leq x\} .

According to Lemma A5 and (A18), with a probability of at least

1 - C n^{- 1}

, we have

\max_{1 \leq s \leq q} | \frac{{\hat{σ}}_{1, s s} / n_{1} + {\hat{σ}}_{2, s s} / n_{2}}{σ_{1, s s} / n_{1} + σ_{2, s} / n_{2}} - 1 | \leq C \sqrt{\frac{\log^{3} (q n)}{n}} .

Thus, we have

P (E_{0} {(x)}^{c}) < n^{- 1} .

Under

E_{0} (x)

, we have

T_{d} \geq \underset{L_{1}}{\underset{︸}{\frac{1}{1 + x} \max_{1 \leq s \leq q} \frac{| u_{1, s} - u_{2, s} |}{\sqrt{{\hat{σ}}_{1, s s} / n_{1} + {\hat{σ}}_{2, s s} / n_{2}}} - T_{d}^{'}}} .

Consequently, we have

\begin{matrix} L \geq P (L_{1} > (1 + {(\log (q))}^{- 1}) (\sqrt{2 \log (q)} + \sqrt{2 \log (1 / α)}), E_{0} (x)) . \end{matrix}

Considering the signal size requirements in (20), by choosing z satisfying

(1 + ε_{n}) = (1 + x) (1 + z + {2 \log q}^{- 1})

, we have

L \geq P (T_{d}^{'} < z \sqrt{2 \log (q)}) .

By the triangle inequality,

P (T_{d}^{'} > z \sqrt{2 \log (q)}) \leq \max_{1 \leq s \leq q} P (\frac{| {\hat{u}}_{1, s} - {\hat{u}}_{2, s} - u_{1, s} + u_{2, s} |}{\sqrt{{\hat{σ}}_{1, s s} / n_{1} + {\hat{σ}}_{2, s s} / n_{2}}} > z \sqrt{2 \log (q)}) .

(A24)

Thus, we have

L \geq 1 - \underset{L_{2}}{\underset{︸}{\max_{1 \leq s \leq q} P (\frac{| {\hat{u}}_{1, s} - {\hat{u}}_{2, s} - u_{1, s} + u_{2, s} |}{\sqrt{{\hat{σ}}_{1, s s} / n_{1} + {\hat{σ}}_{2, s s} / n_{2}}} > z \sqrt{2 \log (q)})}} .

With similar arguments to obtain the upper bound for

L_{12}

in the proof of Lemma A5, we have

L_{2} \to 0

as

n \to \infty

. Further, we have

L \to 1

with a probability of 1 as

n, q, B \to \infty

, i.e.,

P (Φ_{α}^{d} = 1) \to 1

, Theorem 3 is proved. □

References

Hu, R.; Qiu, X.; Glazko, G.; Klebanov, L.; Yakovlev, A. Detecting Intergene Correlation Changes in Microarray Analysis: A New Approach to Gene Selection. BMC Bioinform. 2009, 10, 20. [Google Scholar] [CrossRef] [PubMed]
Hu, R.; Qiu, X.; Glazko, G. A New Gene Selection Procedure Based on the Covariance Distance. Bioinformatics 2010, 26, 348–354. [Google Scholar] [CrossRef] [PubMed]
Shaw, P.; Greenstein, D.; Lerch, J.; Clasen, L.; Lenroot, R.; Gogtay, N.; Evans, A.; Rapoport, J.; Giedd, J. Intellectual Ability and Cortical Development in Children and Adolescents. Nature 2006, 440, 676–679. [Google Scholar] [CrossRef]
Shedden, K.; Chen, W.; Kuick, R.; Ghosh, D.; Macdonald, J.; Cho, K.R.; Giordano, T.J.; Gruber, S.B.; Fearon, E.R.; Taylor, J.M.G.; et al. Comparison of Seven Methods for Producing Affymetrix Expression Scores Based on False Discovery Rates in Disease Profiling Data. BMC Bioinform. 2005, 6, 26. [Google Scholar] [CrossRef]
Dubois, P.C.; Trynka, G.; Franke, L.; Hunt, K.A.; Romanos, J.; Curtotti, A.; Zhernakova, A.; Heap, G.A.R.; Adány, R.; Aromaa, A.; et al. Multiple Common Variants for Celiac Disease Influencing Immune Gene Expression. Nat. Genet. 2010, 42, 295–302. [Google Scholar] [CrossRef]
Huang, Y.; Li, C.; Li, R.; Yang, S. An Overview of Tests on High-Dimensional Means. J. Multivar. Anal. 2022, 188, 104813. [Google Scholar] [CrossRef]
Bai, Z.; Saranadasa, H. Effect of High Dimension: By an Example of a Two Sample Problem. Stat. Sin. 1996, 6, 311–329. [Google Scholar]
Srivastava, M.S.; Du, M. A Test for the Mean Vector with Fewer Observations than the Dimension. J. Multivar. Anal. 2008, 99, 386–402. [Google Scholar] [CrossRef]
Srivastava, M.S. A Test for the Mean Vector with Fewer Observations than the Dimension Under Non-Normality. J. Multivar. Anal. 2009, 100, 518–532. [Google Scholar] [CrossRef]
Chen, S.X.; Qin, Y.L. A Two-Sample Test for High-Dimensional Data with Applications to Gene-Set Testing. Ann. Stat. 2010, 38, 808–835. [Google Scholar] [CrossRef]
Li, H.; Aue, A.; Paul, D.; Peng, J.; Wang, P. An Adaptable Generalization of Hotelling’s T² Test in High Dimension. Ann. Stat. 2020, 48, 1815–1847. [Google Scholar] [CrossRef]
Cai, T.T.; Liu, W.; Xia, Y. Two-Sample Test of High Dimensional Means Under Dependence. J. R. Stat. Soc. Ser. B Stat. Methodol. 2014, 76, 349–372. [Google Scholar]
Chang, J.; Zheng, C.; Zhou, W.X.; Zhou, W. Simulation-Based Hypothesis Testing of High Dimensional Means Under Covariance Heterogeneity. Biometrics 2017, 73, 1300–1310. [Google Scholar] [CrossRef] [PubMed]
Xue, K.; Yao, F. Distribution and Correlation-Free Two-Sample Test of High-Dimensional Means. Ann. Stat. 2020, 48, 1304–1328. [Google Scholar] [CrossRef]
Xu, G.; Lin, L.; Wei, P.; Pan, W. An Adaptive Two-Sample Test for High-Dimensional Means. Biometrika 2016, 103, 609–624. [Google Scholar] [CrossRef]
Zhou, C.; Zhang, X.; Zhou, W.; Liu, H. A Unified Framework for Testing High Dimensional Parameters: A Data-Adaptive Approach. arXiv 2018, arXiv:1808.02648. [Google Scholar]
Feng, L.; Jiang, T.; Li, X.; Liu, B. Asymptotic independence of the sum and maximum of dependent random variables with applications to high-dimensional tests. Stat. Sin. 2024, 34, 1745–1763. [Google Scholar] [CrossRef]
Lopes, M.E.; Jacob, L.J.; Wainwright, M.J. A More Powerful Two-Sample Test in High Dimensions Using Random Projection. arXiv 2012, arXiv:1108.2401v2. [Google Scholar]
Srivastava, R.; Li, P.; Ruppert, D. RAPTT: An Exact Two-Sample Test in High Dimensions Using Random Projections. J. Comput. Graph. Stat. 2016, 25, 954–970. [Google Scholar] [CrossRef]
Liu, W.; Yu, X.; Zhong, W.; Li, R. Projection Test for Mean Vector in High Dimensions. J. Am. Stat. Assoc. 2024, 119, 744–756. [Google Scholar] [CrossRef]
Zhou, C.; Kong, X.B. Testing of High Dimensional Mean Vectors via Approximate Factor Model. J. Stat. Plan. Inference 2015, 167, 216–227. [Google Scholar] [CrossRef]
Zhang, M.; Zhou, C.; He, Y.; Zhang, X. Adaptive Test for Mean Vectors of High-Dimensional Time Series Data with Factor Structure. J. Korean Stat. Soc. 2018, 47, 450–470. [Google Scholar] [CrossRef]
He, Y.; Zhang, M.; Zhang, X.; Zhou, W. High-Dimensional Two-Sample Mean Vectors Test and Support Recovery with Factor Adjustment. Comput. Stat. Data Anal. 2020, 151, 107004. [Google Scholar] [CrossRef]
Ma, H.; Feng, L.; Wang, Z.; Bao, J. Adaptive testing for alphas in conditional factor models with high dimensional assets. J. Bus. Econ. Stat. 2024, 42, 1356–1366. [Google Scholar] [CrossRef]
Johnstone, I.M. On the Distribution of the Largest Eigenvalue in Principal Components Analysis. Ann. Stat. 2001, 29, 295–327. [Google Scholar] [CrossRef]
Soshnikov, A. A Note on Universality of the Distribution of the Largest Eigenvalues in Certain Sample Covariance Matrices. J. Stat. Phys. 2002, 108, 1033–1056. [Google Scholar] [CrossRef]
Péché, S. Universality Results for the Largest Eigenvalues of Some Sample Covariance Matrix Ensembles. Probab. Theory Relat. Fields 2009, 143, 481–516. [Google Scholar] [CrossRef]
Birke, M.; Dette, H. A Note on Testing the Covariance Matrix for Large Dimension. Stat. Probab. Lett. 2005, 74, 281–289. [Google Scholar] [CrossRef]
Srivastava, M.S. Some Tests Concerning the Covariance Matrix in High Dimensional Data. J. Jpn. Stat. Soc. 2005, 35, 251–272. [Google Scholar] [CrossRef]
Chen, S.X.; Zhang, L.X.; Zhong, P.S. Tests for High-Dimensional Covariance Matrices. J. Am. Stat. Assoc. 2010, 105, 810–819. [Google Scholar] [CrossRef]
Schott, J.R. A Test for the Equality of Covariance Matrices When the Dimension Is Large Relative to the Sample Sizes. Comput. Stat. Data Anal. 2007, 51, 6535–6542. [Google Scholar] [CrossRef]
Srivastava, M.S.; Yanagihara, H. Testing the Equality of Several Covariance Matrices with Fewer Observations than the Dimension. J. Multivar. Anal. 2010, 101, 1319–1329. [Google Scholar] [CrossRef]
Li, J.; Chen, S.X. Two Sample Tests for High-Dimensional Covariance Matrices. Ann. Stat. 2012, 40, 908–940. [Google Scholar] [CrossRef]
Cai, T.T.; Ma, Z. Optimal Hypothesis Testing for High Dimensional Covariance Matrices. Bernoulli 2013, 19, 2359–2388. [Google Scholar] [CrossRef]
Cai, T.T.; Jiang, T. Limiting Laws of Coherence of Random Matrices With Applications to Testing Covariance Structure and Construction of Compressed Sensing Matrices. Ann. Stat. 2011, 39, 1496–1525. [Google Scholar] [CrossRef]
Qiu, Y.; Chen, S.X. Test for Bandedness of High-Dimensional Covariance Matrices and Bandwidth Estimation. Ann. Stat. 2012, 40, 1285. [Google Scholar] [CrossRef]
Cai, T.T.; Zhang, A. Inference for High-Dimensional Differential Correlation Matrices. J. Multivar. Anal. 2016, 143, 107–126. [Google Scholar] [CrossRef] [PubMed]
Zheng, S.; Cheng, G.; Guo, J.; Zhu, H. Test for High Dimensional Correlation Matrices. Ann. Stat. 2019, 47, 2887–2921. [Google Scholar] [CrossRef]
Zhou, W. Asymptotic Distribution of the Largest Off-Diagonal Entry of Correlation Matrices. Trans. Am. Math. Soc. 2007, 359, 5345–5363. [Google Scholar] [CrossRef]
Bao, Z.; Lin, L.C.; Pan, G.; Zhou, W. Spectral Statistics of Large Dimensional Spearman’s Rank Correlation Matrix and Its Application. Ann. Stat. 2015, 43, 2588–2623. [Google Scholar] [CrossRef]
Han, F.; Chen, S.; Liu, H. Distribution-Free Tests of Independence in High Dimensions. Biometrika 2017, 104, 813–828. [Google Scholar] [CrossRef] [PubMed]
Zhou, C.; Han, F.; Zhang, X.S.; Liu, H. An Extreme-Value Approach for Testing the Equality of Large U-Statistic Based Correlation Matrices. Bernoulli 2019, 25, 1472–1503. [Google Scholar] [CrossRef]
He, Y.; Xu, G.; Wu, C.; Pan, W. Asymptotically Independent U-Statistics in High-Dimensional Testing. Ann. Stat. 2021, 49, 154. [Google Scholar] [CrossRef] [PubMed]
Chen, X. Gaussian and Bootstrap Approximations for High-Dimensional U-Statistics and Their Applications. Ann. Stat. 2018, 46, 642–678. [Google Scholar] [CrossRef]
Tukey, J.W. Bias and Confidence in Not Quite Large Samples. Ann. Math. Stat. 1958, 29, 614. [Google Scholar]
Quenouille, M.H. Notes On Bias In Estimation. Biometrika 1956, 3–4, 3–4. [Google Scholar]
Shi, X. The Approximate Independence of Jackknife Pseudo-Values and the Bootstrap Methods. J. Wuhan Univ. Hydraul. Electr. Eng. 1984, 2, 83–90. [Google Scholar]
Chernozhukov, V.; Chetverikov, D.; Kato, K. Gaussian Approximations and Multiplier Bootstrap for Maxima of Sums of High-Dimensional Random Vectors. Ann. Stat. 2013, 41, 2786–2819. [Google Scholar] [CrossRef]
Cai, T.; Liu, W.; Xia, Y. Two-Sample Covariance Matrix Testing and Support Recovery in High-Dimensional and Sparse Settings. J. Am. Stat. Assoc. 2013, 108, 265–277. [Google Scholar] [CrossRef]
Chang, J.; Zhou, W.; Zhou, W.X.; Wang, L. Comparing Large Covariance Matrices Under Weak Conditions on the Dependence Structure and Its Application to Gene Clustering. Biometrics 2017, 73, 31–41. [Google Scholar] [CrossRef]
Mazieres, J.; He, B.; You, L.; Xu, Z.; Jablons, D.M. Wnt Signaling in Lung Cancer. Cancer Lett. 2005, 222, 1–10. [Google Scholar] [CrossRef] [PubMed]
Clements, W.M.; Wang, J.; Sarnaik, A.; Kim, O.J.; Macdonald, J.; Fenogliopreiser, C.; Groden, J.; Lowy, A.M. Beta-Catenin Mutation Is a Frequent Cause of Wnt Pathway Activation in Gastric Cancer. Cancer Res. 2002, 62, 3503–3506. [Google Scholar] [PubMed]
Howe, L.R.; Brown, A.M. Wnt Signaling and Breast Cancer. Cancer Biol. Ther. 2004, 3, 36–41. [Google Scholar] [CrossRef]
Corda, G.; Sala, A. Non-Canonical WNT/PCP Signalling in Cancer: Fzd6 Takes Centre Stage. Oncogenesis 2017, 6, e364. [Google Scholar] [CrossRef] [PubMed]
Rapp, J.; Jaromi, L.; Kvell, K.; Miskei, G.; Pongracz, J.E. WNT Signaling Lung Cancer Is No Exception. Respir. Res. 2017, 18, 167. [Google Scholar] [CrossRef]
Kim, K.B.; Kim, D.W.; Kim, Y.; Tang, J.; Kirk, N.; Gan, Y.; Kim, B.; Fang, B.; Park, J.l.; Zheng, Y.; et al. WNT5A–RHOA signaling is a driver of tumorigenesis and represents a therapeutically actionable vulnerability in small cell lung cancer. Cancer Res. 2022, 82, 4219–4233. [Google Scholar] [CrossRef]
Lloyd, C.J. Estimating test power adjusted for size. J. Stat. Comput. Simul. 2005, 75, 921–933. [Google Scholar] [CrossRef]
Cuparić, M.; Milošević, B. To impute or to adapt? model specification tests’ perspective. Stat. Pap. 2024, 65, 1021–1039. [Google Scholar] [CrossRef]
Papadimitriou, C.K.; Meintanis, S.G.; Andrade, B.B.; Tsionas, M.G. Specification tests for normal/gamma and stable/gamma stochastic frontier models based on empirical transforms. Econom. Stat. 2024, in press. [Google Scholar] [CrossRef]
Delaigle, A.; Hall, P.; Jin, J. Robustness and Accuracy of Methods for High Dimensional Data Analysis Based on Student’s T-Statistic. J. R. Stat. Soc. Ser. B Stat. Methodol. 2011, 73, 283–301. [Google Scholar] [CrossRef]
Boucheron, L.G.S.; Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence; Oxford University Press: Oxford, UK, 2013. [Google Scholar]

Table 1. Empirical sizes of Model 1–3 with Case I–IV based on 500 replications, with

α = 0.05

.

Table 1. Empirical sizes of Model 1–3 with Case I–IV based on 500 replications, with

α = 0.05

.

		Model 1			Model 2			Model 3
$n_{1} = n_{2} = 100$	d	30	40	50	30	40	50	30	40	50
Case I	CLX	0.04	0.03	0.03	0.03	0.05	0.03	0.01	0.01	0.01
	HD	0.05	0.04	0.05	0.04	0.06	0.05	0.03	0.02	0.02
	ZU	0.03	0.02	0.03	0.03	0.03	0.03	0.03	0.03	0.03
	UJB	0.06	0.07	0.07	0.05	0.06	0.07	0.06	0.06	0.07
Case II	CLX	0.04	0.04	0.05	0.04	0.05	0.05	0.02	0.01	0.01
	HD	0.05	0.05	0.05	0.06	0.05	0.06	0.03	0.02	0.01
	ZU	0.03	0.03	0.04	0.05	0.04	0.04	0.03	0.03	0.04
	UJB	0.06	0.08	0.07	0.09	0.08	0.07	0.08	0.05	0.07
Case III	CLX	0.04	0.03	0.04	0.05	0.04	0.07	0.01	0.01	0.01
	HD	0.04	0.03	0.04	0.06	0.05	0.07	0.01	0.01	0.01
	ZU	0.04	0.05	0.03	0.05	0.04	0.05	0.05	0.03	0.06
	UJB	0.07	0.08	0.07	0.07	0.07	0.08	0.06	0.05	0.07
Case IV	CLX	0.03	0.05	0.04	0.06	0.04	0.04	0.02	0.02	0.02
	HD	0.05	0.05	0.05	0.07	0.06	0.06	0.03	0.02	0.03
	ZU	0.02	0.02	0.03	0.02	0.03	0.03	0.02	0.02	0.02
	UJB	0.07	0.08	0.09	0.09	0.09	0.09	0.07	0.09	0.08

Table 2. Empirical powers of Model 1–3 with Case I–IV based on 500 replications, with

α = 0.05

.

Table 2. Empirical powers of Model 1–3 with Case I–IV based on 500 replications, with

α = 0.05

.

		Model 1			Model 2			Model 3
$n_{1} = n_{2} = 100$	d	30	40	50	30	40	50	30	40	50
Case I	CLX	0.75	0.60	0.52	0.98	0.83	0.57	0.49	0.25	0.09
	HD	0.75	0.62	0.54	0.99	0.83	0.60	0.52	0.26	0.11
	ZU	1.00	0.99	0.86	1.00	0.99	0.87	1.00	0.99	0.73
	UJB	1.00	1.00	0.99	1.00	1.00	1.00	1.00	1.00	0.98
Case II	CLX	0.62	0.59	0.59	0.77	0.71	0.67	0.30	0.18	0.09
	HD	0.65	0.61	0.61	0.79	0.73	0.68	0.34	0.20	0.10
	ZU	0.78	0.78	0.61	0.84	0.79	0.62	0.75	0.56	0.51
	UJB	1.00	0.99	0.97	1.00	1.00	0.95	0.98	0.97	0.87
Case III	CLX	0.98	1.00	0.98	1.00	1.00	1.00	0.71	0.61	0.55
	HD	0.98	1.00	0.98	1.00	1.00	1.00	0.73	0.63	0.57
	ZU	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
	UJB	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
Case IV	CLX	0.91	0.36	0.62	0.95	0.43	0.75	0.42	0.29	0.19
	HD	0.92	0.39	0.64	0.95	0.45	0.76	0.47	0.32	0.21
	ZU	1.00	0.81	0.97	1.00	0.89	0.98	1.00	0.59	0.92
	UJB	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

High-Dimensional U-Statistics Type Hypothesis Testing via Jackknife Pseudo-Values with Multiplier Bootstrap

Abstract

1. Introduction

2. Methodology

3. Theoretical Analysis

4. Simulation Study

5. Real Data Analysis

6. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Proof of the Main Theorem

Appendix A.1. Useful Lemmas

Appendix A.2. Proof of Lemma A5

Appendix A.3. Proof of Theorem 1

Appendix A.4. Proof of Theorem 3

References

Article Metrics

Citations

Article Access Statistics