Consistency of Restricted Maximum Likelihood Estimators in High-Dimensional Kernel Linear Mixed-Effects Models with Applications in Estimating Genetic Heritability

Shen, Xiaoxi; Lu, Qing

doi:10.3390/math13152363

Open AccessArticle

Consistency of Restricted Maximum Likelihood Estimators in High-Dimensional Kernel Linear Mixed-Effects Models with Applications in Estimating Genetic Heritability

by

Xiaoxi Shen

¹

and

Qing Lu

^2,*

¹

Department of Mathematics, Texas State University, San Marcos, TX 78666, USA

²

Department of Biostatistics, University of Florida, Gainesville, FL 32611, USA

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(15), 2363; https://doi.org/10.3390/math13152363

Submission received: 12 June 2025 / Revised: 16 July 2025 / Accepted: 20 July 2025 / Published: 23 July 2025

(This article belongs to the Special Issue Statistics: Theories and Applications)

Download

Browse Figures

Versions Notes

Abstract

Restricted maximum likelihood (REML) estimators are commonly used to obtain unbiased estimators for the variance components in linear mixed models. In modern applications, particularly in genomic studies, the dimension of the design matrix with respect to the random effects can be high. Motivated by this, we first introduce high-dimensional kernel linear mixed models, derive the REML equations, and establish theoretical results on the consistency of REML estimators for several commonly used kernel matrices. The validity of the theories is demonstrated via simulation studies. Our results provide rigorous justification for the consistency of REML estimators in high-dimensional kernel linear mixed models and offer insights into the application of estimating genetic heritability. Finally, we apply the kernel linear mixed models to estimate genetic heritability in a real-world data application.

Keywords:

mixed effects models; restricted maximum likelihood; kernel matrix; random matrix theory

MSC:

62F12

1. Introduction

Heritability, particularly narrow-sense heritability, refers to the proportion of phenotypic variance explained by additive genetic variance. For example, it is well-known that the estimated heritability of human height is roughly 0.8 [1,2]. However, early identified genes could only explain a small fraction of height heritability. More recently, large-scale genome-wide association studies (GWASs) involving 5.4 million individuals have identified 12,111 height-related single-nucleotide polymorphisms (SNPs). It has been shown that common SNPs can explain 40–50% of the phenotypic variation in human height [3]. Therefore, the previously observed “missing heritability” can be attributed to the small effect sizes of individual SNPs. Furthermore, because a classical GWAS relies on linear models and multiple testing procedures, a SNP must meet stringent criteria at the whole-genome level (e.g., p-value <

10^{- 8}

) to be considered significant. As a result, many SNPs with small yet meaningful effects remain undetectable using classical GWAS methods.

The idea that heritability is spread across multiple SNPs that contribute to phenotypic traits, also known as polygenicity, has been widely accepted over the past few decades. As a result, rather than estimating the individual effect of each SNP, state-of-the-art models in GWASs estimate the cumulative effect of multiple SNPs in a genetic region, such as all SNPs in a gene, a pathway, or even a chromosome. For quantitative traits, linear mixed-effects models are commonly used to model the cumulative effect of multiple SNPs. It was shown that, using a linear mixed model, known as genome-wide complex trait analysis (GCTA) [4], 45% of the variance in human height could be explained by all common SNPs [5]. This finding has since been supported by the large-scale GWAS mentioned above.

Linear mixed-effects models (LMMs) provide powerful alternatives in GWASs, and various studies have explored LMMs in the high-dimensional setting. In Li et al. [6], the authors proposed inference methods for linear mixed models using techniques similar to the debiased LASSO method [7,8], achieving an estimator of the fixed effects with the optimal convergence rate and valid inferential procedures. However, their framework assumes that the cluster size of the random effects is large, which is generally not true in genetic studies where linear mixed-effects models are commonly applied. Similarly, Law and Ritov [9] proposed methods for constructing asymptotic confidence intervals and hypothesis tests for the significance of random effects. In both studies, however, high dimensionality resides in the fixed effects, while the dimension of the random effects is assumed to be low. This limits their applicability to genetic studies, where the SNP matrix is often treated as the design matrix for random effects [4,10].

Therefore, it is of greater interest to investigate the properties of variance component estimators when the random effects are high-dimensional. Jiang et al. [11] provided theoretical guarantees for the consistency of the restricted maximum likelihood (REML) estimator under the setting where the random-effects design matrix is high-dimensional and random in the so-called linear regime (

n / p \to τ

for some

0 < τ \leq 1

). These techniques have since been extended to construct confidence intervals for variance components [12] and to establish the asymptotic normality of heritability estimators based on GWAS summary statistics [13].

Currently, the most commonly used LMMs in GWASs, as studied in Jiang et al. [11], for a polygenetic quantitative trait can be expressed via the following additive model:

y_{i} = μ_{i} + \sum_{j \in C} G_{i j} u_{j} + ϵ_{i}, i = 1, \dots, n,

(1)

where

μ_{i}

is the fixed effect of the ith individual, which is typically of the form

μ_{i} = x_{i}^{T} β

, and

x_{i}

are covariates (e.g., age, gender, and race group) of the ith individual;

C

is a set of SNPs; and

G_{i j}

is the genotype of the jth SNP of the ith individual.

u_{j}

is the effect of the jth SNP, which typically follows a normal distribution

u_{1}, \dots, u_{| C |} \sim i . i . d . N (0, σ_{u}^{2})

with

| C |

being the cardinality of the set

C

.

ϵ_{i}

is the random error or other possible environmental effects, which are typically modeled by a normal distribution

ϵ_{1}, \dots, ϵ_{n} \sim i . i . d . N (0, σ_{ϵ}^{2})

. By writing

g_{i} = \sum_{j \in C} G_{i j} u_{j}

, model (1) can be written as

y_{i} = μ_{i} + g_{i} + ϵ_{i}, i = 1, \dots, n,

or in a vector form

y = μ + g + ϵ,

(2)

where

y = {[y_{1}, \dots, y_{n}]}^{T}

,

μ = {[μ_{1}, \dots, μ_{n}]}^{T}

, and

ϵ = {[ϵ_{1}, \dots, ϵ_{n}]}^{T}

. Under the distributional assumptions in Model (1), the distributions of

g

and

ϵ

are

g \sim N_{n} (0, σ_{u}^{2} {GG}^{T}), ϵ \sim N_{n} (0, σ_{ϵ}^{2} I_{n}),

(3)

where the matrix

G

is an

n \times | C |

-matrix containing SNPs in region

C

.

An underlying assumption in Model (1) or (2) is that the genetic effect associated with the quantitative trait of interest is linear. This may not be a reasonable assumption in reality, as generally the relationships between genotypes and phenotypes are rather complex. Therefore, the following semi-parametric models could be more appropriate.

y_{i} = μ_{i} + h (G_{i 1}, \dots, G_{i | C |}) + ϵ_{i}, i = 1, \dots, n,

(4)

where

h \in H

and

H

is some function space. In particular, let

H

be a reproducing kernel Hilbert space (RKHS) associated with a kernel function

K (\cdot, \cdot)

. The representer theorem [14] shows that when

μ_{i} = x_{i}^{T} β

, the estimating equation for

β

and

h = {[h (G_{11}, \dots, G_{1 | C |}), \dots, h (G_{n 1}, \dots, G_{n | C |})]}^{T}

under penalized least squares is equivalent to the Henderson mixed model equation, which is often used to obtain the BLUP in a linear mixed model. In this case, the linear mixed model is given as

y = X β + a + ϵ,

(5)

where

a \sim N_{n} (0, σ_{a}^{2} K)

with

K

being a kernel matrix,

ϵ \in N_{n} (0, σ_{ϵ}^{2} I_{n})

,

a

and

ϵ

are independent.

In this paper, we study the consistency of the REML estimator for variance components in a high-dimensional kernel linear mixed-effects model. Such models have also been widely used in spatial statistics [15], as well as in identifying significant genetic variants [10,16] and predicting genetic risk for diseases [17]. The results established in this paper extend the work by Jiang et al. [11] from linear kernels to other commonly used kernels that capture nonlinear genetic effects, thereby providing theoretical justification for the successful application of kernel LMMs in genetic studies.

The rest of the paper is organized as follows: Section 2 provides the formulation of the REML equations for kernel linear mixed-effects models and the main consistency results for the REML estimators. The main results rely heavily on random matrix theories (RMTs), and a brief review of the RMTs is also given in Section 2. Simulation studies are given in Section 3 to verify the theoretical results, followed by a real-world data application.

Notations : Throughout this paper, bold alphabetic or Greek letters are used for vectors, while bold capital alphabetic or Greek letters are used for matrices. For an

n \times n

matrix

A, B

,

A \geq 0

means that

A

is positive and semi-definite, while

A \geq B

means that

A - B \geq 0

. For a vector

x

,

| x | = \sqrt{\sum_{i = 1}^{n} x_{i}^{2}}

denotes its Euclidean norm, and we use

∥A∥

to denote the operator norm of a matrix

A

; that is,

∥A∥ = \sqrt{λ_{max} (A^{T} A)}

. When

A

is symmetric and positive semi-definite, this is the same as the largest eigenvalue of

A

.

\det (A)

is used to denote the determinant of a squared matrix

A

. For two matrices

A, B

of the same size,

A \circ B

denotes the Hadamard product; i.e.,

{[A \circ B]}_{i j} = A_{i j} B_{i j}

. Moreover, if

f : R \to R

is a univariate function and

A

is a matrix, then

f [A]

denotes a matrix of the same size as

A

and

{[f (A)]}_{i j} = f (A_{i j})

.

2. Materials and Methods

2.1. The REML Equations for Variance Components

Let q be the rank of the fixed-effects design matrix

X

and

A \in R^{n \times (n - q)}

such that

A^{T} X = 0

and

A^{T} A = I_{n - q}

. Then, by multiplying

A^{T}

on both sides of Equation (5), we obtain

\tilde{y} = \tilde{a} + \tilde{ϵ},

(6)

where

\tilde{y} = A^{T} y

and

\begin{matrix} \tilde{a} = A^{T} a & \sim N_{n} (0, σ_{a}^{2} A^{T} KA), \\ \tilde{ϵ} = A^{T} ϵ & \sim N_{n} (0, σ_{ϵ}^{2} I_{n - q}) . \end{matrix}

Let

f (\tilde{y} | θ)

be the probability density function of

\tilde{y}

. Then the restricted log-likelihood of

θ : = {[σ_{a}^{2}, σ_{ϵ}^{2}]}^{T}

is

\begin{matrix} ℓ_{r} (θ) & = log f (\tilde{y} | θ) \propto - \frac{1}{2} log \det (σ_{a}^{2} A^{T} KA + σ_{ϵ}^{2} I_{n - q}) - \frac{1}{2} {\tilde{y}}^{T} {(σ_{a}^{2} A^{T} KA + σ_{ϵ}^{2} I_{n - q})}^{- 1} \tilde{y} \\ = - \frac{1}{2} log \det (σ_{ϵ}^{2} (γ A^{T} KA + I_{n - q})) - \frac{1}{2 σ_{ϵ}^{2}} {\tilde{y}}^{T} {(γ A^{T} KA + I_{n - q})}^{- 1} \tilde{y} \\ = - \frac{n - q}{2} log σ_{ϵ}^{2} - \frac{1}{2} log \det (γ A^{T} KA + I_{n - q}) - \frac{1}{2 σ_{ϵ}^{2}} {\tilde{y}}^{T} {(γ A^{T} KA + I_{n - q})}^{- 1} \tilde{y}, \end{matrix}

where

γ = σ_{a}^{2} / σ_{ϵ}^{2}

is the ratio between two variance components. With slight abuse of notation, let

θ = {[γ, σ_{ϵ}^{2}]}^{T}

in what follows.

A well-known result for REML estimators is that they are independent of the choice of

A

, which is based on the following identity [18]

A {(γ A^{T} KA + I_{n - q})}^{- 1} A^{T} = V_{γ}^{- 1} - V_{γ}^{- 1} X {(X^{T} V_{γ}^{- 1} X)}^{- 1} X^{T} V_{γ}^{- 1},

where

V_{γ} = I_{n} + γ K

. Also, let

Σ_{γ} = I_{n - q} + γ A^{T} KA

, then

P_{γ} : = A Σ_{γ}^{- 1} A^{T} = V_{γ}^{- 1} - V_{γ}^{- 1} X {(X^{T} V_{γ}^{- 1} X)}^{- 1} X^{T} V_{γ}^{- 1} .

Therefore,

{\tilde{y}}^{T} {(γ A^{T} KA + I_{n - q})}^{- 1} \tilde{y} = y^{T} A Σ_{γ}^{- 1} A^{T} y = y^{T} P_{γ} y

, and the restricted log-likelihood function can be written as

ℓ_{r} (θ) \propto - \frac{n - q}{2} log σ_{ϵ}^{2} - \frac{1}{2} log \det (Σ_{γ}) - \frac{1}{2 σ_{ϵ}^{2}} y^{T} P_{γ} y .

(7)

By taking the derivatives with respect to

γ

and

σ_{ϵ}^{2}

and letting them equal to zero, we know that the REML estimators of

γ

and

σ_{ϵ}^{2}

satisfy the following REML equations:

\{\begin{matrix} {\hat{σ}}_{ϵ}^{2} = \frac{1}{n - q} y^{T} P_{γ} y \\ \frac{y^{T} P_{γ} {KP}_{γ} y}{tr (P_{γ} K)} = \frac{y^{T} P_{γ} y}{n - q} \end{matrix}

(8)

2.2. A Review on Random Matrix Theory

In this section, we briefly review the theory of random matrices. For more details on the application of random matrix theory in statistics, we refer interested readers to Bai and Silverstein [19] and Paul and Aue [20]. Let

A

be an

m \times m

matrix with eigenvalues

λ_{j}, j = 1, 2, \dots, m

. The empirical spectral distribution (ESD) of the matrix

A

is defined to be

F_{m}^{A} (x) = \frac{1}{m} \sum_{j = 1}^{m} I_{{λ_{j} \leq x}}, x \in R .

For a double array of i.i.d. random variables

{g_{j k} : j, k = 1, 2, \dots}

with a mean of 0 and a variance of

σ^{2}

, write

G = [g_{1}, \dots, g_{p}] = [\begin{matrix} {g^{'}}_{1}^{T} \\ ⋮ \\ {g^{'}}_{n}^{T} \end{matrix}]

with

g_{j} = {(g_{1 j}, \dots, g_{n j})}^{T}

and

{g^{'}}_{j} = {(g_{j 1}, \dots, g_{j p})}^{T}

. It is well known that the ESD of the sample covariance matrix

S = \frac{1}{p} {GG}^{T}

converges almost surely in distribution to the Marchenko–Pastur (M-P) law.

Theorem 1.

(Marchenko–Pastur Law). Suppose that

{g_{i j}}

are i.i.d. random variables with a mean of 0 and a variance of

σ^{2}

. Also assume that

n / p \to τ \in (0, \infty)

as

n, p \to \infty

. Then, with a probability of 1,

F^{S}

tends toward the M-P law having density

φ_{τ} (x) = \{\begin{matrix} {(2 π σ^{2} τ x)}^{- 1} \sqrt{(b_{+} (τ) - x) (x - b_{-} (τ))} & if x \in [b_{-} (τ), b_{+} (τ)] \\ 0 & otherwise . \end{matrix},

and has a point mass

1 - 1 / τ

at the origin if

τ > 1

, where

b_{\pm} (τ) = σ^{2} {(1 \pm \sqrt{τ})}^{2}

.

The following corollary is by Jiang et al. [11], which is a consequence of convergence in distribution and will be frequently used.

Corollary 1.

Under the assumptions of Theorem 1, for the integer l,

n^{- 1} tr (S^{l}) \to \int_{b_{-} (τ)}^{b_{+} (τ)} x^{l} φ_{τ} (x) d x, as p \to \infty .

In general, the assumptions on i.i.d. can be reduced. The following theorem is Theorem 4.3 in Bai and Silverstein [19].

Theorem 2.

Suppose that the entries of

G \in R^{n \times p}

are random variables that are independent of each n and identically distributed for all n and satisfy

E [| g_{11} - E [g_{11}] |^{2}] = 1

. Also assume that

T = diag {τ_{1}, \dots, τ_{p}}

,

τ_{i}

is real, and that the empirical distribution function of

{τ_{1}, \dots, τ_{p}}

converges a.s. to a probability distribution function H as

n \to \infty

. The entries of both

G

and

T

may depend on n. Set

B = A + \frac{1}{n} G T G^{T},

where

A \in R^{n \times n}

is symmetric and satisfies

F^{A} \to H

almost surely, where H is a distribution (possibly defective) on the real line. Assume that

G

,

T

, and

A

are independent. When

p / n \to y > 0

as

n \to \infty

, then the ESD of

B

and

F^{B}

converges almost surely as

n \to \infty

to a nonrandom distribution function F.

The limiting spectral behaviors of

p^{- 1} {GG}^{T}

have not only been studied, but the limiting spectral behaviors of inner-product kernel matrices

f [p^{- 1} {GG}^{T}]

have also been studied. Here we quote Theorem 2.1 in El Karoui [21].

Theorem 3.

(Spectrum of Inner-Product Kernel Random Matrices). Let us assume that we observe n i.i.d. random vectors

{g^{'}}_{i} \in R^{p}

. Consider the kernel matrix

K

with entries

K_{i j} = f (\frac{{g^{'}}_{i}^{T} g_{j}^{'}}{p}) .

Assume that

1.: $n ≍ p$ ; that is, $n / p$ and $p / n$ remain bounded as $p \to \infty$ .
2.: $Σ_{p}$ is a positive definite $p \times p$ matrix and $∥Σ_{p}∥$ remains bounded in p; that is, there exists $C > 0$ such that $∥Σ_{p}∥ \leq C$ for all p.
3.: $Σ_{p} / p$ has a finite limit; that is, there exists $l \in R$ such that ${lim}_{p \to \infty} tr (Σ_{p}) / p = l$ .
4.: ${g^{'}}_{i} = Σ_{p}^{1 / 2} Γ_{i}$ .
5.: The entries of $Γ_{i}$ , a p-dimensional random vector, are i.i.d. Also, as denoted by $Γ_{i} (k)$ , the kth entry of $Γ_{i}$ , we assume that $E [Γ_{i} (k)] = 0$ , $Var [Γ_{i} (k)] = 1$ , and $E [| Γ_{i} (k) |^{4 + η}] < \infty$ for some $η > 0$ .
6.: f is a $C^{1}$ function in the neighborhood of $l = {lim}_{p \to \infty} tr (Σ_{p}) / p$ and a $C^{3}$ function in the neighborhood of 0.

Under these assumptions, the kernel matrix

K

can (in probability) be approximated consistently in the operator norm when p and n tend to ∞ by the matrix

M

, where

M = (f (0) + f^{″} (0) \frac{tr (Σ_{p}^{2})}{2 p^{2}}) 1_{n} 1_{n}^{T} + f^{'} (0) \frac{{GG}^{T}}{p} + v_{p} I_{n},

(9)

where

v_{p} = f (\frac{tr (Σ_{p})}{p}) - f (0) - f^{'} (0) \frac{tr (Σ_{p})}{p} .

In other words,

∥K - M∥ \overset{p}{\to} 0, when p \to \infty .

El Karoui [21] also studied the behaviors of the spectrum of Euclidean distance kernel random matrices or, in other words, radial basis kernel matrices. The following result is from Theorem 2.2 from El Karoui [21].

Theorem 4.

(Spectrum of Euclidean Distance Kernel Random Matrices). Consider the

n \times n

kernel matrix

K

with entries

K_{i j} = f (\frac{∥g_{i}^{'} - g_{j}^{'}∥}{p}) .

Let us call

τ = 2 \frac{tr (Σ_{p})}{p} .

Let us call ψ the vector with the ith entry

ψ_{i} = {∥g_{i}^{'}∥}_{2}^{2} / p - tr (Σ_{p}) / p

. Suppose that the assumptions of Theorem 3 hold, but that conditions 5 and 6 are replaced by

(5’): The entries of $Γ_{i}$ , a p-dimensional random vector, are i.i.d. Also, as denoted by $Γ_{i} (k)$ the kth entry of $Γ_{i}$ , we assume that $E [Γ_{i} (k)] = 0$ , $Var [Γ_{i} (k)] = 1$ , and $E [| Γ_{i} (k) |^{5 + η}] < \infty$ for some $η > 0$ .
(6’): f is $C^{3}$ in a neighborhood of τ.

Then

K

can be approximated consistently in the operator norm (and in probability) by the matrix

M

, which is defined by

\begin{matrix} M & = f (τ) 1_{n} 1_{n}^{T} + f^{'} (τ) [1_{n} ψ^{T} + ψ 1_{n}^{T} - 2 \frac{{GG}^{T}}{p}] + \\ \frac{f^{″} (τ)}{2} [1_{n} {(ψ \circ ψ)}^{T} + (ψ \circ ψ) 1_{n}^{T} + 2 ψ ψ^{T} + 4 \frac{tr (Σ_{p}^{2})}{p^{2}} 1_{n} 1_{n}^{T}] + v_{p} I_{n}, \end{matrix}

(10)

where

v_{p} = f (0) + τ f^{'} (τ) - f (τ)

. In other words,

∥K - M∥ \overset{p}{\to} 0, when p \to \infty .

The following lemma provides a useful bound for the variance of a quadratic form, and its proof is given in Appendix A.

Lemma 1.

Suppose that

y = μ + e

, with μ being a fixed vector;

E [e] = 0

;

E [{ee}^{T}] = Σ

; and

E [| e |^{4}] < \infty

. Then

Var [y^{T} Γ y] \leq 2 Var [e^{T} Γ e] + 8 μ^{T} Γ^{T} Σ Γ μ .

2.3. Consistency of REML Estimators in Kernel LMMs

In this section, we will show the main results on the consistency of REML estimators for kernel linear mixed models under three commonly used classes of kernel matrices in genetic data analysis. The first one is the weighted product kernel, which is a natural generalization of the product kernel and is the kernel matrix used in a sequence kernel association test (SKAT) [10] in genetic association studies. The other two classes are the inner-product-based kernel matrices and the Euclidean distance-based kernel matrices. An example of an inner-product-based kernel is the polynomial kernel, and an example of a Euclidean distance-based kernel is a Gaussian kernel. Proofs of all results in this section can be found in Appendix A.

2.3.1. Weighted Product Kernel

The following theorem serves as a natural extension of the result by Jiang et al. [11] to the case of a weighted product kernel. It guarantees the consistency of REML estimators of the variance components in kernel linear mixed models when the entries in

G

are i.i.d. and have a finite fourth moment, and the weights in the diagonal matrix

W

are upper and lower bounded.

Theorem 5.

For

K = p^{- 1} {GWG}^{T}

with entries in

G \in R^{n \times p}

and

W \in R^{p \times p}

,

1.: $G_{i j}$ are i.i.d. with $E [G_{i j}] = 0$ , $E [G_{i j}^{2}] = 1$ , and $E [G_{i j}^{4}] < \infty$ ;
2.: $W = diag {w_{1}, \dots, w_{p}}$ with

$max_{1 \leq i \leq p} | w_{i} | \leq 1, min_{1 \leq i \leq p} | w_{i} | \geq δ,$

(11)

for some $δ > 0$ .

Then as

n / p \to τ \in (0, 1)

and

n, p \to \infty

, we have

\begin{matrix} \hat{γ} \overset{p}{\to} γ_{0}, {\hat{σ}}_{ϵ}^{2} \overset{p}{\to} σ_{ϵ}^{2}, \end{matrix}

where

γ_{0}

is the true value for γ.

In addition, the same result still holds when

G

is replaced by its column standardized version

\tilde{G}

as given in the following corollary. The proofs of the theorem and the following corollary are very similar to the proofs in Jiang et al. [11], and they can be found in Appendix A.

Corollary 2.

Let

G

be a random matrix whose elements are i.i.d. with

E [G_{i j}] = 0

,

E [G_{i j}^{2}] = 1

, and

E [G_{i j}^{4}] < \infty

. Let

\tilde{G}

be the matrix that is obtained by column standardizing the matrix

G

; i.e.,

\tilde{G} = (G - \bar{g} \otimes 1_{n}) D_{s}^{- 1},

where

\bar{g} = [{\bar{g}}_{1}, \dots, {\bar{g}}_{p}]

with

{\bar{g}}_{j} = n^{- 1} \sum_{i = 1}^{n} G_{i j}

,

j = 1, \dots, p

, and

D_{s} = diag {s_{1}, \dots, s_{p}}

with

s_{j}^{2} = {(n - 1)}^{- 1} \sum_{i = 1}^{n} {(G_{i j} - {\bar{g}}_{j})}^{2}

. Let

W \in R^{p \times p}

be the weight matrix satisfying Equation (11). Then for

\tilde{K} = p^{- 1} \tilde{G} W {\tilde{G}}^{T}

, as

n / p \to τ \in (0, 1)

and

n, p \to \infty

, we have

\begin{matrix} \hat{γ} \overset{p}{\to} γ_{0}, {\hat{σ}}_{ϵ}^{2} \overset{p}{\to} σ_{ϵ}^{2}, \end{matrix}

where

γ_{0}

is the true value for γ.

The i.i.d. assumption on the entries of

G

in Theorem 5 and Corollary 2 may limit their applicability in genetic studies, where SNPs are often correlated due to linkage disequilibrium (LD). However, it is worth noting that the correlation among SNPs does not necessarily hinder the application of these results. This is because the theoretical guarantees rely on the validity of the Marchenko–Pastur law, which was originally established for random matrices with i.i.d. entries having a mean of zero and unit variance. Importantly, the i.i.d. assumption on the entries can be relaxed: the Marchenko–Pastur law still holds for random matrices with i.i.d. rows, as long as each row has a mean of zero and unit variance and satisfies a light tail condition (e.g., sub-Gaussian) [22]. Therefore, as long as the SNP vectors from each individual are i.i.d., we expect that the results in Theorem 5 and Corollary 2 remain valid.

2.3.2. Inner-Product Kernel Matrices

Inner-product kernels, including the linear kernel and the polynomial kernel, are commonly used in practice. Generally, an inner-product-based kernel matrix has the form

K = f [p^{- 1} {GG}^{T}]

. The first thing that needs to be addressed is the positive definiteness of

K

. In fact, based on Hiai [23], a necessary and sufficient condition for

K

to be positive definite is when f is real analytic and

f^{(k)} (0) \geq 0

for all

k \geq 0

. In this subsection, we will implicitly assume that this condition is satisfied.

Theorem 6.

For

K = f [p^{- 1} {GG}^{T}]

with entries in

G \in R^{n \times p}

and the function f,

1.: $n ≍ p$ .
2.: $G_{i j}$ are i.i.d. with $E [G_{i j}] = 0$ , $E [G_{i j}^{2}] = 1$ , and $E [| G_{i j} |^{4 + η}] < \infty$ for some $η > 0$ .
3.: f is a $C^{1}$ function in the neighborhood of 1 and a $C^{3}$ function in the neighborhood of 0.

Then as

n / p \to τ \in (0, 1)

as

n, p \to \infty

, we have

\hat{γ} \overset{p}{\to} γ_{0}, {\hat{σ}}_{ϵ}^{2} \overset{p}{\to} σ_{ϵ}^{2},

where

γ_{0}

is the true value for γ.

2.3.3. Euclidean Distance Kernel Matrices

Another commonly used kernels in practice are the Euclidean distance-based kernels, such as the Gaussian kernel. Similar to the case in the inner-product kernel matrices, the positive definiteness of

K

is the first issue to be addressed. According to Wendland [24], f being completely monotone is a necessary and sufficient condition for a Euclidean distance kernel matrix to be positive semi-definite. In other words, the function f needs to satisfy

f \in C^{\infty} ([0, \infty)])

and

{(- 1)}^{k} f^{(k)} (r) \geq 0

for all

r > 0

and

k = 0, 1, 2, \dots

. Throughout Section 2.3.3, it will be implicitly assumed that f is completely monotone.

Theorem 7.

Let

K

be a kernel matrix with entries

K_{i j} = f (p^{- 1} {∥g_{i}^{'} - g_{j}^{'}∥}_{2}^{2})

. Suppose that the entries in

G \in R^{n \times p}

and the function f satisfying

1.: $n ≍ p$ .
2.: $G_{i j}$ are i.i.d. with $E [G_{i j}] = 0$ , $E [G_{i j}^{2}] = 1$ and $E [| G_{i j} |^{5 + η}] < \infty$ for some $η > 0$ .
3.: f is a $C^{3}$ function in a neighborhood of 2.

Then as

n / p \to τ \in (0, 1)

and

n, p \to \infty

, we have

\hat{γ} \overset{p}{\to} γ_{0}, {\hat{σ}}_{ϵ}^{2} \overset{p}{\to} σ_{ϵ}^{2},

where

γ_{0}

is the true value for γ.

Remark 1.

Both Theorems 6 and 7 rely on the approximation of inner-product kernel matrices or Euclidean distance kernel matrices by linear kernels, identity matrices, and some low-rank matrices. Based on the proofs in El Karoui [21], the approximation rate for inner-product kernel matrices is

o_{p} (p^{- δ / 2})

for some

δ < \frac{1}{2}

, while for a Gaussian kernel with entries with

exp {- {∥ζ_{i} - ζ_{j}∥}^{2} / 2}

, the approximation rate is

n exp {- p^{1 / 2 + 2 / m + δ}}

with

p^{1 / 2 + 2 / m + δ}

being the rate that

tr (Var [ζ])

grows to infinity, and this is an extremely fast rate.

3. Results

3.1. Simulation Studies

3.1.1. Weighted Product Kernel

We follow an idea similar to that by Jiang et al. [11]; we simulated the genotype for each SNP. Specifically, the minor allele frequencies (MAFs) for p SNPs

{f_{1}, f_{2}, \dots, f_{p}}

were generated from the uniform distribution

Unif [0.05, 0.5]

, where

f_{j}

is the allele for the jth SNP. In the simulation, the number of SNPs is set to

p = 2 n

. For the weight matrix

W

, the jth element

w_{j}

is defined to be

w_{j} = \frac{- {log}_{10} f_{j}}{{max}_{1 \leq j \leq p} - {log}_{10} f_{j}} .

The logarithm of MAF, which was defined in Li et al. [25], is one of the commonly used weights to detect effects from common variants and also takes the contributions from rare variants into account. To simulate the genotype matrix

\tilde{G} \in {0, 1, 2}^{n \times p}

, the Hardy–Weinberg equilibrium was assumed. Specifically, for the j SNP, the genotype value of each individual was sampled from

{0, 1, 2}

according to the probabilities

{(1 - f_{j})}^{2}, 2 f_{j} (1 - f_{j})

, and

f_{j}^{2}

, respectively. Given the simulated genotypes, column standardization was applied to each column in

\tilde{G}

, and the resulting genotype matrix was denoted as

G

. The weighted kernel matrix is defined as

K = p^{- 1} {GWG}^{T} .

The responses are simulated based on the following equation:

y = X β + a + ϵ,

(12)

where the fixed-effect design matrix

X = [1_{n}, \tilde{X}] \in R^{n \times 3}

and the elements in

\tilde{X}

were generated from a standard normal distribution. We set

β = {[1, 2, - 1]}^{T}

. The random effect

a \sim N_{n} (0, 0.6 K)

, and the random noise

ϵ \sim N_{n} (0, 2 I_{n})

. The choice of the variance component for the random effects is the same as in Jiang et al. [11] under their dense scenario. On the other hand, genetic data is often noisy, and the signal-to-noise ratio is low. To mimic the real situation, we set the variance of the random error to be two. In the simulation, we selected 100, 200, 400, 600, 800, and 1000 as the sample sizes. A total of 1000 Monte Carlo replications were conducted to evaluate the performance of the REML estimators for the variance components in the simulation model.

Figure 1 provides the boxplots for the REML estimators of the variance components in the simulation model (12). As shown from both panels, the REML estimators converge to the true values, which justifies the theoretical results developed in the previous sections. The mean and standard deviation of the REML estimators of the variance components are given in Table 1.

3.1.2. Inner Product Kernel

We followed the same procedures mentioned in Section 3.1.1 to simulate the genotype matrix

G

, and Equation (12) was used to generate the response, except that in this case the kernel matrix is

K = {(1 + p^{- 1} {GG}^{T})}^{2}

, with the polynomial kernel having an order of 2. In the expression, power means the elementwise power. Figure 2 shows the simulation results for the REML estimators of the two variance components in Equation (12). As we can see from both panels in Figure 2, the REML estimators converge to the true values as anticipated. The mean and standard deviation of the REML estimators of the variance components are given in Table 2.

3.1.3. Euclidean Distance Kernel

In this section, we investigated the consistency of REML estimators of the variance components when the kernel used to generate data in Equation (12) is the Gaussian kernel. When applying the Gaussian kernel

K (u, v) = exp {- \frac{{∥u - v∥}^{2}}{2 ϕ}}

, the tuning parameter

ϕ

is chosen to be the sample variance of the Euclidean distance between each pair. Figure 3 shows the simulation results for the REML estimators of the two variance components in Equation (12). The REML estimators converge to the true values as shown in both panels in Figure 3. The mean and standard deviation of the REML estimators of the variance components are given in Table 3.

3.2. Real Data Analysis

To check the performance of Restricted Maximum Likelihood (REML) estimators in high-dimensional kernel linear mixed models, we conducted a real data analysis similar to that by Jiang et al. [11]. The Study of Addiction: Genetics and Environment (SAGE) (dbGap Study Accession: phs000092.v1.p1) dataset was used to estimate the heritability of body mass index (BMI) based on kernel linear mixed models. To check the consistency of heritability estimation across different traits, we also include height and weight as the phenotypes.

After removing individuals without height or weight measures, a total of

n = 1985

individuals of European ancestry remain. We also performed a quality control process similar to that by Jiang et al. [11] to avoid bias in the estimation. Specifically, SNPs with a missing rate of >1%, a minor allele frequency (MAF) of <5%, and a p-value of <0.001 from the Hardy–Weinberg equilibrium test were excluded from the analysis. After the quality control process, p = 4,898,519 SNPs remained for analysis.

As in Jiang et al. [11], for the fixed-effect design matrix

X

, besides the intercept, the first 10 principal component scores computed from the product kernels were included. Kernel LMMs with the product kernel

K = p^{- 1} {GG}^{T}

, a second-order polynomial kernel, and a Gaussian kernel were used to estimate the variance components and hence the heritability. The results are summarized in Table 4. The heritability obtained by the linear kernel is 20.35%, which is very close to the one (19.6%) obtained in Jiang et al. [11]. It can be seen from the table that the heritability estimation of BMI is higher when the polynomial kernel

K = {(p^{- 1} {GG}^{T})}^{2}

or the Gaussian kernel

K (u, v) = exp {- \frac{{∥u - v∥}^{2}}{2 ϕ}}

with

ϕ = 1

is used. The heritability obtained by the polynomial kernel is 50.08%, while the one obtained by the Gaussian kernel is 34.29%.

It is believed that the BMI is highly heritable, and recent twin studies have demonstrated large variation in BMI heritability, ranging from 31% to 90% [26]. Based on the previous results, the heritability estimation obtained using the product kernel is below this range, while the estimations using the Gaussian kernel and the polynomial kernel are within this range. The trend that estimated heritability using the polynomial kernel is the highest among the three choices of kernel and is consistent among the other two phenotypes.

In addition, we also experimented with different values of the constant c in the polynomial kernel, ranging from 0 to 1. Table 5 summarizes the heritability estimates under these choices. The results show that the performance of heritability estimation is sensitive to the choice of kernel hyperparameters. In particular, Table 5 reveals a general pattern: as the value of c increases, the estimated heritability tends to decrease. One explanation is that

{(c + \frac{1}{p} {GG}^{T})}^{2} = c^{2} J_{n} + \frac{2 c}{p} {GG}^{T} + {(\frac{1}{p} {GG}^{T})}^{2}

so that the largest eigenvalue of the

{(c + \frac{1}{p} {GG}^{T})}^{2}

is lower bounded by

λ_{max} (c^{2} J_{n}) = n c^{2}

, which means that when c becomes larger, the rank-one matrix

c^{2} J_{n}

dominates the spectrum of the polynomial kernel matrix. Consequently, the kernel matrix carries less informative structure for estimating the variance component in the random effects, resulting in a smaller estimate of

{\hat{σ}}_{a}^{2}

.

4. Discussion

Kernel methods have been widely used in machine learning to capture nonlinear relationships between features and responses. In this paper, we demonstrate that when kernels are correctly specified, the REML estimators for the variance components in high-dimensional kernel linear mixed models are consistent across three classes of kernel functions. Simulation studies validate the theoretical results on consistency.

In addition, we would like to highlight a few directions for future work. First, as we have mentioned, the consistency of the REML estimators for variance components in kernel LMMs is established under the assumption that the kernels are correctly specified. In practice, the underlying data-generating process and the appropriate kernel are often unknown. Misspecifying the kernel in a kernel LMM could lead to inconsistent variance component estimators, resulting in biased estimates of heritability. Therefore, it is worth exploring a data-driven approach to identify the appropriate kernel function and evaluate its performance in estimating genetic heritability compared to commonly used kernels.

Second, the theories in this paper are established under the high-dimensional linear regime, where the sample size grows linearly with the number of SNPs. One reason for this assumption is that random matrix theory is widely used to derive theoretical results, and most existing work in random matrix theory focuses on the linear regime. Therefore, it is worthwhile to extend random matrix theory beyond the linear regime. Recently, Ghorbani et al. [27] and Mei et al. [28] developed theories for random kernel matrices under the polynomial regime, where

n ≍ p^{λ}

with

λ < 1

. However, in genetic studies, the number of SNPs is often much larger than the sample size. Thus, it is worthwhile to extend the results in this paper to scenarios where

n ≍ p^{λ}

and

λ > 1

and to develop the corresponding theories for random kernel matrices under this regime.

Last but not least, as demonstrated in the real data analyses, the performance of kernel LMMs depends heavily on the choice of the kernel matrix. In practice, researchers often lack prior knowledge about which kernel is most appropriate. Therefore, developing a data-driven method for kernel selection would be highly desirable. One widely used strategy is multiple kernel learning [29], in which a single kernel matrix is replaced by a convex combination of several kernel matrices, where the weights of the kernel matrices are also learning by minimizing a loss function (e.g., mean squared error in the regression setting). Extending the current work to accommodate kernel matrices constructed via multiple kernel learning would be a valuable direction for future research.

Author Contributions

Conceptualization, X.S. and Q.L.; methodology and formal analysis, X.S.; writing—original draft preparation, X.S.; writing—review and editing, Q.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The R code used for implementation can be found at https://github.com/SxxMichael/KernelLMM (accessed on 19 July 2025).

Acknowledgments

During the preparation of this manuscript/study, the authors used ChatGPT 4o for the purposes of correcting grammatical mistakes. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Mathematical Proofs of the Main Results

Appendix A.1. Proof of Lemma 1

Proof.

Let

Q = y^{T} Γ y

and note that

\begin{matrix} E [Q] & = E [y^{T} Γ y] = E [{(μ + e)}^{T} Γ (μ + e)] \\ = μ^{T} A μ + E [e^{T} Γ e], \end{matrix}

we can obtain

\begin{matrix} Q - E [Q] & = {(μ + e)}^{T} Γ (μ + e) - μ^{T} Γ μ - E [e^{T} Γ e] \\ = e^{T} Γ e - E [e^{T} Γ e] + 2 e^{T} Γ μ . \end{matrix}

Therefore,

\begin{matrix} Var [Q] & = E [{(Q - E [Q])}^{2}] \\ = E [{(e^{T} Γ e - E [e^{T} Γ e] + 2 e^{T} Γ μ)}^{2}] \\ \leq E [2 {(e^{T} Γ e - E [e^{T} Γ e])}^{2} + 8 μ^{T} Γ^{T} {ee}^{T} Γ μ] \\ = 2 Var [e^{T} Γ e] + 8 μ^{T} Γ^{T} Σ Γ μ \end{matrix}

□

Appendix A.2. Proof of Theorem 5

Proof.

Let

Σ_{γ} = A^{T} V_{γ} A

. Then according to the well-known identity (see for example, Searle et al. [18]), we have

\begin{matrix} P_{γ} & = V_{γ}^{- 1} - V_{γ}^{- 1} X {(X^{T} V_{γ}^{- 1} X)}^{- 1} X^{T} V_{γ}^{- 1} \\ = A {(A^{T} V_{γ} A)}^{- 1} A^{T} \\ = A Σ_{γ}^{- 1} A^{T} . \end{matrix}

Therefore,

\begin{matrix} | tr (P_{γ} K) | & = | tr (Σ_{γ}^{- 1} A^{T} KA) | \\ = | tr (Σ_{γ}^{- 1} A^{T} (p^{- 1} {GWG}^{T} A)) | \\ \leq λ_{max} (Σ_{γ}^{- 1}) p^{- 1} tr (A^{T} {GWG}^{T} A) . \end{matrix}

Under the assumptions of

W

, we have

I_{p} - W \geq 0

, and hence

A^{T} {GG}^{T} A \geq A^{T} {GWG}^{T} A,

which further implies that

tr (A^{T} {GG}^{T} A) \geq tr (A^{T} {GWG}^{T} A) .

As a consequence,

\begin{matrix} | tr (P_{γ} K) | & \leq λ_{max} (Σ_{γ}^{- 1}) p^{- 1} tr (A^{T} {GG}^{T} A) \\ = λ_{min}^{- 1} (Σ_{γ}) p^{- 1} tr (A^{T} {GG}^{T} A) \\ \leq p^{- 1} tr (A^{T} {GG}^{T} A) = O_{p} (n), \end{matrix}

where the last equality follows from Corollary 1.

Now we define

\begin{matrix} Δ = Δ (γ) & = y^{T} B (γ) y \\ B = B (γ) & = \frac{P_{γ} {KP}_{γ}}{tr (P_{γ} K)} - \frac{P_{γ}}{n - q} \\ = \frac{A Σ_{γ}^{- 1} A^{T} KA Σ_{γ}^{- 1} A^{T}}{tr (P_{γ} K)} - \frac{A Σ_{γ}^{- 1} A^{T}}{n - q} \\ = A (C_{1} - C_{2}) A^{T}, \end{matrix}

with

\begin{matrix} C_{1} & = Σ_{γ}^{- 1} A^{T} KA Σ_{γ}^{- 1}/ tr (P_{γ} K) \\ C_{2} & = Σ_{γ}^{- 1}/ (n - q) . \end{matrix}

Next, we write

Δ = E [Δ | G] + Δ - E [Δ | G]

, and we are going to show that

E [Δ | G]

converges to some constant limit, and

Δ - E [Δ | G] = o_{p} (1)

. We first show that

Δ - E [Δ | G] = o_{p} (1)

. Based on Lemma 1,

\begin{matrix} Var [Δ | G] & = Var [y^{T} A (C_{1} - C_{2}) A^{T} y | G] \\ = Var [{\tilde{y}}^{T} (C_{1} - C_{2}) \tilde{y} | G] \\ \leq 2 σ_{ϵ}^{2} tr ((C_{1} - C_{2}) Σ_{0} (C_{1} - C_{2}) Σ_{0}) \\ = 2 σ_{ϵ}^{2} [tr ({(C_{1} Σ_{0})}^{2}) + tr ({(C_{2} Σ_{0})}^{2}) - 2 tr (C_{1} Σ_{0} C_{2} Σ_{0})] \end{matrix}

(A1)

where

\tilde{y} = A^{T} y \sim N_{n - q} (0, σ_{ϵ}^{2} A^{T} V_{0} A) = N_{n - q} (0, σ_{ϵ}^{2} Σ_{0})

and

V_{0}

and

Σ_{0}

are the corresponding matrices of

V_{γ}

and

Σ_{γ}

with the truth

γ_{0}

. We now consider each term in the right hand side of Equation (A1). First note that

tr ({(C_{1} Σ_{0})}^{2}) = {[tr (P_{γ} K)]}^{- 2} tr ({(Σ_{γ}^{- 1} A^{T} KA Σ_{γ}^{- 1} Σ_{0})}^{2}) .

Since

\begin{matrix} tr (P_{γ} K) & = tr (A Σ_{γ}^{- 1} A^{T} K) \\ = tr (Σ_{γ}^{- 1 / 2} A^{T} KA Σ_{γ}^{- 1 / 2}) \\ \geq λ_{min} (K) tr (Σ_{γ}^{- 1}), \end{matrix}

and it follows from the assumption on

W

and Corollary 1,

\begin{matrix} λ_{min} (K) & = p^{- 1} λ_{min} ({GWG}^{T}) \\ \geq λ_{min} (W) λ_{min} (p^{- 1} {GG}^{T}) \\ \geq δ λ_{min} (p^{- 1} {GG}^{T}) \\ \to δ b_{-} (τ) > 0 a . s ., \end{matrix}

\begin{matrix} tr (Σ_{γ}^{- 1}) & = tr ({(I_{n - q} + γ A^{T} K A)}^{- 1}) \\ = \sum_{i = 1}^{n - q} \frac{1}{1 + λ_{i} (γ A^{T} KA)} \\ \geq \frac{n - q}{1 + γ λ_{max} (A^{T} KA)} = O_{p} (n), \end{matrix}

we know that

tr (P_{γ} K) ≍_{p} n

. On the other hand,

tr ({(Σ_{γ}^{- 1} A^{T} KA Σ_{γ}^{- 1} Σ_{0})}^{2}) = tr ({(Σ_{0}^{1 / 2} Σ_{γ}^{- 1} A^{T} KA Σ_{γ}^{- 1} Σ_{0}^{1 / 2})}^{2})

and since

I_{p} \geq W

, it follows that

\begin{matrix} Σ_{0}^{1 / 2} Σ_{γ}^{- 1} A^{T} KA Σ_{γ}^{- 1} Σ_{0}^{1 / 2} & \leq Σ_{0}^{1 / 2} Σ_{γ}^{- 1} A^{T} p^{- 1} {GG}^{T} A Σ_{γ}^{- 1} Σ_{0}^{1 / 2} \end{matrix}

According to Corollary 1

tr ({(Σ_{0}^{1 / 2} Σ_{γ}^{- 1} A^{T} KA Σ_{γ}^{- 1} Σ_{0}^{1 / 2})}^{2}) \leq tr ({(Σ_{0}^{1 / 2} Σ_{γ}^{- 1} A^{T} p^{- 1} {GG}^{T} A Σ_{γ}^{- 1} Σ_{0}^{1 / 2})}^{2}) = O_{p} (n),

which implies that

tr ({(C_{1} Σ_{0})}^{2}) = O_{p} (n^{- 1})

.

For the second term, it is clear that

tr ({(C_{2} Σ_{0})}^{2}) = {(n - q)}^{- 2} tr ({(Σ_{γ}^{- 1} Σ_{0})}^{2}) \leq {(n - q)}^{- 1} λ_{max}^{2} (Σ_{γ}^{- 1} Σ_{0}) .

Since

\begin{matrix} λ_{max} (Σ_{γ}^{- 1} Σ_{0}) & \leq λ_{max} (Σ_{γ}^{- 1}) λ_{max} (Σ_{0}) \\ = {[λ_{min} (Σ_{γ})]}^{- 1} λ_{max} (I_{n - q} + γ_{0} A^{T} KA) \\ = \frac{1 + γ_{0} λ_{max} (A^{T} KA)}{1 + γ λ_{max} (A^{T} KA)} \\ \leq \frac{γ_{0}}{γ} \lor 1, \end{matrix}

we can know that

tr ({(C_{2} Σ_{0})}^{2}) = O_{p} (n^{- 1})

.

Similarly for the third term, we have

\begin{matrix} tr (C_{1} Σ_{0} C_{2} Σ_{0}) & = {[(n - q) tr (P_{γ} K)]}^{- 1} tr (Σ_{γ}^{- 1} A^{T} KA Σ_{γ}^{- 1} Σ_{0} Σ_{γ}^{- 1} Σ_{0}) \\ \leq λ_{max} (Σ_{γ}^{- 1} Σ_{0}) {[(n - q) tr (P_{γ} K)]}^{- 1} tr (Σ_{γ}^{- 1} A^{T} KA Σ_{γ}^{- 1} Σ_{0}) \\ \leq λ_{max} (Σ_{γ}^{- 1} Σ_{0}) {[(n - q) tr (P_{γ} K)]}^{- 1} tr (Σ_{γ}^{- 1} A^{T} p^{- 1} {GG}^{T} A Σ_{γ}^{- 1} Σ_{0}) \\ = O_{p} (n^{- 2} \cdot n) = O_{p} (n^{- 1}) . \end{matrix}

Therefore,

Var [Δ | G] = O_{p} (n^{- 1})

and by Chebyshev’s inequality, for any

t > 0

,

P (| Δ - E [Δ | G] | > t | G) \leq \frac{Var [Δ | G]}{t^{2}} \overset{p}{\to} 0, as n \to \infty .

It then follows from the Dominated Convergence Theorem that

P (| Δ - E [Δ | G] | > t) \to 0, \forall t > 0,

which implies that

Δ - E [Δ | G] = o_{p} (1)

.

Next, let us focus on

E [Δ | G]

. It is easy to see that

\begin{matrix} E [Δ | G] & = E [y^{T} A (C_{1} - C_{2}) A^{T} y | G] \\ = σ_{ϵ}^{2} tr ((C_{1} - C_{2}) Σ_{0}) \\ = σ_{ϵ}^{2} [\frac{tr (Σ_{γ}^{- 1} A^{T} KA Σ_{γ}^{- 1} Σ_{0})}{tr (P_{γ} K)} - \frac{tr (Σ_{γ}^{- 1} Σ_{0})}{n - q}] . \end{matrix}

(A2)

Since

\begin{matrix} \frac{tr (Σ_{γ}^{- 1} A^{T} KA Σ_{γ}^{- 1} Σ_{0})}{tr (P_{γ} K)} & \leq \frac{λ_{max} (Σ_{γ}^{- 1} Σ_{0}) tr (Σ_{γ}^{- 1} A^{T} KA)}{tr (P_{γ} K)} \\ = λ_{max} (Σ_{γ}^{- 1} Σ_{0}) \leq \frac{γ_{0}}{γ} \lor 1, \end{matrix}

it follows from Bounded Convergence Theorem that

E [\frac{tr (Σ_{γ}^{- 1} A^{T} KA Σ_{γ}^{- 1} Σ_{0})}{tr (P_{γ} K)}] \to E [lim_{n \to \infty} \frac{tr (Σ_{γ}^{- 1} A^{T} KA Σ_{γ}^{- 1} Σ_{0})}{tr (P_{γ} K)}], as n \to \infty .

Moreover, note that

\begin{matrix} E [tr (Σ_{γ}^{- 1} A^{T} KA Σ_{γ}^{- 1} Σ_{0})] = E [tr (Σ_{γ}^{- 1} A^{T} KA Σ_{γ}^{- 1})] + γ_{0} E [tr (Σ_{γ}^{- 1} A^{T} KA Σ_{γ}^{- 1} A^{T} KA)], \end{matrix}

and according to Theorem 2, the ESD of

\frac{p}{n} A^{T} KA

converges in distribution almost surely to a nonrandom distribution F. Let

λ_{1}, \dots, λ_{n - q}

be the eigenvalues of

(p / n) A^{T} KA

, then

\begin{matrix} \frac{tr (Σ_{γ}^{- 1} A^{T} KA Σ_{γ}^{- 1})}{n - q} & = \frac{1}{n - q} \sum_{i = 1}^{n - q} \frac{(n / p) λ_{i}}{{(1 + (n / p) γ λ_{i})}^{2}} \to \int \frac{τ x}{{(1 + τ γ x)}^{2}} d F (x) a . s . \\ \frac{tr (P_{γ} K)}{n - q} & = \frac{1}{n - q} \sum_{i = 1}^{n - q} \frac{(n / p) λ_{i}}{1 + (n / p) γ λ_{i}} \to \int \frac{τ x}{1 + τ γ x} d F (x) a . s . \\ \frac{tr ((Σ_{γ}^{- 1} A^{T} KA))^{2})}{n - q} & = \frac{1}{n - q} \sum_{i = 1}^{n - q} \frac{{(n / p)}^{2} λ_{i}^{2}}{{(1 + γ (n / p) λ_{i})}^{2}} \to \int \frac{τ^{2} x^{2}}{{(1 + τ γ x)}^{2}} d F (x) a . s . . \end{matrix}

For notation simplicity, we define

\begin{matrix} ϕ_{1} & = \int \frac{τ x}{{(1 + τ γ x)}^{2}} d F (x) \\ ϕ_{2} & = \int \frac{τ^{2} x^{2}}{{(1 + τ γ x)}^{2}} d F (x) . \end{matrix}

Similarly, since

{(n - q)}^{- 1} tr (Σ_{γ}^{- 1} Σ_{0}) \leq λ_{max} (Σ_{γ}^{- 1} Σ_{0}) \leq γ_{0} / γ \lor 1

, it follows from Bounded Convergence Theorem that

E [\frac{tr (Σ_{γ}^{- 1} Σ_{0})}{n - q}] \to E [lim_{n \to \infty} \frac{tr (Σ_{γ}^{- 1} Σ_{0})}{n - q}], as n \to \infty .

Moreover, note that

tr (Σ_{γ}^{- 1} Σ_{0}) = tr (Σ_{γ}^{- 1}) + γ_{0} tr (Σ_{γ}^{- 1} A^{T} KA),

and by Theorem 2,

\begin{matrix} \frac{tr (Σ_{γ}^{- 1})}{n - q} = \frac{1}{n - q} \sum_{i = 1}^{n - q} \frac{1}{1 + (n / p) γ λ_{i}} \to \int \frac{1}{1 + τ γ x} d F (x) a . s . . \end{matrix}

Let

\begin{matrix} ψ_{1} & = \int \frac{1}{1 + τ γ x} d F (x) \\ ψ_{2} & = \int \frac{τ x}{1 + τ γ x} d F (x) . \end{matrix}

Then we have

E [Δ | G] = σ_{ϵ}^{2} (\frac{ϕ_{1}}{ψ_{2}} + γ_{0} \frac{ϕ_{2}}{ψ_{2}} - ψ_{1} - γ_{0} ψ_{2}) = σ_{ϵ}^{2} (\frac{ϕ_{1}}{ψ_{2}} - ψ_{1}) (1 + γ_{0} \frac{ϕ_{2} / ψ_{2} - ψ_{2}}{ϕ_{1} / ψ_{2} - ψ_{1}})

Using Fubini Theorem and some tedious algebra, it follows that

E [Δ | G] = σ_{ϵ}^{2} (\frac{γ_{0}}{γ} - 1) (ψ_{1} - \frac{ϕ_{1}}{ψ_{2}}) .

Note that since

f (x) = {(1 + τ x)}^{- 1}

is non-increasing and

g (x) = {(1 + τ x)}^{- 1} τ x

is non-decreasing, we have for any

x, y \in R

(f (x) - f (y)) (g (x) - g (y)) \leq 0,

and hence

\begin{matrix} 0 & \geq \int \int (f (x) - f (y)) (g (x) - g (y)) d F (x) d F (y) \\ = 2 \int f (x) g (x) d F (x) - 2 \int f (x) d F (x) \int g (x) d F (x), \end{matrix}

which implies that

\int \frac{1}{1 + τ x} d F (x) \int \frac{τ x}{1 + τ x} d F (x) \geq \int \frac{τ x}{{(1 + τ x)}^{2}} d F (x),

that is,

ψ_{1} ψ_{2} \geq ϕ_{1}

. Therefore,

Δ \overset{p}{\to} σ_{ϵ}^{2} (\frac{γ_{0}}{γ} - 1) (ψ_{1} - \frac{ϕ_{1}}{ψ_{2}})

, which is a constant limit, and the limit is

> 0

,

= 0

, and

< 0

when

γ < γ_{0}

,

γ = γ_{0}

, and

γ > γ_{0}

, respectively. This proves the desired result as

\hat{γ}

is the solution to

Δ = 0

, and hence

{\hat{γ}}_{n} \to γ_{0}

.

Next, we prove the second part of this theorem. Let

\begin{matrix} {\hat{σ}}_{ϵ}^{2} & = \frac{1}{n - q} y^{T} P_{{\hat{γ}}_{n}} y \\ = E [\frac{1}{n - q} y^{T} P_{γ_{0}} y] + (\frac{1}{n - q} y^{T} P_{{\hat{γ}}_{n}} y - \frac{1}{n - q} y^{T} P_{γ_{0}} y) + (\frac{1}{n - q} y^{T} P_{γ_{0}} y - E [\frac{1}{n - q} y^{T} P_{γ_{0}} y]) . \end{matrix}

First note that

X^{T} P_{γ_{0}} X = X^{T} V_{γ_{0}}^{- 1} X - X^{T} V_{γ_{0}}^{- 1} X {(X^{T} V_{γ_{0}}^{- 1} X)}^{- 1} X^{T} V_{γ_{0}}^{- 1} X = 0

, so we have

\begin{matrix} E [\frac{1}{n - q} y^{T} P_{γ_{0}} y] & = \frac{1}{n - q} [tr (P_{γ_{0}} σ_{ϵ}^{2} V_{γ_{0}}) + β^{T} X^{T} P_{γ_{0}} X β] \\ = \frac{1}{n - q} [σ_{ϵ}^{2} tr (I_{n} - V_{γ_{0}}^{- 1} X {(X^{T} V_{γ_{0}} X)}^{- 1} X^{T})] \\ = \frac{1}{n - q} (n - q) σ_{ϵ}^{2} = σ_{ϵ}^{2} . \end{matrix}

Next, since

{\hat{γ}}_{n} \overset{p}{\to} γ_{0}

, by continuous mapping theorem, we have

∥\frac{1}{n - q} P_{γ_{n}} - \frac{1}{n - q} P_{γ_{0}}∥ \leq \frac{1}{n - q} \cdot n max_{1 \leq i \leq n} (P_{{\hat{γ}}_{n}, i i} - P_{γ_{0}, i i}) = o_{p} (1),

which implies that

\frac{1}{n - q} y^{T} P_{{\hat{γ}}_{n}} y - \frac{1}{n - q} y^{T} P_{γ_{0}} y = o_{p} (1)

. Finally, since

y = σ_{ϵ} V_{γ_{0}}^{1 / 2} z

where

z \sim N_{n} (0, I_{n})

, by the famous Hanson Wright inequality, for any

t > 0

,

\begin{matrix} P (|\frac{1}{n - q} y^{T} P_{γ_{0}} y - E [\frac{1}{n - q} y^{T} P_{γ_{0}} y]| > t) \\ = P (|z^{T} V_{γ_{0}}^{1 / 2} P_{γ_{0}} V_{γ_{0}}^{1 / 2} z - E [z^{T} V_{γ_{0}}^{1 / 2} P_{γ_{0}} V_{γ_{0}}^{1 / 2} z]| > (n - q) t / σ_{ϵ}^{2}) \\ ≲ exp \{- c min (\frac{{(n - q)}^{2} t^{2}}{σ_{ϵ}^{4} {∥V_{γ_{0}}^{1 / 2} P_{γ_{0}} V_{γ_{0}}^{1 / 2}∥}_{F}^{2}}, \frac{(n - q) t}{σ_{ϵ}^{2} ∥V_{γ_{0}}^{1 / 2} P_{γ_{0}} V_{γ_{0}}^{1 / 2}∥})\} . \end{matrix}

Note that

\begin{matrix} ∥V_{γ_{0}}^{1 / 2} P_{γ_{0}} V_{γ_{0}}^{1 / 2}∥ & \leq ∥I_{n} - V_{γ_{0}}^{- 1 / 2} X {(X^{T} V_{γ_{0}}^{- 1} X)}^{- 1} X^{T} V_{γ_{0}}^{- 1 / 2}∥ \leq 1, \\ {∥V_{γ_{0}}^{1 / 2} P_{γ_{0}} V_{γ_{0}}^{1 / 2}∥}_{F}^{2} & \leq n ∥V_{γ_{0}}^{1 / 2} P_{γ_{0}} V_{γ_{0}}^{1 / 2}∥ \leq n, \end{matrix}

so we obtain

P (|\frac{1}{n - q} y^{T} P_{γ_{0}} y - E [\frac{1}{n - q} y^{T} P_{γ_{0}} y]| > t) ≲ exp \{- O (n)\} \to 0,

which implies that

{\hat{σ}}_{ϵ}^{2} = σ_{ϵ}^{2} + o_{p} (1)

. □

Appendix A.3. Proof of Corollary 2

Proof.

For simplicity, let

A = (G - \bar{g} \otimes 1_{n}) W^{1 / 2}

and

B = \tilde{G} W^{1 / 2}

. Then

\tilde{K} = p^{- 1} {BB}^{T}

. Let

K^{'} = p^{- 1} {AA}^{T}

. Now note that

\begin{matrix} {∥A∥}_{F}^{2} & = tr ({AA}^{T}) \\ = tr ((G - \bar{g} \otimes 1_{n}) W {(G - \bar{g} \otimes 1_{n})}^{T}) \\ = tr ({GWG}^{T}) - 2 tr (GW {(\bar{g} \otimes 1_{n})}^{T}) + tr ((\bar{g} \otimes 1_{n}) W {(\bar{g} \otimes 1_{n})}^{T}) . \end{matrix}

Since

I_{p} \geq W

, we have

p^{- 1} {GG}^{T} \geq p^{- 1} {GWG}^{T}

and

\begin{matrix} p^{- 1} tr (GWG 6 T) & \leq p^{- 1} tr ({GG}^{T}) = O_{p} (n) . \\ p^{- 1} tr (GW {(\bar{g} \otimes 1_{n})}^{T}) & = p^{- 1} tr (GW {((\frac{1}{n} 1_{n}^{T} G) \otimes 1_{n})}^{T}) \\ = p^{- 1} tr ((\frac{1}{n} {GWG}^{T} 1_{n}) \otimes 1_{n}^{T}) \\ = n^{- 1} 1_{n}^{T} p^{- 1} {GWG}^{T} 1_{n} \\ \leq λ_{max} (p^{- 1} {GWG}^{T}) \\ \leq λ_{max} (p^{- 1} {GG}^{T}) = O_{p} (1) . \\ p^{- 1} tr ((\bar{g} \otimes 1_{n}) W {(\bar{g} \otimes 1_{n})}^{T}) & = p^{- 1} tr [((\frac{1}{n} 1_{n}^{T} G) \otimes 1_{n}) W {((\frac{1}{n} 1_{n}^{T} G) \otimes 1_{n})}^{T}] \\ = p^{- 1} tr [(\frac{1}{n^{2}} 1_{n}^{T} {GWG}^{T} 1_{n}]) \otimes (1_{n} 1_{n}^{T})] \\ = n^{- 1} 1_{n}^{T} p^{- 1} {GWG}^{T} 1_{n} \\ \leq λ_{max} (p^{- 1} {GWG}^{T}) = O_{p} (1) . \end{matrix}

Combining all of these yields

p^{- 1} {∥A∥}_{F}^{2} = O_{p} (n)

. On the other hand,

\begin{matrix} {∥B∥}_{F}^{2} & = tr ({BB}^{T}) = tr (\tilde{G} W {\tilde{G}}^{T}) \\ = tr ((G - \bar{g} \otimes 1_{n}) D_{s}^{- 1} W D_{s}^{- 1} {(G - \bar{g} \otimes 1_{n})}^{T}) \\ \leq λ_{max} (D_{s}^{- 1} W D_{s}^{- 1}) {∥G - \bar{g} \otimes 1_{n}∥}_{F}^{2} \\ \leq \frac{{max}_{1 \leq i \leq p} w_{i}}{{min}_{1 \leq j \leq p} s_{j}^{2}} {∥G - \bar{g} \otimes 1_{n}∥}_{F}^{2} . \end{matrix}

By Lemma 2.6 in Jiang et al. [11],

max_{1 \leq j \leq p} | s_{j}^{2} - 1 | \to 0 a . s .,

which implies that

{({min}_{1 \leq j \leq p} s_{j}^{2})}^{- 1} = O_{p} (1)

, and hence

{∥B∥}_{F}^{2} = O_{p} (n)

. Finally, since

\begin{matrix} {∥A - B∥}_{F}^{2} & = tr ((A - B) {(A - B)}^{T}) \\ = tr ((G - \bar{g} \otimes 1_{n}) (I_{p} - D_{s}^{- 1}) W (I_{p} - D_{s}^{- 1}) {(G - \bar{g} \otimes 1_{n})}^{T}) \\ \leq λ_{max} ((I_{p} - D_{s}^{- 1}) W (I_{p} - D_{s}^{- 1})) {∥G - \bar{g} \otimes 1_{n}∥}_{F}^{2}, \end{matrix}

and by Corollary 2.3 in Jiang et al. [11],

\begin{matrix} λ_{max} ((I_{p} - D_{s}^{- 1}) W (I_{p} - D_{s}^{- 1})) & \leq (max_{1 \leq i \leq p} w_{i}) λ_{max} ({(I_{p} - D_{s}^{- 1})}^{2}) \\ \leq λ_{max} ({(I_{p} - D_{s}^{- 1})}^{2}) \\ = o_{p} (1) \end{matrix}

Therefore, we can bound the Levy’s distance between

F^{K^{'}}

and

F^{\tilde{K}}

by Corollary A.42 in Bai and Silverstein [19]:

\begin{matrix} L^{4} (F^{K^{'}}, F^{\tilde{K}}) & \leq 2 n^{- 2} p^{- 2} ({∥A∥}_{F}^{2} + {∥B∥}_{F}^{2}) {∥A - B∥}_{F}^{2} \\ = n^{- 2} (O_{p} (n) + O_{p} (n)) \cdot o_{p} (n) \\ = o_{p} (1) . \end{matrix}

Hence, the ESD of

\tilde{K}

converges a.s. in distribution to the ESD of

K

. On the other hand,

\begin{matrix} ∥F^{K^{'}} - F^{{GWG}^{T}}∥ & \leq n^{- 1} rank (A - {GW}^{1 / 2}) \\ = n^{- 1} rank ((\bar{g} \otimes 1_{n}) W^{1 / 2}) \\ \leq n^{- 1} \to 0, as n \to \infty . \end{matrix}

Thus, the ESD of

\tilde{K}

converges a.s. in distribution to the ESD of

p^{- 1} {GWG}^{T}

, and the desired results in the corollary follows Theorem 5. □

Appendix A.4. Proof of Theorem 6

Proof.

We follow the same framework as in the proof of Theorem 5. Let

\tilde{M} = f^{'} (0) \frac{{GG}^{T}}{p} + v_{p} I_{n} .

Based on Theorem A.43 in Bai and Silverstein [19], we have

∥F_{n}^{M} - F_{n}^{\tilde{M}}∥ \leq \frac{1}{n} rank (M - \tilde{M}) = \frac{1}{n} \to 0, as n \to \infty .

The matrix

\tilde{M}

will play a vital part in the remainder of the proof as it is easy to see that the LSD of

\tilde{M}

converges in distribution a.s. to some nonrandom distribution function by Theorem 2. On the other hand, since the elements in

G

are i.i.d. with

E [G_{i j}] = 0

and

E [G_{i j}^{2}] = 1

, it is easy to see that the elements in

A^{T} G

are still i.i.d. with a mean of 0 and unit variance. Moreover, since

A^{T} \tilde{M} A = f^{'} (0) \frac{A^{T} {GG}^{T} A}{p} + v_{p} I_{n - q},

we know that the LSD of

A^{T} \tilde{M} A

also converges in distribution a.s. to some nonrandom distribution function, and we denote such distribution by F.

It can be seen from the proof of Theorem 5 that one major part of showing the result is to show that

tr ({(C_{1} Σ_{0})}^{2}) = O_{p} (n^{- 1})

,

tr ({(C_{2} Σ_{0})}^{2}) = O_{p} (n^{- 1})

and

tr (C_{1} Σ_{0} C_{2} Σ_{0}) = O_{p} (n^{- 1})

. The proof of

tr ({(C_{2} Σ_{0})}^{2}) = O_{p} (1)

is exactly the same arguments as in the proof of Theorem 5, so we focus on the other two quantities.

As a consequence of Theorem A.45 in Bai and Silverstein [19] and Theorem 3, we have

L (F_{n - q}^{A^{T} KA}, F_{n - q}^{A^{T} MA}) \leq ∥M - K∥ = o_{p} (1) .

Moreover, it follows from Theorem A.43 in Bai and Silverstein [19] that

∥F_{n - q}^{A^{T} MA} - F_{n - q}^{A^{T} \tilde{M} A}∥ \leq \frac{1}{n - q} rank (A^{T} (M - \tilde{M}) A) \leq \frac{1}{n - q},

which implies that

L (F_{n - q}^{A^{T} KA}, F_{n - q}^{A^{T} \tilde{M} A}) \leq L (F_{n - q}^{A^{T} KA}, F_{n - q}^{A^{T} MA}) + ∥F_{n - q}^{A^{T} MA} - F_{n - q}^{A^{T} \tilde{M} A}∥ = o_{p} (1) .

Therefore,

\begin{matrix} \frac{1}{n - q} tr (P_{γ} K) & = \frac{1}{n - q} tr (Σ_{γ}^{- 1} A^{T} KA) = \frac{1}{n - q} \sum_{i = 1}^{n - q} \frac{λ_{i} (A^{T} KA)}{1 + γ λ_{i} (A^{T} KA)} \\ = \int \frac{x}{1 + γ x} d F_{n - q}^{A^{T} KA} (x) \\ = \int \frac{x}{1 + γ x} d (F_{n - q}^{A^{T} KA} (x) - F_{n - q}^{A^{T} \tilde{M} A} (x)) + \int \frac{x}{1 + γ x} d F_{n - q}^{A^{T} \tilde{A} A} (x) \\ \to 0 + \int \frac{x}{1 + γ x} d F (x) a . s ., \end{matrix}

(A3)

which implies that

tr (P_{γ} K) ≍_{p} n

. Similarly, note that

\begin{matrix} \frac{1}{n - q} tr ({(Σ_{γ}^{- 1} A^{T} KA Σ_{γ}^{- 1} Σ_{0})}^{2}) \\ = & \frac{1}{n - q} \sum_{i = 1}^{n - q} \frac{{(1 + γ_{0} λ_{i} (A^{T} KA))}^{2} λ_{i}^{2} (A^{T} KA)}{{(1 + λ_{i} (A^{T} KA))}^{4}} \\ = & \int \frac{{(1 + γ_{0} x)}^{2} x^{2}}{{(1 + γ x)}^{4}} d F_{n - q}^{A^{T} KA} (x) \\ = & \int \frac{{(1 + γ_{0} x)}^{2} x^{2}}{{(1 + γ x)}^{4}} d (F_{n - q}^{A^{T} KA} (x) - F_{n - q}^{A^{T} \tilde{M} A} (x)) + \int \frac{{(1 + γ_{0} x)}^{2} x^{2}}{{(1 + γ x)}^{4}} d F_{n - q}^{A^{T} \tilde{M} A} (x) \\ \to & 0 + \int \frac{{(1 + γ_{0} x)}^{2} x^{2}}{{(1 + γ x)}^{4}} d F (x) a . s ., \end{matrix}

which implies that

tr ({(Σ_{γ}^{- 1} A^{T} KA Σ_{γ}^{- 1} Σ_{0})}^{2}) = O_{p} (n)

and hence

tr ({(C_{1} Σ_{0})}^{2}) = {[tr (P_{γ} K)]}^{- 2} tr ({(Σ_{γ}^{- 1} A^{T} KA Σ_{γ}^{- 1} Σ_{0})}^{2}) = O_{p} (n^{- 1}) .

For

tr (C_{1} Σ_{0} C_{2} Σ_{0})

, note that

\begin{matrix} \frac{1}{n - q} tr (Σ_{γ}^{- 1} A^{T} KA Σ_{γ}^{- 1} Σ_{0}) \\ = & \frac{1}{n - q} \sum_{i = 1}^{n - q} \frac{(1 + γ_{0} λ_{i} (A^{T} KA)) λ_{i} (A^{T} KA)}{{(1 + γ λ_{i} (A^{T} KA))}^{2}} \\ = & \int \frac{(1 + γ_{0} x) x}{{(1 + γ x)}^{2}} d F_{n - q}^{A^{T} KA} (x) \\ = & \int \frac{(1 + γ_{0} x) x}{{(1 + γ x)}^{2}} d (F_{n - q}^{A^{T} KA} (x) - F_{n - q}^{A^{T} \tilde{M} A} (x)) + \int \frac{(1 + γ_{0} x)}{{(1 + γ x)}^{2}} d F_{n - q}^{A^{T} \tilde{M} A} (x) \\ \to & 0 + \int \frac{(1 + γ_{0} x) x}{{(1 + γ x)}^{2}} d F (x) a . s ., \end{matrix}

(A4)

which implies that

tr (Σ_{γ}^{- 1} A^{T} KA Σ_{γ}^{- 1} Σ_{0}) = O_{p} (n)

, and hence

\begin{matrix} tr (C_{1} Σ_{0} C_{2} Σ_{0}) & = {[(n - q) tr (P_{γ} K)]}^{- 1} tr (Σ_{γ}^{- 1} A^{T} KA Σ_{γ}^{- 1} Σ_{0} Σ_{γ}^{- 1} Σ_{0}) \\ \leq λ_{max} (Σ_{γ}^{- 1} Σ_{0}) {[(n - q) tr (P_{γ} K)]}^{- 1} tr (Σ_{γ}^{- 1} A^{T} KA Σ_{γ}^{- 1} Σ_{0}) \\ = O_{p} (n^{- 2} \cdot n) = O_{p} (n^{- 1}) . \end{matrix}

Therefore,

Var [Δ | G] = O_{p} (n^{- 1})

and as in the proof of Theorem 5, it follows from Chebyshev’s inequality and Dominated Convergence Theorem that

Δ - E [Δ | G] = o_{p} (1)

.

Similar to the proof of Theorem 5, we now focus on

E [Δ | G]

. Recall from Equation (A2) that

\begin{matrix} E [Δ | G] & = σ_{ϵ}^{2} [\frac{tr (Σ_{γ}^{- 1} A^{T} KA Σ_{γ}^{- 1} Σ_{0})}{tr (P_{γ} K)} - \frac{tr (Σ_{γ}^{- 1} Σ_{0})}{n - q}] \\ = σ_{ϵ}^{2} [\frac{tr (Σ_{γ}^{- 1} A^{T} KA Σ_{γ}^{- 1} Σ_{0})}{tr (P_{γ} K)} - \frac{tr (Σ_{γ}^{- 1}) + γ_{0} tr (P_{γ} K)}{n - q}] . \end{matrix}

Note that

\begin{matrix} \frac{1}{n - q} tr (Σ_{γ}^{- 1}) & = \frac{1}{n - q} \sum_{i = 1}^{n - q} \frac{1}{1 + γ λ_{i} (A^{T} KA)} = \int \frac{1}{1 + γ x} d F_{n - q}^{A^{T} KA} (x) \\ = \int \frac{1}{1 + γ x} d (F_{n - q}^{A^{T} KA} (x) - F_{n - q}^{A^{T} \tilde{M} A} (x)) + \int \frac{1}{1 + γ x} d F_{n - q}^{A^{T} \tilde{M} A} (x) \\ \to 0 + \int \frac{1}{1 + γ x} d F (x) . \end{matrix}

Combined with Equations (A3) and (A4), it is easy to see that

E [Δ | G]

converges to a constant limit. The remaining part of the proof follows the same statements as in the proof of Theorem 6, so it is omitted. □

Appendix A.5. Proof of Theorem 7

Proof.

Let

M

and

v_{p}

be as defined in Equation (10). Now let

\tilde{M} = - 2 f^{'} (τ) \frac{{GG}^{T}}{p} + v_{p} I_{n} .

By Theorem A.43 in Bai and Silverstein [19] and the triangle inequality for matrix rank, we have

\begin{matrix} ∥F_{n}^{M} - F_{n}^{\tilde{M}}∥ & \leq \frac{1}{n} rank (M - \tilde{M}) \\ = \frac{1}{n} rank (f (τ) 1_{n} 1_{n}^{T} + f^{'} (τ) [1_{n} ψ^{T} + ψ 1_{n}^{T}] + \\ \frac{f^{″} (τ)}{2} [1_{n} {(ψ \circ ψ)}^{T} + (ψ \circ ψ) 1_{n}^{T} + 2 ψ ψ^{T} + 4 \frac{tr (Σ_{p}^{2})}{p^{2}} 1_{n} 1_{n}^{T}]) \\ \leq \frac{7}{n} \to 0, as n \to \infty . \end{matrix}

On the other hand, as a consequence of Theorem A.45 in Bai and Silverstein [19] and Theorem 4,

L (F_{n - q}^{A^{T} KA}, F_{n - q}^{A^{T} MA}) \leq ∥A^{T} (K - M) A∥ \leq ∥K - M∥ = o_{p} (1) .

Moreover, it follows from Theorem A.43 in Bai and Silverstein [19] that

∥F_{n - q}^{A^{T} MA} - F_{n - q}^{A^{T} \tilde{M} A}∥ \leq \frac{1}{n - q} rank (A^{T} (M - \tilde{M}) A) \leq \frac{1}{n - q} rank (M - \tilde{M}) \leq \frac{7}{n - q},

which implies that

L (F_{n - q}^{A^{T} KA}, F_{n - q}^{A^{T} \tilde{M} A}) \leq L (F_{n - q}^{A^{T} KA}, F_{n - q}^{A^{T} MA}) + ∥F_{n - q}^{A^{T} MA} - F_{n - q}^{A^{T} \tilde{M} A}∥ = o_{p} (1) .

The remaining of the proof is the same as the proof of Theorem 3. □

References

Macgregor, S.; Cornes, B.K.; Martin, N.G.; Visscher, P.M. Bias, precision and heritability of self-reported and clinically measured height in Australian twins. Hum. Genet. 2006, 120, 571–580. [Google Scholar] [CrossRef] [PubMed]
Silventoinen, K.; Sammalisto, S.; Perola, M.; Boomsma, D.I.; Cornes, B.K.; Davis, C.; Dunkel, L.; De Lange, M.; Harris, J.R.; Hjelmborg, J.V.; et al. Heritability of adult body height: A comparative study of twin cohorts in eight countries. Twin Res. Hum. Genet. 2003, 6, 399–408. [Google Scholar] [CrossRef] [PubMed]
Yengo, L.; Vedantam, S.; Marouli, E.; Sidorenko, J.; Bartell, E.; Sakaue, S.; Graff, M.; Eliasen, A.U.; Jiang, Y.; Raghavan, S.; et al. A saturated map of common genetic variants associated with human height. Nature 2022, 610, 704–712. [Google Scholar] [CrossRef] [PubMed]
Yang, J.; Lee, S.H.; Goddard, M.E.; Visscher, P.M. GCTA: A tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 2011, 88, 76–82. [Google Scholar] [CrossRef] [PubMed]
Yang, J.; Benyamin, B.; McEvoy, B.P.; Gordon, S.; Henders, A.K.; Nyholt, D.R.; Madden, P.A.; Heath, A.C.; Martin, N.G.; Montgomery, G.W.; et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 2010, 42, 565–569. [Google Scholar] [CrossRef] [PubMed]
Li, S.; Cai, T.T.; Li, H. Inference for high-dimensional linear mixed-effects models: A quasi-likelihood approach. J. Am. Stat. Assoc. 2022, 117, 1835–1846. [Google Scholar] [CrossRef] [PubMed]
van de Geer, S.; Bühlmann, P.; Ritov, Y.; Dezeure, R. On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Stat. 2014, 42, 1166–1202. [Google Scholar] [CrossRef]
Zhang, C.H.; Zhang, S.S. Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. Stat. Methodol. 2014, 76, 217–242. [Google Scholar] [CrossRef]
Law, M.; Ritov, Y. Inference and estimation for random effects in high-dimensional linear mixed models. J. Am. Stat. Assoc. 2023, 118, 1682–1691. [Google Scholar] [CrossRef]
Wu, M.C.; Lee, S.; Cai, T.; Li, Y.; Boehnke, M.; Lin, X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 2011, 89, 82–93. [Google Scholar] [CrossRef] [PubMed]
Jiang, J.; Li, C.; Paul, D.; Yang, C.; Zhao, H. On high-dimensional misspecified mixed model analysis in genome-wide association study. Ann. Stat. 2016, 44, 2127–2160. [Google Scholar] [CrossRef]
Dao, C.; Jiang, J.; Paul, D.; Zhao, H. Variance estimation and confidence intervals from genome-wide association studies through high-dimensional misspecified mixed model analysis. J. Stat. Plan. Inference 2022, 220, 15–23. [Google Scholar] [CrossRef] [PubMed]
Jiang, J.; Jiang, W.; Paul, D.; Zhang, Y.; Zhao, H. High-dimensional asymptotic behavior of inference based on gwas summary statistic. Stat. Sin. 2023, 33, 1555–1576. [Google Scholar] [CrossRef]
Liu, D.; Lin, X.; Ghosh, D. Semiparametric regression of multidimensional genetic pathway data: Least-squares kernel machines and linear mixed models. Biometrics 2007, 63, 1079–1088. [Google Scholar] [CrossRef] [PubMed]
Banerjee, S.; Carlin, B.P.; Gelfand, A.E. Hierarchical Modeling and Analysis for Spatial Data; Chapman and Hall/CRC: Boca Raton, FL, USA, 2003. [Google Scholar]
Shen, X.; Wen, Y.; Cui, Y.; Lu, Q. A conditional autoregressive model for genetic association analysis accounting for genetic heterogeneity. Stat. Med. 2022, 41, 517–542. [Google Scholar] [CrossRef] [PubMed]
de Los Campos, G.; Vazquez, A.I.; Fernando, R.; Klimentidis, Y.C.; Sorensen, D. Prediction of complex human traits using the genomic best linear unbiased predictor. PLoS Genet. 2013, 9, e1003608. [Google Scholar] [CrossRef] [PubMed]
Searle, S.R.; Casella, G.; McCulloch, C.E. Variance Components; John Wiley & Sons: Hoboken, NJ, USA, 2009; Volume 391. [Google Scholar]
Bai, Z.; Silverstein, J.W. Spectral Analysis of Large Dimensional Random Matrices; Springer: Berlin/Heidelberg, Germany, 2010; Volume 20. [Google Scholar]
Paul, D.; Aue, A. Random matrix theory in statistics: A review. J. Stat. Plan. Inference 2014, 150, 1–29. [Google Scholar] [CrossRef]
El Karoui, N. The spectrum of kernel random matrices. Ann. Stat. 2010, 38, 1–50. [Google Scholar] [CrossRef]
Couillet, R.; Liao, Z. Random Matrix Methods for Machine Learning; Cambridge University Press: Cambridge, UK, 2022. [Google Scholar]
Hiai, F. Monotonicity for entrywise functions of matrices. Linear Algebra Its Appl. 2009, 431, 1125–1146. [Google Scholar] [CrossRef]
Wendland, H. Scattered Data Approximation; Cambridge University Press: Cambridge, UK, 2004; Volume 17. [Google Scholar]
Li, M.; He, Z.; Zhang, M.; Zhan, X.; Wei, C.; Elston, R.C.; Lu, Q. A generalized genetic random field method for the genetic association analysis of sequencing data. Genet. Epidemiol. 2014, 38, 242–253. [Google Scholar] [CrossRef] [PubMed]
Schrempft, S.; van Jaarsveld, C.H.; Fisher, A.; Herle, M.; Smith, A.D.; Fildes, A.; Llewellyn, C.H. Variation in the heritability of child body mass index by obesogenic home environment. JAMA Pediatr. 2018, 172, 1153–1160. [Google Scholar] [CrossRef] [PubMed]
Ghorbani, B.; Mei, S.; Misiakiewicz, T.; Montanari, A. When do neural networks outperform kernel methods? Adv. Neural Inf. Process. Syst. 2020, 33, 14820–14830. [Google Scholar] [CrossRef]
Mei, S.; Misiakiewicz, T.; Montanari, A. Learning with invariances in random features and kernel models. In Proceedings of the Conference on Learning Theory, PMLR, Boulder, CO, USA, 15–19 August 2021; pp. 3351–3418. [Google Scholar]
Gönen, M.; Alpaydın, E. Multiple kernel learning algorithms. J. Mach. Learn. Res. 2011, 12, 2211–2268. [Google Scholar]

Figure 1. Boxplots for REML estimators of variance components obtained from Simulation (12) under the weighted linear kernel. The left panel shows the boxplots for the REML estimator of

σ_{a}^{2}

(truth = 0.6), and the right panel shows the boxplots for the REML estimator of

σ_{ϵ}^{2}