Relationship Between MiRKAT and Coefficient of Determination in Similarity Matrix Regression

The Microbiome Regression-based Kernel Association Test (MiRKAT) is widely used in testing for the association between microbiome compositions and an outcome of interest. The MiRKAT statistic is derived as a variance-component score test in a kernel machine regression-based generalized linear mixed model. In this brief report, we show that the MiRKAT statistic is proportional to the R2 (coefficient of determination) statistic in a similarity matrix regression, which characterizes the fraction of variability in outcome similarity, explained by microbiome similarity (up to a constant).


Introduction
Recent research has highlighted the vital role of the human microbiome in many diseases and health conditions, including (but not limited to) obesity [1], diabetes [2], cancer [3], inflammatory disorders [4], and bacterial vaginosis [5].Advances in next-generation sequencing technologies and high-throughput functional profiling technologies, including metagenomics, metatranscriptomics, metaproteomics, and metabolomics, have made them powerful tools for surveying in related research areas [6,7].The field of microbiome studies, however, has not yet reached the maturity attained in other established molecular-epidemiological fields, such as cancer biomarker discovery and genome-wide association studies, to make the leap from "-omics" surveys to rational microbiome-based therapeutics.One of the primary limitations to leveraging this large body of big microbiome and metagenomics data is the computational and statistical challenges: high-dimensionality, count and compositional data structure, sparsity (zero-inflation), over-dispersion, phylogenetic relatedness, among others.To combat these challenges, specialized computational tools and quantitative approaches, to aid in understanding the role of the microbiota in maintaining homeostasis in their animal host, as well as in the initiation and propagation of disease, are desired.
A common mode of analysis in microbiome studies is diversity-based community-level analysis, wherein overall microbiome composition is studied in relation to outcomes of interest, such as host transcriptome [4], host genetics [8], and other clinical or environmental covariates [9].Targeting overall microbiome community composition provides a holistic view towards facilitating identification of large-scale differences, accommodating correlation among taxa, and harnessing phylogenetic relationships.Besides being biologically meaningful, the community-level analysis is often statistically more powerful than individual taxon-level analysis, through reduced multiple testing burden and the ability to aggregate modest individual effects [10].Motivated by these advantages, many novel statistical methods and computational tools have been proposed for efficiently testing for associations between outcomes and microbial community composition, using either alpha-diversity [11] or beta-diversity [10,[12][13][14][15][16][17][18][19].
Among the existing quantitative analyses of association between microbial communities and their host, a powerful and popular method is the MiRKAT-type strategy, which regresses the outcome on microbiome compositions by way of the kernel machine regression framework [10,15,16].A major advantage of MiRKAT over other microbiome community-level association analyses is that the kernel machine regression framework allows for flexible microbial effect (e.g., nonlinear effects and interactions) on the outcome.The performance of MiRKAT, as an overall association test, has been well studied in the literature.In this report, we study the interpretation of MiRKAT results by investigating the MiRKAT statistic and show that the MiRKAT statistic corresponds to the ratio of explained variation (by microbiome similarities) and total variation (in outcome similarities).

Materials and Methods
We first introduce some notation.Let the triplet (Y i , X i , Z i ), i = 1, . . ., n be independent observations, where Y i denotes the outcome of interest (e.g., body mass index), X i denotes the q × 1 covariate vector including the intercept (e.g., age, gender, and antibiotic use), and Z i is the p × 1 composition vector of a microbiome community with p taxa.MiRKAT relates the outcome to microbiome features through the generalized partial linear model where g(•) is the link function (e.g., identity function for a continuous outcome and logit function for a binary outcome) and f (•) is a centered smooth function in a reproducing kernel Hilbert space spanned by a kernel function k(•, •).By using a nonparametric function f (•), the model allows for flexible relationship (e.g., nonlinear effects and interactions) between the outcome Y i and the microbiome compositions [10].In fact, for the purpose of testing H 0 : f (•) = 0, it is sufficient to specify the n × n kernel matrix K with K i,i = k(Z i , Z i ), rather than explicitly defining the kernel function k(•, •) [10].Within the context of microbiome studies, we typically define such a kernel matrix from a β-diversity using where D is a matrix of pair-wise β-diversities between individual microbial communities.For example, D could be a matrix of Bray-Curtis dissimilarity [20] of the UniFrac-family distances [9].
The MiRKAT for hypothesis H 0 : f (•) = 0 is derived from a variance component score test in a generalized linear mixed model (GLMM) and the specific MiRKAT statistic was proposed as [10] where μ0 = ( μ0,1 , . . ., μ0,n ) are the estimates of µ = E(Y|X, Z) under the null GLMM of no microbiome effect on outcome, and K = {K ij } n×n with K ij being a similarity metric measuring the similarity level between microbiome profiles Z i and Z j .Examples of such similarity/dissimilarity metrics include the UniFrac-family and the Bray-Curtis dissimilarity [10].The original MiRKAT was proposed for either a continuous outcome or a binary outcome.Under a continuous outcome model, the dispersion parameter φ equals σ2 0 , the null estimates are of residual variance, and φ = 1 under a binary outcome model [10].The testing strategy of MiRKAT has further been extended to accommodate more complicated outcome types (e.g., survival times and multiple correlated outcomes) [15,16,19] and complex study designs (e.g., longitudinal microbiome studies) [21,22], which all share the same spirit by deriving the test statistic as a variance component score test in a certain mixed effect model [10,15,16,22].As a result, all of these aforementioned MiRKAT-type test statistics have a comparable form to Q in Equation (1).For ease of presentation, we will illustrate the connection between a MiRKAT statistic and an R 2 (coefficient of determination) statistic, using a continuous outcome as an example.
To build the connection between the MiRKAT statistic (1) and R 2 statistic, we rearrange the non-kernel part and kernel part in the numerator of Q.Let S y = (Y − μ0 )(Y − μ0 ) be the cross product of the residuals, where S y ij = (Y i − μ0,i )(Y j − μ0,j ), to describe the covariates X-adjusted outcome similarity between Y i and Y j .An alternative way to study the association between outcome and microbiome is by the similarity matrix regression [23] where e ij are some mean-zero normal-distribution error terms.Since the outcome similarity S y ij is calculated from the null model residuals, which have been X-adjusted (note that the intercept term is included in X), and f (•) (thus K) is assumed to be a centered smooth function, both S y ij and K ij are centered and, thus, we do not include an intercept term in the similarity matrix regression model (2).It has been pointed out that H 0 : a = 0 is equivalent to testing a corresponding variance component being zero in a random effect model [23], which is the null hypothesis in MiRKAT.Besides the correspondence between the null hypothesis of MiRKAT and similarity matrix regression, in this short report, we will further demonstrate the correspondence between the MiRKAT statistic and the R 2 statistic (coefficient of determination) of similarity matrix regression.
Define the concatenation of matrix S y as S vec = (S , where vec stands for vectorization.The same notation applies to the microbiome similarities K ij and error terms e ij .Then, the matrix regression (2) can be reformulated in a vector format Under a simple linear regression model ( 3), it is easy to verify that The correlation on the right hand side of Equation ( 4) can be estimated as its empirical sample correlation, Corr n (S vec , K vec ), further calculated as where q is the dimension of X and the second equality holds because 0 .On the other hand, according to the law of total variance, Var(S vec ) = Var(E[S vec |K vec ]) + E(Var[S vec |K vec ]), where Var(E[S vec |K vec ]) and E(Var[S vec |K vec ]) represent the explained and unexplained fraction of variance in the outcome similarities by microbiome similarites, respectively.In other words, the first term in Equation ( 4) is the fraction of explained variation of outcome similarity by microbiome similarity or, equivalently, the R 2 statistic (coefficient of determination) of similarity matrix regression (3).Putting all of this together, we have That is, given the microbiome similarity (such that ∑ n i=1 ∑ n j=1 K 2 ij is a constant), the coefficient of determination statistics R 2 is proportional to the squared MiRKAT statistic Q 2 .

Results
We conducted a numerical study to verify the analytical result in Equation (6).We simulated the microbiome data Z (of p = 856 taxa) from an estimated Dirichlet-multinomial distribution, following the same strategy used in MiRKAT [10], and considered a sample size of n = 200 in this simulation.After the microbiome data was generated, we simulated two covariates, X 1 and X 2 , where X 1 was a Bernoulli variable with success probability 0.5, and X 2 was simulated from the normal distribution N(scale(∑ j Z ij ), 1) whose mean depended on the microbiome composition.Then, the outcome was simulated according to the following model where α = (α 1 , . . ., α p ) , and we randomly selected 50 of the α j to be nonzero, generated from the uniform distribution between −1 and 1.After the (Y i , X i , Z i ) were generated, we next calculated both the outcome similarity S y and microbiome similarity K.For the outcome similarity, the linear cross product of residuals (as described previously in this report) was used throughout the simulations.Four different microbiome similarity metrics were considered in this simulation, including weighted UniFrac, unweighted UniFrac, generalized UniFrac (parameter set to 0.5, as used in MiRKAT), and the Bray-Curtis.Each microbiome similarity metric was constructed, based on the corresponding β-diversity (as described in Zhao et al. [10]).We calculated both the MiRKAT statistic Q and the R 2 statistic of the similarity matrix regression (3) using 1000 replicates.Finally, R 2 was compared to 4Q 2 /(n − q) 2 ∑ i ∑ j K 2 ij , according to Equation ( 6).The result is reported in Figure 1, and it was confirmed that

Discussion and Conclusions
In summary, we found a connection between two statistics which seem to be quite different.One is the MiRKAT-type statistic, which is usually derived as a score test statistic for a variance component in a GLMM.The other is an R 2 -type statistic, the proportion of explained variation in similarity matrix regression.Despite the popularity of the MiRKAT test itself, it is striking to detect the correspondence between the MiRKAT statistic and the proportion of variance in outcome that was explained by microbiome (in the similarity level).A high R 2 of a certain microbiome similarity may imply an underlying microbiome-trait association pattern (e.g., a high unweighted UniFrac R 2 may imply that the trait is more influenced by the presence/absense of OTUs rather than their abundances).As a result, the correspondence between MiRKAT and R 2 can enhance the interpretability of the MiRKAT test, in the sense that a quantitative R 2 value is, in general, more straightforward and informative than the more qualitative MiRKAT p-value (significant or not).Moreover, this correspondence may also suggest the potential for development of a powerful similarity-learning procedure, by maximizing the proportion of explained variance (or, equivalently, minimizing the proportion of unexplained variance).

Figure 1 .
Figure 1.Comparison of MiRKAT statistic and R 2 statistic of similarity matrix regression.The red line represents the regression line, which is identical to the 45-degree line y = x.