MicrobiomeGWAS: A Tool for Identifying Host Genetic Variants Associated with Microbiome Composition

The microbiome is the collection of all microbial genes and can be investigated by sequencing highly variable regions of 16S ribosomal RNA (rRNA) genes. Evidence suggests that environmental factors and host genetics may interact to impact human microbiome composition. Identifying host genetic variants associated with human microbiome composition not only provides clues for characterizing microbiome variation but also helps to elucidate biological mechanisms of genetic associations, prioritize genetic variants, and improve genetic risk prediction. Since a microbiota functions as a community, it is best characterized by β diversity; that is, a pairwise distance matrix. We develop a statistical framework and a computationally efficient software package, microbiomeGWAS, for identifying host genetic variants associated with microbiome β diversity with or without interacting with an environmental factor. We show that the score statistics have positive skewness and kurtosis due to the dependent nature of the pairwise data, which makes p-value approximations based on asymptotic distributions unacceptably liberal. By correcting for skewness and kurtosis, we develop accurate p-value approximations, whose accuracy was verified by extensive simulations. We exemplify our methods by analyzing a set of 147 genotyped subjects with 16S rRNA microbiome profiles from non-malignant lung tissues. Correcting for skewness and kurtosis eliminated the dramatic deviation in the quantile–quantile plots. We provided preliminary evidence that six established lung cancer risk SNPs were collectively associated with microbiome composition for both unweighted (p = 0.0032) and weighted (p = 0.011) UniFrac distance matrices. In summary, our methods will facilitate analyzing large-scale genome-wide association studies of the human microbiome.


Introduction
The human body is colonized by bacteria, viruses, and other microbes that exceed the number of human cells by at least 10-fold and that exceed the number of human genes by at least 100-fold. The relationship between a person and his or her microbial population, termed the microbiota, is generally mutualistic. The microbiota may promote human health by inhibiting infection by pathogens, conditioning the immune system, synthesizing and digesting nutrients, and maintaining overall homeostasis. The microbiome, which is the collection of all microbial genes, can be investigated through massively parallel, nextgeneration DNA sequencing technologies. By amplifying and sequencing highly variable regions of 16S ribosomal RNA genes that are present in all eubacteria, cost-effective and informative microbiome profiles down to the genus level are obtained.
The human microbiome has been associated with diseases, including obesity [1], inflammatory bowel disease (IBD) [2], colorectal cancer [3], and breast cancer [4]. Thus, identifying factors that have a sustained impact on the microbiome is fundamental for elucidating its role in health conditions and for developing treatment strategies. Increasing evidence suggests that microbiome composition at a specific site of the human body is impacted by environmental factors [5,6], host genetics [7,8], and possibly by their interactions. In the mouse, quantitative trait loci (QTL) studies have identified loci contributing to the variation in the gut microbiome using linkage analysis [9,10]. Recently, Goodrich et al. [11] systematically investigated the heritability of the human gut microbiome by comparing monozygotic twins to dizygotic twins and found substantial heritability in different microbiome metrics, suggesting the important role of host genetics on gut microbiome diversity. Associations between individual host genetic variants and microbiome taxa abundances have also begun to emerge in other human samples [7,8,12]. These studies suggest that genome-wide association studies (GWAS) have great potential to identify host genetic variants associated with microbiome diversity.
GWAS of complex human diseases have identified many risk SNPs; however, the biological mechanisms are largely unknown for the majority of the risk SNPs. QTL studies of intermediate traits, e.g., gene expression [13,14], DNA methylation [15,16], chromatin structure [17,18], and metabolite production [19,20], have provided useful insights into the biological mechanisms of the GWAS findings. The human microbiome at a specific body site is another important and informative intermediate trait for interpreting GWAS signals. Knights et al. [8] reported that a risk SNP for IBD located in NOD2 was associated with the relative abundance of Enterobacteriaceae in the human gut microbiome. Tong et al. [7] show that a loss-of-function allele in FUT2 that increases the risk of developing Crohn's Disease (CD) may modulate the energy metabolism of the gut microbiome. In both examples, the microbiome is a potential intermediate for explaining the association between risk SNPs and disease risks, although a formal mediation analysis is required based on samples with genotype, microbiome, and disease status data. Moreover, identifying microbiomeassociated host genetic variants has the potential to prioritize SNPs for discovery and to improve the performance of polygenetic risk prediction.
Three types of microbiome metrics can be derived as phenotypes for GWAS analysis. First, for each taxon at a specified taxonomic level (phylum, class, order, family, genus, and species), we calculate the relative abundance (RA) of the taxon as the ratio of the number of sequencing reads assigned to the taxon to the total number of sequencing reads. In 16S ribosomal RNA sequence profiles, approximately 100-200 taxa with average RAs ≥ 0.1% (from the phylum level to the genus level) across samples are abundant enough for QTL analysis. One can perform a Poisson regression to examine the association between the RA of each taxon and each SNP. Significant associations are identified using Bonferroni correction (p < 5 × 10 −8 /200 = 2.5 × 10 −10 ) or by controlling FDR at an appropriate level. Second, multiple α-diversity metrics [21] can be calculated to reflect the richness (e.g., number of unique taxa) and evenness of each microbiome community after a procedure called rarefication, which eliminates the dependence between the estimated α diversity and the variable total number of sequence reads across subjects. Once the α-diversity metrics are derived, one may perform standard GWAS with α diversity as the phenotype using linear regression.
Because a microbiota functions as a community, the most important analysis for a microbiome GWAS may be by assessing the complete structure of the community by using a pairwise microbiome distance matrix (or β diversity) of the microbial community. Microbiome distances can be defined in different ways, based on using phylogenetic tree information or each taxon's abundance information. Bray-Curtis dissimilarity [22] quantifies the difference between two microbiome communities using the abundance information of specific taxa. UniFrac [23][24][25] is another widely used distance metric. Unlike the Bray-Curtis dissimilarity metric, UniFrac compares microbiome communities by using information on the relative relatedness of each taxon, specifically by phylogenetic distance (branch lengths on a phylogenetic tree). UniFrac has two variants: the weighted UniFrac [24], which accounts for the taxa abundance information, and the unweighted UniFrac [23], which only models the information of presence or absence. Recently, a generalized UniFrac distance metric [26] was developed to automatically appreciate the advantages of weighted and unweighted UniFrac metrics and was shown to provide better statistical power to detect associations between human health conditions and microbiome communities. GWAS based on a microbiome distance matrix aims to identify the host SNPs associated with microbiome composition. This has been done frequently by fitting non-parametric multivariate models [27]. This approach requires permutations to assess significance [28], which is computationally prohibitive, particularly when evaluating p-values less than 5 × 10 −8 -the standard GWAS p-value threshold-or even lower when testing multiple-diversity matrices. In a recent microbiome GWAS, the computation is prohibitive even using a moment matching method based on the F-statistic.
Intuitively, the microbiome distances tend to be smaller for pairs of subjects with similar genotypic values at the associated SNP. In addition, it is also of great interest to identify host SNPs that interact with an environmental factor to affect microbiome composition. Importantly, β diversity is temporally more stable compared with RA of taxa and α-diversity metrics based on the data from the Human Microbiome Project [29], suggesting a smaller power loss for a GWAS due to temporal variability. To our knowledge, no statistical methods or software packages have been designed to efficiently analyze microbiome GWAS data using distance matrices as phenotypes.
In this paper, we develop a statistical framework and a computationally efficient package, microbiomeGWAS, for analyzing microbiome GWAS data. Our package allows the detection of host SNPs with the main effect or interaction with an environment factor; i.e., host SNPs interacting with an environment factor to affect the microbiome composition. We calculate the variance of the score statistics by appropriately considering the dependence of the pairwise distances. Importantly, we show that the score statistics have positive skewness and kurtosis due to the dependence in pairwise distances, which makes the approximation of small p-values based on the asymptotic distribution too liberal, which easily yields false positive associations. Resampling methods, e.g., bootstrap or permutation, are computationally prohibitive for accurately approximating small p-values. We propose to improve the tail probability approximation by correcting for skewness and kurtosis of the score statistics. Numerical investigations demonstrate that our method provides a very accurate approximation, even for p = 5 × 10 −8 . MicrobiomeGWAS runs very efficiently, taking 36 min for analyzing main effects and 69 min for analyzing both main and interaction effects for a study with 2000 subjects and 500,000 SNPs, using a single core. MicrobiomeGWAS is available at https://github.com/lsncibb/microbiomeGWAS [30], accessed on 30 May 2022.
We illustrate our methods by applying microbiomeGWAS to non-malignant lung tissue samples (N = 147) in the Environment And Genetics in Lung cancer Etiology (EAGLE) study [31,32]. Because smoking may alter microbiome composition, we tested both the main effect and gene-smoking interaction effect. When p-values were calculated based on asymptotic distributions, the quantile-quantile (QQ) plots strongly deviated from the uniform distribution. Nine loci also achieved genome-wide significance based on asymptotic approximations. Correcting for skewness and kurtosis eliminated the inflation and also the genome-wide significance of these loci. However, we provide evidence that the established lung cancer risk SNPs are associated with lung microbiome composition.

A Score Statistic for Testing Main Effect
Suppose that we have a set of N subjects genotyped with SNP arrays. For notational simplicity, we consider only one SNP with a minor allele frequency (MAF) denoted as f . Our interest centers on testing whether the genotype of the SNP is associated with microbiome composition. Let g n = 0, 1, 2 represent the number of the minor alleles for subject n. We assume that the 16S rRNA gene of microbiota from a target site (e.g., gut) has been sequenced for these samples. Let d ij be the microbiome distance between subject i and subject j and D be the distance matrix.
Intuitively, if the SNP is associated with the microbiome composition, the microbiome distances tend to be smaller for subject pairs with similar genotypic values, as is illustrated in Figure 1. For N subjects, N(N − 1)/2 pairs can be divided into three groups with genetic distance 0, 1, and 2. For example, a pair of subjects with genotype (AA, AA) or (BB, BB) has genetic distance 0; a pair of subjects with genotype (AA, BB) or (BB, AA) has genetic distance 2; all other pairs have genetic distance 1. Apparently, we expect the microbiome distance to be positively correlated with genetic distance for subject pairs.

A Score Statistic for Testing Gene-Environment Interaction
Let denote an environmental variable. Define Δ = | − |. We extend the statistical framework to detect the SNP-environment interaction by assuming = + + − + Δ + , where denotes the main genetic effect, denote the additive gene-environment effect, and denotes the main effect of the environmental factor. We consider testing the null hypothesis that the SNP is not associated with microbiome composition either directly or by interacting with , . . : = = 0. The alternative hypothesis is : > 0 or > 0.  We define G ij = g i − g j as the genetic distance for a pair of subjects (i, j). We assume d ij = α + β M G ij + ε ij . A score statistic for testing H 0 : β M = 0 (main effect) vs. β M > 0 is derived as: The variance Var 0 ( S M |D) under H 0 : β M = 0 is calculated by considering the dependence in G ij , G kl and conditioning on the distance matrix D. Briefly, we have Var 0 ( S M |D) = ∑ i<j,k<l d ij d kl Cov G ij , G kl . When (i, j, k, l) are distinct, G ij and G kl are independent; i.e., Cov G ij , G kl = 0. Some algebra leads to where and The details for calculating Var G ij and Cov G ij , G ik are in Appendix A. The normalized statistic Z M = S M / Var 0 ( S M |D) ∼ N(0, 1) under H 0 asymptotically.
In analyses of real data, we typically have to adjust for covariates, including demographic variables and principal component analysis (PCA) scores derived based on genotypes, to eliminate potential population stratification. Given a distance matrix D and v covariates (x i1 , · · · , x iv ), we perform distance-based redundancy analyses using function capscale in the vegan package [33]. The residual matrix D , extracted using the residuals function in the vegan package [33], is now adjusted for these potential confounding factors and can be used for genetic analysis.

A Score Statistic for Testing Gene-Environment Interaction
Let E i denote an environmental variable. Define ∆ ij = g i E i − g j E j . We extend the statistical framework to detect the SNP-environment interaction by assuming where β M denotes the main genetic effect, β I denote the additive gene-environment effect, and β E denotes the main effect of the environmental factor. We consider testing the null hypothesis that the SNP is not associated with microbiome composition either directly or by interacting with E, i.e. H 0 : We estimate β E and α under H 0 and calculate (2), we derive the variance Var 0 (S I |D ) by accounting for the dependence in ∆ ij , ∆ kl : Let Z M = S M / Var 0 (S M |D ) and Z I = S I / Var 0 (S I |D ) . Asymptotically, Z M ∼ N(0, 1) and Z I ∼ N(0, 1) under H 0 . In Appendix B, we derive Let ρ = Cor 0 (Z M , Z I |D ) be the correlation between the two statistics. Asymptotically, Appendix C, we derive a statistic for jointly testing H 0 : Briefly, the 2D plane is partitioned to four parts ( Figure 2). The joint statistic is derived as where w 1 = (θ − 1/θ )/2, w 2 = (θ + 1/θ )/2 and θ = (1 − ρ)/(1 + ρ). The asymptotic p-value is calculated as The as totic p-value is calculated as

Improved p-Value Approximations by Correcting for Skewness and Kurtosis
Theoretic investigation suggests that the score statistics and have a po skewness, which makes the tail probability approximations based on the asymptoti tribution (0,1) unacceptably liberal ( Figure 3A,B). In a numeric example with skew = 0.2, ( > 5) =2.9 × 10 −7 based on (0, 1), which is approximately two orde magnitude more significant than p = 3.9 × 10 −5 based on 10 8 permutations. The signifi inflation becomes worse for smaller p-values and larger skewness . Similar but tedious calculations suggest that both statistics have positive kurtosis, making the ap imation based on (0, 1) even worse. One possible solution is to approximate tail abilities using permutations or bootstrap. However, these resampling methods are putationally prohibitive for testing millions of common SNPs in a large-scale study.
To address this problem, we calculated the skewness and kurtosis of the statistics under (Appendix D). We propose to improve the tail proba

Improved p-Value Approximations by Correcting for Skewness and Kurtosis
Theoretic investigation suggests that the score statistics Z M and Z I have a positive skewness, which makes the tail probability approximations based on the asymptotic distribution N(0, 1) unacceptably liberal ( Figure 3A,B). In a numeric example with skewness γ = 0.2, P(Z > 5) =2.9 × 10 −7 based on N(0, 1), which is approximately two orders of magnitude more significant than p = 3.9 × 10 −5 based on 10 8 permutations. The significance inflation becomes worse for smaller p-values and larger skewness γ. Similar but more tedious calculations suggest that both statistics have positive kurtosis, making the approximation based on N(0, 1) even worse. One possible solution is to approximate tail probabilities using permutations or bootstrap. However, these resampling methods are computationally prohibitive for testing millions of common SNPs in a large-scale study.
Given the distance matrix D, γ M ∝ 1/N 1/2 , γ I ∝ 1/N 1/2 , κ M ∝ 1/N and κ I ∝ 1/N (Appendix D). Thus, skewness decays much more slowly with sample size N than kurtosis ( Figure 3C,D). Thus, even for a large study with thousands of samples, correcting for skewness is necessary for accurately evaluating the tail probabilities. Importantly, both skewness and kurtosis highly depend on the MAF, suggesting that the impact of skewness and kurtosis is different across SNPs with a different MAF. Numerical studies ( Figure 3C,D) show that skewness and kurtosis are minimized when MAF = 0.5 and maximized when MAF ≈ 0.2-0.3.
Finally, we discuss how to approximate the tail probability of Q in (7) for testing H 0 : β M = β I = 0 by correcting for non-normality in Z M and Z I . When (Z M , Z I ) ∈ A 2 (or A 3 ), we calculate the skewness E(w 1 Z M + w 2 Z I ) 3 and the kurtosis E(w 1 Z M + w 2 Z I ) 4 − 3 and use (9) to approximate P(w 1 Z M + w 2 Z I > b). When (Z M , Z I ) ∈ A 1 , we first approximate their marginal p-values as p M and p I by (9), and then calculate the normal quantile z M = Φ(1 − p M ) and z I = Φ(1 − p I ). Because the correction primarily impacts the tails of the distributions, the correlation between the two statistics will remain roughly unchanged; i.e., cor 0 (Z M , Z I ) ≈ cor 0 (z M , z I ). Thus, when (Z M , Z I ) ∈ A 1 , the tail probability is approximated as P(χ 2 2 > (z M , z I )Ω −1 (z M , z I ) ).

Simulation Results
The main purpose of simulations was to investigate the type-I error of Z M (for testing the main genetic effect), Z I (for detecting SNP-environment interactions), and Q (for detecting either the main genetic effect or SNP-environment effect or both). Simulations were performed under different combinations of sample size, MAF, and microbiome distance matrices. To make the simulations realistic, we used an unweighted distance matrix of the fecal microbiome samples with the 16S rRNA V4 region sequences from the American Gut Project (AGP) [36]. The OTU table, rarefied to 10,000 sequence reads per sample, as well as the metadata were downloaded from the AGP website. Samples with less than 10,000 sequence reads were excluded from the analysis. The weighted and the unweighted UniFrac distance matrices were generated in the Quantitative Insights Into Microbial Ecology [21] (QIIME) pipeline. Because antibiotics may substantially change the microbiome composition to generate outliers that may distort the null distribution, we excluded samples with self-reported history of antibiotic usage within one month. After quality control, 1879 subjects remained for analysis. In the simulations, we randomly selected N samples for a given sample size N.
For each setting, the type-I error rates were evaluated based on 10 8 simulations under H 0 . For the interaction test and the joint test, the binary environment factor had a frequency of 50% and was simulated independent of the SNP. The type-I error rates are summarized in Table 1 for the weighted UniFrac distance matrix. The skewness and kurtosis are reported in Figure 3C,D. The statistics adjusted for skewness and kurtosis have accurate type-I error rates while the statistics without adjustment have unacceptably high type-I error rates. As the sample size increases, the impact of skewness and kurtosis decreases. However, even for a study with N = 1000, the type-I error rates are still seriously inflated. The results for the unweighted UniFrac distance matrix and for MAF = 0.5 are reported in Table S1.

Software Implementation, Memory Requirement, and Computational Complexity
We implemented our algorithms in a software package, microbiomeGWAS, which is freely available at https://github.com/lsncibb/microbiomeGWAS [30], accessed on 30 May 2022. MicrobiomeGWAS requires three sets of files: a microbiome distance matrix file, a set of PLINK binary files for GWAS genotypes, and a set of covariates. MicrobiomeG-WAS processes one SNP at a time and does not load all genotype data into memory; thus, it requires only memory for storing the distance matrix. Variance, skewness, and kurtosis can be partitioned into two parts related with the microbiome distance matrix and the MAF of the SNP separately; thus, we can quickly calculate these quantities for a predefined grid of MAFs. The overall computational complexity is about O(N 2 M), where N is the sample size and M is the number of SNPs. Figure 4 reports the computation time on a Linux server using a single core. For a study with 10,000 subjects, it takes approximately 15 h for analyzing the main effect and approximately 30 h for analyzing both the main and interaction effects for 0.5 million variants. As a comparison, in a recent microbiome GWAS [37], to analyze 7 × 10 −6 variants for the main effect and n = 3382 subjects in the SHIP-TREND cohort [37], it would take 61 years using one CPU and 94 days using one graph-processing unit for parallel computation. Moreover, their analytic pipeline could not jointly analyze all 8956 subjects from five cohorts because of the computational burden; instead, they performed a stepwise search that may cause power loss.

Software Implementation, Memory Requirement, and Computational Complexity
We implemented our algorithms in a software package, microbiomeGWAS, which is freely available at https://github.com/lsncibb/microbiomeGWAS [30], accessed on 30 May 2022. MicrobiomeGWAS requires three sets of files: a microbiome distance matrix file, a set of PLINK binary files for GWAS genotypes, and a set of covariates. MicrobiomeGWAS processes one SNP at a time and does not load all genotype data into memory; thus, it requires only memory for storing the distance matrix. Variance, skewness, and kurtosis can be partitioned into two parts related with the microbiome distance matrix and the MAF of the SNP separately; thus, we can quickly calculate these quantities for a predefined grid of MAFs. The overall computational complexity is about ( ), where is the sample size and is the number of SNPs. Figure 4 reports the computation time on a Linux server using a single core. For a study with 10,000 subjects, it takes approximately 15 h for analyzing the main effect and approximately 30 h for analyzing both the main and interaction effects for 0.5 million variants. As a comparison, in a recent microbiome GWAS [37], to analyze 7 × 10 −6 variants for the main effect and n = 3382 subjects in the SHIP-TREND cohort [37], it would take 61 years using one CPU and 94 days using one graph-processing unit for parallel computation. Moreover, their analytic pipeline could not jointly analyze all 8956 subjects from five cohorts because of the computational burden; instead, they performed a stepwise search that may cause power loss.

GWAS of Microbiome Diversity in Adjacent Normal Lung Tissues
We applied our methods to a set of lung cancer patients of Italian ancestry in the EAGLE [31] study. All subjects have germline genome-wide SNPs [32] and 16S rRNA microbiome data (V3-V4 region, Illumina MiSeq, 300 paired-end) in histologically normal lung tissues from these patients. Here, the histologically normal lung tissues were 1~5 cm from the tumor tissue. We performed a series of quality control steps to filter out low-quality sequence reads: average quality score <20 over 30 bp windows, less than 60% similarity to the Greengenes [38] reference, or identified as chimera reads using UCHIME [39]. Sequence reads were then processed by QIIME [21] to produce the relative abundances (RA) of taxa, two α-diversity metrics (observed number of species and Shannon's index), and β diversity metrics (unweighted and weighted UniFrac distances) rarified to 1000 reads. We included 147 subjects with at least 1000 high-quality sequence reads for genetic association analysis.
Out of the 147 subjects, 78 are current smokers, 8 never smoked, and 61 are former smokers. Because of the small number of never smokers, we merged never and former smokers as non-current smokers. All of the genetic association analyses were adjusted for sex, age, smoking status, and the top three PCA scores derived based on genome-wide SNPs. Here, the top three PCA scores were selected for controlling population stratification because the other PCA scores were unassociated with the distance matrices. We included 383,263 common SNPs with MAF ≥ 10% because rarer SNPs were expected to have no statistical power given the current sample size. We first performed GWAS analysis using PLINK [40] to identify the SNPs associated with taxa with an average RA greater than 0.1% or two α-diversity metrics. We did not detect genome-wide significant associations with either the main effects or gene-smoking interactions.
Next, we performed GWAS analysis using unweighted and weighted UniFrac distance matrices as a representation of eubacteria β diversity. The results for testing the main effects are reported in Figure 5. Results for testing the joint effects (main effect and SNP by smoking status interaction) are reported in Figure S1. Because of the small sample size, we observed large values of skewness and kurtosis, with the magnitude varying with the MAF of the SNPs ( Figure 5A). The score statistics based on the weighted UniFrac distance matrix had a much larger skewness and kurtosis than did the unweighted UniFrac matrix. Figure 5B,C report the quantile-quantile (QQ) plot of the logarithm of the association p-values for the unweighted and weighted UniFrac distance matrices, respectively. For each distance matrix, we produced QQ plots for p-values based on the asymptotic approximation and for p-values adjusted for skewness and kurtosis. For both distance matrices, the QQ plots before adjustment strongly deviated from the expected uniform distribution. Our adjustment eliminated the deviation. In addition, consistent with the observation that the skewness and kurtosis were larger for the weighted UniFrac distance matrix, the QQ plot deviated more for the analysis based on the weighted UniFrac distance. Note that the skewness and kurtosis only affect the tail probabilities; thus, the inflation of the QQ plot is not reflected by the genomic control lambda value [41], calculated as the median of the p-values. In fact, lambda ≈ 1 for all four QQ plots.
Without correcting for skewness and kurtosis, we identified three and six loci achieving genome-wide significance (p < 5 × 10 −8 ) for the unweighted and weighted UniFrac distance matrices, respectively ( Figure 5D). After correcting for skewness and kurtosis, no locus remained genome-wide significant ( Figure 5D), which was verified by 10 8 permutations. Importantly, skewness and kurtosis had a dramatic effect on tail probabilities. Here, we use SNP rs12785513 as an example, which was identified as the top SNP in both analyses. In the unweighted UniFrac analysis, p = 4.4 × 10 −9 without adjustment and p = 1.6 × 10 −6 after adjustment, a 364-fold inflation. The inflation was even larger for weighted UniFrac analysis because of larger skewness and kurtosis ( Figure 5A). In fact, p = 3.4 × 10 −10 without adjustment and p = 3.5 × 10 −6 after adjustment, a 1000-fold inflation. Although these SNPs were not significant genome-wide, they were the top SNPs from the current study. Thus, we report box-plots for each of these nine SNPs ( Figure 5E). As expected, in all box plots, microbiome distances tend to be larger in subject pairs with greater genetic distance at these SNPs. These associations remain to be replicated in studies with larger sample sizes. Without correcting for skewness and kurtosis, we identified three and six loci achieving genome-wide significance ( 5 × 10 ) for the unweighted and weighted UniFrac distance matrices, respectively ( Figure 5D). After correcting for skewness and kurtosis, no locus remained genome-wide significant ( Figure 5D), which was verified by 10 8 permutations. Importantly, skewness and kurtosis had a dramatic effect on tail probabilities. Here, we use SNP rs12785513 as an example, which was identified as the top SNP in both analyses. In the unweighted UniFrac analysis, p = 4.4 × 10 −9 without adjustment and p = 1.6 × 10 −6 after adjustment, a 364-fold inflation. The inflation was even larger for weighted UniFrac analysis because of larger skewness and kurtosis ( Figure 5A). In fact, p = 3.4 × 10 −10 without adjustment and p = 3.5 × 10 −6 after adjustment, a 1000-fold inflation. Although these SNPs were not significant genome-wide, they were the top SNPs from the current study. Thus, we report box-plots for each of these nine SNPs ( Figure 5E). As expected, in all box plots, microbiome distances tend to be larger in subject pairs with greater genetic Finally, we concentrated on the six common SNPs in four genomic regions reported to be associated with lung cancer risk in GWAS of European subjects: rs2036534 and rs1051730 at 15q25.1 [42][43][44][45] (CHRNA5-CHRNA3-CHRNB4), rs2736100 and rs401681 at locus 5p15.33 [31,46] (TERT/CLPTM1L), rs6489769 [47] at 12p13.3 (RAD52), and rs1333040 at 9p21.3 [48] (CDKN2A/CDKN2B). The SNPs at 15q25.1 and 5p15.33 have the largest effect sizes for lung cancer risk based on the meta-analysis from the Transdisciplinary Research in Cancer of the Lung (TRICL) consortium [48]: OR = 1.32 for rs1051730, OR = 1.26 for rs2036534, OR = 1.13 for rs2736100, and OR = 1.14 for rs401681. Rs3131379 at locus 6p21.33 [46] (BAT3/MSH5) was excluded because the MAF = 7.5%. No SNPs were significantly associated with taxa RAs or α-diversity metrics after correcting for multiple testing. However, association analysis based on the UniFrac distance matrices provided evidence that these SNPs may be associated with the lung microbiota (Table 2). These SNPs were independent except that rs2036534 and rs1051730 at 15q25.1 were weakly correlated with R 2 = 0.15. A test combining six z-scores (Z M ) and adjusting for the weak correlation yielded overall p-values of 0.0033 and 0.011 for the unweighted and the weighted UniFrac distance matrices, respectively. These results suggest that lung cancer risk SNPs were enriched for genetic association with the composition of the lung microbiome. The results for testing interactions and joint effects are reported in Table S2.

Discussion
We developed a software package, microbiomeGWAS, for identifying host genetic variants associated with microbiome composition. MicrobiomeGWAS can test both the main effect and SNP-environment interactions. Importantly, we found that the score statistics had positive skewness and kurtosis and that the tail probabilities evaluated based on asymptotic approximations were very liberal. We addressed this problem by explicitly adjusting for skewness and kurtosis. MicrobiomeGWAS runs very efficiently and takes only 36 min for testing main effects and 69 min for testing joint effects in a GWAS with 2000 subjects and 500,000 markers. Other statistical methods exist for testing the association of microbiome distance matrices. PERMANOVA [27] is an extension of multivariate analysis of variance to a matrix of pairwise distances and relies on permutations to evaluate significance. MiRKAT [49], a recently proposed method based on kernel regression, takes hours for evaluating one association for 2000 subjects. Neither is computationally feasible for analyzing a large-scale GWAS of a microbiome. Recently, an asymptotic distribution was proposed to approximate the p-value for the PERMANOVA pseudo-F statistic [50]; however, whether it is sufficiently accurate for very small p-values (p < 5 × 10 −8 , for GWAS) remains to be investigated.
Interactions of host genetic susceptibility with the microbiome have been postulated for many conditions, including inflammatory bowel diseases [51,52], autoimmune and rheumatic diseases [53][54][55][56], diabetes [57], and cancer, especially of the colon [58]. All models of these host-microbiome interactions also note the critical role of environmental factors, including diet, smoking, drugs, and antibiotics and other medications [59]. Although based on a very small initial sample set, the suggestive associations that we found between the six known lung cancer risk SNPs and the microbiome of adjacent normal lung tissue samples, including effects of cigarette smoking, provide preliminary evidence that our microbiomeGWAS method is likely to be a useful tool for generating data that will unravel host-microbiome interactions with high confidence.
We are working on two extensions for microbiomeGWAS: (1) jointly testing additive and dominant effects; and (2) testing genetic associations using many microbiome distance matrices. We have assumed an additive effect model ( Figure 1); however, several top SNPs in the EAGLE data suggest a dominant effect (e.g., rs8083714 in Figure 5E). Thus, a statistic for jointly testing the additive and dominant effects might be powerful for this scenario. The second extension is motivated by the fact that that the power to detect associations depends heavily on the choice of distance matrix. The recently developed generalized UniFrac [26] (gUniFrac) defines a series of distance matrices to reflect the different emphases of using taxa relative abundance information. gUniFrac has been shown to have a robust power for association studies [26]. Extending microbiomeGWAS to gUniFrac, however, requires solving two problems. First, the computational complexity is proportional to the number of distance matrices analyzed for associations, which can be addressed by implementing the algorithms using multithreading technology. Second, we need to derive accurate analytic approximations to the association p-values by correcting for the multiple testing introduced by many distance matrices. MiRKAT [49] has an option for using gUniFrac; however, intensive permutations are required to evaluate p-values.
In summary, GWAS of the microbiome of each body site has the potential to help one understand microbiome variation, to elucidate the biological mechanisms of genetic associations, to improve the power of identifying novel disease-associated genetic variants, and to improve the performance of genetic risk prediction. We expect our methods and software to be useful for large-scale GWAS of the human microbiome.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/genes13071224/s1, Table S1: Type-I error rates estimated based on 10 8 simulations. Table S2: Association P-values between lung cancer risk SNPs and microbiome composition in the EAGLE data. Figure S1: Quantile-quantile (QQ) plot for association p-values testing the joint effects (main effect and SNP by smoking interaction) using the unweighted UniFrac distance matrices. Figure S2: Derivation of the likelihood ratio statistic Q in (7) and (8).  Let G ij = G ij − EG ij and ∆ ij = ∆ ij − E∆ ij . We first calculate covariance under H 0 : When (i, j, m, n) are distinct, Cov G ij , ∆ mn = 0. Some algebra leads to with µ 2 and µ 3 specified in (3) and (4). Combining (2), (5) and (A4), we have Equation (A5) suggests that the correlation is asymptotically independent of the microbiome distance matrix. In the real-data analyses, we found that (A5) was very accurate when sample size N ≥ 50. The details of calculating Cov G ij , ∆ ij and Cov G ij , ∆ ik are provided in Supplemental Data.