Next Article in Journal
Identification of Antitumor miR-30e-5p Controlled Genes; Diagnostic and Prognostic Biomarkers for Head and Neck Squamous Cell Carcinoma
Next Article in Special Issue
What Can We Learn about the Bias of Microbiome Studies from Analyzing Data from Mock Communities?
Previous Article in Journal
SLC25A12 Missense Variant in Nova Scotia Duck Tolling Retrievers Affected by Cerebellar Degeneration—Myositis Complex (CDMC)
Previous Article in Special Issue
A Distribution-Free Model for Longitudinal Metagenomic Count Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MicrobiomeGWAS: A Tool for Identifying Host Genetic Variants Associated with Microbiome Composition

1
Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institute of Health, Rockville, MD 20850, USA
2
Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
3
Molecular Genetics and Genomics Branch, Center for Scientific Review, National Institute of Health, Bethesda, MD 20817, USA
*
Author to whom correspondence should be addressed.
Genes 2022, 13(7), 1224; https://doi.org/10.3390/genes13071224
Submission received: 2 June 2022 / Revised: 30 June 2022 / Accepted: 1 July 2022 / Published: 9 July 2022
(This article belongs to the Special Issue Statistical Analysis of Microbiome Data: From Methods to Application)

Abstract

:
The microbiome is the collection of all microbial genes and can be investigated by sequencing highly variable regions of 16S ribosomal RNA (rRNA) genes. Evidence suggests that environmental factors and host genetics may interact to impact human microbiome composition. Identifying host genetic variants associated with human microbiome composition not only provides clues for characterizing microbiome variation but also helps to elucidate biological mechanisms of genetic associations, prioritize genetic variants, and improve genetic risk prediction. Since a microbiota functions as a community, it is best characterized by β diversity; that is, a pairwise distance matrix. We develop a statistical framework and a computationally efficient software package, microbiomeGWAS, for identifying host genetic variants associated with microbiome β diversity with or without interacting with an environmental factor. We show that the score statistics have positive skewness and kurtosis due to the dependent nature of the pairwise data, which makes p-value approximations based on asymptotic distributions unacceptably liberal. By correcting for skewness and kurtosis, we develop accurate p-value approximations, whose accuracy was verified by extensive simulations. We exemplify our methods by analyzing a set of 147 genotyped subjects with 16S rRNA microbiome profiles from non-malignant lung tissues. Correcting for skewness and kurtosis eliminated the dramatic deviation in the quantile–quantile plots. We provided preliminary evidence that six established lung cancer risk SNPs were collectively associated with microbiome composition for both unweighted (p = 0.0032) and weighted (p = 0.011) UniFrac distance matrices. In summary, our methods will facilitate analyzing large-scale genome-wide association studies of the human microbiome.

1. Introduction

The human body is colonized by bacteria, viruses, and other microbes that exceed the number of human cells by at least 10-fold and that exceed the number of human genes by at least 100-fold. The relationship between a person and his or her microbial population, termed the microbiota, is generally mutualistic. The microbiota may promote human health by inhibiting infection by pathogens, conditioning the immune system, synthesizing and digesting nutrients, and maintaining overall homeostasis. The microbiome, which is the collection of all microbial genes, can be investigated through massively parallel, next-generation DNA sequencing technologies. By amplifying and sequencing highly variable regions of 16S ribosomal RNA genes that are present in all eubacteria, cost-effective and informative microbiome profiles down to the genus level are obtained.
The human microbiome has been associated with diseases, including obesity [1], inflammatory bowel disease (IBD) [2], colorectal cancer [3], and breast cancer [4]. Thus, identifying factors that have a sustained impact on the microbiome is fundamental for elucidating its role in health conditions and for developing treatment strategies. Increasing evidence suggests that microbiome composition at a specific site of the human body is impacted by environmental factors [5,6], host genetics [7,8], and possibly by their interactions. In the mouse, quantitative trait loci (QTL) studies have identified loci contributing to the variation in the gut microbiome using linkage analysis [9,10]. Recently, Goodrich et al. [11] systematically investigated the heritability of the human gut microbiome by comparing monozygotic twins to dizygotic twins and found substantial heritability in different microbiome metrics, suggesting the important role of host genetics on gut microbiome diversity. Associations between individual host genetic variants and microbiome taxa abundances have also begun to emerge in other human samples [7,8,12]. These studies suggest that genome-wide association studies (GWAS) have great potential to identify host genetic variants associated with microbiome diversity.
GWAS of complex human diseases have identified many risk SNPs; however, the biological mechanisms are largely unknown for the majority of the risk SNPs. QTL studies of intermediate traits, e.g., gene expression [13,14], DNA methylation [15,16], chromatin structure [17,18], and metabolite production [19,20], have provided useful insights into the biological mechanisms of the GWAS findings. The human microbiome at a specific body site is another important and informative intermediate trait for interpreting GWAS signals. Knights et al. [8] reported that a risk SNP for IBD located in NOD2 was associated with the relative abundance of Enterobacteriaceae in the human gut microbiome. Tong et al. [7] show that a loss-of-function allele in FUT2 that increases the risk of developing Crohn’s Disease (CD) may modulate the energy metabolism of the gut microbiome. In both examples, the microbiome is a potential intermediate for explaining the association between risk SNPs and disease risks, although a formal mediation analysis is required based on samples with genotype, microbiome, and disease status data. Moreover, identifying microbiome-associated host genetic variants has the potential to prioritize SNPs for discovery and to improve the performance of polygenetic risk prediction.
Three types of microbiome metrics can be derived as phenotypes for GWAS analysis. First, for each taxon at a specified taxonomic level (phylum, class, order, family, genus, and species), we calculate the relative abundance (RA) of the taxon as the ratio of the number of sequencing reads assigned to the taxon to the total number of sequencing reads. In 16S ribosomal RNA sequence profiles, approximately 100–200 taxa with average RAs ≥ 0.1% (from the phylum level to the genus level) across samples are abundant enough for QTL analysis. One can perform a Poisson regression to examine the association between the RA of each taxon and each SNP. Significant associations are identified using Bonferroni correction (p < 5 × 10−8/200 = 2.5 × 10−10) or by controlling FDR at an appropriate level. Second, multiple α-diversity metrics [21] can be calculated to reflect the richness (e.g., number of unique taxa) and evenness of each microbiome community after a procedure called rarefication, which eliminates the dependence between the estimated α diversity and the variable total number of sequence reads across subjects. Once the α-diversity metrics are derived, one may perform standard GWAS with α diversity as the phenotype using linear regression.
Because a microbiota functions as a community, the most important analysis for a microbiome GWAS may be by assessing the complete structure of the community by using a pairwise microbiome distance matrix (or β diversity) of the microbial community. Microbiome distances can be defined in different ways, based on using phylogenetic tree information or each taxon’s abundance information. Bray–Curtis dissimilarity [22] quantifies the difference between two microbiome communities using the abundance information of specific taxa. UniFrac [23,24,25] is another widely used distance metric. Unlike the Bray–Curtis dissimilarity metric, UniFrac compares microbiome communities by using information on the relative relatedness of each taxon, specifically by phylogenetic distance (branch lengths on a phylogenetic tree). UniFrac has two variants: the weighted UniFrac [24], which accounts for the taxa abundance information, and the unweighted UniFrac [23], which only models the information of presence or absence. Recently, a generalized UniFrac distance metric [26] was developed to automatically appreciate the advantages of weighted and unweighted UniFrac metrics and was shown to provide better statistical power to detect associations between human health conditions and microbiome communities. GWAS based on a microbiome distance matrix aims to identify the host SNPs associated with microbiome composition. This has been done frequently by fitting non-parametric multivariate models [27]. This approach requires permutations to assess significance [28], which is computationally prohibitive, particularly when evaluating p-values less than 5 × 10−8—the standard GWAS p-value threshold—or even lower when testing multiple-diversity matrices. In a recent microbiome GWAS, the computation is prohibitive even using a moment matching method based on the F-statistic.
Intuitively, the microbiome distances tend to be smaller for pairs of subjects with similar genotypic values at the associated SNP. In addition, it is also of great interest to identify host SNPs that interact with an environmental factor to affect microbiome composition. Importantly, β diversity is temporally more stable compared with RA of taxa and α-diversity metrics based on the data from the Human Microbiome Project [29], suggesting a smaller power loss for a GWAS due to temporal variability. To our knowledge, no statistical methods or software packages have been designed to efficiently analyze microbiome GWAS data using distance matrices as phenotypes.
In this paper, we develop a statistical framework and a computationally efficient package, microbiomeGWAS, for analyzing microbiome GWAS data. Our package allows the detection of host SNPs with the main effect or interaction with an environment factor; i.e., host SNPs interacting with an environment factor to affect the microbiome composition. We calculate the variance of the score statistics by appropriately considering the dependence of the pairwise distances. Importantly, we show that the score statistics have positive skewness and kurtosis due to the dependence in pairwise distances, which makes the approximation of small p-values based on the asymptotic distribution too liberal, which easily yields false positive associations. Resampling methods, e.g., bootstrap or permutation, are computationally prohibitive for accurately approximating small p-values. We propose to improve the tail probability approximation by correcting for skewness and kurtosis of the score statistics. Numerical investigations demonstrate that our method provides a very accurate approximation, even for p = 5 × 10−8. MicrobiomeGWAS runs very efficiently, taking 36 min for analyzing main effects and 69 min for analyzing both main and interaction effects for a study with 2000 subjects and 500,000 SNPs, using a single core. MicrobiomeGWAS is available at https://github.com/lsncibb/microbiomeGWAS [30], accessed on 30 May 2022.
We illustrate our methods by applying microbiomeGWAS to non-malignant lung tissue samples ( N = 147 )   in the Environment And Genetics in Lung cancer Etiology (EAGLE) study [31,32]. Because smoking may alter microbiome composition, we tested both the main effect and gene–smoking interaction effect. When p-values were calculated based on asymptotic distributions, the quantile–quantile (QQ) plots strongly deviated from the uniform distribution. Nine loci also achieved genome-wide significance based on asymptotic approximations. Correcting for skewness and kurtosis eliminated the inflation and also the genome-wide significance of these loci. However, we provide evidence that the established lung cancer risk SNPs are associated with lung microbiome composition.

2. Material and Methods

2.1. A Score Statistic for Testing Main Effect

Suppose that we have a set of N subjects genotyped with SNP arrays. For notational simplicity, we consider only one SNP with a minor allele frequency (MAF) denoted as   f . Our interest centers on testing whether the genotype of the SNP is associated with microbiome composition. Let g n = 0 , 1 , 2 represent the number of the minor alleles for subject n. We assume that the 16S rRNA gene of microbiota from a target site (e.g., gut) has been sequenced for these samples. Let d i j be the microbiome distance between subject i and subject j and D be the distance matrix.
Intuitively, if the SNP is associated with the microbiome composition, the microbiome distances tend to be smaller for subject pairs with similar genotypic values, as is illustrated in Figure 1. For N   subjects, N ( N 1 ) / 2 pairs can be divided into three groups with genetic distance 0, 1, and 2. For example, a pair of subjects with genotype (AA, AA) or (BB, BB) has genetic distance 0; a pair of subjects with genotype (AA, BB) or (BB, AA) has genetic distance 2; all other pairs have genetic distance 1. Apparently, we expect the microbiome distance to be positively correlated with genetic distance for subject pairs.
We define G i j = | g i g j | as the genetic distance for a pair of subjects ( i , j ) . We assume d i j = α + β M G i j + ε i j . A score statistic for testing H 0 :   β M = 0 (main effect) vs. β M > 0   is derived as:
S M = i < j d i j G i j       with     d i j = d i j 2 N ( N 1 ) k < l d k l .                                    
The variance V a r 0 ( S M | D ) under H 0 : β M = 0 is calculated by considering the dependence in ( G i j , G k l ) and conditioning on the distance matrix D . Briefly, we have V a r 0 ( S M | D ) = i < j , k < l d i j d k l C o v ( G i j , G k l ) . When ( i , j , k , l ) are distinct, G i j and G k l are independent; i.e., C o v ( G i j , G k l ) = 0 . Some algebra leads to
                                V a r 0 ( S M | D ) = N ( N 1 ) 2 V a r ( G i j ) μ 2 + N ( N 1 ) ( N 2 ) C o v ( G i j , G i k ) μ 3            
where
μ 2 = 2 N ( N 1 ) i < j ( d i j ) 2  
and
  μ 3 = 2 N ( N 1 ) ( N 2 ) i < j < k ( d i j d i k + d i j d j k + d i k d j k ) .
The details for calculating V a r ( G i j ) and C o v ( G i j , G i k ) are in Appendix A. The normalized statistic Z M = S M / V a r 0 ( S M | D )   ~ N ( 0 , 1 ) under H 0 asymptotically.
In analyses of real data, we typically have to adjust for covariates, including demographic variables and principal component analysis (PCA) scores derived based on genotypes, to eliminate potential population stratification. Given a distance matrix D and v covariates ( x i 1 , , x i v ) , we perform distance-based redundancy analyses using function capscale in the vegan package [33]. The residual matrix D , extracted using the residuals function in the vegan package [33], is now adjusted for these potential confounding factors and can be used for genetic analysis.

2.2. A Score Statistic for Testing Gene–Environment Interaction

Let E i denote an environmental variable. Define Δ i j = | g i E i g j E j | . We extend the statistical framework to detect the SNP–environment interaction by assuming d i j = α + β M G i j + β E | E i E j | + β I Δ i j + ε i j , where β M denotes the main genetic effect, β I denote the additive gene–environment effect, and β E denotes the main effect of the environmental factor. We consider testing the null hypothesis that the SNP is not associated with microbiome composition either directly or by interacting with E ,   i . e .   H 0 :   β M = β I = 0 . The alternative hypothesis is H 1 :   β M > 0   or   β I > 0 .
We estimate β E and α under H 0 and calculate d i j = d i j α ^ β ^ E | E i E j | . Let D = ( d i j ) be the residual matrix. The scores evaluated under H 0 are S M = i < j d i j G i j   for   β M and S I = i < j d i j Δ i j for β I . Similar to (2), we derive the variance V a r 0 ( S I | D ) by accounting for the dependence in ( Δ i j , Δ k l ) :
V a r 0 ( S I | D ) = N ( N 1 ) 2 V a r ( Δ i j ) μ 2 + N ( N 1 ) ( N 2 ) C o v ( Δ i j , Δ i k ) μ 3 .
Let Z M = S M / V a r 0 ( S M | D ) and Z I = S I / V a r 0 ( S I | D ) . Asymptotically, Z M ~ N ( 0 , 1 ) and Z I ~ N ( 0 , 1 ) under H 0 . In Appendix B, we derive
C o v 0 ( S M , S I | D ) = N ( N 1 ) 2 C o v ( G i j , Δ i j ) μ 2 + N ( N 1 ) ( N 2 ) C o v ( G i j , Δ i k ) μ 3        
Let ρ = C o r 0 ( Z M , Z I | D ) be the correlation between the two statistics. Asymptotically, ( Z M , Z I ) follows a bivariate normal distribution with a correlation matrix Ω = ( 1 ρ ρ 1 ) . In Appendix C, we derive a statistic for jointly testing H 0 :   β M = β I = 0 vs. H 1 :   β M > 0   or   β I > 0 . Briefly, the 2D plane is partitioned to four parts (Figure 2). The joint statistic is derived as
Q = { ( Z M , Z I ) Ω 1 ( Z M , Z I ) T             ( Z M , Z I ) A 1     ( w 1 Z M + w 2 Z I ) 2                               ( Z M , Z I ) A 2 ( w 2 Z M + w 1 Z I ) 2                               ( Z M , Z I ) A 3 0                                                                                         ( Z M , Z I ) A 4
where w 1 = ( θ 1 / θ   ) / 2 , w 2 = ( θ + 1 / θ   ) / 2 and θ = ( 1 ρ ) / ( 1 + ρ ) . The asymptotic p-value is calculated as
P ( Q > b 2 ) = q 1 P ( χ 2 2 > b 2 ) + q 2 P ( N ( 0 , 1 ) > b ) + q 3 P ( N ( 0 , 1 ) > b ) ,
where q i = P ( ( Z M , Z I ) A i ) .

2.3. Improved p-Value Approximations by Correcting for Skewness and Kurtosis

Theoretic investigation suggests that the score statistics Z M and Z I   have a positive skewness, which makes the tail probability approximations based on the asymptotic distribution N ( 0 , 1 ) unacceptably liberal (Figure 3A,B). In a numeric example with skewness γ = 0.2 , P ( Z > 5 ) = 2.9 × 10−7 based on N ( 0 ,   1 ) , which is approximately two orders of magnitude more significant than p = 3.9 × 10−5 based on 108 permutations. The significance inflation becomes worse for smaller p-values and larger skewness γ . Similar but more tedious calculations suggest that both statistics have positive kurtosis, making the approximation based on N ( 0 ,   1 ) even worse. One possible solution is to approximate tail probabilities using permutations or bootstrap. However, these resampling methods are computationally prohibitive for testing millions of common SNPs in a large-scale study.
To address this problem, we calculated the skewness γ and kurtosis κ of the score statistics under H 0   (Appendix D). We propose to improve the tail probability approximation P 0 ( Z > b )   by correcting for the skewness and kurtosis, following the skewness correction in linkage analysis [34,35]. Technical details are provided in Appendix E. Correcting for both skewness and kurtosis leads to an approximation
P 0 ( Z > b ) e b ξ 1 + ( 1 + σ 1 2 ) ξ 1 2 / 2 + γ ξ 1 3 / 6 + κ ξ 1 4 / 24   Φ ( σ 1 ξ 1 )
where ξ 1 satisfies ξ + γ ξ 2 / 2 + κ ξ 3 / 6 = b , σ 1 2 = 1 + γ ξ 1 + κ ξ 1 2 / 2 and Φ ( · ) is the cumulative distribution function of N ( 0 , 1 ) . Correcting for skewness but ignoring kurtosis (i.e., assuming κ = 0 ) leads to an approximation
P 0 ( Z > b ) e b ξ 2 + ( 1 + σ 2 2 ) ξ 2 2 / 2 + γ ξ 2 3 / 6   Φ ( σ 2 ξ 2 )
where ξ 2 = ( 1 + 2 γ b 1 ) / γ , σ 2 2 = 1 + γ ξ 2 . Numerical results presented in Figure 3B demonstrate that (9) works very well.
Given the distance matrix D , γ M 1 / N 1 / 2 , γ I 1 / N 1 / 2   , κ M 1 / N and κ I 1 / N (Appendix D). Thus, skewness decays much more slowly with sample size N than kurtosis (Figure 3C,D). Thus, even for a large study with thousands of samples, correcting for skewness is necessary for accurately evaluating the tail probabilities. Importantly, both skewness and kurtosis highly depend on the MAF, suggesting that the impact of skewness and kurtosis is different across SNPs with a different MAF. Numerical studies (Figure 3C,D) show that skewness and kurtosis are minimized when MAF = 0.5 and maximized when MAF   0.2–0.3.
Finally, we discuss how to approximate the tail probability of Q in (7) for testing H 0 :   β M = β I = 0   by correcting for non-normality in Z M and Z I . When ( Z M , Z I ) A 2 (or A 3 ), we calculate the skewness E ( w 1 Z M + w 2 Z I ) 3 and the kurtosis E ( w 1 Z M + w 2 Z I ) 4 3 and use (9) to approximate P ( w 1 Z M + w 2 Z I > b ) . When ( Z M , Z I ) A 1 , we first approximate their marginal p-values as p M and p I by (9), and then calculate the normal quantile z M = Φ ( 1 p M ) and z I = Φ ( 1 p I ) . Because the correction primarily impacts the tails of the distributions, the correlation between the two statistics will remain roughly unchanged; i.e., c o r 0 ( Z M , Z I ) c o r 0 ( z M , z I ) . Thus, when ( Z M , Z I ) A 1 , the tail probability is approximated as P ( χ 2 2 > ( z M , z I ) Ω 1 ( z M , z I ) ) .

3. Results

3.1. Simulation Results

The main purpose of simulations was to investigate the type-I error of Z M (for testing the main genetic effect), Z I (for detecting SNP–environment interactions), and Q (for detecting either the main genetic effect or SNP–environment effect or both). Simulations were performed under different combinations of sample size, MAF, and microbiome distance matrices. To make the simulations realistic, we used an unweighted distance matrix of the fecal microbiome samples with the 16S rRNA V4 region sequences from the American Gut Project (AGP) [36]. The OTU table, rarefied to 10,000 sequence reads per sample, as well as the metadata were downloaded from the AGP website. Samples with less than 10,000 sequence reads were excluded from the analysis. The weighted and the unweighted UniFrac distance matrices were generated in the Quantitative Insights Into Microbial Ecology [21] (QIIME) pipeline. Because antibiotics may substantially change the microbiome composition to generate outliers that may distort the null distribution, we excluded samples with self-reported history of antibiotic usage within one month. After quality control, 1879 subjects remained for analysis. In the simulations, we randomly selected N samples for a given sample size N .
For each setting, the type-I error rates were evaluated based on 108 simulations under H 0 . For the interaction test and the joint test, the binary environment factor had a frequency of 50% and was simulated independent of the SNP. The type-I error rates are summarized in Table 1 for the weighted UniFrac distance matrix. The skewness and kurtosis are reported in Figure 3C,D. The statistics adjusted for skewness and kurtosis have accurate type-I error rates while the statistics without adjustment have unacceptably high type-I error rates. As the sample size increases, the impact of skewness and kurtosis decreases. However, even for a study with N = 1000 , the type-I error rates are still seriously inflated. The results for the unweighted UniFrac distance matrix and for MAF = 0.5 are reported in Table S1.

3.2. Software Implementation, Memory Requirement, and Computational Complexity

We implemented our algorithms in a software package, microbiomeGWAS, which is freely available at https://github.com/lsncibb/microbiomeGWAS [30], accessed on 30 May 2022. MicrobiomeGWAS requires three sets of files: a microbiome distance matrix file, a set of PLINK binary files for GWAS genotypes, and a set of covariates. MicrobiomeGWAS processes one SNP at a time and does not load all genotype data into memory; thus, it requires only memory for storing the distance matrix. Variance, skewness, and kurtosis can be partitioned into two parts related with the microbiome distance matrix and the MAF of the SNP separately; thus, we can quickly calculate these quantities for a predefined grid of MAFs. The overall computational complexity is about O ( N 2 M ) , where N is the sample size and M is the number of SNPs. Figure 4 reports the computation time on a Linux server using a single core. For a study with 10,000 subjects, it takes approximately 15 h for analyzing the main effect and approximately 30 h for analyzing both the main and interaction effects for 0.5 million variants. As a comparison, in a recent microbiome GWAS [37], to analyze 7 × 10−6 variants for the main effect and n = 3382 subjects in the SHIP-TREND cohort [37], it would take 61 years using one CPU and 94 days using one graph-processing unit for parallel computation. Moreover, their analytic pipeline could not jointly analyze all 8956 subjects from five cohorts because of the computational burden; instead, they performed a stepwise search that may cause power loss.

3.3. GWAS of Microbiome Diversity in Adjacent Normal Lung Tissues

We applied our methods to a set of lung cancer patients of Italian ancestry in the EAGLE [31] study. All subjects have germline genome-wide SNPs [32] and 16S rRNA microbiome data (V3-V4 region, Illumina MiSeq, 300 paired-end) in histologically normal lung tissues from these patients. Here, the histologically normal lung tissues were 1~5 cm from the tumor tissue. We performed a series of quality control steps to filter out low-quality sequence reads: average quality score <20 over 30 bp windows, less than 60% similarity to the Greengenes [38] reference, or identified as chimera reads using UCHIME [39]. Sequence reads were then processed by QIIME [21] to produce the relative abundances (RA) of taxa, two α-diversity metrics (observed number of species and Shannon’s index), and β diversity metrics (unweighted and weighted UniFrac distances) rarified to 1000 reads. We included 147 subjects with at least 1000 high-quality sequence reads for genetic association analysis.
Out of the 147 subjects, 78 are current smokers, 8 never smoked, and 61 are former smokers. Because of the small number of never smokers, we merged never and former smokers as non-current smokers. All of the genetic association analyses were adjusted for sex, age, smoking status, and the top three PCA scores derived based on genome-wide SNPs. Here, the top three PCA scores were selected for controlling population stratification because the other PCA scores were unassociated with the distance matrices. We included 383,263 common SNPs with MAF ≥ 10% because rarer SNPs were expected to have no statistical power given the current sample size. We first performed GWAS analysis using PLINK [40] to identify the SNPs associated with taxa with an average RA greater than 0.1% or two α-diversity metrics. We did not detect genome-wide significant associations with either the main effects or gene–smoking interactions.
Next, we performed GWAS analysis using unweighted and weighted UniFrac distance matrices as a representation of eubacteria β diversity. The results for testing the main effects are reported in Figure 5. Results for testing the joint effects (main effect and SNP by smoking status interaction) are reported in Figure S1. Because of the small sample size, we observed large values of skewness and kurtosis, with the magnitude varying with the MAF of the SNPs (Figure 5A). The score statistics based on the weighted UniFrac distance matrix had a much larger skewness and kurtosis than did the unweighted UniFrac matrix. Figure 5B,C report the quantile–quantile (QQ) plot of the logarithm of the association p-values for the unweighted and weighted UniFrac distance matrices, respectively. For each distance matrix, we produced QQ plots for p-values based on the asymptotic approximation and for p-values adjusted for skewness and kurtosis. For both distance matrices, the QQ plots before adjustment strongly deviated from the expected uniform distribution. Our adjustment eliminated the deviation. In addition, consistent with the observation that the skewness and kurtosis were larger for the weighted UniFrac distance matrix, the QQ plot deviated more for the analysis based on the weighted UniFrac distance. Note that the skewness and kurtosis only affect the tail probabilities; thus, the inflation of the QQ plot is not reflected by the genomic control lambda value [41], calculated as the median of the p-values. In fact, lambda 1 for all four QQ plots.
Without correcting for skewness and kurtosis, we identified three and six loci achieving genome-wide significance ( p < 5 × 10 8 ) for the unweighted and weighted UniFrac distance matrices, respectively (Figure 5D). After correcting for skewness and kurtosis, no locus remained genome-wide significant (Figure 5D), which was verified by 108 permutations. Importantly, skewness and kurtosis had a dramatic effect on tail probabilities. Here, we use SNP rs12785513 as an example, which was identified as the top SNP in both analyses. In the unweighted UniFrac analysis, p = 4.4 × 10−9 without adjustment and p = 1.6 × 10−6 after adjustment, a 364-fold inflation. The inflation was even larger for weighted UniFrac analysis because of larger skewness and kurtosis (Figure 5A). In fact, p = 3.4 × 10−10 without adjustment and p = 3.5 × 10−6 after adjustment, a 1000-fold inflation. Although these SNPs were not significant genome-wide, they were the top SNPs from the current study. Thus, we report box-plots for each of these nine SNPs (Figure 5E). As expected, in all box plots, microbiome distances tend to be larger in subject pairs with greater genetic distance at these SNPs. These associations remain to be replicated in studies with larger sample sizes.
Finally, we concentrated on the six common SNPs in four genomic regions reported to be associated with lung cancer risk in GWAS of European subjects: rs2036534 and rs1051730 at 15q25.1 [42,43,44,45] (CHRNA5CHRNA3CHRNB4), rs2736100 and rs401681 at locus 5p15.33 [31,46] (TERT/CLPTM1L), rs6489769 [47] at 12p13.3 (RAD52), and rs1333040 at 9p21.3 [48] (CDKN2A/CDKN2B). The SNPs at 15q25.1 and 5p15.33 have the largest effect sizes for lung cancer risk based on the meta-analysis from the Transdisciplinary Research in Cancer of the Lung (TRICL) consortium [48]: OR = 1.32 for rs1051730, OR = 1.26 for rs2036534, OR = 1.13 for rs2736100, and OR = 1.14 for rs401681. Rs3131379 at locus 6p21.33 [46] (BAT3/MSH5) was excluded because the MAF = 7.5%. No SNPs were significantly associated with taxa RAs or α-diversity metrics after correcting for multiple testing. However, association analysis based on the UniFrac distance matrices provided evidence that these SNPs may be associated with the lung microbiota (Table 2). These SNPs were independent except that rs2036534 and rs1051730 at 15q25.1 were weakly correlated with R2 = 0.15. A test combining six z -scores ( Z M ) and adjusting for the weak correlation yielded overall p-values of 0.0033 and 0.011 for the unweighted and the weighted UniFrac distance matrices, respectively. These results suggest that lung cancer risk SNPs were enriched for genetic association with the composition of the lung microbiome. The results for testing interactions and joint effects are reported in Table S2.

4. Discussion

We developed a software package, microbiomeGWAS, for identifying host genetic variants associated with microbiome composition. MicrobiomeGWAS can test both the main effect and SNP–environment interactions. Importantly, we found that the score statistics had positive skewness and kurtosis and that the tail probabilities evaluated based on asymptotic approximations were very liberal. We addressed this problem by explicitly adjusting for skewness and kurtosis. MicrobiomeGWAS runs very efficiently and takes only 36 min for testing main effects and 69 min for testing joint effects in a GWAS with 2000 subjects and 500,000 markers. Other statistical methods exist for testing the association of microbiome distance matrices. PERMANOVA [27] is an extension of multivariate analysis of variance to a matrix of pairwise distances and relies on permutations to evaluate significance. MiRKAT [49], a recently proposed method based on kernel regression, takes hours for evaluating one association for 2000 subjects. Neither is computationally feasible for analyzing a large-scale GWAS of a microbiome. Recently, an asymptotic distribution was proposed to approximate the p-value for the PERMANOVA pseudo-F statistic [50]; however, whether it is sufficiently accurate for very small p-values (p < 5 ×   10 8 ,   for   GWAS ) remains to be investigated.
Interactions of host genetic susceptibility with the microbiome have been postulated for many conditions, including inflammatory bowel diseases [51,52], autoimmune and rheumatic diseases [53,54,55,56], diabetes [57], and cancer, especially of the colon [58]. All models of these host–microbiome interactions also note the critical role of environmental factors, including diet, smoking, drugs, and antibiotics and other medications [59]. Although based on a very small initial sample set, the suggestive associations that we found between the six known lung cancer risk SNPs and the microbiome of adjacent normal lung tissue samples, including effects of cigarette smoking, provide preliminary evidence that our microbiomeGWAS method is likely to be a useful tool for generating data that will unravel host–microbiome interactions with high confidence.
We are working on two extensions for microbiomeGWAS: (1) jointly testing additive and dominant effects; and (2) testing genetic associations using many microbiome distance matrices. We have assumed an additive effect model (Figure 1); however, several top SNPs in the EAGLE data suggest a dominant effect (e.g., rs8083714 in Figure 5E). Thus, a statistic for jointly testing the additive and dominant effects might be powerful for this scenario. The second extension is motivated by the fact that that the power to detect associations depends heavily on the choice of distance matrix. The recently developed generalized UniFrac [26] (gUniFrac) defines a series of distance matrices to reflect the different emphases of using taxa relative abundance information. gUniFrac has been shown to have a robust power for association studies [26]. Extending microbiomeGWAS to gUniFrac, however, requires solving two problems. First, the computational complexity is proportional to the number of distance matrices analyzed for associations, which can be addressed by implementing the algorithms using multithreading technology. Second, we need to derive accurate analytic approximations to the association p-values by correcting for the multiple testing introduced by many distance matrices. MiRKAT [49] has an option for using gUniFrac; however, intensive permutations are required to evaluate p-values.
In summary, GWAS of the microbiome of each body site has the potential to help one understand microbiome variation, to elucidate the biological mechanisms of genetic associations, to improve the power of identifying novel disease-associated genetic variants, and to improve the performance of genetic risk prediction. We expect our methods and software to be useful for large-scale GWAS of the human microbiome.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/genes13071224/s1, Table S1: Type-I error rates estimated based on 108 simulations. Table S2: Association P-values between lung cancer risk SNPs and microbiome composition in the EAGLE data. Figure S1: Quantile-quantile (QQ) plot for association p-values testing the joint effects (main effect and SNP by smoking interaction) using the unweighted UniFrac distance matrices. Figure S2: Derivation of the likelihood ratio statistic Q in (7) and (8). Figure S3: Calculations related with genetic dependence.

Author Contributions

Conceptualization, X.H. and J.S.; methodology, X.H. and J.S.; software, X.H., L.S. and J.S.; data analysis, X.H., L.S. and J.S.; investigation, all authors; EAGLE data resource, M.T.L.; writing-original draft preparation, X.H. and J.S.; writing—review and editing, all authors. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the NIH Intramural Research Program.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects in the study.

Data Availability Statement

The genetic data for the EAGLE study can be accessed from dbGap with accession number phs000093.v2.p2. The American Gut Project data used for simulations can be obtained from https://github.com/biocore/American-Gut (accessed on 25 May 2022).

Acknowledgments

This study utilized the high-performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, MD. (http://biowulf.nih.gov (accessed on 25 May 2022)).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Calculating Calculating V a r ( G i j ) , C o v ( G i j ,   G i k ) , V a r ( Δ i j ) and C o v ( Δ i j ,   Δ i k )

We first calculate E ( G i j ) , V a r ( G i j ) , and C o v ( G i j ,   G i k ) . Let p t = P ( g i = t ) with p 0 , p 1 , p 2 0   and p 0 + p 1 + p 2 = 1 . We can also assume the Hardy–Weinberg equilibrium and characterize the probabilities as the allele frequency: p 0 = ( 1 f ) 2 , p 1 = 2 f ( 1 f ) and p 2 = f 2 . Some algebra leads to
E ( G i j ) = E | g i g j | = m , n { 0 ,   1 ,   2 } p m p n | m n | = 2 p 0 p 1 + 2 p 1 p 2 + 4 p 0 p 2              
V a r ( G i j ) = E ( G i j 2 ) E ( G i j ) 2 = ( 2 p 0 p 1 + 2 p 1 p 2 + 8 p 0 p 2 ) ( 2 p 0 p 1 + 2 p 1 p 2 + 4 p 0 p 2 ) 2    
        C o v ( G i j ,   G i k ) = p 1 ( 1 p 1 ) + 4 p 0 p 2 ( 1 + p 1 ) ( 2 p 0 p 1 + 2 p 1 p 2 + 4 p 0 p 2 ) 2          
Now consider Δ i j = | g i E i g j E j | . When E i is binary, g i E i = 0 ,   1 or 2. Let p t = P ( g i E i = t ) . Then, E ( Δ i j ) , V a r ( Δ i j ) , and C o v ( Δ i j ,   Δ i k ) can be calculated similarly using (A1)–(A3).

Appendix B. Calculating ρ = C o r 0 ( Z M , Z I | D )

Let G i j = G i j E G i j and Δ i j = Δ i j E Δ i j . We first calculate covariance under H 0 :
C o v 0 ( S M , S I | D ) = C o v 0 ( i < j d i j G i j , m < n d m n Δ m n ) = i < j , m < n d i j d m n C o v ( G i j , Δ m n ) .
When ( i , j , m , n ) are distinct, C o v ( G i j , Δ m n ) = 0 . Some algebra leads to
                                                                    C o v 0 ( S M , S I | D ) = ( N 2 ) C o v ( G i j , Δ i j ) μ 2 + 6 ( N 3 ) C o v ( G i j , Δ i k ) μ 3
with μ 2 and μ 3 specified in (3) and (4). Combining (2), (5) and (A4), we have
                                      ρ = C o v 0 ( S M , S I | D ) V a r 0 ( S M | D ) V a r 0 ( S I | D ) N C o v ( G i j , Δ i k ) C o v ( G i j , G i k ) C o v ( Δ i j , Δ i k )                                                  
Equation (A5) suggests that the correlation is asymptotically independent of the microbiome distance matrix. In the real-data analyses, we found that (A5) was very accurate when sample size N 50 . The details of calculating C o v ( G i j , Δ i j ) and C o v ( G i j , Δ i k ) are provided in Supplemental Data.

Appendix C. A Statistic for Testing H 0 : β M = β I = 0   vs .   H 1 : β M > 0   or   β I > 0

Denote Z = ( Z M ,   Z I ) T . Under H 0 , Z ~ N ( 0 ,   Σ ) with Σ = ( 1 ρ ρ 1 ) . Let ξ M = E 1 Z M 0 and ξ I = E 1 Z I 0 be the non-centrality parameter of the two score statistics. Apparently, the original testing problem is equivalent for testing H 0 : ξ M = ξ I = 0   vs .   H 1 : ξ M > 0   or   ξ I > 0 . Given the observed values ( Z M , Z I ) , the likelihood ratio statistic is simplified as
Q = Z T Σ 1 Z ( Z ξ ) T Σ 1 ( Z ξ )
where ξ = ( ξ M , ξ I ) T = arginf ξ M 0 , ξ I 0 Q (Figure S2A).
To simplify the optimization problem in (A6), we perform a linear transformation: Y T = Z T Σ 1 2   and v T = ξ T Σ 1 2 , where
Σ 1 2 = 1 2 ( 1 1 1 1 ) ( 1 / 1 ρ 0 0 1 / 1 + ρ )
Under this transformation, Q = Y T Y ( Y v ) T ( Y v ) and can be interpreted as the difference of the square of two distances (Figure S2B). The original parameter space { ( ξ M , ξ I ) : ξ M 0 , ξ I 0 } is now transformed to { ( ν 1 , ν 2 ) : ν 2 θ ν 1 , ν 2 θ ν 1 } with θ = ( 1 ρ ) / ( 1 + ρ ) . Thus, the new parameter space is bounded by two lines represented by ν 2 θ ν 1 and ν 2 θ ν 1 . We partition the 2D plane into four parts (see Figure S2B), identify v = arginf v A 1 ( Y v ) T ( Y v ) and calculate Q :
Q = { Y 1 2 + Y 2 2                                                                                 ( Y 1 , Y 2 ) | A 1 ( Y 2 Y 1 / θ ) 2 / ( 1 + θ 2 )                               ( Y 1 , Y 2 ) | A 2     ( Y 2 + Y 1 / θ ) 2 / ( 1 + θ 2 )                               ( Y 1 , Y 2 ) | A 3     0                                                                                                           ( Y 1 , Y 2 ) | A 4    
We now perform an inverse transformation using matrix
          Σ 1 2 = [ 1 2 ( 1 ρ 0 0 1 + ρ ) ( 1 1 1 1 ) ]    
to return to the original parameter space. The four areas { A 1 , A 2 , A 3 , A 4 } under the original space are in Figure 2 and Figure S2C.
Tedious calculations show that ( Y 2 + Y 1 / θ ) 2 / ( 1 + θ 2 ) = ( w 2 Z M + w 1 Z I ) 2   with w 1 = ( θ 1 / θ ) / 2 and w 2 = ( θ + 1 / θ ) / 2 . Similarly, ( Y 2 Y 1 / θ ) 2 / ( 1 + θ 2 ) = ( w 1 Z M + w 2 Z I ) 2 . This proves (7). In addition, w 1 Z M + w 2 Z I 0 and w 1 2 + 2 ρ w 1 w 2 + w 2 2 = 1 ; thus, P { ( w 1 Z M + w 2 Z I ) 2 > b 2 } = P { w 1 Z M + w 2 Z I > b } = P { N ( 0 , 1 ) > b } . This proves (8). The probabilities in (8) could also be calculated from Figure S2B: q 1 = 1 / 2 ( arctan θ ) / π   , q 2 = q 3 = 1 / 4 .

Appendix D. Calculating Skewness and Kurtosis under H 0

By definition, γ = E 0 ( S M 3 | D ) / V a r 0 3 / 2   ( S M | D ) and κ = E 0 ( S M 4 | D ) / V a r 0 2   ( S M | D ) 3 . We first calculate E 0 ( S M 3 | D ) . Let G i j = G i j E G i j . We have
E 0 ( S M 3 | D ) = E 0 ( i < j d i j G i j ) 3 = i < j ,   m < n , s < t d i j d m n d s t E G i j G m n G s t .
Figure S3A lists all combinations of ( i , j , m , n , s , t ) with E G i j G m n G s t 0 ; then
E 0 ( S M 3 | D ) = ( N 2 ) μ 4 E G i j 4 + ( N 3 ) ( μ 5 E G i j 2 G i k + μ 6 E G i j G j k G i k ) + ( N 4 ) ( μ 7 E G i j G j k G k l + μ 8 E G i j G i k G i l ) ,
where ( μ 4 , μ 5 , μ 6 , μ 7 , μ 8 ) are provided in Supplemental Data. Similarly,
E 0 ( S M 4 | D ) = E 0 ( i < j d i j G i j ) 4 = i < j ,   m < n , s < t , x < y d i j d m n d s t d x y E G i j G m n G s t G x y .
Figure S3B lists combinations of ( i , j , m , n , s , t , x , y ) with E G i j G m n G s t G x y 0 . Thus,
E 0 ( S M 4 | D ) = ( N 2 ) μ 9 E G i j 4 + ( N 3 ) ( μ 10 E G i j 3 G i k + μ 11 E G i j 2 G i k 2 + μ 12 E G i j 2 G j k G i k ) + ( N 4 ) ( μ 13 E G i j 2 G j k G k l + μ 14 E G i j G j k 2 G k l + μ 15 E G i j 2 G i k G i l + μ 16 E G i j G j k G i k G i l + μ 17 E G i j G j k G k l G i l + μ 18 E G i j 2 G k l 2 ) + ( N 5 ) ( μ 19 E G i j G j k G k l G l m + μ 20 E G i j G i k G i l G i m + μ 21 E G i j G i k G i l G l m + μ 22 E G i j G i k G l m 2 ) + ( N 6 ) μ 23 E G i j G i k G l m G l n
The constants ( μ 9 , , μ 23 ) are dependent on D and are provided in Supplemental Data. Note that V a r 0 ( S M | D ) ~ O ( N 3 ) , E 0 ( S M 3 | D ) ~ O ( N 4 ) ; thus, γ ~ O ( 1 / N ) . Similarly, we can prove κ ~ O ( 1 / N ) .

Appendix E. Improve p-Value Approximations by Adjusting for Skewness and Kurtosis

We assume that E 0 Z = 0 , V a r 0 Z = 1 , γ = E 0 Z 3 and κ = E 0 Z 4 3 under the original probability measure P 0 . The tail probability P 0 ( Z > b ) for a large value of b is sensitive to the non-normality of Z , characterized by γ and κ . We define a new probability measure by embedding to the exponential probability density
  d P ξ = exp ( ξ Z ϕ ( ξ ) ) D P 0
where ϕ ( ξ ) = log E 0 exp ( ξ Z )   is the log moment generating function. Note that γ = ϕ ( 0 ) and κ = ϕ ( 0 ) . Because E 0 ( Z ) = 0 and V a r 0 ( Z ) = 1 , Taylor’s expansion leads to ϕ ( ξ ) ξ 2 / 2 + γ ξ 3 / 6 + κ ξ 4 / 24 . Under P ξ , we have
E ξ Z = Z d P ξ = ϕ ( ξ ) ξ + γ 2 ξ 2 + κ 6 ξ 3                                                                      
and
V a r ξ Z = ϕ ( ξ ) 1 + γ ξ + κ 2 ξ 2
We choose ξ such that E ξ Z b by numerically solving an equation
ξ + γ 2 ξ 2 + κ 6 ξ 3 = b                                                                                                                
Under the probability measure P ξ , Z ~ N ( b , σ 2 ) approximately with σ 2 = 1 + γ ξ + k ξ 2 / 2 in (A12). By the likelihood ratio identity and (A10), we have
    P 0 ( Z > b ) = E 0 I Z > b = E ξ d P 0 d P ξ I Z > b = E ξ e ϕ ( ξ ) ξ Z I Z > b = e ϕ ( ξ ) E ξ e ξ Z I Z > b            
Note that e ξ Z decays very fast when Z increases. Thus, the integral E ξ e ξ Z I Z > b does not heavily depend on the tail distribution of Z . Assuming Z ~ N ( b , σ 2 ) under P ξ , we can verify that
E ξ e ξ Z I Z > b = e b ξ + σ 2 ξ 2 2 Φ ( σ ξ )                                                                                              
Combining (A14) and (A15) gives P 0 ( Z > b ) exp ( ϕ ( ξ ) b ξ + σ 2 ξ 2 / 2 )   Φ ( σ ξ ) , which is further approximated as
  P 0 ( Z > b ) exp ( b ξ + 1 + σ 2 2 ξ 2 + γ 6 ξ 3 + κ 24 ξ 4 ) Φ ( σ ξ ) ,
because ϕ ( ξ ) ξ 2 / 2 + γ ξ 3 / 6 + κ ξ 4 / 24 based on the Taylor expansion. This proves (9). If we correct skewness but assume kurtosis κ = 0 , then ϕ ( ξ ) ξ 2 / 2 + γ ξ 3 / 6 . We recalculate ξ by setting κ = 0 in (A13) to derive ξ = ( 1 + 2 γ b 1 ) / γ . This proves (10).

References

  1. Turnbaugh, P.J.; Hamady, M.; Yatsunenko, T.; Cantarel, B.L.; Duncan, A.; Ley, R.E.; Sogin, M.L.; Jones, W.J.; Roe, B.A.; Affourtit, J.P.; et al. A core gut microbiome in obese and lean twins. Nature 2009, 457, 480–484. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Morgan, X.C.; Tickle, T.L.; Sokol, H.; Gevers, D.; Devaney, K.L.; Ward, D.V.; Reyes, J.A.; Shah, S.A.; LeLeiko, N.; Snapper, S.B.; et al. Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 2012, 13, R79. [Google Scholar] [CrossRef]
  3. Ahn, J.; Sinha, R.; Pei, Z.; Dominianni, C.; Wu, J.; Shi, J.; Goedert, J.J.; Hayes, R.B.; Yang, L. Human gut microbiome and risk for colorectal cancer. J. Natl. Cancer Inst. 2013, 105, 1907–1911. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  4. Goedert, J.J.; Jones, G.; Hua, X.; Xu, X.; Yu, G.; Flores, R.; Falk, R.T.; Gail, M.H.; Shi, J.; Ravel, J.; et al. Investigation of the Association Between the Fecal Microbiota and Breast Cancer in Postmenopausal Women: A Population-Based Case-Control Pilot Study. J. Natl. Cancer Inst. 2015, 107, djv147. [Google Scholar] [CrossRef] [PubMed]
  5. Lax, S.; Smith, D.P.; Hampton-Marcell, J.; Owens, S.M.; Handley, K.M.; Scott, N.M.; Gibbons, S.M.; Larsen, P.; Shogan, B.D.; Weiss, S.; et al. Longitudinal analysis of microbial interaction between humans and the indoor environment. Science 2014, 345, 1048–1052. [Google Scholar] [CrossRef] [Green Version]
  6. Wu, G.D.; Chen, J.; Hoffmann, C.; Bittinger, K.; Chen, Y.Y.; Keilbaugh, S.A.; Bewtra, M.; Knights, D.; Walters, W.A.; Knight, R.; et al. Linking long-term dietary patterns with gut microbial enterotypes. Science 2011, 334, 105–108. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  7. Tong, M.; McHardy, I.; Ruegger, P.; Goudarzi, M.; Kashyap, P.C.; Haritunians, T.; Li, X.; Graeber, T.G.; Schwager, E.; Huttenhower, C.; et al. Reprograming of gut microbiome energy metabolism by the FUT2 Crohn’s disease risk polymorphism. ISME J. 2014, 8, 2193–2206. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  8. Knights, D.; Silverberg, M.S.; Weersma, R.K.; Gevers, D.; Dijkstra, G.; Huang, H.; Tyler, A.D.; van Sommeren, S.; Imhann, F.; Stempak, J.M.; et al. Complex host genetics influence the microbiome in inflammatory bowel disease. Genome Med. 2014, 6, 107. [Google Scholar] [CrossRef] [Green Version]
  9. McKnite, A.M.; Perez-Munoz, M.E.; Lu, L.; Williams, E.G.; Brewer, S.; Andreux, P.A.; Bastiaansen, J.W.M.; Wang, X.; Kachman, S.D.; Auwerx, J.; et al. Murine Gut Microbiota Is Defined by Host Genetics and Modulates Variation of Metabolic Traits. PLoS ONE 2012, 7, e39191. [Google Scholar] [CrossRef] [Green Version]
  10. Benson, A.K.; Kelly, S.A.; Legge, R.; Ma, F.; Low, S.J.; Kim, J.; Zhang, M.; Oh, P.L.; Nehrenberg, D.; Hua, K.; et al. Individuality in gut microbiota composition is a complex polygenic trait shaped by multiple environmental and host genetic factors. Proc. Natl. Acad. Sci. USA 2010, 107, 18933–18938. [Google Scholar] [CrossRef] [Green Version]
  11. Goodrich, J.K.; Waters, J.L.; Poole, A.C.; Sutter, J.L.; Koren, O.; Blekhman, R.; Beaumont, M.; Van Treuren, W.; Knight, R.; Bell, J.T.; et al. Human genetics shape the gut microbiome. Cell 2014, 159, 789–799. [Google Scholar] [CrossRef] [Green Version]
  12. Davenport, E.R.; Cusanovich, D.A.; Michelini, K.; Barreiro, L.B.; Ober, C.; Gilad, Y. Genome-Wide Association Studies of the Human Gut Microbiota. PLoS ONE 2015, 10, e0140301. [Google Scholar] [CrossRef]
  13. GTEx Consortium. The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science 2015, 348, 648–660. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  14. Battle, A.; Mostafavi, S.; Zhu, X.; Potash, J.B.; Weissman, M.M.; McCormick, C.; Haudenschild, C.D.; Beckman, K.B.; Shi, J.; Mei, R.; et al. Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals. Genome Res. 2014, 24, 14–24. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  15. Bell, J.T.; Pai, A.A.; Pickrell, J.K.; Gaffney, D.J.; Pique-Regi, R.; Degner, J.F.; Gilad, Y.; Pritchard, J.K. DNA methylation patterns associate with genetic and gene expression variation in HapMap cell lines. Genome Biol. 2011, 12, R10. [Google Scholar] [CrossRef] [Green Version]
  16. Shi, J.; Marconett, C.N.; Duan, J.; Hyland, P.L.; Li, P.; Wang, Z.; Wheeler, W.; Zhou, B.; Campan, M.; Lee, D.S.; et al. Characterizing the genetic basis of methylome diversity in histologically normal human lung tissue. Nat. Commun. 2014, 5, 3365. [Google Scholar] [CrossRef] [Green Version]
  17. McVicker, G.; van de Geijn, B.; Degner, J.F.; Cain, C.E.; Banovich, N.E.; Raj, A.; Lewellen, N.; Myrthil, M.; Gilad, Y.; Pritchard, J.K. Identification of Genetic Variants That Affect Histone Modifications in Human Cells. Science 2013, 342, 747–749. [Google Scholar] [CrossRef] [Green Version]
  18. Kilpinen, H.; Waszak, S.M.; Gschwind, A.R.; Raghav, S.K.; Witwicki, R.M.; Orioli, A.; Migliavacca, E.; Wiederkehr, M.; Gutierrez-Arcelus, M.; Panousis, N.I.; et al. Coordinated Effects of Sequence Variation on DNA Binding, Chromatin Structure, and Transcription. Science 2013, 342, 744–747. [Google Scholar] [CrossRef] [Green Version]
  19. Suhre, K.; Wallaschofski, H.; Raffler, J.; Friedrich, N.; Haring, R.; Michael, K.; Wasner, C.; Krebs, A.; Kronenberg, F.; Chang, D.; et al. A genome-wide association study of metabolic traits in human urine. Nat. Genet. 2011, 43, 565–569. [Google Scholar] [CrossRef]
  20. Sabatti, C.; Service, S.K.; Hartikainen, A.L.; Pouta, A.; Ripatti, S.; Brodsky, J.; Jones, C.G.; Zaitlen, N.A.; Varilo, T.; Kaakinen, M.; et al. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nat. Genet. 2009, 41, 35–46. [Google Scholar] [CrossRef] [Green Version]
  21. Caporaso, J.G.; Kuczynski, J.; Stombaugh, J.; Bittinger, K.; Bushman, F.D.; Costello, E.K.; Fierer, N.; Pena, A.G.; Goodrich, J.K.; Gordon, J.I.; et al. QIIME allows analysis of high-throughput community sequencing data. Nat. Methods 2010, 7, 335–336. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  22. Bray, J.R.; Curtis, J.T. An ordination of upland forest communities of southern Wisconsin. Ecol. Monogr. 1957, 27, 325–349. [Google Scholar] [CrossRef]
  23. Lozupone, C.; Knight, R. UniFrac: A new phylogenetic method for comparing microbial communities. Appl. Environ. Microbiol. 2005, 71, 8228–8235. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  24. Lozupone, C.A.; Hamady, M.; Kelley, S.T.; Knight, R. Quantitative and qualitative β diversity measures lead to different insights into factors that structure microbial communities. Appl. Environ. Microbiol. 2007, 73, 1576–1585. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  25. Lozupone, C.; Hamady, M.; Knight, R. UniFrac--an online tool for comparing microbial community diversity in a phylogenetic context. BMC Bioinform. 2006, 7, 371. [Google Scholar] [CrossRef] [Green Version]
  26. Chen, J.; Bittinger, K.; Charlson, E.S.; Hoffmann, C.; Lewis, J.; Wu, G.D.; Collman, R.G.; Bushman, F.D.; Li, H. Associating microbiome composition with environmental covariates using generalized UniFrac distances. Bioinformatics 2012, 28, 2106–2113. [Google Scholar] [CrossRef]
  27. Anderson, M.J. A new method for non-parametric multivariate analysis of variance. Austral. Ecol. 2001, 26, 32–46. [Google Scholar]
  28. Wang, J.; Thingholm, L.B.; Skieceviciene, J.; Rausch, P.; Kummen, M.; Hov, J.R.; Degenhardt, F.; Heinsen, F.A.; Ruhlemann, M.C.; Szymczak, S.; et al. Genome-wide association analysis identifies variation in vitamin D receptor and other host factors influencing the gut microbiota. Nat. Genet. 2016, 48, 1396–1406. [Google Scholar] [CrossRef]
  29. Gevers, D.; Knight, R.; Petrosino, J.F.; Huang, K.; McGuire, A.L.; Birren, B.W.; Nelson, K.E.; White, O.; Methe, B.A.; Huttenhower, C. The Human Microbiome Project: A community resource for the healthy human microbiome. PLoS Biol. 2012, 10, e1001377. [Google Scholar] [CrossRef] [Green Version]
  30. MicrobiomeGWAS. Available online: https://github.com/lsncibb/microbiomeGWAS (accessed on 30 May 2022).
  31. Landi, M.T.; Consonni, D.; Rotunno, M.; Bergen, A.W.; Goldstein, A.M.; Lubin, J.H.; Goldin, L.; Alavanja, M.; Morgan, G.; Subar, A.F.; et al. Environment And Genetics in Lung cancer Etiology (EAGLE) study: An integrative population-based case-control study of lung cancer. BMC Public Health 2008, 8, 203. [Google Scholar] [CrossRef] [Green Version]
  32. Landi, M.T.; Chatterjee, N.; Yu, K.; Goldin, L.R.; Goldstein, A.M.; Rotunno, M.; Mirabello, L.; Jacobs, K.; Wheeler, W.; Yeager, M.; et al. A genome-wide association study of lung cancer identifies a region of chromosome 5p15 associated with risk for adenocarcinoma. Am. J. Hum. Genet. 2009, 85, 679–691. [Google Scholar] [CrossRef] [Green Version]
  33. Oksanen, J.; Blanchet, F.G.; Friendly, M.; Kindt, R.; Legendre, P.; McGlinn, D.; Minchin, P.R.; O’Hara, R.B.; Simpson, G.L.; Solymos, P.; et al. vegan: Community Ecology Package. R Package Version 2.5-7. 2020. Available online: https://CRAN.R-project.org/package=vegan (accessed on 30 May 2022).
  34. Tu, I.P.; Siegmund, D.O. The maximum of a function of a Markov chain and application to linkage analysis. Adv. Appl. Probab. 1999, 31, 510–531. [Google Scholar] [CrossRef]
  35. Siegmund, D. Sequential Analysis: Tests and Confidence Intervals; Springer: New York, NY, USA, 1985. [Google Scholar]
  36. McDonald, D.; Hyde, E.; Debelius, J.W.; Morton, J.T.; Gonzalez, A.; Ackermann, G.; Aksenov, A.A.; Behsaz, B.; Brennan, C.; Chen, Y.; et al. American Gut: An Open Platform for Citizen Science Microbiome Research. mSystems 2018, 3, e00031-18. [Google Scholar] [CrossRef] [Green Version]
  37. Rühlemann, M.C.; Hermes, B.M.; Bang, C.; Doms, S.; Moitinho-Silva, L.; Thingholm, L.B.; Frost, F.; Degenhardt, F.; Wittig, M.; Kässens, J.; et al. Genome-wide association study in 8956 German individuals identifies influence of ABO histo-blood groups on gut microbiome. Nat. Genet. 2021, 53, 147–155. [Google Scholar] [CrossRef]
  38. DeSantis, T.Z.; Dubosarskiy, I.; Murray, S.R.; Andersen, G.L. Comprehensive aligned sequence construction for automated design of effective probes (CASCADE-P) using 16S rDNA. Bioinformatics 2003, 19, 1461–1468. [Google Scholar] [CrossRef] [Green Version]
  39. Edgar, R.C.; Haas, B.J.; Clemente, J.C.; Quince, C.; Knight, R. UCHIME improves sensitivity and speed of chimera detection. Bioinformatics 2011, 27, 2194–2200. [Google Scholar] [CrossRef] [Green Version]
  40. Purcell, S.; Neale, B.; Todd-Brown, K.; Thomas, L.; Ferreira, M.A.; Bender, D.; Maller, J.; Sklar, P.; de Bakker, P.I.; Daly, M.J.; et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007, 81, 559–575. [Google Scholar] [CrossRef] [Green Version]
  41. Devlin, B.; Roeder, K. Genomic control for association studies. Biometrics 1999, 55, 997–1004. [Google Scholar] [CrossRef]
  42. Hung, R.J.; McKay, J.D.; Gaborieau, V.; Boffetta, P.; Hashibe, M.; Zaridze, D.; Mukeria, A.; Szeszenia-Dabrowska, N.; Lissowska, J.; Rudnai, P.; et al. A susceptibility locus for lung cancer maps to nicotinic acetylcholine receptor subunit genes on 15q25. Nature 2008, 452, 633–637. [Google Scholar] [CrossRef] [Green Version]
  43. McKay, J.D.; Hung, R.J.; Gaborieau, V.; Boffetta, P.; Chabrier, A.; Byrnes, G.; Zaridze, D.; Mukeria, A.; Szeszenia-Dabrowska, N.; Lissowska, J.; et al. Lung cancer susceptibility locus at 5p15.33. Nat. Genet. 2008, 40, 1404–1406. [Google Scholar] [CrossRef]
  44. Thorgeirsson, T.E.; Geller, F.; Sulem, P.; Rafnar, T.; Wiste, A.; Magnusson, K.P.; Manolescu, A.; Thorleifsson, G.; Stefansson, H.; Ingason, A.; et al. A variant associated with nicotine dependence, lung cancer and peripheral arterial disease. Nature 2008, 452, 638–642. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  45. Amos, C.I.; Wu, X.; Broderick, P.; Gorlov, I.P.; Gu, J.; Eisen, T.; Dong, Q.; Zhang, Q.; Gu, X.; Vijayakrishnan, J.; et al. Genome-wide association scan of tag SNPs identifies a susceptibility locus for lung cancer at 15q25.1. Nat. Genet. 2008, 40, 616–622. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  46. Wang, Y.; Broderick, P.; Webb, E.; Wu, X.; Vijayakrishnan, J.; Matakidou, A.; Qureshi, M.; Dong, Q.; Gu, X.; Chen, W.V.; et al. Common 5p15.33 and 6p21.33 variants influence lung cancer risk. Nat. Genet. 2008, 40, 1407–1409. [Google Scholar] [CrossRef]
  47. Shi, J.; Chatterjee, N.; Rotunno, M.; Wang, Y.; Pesatori, A.C.; Consonni, D.; Li, P.; Wheeler, W.; Broderick, P.; Henrion, M.; et al. Inherited variation at chromosome 12p13.33, including RAD52, influences the risk of squamous cell lung carcinoma. Cancer Discov. 2012, 2, 131–139. [Google Scholar] [CrossRef] [Green Version]
  48. Timofeeva, M.N.; Hung, R.J.; Rafnar, T.; Christiani, D.C.; Field, J.K.; Bickeboller, H.; Risch, A.; McKay, J.D.; Wang, Y.; Dai, J.; et al. Influence of common genetic variation on lung cancer risk: Meta-analysis of 14 900 cases and 29 485 controls. Hum. Mol. Genet. 2012, 21, 4980–4995. [Google Scholar] [CrossRef]
  49. Zhao, N.; Chen, J.; Carroll, I.M.; Ringel-Kulka, T.; Epstein, M.P.; Zhou, H.; Zhou, J.J.; Ringel, Y.; Li, H.; Wu, M.C. Testing in Microbiome-Profiling Studies with MiRKAT, the Microbiome Regression-Based Kernel Association Test. Am. J. Hum. Genet. 2015, 96, 797–807. [Google Scholar] [CrossRef] [Green Version]
  50. Chen, J.; Zhang, X. D-MANOVA: Fast distance-based multivariate analysis of variance for large-scale microbiome association studies. Bioinformatics. 2021, 38, 286–288. [Google Scholar] [CrossRef]
  51. Leone, V.A.; Cham, C.M.; Chang, E.B. Diet, gut microbes, and genetics in immune function: Can we leverage our current knowledge to achieve better outcomes in inflammatory bowel diseases? Curr. Opin. Immunol. 2014, 31, 16–23. [Google Scholar] [CrossRef] [Green Version]
  52. Huang, H.; Vangay, P.; McKinlay, C.E.; Knights, D. Multi-omics analysis of inflammatory bowel disease. Immunol. Lett. 2014, 162, 62–68. [Google Scholar] [CrossRef] [PubMed]
  53. Troncone, R.; Discepolo, V. Celiac disease and autoimmunity. J. Pediatr. Gastroenterol. Nutr. 2014, 59 (Suppl. S1), S9–S11. [Google Scholar] [CrossRef]
  54. Yeoh, N.; Burton, J.P.; Suppiah, P.; Reid, G.; Stebbings, S. The role of the microbiome in rheumatic diseases. Curr. Rheumatol. Rep. 2013, 15, 314. [Google Scholar] [CrossRef]
  55. Sparks, J.A.; Costenbader, K.H. Genetics, environment, and gene-environment interactions in the development of systemic rheumatic diseases. Rheum. Dis. Clin. N. Am. 2014, 40, 637–657. [Google Scholar] [CrossRef] [Green Version]
  56. Smith, J.A. Update on ankylosing spondylitis: Current concepts in pathogenesis. Curr. Allergy Asthma Rep. 2015, 15, 489. [Google Scholar] [CrossRef]
  57. Nielsen, D.S.; Krych, L.; Buschard, K.; Hansen, C.H.; Hansen, A.K. Beyond genetics. Influence of dietary factors and gut microbiota on type 1 diabetes. FEBS Lett. 2014, 588, 4234–4243. [Google Scholar] [CrossRef] [Green Version]
  58. Birt, D.F.; Phillips, G.J. Diet, genes, and microbes: Complexities of colon cancer prevention. Toxicol. Pathol. 2014, 42, 182–188. [Google Scholar] [CrossRef] [Green Version]
  59. Marietta, E.; Rishi, A.; Taneja, V. Immunogenetic control of the intestinal microbiota. Immunology 2015, 145, 313–322. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Microbiome distances are positively correlated with genetic distances at an associated SNP.
Figure 1. Microbiome distances are positively correlated with genetic distances at an associated SNP.
Genes 13 01224 g001
Figure 2. Define the joint test for testing H 0 :   β M = β I = 0 vs. β M > 0   o r   β I > 0 . We assume that Z M ~ N ( 0 , 1 ) , Z I ~ N ( 0 , 1 ) and c o r ( Z M , Z I ) = ρ under H 0 . Details are in Appendix C.
Figure 2. Define the joint test for testing H 0 :   β M = β I = 0 vs. β M > 0   o r   β I > 0 . We assume that Z M ~ N ( 0 , 1 ) , Z I ~ N ( 0 , 1 ) and c o r ( Z M , Z I ) = ρ under H 0 . Details are in Appendix C.
Genes 13 01224 g002
Figure 3. Correcting tail probabilities for skewness and kurtosis. (A) The standard normal distribution N ( 0 , 1 ) and an approximately normal distribution with positive skewness. The skewness has big impact when calculating the tail probability P ( Z > b ) for a large value of b . (B) Numerical evaluation of tail probability approximation for Z M . We used the unweighted UniFrac distance matrix of 500 samples from the American Gut Project (AGP). For each value of b (>0), we calculated p-values P ( Z M > b ) based on N ( 0 , 1 ) , skewness correction, both skewness and kurtosis correction, and 108 simulations. (C) Skewness depends on minor allele frequency (MAF) of SNPs and the sample size of the study, calculated based on the weighted UniFrac distance matrix in AGP data. (D) Kurtosis depends on MAF of SNPs and the sample size, calculated based on the weighted UniFrac distance matrix in the AGP data.
Figure 3. Correcting tail probabilities for skewness and kurtosis. (A) The standard normal distribution N ( 0 , 1 ) and an approximately normal distribution with positive skewness. The skewness has big impact when calculating the tail probability P ( Z > b ) for a large value of b . (B) Numerical evaluation of tail probability approximation for Z M . We used the unweighted UniFrac distance matrix of 500 samples from the American Gut Project (AGP). For each value of b (>0), we calculated p-values P ( Z M > b ) based on N ( 0 , 1 ) , skewness correction, both skewness and kurtosis correction, and 108 simulations. (C) Skewness depends on minor allele frequency (MAF) of SNPs and the sample size of the study, calculated based on the weighted UniFrac distance matrix in AGP data. (D) Kurtosis depends on MAF of SNPs and the sample size, calculated based on the weighted UniFrac distance matrix in the AGP data.
Genes 13 01224 g003
Figure 4. Computation time for a microbiome GWAS with 500,000 SNPs. “Main”: computation time for testing main effect only. “All”: computation time for testing main effect, interaction and the joint null hypothesis H 0 :   β M = 0 , β I = 0 .
Figure 4. Computation time for a microbiome GWAS with 500,000 SNPs. “Main”: computation time for testing main effect only. “All”: computation time for testing main effect, interaction and the joint null hypothesis H 0 :   β M = 0 , β I = 0 .
Genes 13 01224 g004
Figure 5. Results of analyzing the microbiome GWAS data of 147 adjacent normal lung tissues in the EAGLE study. (A) Skewness and kurtosis for the main effect test using the unweighted and the weighted UniFrac distance matrices. (B) Quantile–quantile (QQ) plot for association p-values using the unweighted UniFrac distance matrix. “Adjusted”: p-values were corrected for skewness and kurtosis. “Unadjusted”: p-values were approximated based on the asymptotic distribution N ( 0 , 1 ) . (C) Quantile–quantile (QQ) plot for association p-values using the weighted UniFrac distance matrix. (D) Manhattan plots based on the unweighted or the weighted UniFrac distance matrices. (E) Box plots for the top nine loci in microbiome GWAS analysis. Subject pairs are classified into three groups according to the genetic distance | g i g j | at the SNP. The y-coordinate is the microbiome distance.
Figure 5. Results of analyzing the microbiome GWAS data of 147 adjacent normal lung tissues in the EAGLE study. (A) Skewness and kurtosis for the main effect test using the unweighted and the weighted UniFrac distance matrices. (B) Quantile–quantile (QQ) plot for association p-values using the unweighted UniFrac distance matrix. “Adjusted”: p-values were corrected for skewness and kurtosis. “Unadjusted”: p-values were approximated based on the asymptotic distribution N ( 0 , 1 ) . (C) Quantile–quantile (QQ) plot for association p-values using the weighted UniFrac distance matrix. (D) Manhattan plots based on the unweighted or the weighted UniFrac distance matrices. (E) Box plots for the top nine loci in microbiome GWAS analysis. Subject pairs are classified into three groups according to the genetic distance | g i g j | at the SNP. The y-coordinate is the microbiome distance.
Genes 13 01224 g005
Table 1. Type-I error rates estimated based on 108 simulations. Minor allele frequency = 20%. Simulations were based on the weighted UniFrac distance matrix of the gut microbiome data from the American Gut Project. Reported are the type-I error inflation factor. A value greater than 1 indicates an inflated type-I error.
Table 1. Type-I error rates estimated based on 108 simulations. Minor allele frequency = 20%. Simulations were based on the weighted UniFrac distance matrix of the gut microbiome data from the American Gut Project. Reported are the type-I error inflation factor. A value greater than 1 indicates an inflated type-I error.
ZMZIQ
Nα = 10−310−510−710−310−510−710−310−510−7
Asymptotic approximation1005.551.6610.04.736.1342.87.380.91148.0
2003.723.0187.33.115.8105.54.633.0316.7
5002.49.445.22.16.725.52.811.964.1
10002.05.721.31.84.414.02.26.928.5
Adjusted for skewness and kurtosis1001.01.20.71.01.10.61.01.52.0
2001.01.11.01.01.10.70.91.31.8
5001.01.11.31.01.00.90.91.01.7
10001.01.01.21.01.00.80.91.01.1
Table 2. Association p-values between lung cancer risk SNPs and microbiome composition in the EAGLE data.
Table 2. Association p-values between lung cancer risk SNPs and microbiome composition in the EAGLE data.
SNPChrAnnotated GenesUnweighted UniFracWeighted UniFrac
rs203653415q25.1CHRNA3/4/50.4250.167
rs105173015q25.1CHRNA3/4/50.0200.401
rs27361005p15.33TERT0.0890.267
rs4016815p15.33CLPTM1L0.0560.005
rs648976912p13.3RAD520.1970.329
rs13330409p21.3CDKN2A/B0.2490.224
Overall test0.00320.011
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Hua, X.; Song, L.; Yu, G.; Vogtmann, E.; Goedert, J.J.; Abnet, C.C.; Landi, M.T.; Shi, J. MicrobiomeGWAS: A Tool for Identifying Host Genetic Variants Associated with Microbiome Composition. Genes 2022, 13, 1224. https://doi.org/10.3390/genes13071224

AMA Style

Hua X, Song L, Yu G, Vogtmann E, Goedert JJ, Abnet CC, Landi MT, Shi J. MicrobiomeGWAS: A Tool for Identifying Host Genetic Variants Associated with Microbiome Composition. Genes. 2022; 13(7):1224. https://doi.org/10.3390/genes13071224

Chicago/Turabian Style

Hua, Xing, Lei Song, Guoqin Yu, Emily Vogtmann, James J. Goedert, Christian C. Abnet, Maria Teresa Landi, and Jianxin Shi. 2022. "MicrobiomeGWAS: A Tool for Identifying Host Genetic Variants Associated with Microbiome Composition" Genes 13, no. 7: 1224. https://doi.org/10.3390/genes13071224

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop