A Two-Stage Mutual Information Based Bayesian Lasso Algorithm for Multi-Locus Genome-Wide Association Studies

Genome-wide association study (GWAS) has turned out to be an essential technology for exploring the genetic mechanism of complex traits. To reduce the complexity of computation, it is well accepted to remove unrelated single nucleotide polymorphisms (SNPs) before GWAS, e.g., by using iterative sure independence screening expectation-maximization Bayesian Lasso (ISIS EM-BLASSO) method. In this work, a modified version of ISIS EM-BLASSO is proposed, which reduces the number of SNPs by a screening methodology based on Pearson correlation and mutual information, then estimates the effects via EM-Bayesian Lasso (EM-BLASSO), and finally detects the true quantitative trait nucleotides (QTNs) through likelihood ratio test. We call our method a two-stage mutual information based Bayesian Lasso (MBLASSO). Under three simulation scenarios, MBLASSO improves the statistical power and retains the higher effect estimation accuracy when comparing with three other algorithms. Moreover, MBLASSO performs best on model fitting, the accuracy of detected associations is the highest, and 21 genes can only be detected by MBLASSO in Arabidopsis thaliana datasets.


Introduction
Genome-wide association study (GWAS) has evolved to be an essential technology for exploring the genetic mechanism of complex traits [1]. It concentrates on identifying the significant single nucleotide polymorphisms (SNPs) associated with the given traits. In past years, several single-locus GWAS methods have been developed [1][2][3][4][5], and have detected a few variants among various traits successfully. However, they still have some drawbacks, such as the combined effects of multiple loci are ignored and the threshold in multiple test correction is hard to be determined [6].
To address these drawbacks, some classical high-dimensional statistical methods were well used in GWAS when the number of SNPs is not far more than that of individuals, such as the least absolute shrinkage and selector operator (Lasso) [7], the elastic net [8], and Bayesian Lasso [9,10]. However, the current situation is the opposite, because the number of SNPs is much larger than

Statistical Framework
In this study, we consider the linear mixed genetic model [6] as follows: where y is a n × 1 phenotypic vector of quantitative trait, and n is the number of individuals; 1 is a n × 1 vector in which every element is equal to 1, and µ is the overall mean; Q = (Q 1 , Q 2 , . . . , Q q ) is a n × q matrix of fixed effects, such as the population structure, and q is the number of fix effects; α is a q × 1 vector of fixed effects; and X is a n × p matrix of SNP genotype values. For each SNP, homozygous genotype are coded as 1, and −1, respectively, and the heterozygous ones are indicated by 0. p is the number of presumed QTNs, β is the QTN effects, and ε ∼ MV N n (0, σ 2 e I) is a n × 1 vector of residual error.

Simulation Experiments
To assess the performance of methods, we considered simulation scenarios based on the Arabidopsis thaliana datasets consisting of 216,130 SNPs, 199 accessions, and 107 phenotype traits [24]. For genotype simulation, we randomly selected 10,000 SNPs, 2000 for each of the five chromosomes, i.e., 11,226,038,776 bp on Chr.1, 5,045,828-6,6412,875 bp on Chr. 2,1,916,196,442 bp on Chr.3, 2,232,796-3,3143,893 bp on Chr.4, and 19,999,868-21,039,406 bp on Chr.5. Additionally, we generated the phenotype simulation data with sample size 199 from three different scenarios, and undertook 1000 times for each simulation. Six QTNs were assumed to be genuine; their heritabilities were set as 0.10, 0.05, 0.05, 0.15, 0.05, and 0.05, respectively; and their allelic frequencies we are all nearly 0.30. Both the overall mean and residual variance we are set as 10.0, and the positions and effects of the six QTNs are shown in Tables S1-S3. The genotype and phenotype simulations were the same as those used by Wang et al. [25].
The first model (only six QTNs' additive effects) is: The second model (six QTNs' additive effects plus polygenic effect) is:

Real Data and Preprocessing
We used four flowering-time related traits of Arabidopsis thalina datasets [24,26] for analysis. The four traits are days to flowering time under long days with vernalization (LDV), days to flowering time under short days with vernalization (SDV), days to flowering time under long days with two weeks vernalization (2W), and days to flowering time under long days with four weeks vernalization (4W), respectively. We removed the SNPs with minor allele frequency (MAF) less than 0.01, and 178376 SNPs remained ultimately. For phenotypes, we deleted the individuals with missing phenotype value, thus 168, 159, 152 and 119 individuals were reserved for each of the four traits LDV, SDV, 2W and 4W, respectively, and then a logarithmic transformation was performed to each phenotype value. Due to the strong population structure in Arabidopsis thaliana, we were obliged to eliminate the impact of population structure. We reorganized the SNP genotype data via the software PLINK (Version 1.09) [27] at first, then chose a suitable value for population number q from 1 to 5 with the minimum cross-validation error, and calculated the population structure matrix Q synchronously by using the software ADMIXTURE (Version 1.3) [28], and finally corrected the primary phenotype vector y by Q j , j = 1, 2, . . . , q, whose effectsα j were estimated by least-square method. Therefore, the corrected phenotype vector is:

Mutual Information
Mutual information proposed by Shannon [29] is based on the concept of entropy and has been widely used in feature selection [23]. Given two discrete random variables X and Y, the mutual information of X and Y is defined as: where H(X) is the entropy of X and H(X, Y) is the joint entropy of X and Y. They can be specified as: where p(x) = P(X = x) is the marginal probability density function, and p(x, y) = P(X = x, Y = y) is the joint probability density function. Mutual information can also be defined as: where H(X|Y) is the conditional entropy of X given Y. We calculated the mutual information by using the matlab package "MutualInfo" (Version 0.9) written by Peng et al. [23].
In fact, mutual information can be illustrated as the amount of information one random variable contained in another random variable. The larger the mutual information is, the stronger correlation between the two random variables is. In GWAS, we consider the phenotypic vector as one random variable, and the genotype vector of a SNP as another random variable. In this way, we can calculate the mutual information between each of the SNPs and phenotype.

SCAD
SCAD is a penalized likelihood approach that enables to selecting variables and estimating coefficients simultaneously due to its Oracle properties [12]. The objective function ξ is: where β = (β 1 , β 2 , . . . , β p ) T is the regression coefficient vector to be estimated and, λ and γ are penalty and shrinkage parameter, respectively, both of them are greater than 0. The former term of Equation (7) is the loss function, and the latter term is the penalty function defined by: , if λ ≤ |β j | < γλ and γ > 2, γ = 3.7 as suggested in the original study [12]. We performed SCAD by using the R package "ncvreg" from https://CRAN.R-project.org/package=ncvreg.

Likelihood Ratio Test
Likelihood ratio test is to compare the maximum of likelihood function in null hypothesis H 0 and alternative hypothesis H 1 , and further determine whether the hypothesis is effective. LOD (log of odds) score is a statistic criterion used in likelihood ratio test. The definition is: l 0 = e L 0 , l 1 = e L 1 , L 0 = L(θ −k ) and L 1 = L(θ) are the natural logarithms of the likelihood functions for null hypothesis H 0 : β k = 0 and alternative hypothesis H 1 , respectively, θ −k = {β 1 , . . . , β k−1 , β k+1 , . . . , β o } and θ = {β 1 , . . . , β o }, and o is the number of markers potentially associated with the trait. LOD ≥ 3 was proposed to be the significant criterion in multi-locus GWAS [25], which is slightly stringent and equivalent to P = Pr(χ 2 1 > 3 × 4.6052) ≈ 0.0002. Under H 0 , LOD × 4.6052 follows a χ 2 distribution with one degree of freedom. We set the significant criterion of MBLASSO, ISIS EM-BLASSO, and EM-BLASSO as LOD ≥ 3, which is Bonferroni correction for GEMMA by referring the published study [30].

A Two-Stage Mutual Information Based Bayesian Lasso (MBLASSO) Method
On the whole, this procedure is a two-stage strategy for multi-locus GWAS. In the first phase, we used a modified ISIS approach based on Pearson correlation and mutual information to obtain a subset of SNPs, the elements of which can be divided into two types, separately. As to Pearson correlation screening, Type I includes those SNPs with strong correlated to phenotype, and Type II consists of those SNPs weak correlated while indirectly correlated to phenotype with some SNPs from Type I. For mutual information screening, Types I and II are similar as those in Pearson correlation screening. The first phase of our method can be considered to select SNPs from two different measurements. In the second phase, we adopted EM-BLASSO [10] to estimate the effects and select the SNPs with nonzero effect (≥ 10 −5 ) to further likelihood ratio test procedure. We call this method MBLASSO. The flow chart is shown in Figure 1.

•
Step 8: Gather the SNPs selected from Steps 4 and 7 and remove the reduplicated ones. Then obtain a new subset of SNPs, that is, C = A B, the size of which is ν = k + τ. • Step 9: Use EM-BLASSO to estimate the effect of the ν SNPs from C and further eliminate the SNPs with zero effect, the source code for EM-BLASSO can be found at https://CRAN.R-project. org/package=mrMLM, where we can also download the program of ISIS EM-BLASSO. Note that the phenotype vector in this step refers to the original one (y). • Step 10: Apply the likelihood ratio test to identify the true QTNs, and set the significant criterion as LOD ≥ 3.

The Overlap Ratio between Pearson Correlation and Mutual Information Based Screening in MBLASSO
To illustrate the necessity of considering the correlation measured in mutual information between the SNPs and phenotype, we calculated the overlap ratio and average number of SNPs selected by Pearson correlation and mutual information in the first variable selection stage. The SNPs selected by Pearson correlation and mutual information can be divided into two types (Types I and Type II), respectively. We found that each type of screening obtains phenotype-associated SNPs without large overlapping (Table 1), which suggests that the SNPs from our MBLASSO method may have more associations with phenotype than ISIS EM-BLASSO.

Statistical Power for QTN Detection
The power for the ith QTN is: power i = i /1000, i = 1, 2, 3, 4, 5, 6, where i is the frequency that ith hypothetical QTN is successfully detected among all 1000 repetitions. A detected SNP within 1kb of the candidate QTN is regarded as true QTN [6,25,30]. In three simulations, powers of the six QTNs in MBLASSO are highest, except the second QTN powers are lower than those of EM-BLASSO (  Tables S1-S3). To measure the robustness of methods, we used the standard deviation of powers across the four QTNs, which was proposed by Ren et al. [30]. In Simulation 1, the standard deviations for MBLASSO, ISIS EM-BLASSO and EM-BLASSO are 8.14, 8.16 and 13.99, respectively, indicating the best stability of MBLASSO. The stability comparisons in Simulations 2 and 3 are the same as that in Simulation 1. Therefore, MBLASSO improves the power and has best stability in different scenarios. A Violin plot of average statistical powers for MBLASSO in three simulation scenarios is shown in Figure S1a.

Average Accuracy for QTN Effects
Mean squared error (MSE) was used to quantify the bias of effect estimation. The MSE of the ith QTN is:  Figure 3. Average mean squared errors (MSEs) for the six simulated QTNs in three simulation scenarios. The description of (a-c) is the same as that in Figure 2

Type 1 Error Ratio
Type 1 error ratio, also known as false positive ratio, is an important problem that needs to be overcome in GWAS. In Simulation 1, they are 0.0302%, 0.0325%, 0.0325% and 0.0259% for MBLASSO, ISIS EM-BLASSO, GEMMA, and EM-BLASSO, respectively, and GEMMA has the lowest Type 1 error ratio in Simulations 2 and 3 ( Figure 4). Note that all Type 1 error ratios are less than 0.05% (Figure 4 and Tables S1-S3), which indicates that all the four algorithms ensure the Type 1 error is at a very low level. A violin plot of Type 1 error ratios for MBLASSO in three simulation scenarios is shown in Figure S1c.

Computational Efficiency
The computing time of MBLASSO is longer than that of ISIS EM-BLASSO, because it needs additional computation of mutual information between all the SNPs and phenotype, but it takes less time than EM-BLASSO. For example, in Simulation 1, MBLASSO finishes the analysis of 199 individuals with 10,000 SNPs for 1000 repetitions in 4.12 h, ISIS EM-BLASSO takes 2.90 h, GEMMA spends 2.20 h, and while EM-BLASSO needs 28.86 h for the same dataset (Table S1). The specific hours spent on the other two simulations are largely identical with only minor differences to those in Simulation 1 (Tables S2 and S3), and the operations of computation are on a computer of Intel Xeon E5-2640 CPU 2.40 GHz.

Arabidopsis Thaliana Dataset Analysis
We analyzed four flowering-time related traits (LDV, SDV, 2W, and 4W) using by MBLASSO, ISIS EM-BLASSO, GEMMA and EM-BLASSO. Suppose that the candidate genes for the traits are in the proximity of 20 kb with the associated SNPs [6,25]; MBLASSO identifies 17, 18, 17 and 18 SNPs significant associated with each of the four traits LDV, SDV, 2W, and 4W, respectively. ISIS EM-BLASSO detects 14, 18, 19, and 16 remarkable associated SNPs; GEMMA identifies 3, 5, 1, and 2 significant SNPs; and EM-BLASSO tests 3, 0, 4, and 6 SNPs, respectively. A Venn diagram showings the overlap numbers of SNPs detected by the four algorithms in the four traits is presented in Figure S2.
To measure the model fitting degree of the detected SNPs, Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) were computed for each trait in four various methods, where a lower value indicates a better model fitting. We can explicitly see that MBLASSO shows the lowest AIC and BIC for the four traits (Table 2), thus it is the best algorithm in model fitting, followed by ISIS EM-BLASSO, EM-BLASSO, and GEMMA. Meanwhile, by referring to the latest GO annotation [31] for Arabidopsis thalina genes at www. arabidopsis.org, we extracted the known genes related to flowering-time traits and found 5, 4, 2, and 3 known genes closed to the detected SNPs with MBLASSO; 3, 2, 1, and 2 known genes with ISIS EM-BLASSO; 0, 1, 0, and 1 known genes with GEMMA; and none of known genes could be identified by using EM-BLASSO for LDV, SDV, 2W and 4W, respectively (Table 3). These results suggest that the accuracy of associations retrieved by MBLASSO are the highest. Table 3. The accuracy of detected associations in four flowering-time related traits for Arabidopsis thaliana (the number behind slash in each cell is the count of detected SNPs, and the number in front of slash is the count of known genes in GO annotation), In addition, totally 21 genes were only detected by MBLASSO, among which five genes (AT5G45830, AT5G45840, AT3G57230, AT5G15850, and AT5G04240) are in the 98 candidate genes [24], and AT5G45830 (alias: DOG1) is the gene with the highest frequency significant associated with flowering-time related phenotypes. Nearly all of the 23 flowering-time related phenotypes are associated with this gene [24]. Meanwhile, AT5G45840 (alias: MDIS1) is one of the Top 5 flowering-time related genes studied by researchers in Arabidopsis thaliana (www.arabidopsis.org). The detailed GWAS results are listed in Table S4.

Traits MBLASSO ISIS EM-BLASSO GEMMA EM-BLASSO
About the computation speed, despite MBLASSO being is slower than ISIS EM-BLASSO and GEMMA, it is much faster than EM-BLASSO, for example, for the trait LDV, the time for MBLASSO is 2.31 min, ISIS EM-BLASSO requires 1.92 min, GEMMA takes 0.85 min and EM-BLASSO consumes to 183.6 min. We notice that the time costs of all the four flowering-time traits in MBLASSO, ISIS EM-BLASSO, and GEMMA are all less than 3 min (Table S5).

Discussion
MBLASSO is a GWAS method modified from ISIS EM-BLASSO, that is, iterative sure independence screening (ISIS) in the first stage of ISIS EM-BLASSO is replaced by a combination ISIS based on Pearson correlation and mutual information. We assume a subset of loci jointly affects the phenotype. In the first stage, we focus on selecting those SNPs that are likely to be highly associated. Considering some SNPs may have different correlations under various phenotypes, which are hard to measure only by Pearson correlation, so we adopt the mutual information to obtain the SNPs with potential correlation to phenotype. Meanwhile, since those SNPs individually irrelevant but jointly relevant to phenotype can be revived, this multi-objective screening process is a crucial component of our methodology to improve the statistical power. In the second stage, we apply the existing EM-BLASSO method [10], which is actually a single stage multi-locus GWAS strategy, to estimate the effects of selected SNPs and further filter out the SNPs with very small effect (<10 −5 ). Finally, we use likelihood ratio test to identify the true QTNs.
In fact, the method and criterion of hypothesis testing in different approaches may be different, e.g., the Wald test is applied in RMLM [25] and original EM-BLASSO [10], the significant level is P = 0.01 or 0.05, and a looser likelihood ratio test criterion LOD ≥ 2 is employed in pLARmEB [32]. Since different significant criteria will lead to changes in results, for above three simulations, we listed the performances (average power, average MSE and Type 1 error ratio) of MBLASSO in three different significant criteria (LOD = 3, LOD = 2 and P = 0.01) in Table S6. We can see the average power increased with the decrease of LOD value, but the Type 1 error ratio and average MSE also increased. This means that with the relaxation of significant criteria, high statistical power will be achieved, while false positives will be increased and estimation accuracy will be reduced. In addition, the performances at the significant criterion P = 0.01 in Wald test are between LOD = 2 and 3 in likelihood ratio test. GEMMA is a single-locus GWAS approach, and the significant threshold for each test is determined by Bonferroni correction (0.05/p, p is the number of SNPs). MBLASSO, ISIS EM-BLASSO, and EM-BLASSO are multi-locus approaches and do not require multiple test correction.
We conducted paired t-test (also used in [6,25,30]) for statistical power and MSE between MBLASSO and three other methods in three simulation scenarios (Table S7). We can see it has significant improvements compared with ISIS EM-BLASSO and GEMMA. For the traits SDV and 2 W in real Arabidopsis thaliana datasets, the numbers of significant SNPs identified by MBLASSO are not more than ISIS EM-BLASSO, but the degrees of model fitting are better ( Table 2); and the number of known candidate genes adjacent to the detected SNPs is still larger (Table 3), this phenomenon indicates MBLASSO may be more effective to capture the inherent relationship between SNPs and phenotype. The traditional EM-BLASSO [10] and GEMMA perform well in terms of Type 1 error ratio in the three simulations, but their performances in Arabidopsis thaliana dataset are worse than expected, not only achieving the worse model fitting performance but also fewer of genes are detected. On the whole, our algorithm MBLASSO is slightly slower than ISIS EM-BLASSO and GEMMA, but it is more effective and accurate for both simulation and real datasets.

Conclusions
Our algorithm MBLASSO is a modified version of ISIS EM-BLASSO; it integrates Pearson correlation and mutual information to the feature screening stage, and it considers different types of correlation between the SNPs and phenotype. In three different simulation scenarios, MBLASSO improves the statistical power and retains the higher effect estimation accuracy when comparing with three other methods. Meanwhile, the GWAS results in four flowering-time related traits are superior in model fitting; the accuracy of detected associations are the highest; and 21 genes can only be detected by MBLASSO.