Support Interval for Two-Sample Summary Data-Based Mendelian Randomization

The summary-data-based Mendelian randomization (SMR) method is gaining popularity in estimating the causal effect of an exposure on an outcome. In practice, the instrument SNP is often selected from the genome-wide association study (GWAS) on the exposure but no correction is made for such selection in downstream analysis, leading to a biased estimate of the effect size and invalid inference. We address this issue by using the likelihood derived from the sampling distribution of the estimated SNP effects in the exposure GWAS and the outcome GWAS. This likelihood takes into account how the instrument SNPs are selected. Since the effective sample size is 1, the asymptotic theory does not apply. We use a support for a profile likelihood as an interval estimate of the causal effect. Simulation studies indicate that this support has robust coverage while the confidence interval implied by the SMR method has lower-than-nominal coverage. Furthermore, the variance of the two-stage least squares estimate of the causal effect is shown to be the same as the variance used for SMR for one-sample data when there is no selection.


Introduction
A main interest in scientific research is to study the causal effect of an exposure x on an outcome y. When the outcome is continuous, the causal effect is the coefficient b in the following regression model: where u represents the unobserved factors and y is normally distributed with mean 0 and variance σ 2 y . Throughout this report, all variables are centered so that the intercept is equal to 0.
When u confounds the effect of x, the least squares estimate of b is biased. Mendelian randomization (MR) is a modern technique for correcting this bias [1][2][3][4][5], thanks to the availability of the large number of genome-wide association studies (GWASs). One appealing feature of the summary-data-based MR methods is that they don't rely on individuallevel data.
MR is an application of instrumental variable (IV) analysis to estimate b. IV analysis is able to control for unobserved confounders. MR uses single nucleotide polymorphisms (SNPs) as IVs. Let g denote the genotypic score of an SNP. For an IV to be valid, it must satisfy the following assumptions [6]: Relevance: It is associated with the exposure x (i.e., Cov(g, x) = 0); Exclusion Restriction: It affects the outcome y only through its association with the exposure; and Exchangeability: It is not associated with any confounders of the exposure-outcome association, which implies Cov(g, y) = bCov(g, x).
Under these assumptions, Equation (1) implies: where b gy = Cov(g, y)/Var(g) and b gx = Cov(g, x)/Var(g). Since b gx and b gy can be estimated from the exposure GWAS and the outcome GWAS, respectively, a popular summary-data MR (SMR) estimate of b is [2]: whereb gy andb gx are GWAS estimates of b gy and b gx , respectively. In practice, in order to satisfy the relevance assumption on an IV, an SNP is typically selected from the exposure GWAS, often at the genome-wide significance level p < 5 × 10 −8 . Hence the selected SNPs are subject to a winner's curse that leads to a biased effect estimate. This is an issue that has been recognized in the MR literature for some time [7][8][9][10][11][12][13]. A typical solution is to use another GWAS on the exposure to screen for IV SNPs [10,14,15]. However, such a GWAS may not always be available. A simple correction is to transform the false discovery rate to the z-scale [11]. An empirical study on the effect of the winner's curse on a Mendelian randomization study is presented in [12]. A review of methods to overcome the winner's curse in the context of genetic association studies is provided in [13].
As a matter of fact, many applications [2,16,17], including those contained in the original paper that proposed SMR [2], do not use another GWAS for screening. The IV SNPs are simply selected from the exposure GWAS and the selection bias is not corrected for in the downstream analysis [2,18,19]. Furthermore, this approach has been generalized to other settings [20,21].
This research is based on the sampling distribution of the estimated SNP effects on the exposure and on the outcome. Despite the large number of subjects used in the exposure GWAS and the outcome GWAS, there are only 4 summary statistics (i.e., two coefficient estimates and their respective standard errors) needed for MR at an IV SNP. The standard asymptotic theory, which requires a large sample size, does not apply since there is only one "observation" at an SNP. To sidestep this issue, we use a support derived from a profile likelihood as an interval estimate for b and assess its coverage probability through simulation. Support is the set of parameter values at which the log profile likelihood is a certain unit below the maximum log profile likelihood. It can be considered an extension of the confidence interval. Simulation studies demonstrate that the 2-unit support has robust coverage while the confidence interval implied by the SMR method has lower-than-nominal coverage.
In addition, we point out that the standard error ofb SMR derived from the delta method is the same as the standard error derived from the standard theory on two-stage least-squares (TSLS) regression for one-sample individual-level data in the absence of SNP selection.

One-Sample Individual-Level Data
In this subsection, we consider one-sample individual-level data where the IV SNP is not selected for its significant p-value. In this case, the delta method estimate of SE(b SMR ) is the same as the estimate derived from the theory on the TSLS method. The delta method is the method used by SMR [2]. This result indicates another connection between SMR and TSLS, in addition to the connection thatb SMR is the same as the TSLS estimate of b. James E. Pustejovsky proved this relationship in a blog post [22]. Below we provide a similar proof in our current context.
Consider the following GWAS model on exposure x: where x correlates with the unobserved confounder u. The reduced-form equation for y is: where b gy = bb gx . Let x, y, and g denote the centered vectors of n observations on x, y, and g, respectively. Estimates of b gx and b gy can be obtained as follows:b gx = g x/g g and b gy = g y/g g. Let P = gg /g g. Their second-order moments are estimated by: Var(b gy ) = n −1 y (I − P)y g g , We note that Cov(b gx ,b gy ) = 0. The TSLS estimate of b is the least squares estimate of the coefficient in the regression where the response is y and the predictor is Px: The delta method estimate of the variance ofb SMR is [2]: Sinceb 2 gx g g = x Px, we have: The last equal sign holds because: where 0 is a vector of 0's. The right-hand side of Equation (2) is exactly the estimated variance ofb TSLS defined in the standard theory on TSLS method [23]. Combining all these results, we have V delta = V TSLS . V delta can not be computed from GWAS summary data as there is no information on Cov(b gx ,b gy ). For the same reason, TSLS can also not be computed from GWAS summary data.
However, whenb gx andb gy are derived from two independent samples, Cov(b gx ,b gy ) = 0 and V delta can be computed from GWAS summary data, as is shown in the SMR method [2]. SMR tests whether the exposure has a causal effect on the outcome using a statistic T SMR defined by: whereb SMR =b 2 gy /b 2 gx and On the other hand, the TSLS method is not defined for two-samples MR although there are some extensions [24].

Two Independent Samples with a Selected SNP
In this subsection, we consider two-sample MR where the IV SNP is selected from the exposure GWAS. The purpose of this selection is to ensure that the IV SNP is associated with the exposure. This practice is commonly used in empirical MR studies [2,16]. The selection criterion for an SNP is typically |b gx |/SE(b gx ) ≥ τ for a prespecified τ. For the genome-wide significance level 5 × 10 −8 , τ = 5.45131.
The summary statistics used in an MR analysis areb gx , SE(b gx ),b gy , and SE(b gy ). To simplify notations, they will be denoted by x, σ x , y, and σ y , respectively, and b gx and b gy will be denoted by µ x and µ y , respectively. We ignore the sampling variation in σ x and σ y as they typically are derived from GWASs of very large sample sizes. A similar assumption is made elsewhere, for instance, [25]. In these notations, x and y have the following sampling distributions, respectively: where CN(·, ·) stands for a conditional normal given |x/σ x | ≥ τ and N(·, ·) a normal distribution. The distribution function CN(·, ·) was used to construct an approximate conditional likelihood for estimating µ x [26]. The term "approximate" comes from the fact that the distributions of x (prior to selection) and y are approximately normal.
where Φ(·) is the cumulative distribution function of standard normal. The density function of x is: 1 where φ(·) is the density function of standard normal. The expected value of x is: and its variance is: Note thatσ 2 x is no longer constant; its value depends on µ x . When there is no selection, τ = 0 and α 1 = α 2 . The mean and variance reduce to µ x and σ 2 x , respectively. Since x is independent of y, the likelihood function L(µ x , µ y ) is: where L x (µ x ) and L y (µ y ) are the likelihood functions based on x and y, respectively.
The likelihood function L y (µ y ) is: The MLE of µ y is apparentlyμ y = y. The likelihood function L x (µ x ) is: Its score equation is: This equation determines the MLEμ x for µ x . However, there is no explicit form forμ x . The MLEμ x can be obtained by maximizing L x (µ x ) numerically. From Equations (4) and (5), x is an unbiased estimate of the mean of the conditional normal distribution for x. However it is biased for µ x as the mean shown in Equation (4) is a nonlinear function of µ x . Similar comments are made elsewhere [26].
Since σ x > 0 and A > 0, Equation (5) indicates that when x > 0 µ x must be positive. Otherwise x − µ x would be positive and φ(α 2 ) − φ(α 1 ) is negative (because α 1 is closer to 0 than α 2 is). There would be a contradiction. Because µ x > 0 implies that the second term of Equation (5) is positive, there is x >μ x > 0. Following the same logic, when x < 0, there is x <μ x < 0. In either case, the naïve Wald ratio y/ Figure 1 shows theμ x /σ x as a function of x/σ x . The larger the value of x/σ x , the smaller the absolute difference |x/σ x −μ x /σ x |. When |x/σ x | ≥ 7.5 (corresponding to a p-value less than or equal to 6.38 × 10 −14 ), the absolute difference |x/σ x −μ x /σ x | is < 0.05 and seems to be negligible.
Under the null: there is L(µ x , µ y ) = L x (µ x )L y (0). The MLE of µ x is equal toμ x determined by Equation (5). Under the alternative: the MLE of µ y is y and the MLE of µ x is stillμ x . The likelihood ratio statistic for testing H 0 against H 1 is: This is the "conditional test" we proposed previously [9]. It is more powerful than the SMR statistic shown in Equation (3). The SMR statistic T SMR shown in Equation (3) does not taking into account the effect of the selection of the IV SNP on the inference. The variance of x is no longer σ 2 x because x is selected. Even if σ 2 x is replaced by the varianceσ 2 x of CN(µ x , σ 2 x ), the resulting statistic is less powerful than the T statistic: Replacingb SMR by y/μ x and σ x byσ x in Equation (3), a modification of the SMR statistic would be: That is,T SMR is less powerful than T; p-values for both statistics are calculated from the same distribution, which is chi-square with 1 df. The larger test statistic corresponds to the smaller p-value. The vertical line is at |x/σ x | = 7.5, which corresponds to a p-value of 6.38 × 10 −14 . The part corresponding to x/σ x < 0 is not shown sinceμ x /σ x is an odd function of x/σ x .

Support of Profile Likelihood
We now turn to an interval estimate for b = µ y /µ x . Such an estimate is not trivial since the asymptotic theory is irrelevant as there is effectively only one observation in L(µ x , u y ). For this reason, the distribution ofb = y/μ x , which is the MLE of b = µ y /µ x , is far from normal. To demonstrate this point, the following simulation study is conducted.
We generate 100,000 x's from a normal distribution with µ x = 4 and σ 2 x = 1 (so that there are a reasonable amount of x's to be selected), 7.412 of them satisfy |x| > 5.45131 and are selected. The same number (i.e., 7.412) of y's are generated independently from a normal distribution with mean µ y = bµ x and variance σ 2 y = 1. For each (x, y) pair,b = y/μ x is calculated. Histograms ofb for b = 0 and b = 2 are shown in Figure 2. For b = 0, the mean ofb is 0.0073 and the median is 0.0022. The distribution has a high probability in the neighborhood of 0. For b = 2, the distribution ofb seems to be bimodal and is skewed to the right with a mean equal to 7.4755 and a median equal to 2.5700. Both values are larger than the true value b = 2. These means and medians are also shown in Table 1 together with results from another simulation study described later. We consider the profile log-likelihood function pl(b) defined by: This function is maximized atb = y/μ x and the maximum is equal to pl(b) = log L(μ x , y) = log L x (μ x ) + log L y (y), which is also the maximum of log L(µ x , µ y ).
A natural interval estimate would be a 1 − α profile confidence interval defined as the set of b 0 such that H 0 : b = b 0 is not rejected at significance level α. However, the distribution of the log profile likelihood ratio is unknown for an arbitrary b 0 . The only exception is b 0 = 0 at which Intuitively, the log partial likelihood function pl(b) can not be approximated by a quadratic function in the vicinity ofb when b 0 = 0. As a result, the profile confidence interval for b can not be constructed. An example log partial likelihood function pl(b) is shown in Figure 3 for x/σ x = 5.4599 and y/σ y = 12.3155. For an interval estimate of b, we use the k-unit support defined by [27]: where k is a prespecified number. This interval consists of b 0 for which pl(b 0 ) is greater than log L(μ x , y) − k. It can be regarded as a generalization of the usual confidence interval. For instance, when b 0 = 0 and k = 2, This approximation worsens as |b 0 | moves further away from 0. For the example shown in Figure 3 (i.e., x/σ x = 5.4599 and y/σ y = 12.3155), the lower limit of the 2-unit support is 2.146 and the upper limit is greater than 43.406. The exact value of the upper limit is unknown due to numerical issues. It may be unbounded. By the way the support is constructed, the null H 0 : b 0 = 0 is rejected by the statistic T at significance level α if and only if the k-unit support, where k = [Φ −1 (1 − α/2)] 2 /2, contains 0.
We use a simulation study to investigate the coverage of a 2-unit support. For this purpose, data are generated as before but more values for b, i.e., b = 0, 0.5, 1, 1.5, and 2, are considered. For each value of b, we compare the winner's-0curse-corrected method and the SMR method in terms of a point estimate of b, an interval estimate of b, and a test of H 0 : b = 0. Results are reported in Table 1. Both the winner's-curse-corrected method and SMR method are biased in terms of the mean and median. The SMR 95% confidence interval, computed asb SMR ± 1.96 × V 1/2 delta , has worse coverage as b increases while the 2-unit support has rather stable coverage. In addition, the test statistic T is more powerful than the SMR method. Table 1. Results of simulation studies with µ x = 4 and σ x = σ y = 1. The statistic T is T = y 2 /σ 2 y .

An Empirical Data Analysis
We conducted a Mendelian randomization analysis of the effect of age of menarche on total pubertal height growth and late pubertal height growth using the winner's-cursecorrected method and the SMR method. Previously, we used the inverse-variance weighted (IVW) method [5] and the MR-Egger regression method [6] on these exposures and outcomes [15]. In that study, to avoid the winner's curse caused by the selection of IV SNPs, two other GWAS studies on age at menarche from an MR-Base database were used for validation. IV SNPs were significant in the main GWAS for age at menarche but not in the other two other GWASs which were removed. Such a procedure helps to avoid IV SNPs that are close to the selection threshold. In this study, we use all significant IV SNPs without further validation.
GWAS summary data were retrieved from the MR-Base database (http://www. mrbase.org/ accessed on 27 November 2022). At the genome-wide significance level 5 × 10 −8 , 117 instrument SNPs were selected from a previous study on age at menarche with 182,413 females of European ancestry [28]. After pruning for linkage disequilibrium, there are 84 SNPs left. The GWAS summary statistics on adult height were obtained from a study with 4946 females of European ancestry [29]. Thus, the population of this study matches that of the study on age at menarche.
For each SNP, the winner's-curse-corrected estimate of b and a support are computed in addition to the SMR estimate and the associated confidence interval. To correct for the 84 IV SNPs, the support is 5.9-unit since Pr(χ 2 1 > 2 × 5.9) = 0.05/84 and the nominal coverage of the confidence interval is 0.9994 (=1 − 0.05/84). As discussed previously, this support excludes 0 if and only if the T statistic is significant at the level 0.05/84. The p-value for the winner's-curse-corrected method is based on the T statistic. SNPs whose supports or confidence intervals do not contain 0 are shown in Table 2.
The estimates of b from the winner's-curse-corrected method and the SMR method are pretty close to each other for the SNPs shown in Table 2, as are the support and the confidence interval. This is due to the high significance of the association of these SNPs with the age at menarche (p-values: 4.552 × 10 −15 for rs7514705 and rs7642134; <4.552 × 10 −15 for rs7759938). For both total and late pubertal height growth, the T statistic is more significant than the T SMR statistic. For late pubertal height growth, SNP rs7514705 is significant for the T statistic but not for the T SMR statistic. Table 2. Results for the effects of age at menarche on total pubertal height growth and late pubertal height growth. To correct for the 84 IV SNPs, the support is 5.9-unit and the nominal coverage of the CI is 0.9994(= 1 − 0.05/84). This support excludes 0 if and only if the T statistic is significant at the level 0.05/84. The p-value is for the null H 0 : b = 0. It is computed from the T statistic (the winner's-curse-corrected method) or the T SMR statistic (the SMR method).

SNP
Gene Nameb ( 5.9-Unit Support) p-Value Another empirical application on the conditional test T is the study of schizophrenia, which was shown in our previous publication [9]. The T statistic identified some strong candidate genes (e.g., AKT3, RGS6, and KCNN3) for schizophrenia that are missed by the SMR method.

Discussion
Previously, we proposed a test statistic T for testing H 0 : b = 0 [15]. The current work extends the previous work by focusing on the point and interval estimate of the causal effect b. Because the "sample size" for the MR analysis is 1, the standard likelihood theory does not apply. As a result, it is not straightforward to construct a confidence interval.
We considered two extreme scenarios: one being the one-sample individual-level data and the other being independent-sample summary data. In addition, on of these scenarios is without the winner's curse caused by the selection of IV SNPs and the other suffers from the winner's curse. For one-sample individual-level data that is free of the winner's curse, the SMR method is the same as the TSLS method, not only in terms of the estimates of the causal effect size but also in terms of the variance of the estimates. For two independent-sample summary data with a selected SNP, the SMR test for H 0 : b = 0 is less powerful than the conditional test we proposed earlier [9]. Confidence intervals derived from the SMR method have poor coverage compared to their nominal levels. In comparison, the supports we proposed have stable coverage, at least in our simulation studies.
There are reports (also see our empirical data analyses) showing that the winner's curse may not have substantial impact on the MR estimates [12]. This is because in these cases the SNPs are strong. As indicated by Figure 1, the winner's curse affects the relatively weak IV SNPs most. These are the SNPs with |b gx /SE(b gx )| < 7.5 (i.e., p < 6.38 × 10 −14 ). For the strong SNPs, there is not much difference between x and its maximum likelihood estimate. An SNP is strong when either the effect size b or the sample size in the exposure GWAS is, or both, are large. The three-sample design [14] also helps in making an SNP strong by increasing the chance that the selected SNPs are highly significant. Theoretically, however, it does not eliminate the winner's curse as the probability that a weak SNP is significant in the discovery GWAS and the exposure GWAS is non-zero.
In the previous paragraph, the meaning of the term "weak" may be different than weak instrument in the usual sense although there is no universally-accepted definition of weak instrument. It is relative to the threshold for selecting IV SNPs. SNPs that barely pass the threshold are always weak. In comparison, a weak instrument in the usual sense seems to be characterized in absolute sense, for instance, the F-statistic for testing H 0 : b gx = 0 is less than 10 [30].
Although Equation (1) is on continuous traits, the proposed winner's-curse-corrected method works for dichotomous traits because it is based on the approximate normality on b gx andb gy .
The current study focuses on a single SNP analysis. A major advantage of such an analysis over multiple SNPs such as the IVW method and the MR-Egger regression method is that it involves less assumptions. For example, the causal effects at different SNPs are allowed to be different. An interesting topic would be to generalize the current work to the case of using multiple SNPs simultaneously.
Our winner's-curse-corrected method is designed for two independent (i.e., nonoverlapping) samples only. This is a limitation although it is not uncommon for methodology development, for instance, [14]. In practice, the study subjects for the exposure GWAS and the outcome GWAS may overlap [12]. The likelihood function L(µ x , µ y ) will be different than what is presented here. The conditional test T needs to be revised and the concept of support is still applicable. Future research on this topic is warranted.
The winner's-curse-corrected method has been implemented in the R package iGasso.
Funding: This study was partially supported by National Institutes of Health grant R01MH121394 (to S.H.).
Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable
Data Availability Statement: GWAS summary data used in this research were retrieved from the MR-Base database (http://www.mrbase.org/ accessed on 27 November 2022).

Acknowledgments:
The author would like to thank two anonymous reviewers for their helpful comments.

Conflicts of Interest:
The author declares no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: GWAS genome-wide association study IV instrumental variable MR Mendelian randomization SNP single nucleotide polymorphism TSLS two-stage least-squares