This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Mendelian randomization refers to the random allocation of alleles at the time of gamete formation. In observational epidemiology, this refers to the use of genetic variants to estimate a causal effect between a modifiable risk factor and an outcome of interest. In this review, we recall the principles of a “Mendelian randomization” approach in observational epidemiology, which is based on the technique of instrumental variables; we provide simulations and an example based on real data to demonstrate its implications; we present the results of a systematic search on original articles having used this approach; and we discuss some limitations of this approach in view of what has been found so far.

Observational studies have brought important insight into disease etiology. During the past decade however, the validity of observational studies has been questioned [

One cannot, for ethical and technical reasons, randomize risk factors using controlled trials in humans. The identification of risk factors therefore relies on observational studies, which are prone to spurious results due to confounding factors, reverse causation, and/or selection biases [

Mendelian randomization refers to the random allocation of alleles at the time of gamete formation. A specific genotype carried by a person therefore results from two such randomized transmissions, one from the paternally inherited allele and the other from the maternally inherited allele. A logical consequence of these randomizations is that genotypes are not expected to be associated with known (measurable or not) or unknown confounders for any outcome of interest, except those lying on the causal pathway between the genotype and the outcome. This should hence allow analyzing the genotype-risk factor association and the genotype-outcome association in an unconfounded manner. By combining appropriately the results of these two analyses, one can get an estimate of the risk factor-outcome association, which is itself not confounded. This is analogous to randomized controlled trials (of sufficient sample size), in which the random allocation of treatment (or preventive measure) is expected to lead to an even distribution of (known or unknown) confounding factors across each groups. The term “Mendelian randomization” is now frequently used in observational epidemiology to refer to the use of genetic variants to estimate a causal effect between a specific modifiable risk factor and a trait/disease of interest. The idea is to overcome some of the problems encountered in observational epidemiology, such as residual confounding and reverse causation, by taking advantage of the natural random allocation of alleles during meiosis [

We here provide an example to illustrate this approach. The aldehyde dehydrogenase 2 (

Historically, the first description of the concept of Mendelian randomization in observational epidemiology is attributed to Katan [

We consider the case where an association between a continuous (or binary) modifiable exposure

The beta coefficient is a consistent estimate of the causal effect of

The beta coefficient is actually underestimating the true causal effect of

The beta coefficient is overestimating the true causal effect of

The non-zero beta coefficient is entirely due to the presence of a confounder which is related to both

The beta coefficient is non-zero because of a causal effect of

In other words, if the interest lies in assessing “the causal effect of

A linear model (consistent with _{1} is the causal effect of _{1}_{1}.

The method of instrumental variables has been proposed to correct for the bias of the least squares estimate. For this, we need to have at our disposal an “instrumental variable”, or instrument

A second linear model (consistent with _{2}

Denoting _{3} = _{1} + _{1}_{2}, _{3} _{1}_{2} and _{3} = _{1} + _{1}_{2}, we obtain hence a third linear model:
_{3}

At the end, the parameters of the first model can be consistently estimated using relationships _{1} = _{3} − _{1}_{2} and _{1} = _{3}/_{2}, the denominator _{2} being non zero by assumption. In particular, the instrumental variable (IV) estimate of the causal effect _{1} in the first model is the quotient of the two least squares estimates of slope parameters _{3} and _{2} in the third and second models. Since the expectation of a quotient of two estimates is asymptotically equal to the quotient of the expectations of these estimates, the IV estimates are asymptotically unbiased, but they may be biased in finite samples.

Asymptotically, the IV estimates are normally distributed and explicit formulae for the standard errors are available, enabling to calculate confidence intervals and to test for the nullity of the causal effect _{1} in the first model (as calculated e.g., with the ivregress 2sls command implemented in Stata 10.0). The standard error of the estimates will depend, among others, on the percentage of explained variance in the second model (itself related to the percentage of explained variance in the third model). If this percentage is low, the instrument is said to be weak, the standard errors will be large and the test above will have low power. Moreover, the bias of the IV estimates is typically larger, and the asymptotic normal distribution of the IV estimates may be a poor approximation to the true distribution, when the instrument is weak, the inference being then unreliable [_{2} in the second model is inferior to 10 [

Another equivalent way to calculate the IV estimates (but without their standard errors!) is to perform a “two-stage least squares”, regressing

In addition to test for the nullity of the causal effect _{1}, one may also test for the absence of correlation between _{1} and

To illustrate that the method of instrumental variable is effective, we simulated data from five models consistent with the five above-mentioned interpretations (

The causal effect of _{1} = 1 under the first three models, and is _{1} = 0 under the last two models. Boxplots of the least squares (LS) estimates and of the instrumental variable (IV) estimates of parameter _{1} obtained from 1,000 samples of size n = 100 under each of the five models are shown on the top panel of

To provide an idea of what may happen when using a weak instrument, we considered the same five models, but the slopes involving

We next provide an example with real data to illustrate that the method of instrumental variable is able to correct for the bias of least squares in a case of reverse causation. We used the 1,268 participants of the population-based CoLaus study [_{1} was estimated using least squares and the method of instrumental variables (the latter with the ivregress 2sls command implemented in Stata 10.0). The LS estimate (95%CI) was 5.53 (4.73;6.33) mmol/L per risk allele. The IV estimate (95%CI) was −4.60 (−13.82; 4.63) mmol/L per risk allele, which was significantly different from the LS estimate in a Durbin-Wu-Hausman test (P = 0.03), and not significantly different from zero. Thus, while the result provided by least squares was highly significant, the instrumental variable approach did not show any evidence for a positive causal association of GGT on alcohol consumption.

We searched MEDLINE using the following «Mendelian randomization» OR “Mendelian randomisation”, which retrieved 99 citations (January 13, 2009). We acknowledge that this search strategy might not have retrieved all publications using the concept of Mendelian randomization, but it should provide a good overview of what has been published. The aim was to identify original articles reporting results from an observational study using a Mendelian randomization approach. We also searched references from review papers and original articles, as well as citations of these papers.

We identified 23 studies with a dichotomous trait as the outcome of interest (

In order to use Mendelian randomization to infer causality in observational epidemiology, numerous conditions need to be fulfilled [

Also, there should be no pleiotropy, (

A practical condition is that there should be enough data to establish reliable genotype-intermediate phenotype, or genotype-outcome, associations. In our literature review, we observed that for many publications, estimates for these two associations came from different studies. Whenever independent studies have analyzed these two relationships, separate meta-analyses can be conducted. For studies having assessed both relationships, a multivariate model is needed in order to take into account the correlation in the genotype–phenotype and genotype–disease associations. Minelli

Many of the studies we identified applied a Mendelian randomization approach with a binary outcome. While econometricians have proposed instrumental variables methods for binary outcomes (see Lawlor

The Mendelian randomization approach in observational epidemiology is a valuable tool that has taken a new dimension in the post-genomic era and is being used increasingly. This approach conceptually relies on an instrumental variable approach. There have been some successes of the Mendelian randomization approach to help unraveling causal relationships in observational epidemiology. Examples are the recently published evidence for the causal role of body mass index on blood pressure [

We thank Peter Vollenweider, Gérard Waeber, Vincent Mooser for allowing us to use the CoLaus data to illustrate the instrumental variable approach. M.B is supported by the Swiss School of Public Health Plus and by grants from the Swiss Science Foundation (PROSPER 3200BO-111362/1 and 111361/1, SPUM 33CM30/124087/1).

Directed acyclic graph (DAG) representing causal relationships between the genetic instrument (

Results of simulations based on 1,000 samples under each of the five models described in _{1} = 1 (solid horizontal line) for the first three models, and _{1} = 0 (dashed horizontal line) for the last two models.

Type and frequency of genetic instruments in Mendelian randomization. R^{2} represent the proportion of variance of ^{2} value for the first linear regression in 2-stage least squares regression models.

Description of the five models used for the simulations in Section 4.

1. Causal effect of X on Y | Z = N(0,1) | X = Z+N(0,1) | Y = X+N(0,1) | ||

2. Causal effect of X on Y and measurement errors on X and Y | Z = N(0,1) | Xtrue = Z+N(0,1) | Ytrue = Xtrue+N(0,1) | X = Xtrue+N(0,1) | Y = Ytrue+N(0,1) |

3. Causal effect of X on Y and presence of a confounder | Z = N(0,1) | U = N(0,1) | X = Z+U+N(0,1) | Y = X+U+N(0,1) | |

4. No causal effect between X and Y and presence of a confounder | Z = N(0,1) | U = N(0,1) | X = Z+U+N(0,1) | Y = U+N(0,1) | |

5. Causal effect of Y on X (reverse causation) | Z = N(0,1) | Y = N(0,1) | X = Z+Y+N(0,1) |

Literature review for dichotomous outcomes analyzed using a Mendelian randomization approach.

Type 2 diabetes | rs1799941 | SHBG | + | [ | |

Type 2 diabetes | rs6257, rs6259 | SHBG | + | [ | |

Type 2 diabetes | rs6564851 | β-carotene | − | [ | |

Type 2 diabetes | rs1007888 | MIF | + | [ | |

- Coronary artery disease | rs2228671 | LDL-cholesterol | [ | ||

Coronary heart disease | Y142X, C679X | LDL | + | [ | |

Coronary heart disease | rs1130864 | CRP | − | [ | |

Coronary heart disease | rs7553007 | CRP | − | [ | |

Myocardial infarction | rs1130864 | CRP | − | [ | |

Myocardial infarction | KIV-2 (CNV) | Lp(a) | + | [ | |

Myocardial infarction | −148C/T | fibrinogen | − | [ | |

-Stroke | C677T | homocysteine | + | [ | |

Hypertension | rs1800947, | CRP | − | [ | |

Metabolic syndrome | rs4988235 (-13910-C/T) | Milk consumption | + | [ | |

Hypertriglyceridemia | rs3758538 | RBP4 | − | [ | |

Cancer | E2, E3, E4 | cholesterol | − | [ | |

Head and neck cancer | rs671 | Alcohol consumption | + | [ | |

Oesophageal cancer | rs671 | Alcohol consumption | + | [ | |

Lung or kidney cancer | rs9939609 | BMI | + | [ | |

Polycystic ovary syndrome | Gly972Arg | Insulin | + | [ | |

Depression | rs662 | PON1 activity | − | [ | |

Stillbirth | slow/fast metabolizers | caffeine | + | [ | |

Cataract | rs9939609 | BMI | + | [ |

+ means evidence for causality; − means no evidence for causality. SHBG, sex hormone binding protein; LDL, low density lipoprotein; CRP, C-reactive protein; Lp(a), lipoprotein a; RBP4, retinol binding protein 4; BMI, body mass index; PON1, paraoxonase 1 Gene symbols:

Literature review for continuous outcomes analyzed using a Mendelian randomization approach.

Metabolic traits (insulin, lipids, etc) | rs9939609 | BMI | + | [ | |

BMI | rs1800947, rs1205 | CRP | − | [ | |

BMI | rs7553007 | CRP | + | [ | |

rs1805096 | |||||

BMI, blood pressure, triglycerides, HDL, waist-to-hip ratio, HOMA-R | rs1800947, rs1130864, rs1205 | CRP | − | [ | |

Blood pressure | rs17782313 rs9939609 | BMI | + | [ | |

Blood pressure | rs1800947, | CRP | − | [ | |

Bone mass | rs17782313 | adiposity | + | [ | |

rs9939609 | |||||

Bone mass density, bone fractures | rs4988235 (-13910-C/T) | Calcium intake | + | [ | |

HbA1c | rs1130864, rs1205, rs3093077 | CRP | − | [ | |

Carotid-intima media thickness | rs1130864, rs1205, rs3093077 | CRP | − | [ | |

Carotid-intima media thickness | rs 2794521, rs3091244, rs1800947, rs1130864, rs1205 | CRP | − | [ | |

Carotid-intima media thickness | rs9939609 | BMI | + | [ | |

Serum leptin | rs 2794521, rs3091244, rs1800947, rs1130864, rs1205 | CRP | − | [ | |

Lung function | rs1205, rs1800947 | CRP | + | [ | |

Physical functioning | rs5744256 | IL-18 | + | [ |

BMI, body mass index. CRP, C-reactive protein. IL-18, interleukin 18.