Previous Article in Journal
A Machine Learning and Deep Learning Approach for the Classification of Thyroid Disorders Using Multi-Source Clinical Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Tutorial

How to Interpret Heterogeneity in Meta-Analysis: A Structured Guide for Clinicians and Researchers

by
Javier Arredondo Montero
Department of Pediatric Surgery, Complejo Asistencial Universitario de León, c/Altos de Nava s/n, 24008 León, Spain
BioMedInformatics 2026, 6(3), 35; https://doi.org/10.3390/biomedinformatics6030035
Submission received: 16 April 2026 / Revised: 19 May 2026 / Accepted: 26 May 2026 / Published: 2 June 2026
(This article belongs to the Section Medical Statistics and Data Science)

Abstract

Heterogeneity is central to the credibility, transportability, and clinical interpretation of meta-analytic evidence, yet its assessment is frequently reduced to isolated or misinterpreted statistics. This tutorial provides a clinically oriented framework for interpreting heterogeneity across therapeutic, diagnostic, and prognostic evidence synthesis. It distinguishes sampling error from genuine between-study variability and explains why Q, I2, τ2, and prediction intervals should be interpreted as complementary rather than interchangeable measures. Rather than relying on I2 alone, meaningful interpretation requires assessing both the relative and absolute scale of between-study variability and its implications for future clinical settings. The tutorial also addresses estimator dependence, uncertainty in τ2, the limitations of different confidence-interval approaches in random-effects models, and the clinical interpretation of wide prediction intervals, including links with certainty-of-evidence judgments. Diagnostic test accuracy meta-analysis is discussed as a bivariate problem that requires hierarchical models, variance components, correlation parameters, and prediction regions, rather than univariate I2 summaries. Prognostic reviews are framed around case mix, baseline risk, discrimination, calibration, and transportability. Worked simulated examples, formal expressions, and practical algorithms are used to support decision-making about when pooling is reasonable, when it should be qualified, and when it should be avoided. Accurate interpretation of heterogeneity requires domain-appropriate modeling, transparent reporting, and clinical judgment rather than reliance on any single statistic.

1. Target Audience and Prerequisites

This tutorial is intended primarily for clinicians, clinical researchers, and applied health-science investigators who read, conduct, or interpret meta-analyses and seek a practical framework rather than a mathematical derivation of meta-analytic estimators. It assumes basic familiarity with forest plots, effect estimates, confidence intervals, and fixed-effect versus random-effects models. The aim is to offer a clinically oriented framework for interpreting heterogeneity, understanding when pooled estimates are trustworthy, and recognizing when uncertainty should modify clinical conclusions.

2. Introduction

Imagine walking through two forests. In the first, all trunks stand vertical, of equal height and girth, aligned in neat rows. The skyline is smooth, implying perfect order. This is homogeneity: studies pointing in the same direction, with slight variation.
Now picture a second forest. Here, one trunk is thicker, another taller, and several lean at different angles. The skyline is uneven and irregular. This is heterogeneity: differences in study results that go beyond what would be expected by chance. In this metaphor, each tree represents a study: its tilt reflects the effect estimate, its height or girth the sample size and precision, and the skyline the pooled evidence. What at first glance may appear to be a tidy alignment is, in truth, a landscape of variation—the heterogeneity that ultimately shapes the trustworthiness of meta-analytic conclusions (Figure 1).
Meta-analysis invites a familiar shortcut: readers often focus on the diamond at the bottom of the forest plot while overlooking the variation that gives it meaning. Yet the pooled estimate is only as trustworthy as the studies behind it. Heterogeneity lies between those studies, shaping both the credibility and the applicability of the summary effect.
Heterogeneity may take on different shapes depending on the type of review—therapeutic, diagnostic, or prognostic. Some forests are dense, and others are sparse; some are tilted by bias or design, while others are shaped by natural diversity. Across these review types, the central question remains the same: how much of the observed variability exceeds sampling error—the random variation expected because each study observes only a sample of the target population—where that variability comes from, and whether it changes the clinical interpretation.

3. Measures of Heterogeneity

How do we measure a forest’s irregularities? One might count trees, compare their heights, or note the angles at which they lean. Each measure captures part of the picture, but none tells the whole story. Similarly, heterogeneity can be quantified in several ways, each with its own strengths and limitations.
  • The Q Statistic 
  • Definition.
The Q statistic, introduced by Cochran in 1954 [1], is the oldest test for heterogeneity. It asks whether differences between studies exceed what would be expected by chance. In forest terms, random tilts are normal, but if many trunks lean sharply, something real makes the forest uneven. Q distinguishes these cases. In practice, Q sums the squared deviation of each study from the pooled mean, weighting precise studies more heavily. If studies align, Q stays small; if some diverge, Q grows.
2.
Interpretation.
Under the null hypothesis of homogeneity, Q follows an approximate chi-square (χ2) distribution with degrees of freedom (df) equal to the number of studies minus one (df = k − 1) [2,3]. The χ2 distribution is a reference curve for scatter expected by chance—a “null model” of a perfectly straight forest: if observed Q greatly exceeds this baseline, variation is unlikely to be random. Degrees of freedom (df) serve to calibrate this test because they define the χ2 curve against which Q is compared when calculating the p-value. In a small grove, with few studies and therefore low df, the test has limited information and low power: even meaningful heterogeneity may remain undetected. In a vast forest, with numerous studies and therefore high df, even small or clinically trivial deviations may become statistically detectable.
In short, a larger Q suggests greater departure from homogeneity, but its interpretation depends strongly on study precision and the number of included studies; df adjusts the yardstick for judging whether the observed departure is extreme enough to reject homogeneity. A small p-value (e.g., <0.05) suggests that observed variability exceeds what would typically be expected by chance.
3.
Limitations.
Because Q depends on df, it is strongly influenced by the number of studies. With few, it has low power and may miss true heterogeneity; with many, it becomes oversensitive, flagging trivial differences [4]. Another key limitation is that Q reflects the amount of excess variation but not its structure. Two meta-analyses can have the same Q value but exhibit very different patterns of irregularity—numerous small deviations versus one extreme outlier. In forest terms, Q can tell us that the skyline is uneven, but not why: whether the irregularity arises from many trees differing slightly in height, or from a dramatically shorter single trunk (Figure 2). Q is helpful as a first signal, but never sufficient on its own.
  • The I2 Statistic 
  • Definition.
The I2 statistic, introduced by Higgins and Thompson in 2002 [5,6], is the proportion of total variability in effect estimates attributable to heterogeneity rather than chance. It is derived from Q and its df. I2 is a percentage: 0% means the observed scatter is attributable to sampling error alone; higher values indicate that a larger proportion of the observed variability is attributable to between-study differences. For example, I2 = 50% suggests that approximately half of the observed variability (not half of the studies!) is attributable to heterogeneity rather than sampling error.
2.
Interpretation.
Guidelines sometimes classify I2 values as: 0–40% possibly unimportant, 30–60% moderate, 50–90% substantial, and 75–100% considerable. These thresholds are rough guides. Two forests may both yield I2 = 50% and still look very different: one where tilts are barely noticeable, another where several trunks are almost falling. For this reason, the Cochrane Handbook cautions against applying thresholds rigidly [7].
3.
Limitations.
I2 does not measure the actual size of heterogeneity, only its proportion. In meta-analyses involving highly precise studies, even small absolute differences can yield high I2 values. Conversely, with small, imprecise studies, I2 may appear low despite an obvious spread, or it may overestimate heterogeneity due to noise [8]. Another limitation is instability with few studies: in a small forest, I2 may under- or overestimate heterogeneity, and its apparent precision can be misleading. In summary, I2 is a useful descriptor of relative heterogeneity; however, it does not reveal the actual magnitude of heterogeneity and should not be interpreted in isolation.
  • The τ2 Statistic 
  • Definition.
The τ2 statistic is the main measure of between-study variance in a meta-analysis. Variance describes how spread out a set of numbers is. If all studies give almost identical results, the variance is close to zero; if results differ widely, the variance is larger. While Q detects whether heterogeneity exists and I2 quantifies the proportion of observed variability beyond chance, τ2 quantifies the absolute magnitude of between-study variance on the squared effect-size scale, whereas τ expresses the between-study standard deviation on the effect-size scale. For example, in a meta-analysis of weight loss (in kilograms), τ describes the typical between-study spread directly in kilograms, whereas τ2 is the variance parameter used by the random-effects model to represent that dispersion.
In forest terms, I2 describes how much of the visible unevenness reflects systematic differences across the forest, whereas τ2 indicates how large those underlying differences are.
2.
Interpretation.
Fixed-effect models assume that all studies estimate the same underlying effect. Any differences are attributed to sampling error, so between-study variance is constrained to zero, and τ2 is therefore not allowed to vary in the model.
Random-effects models, by contrast, acknowledge that true effects may differ across studies due to variation in populations, interventions, or methods. Here, τ2 captures the actual variance of these true effects—the average squared distance between them. If τ2 = 0, the forest is uniform; as τ2 grows, the trunks lean at increasingly different angles.
Strictly, τ2 denotes the true between-study variance, whereas τ ^ 2 denotes the empirical estimate calculated from the included studies using an estimator such as REML, Paule–Mandel, or DerSimonian–Laird. For readability, this tutorial follows common usage and refers to τ2 throughout, but any reported numerical value of τ2 should be understood as an estimate of the true between-study variance.
Estimating τ2 is not trivial. Several methods exist:
  • DerSimonian–Laird (DL). The most widely known and historically dominant estimator [9]. DL is simple and computationally convenient, which explains its widespread use in standard meta-analytic software. It is directly linked to Cochran’s Q, the same statistic underlying the standard χ2 test for homogeneity. However, this same dependence on Q contributes to its limitations: with few studies, sparse data, unbalanced weights, or substantial true heterogeneity, DL may underestimate τ2, pull between-study variance toward zero, and produce overly narrow confidence intervals. Given these limitations, reliance on DL as the default estimator warrants caution.
  • Restricted maximum likelihood (REML). Widely recommended in contemporary methodological guidance [10,11]. Unlike DL, REML incorporates uncertainty in the pooled effect when estimating τ2 and is often less biased under many common meta-analytic conditions, particularly when the number of studies is small or heterogeneity is high.
  • Paule–Mandel. A moment-based estimator that, like REML, provides more stable estimates of between-study variance than DL, particularly in the presence of substantial heterogeneity or when study weights are unbalanced. It has been shown to perform well in terms of bias and coverage across a range of realistic meta-analytic scenarios.
  • Bayesian estimators. In Bayesian statistics, τ2 is not treated as a fixed number but as an uncertain quantity, described by a probability distribution. We start with a prior (what we already know or assume) and update it with the data to get a posterior (what seems plausible after seeing the evidence) [12]. The advantage is that we can make direct probability statements, like “there is a 70% chance that heterogeneity is above a clinically important level.” This approach is flexible, especially when data are scarce, but the results depend on how the prior is chosen, so the choice must be made transparently.
Once τ2 is estimated, it influences not only descriptive measures of heterogeneity but also statistical inference, particularly the calculation of confidence intervals around the pooled effect. In random-effects meta-analysis, these confidence intervals can be calculated in different ways. The conventional Wald-type approach—the default output in software such as Review Manager 5 (RevMan 5), version 5.4 (The Cochrane Collaboration, Copenhagen, Denmark, 2020)—is simple, but it may underestimate uncertainty when the number of studies is small or when between-study variance is present (τ2 ≠ 0). This can make the pooled effect look more precise than it really is. The Hartung–Knapp–Sidik–Jonkman (HKSJ) adjustment, recommended in contemporary guidance such as the Cochrane Handbook [7], was developed to provide more robust confidence intervals for the pooled effect in random-effects models by using a t-based adjustment to better reflect uncertainty when few studies are available. HKSJ intervals are often wider, but not always, and should not be interpreted as a universal solution. In very small meta-analyses (e.g., k < 3) or in highly consistent evidence bases, HKSJ may be unstable or even yield unexpectedly narrow intervals; modified approaches, such as truncated Knapp–Hartung adjustments, may be considered. Therefore, HKSJ should be viewed as a useful inferential adjustment or sensitivity analysis, not as a substitute for judgment regarding the number of studies, heterogeneity, and clinical compatibility.
A key point is that τ2 and I2 are not interchangeable. Two meta-analyses may both report I2 = 50%, meaning that approximately half of the observed variability is attributable to between-study heterogeneity. τ2 indicates whether this variability is small or large on the squared effect-size scale, while τ expresses the corresponding between-study standard deviation on the effect-size scale. Figure 3 illustrates this: both panels are conceptually constructed to represent similar proportional heterogeneity, yet in one, the tilt is subtle; in the other, dramatic. Thus, τ2, not I2, sets the scale of absolute between-study heterogeneity and strongly influences random-effects weights, the width of pooled confidence intervals, the span of prediction intervals, and the stability of meta-regression.
3.
Limitations.
τ2 is less intuitive than Q or I2 because it is a variance estimated on the squared effect-size scale; its square root, τ, is expressed in the same units as the effect estimate and is often easier to interpret clinically. A τ2 value of 0.04 has different implications depending on the effect-size metric. On a log risk-ratio scale, it corresponds to τ = 0.20. Since risk ratios are analyzed on the logarithmic scale, τ must be converted back to the risk-ratio scale to understand the multiplicative spread of true effects across studies. By contrast, for a mean difference measured in kilograms, the same τ2 corresponds to τ = 0.20 kg, so the between-study standard deviation—not the variance itself—is interpreted directly in kilograms. Its meaning is relative to the chosen metric. Another limitation is its instability with few studies. When the forest has only a handful of trees, τ2 can swing wildly depending on the estimator—sometimes suggesting that trunks are almost perfectly aligned; other times that the woodland is chaotic. Finally, τ2 is often reported without confidence intervals. Yet its uncertainty can be large, especially in small meta-analyses, and ignoring it risks giving a false sense of certainty. In summary, τ2 is harder to interpret but plays a central structural role, as it measures the real size of heterogeneity and governs pooled results. Because τ2 underpins inference, prediction intervals (PIs) are the most direct clinical extension of it.
  • Prediction Intervals: Translating τ2 into Clinical Uncertainty 
A confidence interval (CI) reflects the precision of the pooled effect, but says nothing about what a new study might show. PIs extend τ2 by translating between-study variance into a range of plausible effects for future settings [13,14,15]. Because they incorporate both sampling error and heterogeneity, PIs are usually wider than CIs.
In forest terms, the CI indicates the precision of the average tilt, while the PI shows the range of tilts likely to be found in other forests. Clinically, this matters: a pooled risk ratio may appear beneficial, with its confidence interval entirely below 1, yet the PI can cross the no-effect line, warning that in some contexts the intervention may not work or could even harm. In clinical terms, the PI asks whether the effect is likely to remain clinically meaningful in a new but comparable setting.
  • Limitations.
PIs become unstable when the number of studies (k) is small (e.g., <10–15), often yielding misleading coverage. This stems from their reliance on τ2; when τ2 is unstable, PIs inherit the instability. Hence, the importance of obtaining a robust τ2, since all downstream measures of heterogeneity depend on it.
PIs should be viewed as a conceptual tool that shows the plausible range of effects, rather than as a statistically robust interval in small meta-analyses.

4. Integrating Q, I2, τ2, and Prediction Intervals

No single statistic can capture the full complexity of heterogeneity. Q tells us whether the variability across studies exceeds what chance alone would explain. I2 indicates the proportion of the observed scatter that is real rather than random. τ2 measures the between-study variance on the squared scale of the chosen effect size, whereas τ expresses the corresponding between-study standard deviation on the effect-size scale.
Of these, τ2 is the most structural parameter. It governs pooled confidence intervals, prediction intervals, and meta-regression. I2 is easy to report and widely recognized, but it is strongly influenced by within-study precision and unstable when only a few studies are available. Q provides a formal test, but it is driven by sample size and df rather than by the actual importance of heterogeneity.

5. Minimal Mathematical Expressions Needed to Interpret Heterogeneity Clinically

This tutorial is intended for clinicians and applied researchers, not for mathematical derivations of meta-analytic estimators. Nevertheless, a few formal expressions are useful because they explain why heterogeneity statistics may behave differently in practice. In particular, they clarify why Q is sensitive to both precision and the number of studies, why I2 is a relative rather than an absolute measure, why τ2 is scale-dependent, and why prediction intervals widen when between-study variance is large.
  • Cochran’s Q (1) 
Q = i = 1 k w i ( y i θ ^ ) 2
Here, y i is the effect estimate from the study i , θ ^ is the pooled effect, and w i is the study weight. Because larger and more precise studies receive greater weight, small differences between precise studies can produce a large Q. Clinically, Q should therefore be interpreted as a signal that variability exceeds sampling error, not as a measure of how important that variability is.
  • I2 (2) 
I 2 = m a x { 0 , Q ( k 1 ) Q } × 100 %
I2 is derived from Q and expresses the proportion of observed variability attributed to between-study differences. This formula explains why I2 may be high when studies are very precise: if sampling error is small, even modest between-study variability may represent a large proportion of the total variability. Conversely, imprecise studies may yield a lower I2 despite clinically relevant dispersion. I2 is therefore a relative measure, not a direct measure of the magnitude of heterogeneity.
Note: the expression above is the conventional Q-based formulation of I2. Some software implementations of random-effects meta-analysis, including REML-based output, may report I2 using an equivalent τ2-based formulation. These approaches are closely related but may yield slightly different numerical values in finite samples.
  • Between-study variance in a random-effects model (3) 
θ i N ( μ , τ 2 )
In a random-effects model, each study is allowed to estimate its own true effect θ i , distributed around an average effect μ . The parameter τ 2 represents the between-study variance. Unlike I2, τ2 is a variance defined on the squared effect-size scale, whereas τ is expressed in the same units as the effect estimate. Its interpretation, therefore, depends on the metric used. A given τ2 value has different implications for log risk ratios, odds ratios, hazard ratios, mean differences, or standardized mean differences. It is therefore more directly related to absolute heterogeneity, but also less immediately intuitive and estimator-dependent.
  • Prediction interval (4) 
μ ^ ± t k 2,0.975 S E ( μ ^ ) 2 + τ ^ 2
This expression shows why PIs are usually wider than CIs: they incorporate both uncertainty around the average effect and between-study heterogeneity. When τ ^ 2 is large, or when few studies make τ ^ 2   unstable, the PI widens. Clinically, this interval is often the most interpretable expression of heterogeneity because it asks what effect might be expected in a future comparable setting.

6. Worked Quantitative Example: Divergence Between Proportional and Absolute Heterogeneity

To illustrate why I2 and τ2 should not be interpreted interchangeably, three simulated meta-analytic datasets were created from hypothetical comparisons between an intervention group and a control group, using event and non-event counts. These examples are not intended to represent a specific clinical intervention, but to show how heterogeneity statistics may behave under different combinations of within-study precision, between-study variability, and number of included studies. To keep the examples focused on heterogeneity rather than sparse-data methods, the simulated datasets deliberately avoided zero-event cells. Real meta-analyses may involve zero events, rare outcomes, imbalanced group sizes, or other complexities that require additional analytic choices, such as continuity corrections or alternative effect-size estimators.
Random-effects models were fitted using restricted maximum likelihood (REML). The primary analyses were performed using Stata 19.0 (StataCorp LLC, College Station, TX, USA) with the official meta suite. The simulated datasets are provided as Supplementary Files S1–S3, and the reproducible Stata and R code is provided as Supplementary File S4. The corresponding forest plots are shown in Figure 4. The main text reports the key numerical results and their interpretation, whereas the Supplementary Files provide the underlying datasets and code to ensure reproducibility.
In Supplementary File S1 (k = 10), the studies are relatively precise, and the absolute between-study variance is small. The random-effects REML model yielded τ2 = 0.0049, I2 = 70.14% (as reported by the REML-based software output), and Q = 31.06 (p < 0.01). The pooled risk ratio was 0.870 (95% CI, 0.825–0.916), whereas the 95% PI ranged from 0.731 to 1.034. This example shows that I2 may be high even when the absolute magnitude of heterogeneity, expressed by τ2 on the log risk-ratio scale, is small. In such a case, I2 is high partly because within-study sampling error is small, so even modest between-study dispersion represents a large proportion of the total observed variability.
In Supplementary File S2 (k = 10), the studies are less precise, and the dispersion of absolute effects is larger. Here, τ2 increased to 0.0773, whereas I2 was lower than in the previous example, at 53.92%, with Q = 19.41 (p = 0.02). The pooled risk ratio was 0.929 (95% CI, 0.732–1.178), and the 95% PI was much wider, ranging from 0.462 to 1.870. This illustrates the complementary problem: a lower I2 does not necessarily imply lower absolute heterogeneity or greater clinical consistency. Larger τ2 and wider prediction intervals indicate that true effects may vary substantially across settings, even though the proportion of observed variability attributed to heterogeneity, as measured by I2, is lower.
Supplementary File S3 (k = 5) illustrates the instability that arises when the number of studies is small. With only five studies, the REML model yielded τ2 = 0.1488, I2 = 75.42%, Q = 15.97 (p < 0.01), and a pooled risk ratio of 0.939 (95% CI, 0.635–1.390). The 95% PI was extremely wide, ranging from 0.236 to 3.745. This example reinforces that prediction intervals are highly sensitive to the estimated τ2 and become particularly fragile in small meta-analyses.
Importantly, these examples do not imply that τ2 is a standalone or uncertainty-free measure. The estimated τ2 remains estimator-dependent, and its uncertainty may be substantial, particularly when the number of studies is small. For this reason, τ2 should be reported with the estimator used and, where available, with an uncertainty interval, such as a 95% CI.
Together, these examples show that Q, I2, τ2, and PIs answer different questions. Q provides a formal test of whether observed variability exceeds what would be expected by sampling error alone, but it is sensitive to study precision and the number of included studies. I2 describes the proportion of observed variability attributable to between-study differences, τ2 quantifies the absolute between-study variance, and the PI translates this variance into the expected range of effects in a future comparable setting. Therefore, heterogeneity should not be classified using Q or I2 alone. A transparent interpretation should jointly consider Q, I2, τ2, the effect-size scale, the estimator used for τ2, the number of studies, and the width and clinical implications of the PI.

7. Operational Interpretation: Combining Q, I2, τ2, and Prediction Intervals

In practice, Q, I2, τ2, and PIs should be interpreted sequentially rather than as competing statistics. Cochran’s Q should be used as an initial signal that observed variability may exceed what would be expected by sampling error alone, but it should not determine model choice or pooling decisions by itself because it has low power with few studies and excessive sensitivity with large, precise evidence bases. I2 should then be used to describe the proportion of observed variability attributable to between-study differences, while remembering that it is not a measure of the absolute magnitude of heterogeneity. A high I2 may reflect small but precisely estimated differences, whereas a lower I2 may still coexist with clinically important dispersion.
τ2 should be examined next because it quantifies the absolute between-study variance on the squared effect-size scale. Its interpretation should always consider the metric used: the same τ2 value has different implications for log risk ratios, odds ratios, hazard ratios, mean differences, or standardized mean differences. For ratio measures, translating τ2 into τ and considering the approximate spread of plausible true effects on the exponentiated scale may help clinicians understand whether heterogeneity is likely to be clinically meaningful. However, τ2 should not be interpreted without considering the estimator used, the number of studies, and the uncertainty around the estimate.
Finally, the PI should be used to assess the practical consequences of heterogeneity. If the pooled effect is statistically significant but the prediction interval crosses the null, the average effect may still be beneficial, but the expected effect in a new setting is uncertain. If the PI includes both clinically important benefit and clinically important harm, conclusions should be cautious, and the certainty or applicability of the evidence may need to be downgraded. Conversely, when the pooled estimate and PI point in the same clinically relevant direction, heterogeneity is less likely to undermine decision-making, although its sources should still be explored.
A pragmatic clinical reading is therefore to use Q as a warning signal, I2 as a relative descriptor, τ2 as the scale-dependent magnitude of between-study variance, and the PI as the clinically interpretable expression of what may happen in a future comparable setting. Pooling is most defensible when studies are clinically and methodologically compatible, between-study dispersion is not large on the chosen effect-size metric, and the prediction interval does not alter the clinical conclusion. Pooling should be interpreted cautiously—or avoided—when heterogeneity is driven by bias, incompatible populations or interventions, unexplained design differences, or prediction intervals that include materially different clinical conclusions.

8. Heterogeneity in Diagnostic Test Accuracy Meta-Analysis

In therapeutic meta-analysis, heterogeneity is typically unidimensional, as all studies estimate the same type of effect. Diagnostic test accuracy (DTA) comprises two outcomes—sensitivity and specificity—that must be analyzed jointly, since the diagnostic threshold links them. Raising the threshold increases specificity but decreases sensitivity, and lowering it does the opposite. This correlation makes univariate pooling misleading. Instead, hierarchical random-effects models—the bivariate model and the HSROC model—are required [16,17,18].
  • Interpretation.
In DTA meta-analysis, heterogeneity is described by variance components for sensitivity and specificity, usually on the logit scale (a transformation that converts probabilities bounded between 0 and 1 into an unbounded scale suitable for modeling), together with the correlation between them. This correlation is clinically important because diagnostic thresholds often move sensitivity and specificity in opposite directions: a stricter threshold may increase specificity while reducing sensitivity, whereas a more permissive threshold may do the reverse. Unlike intervention reviews, where Q, I2, or τ2 usually summarizes a single outcome, diagnostic data are inherently bivariate. Applying univariate statistics, such as Q or I2, separately to sensitivity and specificity ignores this coupled structure and may misrepresent the actual pattern of heterogeneity.
A practical reporting recommendation is therefore to avoid using univariate I2 values as primary summaries of DTA heterogeneity for sensitivity and specificity. For most DTA meta-analyses, reviewers should instead report the variance components for logit sensitivity and specificity, the correlation parameter when available, and a graphical 95% prediction region in ROC space. This allows readers to see not only the average diagnostic performance but also how much diagnostic accuracy may vary across future settings. Bivariate I2 has been proposed [19], but its interpretation is less standardized, and it is rarely implemented as a routine decision-making tool. Similarly, the area of the 95% prediction ellipse and Median Odds Ratios (MORs), which express between-study heterogeneity on an odds-ratio-like scale, may provide additional information, but they should be considered complementary rather than definitive summaries. The most robust strategy is to interpret the variance components, correlation structure, and prediction region from a bivariate or HSROC model, while explicitly considering threshold effects, population spectrum, disease prevalence, reference standard quality, and risk of bias.
2.
Limitations.
Quantifying heterogeneity in DTA is inherently more complex than in intervention reviews. Even when hierarchical models are used, estimates of τ2 (Se), τ2 (Sp), and the threshold correlation become unstable in few-study settings, resulting in wide or imprecise variance estimates. This makes heterogeneity harder to measure, yet also more crucial, since threshold-driven differences are often a major source of heterogeneity in diagnostic accuracy research [20].

9. Heterogeneity in Prognostic Meta-Analysis

Prognostic reviews differ fundamentally from intervention or diagnostic ones. Their aims vary: some estimate overall prognosis (e.g., survival at fixed time points), others test a single factor (e.g., the hazard ratio for a biomarker), and others assess or validate multivariable models. Unlike therapeutic or diagnostic settings, prognostic outcomes are usually time-to-event, involve censoring, and depend on case-mix and follow-up.
  • Interpretation.
In prognostic meta-analysis, heterogeneity requires distinct interpretation because variability often reflects differences in baseline risk, case mix, follow-up duration, outcome definition, predictor measurement, and modeling strategy. For single prognostic factors, random-effects pooling of hazard ratios or odds ratios is common, with τ2 quantifying between-study variance on the squared scale of the chosen effect-size metric. However, the same τ2 value may have different implications depending on whether the outcome is rare or common, whether follow-up is short or long, and whether covariate adjustment differs across studies.
For prediction model reviews, heterogeneity should not be reduced to pooled discrimination alone. Discrimination measures such as the c-statistic or AUC describe ranking ability, but they do not indicate whether predicted risks are accurate in absolute terms. Calibration measures, including calibration slope, calibration-in-the-large, observed-to-expected ratios, and calibration plots, are essential for assessing transportability. A model may retain acceptable discrimination in a new population while being poorly calibrated because baseline risk, predictor effects, or outcome incidence differ. Therefore, when prognostic models are synthesized, heterogeneity should be interpreted jointly across discrimination, calibration, case mix, and model specification. Multivariate meta-analysis may be useful when discrimination and calibration are synthesized together, although such approaches require sufficient data and careful interpretation.
2.
Limitations.
Quantifying heterogeneity in prognostic reviews is difficult. With few studies, estimates of variance in hazard ratios, odds ratios, c-statistics, or calibration measures are unstable, giving imprecise prediction intervals and uncertain transportability assessments. Heterogeneity often reflects clinical and methodological diversity—baseline risk, predictor definitions, outcome timing, adjustment sets, modeling choices, and follow-up—rather than sampling error alone. This makes heterogeneity both harder to measure and more clinically consequential, because prognostic evidence may perform differently across risk structures. Transparent reporting of τ2, prediction intervals, case-mix variation, discrimination, and calibration is therefore essential for trustworthy prognostic meta-analysis.

10. Beyond Heterogeneity: Inconsistency in Network Meta-Analysis

In network meta-analysis (NMA), heterogeneity coexists with inconsistency—a distinct but related concept. Heterogeneity refers to variability within each direct comparison, whereas inconsistency refers to disagreement between direct and indirect estimates. A network may have substantial heterogeneity without clear inconsistency, or inconsistency despite apparently modest within-comparison heterogeneity. Both problems are related to the transitivity assumption: treatment effects are comparable only if effect modifiers are sufficiently similar across trials. When this assumption fails, two types of inconsistency may arise: loop inconsistency, in which closed loops yield conflicting estimates, and design inconsistency, in which effects differ across comparator sets. Distinguishing and reporting both heterogeneity and inconsistency is essential for a trustworthy NMA.

11. Advanced Extensions: When Conventional Random-Effects Models Are Not Enough

Conventional random-effects meta-analysis is often sufficient for introductory interpretation of heterogeneity in pairwise therapeutic reviews, but several settings require more advanced models. Bayesian hierarchical models may be useful when the number of studies is small, when external information can be incorporated through transparent priors, or when direct probability statements about τ2, prediction intervals, or clinically relevant thresholds are desired. Their advantage is flexibility and interpretability, but results may be sensitive to prior specification and should therefore be reported transparently.
Multilevel meta-analysis is relevant when effect sizes are not independent. This occurs, for example, when a review extracts multiple outcomes, time points, subgroups, intervention arms, or correlated comparisons from the same study. Treating such effects as independent may underestimate uncertainty and overstate precision. Multilevel models allow variability to be partitioned across levels, such as within-study and between-study components.
Robust variance estimation offers another approach when effect sizes are dependent, and the exact covariance structure is unknown. It can provide valid standard errors under certain conditions without requiring a full specification of all correlations among effect sizes. However, it also requires an adequate number of studies or clusters and should not be treated as a universal correction for sparse or poorly structured evidence.
These approaches are important extensions, but they do not replace the need to interpret heterogeneity clinically. Regardless of the model used, reviewers must still ask whether studies are clinically compatible, whether variability reflects legitimate diversity or bias, whether between-study dispersion is clinically meaningful on the chosen effect-size metric, and whether the prediction interval changes the clinical conclusion.
Table 1 summarizes the main statistics used to assess heterogeneity, highlighting their interpretation, strengths, and limitations. Table 2 summarizes heterogeneity measures in diagnostic test accuracy meta-analysis.

12. Practical Management of Heterogeneity

Detecting heterogeneity is only the start; the challenge is handling it. Statistics like Q, I2, or τ2 show the forest is uneven, but not why. To advance, three questions matter: How much variability? Where from? What does it mean? These frame interpretation across therapeutic, diagnostic, and prognostic reviews, where the task is always to quantify variability, trace its sources, and judge its implications for practice.
  • Distinguishing Heterogeneity, Bias, and Inconsistency 
Before deciding whether to pool results, three related but distinct concepts should be separated. Heterogeneity refers to genuine variability in true effects across studies, usually arising from differences in populations, interventions, comparators, outcomes, follow-up, or settings. In this situation, studies may all be internally credible, but the underlying effects differ because they are not estimating precisely the same clinical reality.
Bias refers to systematic error within or across studies. It may arise from flaws in randomization, confounding, selective reporting, missing data, measurement error, or other design and conduct problems. Unlike heterogeneity arising from legitimate clinical or methodological diversity, bias does not reflect a meaningful distribution of true effects. It threatens validity. Pooling biased studies may therefore produce a precise but misleading summary estimate.
Inconsistency refers to disagreement between bodies of evidence that should be coherent under the assumptions of the review. In pairwise meta-analysis (the conventional comparison of two interventions or exposures at a time), this often appears as effect estimates pointing in materially different directions without a plausible explanation. In network meta-analysis, inconsistency has a more specific meaning: disagreement between direct and indirect evidence, usually because the transitivity assumption is not credible. Thus, heterogeneity concerns variability within an evidence set, bias concerns systematic error, and inconsistency concerns lack of coherence between estimates or evidence pathways.
A practical decision algorithm can be summarized as follows. First, assess whether the studies address a sufficiently similar clinical question. If populations, interventions, outcomes, or designs are incompatible, pooling should generally be avoided regardless of the numerical heterogeneity statistics. Second, assess risk of bias. If the variability appears to be mainly driven by studies at high risk of bias, the primary analysis should avoid uncritical pooling and consider sensitivity analyses that exclude or separate those studies. Third, if studies are clinically compatible and not dominated by major bias, quantify heterogeneity using Q, I2, τ2, and prediction intervals. Fourth, judge whether the prediction interval changes the clinical conclusion. If the pooled effect suggests benefit but the prediction interval includes no effect or clinically important harm, pooling may still be statistically possible, but the conclusion should be cautious and explicitly framed as context-dependent. Finally, if heterogeneity remains unexplained and materially affects the clinical conclusion, the review should prioritize exploration, stratified synthesis, or narrative interpretation rather than a single pooled estimate.
In short, pooling is most defensible when studies are clinically coherent, the risk of bias is not driving the dispersion, between-study dispersion is clinically acceptable on the chosen effect-size metric, and the prediction interval does not imply a materially different clinical conclusion. Pooling should be avoided or strongly qualified when studies are clinically incompatible, when bias is the likely source of variation, or when the prediction interval spans conclusions that would lead to different clinical decisions.
  • A practical algorithm for interpreting heterogeneity and deciding whether to pool 
  • Check clinical and methodological compatibility.
If populations, interventions, comparators, outcomes, follow-up periods, or study designs are incompatible, avoid pooling or use separate syntheses.
2.
Assess risk of bias.
If dispersion is mainly driven by studies at high risk of bias, avoid uncritical pooling and perform sensitivity analyses.
3.
Use Q as an initial signal.
A significant Q suggests variability beyond chance, but Q should not determine pooling on its own.
4.
Use I2 as a relative descriptor.
I2 describes the proportion of variability attributable to between-study differences, not the magnitude of heterogeneity.
5.
Use τ2 to judge absolute heterogeneity.
Interpret τ2 on the squared effect-size scale; when clinical interpretation is needed, translate it to τ on the effect-size scale and report the estimator used.
6.
Use the prediction interval to assess clinical implications.
If the prediction interval spans materially different clinical conclusions, interpret the pooled estimate cautiously and consider downgrading certainty.
7.
Decide on synthesis.
Pooling is reasonable if studies are coherent, bias is not driving the results, τ2 is clinically acceptable, and the prediction interval does not change the clinical conclusion. Pooling should be avoided or qualified if these conditions are not met.
  • How much? 
The magnitude and implications of heterogeneity should be judged using the complementary framework summarized above: Q as an initial signal, I2 as a relative descriptor, τ2/τ as measures of absolute between-study dispersion, and the prediction interval as the clinically interpretable range of effects expected in future comparable settings.
  • Where from? 
  • Subgroup analyses
Splitting studies into categories tests whether effects differ systematically—for example, by risk of bias, design, population, or outcome definition. In the forest metaphor, it is like comparing slopes, clearings, or groves to see if trees lean more in one setting than another. Credibility depends on prespecification, plausibility, consistency, and magnitude—that is, whether the subgroup was planned in advance, clinically credible, reproducible across related analyses, and large enough to matter [21].
2.
Leave-one-out checks
This method re-runs the analysis while omitting one study at a time [22,23]. If the forest skyline remains largely unchanged, the overall conclusion is less dependent on any single study; if removing one tree reshapes the canopy, that study may be influential. Leave-one-out analysis helps identify such cases, but it can exaggerate noise and should be interpreted as a sensitivity analysis.
3.
Meta-regression
When multiple factors may explain heterogeneity, meta-regression relates study-level variables (e.g., design, age, quality) to effect size [24]. It is like putting on colored lenses that reveal different leaning patterns. However, when the number of studies is small, meta-regression is usually underpowered and may produce spurious or unstable findings. A pragmatic rule is to include at least 10 studies per covariate, but this threshold is only a rough guide, not an absolute requirement. Supplementary File S5 provides an overview of the main analytical approaches available to explore heterogeneity in meta-analysis.
4.
Visual tools
Plots reveal patterns at a glance. Forest plots can suggest heterogeneity or dispersion when CIs are wide or show limited overlap, but they do not by themselves establish inconsistency. Baujat plots highlight studies driving heterogeneity [25], like oversized trees skewing the grove. Galbraith (radial) plots show departures from trend [26], and L’Abbé plots reveal scatter in event rates [27].
Funnel plots display study size or precision against effect, forming an inverted funnel when balanced. Large studies cluster near the pooled effect; smaller ones scatter. Distortion may signal small-study effects (publication bias, selective reporting, or true differences). Funnel plots usually require ≥10 studies, and interpretation is subjective. Statistical tests—Egger’s, Begg’s, or Deeks’ for DTA [28,29,30]—can complement, but none are definitive; asymmetry must be judged in context. Supplementary File S6 summarizes the main visual tools for heterogeneity.
  • What does it mean? 
Finding heterogeneity is not the endpoint; the key question is whether it reflects legitimate clinical diversity, methodological differences, bias, or lack of coherence. If variability arises from credible differences across populations, interventions, outcomes, or settings, pooling may remain reasonable, but the result should be interpreted as an average effect across contexts. If variability is driven by bias, incompatible designs, or unexplained contradictions that change the clinical conclusion, a single pooled estimate may be misleading, and stratified synthesis or narrative interpretation may be more appropriate.

13. Prediction Intervals, Clinical Interpretation, and Certainty of Evidence

A wide prediction interval should not be interpreted simply as a statistical inconvenience. It is a clinically important expression of uncertainty about transportability. When the pooled effect is statistically significant but the prediction interval crosses the null, the average effect may still suggest benefit or harm, but the expected effect in a future comparable setting is uncertain. In practical terms, this means that the intervention may not reproduce the average pooled effect in all clinical contexts.
The key question is not only whether the prediction interval crosses the null, but whether it crosses clinically meaningful decision thresholds. A prediction interval that narrowly crosses the null may indicate limited uncertainty around the direction of effect, whereas a prediction interval spanning important benefit and important harm should substantially reduce confidence in the applicability of the pooled estimate. In the latter situation, clinicians should avoid presenting the pooled estimate as a universally expected effect and should instead frame conclusions as context-dependent.
This interpretation aligns with evidence-certainty frameworks such as GRADE [31]. In GRADE, inconsistency is assessed by considering variability in point estimates, overlap of confidence intervals, statistical heterogeneity, and whether differences across studies affect clinical conclusions [32]. Prediction intervals can make this judgment more explicit: if the prediction interval includes effects that would lead to different clinical decisions across settings, this supports downgrading certainty for inconsistency and/or indirectness, depending on whether the variability reflects unexplained heterogeneity or differences in populations, interventions, comparators, outcomes, or settings.
Therefore, a clinically useful interpretation of prediction intervals can be summarized as follows. If both the pooled estimate and the prediction interval support the same clinically relevant conclusion, heterogeneity is less likely to undermine decision-making. If the pooled estimate indicates a benefit but the prediction interval includes no effect, the conclusion should be cautious and framed as an average effect that may not generalize across settings. If the prediction interval includes both clinically important benefit and clinically important harm, the certainty and applicability of the evidence are substantially weakened, and subgroup analyses, sensitivity analyses, or narrative synthesis may be more appropriate than a single unqualified pooled conclusion.

14. Transparency, Reproducibility, and Caution

Meta-analysis is a tool for clinical reasoning, not simply numerical synthesis. A polished pooled estimate with a narrow CI may look convincing, but concealing heterogeneity misleads. Transparency means documenting every analytic choice. Reproducibility means the same data and code yield the same results. Caution means recognizing heterogeneity as the rule, not the exception: sometimes acceptable, sometimes undermining trust, sometimes precluding pooling.
Supplementary File S7 lists reporting practices that enhance transparency and reproducibility. Supplementary File S8 lists common pitfalls (‘don’ts’) in reporting and interpreting heterogeneity.

15. Conclusions

This tutorial used the forest metaphor, formal expressions, and worked examples to clarify how heterogeneity should be interpreted in meta-analysis. Q, I2, τ2, and prediction intervals answer different questions: Q signals departure from homogeneity, I2 describes the relative proportion of observed variability attributable to between-study differences, τ2 estimates the absolute between-study variance, and prediction intervals translate that variance into expected effects in future comparable settings.
Heterogeneity is not merely a statistical nuisance. It may reflect legitimate clinical diversity, methodological differences, bias, or lack of coherence across evidence sources. Its interpretation, therefore, requires domain-appropriate models, transparent reporting, reproducible analyses, and clinical judgment. Pooling is most defensible when studies are clinically coherent, bias is not driving the dispersion, and the prediction interval does not alter the clinical conclusion. When these conditions are not met, cautious qualification, stratified synthesis, or narrative interpretation may be more trustworthy than a single pooled estimate.

16. Glossary of Key Terms

Heterogeneity: Genuine variability in effect estimates across studies beyond what would be expected from sampling error alone. It may arise from differences in populations, interventions, outcomes, follow-up, settings, or methods.
Sampling error: Random variation caused by studying a sample rather than the entire target population. It explains why study estimates differ even when they are estimating the same true effect.
Bias: Systematic error caused by flaws in study design, conduct, analysis, reporting, or interpretation. Unlike legitimate heterogeneity, bias threatens validity rather than reflecting meaningful diversity.
Inconsistency: Lack of coherence between estimates that should be compatible. In network meta-analysis, it specifically refers to disagreement between direct and indirect evidence.
Cochran’s Q: A statistical test assessing whether observed variability between study estimates exceeds what would be expected by chance. It is sensitive to the number and precision of studies.
I2: The proportion of observed variability attributable to between-study heterogeneity rather than sampling error. It is a relative measure and does not quantify the absolute magnitude of heterogeneity.
τ2: The between-study variance in a random-effects meta-analysis. It describes the absolute variance of true effects on the squared effect-size scale and depends on the estimator used.
τ: The square root of τ2. It is the between-study standard deviation and is expressed on the same scale as the effect estimate, making it more directly interpretable than τ2.
Confidence interval: The uncertainty around the pooled average effect. It describes the precision of the summary estimate, not the range of effects expected across settings.
Prediction interval: The expected range of effects in a future comparable study or setting. It incorporates between-study heterogeneity and is often more clinically informative than the confidence interval in random-effects meta-analysis.
Random-effects model: A meta-analytic model assuming that the true effect may vary across studies. It estimates an average effect and the between-study variance.
Transportability: The extent to which evidence from included studies can be applied to a new population, setting, intervention context, or clinical decision.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/biomedinformatics6030035/s1. Supplementary File S1. Simulated dataset illustrating a meta-analysis with relatively precise studies and low absolute between-study heterogeneity. Supplementary File S2. Simulated dataset illustrating a meta-analysis with less precise studies and larger absolute between-study heterogeneity. Supplementary File S3. Simulated dataset illustrating a small-k meta-analysis and the instability of heterogeneity estimates and prediction intervals. Supplementary File S4. Reproducible Stata and R code for the simulated meta-analyses presented in Supplementary Files S1–S3. Supplementary File S5. Analytical approaches to exploring heterogeneity in meta-analysis. Supplementary File S6. Visual approaches to exploring heterogeneity in meta-analysis. Supplementary File S7. Good Practices for Reporting and Interpreting Heterogeneity in Meta-Analysis. Supplementary File S8. Common Pitfalls (“Don’ts”) in Reporting and Interpreting Heterogeneity.

Funding

This research received no external funding.

Institutional Review Board Statement

This study did not involve human subjects or animals. As only simulated data were used, no ethical approval or informed consent was required.

Informed Consent Statement

Not applicable.

Data Availability Statement

The simulated datasets used for the worked quantitative examples are provided as Supplementary Files S1–S3. The reproducible Stata and R code used to analyze these datasets is provided as Supplementary File S4. No patient-level, clinical, or human-subject data were used.

Acknowledgments

Artificial intelligence (ChatGPT 5, OpenAI) was used for language editing and to assist in drafting the structure of the simulated examples. All simulated datasets, statistical code, analyses, numerical results, interpretation, and final manuscript content were reviewed and verified by the author, who retains full responsibility for the work.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Cochran, W.G. The Combination of Estimates from Different Experiments. Biometrics 1954, 10, 101–129. [Google Scholar] [CrossRef]
  2. Mood, A.M.; Graybill, F.A.; Boes, D.C. Introduction to the Theory of Statistics, 3rd ed.; McGraw-Hill: New York, NY, USA, 1974. [Google Scholar]
  3. Casella, G.; Berger, R.L. Statistical Inference, 2nd ed.; Duxbury Press: Boston, MA, USA, 2002. [Google Scholar]
  4. Hoaglin, D.C. Misunderstandings about Q and ‘Cochran’s Q test’ in meta-analysis. Stat. Med. 2016, 35, 485–495. [Google Scholar] [CrossRef] [PubMed]
  5. Higgins, J.P.T.; Thompson, S.G. Quantifying heterogeneity in a meta-analysis. Stat. Med. 2002, 21, 1539–1558. [Google Scholar] [CrossRef]
  6. Higgins, J.P.T.; Thompson, S.G.; Deeks, J.J.; Altman, D.G. Measuring inconsistency in meta-analyses. BMJ 2003, 327, 557–560. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  7. Higgins, J.P.T.; Thomas, J.; Chandler, J.; Cumpston, M.; Li, T.; Page, M.J.; Welch, V.A. (Eds.) Cochrane Handbook for Systematic Reviews of Interventions; Version 6.5; Cochrane: London, UK, 2024; Available online: www.cochrane.org/handbook (accessed on 15 October 2025).
  8. von Hippel, P.T. The heterogeneity statistic I2 can be biased in small meta-analyses. BMC Med. Res. Methodol. 2015, 15, 35. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  9. DerSimonian, R.; Laird, N. Meta-analysis in clinical trials. Control Clin. Trials 1986, 7, 177–188. [Google Scholar] [CrossRef] [PubMed]
  10. Viechtbauer, W. Bias and Efficiency of Meta-Analytic Variance Estimators in the Random-Effects Model. J. Educ. Behav. Stat. 2005, 30, 261–293. [Google Scholar] [CrossRef]
  11. Veroniki, A.A.; Jackson, D.; Viechtbauer, W.; Bender, R.; Bowden, J.; Knapp, G.; Kuss, O.; Higgins, J.P.; Langan, D.; Salanti, G. Methods to estimate the between-study variance and its uncertainty in meta-analysis. Res. Synth. Methods 2016, 7, 55–79. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  12. Higgins, J.P.T.; Thompson, S.G.; Spiegelhalter, D.J. A re-evaluation of random-effects meta-analysis. J. R. Stat. Soc. Ser. A Stat. Soc. 2009, 172, 137–159. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  13. Riley, R.D.; Higgins, J.P.T.; Deeks, J.J. Interpretation of random effects meta-analyses. BMJ 2011, 342, d549. [Google Scholar] [CrossRef] [PubMed]
  14. Nagashima, K.; Noma, H.; Furukawa, T.A. Prediction intervals for random-effects meta-analysis: A confidence distribution approach. Stat. Methods Med. Res. 2019, 28, 1689–1702. [Google Scholar] [CrossRef] [PubMed]
  15. IntHout, J.; Ioannidis, J.P.A.; Rovers, M.M.; Goeman, J.J. Plea for routinely presenting prediction intervals in meta-analysis. BMJ Open 2016, 6, e010247. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  16. Reitsma, J.B.; Glas, A.S.; Rutjes, A.W.; Scholten, R.J.; Bossuyt, P.M.; Zwinderman, A.H. Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. J. Clin. Epidemiol. 2005, 58, 982–990. [Google Scholar] [CrossRef] [PubMed]
  17. Rutter, C.M.; Gatsonis, C.A. A hierarchical regression approach to meta-analysis of diagnostic test accuracy evaluations. Stat. Med. 2001, 20, 2865–2884. [Google Scholar] [CrossRef] [PubMed]
  18. Harbord, R.M.; Deeks, J.J.; Egger, M.; Whiting, P.; Sterne, J.A.C. A unification of models for meta-analysis of diagnostic accuracy studies. Biostatistics 2007, 8, 239–251, Erratum in Biostatistics 2008, 9, 779. [Google Scholar] [CrossRef] [PubMed]
  19. Zhou, Y.; Dendukuri, N. Statistics for quantifying heterogeneity in univariate and bivariate meta-analyses of binary data: The case of meta-analyses of diagnostic accuracy. Stat. Med. 2014, 33, 2701–2717. [Google Scholar] [CrossRef] [PubMed]
  20. Deeks, J.J.; Bossuyt, P.M.; Leeflang, M.M.; Takwoingi, Y. (Eds.) Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy; Version 2.0; Cochrane: London, UK, 2023; Available online: https://training.cochrane.org/handbook-diagnostic-test-accuracy/current (accessed on 15 October 2025).
  21. Oxman, A.D.; Guyatt, G.H. A consumer’s guide to subgroup analyses. Ann. Intern. Med. 1992, 116, 78–84. [Google Scholar] [CrossRef] [PubMed]
  22. Viechtbauer, W.; Cheung, M.W.-L. Outlier and influence diagnostics for meta-analysis. Res. Synth. Methods 2010, 1, 112–125. [Google Scholar] [CrossRef] [PubMed]
  23. Meng, Z.; Wang, J.; Lin, L.; Wu, C. Sensitivity analysis with iterative outlier detection for systematic reviews and meta-analyses. Stat. Med. 2024, 43, 1549–1563. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  24. Thompson, S.G.; Higgins, J.P.T. How should meta-regression analyses be undertaken and interpreted? Stat. Med. 2002, 21, 1559–1573. [Google Scholar] [CrossRef] [PubMed]
  25. Baujat, B.; Mahé, C.; Pignon, J.; Hill, C. A graphical method for exploring heterogeneity in meta-analyses: Application to a meta-analysis of 65 trials. Stat. Med. 2002, 21, 2641–2652. [Google Scholar] [CrossRef] [PubMed]
  26. Galbraith, R.F. A note on graphical presentation of estimated odds ratios from several clinical trials. Stat. Med. 1988, 7, 889–894. [Google Scholar] [CrossRef] [PubMed]
  27. L’Abbé, K.A.; Detsky, A.S.; O’Rourke, K. Meta-analysis in clinical research. Ann. Intern. Med. 1987, 107, 224–233. [Google Scholar] [CrossRef] [PubMed]
  28. Sterne, J.A.; Egger, M. Funnel plots for detecting bias in meta-analysis: Guidelines on choice of axis. J. Clin. Epidemiol. 2001, 54, 1046–1055. [Google Scholar] [CrossRef] [PubMed]
  29. Egger, M.; Smith, G.D.; Schneider, M.; Minder, C. Bias in meta-analysis detected by a simple, graphical test. BMJ 1997, 315, 629–634. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  30. Begg, C.B.; Mazumdar, M. Operating characteristics of a rank correlation test for publication bias. Biometrics 1994, 50, 1088–1101. [Google Scholar] [CrossRef] [PubMed]
  31. Atkins, D.; Best, D.; Briss, P.A.; Eccles, M.; Falck-Ytter, Y.; Flottorp, S.; Guyatt, G.H.; Harbour, R.T.; Haugh, M.C.; Henry, D.; et al. Grading quality of evidence and strength of recommendations. BMJ 2004, 328, 1490. [Google Scholar] [CrossRef]
  32. Guyatt, G.H.; Oxman, A.D.; Kunz, R.; Woodcock, J.; Brozek, J.; Helfand, M.; Alonso-Coello, P.; Glasziou, P.; Jaeschke, R.; Akl, E.A.; et al. GRADE guidelines: 7. Rating the quality of evidence—Inconsistency. J. Clin. Epidemiol. 2011, 64, 1294–1302. [Google Scholar] [CrossRef] [PubMed]
Figure 1. The upper panel depicts a perfectly aligned forest, where all trunks stand vertical and of equal height—an analogy for homogeneous studies with minimal heterogeneity. The lower panel shows the same number of trunks, but several are tilted at different angles, while others vary in height or trunk thickness. This uneven woodland represents heterogeneous studies, where differences across results may arise from multiple potential sources of variability rather than chance alone.
Figure 1. The upper panel depicts a perfectly aligned forest, where all trunks stand vertical and of equal height—an analogy for homogeneous studies with minimal heterogeneity. The lower panel shows the same number of trunks, but several are tilted at different angles, while others vary in height or trunk thickness. This uneven woodland represents heterogeneous studies, where differences across results may arise from multiple potential sources of variability rather than chance alone.
Biomedinformatics 06 00035 g001
Figure 2. Different structures of heterogeneity with similar Q. The upper panel depicts several trunks with small variations in height, creating mild irregularity across the skyline. The lower panel shows otherwise symmetric trunks of equal height, except for one that is extremely short. Both scenarios could yield a similar Q statistic, yet they represent very different realities: the same quantification may arise from the progressive accumulation of many small deviations or from a single disproportionate outlier. This underscores that Q reflects the presence of excess variation but does not capture its underlying structure.
Figure 2. Different structures of heterogeneity with similar Q. The upper panel depicts several trunks with small variations in height, creating mild irregularity across the skyline. The lower panel shows otherwise symmetric trunks of equal height, except for one that is extremely short. Both scenarios could yield a similar Q statistic, yet they represent very different realities: the same quantification may arise from the progressive accumulation of many small deviations or from a single disproportionate outlier. This underscores that Q reflects the presence of excess variation but does not capture its underlying structure.
Biomedinformatics 06 00035 g002
Figure 3. Same I2, different τ2. The upper panel shows six trunks with only slight deviations from vertical alignment in three of them. The lower panel mirrors this arrangement, but the tilts are more pronounced. Although the visual imbalance is kept conceptually similar in both panels, the degree of tilt differs, representing small versus large τ2. This illustrates that I2 captures the relative contribution of heterogeneity to observed variability, whereas τ2 reflects its magnitude as between-study variance: two meta-analyses may share a similar I2 yet differ greatly in absolute variability between studies. This illustration is intentionally conceptual and should not be interpreted as meaning that I2 represents the proportion of studies that differ; I2 is derived from Q and reflects the proportion of observed variability attributable to between-study heterogeneity.
Figure 3. Same I2, different τ2. The upper panel shows six trunks with only slight deviations from vertical alignment in three of them. The lower panel mirrors this arrangement, but the tilts are more pronounced. Although the visual imbalance is kept conceptually similar in both panels, the degree of tilt differs, representing small versus large τ2. This illustrates that I2 captures the relative contribution of heterogeneity to observed variability, whereas τ2 reflects its magnitude as between-study variance: two meta-analyses may share a similar I2 yet differ greatly in absolute variability between studies. This illustration is intentionally conceptual and should not be interpreted as meaning that I2 represents the proportion of studies that differ; I2 is derived from Q and reflects the proportion of observed variability attributable to between-study heterogeneity.
Biomedinformatics 06 00035 g003
Figure 4. Worked quantitative example showing divergence between I2 and τ2. The upper panel shows a simulated meta-analysis of relatively precise studies with low absolute between-study heterogeneity. Despite a small τ2 value, I2 is high because the within-study sampling error is small. The middle panel shows a simulated meta-analysis with less precise studies and larger absolute between-study heterogeneity. In this scenario, τ2 and the prediction interval are substantially larger, although I2 is lower than in the upper panel. The lower panel shows a small-k scenario, illustrating the instability of heterogeneity estimates when few studies are available. The key message is that Q, I2, τ2, and prediction intervals should not be interpreted interchangeably: Q tests whether observed variability exceeds what would be expected by sampling error alone, I2 reflects proportional heterogeneity, whereas τ2 and the prediction interval describe the absolute scale and clinical consequences of between-study variability. All analyses were performed using random-effects REML models in Stata 19.0 (StataCorp LLC, College Station, TX, USA) with the official meta suite. Effect sizes are presented as risk ratios. Q statistics and PIs are reported numerically in the main text and are not displayed as separate graphical intervals in the forest plots. Values displayed within the forest plots are rounded by the software; exact τ2 values are reported in the main text.
Figure 4. Worked quantitative example showing divergence between I2 and τ2. The upper panel shows a simulated meta-analysis of relatively precise studies with low absolute between-study heterogeneity. Despite a small τ2 value, I2 is high because the within-study sampling error is small. The middle panel shows a simulated meta-analysis with less precise studies and larger absolute between-study heterogeneity. In this scenario, τ2 and the prediction interval are substantially larger, although I2 is lower than in the upper panel. The lower panel shows a small-k scenario, illustrating the instability of heterogeneity estimates when few studies are available. The key message is that Q, I2, τ2, and prediction intervals should not be interpreted interchangeably: Q tests whether observed variability exceeds what would be expected by sampling error alone, I2 reflects proportional heterogeneity, whereas τ2 and the prediction interval describe the absolute scale and clinical consequences of between-study variability. All analyses were performed using random-effects REML models in Stata 19.0 (StataCorp LLC, College Station, TX, USA) with the official meta suite. Effect sizes are presented as risk ratios. Q statistics and PIs are reported numerically in the main text and are not displayed as separate graphical intervals in the forest plots. Values displayed within the forest plots are rounded by the software; exact τ2 values are reported in the main text.
Biomedinformatics 06 00035 g004
Table 1. Key statistics for assessing heterogeneity in meta-analysis.
Table 1. Key statistics for assessing heterogeneity in meta-analysis.
MeasureWhat It MeasuresHow It WorksStrengthsLimitationsForest Metaphor
Q (Cochran’s Q)Tests whether variability between studies is greater than expected by sampling error aloneχ2 test, df = k − 1Simple, widely implementedLow power with few studies; too sensitive with many; only a test (yes/no)Spotting whether the grove looks uneven at all
I2 (Higgins & Thompson)Proportion of observed variability attributable to between-study heterogeneityDerived from Q and dfIntuitive %, widely reportedInfluenced by study precision; unstable with few studies; not a measure of absolute heterogeneityWhat fraction of the visible unevenness exceeds random scatter
τ2 (tau-squared)Between-study variance (absolute amount of heterogeneity)Estimated via formulas (DL, REML, etc.)Quantifies between-study variance on the squared effect-size scale; τ is in the same units as the effect estimate.Harder to interpret; estimator-dependent; unstable with few studiesHow variable the leaning is across the grove—for example, a few degrees versus 40°; strictly, τ, not τ2, is on the same scale as the effect estimate.
Prediction Interval (PI)Likely range of true effects in a new studyExtends the random-effects model using τ2Adds realism: shows what to expect in future contextsWide intervals with few studies; rarely reported in practiceWhat leaning may we see in other forests
Table 2. Heterogeneity measures in diagnostic test accuracy meta-analysis.
Table 2. Heterogeneity measures in diagnostic test accuracy meta-analysis.
Measure/ApproachWhat It MeasuresHow It WorksStrengthsLimitationsRecommended Role
DTA: Univariate Q/I2Variability assessed separately for sensitivity or specificity.Applies conventional univariate heterogeneity statistics to one diagnostic dimension (sensitivity/specificity) at a time.Familiar and easy to calculate.Misleading as a primary summary because it ignores the correlation between sensitivity and specificity and threshold effects.Avoid as the main heterogeneity summary in DTA meta-analysis.
DTA: Bivariate/HSROC hierarchical modelsHow sensitivity and specificity vary together across studies.Models sensitivity and specificity jointly, allowing for threshold effects and correlations between the two measures.Preserves the clinical relationship between sensitivity and specificity; supports 95% confidence and prediction regions.Requires a sufficient number of studies and is more complex than univariate pooling.Preferred framework for interpreting heterogeneity in most DTA meta-analyses.
DTA: Bivariate I2 or related summary metricsAttempts to summarize overall DTA heterogeneity as a single value.Converts the joint variability of sensitivity and specificity into a more compact summary.May be easier to communicate than separate model parameters.Less standardized, less familiar, and rarely used as a routine tool.Optional and complementary; not a replacement for reporting variance components, correlation, and prediction regions.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Arredondo Montero, J. How to Interpret Heterogeneity in Meta-Analysis: A Structured Guide for Clinicians and Researchers. BioMedInformatics 2026, 6, 35. https://doi.org/10.3390/biomedinformatics6030035

AMA Style

Arredondo Montero J. How to Interpret Heterogeneity in Meta-Analysis: A Structured Guide for Clinicians and Researchers. BioMedInformatics. 2026; 6(3):35. https://doi.org/10.3390/biomedinformatics6030035

Chicago/Turabian Style

Arredondo Montero, Javier. 2026. "How to Interpret Heterogeneity in Meta-Analysis: A Structured Guide for Clinicians and Researchers" BioMedInformatics 6, no. 3: 35. https://doi.org/10.3390/biomedinformatics6030035

APA Style

Arredondo Montero, J. (2026). How to Interpret Heterogeneity in Meta-Analysis: A Structured Guide for Clinicians and Researchers. BioMedInformatics, 6(3), 35. https://doi.org/10.3390/biomedinformatics6030035

Article Metrics

Back to TopTop