Bootstrap-Calibrated Outlier Detection and Influence Diagnostics for Meta-Analysis: The R Package boutliers

Noma, Hisashi; Maruo, Kazushi; Gosho, Masahiko

doi:10.3390/stats9030060

Open AccessArticle

Bootstrap-Calibrated Outlier Detection and Influence Diagnostics for Meta-Analysis: The R Package boutliers

by

Hisashi Noma

^1,2,*

,

Kazushi Maruo

³

and

Masahiko Gosho

³

¹

Department of Interdisciplinary Statistical Mathematics, The Institute of Statistical Mathematics, Tokyo 190-8562, Japan

²

The Graduate Institute for Advanced Studies, The Graduate University for Advanced Studies (SOKENDAI), Tokyo 190-8562, Japan

³

Department of Biostatistics, Institute of Medicine, University of Tsukuba, Tsukuba 305-8575, Japan

^*

Author to whom correspondence should be addressed.

Stats 2026, 9(3), 60; https://doi.org/10.3390/stats9030060 (registering DOI)

Submission received: 11 May 2026 / Revised: 31 May 2026 / Accepted: 8 June 2026 / Published: 12 June 2026

(This article belongs to the Section Biostatistics)

Download

Browse Figure

Versions Notes

Abstract

Meta-analysis is a statistical tool commonly used within systematic reviews to synthesize quantitative evidence, but individual studies with atypical results or disproportionate influence can materially affect pooled estimates, heterogeneity estimates, and the conclusions drawn from evidence syntheses. Conventional outlier and influence diagnostics for meta-analysis are useful, but their interpretation often relies on asymptotic reference values or informal rules of thumb, which may be inadequate when the number of studies is limited or heterogeneity is substantial. We introduce boutliers, an R package that implements bootstrap-calibrated outlier detection and influence diagnostics for fixed-effect and random-effects meta-analysis. The package provides leave-one-study-out diagnostics based on Studentized deleted residuals, relative changes in the variance of the pooled effect estimator, and relative changes in the between-study variance, together with a likelihood-ratio diagnostic based on a mean-shifted model. For each diagnostic measure, bootstrap reference distributions, critical values, and p-values are provided to support quantitative interpretation of influential studies. We describe the statistical framework, implementation, and practical use of the package and illustrate its application using a real published meta-analysis dataset on spinal manipulative therapy for chronic low back pain. The boutliers package provides accessible tools for incorporating uncertainty-calibrated influence diagnostics into routine meta-analytic practice.

Keywords:

meta-analysis; random-effects model; outlier detection; influence diagnostics; bootstrap calibration

1. Introduction

Meta-analysis provides a quantitative framework for synthesizing evidence across studies and is widely used in clinical research, public health, health technology assessment, and guideline development [1,2]. In practice, however, the studies included in a meta-analysis often differ in design, populations, settings, interventions, outcome definitions, and risk of bias. Such diversity gives rise to between-study heterogeneity and may also produce individual studies that are atypical or highly influential for the pooled results. Random-effects models are commonly used to account for between-study heterogeneity [2,3], but they do not remove the need to examine whether individual studies exert disproportionate influence on the overall synthesis.

Atypical or influential studies can change the magnitude and precision of the pooled effect, inflate or reduce heterogeneity estimates, and alter the interpretation of subgroup or sensitivity analyses. Influence diagnostics are therefore important not only for detecting unusual studies but also for assessing the robustness and transparency of evidence synthesis. In this context, the term “outlier” is often used for a study whose effect estimate deviates markedly from the fitted model, whereas “influential study” refers more broadly to a study whose inclusion materially affects estimates, measures of precision, heterogeneity, or inferential conclusions [4]. These concepts are related but not identical, and both are relevant in practical meta-analysis.

A primary approach to addressing these issues has been to adapt conventional outlier detection and influence diagnostic methods from regression analysis [5,6,7]. Viechtbauer and Cheung [4] proposed a comprehensive set of such methods for meta-analysis, including residual-based measures, leave-one-out diagnostics, and measures of influence on model parameters and heterogeneity. These methods have been widely used in applied systematic reviews and have also been extended to more complex evidence synthesis settings, including bivariate meta-analysis of diagnostic test accuracy studies [8], network meta-analysis [9], and analyses of multicenter or multiregional clinical trials [10,11].

Although these influence diagnostics are useful, their interpretation still often depends on asymptotic reference values or informal cutoffs. For example, Studentized residuals may be compared with conventional normal quantiles, and likelihood-ratio statistics may be compared with chi-square reference distributions. These criteria do not explicitly account for the finite number of studies, the observed pattern of within-study variances, or the estimated degree of between-study heterogeneity. In realistic meta-analytic settings, particularly when the number of studies is small or moderate, such approximations may be crude. Bootstrap methods provide a natural way to evaluate the sampling variability of diagnostic measures and to construct data-adaptive reference values [12]. Similar bootstrap-calibrated approaches have been used in recent methodological studies on influence diagnostics for multicenter and multiregional clinical trials, diagnostic test accuracy meta-analysis, and network meta-analysis [8,9,10,11]. However, user-friendly computational tools for applying these methods in routine pairwise meta-analysis remain limited.

This article makes three contributions. First, we present a unified implementation framework for bootstrap-calibrated outlier and influence diagnostics in standard fixed-effect and random-effects meta-analysis. The diagnostic measures considered here are closely related to existing influence diagnostics, and the contribution of boutliers is primarily to provide a practical bootstrap-calibrated implementation that yields empirical reference distributions, study-specific critical values, and bootstrap p-values for routine use. Second, we describe the implementation of these methods in the R package boutliers, which is available on CRAN and provides simple commands for applied users. Third, we illustrate how the resulting diagnostics can be interpreted in practice using a real published meta-analysis dataset on chronic low back pain. The package provides three leave-one-study-out diagnostic measures: (1) Studentized deleted residuals, (2) relative changes in the variance of the pooled effect estimator, and (3) relative changes in the between-study variance. In addition, it implements a likelihood-ratio diagnostic based on a mean-shifted random-effects model. These methods evaluate complementary aspects of influence and can support transparent sensitivity analyses in systematic reviews.

The remainder of this article is organized as follows. Section 2 describes the statistical models and diagnostic measures. Section 3 provides an overview of the boutliers package. Section 4 illustrates the main functions using a meta-analysis of spinal manipulative therapy for chronic low back pain. Section 5 concludes with practical implications and future directions.

2. Methods

2.1. Statistical Models and Inference Methods

We consider the standard normal-normal model for aggregate-data meta-analysis. Let

Y_{i}

(i = 1, \dots, N)

denote the study-specific estimate of a treatment effect measure, such as a mean difference, standardized mean difference, log risk ratio, log odds ratio, log hazard ratio, or risk difference. Let

σ_{i}^{2}

denote the corresponding within-study variance. This framework assumes that the effect-size rows are statistically independent and that the supplied within-study variances are correctly specified. When multiple effect sizes are extracted from the same study, when treatment groups are shared, or when outcomes are correlated, users should aggregate dependent estimates, specify an appropriate covariance structure, or otherwise account for dependence before interpreting the analysis as a standard independent-effect meta-analysis. The random-effects model is expressed as

Y_{i} ~ N (θ_{i}, σ_{i}^{2})

θ_{i} ~ N (μ, τ^{2})

where

μ

is the overall mean effect, and

τ^{2}

is the between-study heterogeneity variance. When

τ^{2} = 0

, the model reduces to the fixed-effect model, which assumes a common underlying effect across studies [2].

Various estimators are available for the between-study variance, including the DerSimonian–Laird, restricted maximum likelihood (REML), Paule–Mandel, and Sidik–Jonkman estimators [13,14,15,16]. In this article, we use inverse-variance estimation for the fixed-effect model and REML estimation for the random-effects model as default choices. Adjustments to the uncertainty of the pooled effect, such as the Hartung–Knapp–Sidik–Jonkman method, may be used in primary meta-analyses [17], but the diagnostic statistics considered below are defined in terms of the fitted fixed-effect or random-effects model and its leave-one-study-out counterparts.

The log-likelihood function under the random-effects model is

l (μ, τ^{2}) = - \frac{1}{2} \sum_{i = 1}^{N} \{\log {2 π (σ}_{i}^{2} + τ^{2}) + \frac{{(y_{i} - μ)}^{2}}{σ_{i}^{2} + τ^{2}}\},

The restricted log-likelihood can be written as

l_{R L} (μ, τ^{2}) = - \frac{1}{2} \{\sum_{i = 1}^{N} [\frac{{(y_{i} - μ)}^{2}}{σ_{i}^{2} + τ^{2}} + \log (σ_{i}^{2} + τ^{2})] + \log (\sum_{i = 1}^{N} \frac{1}{σ_{i}^{2} + τ^{2}})\} .

The REML estimators

\hat{μ}

and

{\hat{τ}}^{2}

are obtained by maximizing the restricted log-likelihood. In the following sections, we primarily describe diagnostic measures under the random-effects model, but the Studentized residual and likelihood-ratio diagnostic can also be applied to the fixed-effect model by setting

τ^{2} = 0

.

2.2. Studentized Deleted Residuals

First, we consider the Studentized deleted residual, a standard regression diagnostic for identifying observations whose outcomes are unusually far from the fitted model after accounting for their uncertainty [5,6]. In meta-analysis, this measure evaluates whether an individual study’s effect estimate is unusually distant from the pooled effect, relative to its expected sampling variability and the estimated between-study heterogeneity.

As a preliminary quantity, one could compute a full-data Studentized residual by comparing the

a

th study with the pooled estimate obtained from all studies. However, such a residual may underestimate the extremeness of an individual study because the same study contributes to the estimation of the pooled effect and heterogeneity parameters. Therefore, the diagnostic statistic used here is the leave-one-study-out Studentized deleted residual, denoted by

t_{a}

, in which the fitted model is estimated after excluding the study being evaluated. In the remainder of this section,

t_{a}

denotes this leave-one-study-out diagnostic statistic.

Let

{\hat{μ}}^{(- a)}

and

{\hat{τ}}^{2 (- a)}

denote the REML estimators obtained from the dataset of

N - 1

studies after excluding the

a

th study. The Studentized deleted residual is defined as

t_{a} = \frac{Y_{a} - {\hat{μ}}^{(- a)}}{\sqrt{V a r [Y_{a} - {\hat{μ}}^{(- a)}]}},

where

V a r [Y_{a} - {\hat{μ}}^{(- a)}] = {({\hat{w}}_{a}^{(- a)})}^{- 1} + {(\sum_{i \neq a} {\hat{w}}_{i}^{(- a)})}^{- 1}

and

{\hat{w}}_{i}^{(- a)} = {({\hat{τ}}^{2 (- a)} + {\hat{σ}}_{i}^{2})}^{- 1}

(

i = 1,2, \dots, N

). Because

{\hat{μ}}^{(- a)}

and

{\hat{τ}}^{2 (- a)}

are estimated without using the

a

th study,

t_{a}

can be interpreted as a predicted Studentized residual for that study.

In practice, appropriate reference values are needed to interpret

t_{a}

. Conventionally, the standard normal distribution has been used as an approximate reference distribution, and values exceeding 1.96 or 2.00 in absolute value are often regarded as indicative of potential outlyingness. However, this rule relies on large-sample approximations and does not account for the structure of the specific meta-analysis, including the number of studies, the pattern of within-study variances, and the estimated degree of between-study heterogeneity. The bootstrap procedure used in boutliers provides a model-based empirical reference distribution conditional on the fitted model and observed variance structure. These thresholds are intended to support exploratory ranking and flagging of potentially influential studies, rather than definitive confirmatory testing. We therefore use a parametric bootstrap procedure to estimate the sampling distribution of

t_{a}

.

For example, the 2.5th and 97.5th percentiles of the bootstrap distribution can be used as diagnostic thresholds. A study whose observed residual lies outside these limits may be flagged as potentially outlying relative to the fitted meta-analytic model. The same procedure can be applied under the fixed-effect model by setting

τ^{2} = 0

. In boutliers, this diagnostic is implemented by the STR function.

2.3. Relative Change of the Variance of the Pooled Effect Estimator

Second, we consider an influence measure based on the relative change in the variance of the pooled effect estimator under a leave-one-study-out scheme. This measure was originally discussed by Viechtbauer and Cheung [4]. Let

{\hat{w}}_{i} = ({\hat{τ}}^{2} + σ_{i}^{2})^{- 1}

denote the inverse-variance weight under the fitted random-effects model. The variance ratio statistic for the

a

th study is defined as

{V R A T I O}_{a} = \frac{V a r [{\hat{μ}}^{(- a)}]}{V a r [\hat{μ}]} = \frac{\sum_{i = 1}^{N} {\hat{w}}_{i}}{\sum_{i \neq a} {\hat{w}}_{i}}

V R A T I O_{a}

measures how the estimated variance of the pooled effect changes when the

a

th study is omitted. Values close to 1 indicate little influence on the precision of the pooled estimate. Unusually small values indicate that omitting the study reduces the estimated variance of the pooled effect, suggesting that the study contributes disproportionately to the uncertainty of the synthesis. Such studies should be examined together with residual-based diagnostics and substantive information. Values larger than 1 are usually less concerning in this context because some loss of precision is expected when a study is removed. Very large values may suggest that the study contributes substantially to precision, but such studies are not necessarily outliers in the usual sense.

To quantify the degree of influence, the bootstrap procedure described in Algorithm 1 can be applied by replacing

t_{a}

with

{V R A T I O}_{a}

. For example, the lower 5th percentile of the bootstrap distribution can be used as a critical value. Because variance ratio statistics under a fixed-effect model primarily reflect the amount of within-study information contributed by each study, this measure is most useful under the random-effects model. In boutliers, this diagnostic is implemented by the VRATIO function.

Algorithm 1. Bootstrap calibration for the Studentized deleted residual.

For the dataset excluding the ath study, fit the random-effects model and obtain ${\hat{μ}}^{(- a)}$ $and {\hat{τ}}^{2 (- a)}$ .
$For b = 1, \dots, B$ $, generate bootstrap random effects θ_{1}^{(b)}, \dots, θ_{N}^{(b)}$ $from N ({\hat{μ}}^{(- a)}, {\hat{τ}}^{2 (- a)})$ $, and then generate bootstrap observations Y_{i}^{(b)} \sim N (θ_{i}^{(b)}, σ_{i}^{2})$ $, i = 1, \dots, N$ .
$For each bootstrap dataset, compute the Studentized deleted residual t_{a}^{(b)}$ .
$Estimate the sampling distribution of t_{a}$ $by the empirical distribution of t_{a}^{(1)}, \dots, t_{a}^{(B)}$ .

2.4. Relative Change in the Heterogeneity Variance

A related influence measure is the relative change in the estimated between-study heterogeneity variance. Following Viechtbauer and Cheung [4], we define

T R A T I O_{a} = \frac{{\hat{τ}}^{2 (- a)}}{{\hat{τ}}^{2}}

where

{\hat{τ}}^{2 (- a)}

is the heterogeneity variance estimate obtained after excluding the

a

th study. Values close to 1 indicate little influence on the estimated heterogeneity variance. If

{T R A T I O}_{a}

is much smaller than 1, the exclusion of the study substantially reduces the estimated between-study heterogeneity. Such a study may be influential for heterogeneity and may warrant further examination as a potential source of inconsistency or clinical or methodological diversity.

The bootstrap procedure in Algorithm 1 can again be used to estimate the sampling distribution of

{T R A T I O}_{a}

. The lower 5th percentile of the bootstrap distribution may be used as a diagnostic threshold. This measure is defined only under the random-effects model and is particularly relevant when substantial heterogeneity is present.

2.5. Likelihood-Ratio Diagnostic Based on a Mean-Shifted Model

Another approach to influence diagnostics is to compare the standard meta-analytic model with an alternative model in which one study is allowed to have a shifted mean. Aoki et al. [10] and Nakamura and Noma [11] used a related mean-shifted model for detecting influential regions or centers in clinical trials. In the present meta-analytic setting, for the

a

th study, we consider

θ_{a} ~ N (μ + ζ, τ^{2})

whereas the remaining studies follow

θ_{i} ~ N (μ, τ^{2}) . (i \neq a)

The diagnostic hypothesis is

H_{0} : ζ = 0 vs . H_{1} : ζ \neq 0

If the null hypothesis is rejected, the

a

th study may be regarded as influential in the sense that its effect estimate is not well described by the common random-effects distribution fitted to the remaining studies. However, this does not necessarily imply that the study is an outlier in a substantive or clinical sense. The result should therefore be interpreted as a diagnostic flag, not as an automatic basis for exclusion.

The likelihood under the null model is given by the standard random-effects log-likelihood

l_{0} (μ, τ^{2}) = - \frac{1}{2} \sum_{i = 1}^{N} \{\log {2 π (σ}_{i}^{2} + τ^{2}) + \frac{{(y_{i} - μ)}^{2}}{σ_{i}^{2} + τ^{2}}\}

Under the mean-shifted alternative for the

a

th study, the log-likelihood is

l_{1 [a]} (μ, τ^{2}, ζ) = - \frac{1}{2} \{\log {2 π (σ}_{a}^{2} + τ^{2}) + \frac{{(y_{a} - μ - ζ)}^{2}}{σ_{a}^{2} + τ^{2}}\} - \frac{1}{2} \sum_{i \neq a} \{\log {2 π (σ}_{i}^{2} + τ^{2}) + \frac{{(y_{i} - μ)}^{2}}{σ_{i}^{2} + τ^{2}}\}

The likelihood-ratio statistic is

T_{[a]} = - 2 \{l_{0} (\tilde{μ}, {\tilde{τ}}^{2}) - l_{1 [a]} ({\tilde{μ}}_{[a]}, {\tilde{τ}}_{[a]}^{2}, ζ_{[a]})\},

where

\tilde{μ}

and

{\tilde{τ}}^{2}

are the maximum likelihood estimates under the null model, and

{\tilde{μ}}_{[a]}

,

{\tilde{τ}}_{[a]}^{2}

, and

{\tilde{ζ}}_{[a]}

are the maximum likelihood estimates under the mean-shifted model for the

a

th study.

Under conventional large-sample theory,

T_{[a]}

is compared with a chi-square distribution with 1 degree of freedom. However, this approximation may be unreliable in meta-analysis with a limited number of studies or substantial heterogeneity. We therefore use a bootstrap-calibrated likelihood-ratio diagnostic (Algorithm 2).

Algorithm 2. Bootstrap calibration for the mean-shifted likelihood-ratio diagnostic.

Compute the maximum likelihood estimates ${\tilde{μ}}^{(- a)}$ and ${\tilde{τ}}^{2 (- a)}$ from the dataset excluding the $a$ th study.
For $b = 1, \dots, B$ , generate bootstrap datasets from the null random-effects model with parameters ${\tilde{μ}}^{(- a)}$ and ${\tilde{τ}}^{2 (- a)}$ .
For each bootstrap dataset, fit both the null model and the mean-shifted model for the $a$ th study, and calculate the bootstrap likelihood-ratio statistic $T_{[a]}^{(b)}$ .
Estimate the reference distribution of $T_{[a]}$ by the empirical distribution of $T_{[a]}^{(1)}, \dots, T_{[a]}^{(B)}$ . The bootstrap p-value is obtained by comparing the observed likelihood-ratio statistic with this empirical distribution.

Because this diagnostic is applied separately to each study, the resulting bootstrap p-values are exploratory study-wise diagnostic quantities. They are not adjusted for the simultaneous examination of multiple studies and should not be interpreted as confirmatory hypothesis tests. In practice, they are intended to rank and flag potentially influential studies for sensitivity analyses and substantive review. Combined with bootstrap calibration, the likelihood-ratio diagnostic provides a useful complement to residual-based and variance-based influence measures.

3. R Package boutliers

boutliers is an R package available on CRAN. It implements bootstrap-calibrated influence diagnostics for meta-analysis models, as described in Section 2. Table 1 summarizes the main functions. The package is designed for applied researchers conducting systematic reviews as well as statisticians who wish to examine the robustness of meta-analytic conclusions. The main functions require only study-specific effect estimates and their estimated variances, making the package compatible with common workflows based on metafor or other meta-analysis software. boutliers is not intended to replace general-purpose meta-analysis software such as metafor. Rather, it complements such workflows by providing bootstrap-calibrated reference distributions, study-specific critical values, and bootstrap p-values for several influence diagnostics. Because the functions use precomputed effect estimates and within-study variances as inputs, the diagnostics inherit any assumptions used to construct those quantities. This is particularly relevant for repeated-measures outcomes, paired outcomes, cluster-correlated outcomes, shared controls, and reconstructed aggregate data. In the current implementation, bootstrap calibration is performed separately for each diagnostic and each study; therefore, computation time increases approximately linearly with the number of bootstrap replications and the number of studies. Boundary estimates of

τ^{2} = 0

are allowed when supported by the selected heterogeneity estimator.

The package can be installed and loaded as follows:

> install.packages(“boutliers”)
> library(“boutliers”)

The implemented methods are also applicable, with appropriate data structures, to influence diagnostics for centers or regions in multicenter and multiregional clinical trials [10,11]. To facilitate such applications, the package includes example datasets representing these settings. R example programs used in this article are available at https://github.com/nomahi/boutliers/blob/master/Examplecode.r (accessed on 26 May 2026).

4. Functionality and Illustrative Examples

4.1. Example Dataset: Meta-Analysis of Chronic Low Back Pain

As an illustrative example, we consider the meta-analysis by Rubinstein et al. [18], which evaluated spinal manipulative therapy for chronic low back pain. This example is useful for demonstrating influence diagnostics because the synthesis includes a moderate number of trials, substantial between-study heterogeneity, and at least one study with a markedly different effect estimate. The dataset, provided as SMT in the boutliers package, includes 23 randomized controlled trials. The outcome was pain intensity at one month on a 0 to 100 scale, with higher values indicating worse pain. The comparison was spinal manipulative therapy (SMT; N = 1629) versus guideline-recommended therapies (N = 1526), and the effect measure was the mean difference.

Figure 1 shows the study-specific mean differences and the random-effects pooled estimate for the spinal manipulative therapy dataset. The individual estimates are widely dispersed, consistent with the substantial between-study heterogeneity observed in this example. Most estimates are located relatively close to the pooled mean difference, whereas one study shows a markedly larger treatment effect and appears visually extreme before formal diagnostic analysis. Using the DerSimonian–Laird estimator for the between-study variance, the pooled mean difference was −3.18 (95% CI: −7.70 to 1.34), with substantial heterogeneity (I² = 92%,

τ^{2}

= 103.74, p < 0.01 by Cochran’s Q test).

Several study names appear more than once in Figure 1 because the dataset contains multiple effect-size rows from the published meta-analysis. These rows are retained here to reproduce the example dataset distributed with the package and to illustrate the software workflow. However, repeated labels may indicate multiple comparisons or shared groups within a trial, and such effect-size rows may be statistically dependent. Therefore, in substantive meta-analyses with dependent effect sizes, users should first aggregate effect sizes, model the covariance structure, or otherwise account for the dependence before applying these diagnostics as standard independent-effect meta-analysis diagnostics.

4.2. Studentized Deleted Residuals

The STR function performs influence analysis using Studentized deleted residuals. The main arguments and outputs are summarized in Table 2. The user provides a data frame containing study-specific effect estimates and their variances, and specifies the corresponding variable names as y and v. The number of bootstrap resampling iterations is specified by B, and the bootstrap percentile level is specified by alpha. A random seed can be specified using seed to ensure reproducibility.

The method argument specifies the estimator used for the variance component when the random-effects model is fitted. The default is REML. Other estimators available through the rma function in the metafor package can also be specified, including fixed-effect analysis (method = “FE”), the Sidik–Jonkman estimator (method = “SJ”), and the Paule–Mandel estimator (method = “PM”). The choice of model and heterogeneity estimator should generally be aligned with the primary meta-analysis or with the sensitivity analysis of interest.

The analysis can be performed as follows:

> STR(yi, vi, data = edat1, B = 2000)

The largest residuals in the example were:

id psi Q1 Q2
16 -5.10452658 -1.983772 1.912068
23 -1.85339334 -2.017066 1.907075
18 -1.69817290 -1.898156 1.869712
22 1.14244992 -1.945369 2.060744
6 0.99582436 -1.925249 1.905155

The output is ordered by the magnitude of psi. Q1 and Q2 correspond to the 2.5th and 97.5th percentiles of the bootstrap distribution. In this example, the bootstrap-calibrated limits were close to, but not identical to, the conventional ±1.96 thresholds and varied slightly across studies because they were generated under the fitted meta-analytic model and the observed pattern of within-study variances. The 16th study lies well outside the bootstrap-calibrated reference limits and is therefore flagged as potentially influential by the Studentized deleted residual diagnostic. The other studies shown above do not exceed their corresponding bootstrap reference limits.

4.3. Variance Ratio Statistics

The VRATIO function calculates influence diagnostics based on relative changes in the variance of the pooled effect estimator and the between-study heterogeneity variance. The main arguments and outputs are summarized in Table 3. As with STR, the user provides study-specific effect estimates and their variances, specifies the number of bootstrap resampling iterations, and can choose the estimator used for the random-effects model.

The analysis can be performed as follows:

> VRATIO(yi, vi, data = edat1, B = 2000)

The largest influences, according to the variance ratio for the pooled effect estimator, were:

$VRATIO
id VR Q1
16 0.4778609 0.8862575
18 0.9616509 0.8924325
23 0.9649587 0.9395354
22 1.0233935 0.9240738
6 1.0392249 0.9198538

For the heterogeneity variance ratio, the largest influences were:

$TAU2RATIO
id TR Q2
16 0.3774660 0.7671174
18 0.8994492 0.8143563
23 0.9273419 0.8990506
22 0.9869376 0.8787000
6 0.9991795 0.8641722

In both diagnostics, the 16th study falls below the bootstrap-calibrated lower threshold. This indicates that excluding this study substantially reduces both the variance of the pooled effect estimator and the estimated between-study heterogeneity. The diagnostic result should not be interpreted as automatically justifying exclusion of the study; rather, it indicates that the study should be examined carefully in sensitivity analyses and in relation to its clinical and methodological characteristics.

4.4. Likelihood-Ratio Diagnostic Under a Mean-Shifted Model

The LRT function implements the likelihood-ratio diagnostic based on the mean-shifted model described in Section 2.5. The main arguments and outputs are summarized in Table 4. Because the diagnostic is likelihood-based, model parameters are estimated by maximum likelihood. The user can specify either a fixed-effect or random-effects model through the model argument; the default is the random-effects model.

The analysis can be performed as follows:

> LRT(yi, vi, data = edat1, B = 2000)

The largest likelihood-ratio statistics were:

id LR Q P
16 17.711008445 4.452721 0.00000000
23 3.488316594 4.124637 0.06896552
18 2.966951627 4.512144 0.11344328
22 1.363317469 4.207158 0.27686157
6 1.049898689 4.309350 0.34232884

Here, Q is the 95th percentile of the bootstrap distribution and serves as the bootstrap-calibrated rejection threshold. The 16th study exceeds this threshold and has a bootstrap p-value close to zero, indicating that it is influential under the mean-shifted likelihood-ratio diagnostic. The 23rd and 18th studies have larger likelihood-ratio statistics than the remaining studies, but do not exceed their bootstrap thresholds.

Across the residual-based, variance-ratio, heterogeneity-ratio, and likelihood-ratio diagnostics, the 16th study was consistently flagged as potentially influential. The diagnostics do not identify the reason for this discrepancy; rather, they indicate that this study should be examined more closely with respect to clinical and methodological features, such as the intervention protocol, comparator treatment, outcome measurement, participant characteristics, risk of bias, and possible data-extraction issues.

A sensitivity analysis excluding this study yielded a pooled mean difference of −1.61 (95% CI: −4.74 to 1.51), with I² = 80%, τ² = 39.16, and p < 0.01 by Cochran’s Q test. This sensitivity analysis illustrates how influence diagnostics can clarify the extent to which a single study contributes to the magnitude of the pooled effect and the estimated heterogeneity. The purpose of the diagnostic procedure is not to mandate exclusion but to provide transparent quantitative evidence for robustness assessments.

5. Concluding Remarks

Meta-analysis is widely used to support clinical guidelines, health technology assessment, public health policy, and evidence-based decision-making. Because included studies often differ in design, populations, interventions, outcome definitions, and quality, careful assessment of heterogeneity and influence is essential. Conventional influence diagnostics for meta-analysis provide valuable tools, but their interpretation often relies on informal or asymptotic thresholds that may be inadequate in realistic settings.

The boutliers package provides an accessible implementation of bootstrap-calibrated influence diagnostics for meta-analysis. By supplying empirical reference distributions, critical values, and bootstrap p-values, the package helps users move beyond informal diagnostic cutoffs and supports more transparent robustness assessments. The residual-based, variance-ratio, heterogeneity-ratio, and likelihood-ratio diagnostics implemented in the package evaluate different aspects of study influence and should be interpreted jointly with substantive knowledge about the included studies.

In applied systematic reviews, these diagnostics can help investigators identify studies or effect-size rows that disproportionately affect the pooled estimate, its precision, or the estimated heterogeneity. When a study is flagged as potentially influential, investigators should not automatically exclude it. A recommended workflow is to first check the extracted data and variance calculation, then examine the study’s eligibility, risk of bias, population, intervention, comparator, and outcome definition. If the study remains eligible, investigators may report sensitivity analyses with and without the study and discuss the impact on the robustness of the conclusions.

It is also important to emphasize that boutliers is an influence and outlier diagnostic tool, not a publication-bias detector. A study may be influential because of its effect size, precision, or contribution to heterogeneity, but this does not by itself imply publication bias, selective reporting, or data irregularity. Conversely, publication bias or small-study effects may be present even when no single study is flagged as influential. Therefore, the diagnostics implemented in boutliers should be regarded as complementary to, rather than a replacement for, standard assessments of small-study effects, reporting bias, and publication bias. Future methodological work may extend these tools to more complex meta-analytic settings, including multivariate meta-analysis, network meta-analysis, and individual participant data meta-analysis.

Author Contributions

Conceptualization, H.N.; methodology, H.N., K.M. and M.G.; software, H.N.; validation, H.N.; formal analysis, H.N.; investigation, H.N. and K.M.; resources, H.N.; data curation, H.N.; writing—original draft preparation, H.N.; writing—review and editing, H.N., K.M. and M.G.; visualization, H.N.; project administration, H.N. and M.G.; funding acquisition, H.N. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by a Grant-in-Aid for Scientific Research from the Japan Society for the Promotion of Science (grant numbers: JP23K24811, JP23K11931, and JP26K02873).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original dataset used in Section 4 is available in boutliers package on CRAN.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Egger, M.; Higgins, J.P.; Smith, G.D. Systematic Reviews in Health Research: Meta-Analysis in Context, 3rd ed.; BMJ Books: London, UK, 2022. [Google Scholar]
Higgins, J.P.T.; Thomas, J. Cochrane Handbook for Systematic Reviews of Interventions, 2nd ed.; Wiley-Blackwell: Chichester, UK, 2019. [Google Scholar]
Borenstein, M.; Hedges, L.V.; Higgins, J.P.T.; Rothstein, H.R. Introduction to Meta-Analysis, 2nd ed.; Wiley: Chichester, UK, 2021. [Google Scholar]
Viechtbauer, W.; Cheung, M.W. Outlier and influence diagnostics for meta-analysis. Res. Synth. Methods 2010, 1, 112–125. [Google Scholar] [CrossRef] [PubMed]
Belsley, D.A.; Kuh, E.; Welsch, R.E. Regression Diagnostics; Wiley: New York, NY, USA, 1980. [Google Scholar]
Cook, R.D.; Weisberg, S. Residuals and Influence in Regression; Chapman & Hall: New York, NY, USA, 1982. [Google Scholar]
Weisberg, S. Applied Linear Regression, 4th ed.; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
Negeri, Z.F.; Beyene, J. Statistical methods for detecting outlying and influential studies in meta-analysis of diagnostic test accuracy studies. Stat. Methods Med. Res. 2020, 29, 1227–1242. [Google Scholar] [CrossRef] [PubMed]
Noma, H.; Gosho, M.; Ishii, R.; Oba, K.; Furukawa, T.A. Outlier detection and influence diagnostics in network meta-analysis. Res. Synth. Methods 2020, 11, 891–902. [Google Scholar] [CrossRef] [PubMed]
Aoki, M.; Noma, H.; Gosho, M. Methods for detecting outlying regions and influence diagnosis in multi-regional clinical trials. Biostat. Epidemiol. 2021, 5, 30–48. [Google Scholar] [CrossRef]
Nakamura, R.; Noma, H. Detection of outlying centers and influence diagnostics for the analysis of multicenter clinical trials. Jpn. J. Biom. 2021, 41, 117–136. [Google Scholar] [CrossRef]
Efron, B.; Tibshirani, R. An Introduction to the Bootstrap; CRC Press: New York, NY, USA, 1994. [Google Scholar]
DerSimonian, R.; Laird, N.M. Meta-analysis in clinical trials. Control Clin. Trials 1986, 7, 177–188. [Google Scholar] [CrossRef] [PubMed]
Paule, R.C.; Mandel, J. Consensus values and weighting factors. J. Res. Natl. Bur. Stand. 1982, 87, 377–385. [Google Scholar] [CrossRef] [PubMed]
Sidik, K.; Jonkman, J.N. Simple heterogeneity variance estimation for meta-analysis. J. R. Stat. Soc. C 2005, 54, 367–384. [Google Scholar] [CrossRef]
Veroniki, A.A.; Jackson, D.; Bender, R.; Kuss, O.; Langan, D.; Higgins, J.P.T.; Knapp, G.; Salanti, G. Methods to calculate uncertainty in the estimated overall effect size from a random-effects meta-analysis. Res. Synth. Methods 2019, 10, 23–43. [Google Scholar] [CrossRef] [PubMed]
IntHout, J.; Ioannidis, J.P.; Borm, G.F. The Hartung-Knapp-Sidik-Jonkman method for random effects meta-analysis is straightforward and considerably outperforms the standard DerSimonian-Laird method. BMC Med. Res. Methodol. 2014, 14, 25. [Google Scholar] [CrossRef] [PubMed]
Rubinstein, S.M.; de Zoete, A.; van Middelkoop, M.; Assendelft, W.J.J.; de Boer, M.R.; van Tulder, M.W. Benefits and harms of spinal manipulative therapy for the treatment of chronic low back pain: Systematic review and meta-analysis of randomised controlled trials. BMJ 2019, 364, l689. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Forest plot for the meta-analysis of chronic low back pain [18].

Table 1. Description of functions contained in boutliers package.

Function	Description
STR	Studentized deleted residuals are calculated through leave-one-study-out analysis. In addition, bootstrap distributions of these residuals are obtained, and their percentiles are used to provide quantitative evaluations of the influence of individual studies.
VRATIO	Variance ratio statistics are computed from leave-one-study-out analyses of the estimators for the grand mean variance and the heterogeneity variance. In addition, bootstrap distributions of these statistics are obtained, and their percentiles are used to quantitatively evaluate the influence of individual studies.
LRT	Implementing the likelihood-ratio tests using the mean-shifted model. The bootstrap p-values are also provided.

Table 2. Description of the functionalities of the STR function.

	Description
Arguments
y	A vector of the outcome measure estimates (e.g., MD, SMD, log OR, log RR, RD, log HR).
v	A vector of the variance estimate of y.
method	A logical value specifying the estimation method (default: REML). The same options as those available for the method argument of the rma function in the metafor package can be used (e.g., FE for the fixed-effect model, SJ for the Sidik–Jonkman method, and PM for the Paule–Mandel method).
data	An optional data frame containing the variables y and v.
B	The number of bootstrap resampling (default: 2000)
alpha	The bootstrap percentiles to be outputted; 0.5(1 − alpha)th and (1 − 0.5(1 − alpha))th percentiles. Default is 0.95; 2.5th and 97.5th percentiles are calculated.
seed	A numeric value that determines the random seed for reproducibility (default: 123456).
Value
id	ID of the study.
psi	The Studentized residuals by leave-one-out analysis (Studentized deleted residuals).
Q1	0.5(1 − alpha)th percentile for the bootstrap distribution of the Studentized residual (default: 2.5th percentile).
Q2	1 − 0.5(1 − alpha)th percentile for the bootstrap distribution of the Studentized residual (default: 97.5th percentile).

Table 3. Description of the functionalities of the VRATIO function.

	Description
Arguments
y	A vector of the outcome measure estimates (e.g., MD, SMD, log OR, log RR, RD, log HR).
v	A vector of the variance estimate of y.
method	A logical value specifying the estimation method (default: REML). The same options as those available for the method argument of the rma function in the metafor package can be used (e.g., FE for the fixed-effect model, SJ for the Sidik–Jonkman method, and PM for the Paule–Mandel method).
data	An optional data frame containing the variables y and v.
B	The number of bootstrap resampling (default: 2000)
alpha	The bootstrap percentile to be output (default: 0.05).
seed	A numeric value that determines the random seed for reproducibility (default: 123456).
Value
id	ID of the study.
VR	The VRATIO statistic (relative change of the variance of the overall estimator) by leave-one-out analysis.
Q1	alphath percentile for the bootstrap distribution of the VRATIO statistic.
TR	The T2RATIO statistic (relative change of the heterogeneity variance) by leave-one-out analysis.
Q2	alphath percentile for the bootstrap distribution of the T2RATIO statistic.

Table 4. Description of the functionalities of the LRT function.

	Description
Arguments
y	A vector of the outcome measure estimates (e.g., MD, SMD, log OR, log RR, RD, log HR).
v	A vector of the variance estimate of y.
model	A logical value specifying the pooling model (RE: random-effects model, FE: fixed-effect model).
data	An optional data frame containing the variables y and v.
B	The number of bootstrap resampling (default: 2000)
alpha	The significance level (default: 0.05)
seed	A numeric value that determines the random seed for reproducibility (default: 123456).
Value
id	ID of the study.
LR	The likelihood ratio statistic is based on the mean-shifted model.
Q	(1 − alpha)th percentile for the bootstrap distribution of the likelihood-ratio statistic.
p	The bootstrap p-value for the likelihood ratio statistic.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Noma, H.; Maruo, K.; Gosho, M. Bootstrap-Calibrated Outlier Detection and Influence Diagnostics for Meta-Analysis: The R Package boutliers. Stats 2026, 9, 60. https://doi.org/10.3390/stats9030060

AMA Style

Noma H, Maruo K, Gosho M. Bootstrap-Calibrated Outlier Detection and Influence Diagnostics for Meta-Analysis: The R Package boutliers. Stats. 2026; 9(3):60. https://doi.org/10.3390/stats9030060

Chicago/Turabian Style

Noma, Hisashi, Kazushi Maruo, and Masahiko Gosho. 2026. "Bootstrap-Calibrated Outlier Detection and Influence Diagnostics for Meta-Analysis: The R Package boutliers" Stats 9, no. 3: 60. https://doi.org/10.3390/stats9030060

APA Style

Noma, H., Maruo, K., & Gosho, M. (2026). Bootstrap-Calibrated Outlier Detection and Influence Diagnostics for Meta-Analysis: The R Package boutliers. Stats, 9(3), 60. https://doi.org/10.3390/stats9030060

Article Menu

Bootstrap-Calibrated Outlier Detection and Influence Diagnostics for Meta-Analysis: The R Package boutliers

Abstract

1. Introduction

2. Methods

2.1. Statistical Models and Inference Methods

2.2. Studentized Deleted Residuals

2.3. Relative Change of the Variance of the Pooled Effect Estimator

2.4. Relative Change in the Heterogeneity Variance

2.5. Likelihood-Ratio Diagnostic Based on a Mean-Shifted Model

3. R Package boutliers

4. Functionality and Illustrative Examples

4.1. Example Dataset: Meta-Analysis of Chronic Low Back Pain

4.2. Studentized Deleted Residuals

4.3. Variance Ratio Statistics

4.4. Likelihood-Ratio Diagnostic Under a Mean-Shifted Model

5. Concluding Remarks

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI