Fully Bayesian Inference for Meta-Analytic Deconvolution Using Efron’s Log-Spline Prior

Lee, JoonHo; Sui, Daihe

doi:10.3390/math13162639

Open AccessArticle

Fully Bayesian Inference for Meta-Analytic Deconvolution Using Efron’s Log-Spline Prior

by

JoonHo Lee

^1,*

and

Daihe Sui

²

¹

Department of Educational Studies in Psychology, Research Methodology, and Counseling, The University of Alabama, Tuscaloosa, AL 35487, USA

²

Department of Statistics and Data Science, Northwestern University, Evanston, IL 60201, USA

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(16), 2639; https://doi.org/10.3390/math13162639

Submission received: 11 July 2025 / Revised: 10 August 2025 / Accepted: 13 August 2025 / Published: 17 August 2025

(This article belongs to the Section D1: Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

Meta-analytic deconvolution seeks to recover the distribution of true effects from noisy site-specific estimates. While Efron’s log-spline prior provides an elegant empirical Bayes solution with excellent point estimation properties, its plug-in nature yields severely anti-conservative uncertainty quantification for individual site effects—a critical limitation for what Efron terms “finite-Bayes inference.” We develop a fully Bayesian extension that preserves the computational advantages of the log-spline framework while properly propagating hyperparameter uncertainty into site-level posteriors. Our approach embeds the log-spline prior within a hierarchical model with adaptive regularization, enabling exact finite-sample inference without asymptotic approximations. Through simulation studies calibrated to realistic meta-analytic scenarios, we demonstrate that our method achieves near-nominal coverage (88–91%) for 90% credible intervals while matching empirical Bayes point estimation accuracy. We provide a complete Stan implementation handling heteroscedastic observations—a critical feature absent from existing software. The method enables principled uncertainty quantification for individual effects at modest computational cost, making it particularly valuable for applications requiring accurate site-specific inference, such as multisite trials and institutional performance assessment.

Keywords:

meta-analytic deconvolution; empirical Bayes; fully Bayesian inference; log-spline prior; finite-Bayes inference; uncertainty quantification; hierarchical models; heteroscedastic observations; Stan

MSC:

62F15; 62C12; 62G07

1. Introduction

Modern scientific investigations increasingly generate data characterized by numerous parallel estimation problems, where interest extends beyond global summaries to individual-level inference. This paradigm appears across diverse domains: thousands of genes tested for differential expression in genomic studies [1,2,3], hundreds of schools evaluated for educational effectiveness [4,5,6], multiple clinical sites in medical trials [7,8,9,10,11], economic field experiments examining heterogeneous treatment effects across multiple units or contexts [12,13], and cognitive neuroscience applications where individual-level neural parameters must be estimated from noisy fMRI time series [14]. While classical meta-analytic methods excel at estimating population-level parameters—the average effect and its heterogeneity [15]—contemporary applications often demand reliable inference for specific units within the ensemble. Efron [16] terms this challenge finite-Bayes inference: making calibrated probabilistic statements about individual parameters when those parameters are modeled as exchangeable draws from an unknown population distribution.

The distinction between population-level and individual-level inference is fundamental yet often conflated in practice [17]. Consider a multisite educational trial evaluating an intervention across K schools. While policymakers may primarily seek the average treatment effect across all schools, individual school administrators require accurate assessments of their specific school’s effect, complete with well-calibrated uncertainty quantification. This local inference problem—estimating the latent effect parameter

Θ_{i}

for a particular school site i based on noisy observation

{\hat{θ}}_{i}

while borrowing strength from the other

K - 1

schools—exemplifies the finite-Bayes paradigm [16,18,19]. Similar inferential challenges arise in personalized medicine (where individual patient effects matter [20]), institutional performance evaluation (where specific hospitals or teachers face high-stakes assessments [21]), and targeted policy interventions (where resources must be allocated based on site-specific estimates [22]).

Recent methodological advances have begun addressing the unique challenges of finite-Bayes inference. Lee et al. [18] demonstrated that optimal estimation strategies depend critically on the specific inferential goal—estimating individual effects, ranking units, or characterizing the effect distribution—and that sufficient between-site information can justify flexible semiparametric approaches like Dirichlet Process mixtures (e.g., [23]) over conventional parametric models. However, their investigation revealed a striking gap: while sophisticated methods exist for point estimation (e.g., [24]), reliable inference for individual effects remains underdeveloped [25,26,27], particularly when the prior distribution must be nonparametically estimated from the data itself.

This uncertainty quantification challenge lies at the intersection of empirical Bayes and fully Bayesian paradigms. Empirical Bayes methods, which estimate the prior distribution from the data before computing posteriors, have proven remarkably successful for point estimation and compound risk minimization [28]. Yet their “plug-in” nature—treating the estimated prior as fixed when computing posteriors—produces anti-conservative uncertainty assessments that can severely understate the true variability in finite samples [25,29,30]. This limitation becomes particularly acute in the finite-Bayes setting where accurate individual-level inference, not just average performance, is the goal.

Among empirical Bayes approaches, Efron’s log-spline prior [29] stands out for elegantly balancing flexibility and stability. By modeling the log-density of the prior distribution through a low-dimensional spline basis, this framework accommodates diverse shapes—including multimodality, heavy tails, and skewness—while maintaining computational tractability and near-parametric convergence rates. The method has found successful applications in field experiments on hiring discrimination [12], evaluating judicial decisions in the criminal justice system [31], and measuring the effectiveness of digital advertising [13]. However, its development and implementation have remained firmly within the empirical Bayes paradigm, inheriting the fundamental limitation of plug-in inference.

Conventional attempts to address uncertainty quantification within the empirical Bayes framework have proven inadequate for finite-Bayes inference [32]. Efron [29] proposed two approaches: delta-method approximations and parametric bootstrap procedures. While these methods do capture one form of uncertainty, our theoretical analysis reveals that they target a different quantity for individual-level inference. Specifically, they quantify the sampling variability of the posterior mean as an estimator—how much

{\hat{Θ}}_{i}^{EB} = E [Θ_{i} | {\hat{θ}}_{i}, \hat{g}]

would vary across hypothetical replications of the entire experiment. This frequentist stability measure, while valid for assessing the estimation procedure, fundamentally differs from the posterior uncertainty about the parameter

Θ_{i}

given the observed data. For stakeholders requiring probabilistic statements about their site-specific effect—“What is the probability that our school’s effect exceeds the district average?”—the former provides little guidance.

The software landscape further complicates practical implementation. The deconvolveR package (version 1.2-1) [33] implements Efron’s framework but restricts attention to homoscedastic scenarios where all observations share common variance. Real meta-analyses invariably feature heteroscedastic observations due to varying sample sizes across sites. While the package allows transformation to z-scores as a workaround, an approach taken in the main analysis of hiring discrimination by Kline et al. [12], this fundamentally alters the inferential target from the effect distribution to the distribution of z-scores—a scientifically different quantity. Moreover, the deconvolveR does not provide clear routines for posterior uncertainty quantification beyond point estimates, leaving practitioners to implement ad hoc solutions.

This paper develops a fully Bayesian extension of Efron’s log-spline prior that preserves its computational advantages while providing calibrated interval estimation for finite-Bayes inference. By embedding the log-spline framework within a hierarchical Bayesian model and treating the shape parameters as random rather than fixed, our approach automatically propagates all sources of uncertainty—including hyperparameter estimation error—into the final posterior distributions. This yields properly calibrated credible intervals for individual effects without post hoc corrections or asymptotic approximations (e.g., [25,32]).

Our contributions are threefold. First, we provide a precise theoretical characterization of different uncertainty concepts in empirical Bayes inference, clarifying why existing approaches fail for individual-level inference. Second, through a simulation study calibrated to realistic meta-analytic scenarios, we demonstrate that our fully Bayesian approach achieves nominal coverage for individual site-specific effects while matching the point estimation accuracy of empirical Bayes. Third, we present a complete, annotated Stan implementation that handles heteroscedastic observations and provides researchers with a practical tool for finite-Bayes inference.

The remainder of this paper is organized as follows: Section 2 establishes the meta-analytic framework and introduces the deconvolution perspective that motivates our approach. Section 3 reviews empirical Bayes estimation of the log-spline prior, emphasizing its limitations for uncertainty quantification. Section 4 develops our fully Bayesian extension, showing how hierarchical modeling naturally resolves the inferential challenges. Section 5 describes our simulation study design, while Section 6 presents detailed results comparing point estimation accuracy and uncertainty calibration across methods. Section 7 demonstrates the practical implications through a reanalysis of firm-level labor market discrimination data from Kline et al. [12]. Section 8 offers discussion and concluding remarks. The appendices provide extensive technical details: Appendix A presents mathematical proofs of the uncertainty decomposition; Appendix B contains the complete Stan implementation; Appendix C, Appendix D and Appendix E examine sensitivity to grid specification, hyperprior choice, and performance in small-K scenarios, respectively.

2. Meta-Analytic Deconvolution Framework

2.1. Basic Setup

Consider K independent sites (or “studies”) indexed by

i = 1, \dots, K

, where each site i provides a point estimate

{\hat{θ}}_{i}

of some scientific effect of interest along with its associated design-based standard error

σ_{i}

. The canonical meta-analytic measurement model [34,35] assumes the following:

\begin{matrix} Θ_{i} & \overset{iid}{\sim} G, \\ {\hat{θ}}_{i} | Θ_{i} & \sim N (Θ_{i}, σ_{i}^{2}), \end{matrix} i = 1, \dots, K,

(1)

where

Θ_{i}

denotes the true site-specific effect, and G is an unknown probability measure on

R

representing the population distribution of true effects. This hierarchical specification embodies the key meta-analytic assumption that, while individual studies may vary in their true effects

Θ_{i}

, these effects are drawn from a common population distribution G.

The estimation problem that emerges from this framework is fundamentally one of deconvolution [29]: we observe noisy realizations

{\hat{θ}}_{i}

but seek to recover information about the latent distribution G from which the true effects

Θ_{i}

are drawn. This recovered distribution serves multiple critical scientific purposes: (a) estimating the proportion of null or negligible effects, e.g.,

{Pr}_{G} (| Θ | \leq τ)

for some threshold

τ

; (b) producing shrinkage-based predictors of individual effects

Θ_{i}

that borrow strength across sites; (c) forecasting the effect size in a new (future) study; and (d) quantifying between-site heterogeneity beyond classical random-effects summaries.

2.2. Marginal Densities and the Convolution Structure

The statistical challenge becomes clear when we examine the marginal density of the observed estimates. In the simplified case, where all sites share a common measurement variance

σ_{i} \equiv σ

, the marginal density of an observation

{\hat{θ}}_{i}

is given by the convolution [33]:

\begin{matrix} f (x) & = \int_{R} ϕ_{σ} (x - θ) g (θ) d θ, \\ ϕ_{σ} (u) & = \frac{1}{\sqrt{2 π} σ} exp (- u^{2} / (2 σ^{2})), \end{matrix}

(2)

where

g (θ)

represents the Lebesgue density of G when it exists. This convolution structure reveals why deconvolution is inherently challenging: the observed data follow the “blurred” density

f (x)

rather than the target density

g (θ)

of scientific interest.

In the more realistic heteroscedastic setting that characterizes most meta-analyses, we must work with a collection of site-specific marginal densities:

f_{i} (x) = \int_{R} ϕ_{σ_{i}} (x - θ) g (θ) d θ, i = 1, \dots, K .

(3)

Each study contributes its own convolution kernel

ϕ_{σ_{i}} (\cdot)

, reflecting the reality that different studies achieve different levels of precision. This heteroscedastic structure, while more complex, often provides additional information that can aid in the deconvolution process, as studies with smaller standard errors provide sharper views of the underlying distribution G.

The relationship between the unknown prior

g (θ)

and the observable marginal densities

{f_{i} (x)}

lies at the heart of empirical Bayes methodology. Classical approaches have typically focused on estimating these marginal densities directly (f-modeling) and then applying Tweedie’s formula or similar techniques to recover posterior quantities of interest [16,36]. However, directly modeling the prior density

g (θ)

(g-modeling) often provides superior performance, particularly when scientific interest centers on characteristics of the effect distribution itself.

2.3. Ill-Posedness and the Need for Structural Constraints

The operator mapping

g \mapsto f_{i}

in Equation (3) represents a compact smoothing transformation, and its inverse is therefore severely ill-posed in the sense of Hadamard. This ill-posedness manifests in several problematic ways: small perturbations in the observed data can lead to wildly different estimates of

g (θ)

, and classical nonparametric estimators achieve convergence rates that are logarithmic in the sample size K rather than the more familiar polynomial rates. Specifically, Carroll and Hall [37] demonstrated that nonparametric deconvolution estimators achieve mean-squared error rates of order

{(log K)}^{- 1}

at best, rendering them impractical for the moderate sample sizes (

K \approx 10

–100) typical of meta-analyses or multisite trials. Indeed, even with impressive theoretical underpinnings, NPMLE has long been considered “computationally forbidding” for practical applications [16].

This poor rate of convergence necessitates the introduction of structural constraints or regularization to stabilize the inverse problem. The challenge lies in imposing sufficient structure to achieve reasonable finite-sample performance while retaining enough flexibility to capture the diverse range of effect distributions encountered in practice. Meta-analytic effect distributions commonly exhibit features such as multimodality (reflecting distinct subpopulations of studies), heavy tails (accommodating occasional extreme effects), and skewness (e.g., [12,30]).

While alternative approaches exist for meta-analytic deconvolution, each faces distinct trade-offs that motivate our log-spline approach. Dirichlet Process mixtures [23,38,39] offer flexibility and have shown promise in meta-analytic applications [18,40], but require careful specification of the concentration parameter and can exhibit sensitivity to hyperparameter choices in small samples [18,41]. The NPMLE [42,43], despite elegant theoretical properties, yields discrete mixing distributions and can suffer from non-uniqueness or computational challenges; while recent convex optimization implementations have mitigated some computational burden [44,45], the discrete nature of solutions complicates smooth density inference. Bayesian bootstrap methods [46,47] provide a fully nonparametric alternative but generate discrete posterior draws supported on observed data, requiring additional smoothing for density estimation and lacking the established contraction rate guarantees of spline-based priors. In contrast, the log-spline prior balances flexibility with computational tractability, achieving minimax-optimal posterior contraction rates for smooth densities [48,49] while providing interpretable continuous density estimates [29,50].

2.4. The Efron Log-Spline Prior: Flexible Parametric Regularization

The Efron log-spline prior provides an elegant solution to the tension between flexibility and regularization by modeling the prior density

g (θ)

as a member of a carefully chosen exponential family [29]. Specifically, let

T = {θ_{1}, \dots, θ_{m}}

represent a fine grid spanning the support of G, and consider the exponential family:

\begin{matrix} g (θ_{j}; α) & = exp \{Q_{j}^{⊤} α - ϕ (α)\}, \\ ϕ (α) & = log [\sum_{j = 1}^{m} exp (Q_{j}^{⊤} α)], \end{matrix}

(4)

where

Q \in R^{m \times p}

is a fixed basis matrix (typically constructed from natural cubic splines), and

α \in R^{p}

is a low-dimensional parameter vector with

p ≪ m

.

This log-spline specification offers several compelling advantages that make it particularly well-suited for meta-analytic applications. Flexibility is achieved through the choice of spline basis functions and the dimension p, allowing the family to interpolate between nearly uniform distributions (when

α \approx 0

) and strongly multimodal or heavy-tailed shapes (for appropriate choices of

α

). Computational tractability follows from the concavity of the log-likelihood in

α

, ensuring that empirical Bayes estimation converges to a global optimum and that fully Bayesian inference benefits from log-concave posterior distributions that mix efficiently in modern MCMC algorithms. Statistical efficiency is restored through parametric regularization, replacing the

{(log K)}^{- 1}

curse of nonparametric methods with near-parametric

K^{- 1 / 2}

convergence rates in moderate sample sizes.

The log-spline framework also provides a natural pathway for inferential validity. Unlike nonparametric approaches that yield discrete estimates with problematic posterior distributions, the log-spline parameterization supports smooth interpolation and coherent probabilistic assessment of estimation uncertainty. This feature proves crucial for finite-sample inference about individual effects

Θ_{i}

, where accurate uncertainty quantification often matters more than point estimation accuracy

The subsequent sections of this paper develop both empirical Bayes and fully Bayesian implementations of the log-spline framework, demonstrating how the latter provides exact finite-sample uncertainty propagation while maintaining the adaptive shrinkage benefits that have made empirical Bayes methods so successful in practice.

3. Empirical Bayes Estimation

The empirical Bayes (EB) paradigm provides a practical middle ground between classical frequentist and fully Bayesian approaches by estimating the prior distribution from the data itself. In this section, we review the theoretical framework for EB estimation in meta-analytic settings, emphasizing both its computational advantages and inherent limitations in uncertainty quantification.

3.1. Estimation Strategy: G-Modeling and the Geometry of $α$

Following Efron [29], we adopt a g-modeling approach that directly models the prior density

g (θ)

rather than the marginal density

f (x)

. Specifically, we parameterize

g (θ)

as a member of a flexible exponential family defined on a finite grid

T = {θ_{1}, \dots, θ_{m}}

spanning the plausible range of effect sizes, as presented in Equation (4). The choice of basis matrix Q determines the flexibility and smoothness of the estimated prior. Following Efron [29]’s recommendation, we construct Q using B-splines evaluated at the grid points

T

, typically with p ranging from 5 to 10 basis functions.

The role of

α

merits careful interpretation. Unlike traditional hyperparameters that might represent location or scale, the components of

α

control the shape of

log g (θ)

through the basis expansion. To develop geometric intuition, observe that the log-spline prior belongs to the exponential family with natural parameter

α

, where

α

parameterizes a path through the probability simplex via the exponential map. When

α = 0

, we obtain the maximum entropy distribution

g (0) = (1 / m, \dots, 1 / m)

, which is uniform on the grid. As

∥ α ∥

increases, the prior becomes increasingly informative, departing from uniformity in a direction determined by

α / ∥ α ∥

.

This geometric perspective reveals how

α

encodes prior information. Decomposing

α = r \cdot v

, where

r = ∥ α ∥

and

∥ v ∥ = 1

, we see that r controls the “concentration” of the prior, while v determines its “shape.” The eigenvectors of

Q^{⊤} Q

define principal modes of variation in log-density space: components of

α

along dominant eigenvectors produce smooth, unimodal priors, while components along smaller eigenvectors can introduce multimodality or heavy tails. The Fisher information

I (α) = {Cov}_{α} [Q]

reveals that the effective dimensionality of the estimation problem adapts to the data, with the likelihood naturally emphasizing directions where

g (α)

concentrates mass.

3.2. Penalized Marginal Likelihood Estimation

Given the observed data

{{\hat{θ}}_{i}, σ_{i}}_{i = 1}^{K}

, we estimate

α

by maximizing the penalized log marginal likelihood [29]:

\hat{α} = arg max_{α \in R^{p}} \{ℓ (α) - s (α)\},

(5)

where the log-likelihood is

ℓ (α) = \sum_{i = 1}^{K} log f_{i} ({\hat{θ}}_{i}; α), f_{i} (x; α) = \sum_{j = 1}^{m} ϕ_{σ_{i}} (x - θ_{j}) g (θ_{j}; α),

(6)

and

s (α)

is a penalty function that regularizes the solution.

The discrete approximation in (6) replaces the integral

\int ϕ_{σ_{i}} (x - θ) g (θ; α) d θ

with a Riemann sum over the grid

T

. For sufficiently fine grids (typically

m \geq 50

), this approximation is highly accurate while maintaining computational tractability.

Following Efron [29], we employ a ridge penalty

s (α) = c_{0} {∥ α ∥}_{2}

with

c_{0}

chosen to contribute approximately 5–10% additional Fisher information relative to the likelihood. This can be formalized by considering the ratio of expected information:

R (α) = \frac{tr {\ddot{s} (α)}}{tr {I (α)}} \approx 0.05 to 0.10,

(7)

where

I (α) = - E [\ddot{ℓ} (α)]

is the Fisher information matrix.

A key computational advantage of the log-spline formulation is that

ℓ (α)

is concave in

α

. To see this, note that each

f_{i} (x; α)

is log-convex in the weights

g (θ_{j}; α)

, and the exponential family parameterization preserves concavity under the logarithmic transformation. Standard convex optimization algorithms, such as Newton–Raphson or L-BFGS, thus converge rapidly to the global maximum

\hat{α}

.

3.3. Posterior Inference and Shrinkage Estimation

Having obtained

\hat{α}

, and hence

\hat{g} (θ) = g (θ; \hat{α})

, we can compute the empirical Bayes posterior distribution for each site i:

{\hat{π}}_{i j} = P {Θ_{i} = θ_{j} ∣ {\hat{θ}}_{i}, \hat{g}} = \frac{ϕ_{σ_{i}} ({\hat{θ}}_{i} - θ_{j}) \hat{g} (θ_{j})}{\sum_{k = 1}^{m} ϕ_{σ_{i}} ({\hat{θ}}_{i} - θ_{k}) \hat{g} (θ_{k})}, j = 1, \dots, m .

(8)

These posterior weights yield the EB point estimate and posterior variance:

{\hat{Θ}}_{i}^{EB} = \sum_{j = 1}^{m} θ_{j} {\hat{π}}_{i j}, {\hat{Var}}_{EB} (Θ_{i} ∣ {\hat{θ}}_{i}) = \sum_{j = 1}^{m} {(θ_{j} - {\hat{Θ}}_{i}^{EB})}^{2} {\hat{π}}_{i j} .

(9)

The EB estimator

{\hat{Θ}}_{i}^{EB}

exhibits adaptive shrinkage: observations with large

| {\hat{θ}}_{i} | / σ_{i}

are shrunk less toward the center of

\hat{g}

, while those with small signal-to-noise ratios are shrunk more aggressively. This data-driven shrinkage can be understood through the posterior weight function

w_{i} (θ) = ϕ_{σ_{i}} ({\hat{θ}}_{i} - θ) \hat{g} (θ) / {\hat{f}}_{i} ({\hat{θ}}_{i})

, which balances the likelihood contribution against the estimated prior density.

Efron [16] established that under suitable regularity conditions, the EB estimator achieves near-oracle performance in terms of compound risk. Specifically, if the true prior

g_{0} \in G

lies within the log-spline family, then as

K \to \infty

:

R (\hat{g}, {\hat{Θ}}^{EB}) - R_{g_{0}} = O (p / K),

(10)

where

R (\hat{g}, {\hat{Θ}}^{EB}) = K^{- 1} \sum_{i = 1}^{K} E [{({\hat{Θ}}_{i}^{EB} - Θ_{i})}^{2}]

is the compound risk, and

R_{g_{0}}

is the oracle Bayes risk achieved by knowing

g_{0}

. This

O (K^{- 1})

regret bound, with the dimension p appearing only as a multiplicative constant, demonstrates the efficiency of the parametric g-modeling approach compared to nonparametric alternatives that typically achieve logarithmic rates.

3.4. Uncertainty Quantification for Plug-In EB

A fundamental limitation of the plug-in EB approach described above is that it treats the estimated prior

\hat{g}

as fixed when computing posterior quantities. This leads to anti-conservative uncertainty quantification that fails to account for the variability in estimating g from finite data. To understand this issue clearly, we must distinguish between two conceptually different sources of uncertainty:

Definition 1

(Posterior uncertainty of the parameter). For a given prior g and observation

{\hat{θ}}_{i}

, the posterior variance

{Var}_{g} (Θ_{i} ∣ {\hat{θ}}_{i})

quantifies uncertainty about the true parameter

Θ_{i}

. This is a Bayesian concept representing our updated beliefs about where

Θ_{i}

lies after observing data.

Definition 2

(Sampling uncertainty of the estimator). The frequentist variance

Var [{\hat{Θ}}_{i}^{EB}]

quantifies how much the point estimate

{\hat{Θ}}_{i}^{EB} = E_{\hat{g}} [Θ_{i} ∣ {\hat{θ}}_{i}]

would vary across repeated realizations of the entire dataset

{{\hat{θ}}_{i}}_{i = 1}^{K}

. This measures the stability of the estimation procedure itself.

The distinction is crucial: Definition 1 concerns uncertainty about the parameter $Θ_{i}$ conditional on observed data and a fixed prior, while Definition 2 concerns uncertainty about our estimate due to having estimated the prior from finite data.

Efron [16,29] addressed the second type of uncertainty using the delta method. Treating

{\hat{Θ}}_{i}^{EB}

as a function of

\hat{α}

, the first-order Taylor expansion yields the following:

Var [{\hat{Θ}}_{i}^{EB}] \approx \nabla_{α} {\hat{Θ}}_{i}^{EB} \cdot \hat{Cov} (\hat{α}) \cdot {(\nabla_{α} {\hat{Θ}}_{i}^{EB})}^{⊤},

(11)

where

\hat{Cov} (\hat{α}) = {[- \ddot{ℓ} (\hat{α}) + \ddot{s} (\hat{α})]}^{- 1}

is obtained from the inverse of the penalized Hessian at convergence. The gradient

\nabla_{α} {\hat{Θ}}_{i}^{EB}

can be computed analytically using the chain rule:

\frac{\partial {\hat{Θ}}_{i}^{EB}}{\partial α_{k}} = \sum_{j = 1}^{m} θ_{j} \frac{\partial {\hat{π}}_{i j}}{\partial α_{k}},

(12)

where the derivatives of the posterior weights follow from straightforward but tedious algebra involving the exponential family structure.

The delta-method standard error from (11) quantifies the frequentist variability of the EB estimator—it tells us how uncertain our point estimate is, not how uncertain we should be about

Θ_{i}

itself. This distinction becomes particularly important in finite samples where the uncertainty in estimating g can be substantial. To illustrate this conceptually, consider the decomposition:

{Var}_{total} (Θ_{i} ∣ {\hat{θ}}_{i}, data) = \underset{plug - in posterior variance}{\underset{︸}{{Var}_{\hat{g}} (Θ_{i} ∣ {\hat{θ}}_{i})}} + \underset{estimator uncertainty}{\underset{︸}{Var [{\hat{Θ}}_{i}^{EB}]}} .

(13)

The plug-in EB approach captures only the first term, while the delta method approximates the second. Neither alone provides a complete picture of our uncertainty about

Θ_{i}

. A formal proof of this decomposition and its implications for coverage is provided in Appendix A.

Efron [16] also proposed a parametric bootstrap (Type III bootstrap) as an alternative to the delta method. The procedure resamples data from the fitted model:

Draw $Θ_{i}^{*} \sim \hat{g}$ and ${\hat{θ}}_{i}^{*} ∣ Θ_{i}^{*} \sim N (Θ_{i}^{*}, σ_{i}^{2})$ for $i = 1, \dots, K$
Re-estimate ${\hat{α}}^{*}$ from ${{\hat{θ}}_{i}^{*}, σ_{i}}_{i = 1}^{K}$
Compute ${\hat{Θ}}_{i}^{{EB}_{*}} = E_{{\hat{g}}^{*}} [Θ_{i} ∣ {\hat{θ}}_{i}]$ where ${\hat{g}}^{*} = g (\cdot; {\hat{α}}^{*})$

The empirical distribution of

{{\hat{Θ}}_{i}^{{EB}_{*}}}

across bootstrap replications estimates the sampling distribution of the EB estimator. However, like the delta method, this still targets estimator uncertainty rather than parameter uncertainty.

This fundamental limitation of plug-in EB methods—their inability to fully propagate hyperparameter uncertainty into posterior inference—motivates the fully Bayesian approach developed in the next section. By placing a hyperprior on

α

and integrating over its posterior distribution, we can obtain posterior intervals for

Θ_{i}

that properly reflect all sources of uncertainty, achieving nominal coverage in finite samples without ad hoc corrections.

4. Fully Bayesian Estimation

4.1. Finite-Bayes Inference and Site-Specific Effects

The empirical Bayes framework developed in the previous section excels at compound risk minimization and provides computationally efficient shrinkage estimators. However, when scientific interest narrows to specific site-level effects—what Efron [16] terms the inite-Bayes inferential setting—the plug-in nature of empirical Bayes methods becomes problematic. In this setting, the inferential goal shifts from minimizing average squared error across all K sites to providing accurate and calibrated uncertainty statements for individual effects

Θ_{i}

[18].

To illustrate this distinction, consider a multisite educational trial where site

i_{0}

represents a particular school of interest. While the empirical Bayes estimator

{\hat{Θ}}_{i_{0}}^{EB}

may achieve excellent average performance across all sites, stakeholders at site

i_{0}

require not just a point estimate but a full posterior distribution

P (Θ_{i_{0}} ∣ data)

that accurately reflects all sources of uncertainty. This posterior should account for both the measurement error in

{\hat{θ}}_{i_{0}}

and the uncertainty inherent in estimating the population distribution G from finite data.

The finite-Bayes perspective recognizes that different inferential goals demand different solutions [18,24]. When the focus narrows to precise and unbiased single-site inference, the anti-conservative uncertainty quantification of plug-in EB methods—which treats

\hat{g}

as if it were the true prior—becomes untenable. The fully Bayesian approach developed in this section addresses this limitation by propagating hyperparameter uncertainty through to site-level posterior inference, ensuring that posterior credible intervals achieve their nominal coverage even in finite samples.

4.2. Hierarchical Model Specification

To fully propagate uncertainty from the hyperparameter level to site-specific posteriors, we embed Efron’s log-spline prior within a hierarchical Bayesian framework. The model specification is as follows:

Level 1 (Observation model):

{\hat{θ}}_{i} ∣ Θ_{i} \sim N (Θ_{i}, σ_{i}^{2}), i = 1, \dots, K .

(14)

Level 2 (Population model with discrete approximation):

P {Θ_{i} = θ_{j}} = g_{j}, j = 1, \dots, m,

(15)

where

T = {θ_{1}, \dots, θ_{m}}

is a fine grid, and

g = (g_{1}, \dots, g_{m})

forms a probability mass function on

T

.

Level 3 (Log-spline prior with adaptive regularization):

\begin{matrix} log g & = Q α - ϕ (α) 1, \\ ϕ (α) & = log [\sum_{j = 1}^{m} exp (Q_{j}^{⊤} α)], \end{matrix}

(16)

where

Q \in R^{m \times p}

is the spline basis matrix, and

1

is the vector of ones.

Level 4 (Hyperpriors):

\begin{matrix} λ & \sim {Cauchy}^{+} (0, 5), \\ α_{k} & \overset{iid}{\sim} N (0, λ^{- 1 / 2}), k = 1, \dots, p . \end{matrix}

(17)

The key innovation relative to the empirical Bayes approach lies in replacing the fixed ridge penalty parameter

c_{0}

with the adaptive parameter

λ

. The half-Cauchy prior on

λ

serves as a weakly informative hyperprior that allows the data to determine the appropriate level of regularization. This specification has several important properties:

Scale-invariance: The half-Cauchy prior remains weakly informative across different scales of $∥ α ∥$ , unlike conjugate gamma priors that can become inadvertently informative.
Adaptive regularization: The hierarchical structure $α_{k} ∣ λ \sim N (0, λ^{- 1 / 2})$ is equivalent to the penalized likelihood $ℓ (α) - {(λ / 2) ∥ α ∥}_{2}^{2}$ , but with $λ$ estimated from the data rather than fixed.
Computational stability: The reparameterization ensures that the posterior distribution remains well-behaved even when some components of $α$ are near zero.

The joint posterior density for all parameters given the observed data

D = {{\hat{θ}}_{i}, σ_{i}}_{i = 1}^{K}

is as follows:

π (Θ, α, λ ∣ D) \propto [\prod_{i = 1}^{K} ϕ_{σ_{i}} ({\hat{θ}}_{i} - Θ_{i})] [\prod_{i = 1}^{K} g (Θ_{i}; α)] [\prod_{k = 1}^{p} ϕ_{λ^{- 1 / 2}} (α_{k})] π (λ),

(18)

where

ϕ_{τ} (\cdot)

denotes the normal density with standard deviation

τ

, and

π (λ)

is the half-Cauchy density.

4.3. Marginalization and Posterior Inference

While the joint posterior (18) contains all relevant information, practical inference requires marginalization to obtain site-specific posteriors. The Stan implementation presented in Appendix B achieves this through a two-stage process that exploits the discrete nature of the approximating grid

T

.

First, for each MCMC iteration yielding parameters

(α^{(s)}, λ^{(s)})

, we compute the induced prior weights:

g_{j}^{(s)} = \frac{exp (Q_{j}^{⊤} α^{(s)})}{\sum_{k = 1}^{m} exp (Q_{k}^{⊤} α^{(s)})}, j = 1, \dots, m .

(19)

Then, the posterior distribution for site i given the current hyperparameters is as follows:

π_{i j}^{(s)} = P {Θ_{i} = θ_{j} ∣ {\hat{θ}}_{i}, α^{(s)}, λ^{(s)}} = \frac{ϕ_{σ_{i}} ({\hat{θ}}_{i} - θ_{j}) g_{j}^{(s)}}{\sum_{k = 1}^{m} ϕ_{σ_{i}} ({\hat{θ}}_{i} - θ_{k}) g_{k}^{(s)}} .

(20)

The key insight is that marginalizing over the posterior distribution of

(α, λ)

automatically integrates over all sources of uncertainty:

P {Θ_{i} = θ_{j} ∣ D} = \int π_{i j}^{(s)} π (α, λ ∣ D) d α d λ \approx \frac{1}{S} \sum_{s = 1}^{S} π_{i j}^{(s)},

(21)

where S is the number of MCMC samples.

This marginalization principle extends to any functional of interest. The generated quantities block in the Stan code in Appendix B demonstrates three important examples:

Posterior mean: ${\hat{Θ}}_{i}^{FB} = \sum_{j = 1}^{m} θ_{j} P {Θ_{i} = θ_{j} ∣ D}$ .
Posterior mode (MAP): ${\hat{Θ}}_{i}^{MAP} = arg {max}_{θ_{j}} P {Θ_{i} = θ_{j} ∣ D}$ .
Credible intervals: For a $(1 - a)$ credible interval, we find $[L_{i}, U_{i}]$ such that $P {L_{i} \leq Θ_{i} \leq U_{i} ∣ D} = 1 - a$ .

Each of these quantities automatically incorporates uncertainty from all levels of the hierarchy. Unlike the plug-in EB approach, which conditions on

\hat{α}

, or the delta-method correction, which approximates only first-order uncertainty, the fully Bayesian posterior samples provide exact finite-sample inference under the model.

Furthermore, if we examine the posterior variance of the estimator

{\hat{Θ}}_{i}^{FB}

across MCMC iterations:

{Var}_{post} [{\hat{Θ}}_{i}^{FB}] = {Var}_{(α, λ) ∣ D} [\sum_{j = 1}^{m} θ_{j} π_{i j}],

(22)

we recover precisely the quantity approximated by the delta method or Type III bootstrap in the empirical Bayes framework. This demonstrates that the fully Bayesian approach subsumes the uncertainty quantification attempts of empirical Bayes while additionally providing proper posterior inference for the parameters themselves. See Appendix A for a mathematical treatment of how the FB approach captures both sources of uncertainty.

4.4. Advantages over the Empirical Bayes Approach

Table 1 summarizes the key distinctions between empirical Bayes and fully Bayesian implementations of the log-spline framework across multiple dimensions of statistical practice.

The fully Bayesian approach offers compelling advantages when accurate uncertainty quantification for individual effects is paramount. The ability to make calibrated probability statements—for example,

P {Θ_{i} > τ ∣ D}

for a clinically relevant threshold

τ

—without post hoc corrections represents a fundamental improvement over plug-in methods. This advantage becomes particularly pronounced in settings with moderate K (10–50 sites) where hyperparameter uncertainty substantially impacts site-level inference.

Moreover, the FB framework naturally accommodates model extensions that would require substantial theoretical development in the EB setting. Incorporating site-level covariates, modeling correlation structures, or adding mixture components for null effects can be achieved through straightforward modifications to the Stan code, with uncertainty propagation handled automatically by the MCMC machinery.

The primary cost of these advantages is computational: while EB optimization typically completes in seconds, FB inference may require minutes to hours depending on the problem size and desired Monte Carlo accuracy. However, given that meta-analyses are rarely time-critical and the scientific value of accurate uncertainty quantification, this computational overhead is generally acceptable.

In summary, fully Bayesian inference aligns exactly with the finite-Bayes inferential goal and produces internally coherent uncertainty statements for individual effects, while offering greater modeling flexibility at only modest additional computational cost. The empirical Bayes approach remains valuable for rapid point estimation and approximate shrinkage, particularly when K is large and hyperparameter uncertainty negligible. But when calibrated inference for an individual effect is the priority, the fully Bayesian treatment of the log-spline model provides a rigorous and practically achievable solution to the meta-analytic deconvolution problem.

5. Simulation Study

To evaluate the performance of our proposed fully Bayesian approach, we designed and executed a simulation study. The primary objectives were to assess two key dimensions of performance: (1) the accuracy of point estimates in recovering true underlying effects and (2) the validity of the corresponding uncertainty estimates.

5.1. Data-Generating Scenarios

Our simulation framework is built upon the “twin towers” example, a challenging bimodal scenario used by Efron [16] and Narasimhan and Efron [33] to test Empirical Bayes methods. As depicted by the red histograms in Figure 1, the true distribution of site-specific effects, G, is a disjoint bimodal distribution. The true effect sizes,

{Θ_{i}}_{i = 1}^{K}

, for

K = 1500

sites are sourced from the disjointTheta dataset, available in the deconvolveR R package.

The original analyses in Efron [16] and Narasimhan and Efron [33] were conducted under the assumption of homoscedasticity, where the observed effects

{\hat{θ}}_{k}

are generated as

{\hat{θ}}_{k} \sim N (Θ_{k}, 1)

. This scenario, with a fixed standard error of one for all sites, represents an idealized case. It does not reflect the more common and complex situations researchers face, where sampling variances are heteroscedastic due to differing sample sizes or within-site dispersions across units.

To create a more realistic and comprehensive evaluation, our simulation design extends this framework to include both homoscedastic and heteroscedastic conditions. We generated the observed data

{\hat{θ}}_{i} \sim N (Θ_{i}, σ_{i}^{2})

across six distinct scenarios defined by two key factors: the average reliability (I) and the heterogeneity of sampling variances (R).

The first factor, average reliability (I), quantifies the overall signal-to-noise ratio of the observed data. It determines how informative the site-specific estimates

{\hat{θ}}_{i}

are on average. With the between-site variance of the true effects fixed at

Var (Θ_{i}) \approx 1

, the reliability is defined as

I = 1 / (1 + GM (σ_{i}^{2}))

, where

GM (σ_{i}^{2})

is the geometric mean of the sampling variances. A higher value of I indicates less noisy, more informative data. For instance,

I = 0.9

implies that the average sampling variance is only about one-tenth of the true effect variance (

GM (σ_{i}^{2}) \approx 0.11

). The homoscedastic case in Efron [16] with

σ_{i}^{2} = 1

corresponds to a reliability level of

I = 0.5

.

The second factor, heterogeneity ratio (R), controls the dispersion of the sampling variances

σ_{i}^{2}

across the K sites. Following Paddock et al. [23], we define R as the ratio of the largest to the smallest sampling variance,

R = σ_{\max}^{2} / σ_{\min}^{2}

. The minimum and maximum variances are determined as functions of both I and R:

σ_{\max}^{2} = R \cdot (\frac{1 - I}{I}) and σ_{\min}^{2} = \frac{1}{R} \cdot (\frac{1 - I}{I})

(23)

A vector of K sampling variances,

{σ_{i}^{2}}

, is then generated by taking the exponential of equally spaced values on the logarithmic scale, ranging from

ln (σ_{\min}^{2})

to

ln (σ_{\max}^{2})

. A value of

R = 1

corresponds to the homoscedastic case where all

σ_{i}^{2}

are equal, while

R > 1

introduces heteroscedasticity. In our study, we examine three levels of reliability (

I \in {0.5, 0.7, 0.9}

) and two levels of heterogeneity (

R \in {1, 9}

), creating a total of six simulation conditions.

The impact of these factors is visualized in Figure 1 and Figure 2. Figure 1 shows that as reliability I increases (moving down the rows), the distribution of observed effects

{\hat{θ}}_{i}

(gray histograms) becomes less dispersed and more closely resembles the true distribution G (red histograms). When

I = 0.5

, substantial noise obscures the bimodal structure, whereas at

I = 0.9

, the two “towers” are clearly discernible even in the observed data.

Figure 2 further illustrates these dynamics by plotting the observed effects against the true effects. As I increases, the points cluster more tightly around the identity line, indicating more accurate raw estimates. The effect of R is also clear: in the homoscedastic cases (

R = 1

, right column), the scatter of points around the identity line is uniform. In the heteroscedastic cases (

R = 9

, left column), the degree of scatter varies substantially from site to site, as indicated by the color intensity representing the standard error. For instance, when

I = 0.5

and

R = 9

, some sites have very precise estimates (low SE), while others are very noisy (high SE), even though the average reliability is the same as the corresponding homoscedastic case. This design allows us to test the robustness of each estimation method under a challenging and realistic range of data-generating conditions.

5.2. Simulation Analysis and Performance Evaluators

We applied two estimation approaches to each of the six simulated datasets: the empirical Bayes (EB) method described in Section 3 and the fully Bayesian (FB) approach presented in Section 4.

The FB approach was implemented using the Stan program detailed in Appendix B, with posterior sampling conducted via CmdStan v2.36.0 (released December 2024). The EB approach utilized the same Stan program but employed CmdStan’s optimize() function to obtain regularized maximum likelihood estimates of

α

and

λ

. Specifically, we used the L-BFGS quasi-Newton optimizer without Jacobian adjustment, yielding regularized MLEs rather than MAP estimates. Given the point estimates

\hat{α}

and

\hat{λ}

, we computed EB posterior means

{\hat{Θ}}_{i}^{EB}

by treating the estimated prior

\hat{g}

as fixed, following standard EB practice. The computation of standard errors and 90% confidence intervals for EB estimates via the delta method is detailed in our GitHub repository (https://github.com/joonho112/fully-bayes-efron-prior, accessed on 1 July 2025).

All computations were performed on a standard workstation with 12 CPU cores. Empirical Bayes optimization via L-BFGS completed in less than one second per dataset, while fully Bayesian MCMC inference required approximately 8–9 min per dataset (with variation depending on the specific simulation scenario). These timings include all pre- and post-processing steps.

We note that existing R packages deconvolveR and ebnm [51], while implementing Efron’s log-spline framework, are limited to homoscedastic settings. These packages require transformation to z-scores

{\hat{θ}}_{i} / σ_{i}

for heteroscedastic data, fundamentally altering the deconvolution target from G to the distribution of z-scores. This transformation yields substantially different and non-comparable results, necessitating our custom implementation.

We evaluated two critical dimensions of performance. First, point estimation accuracy was assessed through root mean squared error (RMSE) and correlation between estimates and true values:

RMSE = \sqrt{\frac{1}{K} \sum_{i = 1}^{K} {({\hat{Θ}}_{i} - Θ_{i})}^{2}}, ρ = Cor ({\hat{Θ}}_{i}, Θ_{i}),

(24)

where

{\hat{Θ}}_{i}

denotes either the posterior mean from FB or EB approaches. Second, uncertainty quantification validity was evaluated primarily through empirical coverage of 90% credible/confidence intervals:

{Coverage}_{90} = \frac{1}{K} \sum_{i = 1}^{K} I {L_{i} \leq Θ_{i} \leq U_{i}},

(25)

where

[L_{i}, U_{i}]

represents the 90% interval for site i.

To provide deeper insight into calibration, we additionally computed the following:

Probability Integral Transform (PIT): Under correct specification, ${PIT}_{i} = Φ ((Θ_{i} - {\hat{Θ}}_{i}) / {\hat{σ}}_{i})$ should follow Uniform(0,1), where ${\hat{σ}}_{i}$ is the posterior/estimated standard deviation.
Standardized residuals: The z-scores $z_{i} = (Θ_{i} - {\hat{Θ}}_{i}) / {\hat{σ}}_{i}$ should follow $N (0, 1)$ if uncertainty is properly calibrated.
Interval Score (IS): Following Gneiting and Raftery [52], we computed

{IS}_{i} = (U_{i} - L_{i}) + \frac{2}{0.1} (L_{i} - Θ_{i}) I {Θ_{i} < L_{i}} + \frac{2}{0.1} (Θ_{i} - U_{i}) I {Θ_{i} > U_{i}},

(26)

which penalizes both width and miscoverage, with lower values indicating better performance.

6. Simulation Results

We present our findings in two parts: first examining the accuracy of point estimates in recovering true site-specific effects, followed by an assessment of the calibration and validity of uncertainty quantification. Throughout, we compare the fully Bayesian (FB) approach with empirical Bayes (EB) across the six simulation scenarios defined by reliability

I \in {0.5, 0.7, 0.9}

and heterogeneity ratio

R \in {1, 9}

.

6.1. Point Estimation Accuracy

Figure 3 provides a comprehensive visual assessment of how both methods recover the true bimodal distribution across varying data quality conditions. The most striking finding is the near-perfect agreement between FB (solid blue) and EB (dashed orange) posterior means across all scenarios, validating the empirical Bayes approximation for point estimation purposes. Both methods demonstrate sophisticated adaptive shrinkage, producing estimates that optimally balance between the noisy observed data (gray histograms) and the true distribution (red histograms).

The effect of reliability I dominates the recovery performance. At low reliability (

I = 0.5

), both methods appropriately shrink estimates toward a unimodal compromise, barely preserving the bimodal structure. This conservative behavior reflects optimal decision-making under high measurement error: when individual observations are unreliable, the methods correctly pool information toward the center of the prior. As reliability increases to

I = 0.9

, the distinct “twin towers” structure emerges clearly in the estimates, closely tracking the true distribution.

Heteroscedasticity shows more subtle effects. Comparing columns in Figure 3, the heteroscedastic cases (

R = 9

) exhibit slightly better mode separation, particularly at intermediate reliability (

I = 0.7

). This improvement stems from high-precision sites (those with small

σ_{i}

) providing sharp information about the distribution’s structure, effectively compensating for noisier observations. However, this benefit is modest compared to the dominant effect of average reliability.

Figure 4 quantifies these visual patterns through formal performance metrics. The RMSE analysis reveals remarkable consistency across methods: differences between FB, EB, and even MAP estimates are typically less than 0.001. For instance, at the challenging condition of

I = 0.5, R = 1

, the RMSE values are 0.757, 0.759, and 0.763 respectively—essentially indistinguishable for practical purposes. This equivalence extends across all conditions, demonstrating that the computational efficiency of EB (seconds) versus FB (minutes) comes with virtually no sacrifice in point estimation quality.

The dramatic impact of reliability on accuracy is quantified precisely: RMSE decreases from approximately 0.76–0.81 at

I = 0.5

to 0.27–0.35 at

I = 0.9

, representing a 65% reduction in error. The correlation heatmap reinforces this pattern even more strikingly, with correlations increasing from 0.83–0.85 to 0.97–0.98 across the same range. These near-perfect correlations at high reliability suggest we approach the theoretical limit of estimation accuracy given the discrete grid approximation and finite sample size.

Interestingly, heteroscedasticity’s impact on accuracy metrics is mixed and generally modest. While RMSE sometimes improves slightly under heteroscedasticity (e.g., at

I = 0.5

: 0.814 vs. 0.757), correlations show the opposite pattern at low reliability (0.829 vs. 0.853). This suggests a trade-off: the benefit of having some high-precision sites is partially offset by the challenge of appropriately weighting vastly different information sources. At high reliability, these differences become negligible, with all methods achieving correlations exceeding 0.97.

The consistency of results across posterior means and modes (MAP) deserves emphasis. Even these fundamentally different point estimates—one minimizing squared error, the other maximizing posterior probability—yield comparable accuracy in the log-spline framework. This robustness suggests that the adaptive shrinkage inherent in the Bayesian framework, rather than the specific choice of point estimate, drives the excellent performance.

These results powerfully demonstrate that when point estimation is the primary goal, the computationally efficient EB approach sacrifices essentially nothing compared to full MCMC inference. The methods’ ability to recover complex bimodal structure from noisy observations, particularly evident as reliability increases, validates the log-spline framework as a flexible yet stable solution to the deconvolution problem.

6.2. Calibration of Uncertainty Estimates

While point estimation accuracy is crucial, the ability to provide well-calibrated credible intervals distinguishes the fully Bayesian approach from empirical Bayes alternatives. In this section, we evaluate two distinct types of uncertainty quantification: (1) parameter uncertainty about the true effects

Θ_{i}

(denoted

Var (θ_{rep})

), and (2) estimator uncertainty about the posterior mean as a point estimate (denoted

Var (θ_{mean})

). The empirical Bayes intervals are computed using the delta method approximation detailed in Section 3.4, while the fully Bayesian approach provides exact finite-sample inference through MCMC marginalization. We now examine whether the nominal 90% credible intervals achieve their stated coverage and explore deeper aspects of inferential validity.

Figure 5 provides an intuitive visualization of interval coverage by displaying 90% credible intervals for 150 randomly sampled sites. The intervals are sorted by true effect size and colored to indicate coverage (black) or non-coverage (blue) of the true values (red triangles). In the top panel (

I = 0.5, R = 9

), 83.3% of intervals contain the true values, falling short of the nominal 90% coverage. This mild undercoverage at low reliability reflects the challenge of uncertainty quantification when measurement error dominates. The bottom panel (

I = 0.7, R = 9

) achieves 94.0% coverage, slightly exceeding the nominal level. The visual pattern reveals that miscoverage is not systematic—both extreme and moderate effects can fall outside their intervals—suggesting proper calibration rather than systematic bias.

Figure 6 extends this analysis to all 1500 sites across the full range of simulation conditions, revealing a fundamental distinction between two types of uncertainty. The top row displays coverage for

θ_{r e p}

, representing posterior uncertainty about the parameters themselves. Here, the FB approach (MCMC) consistently achieves coverage near the nominal 90% level, ranging from 88.2% to 91.1%. In stark contrast, the EB approach yields severely anti-conservative coverage, dropping as low as 4.9% at high reliability. This dramatic undercoverage directly results from treating the estimated prior

\hat{g}

as fixed, thereby ignoring hyperparameter uncertainty—precisely the limitation identified in Section 3.4 and proven formally in Appendix A.

The bottom row tells a different story. When examining

θ_{m e a n}

—the posterior mean as an estimator—both FB and EB achieve similarly low coverage (6.8% to 13.8%). This apparent failure is not a deficiency but rather reflects that these intervals target estimator uncertainty (how variable the posterior mean is across repeated experiments) rather than parameter uncertainty (where the true

Θ_{i}

lies given the data). The near-identical performance of FB and EB for

θ_{m e a n}

confirms that both methods correctly quantify this narrower source of variability, with FB’s marginal posterior variance of the estimator (Equation (22)) matching EB’s delta-method approximation.

Figure 7 provides a more nuanced assessment through calibration plots, examining coverage across all nominal levels from 10% to 95%. Perfect calibration would follow the diagonal line. The FB approach (red circles) adheres remarkably closely to this ideal across all conditions, with only minor deviations at extreme nominal levels. In contrast, the EB approach (green triangles) exhibits severe miscalibration, with empirical coverage plateauing far below nominal levels. This pattern holds across all reliability and heteroscedasticity conditions, confirming that FB’s proper uncertainty propagation yields well-calibrated inference throughout the probability scale.

Figure 8 examines the stability of coverage as evidence accumulates. These sequential coverage plots track the cumulative coverage rate as sites are processed in order. For FB inference, coverage quickly stabilizes near 90% after processing a few hundred sites, with only minor fluctuations thereafter. The most notable exception occurs at

I = 0.5, R = 9

, where coverage drifts downward after site 500, suggesting some sensitivity to the ordering of heteroscedastic observations at low reliability. Nevertheless, all conditions eventually converge to reasonable coverage levels, demonstrating the reliability of FB inference even in finite samples.

Figure 9 presents three additional diagnostic perspectives. The PIT histograms assess whether the predictive distributions are properly calibrated—under correct specification, these should be uniform. The FB results show reasonable uniformity with mild deviations, particularly some overdispersion at low reliability. The Q-Q plots of standardized residuals similarly confirm approximate normality, with minor heavy-tail behavior at the extremes. Most revealing is the bottom panel examining interval width versus estimation error. The strong positive correlation between interval width and absolute error, combined with high coverage for wide intervals (green points), indicates that the FB approach appropriately adapts uncertainty to the available information. Sites with narrow intervals tend to have small errors, while uncertain estimates correctly acknowledge their imprecision through wider intervals.

Taken together, these results establish a clear hierarchy of uncertainty quantification approaches. The FB method with

θ_{r e p}

provides properly calibrated posterior inference for the parameters themselves, achieving nominal coverage through exact finite-sample marginalization over hyperparameter uncertainty. The EB approach, while computationally attractive, yields severely anti-conservative parameter inference due to its plug-in nature. When the inferential target shifts to estimator uncertainty (

θ_{m e a n}

), both approaches perform equivalently, but such intervals answer a fundamentally different question—how variable is our estimation procedure rather than where does the true parameter lie. For scientific applications requiring honest uncertainty assessment about site-specific effects, the fully Bayesian approach emerges as the principled choice, providing well-calibrated inference at modest additional computational cost.

The calibration results presented here are complemented by extensive sensitivity analyses in the appendices. Appendix C examines the robustness of our findings to grid resolution and bounds specification, while Appendix D investigates sensitivity to the hyperprior choice for

λ

. These supplementary analyses confirm that the superior uncertainty calibration of the fully Bayesian approach persists across a wide range of implementation choices.

7. Real-Data Application: Firm-Level Labor Market Discrimination

To demonstrate the practical implications of our proposed fully Bayesian framework, we reanalyze data from a large-scale resume correspondence experiment conducted by Kline, Rose, and Walters [12,53]. This experiment, which represents one of the most comprehensive investigations of employment discrimination among major US corporations, provides an ideal setting to evaluate the importance of proper uncertainty quantification for individual-level inference in meta-analytic contexts.

7.1. Data and Empirical Setting

Our analysis draws on the publicly available replication package accompanying Walters [19], focusing specifically on racial discrimination patterns across large US employers. The original experiment submitted fictitious job applications to entry-level positions at 108 Fortune 500 companies, with each firm represented by multiple job vacancies across different US counties. Following the seminal design of Bertrand and Mullainathan [54], applications were randomly assigned racially distinctive names to signal applicant race to employers, with each vacancy receiving four applications with distinctively Black names and four with distinctively white names.

Following Kline et al. [53], we restrict our analysis to

K = 97

firms with at least 40 sampled vacancies and overall callback rates exceeding 3 percent, ensuring sufficient statistical power for firm-specific inference. This yields a final sample of 78,910 applications submitted to 10,453 job openings nested within these K firms. The primary outcome of interest is a binary indicator for whether the employer attempted to contact the applicant within 30 days via phone or email. While the original studies also examined gender discrimination, we focus exclusively on racial discrimination to maintain clarity in our methodological exposition. Readers interested in comprehensive details of the experimental design, sampling procedures, and substantive findings are referred to Kline et al. [12,53].

7.2. Estimation Approach

Following Walters [19], we estimate firm-specific discrimination parameters through separate OLS regressions for each employer. Specifically, for each firm

i \in {1, \dots, K}

, we fit the regression:

Y_{i j} = α_{i} + Θ_{i} \cdot 1 {R_{i j} = white} + e_{i j},

(27)

where

Y_{i j}

indicates whether application j to firm i received a callback,

R_{i j} \in {white, Black}

denotes the racial distinctiveness of the assigned name,

α_{i}

captures the baseline callback rate for applications with distinctively Black names at firm i, and

Θ_{i}

represents the differential treatment effect of distinctively white versus Black names. Since race assignment was randomized within each job vacancy,

Θ_{i}

can be interpreted as the causal effect of a distinctively white name on callback probability at firm i. Standard errors are clustered at the job level to account for within-job correlation in callback decisions and the stratified randomization design.

This estimation procedure yields firm-specific estimates

{\hat{θ}}_{i}

with associated standard errors

σ_{i}

for

i = 1, \dots, K

. These estimates correspond exactly to the observed data in our meta-analytic framework from Equation (1):

{\hat{θ}}_{i} ∣ Θ_{i} \sim N (Θ_{i}, σ_{i}^{2}),

(28)

where

Θ_{i}

represents the true discrimination parameter for firm i.

We then apply our fully Bayesian log-spline framework to the collection of firm-specific estimates

{{\hat{θ}}_{i}, σ_{i}}_{i = 1}^{K}

. The implementation follows the hierarchical specification detailed in Section 4, with the prior distribution G modeled through a log-spline density on a grid of 101 points spanning from −0.05 to 0.15. The spline basis employs

M = 6

degrees of freedom, providing sufficient flexibility to capture potential multimodality or skewness in the distribution of discrimination across firms. We use Stan’s NUTS sampler with four chains of 3000 iterations each (following 1000 warmup iterations), yielding 12,000 posterior draws for inference. For comparison, we also compute empirical Bayes estimates using the delta method for uncertainty quantification, following the procedures outlined in Section 3.4.

7.3. Posterior Uncertainty Versus Estimator Uncertainty

Figure 10 presents a comprehensive visualization of firm-specific discrimination estimates ordered by their posterior means. The stark contrast between the fully Bayesian and empirical Bayes approaches is immediately apparent in the width of the uncertainty intervals. The 90% credible intervals from the fully Bayesian approach (top panel) are substantially wider than the 90% confidence intervals from the empirical Bayes delta method (bottom panel), reflecting the fundamental distinction emphasized throughout this paper: the FB approach characterizes posterior distributions for the discrimination parameters themselves, while the EB delta method captures only the sampling variability of the posterior mean as an estimator.

This difference has profound implications for inference. Under the fully Bayesian approach, 42 of 97 firms (43%) have credible intervals that exclude zero, providing evidence of racial discrimination at the 90% confidence level. In contrast, the empirical Bayes approach with its narrower intervals would classify 87 firms (89%) as discriminating, potentially overstating our certainty about firm-specific behavior. This discrepancy underscores the importance of proper uncertainty propagation when scientific interest centers on individual units rather than ensemble properties.

7.4. Identifying Firms with Strongest Evidence of Discrimination

Figure 11 focuses on the firms in the top and bottom quintiles of the discrimination distribution, providing detailed forest plots that leverage a unique advantage of the fully Bayesian framework. Because our approach yields complete posterior distributions for each firm’s discrimination parameter, we can directly compute the posterior probability that firm i discriminates against Black applicants:

P (Θ_{i} > 0 ∣ D) = \int_{0}^{\infty} π (Θ_{i} ∣ D) d Θ_{i},

(29)

where

π (Θ_{i} ∣ D)

represents the marginal posterior distribution integrating over all hyperparameter uncertainty. This quantity provides an interpretable and decision-relevant measure of evidence for discrimination at each firm, displayed alongside firm names in the figure.

Among the top 20% of firms, the posterior probabilities of discrimination are uniformly high, with most exceeding 0.99. These firms, including major retailers and automotive companies, show compelling evidence of favoring white applicants. Notably, the credible intervals for these firms exclude zero by substantial margins, reinforcing the strength of evidence. In contrast, the bottom 20% of firms exhibit considerable heterogeneity in evidential strength despite their similar rankings. While Dr Pepper shows a posterior probability of only 0.28, suggesting limited evidence against the null hypothesis of no discrimination, firms like Target (0.79) and CBRE (0.91) have relatively high posterior probabilities despite being classified in the least discriminatory quintile. This nuanced picture, unavailable through point estimates alone, illustrates how the fully Bayesian framework enables more sophisticated decision-making by quantifying the strength of evidence for each individual firm.

7.5. Distribution of Discrimination Across Firms

Figure 12 compares the distribution of raw OLS estimates with the posterior mean estimates after shrinkage, revealing the extent of statistical noise in the firm-specific estimates. The raw estimates exhibit a mean of 0.021 with a standard deviation of 0.024, indicating that white applicants receive callbacks at rates 2.1 percentage points higher than Black applicants on average. However, after applying shrinkage through our fully Bayesian framework, the standard deviation of posterior means reduces to 0.011.

This variance reduction from 0.024 to 0.011 implies that

(1 - {(0.011 / 0.024)}^{2}) \times 100 = 79 %

of the observed variance in firm-specific OLS estimates is attributable to statistical noise rather than true heterogeneity in discriminatory behavior. This finding aligns closely with Walters [19], though our fully Bayesian approach provides the additional benefit of properly calibrated uncertainty for each firm. Crucially, even after accounting for sampling noise, substantial and economically meaningful heterogeneity in discrimination remains across firms. The standard deviation of 1.1 percentage points (0.011) indicates that a firm one standard deviation above the average penalty of 1.7%p would exhibit a penalty of 2.8% p. This represents an approximately 65% more severe penalty than the average firm ((2.8%p − 1.7%p)/1.7%p), a difference with significant implications for job seekers and policy interventions.

The preservation of the distribution’s right skew after shrinkage suggests that discrimination is not uniformly distributed across employers but rather concentrated among a subset of firms with particularly strong biases. This pattern, clearly visible in both the raw and shrunk distributions, underscores the value of firm-specific analysis over simple aggregate measures and highlights the importance of targeted enforcement efforts focused on the most discriminatory employers.

8. Discussion and Conclusions

This paper has introduced and validated a fully Bayesian hierarchical model for Efron’s log-spline prior, a powerful semi-parametric tool for large-scale inference. Our work is motivated by a common and critical scientific objective: the need for accurate and reliable inference on individual parameters within a large ensemble, a goal Efron [16] terms “finite-Bayes inference.” By embedding the g-modeling framework within a coherent Bayesian structure, we have demonstrated a principled path to overcoming the limitations of standard plug-in Empirical Bayes (EB) methods, particularly in the crucial domain of uncertainty quantification.

Our simulation studies confirm that for the task of point estimation, the traditional EB approach performs admirably. When the inferential goal is limited to obtaining shrinkage estimates, the penalized likelihood optimization of the log-spline prior is computationally efficient and produces accurate posterior means, especially when the signal-to-noise ratio is high. This reinforces the status of EB as a valuable tool for exploratory analysis.

The primary contribution of this work, however, lies in the clarification and resolution of how uncertainty is quantified. A central finding of this paper is the sharp conceptual and empirical distinction between two different forms of uncertainty: the sampling variability of an estimator and the posterior uncertainty of a parameter. We have shown that standard EB error estimation methods [29], such as the Delta method or the parametric bootstrap, address the former. They quantify the frequentist stability of the posterior mean estimator,

{\hat{Θ}}_{i}^{EB}

, across hypothetical replications of an experiment. While this is a valid measure of procedural reliability, it does not provide what is often required for finite-Bayes inference: a credible interval representing our state of knowledge about the true, unobserved site-specific parameter

Θ_{i}

.

Our proposed fully Bayesian (FB) approach directly resolves this ambiguity. By treating the prior’s shape parameters (

α

) and regularization strength (

λ

) as random variables, the MCMC sampling procedure naturally propagates all sources of uncertainty into the final posterior distribution for each

Θ_{i}

. As our simulation results demonstrate, the resulting 90% credible intervals are well-calibrated, achieving near-nominal coverage across a wide range of conditions. This “one-stop” procedure provides a direct and theoretically coherent solution for researchers, eliminating the need for complex, secondary approximation steps [25,32] that are not yet standard in widely used software packages for EB estimation.

Beyond providing well-calibrated intervals, the FB framework offers significant downstream advantages. The availability of a full posterior sample for each

Θ_{i}

empowers researchers to move beyond simple interval estimation. It enables principled decision-making under any specified loss function, far beyond the implicit squared-error loss of the posterior mean. This opens the door to a host of sophisticated analyses, including multiple testing control via the local false discovery rate [55], estimation of the empirical distribution function of the true effects [27], and rank-based inference [56], all while properly accounting for the full range of uncertainty.

The principal trade-off for these benefits is computational cost. MCMC sampling is inherently more intensive than the optimization routines used in EB. While modern hardware and efficient samplers like NUTS make this approach feasible for many problems, its cost may be a consideration in extremely large-scale applications. This suggests a promising avenue for future work: evaluating the performance of faster, approximate Bayesian inference methods—such as Automatic Differentiation Variational Inference (ADVI; [57]) or the Pathfinder algorithm [58]—within this hierarchical g-modeling context. Assessing how well these methods can replicate the accuracy and calibration of full MCMC would be a valuable contribution, potentially offering a practical compromise between the speed of EB and the inferential completeness of the FB approach.

In conclusion, this study reaffirms the power of Efron’s log-spline prior as a flexible tool for large-scale estimation. More importantly, it demonstrates that by placing this tool within a fully Bayesian framework, we can produce more reliable, interpretable, and useful uncertainty estimates. When the scientific priority is to understand the plausible range of individual site-specific effects, the fully Bayesian approach provides a robust and theoretically grounded solution.

Author Contributions

Conceptualization, J.L. and D.S.; methodology, J.L. and D.S.; software, J.L. and D.S.; validation, J.L. and D.S.; formal analysis, J.L.; investigation, J.L. and D.S.; resources, J.L.; data curation, J.L. and D.S.; writing—original draft preparation, J.L.; writing—review and editing, J.L. and D.S.; visualization, J.L.; supervision, J.L.; project administration, J.L.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305D240078 to University of Alabama. The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education. The APC was funded by the same grant.

Institutional Review Board Statement

Not applicable. This study involved statistical analysis and simulation only, with no human participants or animals involved.

Informed Consent Statement

Not applicable.

Data Availability Statement

The simulation study presented in this manuscript is based on the ’twin towers’ example data, which is publicly available in the deconvolveR R package. The real-data application examining firm-level labor market discrimination uses data from the publicly available replication package accompanying Walters [19], accessible at <https://sites.google.com/view/christopher-walters/research> (accessed on 1 July 2025). All R and Stan code used to generate the results and figures in this study are available in a public repository at <https://github.com/joonho112/fully-bayes-efron-prior> (accessed on 1 July 2025).

Acknowledgments

The authors would like to thank the Institute of Education Sciences for their support of this research. We are also grateful to the developers of the deconvolveR R package for making their simulated data publicly available, which facilitated the comparative simulation analysis presented in this work. We thank Christopher Walters for generously providing public access to the replication data from Kline et al. [12,53], which enabled the real-data application examining firm-level labor market discrimination.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. Mathematical Results on Uncertainty Decomposition

This appendix provides a formal mathematical justification for the uncertainty quantification results presented in Section 3.4 and Section 4.3. We establish precise conditions under which empirical Bayes intervals exhibit anti-conservative coverage and demonstrate how the fully Bayesian approach naturally captures all sources of uncertainty.

Appendix A.1. Formal Setup and Notation

Consider the hierarchical model from Equation (1):

Θ_{i} \overset{iid}{\sim} G, {\hat{θ}}_{i} ∣ Θ_{i} \sim N (Θ_{i}, σ_{i}^{2})

(A1)

Let

D = {{\hat{θ}}_{i}, σ_{i}}_{i = 1}^{K}

denote the full dataset. In the log-spline framework, the prior G is parameterized by

α \in R^{p}

, yielding the density

g (θ; α)

on the discretized grid

T

.

Definition A1

(Posterior distributions). Define:

$π (θ_{i} ∣ {\hat{θ}}_{i}, α)$ : the posterior distribution of $Θ_{i}$ given fixed hyperparameter α
$π (θ_{i} ∣ D)$ : the marginal posterior distribution integrating over the posterior of α
$\hat{π} (θ_{i} ∣ {\hat{θ}}_{i}, \hat{α})$ : the plug-in EB posterior using the MLE $\hat{α}$

Definition A2

(Variance functionals). For a distribution π on

Θ_{i}

, define:

$V [π] = {Var}_{π} (Θ_{i})$ : the variance under distribution π
$m [π] = E_{π} [Θ_{i}]$ : the mean under distribution π

Appendix A.2. Main Decomposition Theorem

Theorem A1

(Uncertainty Decomposition). Under regularity conditions ensuring the existence of all required moments, the total posterior variance can be decomposed as:

V [π (\cdot ∣ D)] = E_{α ∣ D} [V [π (\cdot ∣ {\hat{θ}}_{i}, α)]] + {Var}_{α ∣ D} [m [π (\cdot ∣ {\hat{θ}}_{i}, α)]]

(A2)

Proof.

By the law of total variance, for any random variables X and Y:

Var (X) = E [Var (X ∣ Y)] + Var [E [X ∣ Y]]

(A3)

Apply this with

X = Θ_{i}

and

Y = α

, conditioning on

D

:

Var (Θ_{i} ∣ D) = E_{α ∣ D} [Var (Θ_{i} ∣ {\hat{θ}}_{i}, α, D)] + {Var}_{α ∣ D} [E [Θ_{i} ∣ {\hat{θ}}_{i}, α, D]]

(A4)

Since

Θ_{i} ⊥ D ∖ {{\hat{θ}}_{i}} ∣ ({\hat{θ}}_{i}, α)

by the hierarchical structure:

Var (Θ_{i} ∣ {\hat{θ}}_{i}, α, D) = Var (Θ_{i} ∣ {\hat{θ}}_{i}, α) = V [π (\cdot ∣ {\hat{θ}}_{i}, α)]

(A5)

Similarly:

E [Θ_{i} ∣ {\hat{θ}}_{i}, α, D] = E [Θ_{i} ∣ {\hat{θ}}_{i}, α] = m [π (\cdot ∣ {\hat{θ}}_{i}, α)]

(A6)

This completes the proof. □

Appendix A.3. Anti-Conservative Nature of Empirical Bayes Intervals

Theorem A2

(EB Coverage Deficiency). Let

{CI}_{EB}^{1 - α}

be the

(1 - α)

credible interval based on

\hat{π} (\cdot ∣ {\hat{θ}}_{i}, \hat{α})

. Then:

\underset{K \to \infty}{lim inf} P (Θ_{i} \in {CI}_{EB}^{1 - α}) < 1 - α

(A7)

with equality if and only if

{Var}_{α ∣ D} [m [π (\cdot ∣ {\hat{θ}}_{i}, α)]] = 0

.

Proof.

The EB credible interval is constructed as follows:

{CI}_{EB}^{1 - α} = {t : F_{\hat{α}} (t ∣ {\hat{θ}}_{i}) \in [α / 2, 1 - α / 2]}

(A8)

where

F_{α} (t ∣ {\hat{θ}}_{i}) = P (Θ_{i} \leq t ∣ {\hat{θ}}_{i}, α)

.

By Theorem A1:

\begin{matrix} V [π (\cdot ∣ D)] = & V [\hat{π} (\cdot ∣ {\hat{θ}}_{i}, \hat{α})] + E_{α ∣ D} [V [π (\cdot ∣ {\hat{θ}}_{i}, α)] - V [\hat{π} (\cdot ∣ {\hat{θ}}_{i}, \hat{α})]] \\ + {Var}_{α ∣ D} [m [π (\cdot ∣ {\hat{θ}}_{i}, α)]] \end{matrix}

(A9)

Under regularity conditions,

\hat{α} \overset{p}{\to} α_{0}

implies the middle term vanishes asymptotically. However, the last term remains positive unless the posterior mean is constant in

α

, which occurs only in degenerate cases.

Since

{CI}_{EB}^{1 - α}

is based on the underestimated variance

V [\hat{π} (\cdot ∣ {\hat{θ}}_{i}, \hat{α})]

, it must have coverage strictly less than

1 - α

when the ignored variance component is positive. □

Appendix A.4. Validity of Fully Bayesian Intervals

Theorem A3

(FB Coverage). Let

{CI}_{FB}^{1 - α}

be the

(1 - α)

credible interval based on

π (\cdot ∣ D)

. Under the correctly specified model:

P (Θ_{i} \in {CI}_{FB}^{1 - α} ∣ D) = 1 - α

(A10)

Proof.

By construction,

{CI}_{FB}^{1 - α}

satisfies the following:

\int_{{CI}_{FB}^{1 - α}} π (θ_{i} ∣ D) d θ_{i} = 1 - α

(A11)

Since

π (θ_{i} ∣ D)

is the true posterior distribution under the model, the coverage probability equals the credible level by the fundamental property of Bayesian inference. □

Appendix A.5. Connection to Delta Method Approximation

Proposition A1.

The delta method variance approximation (Equation (11)) estimates the following:

{\hat{Var}}_{delta} [{\hat{Θ}}_{i}^{EB}] \approx {Var}_{α ∣ D} [m [π (\cdot ∣ {\hat{θ}}_{i}, α)]]

(A12)

but ignores

E_{α ∣ D} [V [π (\cdot ∣ {\hat{θ}}_{i}, α)]]

.

Proof.

The delta method approximates the following:

{\hat{Θ}}_{i}^{EB} (\hat{α}) \approx {\hat{Θ}}_{i}^{EB} (α_{0}) + \nabla_{α} {\hat{Θ}}_{i}^{EB} |_{α_{0}} \cdot (\hat{α} - α_{0})

(A13)

Taking variance:

Var [{\hat{Θ}}_{i}^{EB} (\hat{α})] \approx {(\nabla_{α} {\hat{Θ}}_{i}^{EB})}^{⊤} Var (\hat{α}) (\nabla_{α} {\hat{Θ}}_{i}^{EB})

(A14)

Since

{\hat{Θ}}_{i}^{EB} (α) = m [π (\cdot ∣ {\hat{θ}}_{i}, α)]

, this captures the variability of the posterior mean with respect to

α

, which corresponds to the second term in Theorem A1’s decomposition. □

Appendix A.6. Numerical Illustration

To illustrate these theoretical results, consider the simple case where

Θ_{i} ∣ α \sim N (μ (α), τ^{2} (α))

and

{\hat{θ}}_{i} ∣ Θ_{i} \sim N (Θ_{i}, σ_{i}^{2})

. The posterior is as follows:

Θ_{i} ∣ {\hat{θ}}_{i}, α \sim N (\frac{τ^{2} (α) {\hat{θ}}_{i} + σ_{i}^{2} μ (α)}{τ^{2} (α) + σ_{i}^{2}}, \frac{τ^{2} (α) σ_{i}^{2}}{τ^{2} (α) + σ_{i}^{2}})

(A15)

The EB approach uses

V [\hat{π}] = \frac{τ^{2} (\hat{α}) σ_{i}^{2}}{τ^{2} (\hat{α}) + σ_{i}^{2}}

, while the true posterior variance must additionally account for the variability in both

μ (α)

and

τ^{2} (α)

through the posterior distribution of

α

.

Appendix A.7. Discussion

These mathematical results formalize the key insight of our paper: empirical Bayes methods systematically underestimate uncertainty by treating the estimated prior as fixed. The fully Bayesian approach naturally incorporates both sources of uncertainty—the conditional variance given the prior and the variance due to prior uncertainty—yielding properly calibrated intervals.

The decomposition in Theorem A1 shows that the magnitude of underestimation depends on how sensitive the posterior mean is to the hyperparameters. In settings with limited data (small K) or weak signal (low reliability I), this sensitivity is substantial, leading to the severe undercoverage observed in our simulations.

Appendix B. Stan Implementation of the Fully Bayesian Log-Spline Model

This appendix presents the complete Stan program implementing the fully Bayesian approach described in Section 4. We provide detailed annotations for each program block to facilitate reproducibility and adaptation to related problems.

Appendix B.1. Data Block: Input Specifications

The data block declares all external inputs required for model fitting:

data {

int<lower=1> K; // Number of sites/studies

vector[K] theta_hat; // Observed site-specific estimates

vector<lower=0>[K] sigma; // Site-specific

int<lower=1> L; // Number of grid points

vector[L] grid; // Grid points

int<lower=1> M; // Dimension of spline basis (df)

matrix[L, M] B; // Spline basis mat. evaluated at grid

}

The first three inputs—K, theta_hat, and sigma—directly correspond to the observed data

{{\hat{θ}}_{i}, σ_{i}}_{i = 1}^{K}

from the meta-analysis. The remaining inputs define the discrete approximation and spline basis for the log-spline prior.

The grid construction follows the principle of covering the plausible range of effect sizes with sufficient resolution to accurately approximate continuous distributions. In our implementation, we construct the grid as follows:

grid <- seq(min(theta_hat) - 0.5, max(theta_hat) + 0.5, length.out = 101)

This creates L = 101 equally spaced points extending slightly beyond the observed data range. The choice of grid size balances computational efficiency against approximation accuracy; finer grids (larger L) improve the discrete approximation to the continuous integral in Equation (3) but increase computational cost quadratically in the mixture likelihood calculations.

The spline basis matrix B encodes the flexibility of the prior family. Following Efron [29], we employ natural cubic splines without an intercept:

B <- ns(grid, df = M, intercept = FALSE)

The choice M = 6 provides sufficient flexibility to capture multimodal distributions while maintaining computational stability. The absence of an intercept ensures identifiability, as the log-normalizing constant

ϕ (α)

in Equation (16) already provides location invariance. Natural splines impose linearity constraints beyond the boundary knots, preventing erratic behavior in the tails of the estimated prior—a crucial consideration given the heavy-tailed nature of many meta-analytic effect distributions.

Appendix B.2. Parameters and Transformed Parameters: Hierarchical Structure

The parameters block declares the unknowns to be estimated:

parameters {

vector[M] alpha; // Spline coefficients

real<lower=0> lambda; // Adaptive regularization parameter

}

These parameters directly correspond to the hierarchical specification in Equations (16) and (17). The constraint lambda > 0 is enforced through Stan’s built-in transformation, ensuring numerical stability during sampling.

The transformed parameters block constructs the prior distribution on the grid:

transformed parameters {

vector[L] log_w = B * alpha; // Log unnormalized weights

vector[L] log_g = log_softmax(log_w); // Log normalized probabilities

simplex[L] g = softmax(log_w); // Probability mass function

}

This parameterization exploits several computational advantages. First, working with log-scale quantities (log_w and log_g) maintains numerical precision even when some grid points have extremely small probabilities. Second, the log_softmax function implements the log-sum-exp trick internally, computing

log g_{j} = log w_{j} - log (\sum_{k = 1}^{L} w_{k}) = Q_{j}^{⊤} α - ϕ (α)

(A16)

in a numerically stable manner. The simultaneous computation of both log_g and g allows efficient calculation of both the likelihood (which requires log_g) and posterior quantities (which requires g).

Appendix B.3. Model Block: Priors and Likelihood

The model block specifies the probabilistic structure:

model {

// Hyperpriors

lambda ~ cauchy(0, 5); // Half-Cauchy prior

alpha ~ normal(0, inv_sqrt(lambda)); // Cond. prior given lambda

// Likelihood: marginal distribution of observed effects

for (i in 1:K) {

vector[L] log_components;

for (j in 1:L) {

log_components[j] = log_g[j] +

normal_lpdf(theta_hat[i] | grid[j], sigma[i]);

}

target += log_sum_exp(log_components);

}

The hyperprior specification implements the adaptive regularization scheme described in Equation (17). The half-Cauchy prior on lambda with scale 5 provides weak information while maintaining propriety. The conditional prior alpha[k] ∼ normal(0, inv_sqrt(lambda)) induces the penalty

{(λ / 2) | | α | |}_{2}^{2}

in the log-posterior, directly paralleling the empirical Bayes penalty but with data-adaptive regularization strength.

The likelihood calculation implements the discrete mixture:

{\hat{θ}}_{i} \sim \sum_{j = 1}^{L} g_{j} \cdot N (θ_{j}, σ_{i}^{2})

(A17)

The nested loop structure, while appearing computationally intensive, is necessary for the heteroscedastic case where each site has its own variance

σ_{i}^{2}

. The log_sum_exp function again employs numerical tricks to prevent underflow when summing probabilities across mixture components. For site i, this computes

log f_{i} ({\hat{θ}}_{i}) = log [\sum_{j = 1}^{L} exp {log g_{j} + log ϕ_{σ_{i}} ({\hat{θ}}_{i} - θ_{j})}]

(A18)

in a stable manner by first identifying the maximum log-component and factoring it out before exponentiation.

Appendix B.4. Generated Quantities: Marginalization and Posterior Summaries

The generated quantities block performs post-processing to extract inferential targets:

generated quantities {

// Summary statistics of the prior distribution

real mean_g = dot_product(g, grid);

real var_g = dot_product(g, square(grid - mean_g));

// Site-specific posterior quantities

vector[K] theta_map; // Maximum a posteriori estimates

vector[K] theta_mean; // Posterior means

vector[K] theta_rep; // Single posterior draw per site

for (i in 1:K) {

// Compute unnormalized log posterior for site i

vector[L] log_post;

for (j in 1:L) {

log_post[j] = log_g[j] +

normal_lpdf(theta_hat[i] | grid[j], sigma[i]);

}

// MAP estimate via grid search

int max_idx = 1;

for (j in 2:L) {

if (log_post[j] > log_post[max_idx]) {

max_idx = j;

}

theta_map[i] = grid[max_idx];

// Normalized posterior weights

real log_post_max = max(log_post);

vector[L] w = exp(log_post - log_post_max);

w /= sum(w);

// Posterior mean

theta_mean[i] = dot_product(w, grid);

// Posterior sample for credible interval construction

theta_rep[i] = grid[categorical_rng(w)];

}

This block implements the marginalization principle described in Equations (19)–(21). For each MCMC iteration with parameters

(α^{(s)}, λ^{(s)})

, we compute the induced posterior distribution for each site. The key insight is that these calculations occur within each MCMC iteration, so averaging across iterations automatically marginalizes over posterior uncertainty in the hyperparameters.

The MAP estimate theta_map[i] provides a robust point estimate less sensitive to discretization than the posterior mean. The grid search, while simple, is efficient given the modest grid size and unimodal posterior typical of individual sites.

The posterior weights w represent

P (Θ_{i} = θ_{j} | {\hat{θ}}_{i}, α^{(s)}, λ^{(s)})

as in Equation (20). The numerical stabilization via log_post_max prevents underflow without affecting the normalized weights.

The posterior draws theta_rep[i] enable direct construction of credible intervals. Across S MCMC iterations, these draws form a sample from the marginal posterior

P (Θ_{i} | D)

, automatically incorporating all sources of uncertainty. Credible intervals can be obtained as quantiles of these draws without parametric assumptions or asymptotic approximations.

Appendix B.5. MCMC Sampling Configuration

We implemented the Stan model using cmdstanr with the following configuration:

Chains and iterations: 4 parallel chains, each with 1000 warmup and 3000 sampling iterations, yielding 12,000 posterior draws in total;
Adaptation: Target average acceptance probability (adapt_delta) set to 0.9 to reduce divergent transitions in the challenging posterior geometry induced by the mixture likelihood;
Parallelization: Between-chain parallelization via parallel_chains = 4 and within-chain threading (threads_per_chain = 4) for sites with many mixture components.

This configuration balances computational efficiency with sampling quality. The relatively high adapt_delta addresses the posterior geometry challenges inherent in mixture models, where the likelihood surface can exhibit ridges corresponding to label-switching symmetries (though these are broken by the continuous grid structure).

Convergence diagnostics indicated excellent mixing across all parameters. The potential scale reduction factors (

\hat{R}

) were uniformly below 1.01, and effective sample sizes exceeded 1000 for all key parameters, confirming that the chains adequately explored the posterior distribution. The absence of divergent transitions after warmup further validated the sampling configuration.

Appendix C. Grid Resolution and Bounds Sensitivity Analysis

This appendix addresses the sensitivity of our fully Bayesian inference framework to the discretization scheme used in approximating the continuous prior distribution G. The choice of grid resolution and bounds represents a fundamental trade-off between computational efficiency and approximation accuracy. We demonstrate through systematic analysis that our default configuration (101 grid points with 0.5 expansion factor) provides robust inference while maintaining computational tractability.

Appendix C.1. Sensitivity Analysis Design

To evaluate the impact of grid specification on inferential performance, we conducted a factorial experiment varying two key dimensions of the discretization scheme. Let

T = {θ_{1}, \dots, θ_{L}}

denote the grid of support points for the discrete approximation of G.

The number of grid points L determines the granularity of the discrete approximation to the continuous prior. We evaluated three levels:

L \in {51, 101, 201}

, representing coarse, moderate, and fine discretizations. The approximation error in replacing the integral

\int ϕ_{σ_{i}} (x - θ) g (θ) d θ

with the Riemann sum

\sum_{j = 1}^{L} ϕ_{σ_{i}} (x - θ_{j}) g (θ_{j}) Δ θ

decreases as

O (L^{- 2})

for smooth densities under uniform spacing.

The range of the grid must adequately cover the support of the true effect distribution while avoiding computational waste on implausible values. We parameterized the grid bounds as follows:

[θ_{min}, θ_{max}] = [min (Θ_{true}) - β \cdot R, max (Θ_{true}) + β \cdot R],

(A19)

where

R = max (Θ_{true}) - min (Θ_{true})

is the range of true effects, and

β

is the expansion factor. We tested

β \in {0.25, 0.5, 1.0}

, corresponding to conservative, moderate, and generous bounds.

For each of the nine configurations in the

3 \times 3

factorial design, we fitted the fully Bayesian model using Variational Bayes (ADVI) rather than MCMC to enable rapid evaluation across multiple settings. The variational approximation employed a mean-field Gaussian family with 10,000 iterations and generated 4000 posterior samples for uncertainty quantification. We used the challenging “twin towers” scenario from our main simulation study (

I = 0.7, R = 9

) as the test case, providing a realistic assessment under moderate reliability with heteroscedastic measurement errors.

The spline basis matrix

B \in R^{L \times M}

was constructed using natural cubic splines with

M = 6

degrees of freedom, maintaining the same flexibility across all grid resolutions. This ensures that differences in performance arise from the discretization itself rather than changes in the prior family’s expressiveness.

Appendix C.2. Sensitivity Analysis Results

Table A1 presents comprehensive performance metrics across all grid configurations, while Figure A1 and Figure A2 provide visual comparisons.

The stability of point estimation across grid configurations is remarkable. Root mean squared errors range from 0.6946 to 0.6959, representing variations of less than 0.2% from the reference configuration. This consistency holds for both posterior means and posterior draws (Panel A, Figure A1). The correlation between estimated and true effects exceeds 0.71 for all configurations, with negligible differences attributable to grid specification.

The insensitivity to grid resolution suggests that even the coarsest grid (

L = 51

) provides sufficient support points to capture the bimodal structure of the true distribution. Similarly, the minimal impact of bound expansion indicates that the adaptive shrinkage inherent in the Bayesian framework naturally down-weights regions of low posterior probability, making the precise placement of grid boundaries relatively unimportant.

Coverage probabilities for 90% credible intervals range from 87.8% to 90.9% across all configurations (Panel B, Figure A1), demonstrating good calibration regardless of grid specification. The slight undercoverage observed in some configurations (particularly

L = 51, β = 0.5

) likely reflects the inherent approximation error in variational inference rather than grid effects per se.

Table A1. Performance metrics across grid configurations.

Grid	Bound	RMSE	RMSE	Coverage	Interval	Time	Time
Points	Expansion	(Mean)	Change (%)	(90%)	Width	(s)	Ratio
51	0.25	0.6959	0.10	0.8993	2.041	9.30	0.61
51	0.50	0.6946	−0.09	0.8780	2.002	9.48	0.62
51	1.00	0.6955	0.04	0.8980	2.040	9.22	0.60
101	0.25	0.6953	0.03	0.9033	2.048	15.36	1.00
101	0.50	0.6952	0.00	0.8887	2.023	15.32	1.00
101	1.00	0.6953	0.03	0.9087	2.055	15.20	0.99
201	0.25	0.6951	−0.01	0.8847	2.011	27.52	1.80
201	0.50	0.6950	−0.02	0.8827	2.009	27.60	1.80
201	1.00	0.6950	−0.03	0.9087	2.055	28.44	1.86

Note: Reference configuration (L = 101, bound expansion = 0.5) shown in bold. RMSE change calculated relative to reference. Coverage represents empirical coverage of 90% credible intervals across 1500 sites. Time ratio normalized to reference configuration.

Figure A1. Grid sensitivity analysis: impact of resolution and bounds on inference quality. Note: Panel (A) shows point estimation accuracy (RMSE) for posterior means and draws. Panel (B) displays coverage probabilities with nominal 90% level (red dashed line). Panel (C) presents computational costs. Panel (D) illustrates the accuracy–coverage trade-off. The reference configuration (L = 101, expansion = 0.5) is marked with triangular symbols. All configurations achieve similar performance, demonstrating robustness to grid specification.

Figure A2. Site-level estimation errors across selected grid configurations. Note: Scatter plots show estimation errors (

{\hat{Θ}}_{i} - Θ_{i}

) versus true effects for three representative configurations. Green points indicate sites where the 90% credible interval contains the true value; red points indicate non-coverage. The similar error patterns across configurations confirm that inference quality is robust to grid specification within reasonable bounds.

Figure A2. Site-level estimation errors across selected grid configurations. Note: Scatter plots show estimation errors (

{\hat{Θ}}_{i} - Θ_{i}

) versus true effects for three representative configurations. Green points indicate sites where the 90% credible interval contains the true value; red points indicate non-coverage. The similar error patterns across configurations confirm that inference quality is robust to grid specification within reasonable bounds.

The mean interval width shows minimal variation (2.00 to 2.06), suggesting that the posterior uncertainty quantification remains stable across discretization schemes. This stability is particularly noteworthy given the threefold variation in computational cost between the coarsest and finest grids.

As expected, computational time scales approximately linearly with the number of grid points (Panel C, Figure A1). The reference configuration requires 15.3 s on average, while the finest grid (

L = 201

) increases this to 27.6 s—an 80% increase for negligible improvement in inferential quality. Conversely, the coarsest grid (

L = 51

) reduces computation time by 40% with minimal loss of accuracy.

Appendix C.3. Theoretical Justification

The robustness of our inference to grid specification emerges from three complementary mechanisms inherent in the hierarchical Bayesian framework.

Appendix C.3.1. Spline Smoothing and Implicit Interpolation

The log-spline parameterization induces smooth variation in the prior density across grid points through the basis expansion:

log g (θ_{j}) = \sum_{k = 1}^{M} B_{j k} α_{k} - ϕ (α),

(A20)

where

B_{j k}

represents the spline basis functions evaluated at grid point

θ_{j}

. The smoothness constraints imposed by the natural cubic splines ensure that

g (θ)

varies continuously between grid points, effectively providing implicit interpolation. This means that even coarse grids can represent smooth densities accurately, with the approximation error decreasing as

O (L^{- 4})

for densities with bounded fourth derivatives.

Appendix C.3.2. Likelihood-Driven Adaptation

The posterior distribution of the hyperparameters

(α, λ)

concentrates on values that maximize the marginal likelihood:

p (D ∣ α, λ) = \prod_{i = 1}^{K} \sum_{j = 1}^{L} ϕ_{σ_{i}} ({\hat{θ}}_{i} - θ_{j}) g (θ_{j}; α) .

(A21)

This likelihood function naturally emphasizes regions of the grid where the data provide information. Grid points in low-density regions contribute negligibly to the likelihood, making their precise placement inconsequential. The adaptive regularization through

λ

further ensures that the effective dimensionality of the problem adjusts to the available information, preventing overfitting even with fine grids.

Appendix C.3.3. Posterior Averaging and Uncertainty Propagation

The fully Bayesian framework marginalizes over uncertainty in the grid-based approximation through the posterior distribution of

α

:

P (Θ_{i} = θ_{j} ∣ D) = \int π_{i j} (α, λ) p (α, λ ∣ D) d α d λ,

(A22)

where

π_{i j} (α, λ)

is the posterior probability for site i at grid point j given the hyperparameters. This averaging naturally smooths over discretization artifacts, as the posterior uncertainty in

α

encompasses multiple plausible configurations of the prior density. The result is inference that remains stable even when individual grid points may be suboptimally placed.

The remarkable insensitivity of the fully Bayesian inference to grid specification—with performance variations of less than 1% across a wide range of configurations—validates our default choice and demonstrates that the method is robust to this implementation detail. This robustness, combined with the proper uncertainty quantification inherent in the fully Bayesian framework, makes our approach particularly suitable for practical meta-analytic applications where computational simplicity and inferential reliability are both priorities.

Appendix D. Sensitivity Analysis for λ Hyperprior Specification

This appendix addresses the sensitivity of our fully Bayesian inference framework to the specification of the hyperprior on the regularization parameter

λ

. Through comprehensive simulation experiments, we demonstrate that our choice of Cauchy⁺(0, 5) provides robust inference while maintaining computational stability, and that site-specific parameter estimation remains remarkably insensitive to reasonable alternative prior specifications.

Appendix D.1. Sensitivity Analysis Design

To evaluate the robustness of our inferential framework to the choice of hyperprior on

λ

, we conducted a systematic sensitivity analysis, examining nine distinct prior specifications. These configurations were selected to span a broad range of distributional families and tail behaviors, encompassing both conjugate and non-conjugate alternatives to our reference specification.

The alternative prior specifications fall into three categories: First, we examined alternative distributional families, including Exponential(0.5) as a light-tailed alternative, Gamma(2, 1) and Inverse-Gamma(2, 2) as conjugate priors with different tail weights, Log-Normal(0, 1) representing a log-scale normal family, and Uniform(0.01, 20) as a flat, non-informative prior. Second, we varied the scale parameter of the Cauchy distribution itself, testing Cauchy⁺(0, 1) for increased informativeness, Cauchy⁺(0, 2.5) as a moderately informative alternative, and Cauchy⁺(0, 10) for decreased informativeness. This design allows us to assess both the choice of distributional family and the degree of prior informativeness.

Figure A3 visualizes these prior distributions, revealing substantial differences in their implications for

λ

. The left panel demonstrates that while the reference Cauchy⁺(0, 5) prior maintains moderate density across a wide range of values, the Exponential(0.5) and Gamma(2, 1) priors concentrate mass near zero with rapidly declining tails. In contrast, the Log-Normal(0, 1) and Inverse-Gamma(2, 2) specifications allow for heavier tails, potentially accommodating more extreme regularization scenarios. The right panel illustrates how varying the Cauchy scale parameter from 1 to 10 dramatically affects the prior’s informativeness, with Cauchy⁺(0, 1) placing substantial mass near zero, while Cauchy⁺(0, 10) spreads probability more diffusely across the parameter space.

Figure A3. Prior distribution specifications for lambda. Note: Left panel compares alternative distributional families with the reference Cauchy⁺(0, 5) prior (solid line). Dashed lines represent light-tailed (Exponential, Gamma), heavy-tailed (Log-Normal, Inverse-Gamma), and non-informative (Uniform) alternatives. Right panel shows sensitivity to Cauchy scale parameter, demonstrating how prior informativeness varies from highly concentrated (scale = 1) to diffuse (scale = 10). Density values are computed over the range

λ \in [0.01, 15]

to capture the relevant parameter space for regularization strength.

Figure A3. Prior distribution specifications for lambda. Note: Left panel compares alternative distributional families with the reference Cauchy⁺(0, 5) prior (solid line). Dashed lines represent light-tailed (Exponential, Gamma), heavy-tailed (Log-Normal, Inverse-Gamma), and non-informative (Uniform) alternatives. Right panel shows sensitivity to Cauchy scale parameter, demonstrating how prior informativeness varies from highly concentrated (scale = 1) to diffuse (scale = 10). Density values are computed over the range

λ \in [0.01, 15]

to capture the relevant parameter space for regularization strength.

For each prior configuration, we fitted the fully Bayesian model using Variational Bayes (ADVI) with 10,000 iterations, generating 4000 posterior samples for inference. We employed the challenging “twin towers” bimodal scenario with moderate reliability (

I = 0.7

) and heteroscedastic measurement errors (

R = 9

) from our main simulation study, providing a realistic test case where regularization plays a crucial role in stabilizing the deconvolution. The computational efficiency of variational inference enabled rapid evaluation across all nine configurations while maintaining sufficient accuracy for comparative assessment.

Appendix D.2. Results and Interpretation

Table A2 presents performance metrics for the alternative distributional families, while Table A3 focuses on sensitivity to the Cauchy scale parameter. The results reveal remarkable stability in inferential performance across widely varying prior specifications.

The most striking finding is the minimal impact on point estimation accuracy. Root mean squared errors range from 0.6899 to 0.6913 across all prior specifications, representing variations of less than 0.2% from the reference configuration. This stability extends to both posterior means and posterior draws, as illustrated in Panel A of Figure A4. The near-identical RMSE values—all approximately 0.690 with differences in the fourth decimal place—demonstrate that the adaptive shrinkage mechanism of the log-spline framework dominates any influence from the regularization hyperprior.

Table A2. Performance comparison across alternative prior distributions for lambda.

Prior	RMSE	RMSE	Coverage	Coverage	CI	$λ$	$λ$	Time
Distribution		$Δ$ (%)	(90%)	$Δ$ (pp)	Width	Post. Mean	Post. SD	(s)
Cauchy⁺(0, 5)	0.6902	0.00	91.0	0.00	1.836	0.206	0.120	16.10
Exponential(0.5)	0.6904	0.03	91.0	0.00	1.837	0.210	0.115	16.31
Gamma(2, 1)	0.6902	0.00	90.9	−0.13	1.832	0.294	0.171	15.60
Log−Normal(0, 1)	0.6902	0.00	90.7	−0.33	1.826	0.278	0.143	15.96
Inv−Gamma(2, 2)	0.6913	0.17	90.9	−0.07	1.850	0.428	0.134	16.04
Uniform(0.01, 20)	0.6899	−0.04	91.0	0.00	1.836	0.301	0.206	15.60

Note: RMSE

Δ

and Coverage

Δ

represent percentage and percentage point changes relative to the reference Cauchy⁺(0, 5) prior. CI Width denotes mean width of 90% credible intervals. All metrics computed across 1500 sites in the twin-towers scenario with

I = 0.7

,

R = 9

.

Table A3. Sensitivity to Cauchy scale parameter.

Cauchy Prior	RMSE	RMSE	Coverage	Coverage	CI	$λ$	$λ$
Specification		$Δ$ (%)	(90%)	$Δ$ (pp)	Width	Post. Mean	Post. SD
Cauchy⁺(0, 5)	0.6902	0.00	91.0	0.00	1.836	0.206	0.120
Cauchy⁺(0, 1)	0.6902	0.00	91.1	0.13	1.833	0.198	0.096
Cauchy⁺(0, 2.5)	0.6907	0.08	90.9	-0.07	1.825	0.223	0.112
Cauchy⁺(0, 10)	0.6902	0.00	90.8	-0.20	1.825	0.323	0.237

Note: Comparison of different scale parameters for the half-Cauchy prior. The reference specification (scale = 5) balances between overly informative (scale = 1) and diffuse (scale = 10) alternatives.

Figure A4. Lambda prior sensitivity analysis: impact on inference quality. Note: Panel (A) shows point estimation accuracy (RMSE) for posterior means and draws across nine prior specifications. Panel (B) displays empirical coverage of 90% credible intervals with nominal level (red dashed line). Panel (C) presents posterior estimates of

λ

with 95% credible intervals, demonstrating how different priors yield different regularization strengths. Panel (D) illustrates the accuracy–coverage trade-off, with all specifications clustering near optimal performance (low RMSE, coverage near 90%). Reference prior Cauchy⁺(0, 5) shown in red throughout.

Figure A4. Lambda prior sensitivity analysis: impact on inference quality. Note: Panel (A) shows point estimation accuracy (RMSE) for posterior means and draws across nine prior specifications. Panel (B) displays empirical coverage of 90% credible intervals with nominal level (red dashed line). Panel (C) presents posterior estimates of

λ

with 95% credible intervals, demonstrating how different priors yield different regularization strengths. Panel (D) illustrates the accuracy–coverage trade-off, with all specifications clustering near optimal performance (low RMSE, coverage near 90%). Reference prior Cauchy⁺(0, 5) shown in red throughout.

Uncertainty calibration remains equally robust, with coverage probabilities for 90% credible intervals ranging from 90.7% to 91.1% (Panel B, Figure A4). All specifications achieve near-nominal coverage, with deviations of less than 0.4 percentage points from the target 90% level. This consistency is particularly noteworthy given the substantial differences in posterior estimates of

λ

itself, which range from 0.198 under Cauchy⁺(0, 1) to 0.428 under Inverse-Gamma(2, 2) (Panel C). The fact that such variation in the regularization parameter translates to negligible differences in site-specific inference underscores a key insight: the hierarchical structure of the model naturally adapts to maintain inferential stability regardless of the specific regularization strength.

The robustness extends across different data scenarios, as demonstrated in Figure A5. When we examine performance under varying reliability conditions—from low (

I = 0.5

) to high (

I = 0.9

)—the consistency across prior specifications persists. RMSE differences remain below 0.001 and coverage rates stay within 2 percentage points of nominal across all scenarios. This uniform behavior suggests that the likelihood dominates the prior in determining the effective regularization, particularly when sufficient data (

K = 1500

sites) are available.

Figure A5. Robustness of lambda prior choice across data scenarios. Note: Comparison of RMSE (left panels) and coverage (right panels) across five representative prior specifications under different reliability conditions. Low reliability (

I = 0.5, R = 1

): high measurement error obscures bimodal structure. Base scenario (

I = 0.7, R = 9

): moderate reliability with heteroscedastic errors. High reliability (

I = 0.9, R = 1

): low measurement error clearly reveals twin towers structure. Consistency across rows demonstrates that prior choice impact remains minimal regardless of data quality.

Figure A5. Robustness of lambda prior choice across data scenarios. Note: Comparison of RMSE (left panels) and coverage (right panels) across five representative prior specifications under different reliability conditions. Low reliability (

I = 0.5, R = 1

): high measurement error obscures bimodal structure. Base scenario (

I = 0.7, R = 9

): moderate reliability with heteroscedastic errors. High reliability (

I = 0.9, R = 1

): low measurement error clearly reveals twin towers structure. Consistency across rows demonstrates that prior choice impact remains minimal regardless of data quality.

Appendix D.3. Theoretical Justification

The remarkable insensitivity to the lambda hyperprior specification emerges from three complementary mechanisms inherent in our hierarchical framework. Understanding these mechanisms provides theoretical justification for the empirical robustness observed in our simulations.

First, the likelihood function for the log-spline coefficients

α

provides substantial information that overwhelms the prior contribution in moderate to large samples. The marginal likelihood

p (D ∣ α, λ) = \prod_{i = 1}^{K} f_{i} ({\hat{θ}}_{i}; α)

involves

K = 1500

independent observations, each contributing information about the shape of the prior distribution

g (θ; α)

. The Fisher information matrix

I (α) = - E [\nabla^{2} log p (D ∣ α)]

scales linearly with K, while the prior precision

λ I_{p}

remains fixed. Consequently, for large K, the likelihood dominates the posterior:

p (α ∣ D, λ) \propto p (D ∣ α) \cdot p (α ∣ λ) \approx p (D ∣ α) as K \to \infty .

(A23)

This likelihood dominance implies that different values of

λ

yielding similar likelihood maxima will produce similar posterior distributions for

α

and, hence, similar site-specific inference.

Second, the adaptive nature of the hierarchical model provides automatic calibration regardless of the hyperprior specification. The posterior distribution of

λ

concentrates around values that optimize the marginal likelihood

p (D ∣ λ) = \int p (D ∣ α) p (α ∣ λ) d α

. This marginal likelihood exhibits a well-defined maximum that depends primarily on the data rather than the hyperprior. Even strongly informative priors like Cauchy⁺(0, 1) or diffuse priors like Cauchy⁺(0, 10) yield posterior distributions for

λ

that concentrate near this data-driven optimum. The posterior mean of

λ

varies by a factor of two across specifications (0.198 to 0.428), but this variation occurs within a range where the log-spline model maintains similar flexibility.

Third, and most crucially for site-specific inference, the two-stage shrinkage structure insulates individual effect estimates from hyperparameter uncertainty. Site-specific posteriors take the form:

p (Θ_{i} ∣ D) = \int p (Θ_{i} ∣ {\hat{θ}}_{i}, g) p (g ∣ D) d g,

(A24)

where g represents the population distribution parameterized by

(α, λ)

. The inner posterior

p (Θ_{i} ∣ {\hat{θ}}_{i}, g)

depends on g only through its shape, not its parameterization. Different combinations of

(α, λ)

that yield similar shapes for g produce nearly identical site-specific posteriors. Since the likelihood strongly constrains the feasible shapes of g—it must accommodate the observed bimodal structure while remaining smooth—various hyperprior specifications converge to similar functional forms.

This theoretical understanding explains why the choice of lambda hyperprior has minimal practical impact on finite-Bayes inference. The combination of likelihood dominance, adaptive regularization, and the two-stage shrinkage structure ensures that reasonable hyperprior specifications yield essentially equivalent inference for individual site effects. Our reference choice of Cauchy⁺(0, 5) represents a balanced specification that provides weak regularization while maintaining computational stability, but the robustness analysis demonstrates that this choice is not critical for obtaining reliable site-specific inference.

The insensitivity to hyperprior specification constitutes a desirable property for practical applications. Researchers can confidently apply the fully Bayesian log-spline framework without extensive sensitivity analyses or hyperparameter tuning. The method’s robustness stems from its fundamental structure rather than careful calibration, making it particularly suitable for routine use in meta-analytic applications where site-specific inference is the primary goal.

Appendix E. Performance Analysis for Small-K Meta-Analyses

This appendix evaluates the performance of our proposed empirical Bayes (EB) and fully Bayesian (FB) approaches in small-K scenarios, where the number of sites ranges from 50 to 500. Such sample sizes are common in practical meta-analyses, particularly in clinical trials and policy evaluations where the number of available studies is limited. We assess whether the favorable properties demonstrated in our main simulation study (with K = 1500) extend to these more challenging small-sample settings.

Appendix E.1. Simulation Design

We conducted a targeted simulation study using the twin-towers scenario from our main analysis (Section 4.2) as the population from which to sample. The twin-towers scenario, with its bimodal distribution of true effects centered at

θ = \pm 2

, provides a challenging test case for density estimation methods in small samples. From the full dataset of K = 1500 sites, we created smaller datasets through stratified random sampling, ensuring balanced representation from both modes of the distribution.

Specifically, we examined five sample size scenarios: K ∈ {50, 100, 200, 500, 1500}, with 20 independent replications for each K. For each replication, we employed proportional stratified sampling that maintained the relative sizes of the two modes. Sites with

θ_{i} < 0

were assigned to the left tower, while sites with

θ_{i} \geq 0

were assigned to the right tower, with sampling proportions matching the original distribution.

Both methods were fitted using Variational Bayes (VB) to ensure computational efficiency across the large number of replications. For the EB approach, we used L-BFGS optimization to obtain maximum-likelihood estimates of the hyperparameters. For the FB approach, we employed automatic differentiation variational inference (ADVI) with a mean-field approximation, using 10,000 iterations and generating 4000 posterior samples for uncertainty quantification. The spline basis dimension was fixed at M = 6, while the grid size L was adapted to the sample size: L = 51 for K = 50, L = 71 for K = 100, L = 81 for K = 200, and L = 101 for K ≥ 500.

Appendix E.2. Results

Figure A6 presents a comprehensive analysis of how sample size affects inference quality. Panel A demonstrates that point estimation accuracy, measured by root mean square error (RMSE) and correlation with true effects, remains remarkably stable across sample sizes. The RMSE increases only modestly from approximately 0.69 at K = 1500 to 0.72 at K = 50 for both methods, representing less than 5% degradation in accuracy. Similarly, the correlation between estimated and true effects decreases minimally, from 0.88 at K = 1500 to 0.87 at K = 50.

Perhaps more importantly, Panel B reveals that the FB approach maintains proper uncertainty calibration even at K = 50. The empirical coverage of 90% credible intervals remains close to the nominal level across all sample sizes, ranging from 88.7% at K = 50 to 89.6% at K = 500. This stability in coverage probability demonstrates that the hierarchical structure and regularization inherent in our approach successfully prevent overconfidence in small-sample settings.

Panel C illustrates the computational efficiency of the VB implementation, with computation time scaling approximately linearly with K on the log-log scale. The FB approach requires roughly 0.8 s for K = 50 and 19.3 s for K = 1500, making it practical for routine use. The EB approach is consistently faster, completing in under 0.3 s even for the largest datasets, though this speed comes at the cost of lacking proper uncertainty quantification.

Panel D synthesizes these findings by plotting coverage probability against RMSE, revealing the fundamental trade-off between point estimation accuracy and uncertainty calibration. While the EB approach achieves slightly better RMSE (approximately 1–4% lower across all K values), it cannot provide valid uncertainty estimates. The FB approach maintains both reasonable point estimation accuracy and proper uncertainty calibration across the entire range of sample sizes.

Figure A6. Performance analysis across sample sizes from K = 50 to K = 1500. Panel (A) shows point estimation accuracy through RMSE and correlation metrics. Panel (B) displays coverage probability for the fully Bayesian approach, with the red dashed line indicating nominal 90% coverage. Panel (C) illustrates computational scaling on log-log axes. Panel (D) presents the trade-off between point estimation accuracy and uncertainty calibration, with numbers indicating sample size K. Note: Box plots show distributions across 20 replications per K value. Individual points represent replication-specific values. The empirical Bayes approach does not provide valid uncertainty estimates; hence, coverage probabilities are only shown for the fully Bayesian method. Computation times were measured on a standard workstation with six parallel cores.

Figure A7 provides a detailed view of site-specific estimates for K = 50, the most challenging scenario in our study. Despite the limited sample size, both methods successfully recover the bimodal structure of the true effect distribution. The FB approach provides appropriately wide credible intervals that reflect the increased uncertainty from limited data while still achieving meaningful shrinkage toward the estimated density.

Figure A7. Site-specific estimates for K = 50 from a representative replication. Points show posterior means plotted against true effects, with the diagonal line indicating perfect estimation. Error bars for the fully Bayesian approach show 90% credible intervals. Note: The bimodal structure of the true effect distribution is preserved in the estimates despite the small sample size. The fully Bayesian approach provides uncertainty quantification through credible intervals, while the empirical Bayes approach yields only point estimates.

The shrinkage analysis reveals an interesting pattern: the empirical variance of the posterior means relative to the variance of true effects (the shrinkage factor) ranges from 0.82 to 0.87 across different K values, indicating consistent and appropriate regularization. This stability in shrinkage behavior helps explain the robust performance in small samples—the method neither overshrinks (which would lose the bimodal structure) nor undershrinks (which would amplify noise).

Appendix E.3. Theoretical Considerations

The robust performance of our approach in small-K settings can be understood through several theoretical mechanisms. First, the nonparametric empirical Bayes framework naturally adapts to the available information. When K is small, the estimated prior

\hat{g} (θ)

becomes less complex, effectively reducing the degrees of freedom in response to limited data. This adaptive complexity control, formalized through the spline basis representation with fixed dimension M, prevents overfitting while maintaining sufficient flexibility to capture essential features of the effect distribution.

Second, the hierarchical structure induces an effective sample size larger than K through borrowing of strength. Each site’s posterior distribution is informed not only by its own data but also by the global pattern learned from all sites. In the context of James–Stein estimation, this borrowing of strength is known to provide minimax optimal risk reduction. Our nonparametric extension preserves this property while accommodating non-normal prior distributions.

The stability of uncertainty calibration in small samples reflects the conservative nature of the Bayesian posterior. When data are limited, the posterior naturally becomes more diffuse, automatically inflating credible intervals to reflect increased uncertainty. This behavior contrasts with plug-in methods that treat estimated hyperparameters as known, leading to anti-conservative inference. The fully Bayesian approach propagates uncertainty through all levels of the hierarchy, yielding credible intervals that maintain nominal coverage even when the prior must be estimated from limited data.

From a frequentist perspective, the coverage properties can be understood through the lens of empirical process theory. The deconvolution problem we solve is a type of inverse problem, where regularization is essential for stable solutions. The spline-based approach provides implicit regularization through the restricted basis dimension, while the Bayesian framework adds explicit regularization through the prior on the basis coefficients. This dual regularization ensures that the effective degrees of freedom remain controlled even as K decreases.

The theoretical literature on deconvolution suggests that the convergence rate for estimating

g (θ)

is typically of order

{(K / log K)}^{- β}

for some

β > 0

depending on the smoothness of the true density. Our empirical results showing only modest degradation from K = 1500 to K = 50 align with these theoretical predictions—a thirtyfold reduction in sample size yields less than 5% increase in RMSE. This favorable scaling reflects the efficiency gains from assuming the structural model

θ_{i} | {\hat{θ}}_{i} \sim N (θ_{i}, σ_{i}^{2})

with known

σ_{i}

.

Appendix E.4. Practical Implications

These findings have important implications for applied researchers. First, our methods can be confidently applied to meta-analyses with as few as 50 studies, a sample size that encompasses the vast majority of published meta-analyses in medicine, psychology, and economics. The maintained coverage probability ensures that inference remains valid, while the modest loss in point estimation accuracy is unlikely to affect substantive conclusions.

Second, the computational efficiency of the VB implementation makes it feasible to conduct sensitivity analyses and bootstrap validation even in small samples. The ability to fit hundreds of models in minutes enables practitioners to explore robustness to prior specifications, outlier removal, and alternative modeling choices.

Finally, the results suggest that concerns about small-sample bias in empirical Bayes methods may be overstated in modern implementations. While classical moment-based estimators can exhibit substantial bias when K is small, our likelihood-based approach with appropriate regularization maintains good performance across the entire range of sample sizes encountered in practice. This robustness, combined with the ability to accommodate arbitrary prior shapes, makes the nonparametric empirical Bayes framework a valuable tool for meta-analysis in data-limited settings.

References

Efron, B.; Tibshirani, R.; Storey, J.D.; Tusher, V. Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc. 2001, 96, 1151–1160. [Google Scholar] [CrossRef]
Love, M.I.; Huber, W.; Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014, 15, 550. [Google Scholar] [CrossRef]
Smyth, G.K. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 2004, 3, 3. [Google Scholar] [CrossRef]
Gilraine, M.; Gu, J.; McMillan, R. A New Method for Estimating Teacher Value-Added; NBER Working Paper No. 27094; National Bureau of Economic Research: Cambridge, MA, USA, 2020. [Google Scholar]
Guarino, C.M.; Maxfield, M.; Reckase, M.D.; Thompson, P.N.; Wooldridge, J.M. An evaluation of empirical Bayes’ estimation of value-added teacher performance measures. J. Educ. Behav. Stat. 2015, 40, 190–222. [Google Scholar] [CrossRef]
McCaffrey, D.F.; Lockwood, J.R.; Koretz, D.; Louis, T.A.; Hamilton, L. Models for value-added modeling of teacher effects. J. Educ. Behav. Stat. 2004, 29, 67–101. [Google Scholar] [CrossRef] [PubMed]
Chen, N.; Lee, J.J. Bayesian hierarchical classification and information sharing for clinical trials with subgroups and binary outcomes. Biom. J. 2019, 61, 1219–1231. [Google Scholar] [CrossRef] [PubMed]
Huang, Y.; Chang, W.J.; Hsiao, C.F. An empirical Bayes approach to evaluation of results for a specific region in multiregional clinical trials. Pharm. Stat. 2013, 12, 59–64. [Google Scholar] [CrossRef]
Kaizer, A.M.; Koopmeiners, J.S.; Hobbs, B.P. Bayesian hierarchical modeling based on multisource exchangeability. Biostatistics 2018, 19, 169–184. [Google Scholar] [CrossRef]
Papanikos, T.; Thompson, J.R.; Abrams, K.R.; Städler, N.; Ciani, O.; Taylor, R.; Bujkiewicz, S. Bayesian hierarchical meta-analytic methods for modeling surrogate relationships that vary across treatment classes using aggregate data. Stat. Med. 2020, 39, 1103–1124. [Google Scholar] [CrossRef]
Skene, A.M.; Wakefield, J.C. Hierarchical models for multicentre binary response studies. Stat. Med. 1990, 9, 919–929. [Google Scholar] [CrossRef] [PubMed]
Kline, P.; Rose, E.K.; Walters, C.R. Systemic discrimination among large US employers. Q. J. Econ. 2022, 137, 1963–2036. [Google Scholar] [CrossRef]
Wernerfelt, N.; Tuchman, A.; Shapiro, B.T.; Moakler, R. Estimating the value of offsite tracking data to advertisers: Evidence from meta. Mark. Sci. 2025, 44, 268–286. [Google Scholar] [CrossRef]
Howlett, J.R.; Harlé, K.M.; Simmons, A.N.; Taylor, C.T. Bayesian deconvolution for computational cognitive modeling of fMRI data. NeuroImage 2025, 312, 121213. [Google Scholar] [CrossRef]
McCulloch, C.E.; Neuhaus, J.M. Misspecifying the shape of a random effects distribution: Why getting it wrong may not matter. Stat. Sci. 2011, 26, 388–402. [Google Scholar] [CrossRef]
Efron, B. Bayes, Oracle Bayes and empirical Bayes. Stat. Sci. 2019, 34, 177–201. [Google Scholar] [CrossRef]
Miratrix, L.W.; Weiss, M.J.; Henderson, B. An applied researcher’s guide to estimating effects from multisite individually randomized trials: Estimands, estimators, and estimates. J. Res. Educ. Eff. 2021, 14, 270–308. [Google Scholar] [CrossRef]
Lee, J.; Che, J.; Rabe-Hesketh, S.; Feller, A.; Miratrix, L. Improving the estimation of site-specific effects and their distribution in multisite trials. J. Educ. Behav. Stat. 2024. Advance online publication. [Google Scholar] [CrossRef]
Walters, C.R. Empirical Bayes Methods in Labor Economics; NBER Working Paper No. 33091; National Bureau of Economic Research: Cambridge, MA, USA, 2024. [Google Scholar]
Saha, A.; Ha, M.J.; Acharyya, S.; Baladandayuthapani, V. A bayesian precision medicine framework for calibrating individualized therapeutic indices in cancer. Ann. Appl. Stat. 2022, 16, 2055–2082. [Google Scholar] [CrossRef]
Normand, S.L.T.; Ash, A.S.; Fienberg, S.E.; Stukel, T.A.; Utts, J.; Louis, T.A. League tables for hospital comparisons. Annu. Rev. Stat. Its Appl. 2016, 3, 21–50. [Google Scholar] [CrossRef]
Yamin, J.C. Poverty targeting with imperfect information. arXiv 2025, arXiv:2506.18188. [Google Scholar] [CrossRef]
Paddock, S.M.; Ridgeway, G.; Lin, R.; Louis, T.A. Flexible distributions for triple-goal estimates in two-stage hierarchical models. Comput. Stat. Data Anal. 2006, 50, 3243–3262. [Google Scholar] [CrossRef]
Shen, W.; Louis, T.A. Triple-goal estimates in two-stage hierarchical models. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 1998, 60, 455–471. [Google Scholar] [CrossRef]
Armstrong, T.B.; Kolesár, M.; Plagborg-Møller, M. Robust empirical Bayes confidence intervals. Econometrica 2022, 90, 2567–2602. [Google Scholar] [CrossRef]
Hansen, B.E. Efficient shrinkage in parametric models. J. Econom. 2016, 190, 115–132. [Google Scholar] [CrossRef]
Paddock, S.M.; Louis, T.A. Percentile-based empirical distribution function estimates for performance evaluation of healthcare providers. J. R. Stat. Soc. Ser. C Appl. Stat. 2011, 60, 575–589. [Google Scholar] [CrossRef] [PubMed]
Efron, B. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
Efron, B. Empirical Bayes deconvolution estimates. Biometrika 2016, 103, 1–20. [Google Scholar] [CrossRef]
van de Wiel, M.A.; te Beest, D.E.; Münch, M.M. Learning from a lot: Empirical Bayes for high-dimensional model-based prediction. Scand. J. Stat. 2019, 46, 2–25. [Google Scholar] [CrossRef]
Angelova, V.; Dobbie, W.S.; Yang, C. Algorithmic Recommendations and Human Discretion; NBER Working Paper No. 31747; National Bureau of Economic Research: Cambridge, MA, USA, 2023. [Google Scholar]
Ignatiadis, N.; Wager, S. Confidence intervals for nonparametric empirical Bayes analysis. J. Am. Stat. Assoc. 2022, 117, 1149–1166. [Google Scholar] [CrossRef]
Narasimhan, B.; Efron, B. deconvolveR: A G-modeling program for deconvolution and empirical Bayes estimation. J. Stat. Softw. 2020, 94, 1–20. [Google Scholar] [CrossRef]
Raudenbush, S.W.; Bryk, A.S. Empirical Bayes meta-analysis. J. Educ. Stat. 1985, 10, 75–98. [Google Scholar] [CrossRef]
Rubin, D.B. Estimation in parallel randomized experiments. J. Educ. Stat. 1981, 6, 377–401. [Google Scholar] [CrossRef]
Stefanski, L.A.; Carroll, R.J. Deconvolving kernel density estimators. Statistics 1990, 21, 169–184. [Google Scholar] [CrossRef]
Carroll, R.J.; Hall, P. Optimal rates of convergence for deconvolving a density. J. Am. Stat. Assoc. 1988, 83, 1184–1186. [Google Scholar] [CrossRef]
Escobar, M.D.; West, M. Bayesian density estimation and inference using mixtures. J. Am. Stat. Assoc. 1995, 90, 577–588. [Google Scholar] [CrossRef]
Neal, R.M. Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 2000, 9, 249–265. [Google Scholar] [CrossRef]
Muthukumarana, S.; Tiwari, R.C. Meta-analysis using Dirichlet process. Stat. Methods Med. Res. 2016, 25, 352–365. [Google Scholar] [CrossRef] [PubMed]
Rousseau, J.; Mengersen, K. Asymptotic behaviour of the posterior distribution in overfitted mixture models. J. R. Stat. Soc. Ser. B 2011, 73, 689–710. [Google Scholar] [CrossRef]
Laird, N.M. Nonparametric maximum likelihood estimation of a mixing distribution. J. Am. Stat. Assoc. 1978, 73, 805–811. [Google Scholar] [CrossRef]
Lindsay, B.G. The geometry of mixture likelihoods: A general theory. Ann. Stat. 1983, 11, 86–94. [Google Scholar] [CrossRef]
Koenker, R.; Gu, J. REBayes: An R package for empirical Bayes mixture methods. J. Stat. Softw. 2017, 82, 1–26. [Google Scholar] [CrossRef]
Koenker, R.; Mizera, I. Convex optimization, shape constraints, compound decisions, and empirical Bayes rules. J. Am. Stat. Assoc. 2014, 109, 674–685. [Google Scholar] [CrossRef]
Lo, A.Y. A Bayesian bootstrap for a finite population. Ann. Stat. 1988, 16, 1684–1695. [Google Scholar] [CrossRef]
Rubin, D.B. The Bayesian bootstrap. Ann. Stat. 1981, 9, 130–134. [Google Scholar] [CrossRef]
Ghosal, S.; Ghosh, J.K.; van der Vaart, A.W. Convergence rates of posterior distributions. Ann. Stat. 2000, 28, 500–531. [Google Scholar] [CrossRef]
Shen, W.; Ghosal, S. Adaptive Bayesian procedures using random series priors. Scand. J. Stat. 2015, 42, 1194–1213. [Google Scholar] [CrossRef]
Kooperberg, C.; Stone, C.J. Logspline density estimation for censored data. J. Comput. Graph. Stat. 1992, 1, 301–328. [Google Scholar] [CrossRef]
Willwerscheid, J.; Carbonetto, P.; Stephens, M. ebnm: An R package for solving the empirical bayes normal means problem using a variety of prior families. arXiv 2024, arXiv:2110.00152. [Google Scholar] [CrossRef]
Gneiting, T.; Raftery, A.E. Strictly proper scoring rules, prediction, and estimation. J. Am. Stat. Assoc. 2007, 102, 359–378. [Google Scholar] [CrossRef]
Kline, P.M.; Rose, E.K.; Walters, C.R. A discrimination report card. Am. Econ. Rev. 2024, 114, 2472–2525. [Google Scholar] [CrossRef]
Bertrand, M.; Mullainathan, S. Are Emily and Greg more employable than Lakisha and Jamal? A field experiment on labor market discrimination. Am. Econ. Rev. 2004, 94, 991–1013. [Google Scholar] [CrossRef]
Muralidharan, O. An empirical Bayes mixture method for effect size and false discovery rate estimation. Ann. Appl. Stat. 2010, 4, 422–438. [Google Scholar] [CrossRef]
Gu, J.; Koenker, R. Invidious comparisons: Ranking and selection as compound decisions. Econometrica 2023, 91, 1–41. [Google Scholar] [CrossRef]
Kucukelbir, A.; Tran, D.; Ranganath, R.; Gelman, A.; Blei, D.M. Automatic differentiation variational inference. J. Mach. Learn. Res. 2017, 18, 1–45. [Google Scholar]
Zhang, L.; Carpenter, B.; Gelman, A.; Vehtari, A. Pathfinder: Parallel quasi-Newton variational inference. J. Mach. Learn. Res. 2022, 23, 1–49. [Google Scholar]

Figure 1. True vs. observed site-specific effects across informativeness (I) and heteroscedasticity (R) conditions. Note: Red bars represent the true bimodal distribution of site-specific effects (

Θ_{i}

) from the “twin towers” example, with 500 values in the left tower (

- 1.7

to

- 0.7

) and 1000 values in the right tower (

0.7

to

2.7

). Gray bars show the distribution of observed effects (

{\hat{θ}}_{i}

). Black curves indicate kernel density estimates. The average reliability I ranges from 0.5 (low signal-to-noise) to 0.9 (high signal-to-noise), while

R = 1

represents homoscedastic, and

R = 9

represents heteroscedastic measurement errors. Mean(SE) denotes the geometric mean of standard errors across sites.

Figure 1. True vs. observed site-specific effects across informativeness (I) and heteroscedasticity (R) conditions. Note: Red bars represent the true bimodal distribution of site-specific effects (

Θ_{i}

) from the “twin towers” example, with 500 values in the left tower (

- 1.7

to

- 0.7

) and 1000 values in the right tower (

0.7

to

2.7

). Gray bars show the distribution of observed effects (

{\hat{θ}}_{i}

). Black curves indicate kernel density estimates. The average reliability I ranges from 0.5 (low signal-to-noise) to 0.9 (high signal-to-noise), while

R = 1

represents homoscedastic, and

R = 9

represents heteroscedastic measurement errors. Mean(SE) denotes the geometric mean of standard errors across sites.

Figure 2. Scatterplots of true vs. observed site-specific effects with heteroscedastic measurement errors. Note: Each point represents one of

K = 1500

sites, with true effects (

Θ_{i}

) on the x-axis and observed effects (

{\hat{θ}}_{i}

) on the y-axis. Color intensity indicates the magnitude of site-specific standard errors, ranging from dark purple (small SE) to yellow (large SE). The dashed line represents perfect agreement (

y = x

). Under homoscedastic conditions (

R = 1

), all sites have identical standard errors, while under heteroscedastic conditions (

R = 9

), standard errors vary substantially across sites despite maintaining the same geometric mean. As reliability I increases from 0.5 to 0.9, observations cluster more tightly around the identity line, indicating improved signal-to-noise ratio.

Figure 2. Scatterplots of true vs. observed site-specific effects with heteroscedastic measurement errors. Note: Each point represents one of

K = 1500

sites, with true effects (

Θ_{i}

) on the x-axis and observed effects (

{\hat{θ}}_{i}

) on the y-axis. Color intensity indicates the magnitude of site-specific standard errors, ranging from dark purple (small SE) to yellow (large SE). The dashed line represents perfect agreement (

y = x

). Under homoscedastic conditions (

R = 1

), all sites have identical standard errors, while under heteroscedastic conditions (

R = 9

), standard errors vary substantially across sites despite maintaining the same geometric mean. As reliability I increases from 0.5 to 0.9, observations cluster more tightly around the identity line, indicating improved signal-to-noise ratio.

Figure 3. Distribution of true, observed, and estimated site-specific effects across simulation conditions. Note: Red histograms and curves show the true bimodal distribution of effects (

Θ_{i}

). Gray histograms and curves display observed effects (

{\hat{θ}}_{i}

). Blue solid lines represent fully Bayesian posterior means via MCMC; orange dashed lines show empirical Bayes estimates. Rows correspond to reliability levels

I \in {0.5, 0.7, 0.9}

; columns to heterogeneity ratios

R \in {1, 9}

. Both methods demonstrate adaptive shrinkage that increases with measurement error, recovering the bimodal structure more accurately as reliability improves.

Figure 3. Distribution of true, observed, and estimated site-specific effects across simulation conditions. Note: Red histograms and curves show the true bimodal distribution of effects (

Θ_{i}

). Gray histograms and curves display observed effects (

{\hat{θ}}_{i}

). Blue solid lines represent fully Bayesian posterior means via MCMC; orange dashed lines show empirical Bayes estimates. Rows correspond to reliability levels

I \in {0.5, 0.7, 0.9}

; columns to heterogeneity ratios

R \in {1, 9}

. Both methods demonstrate adaptive shrinkage that increases with measurement error, recovering the bimodal structure more accurately as reliability improves.

Figure 4. Quantitative assessment of point estimation accuracy. Note: Top panel displays root mean squared error (RMSE) across methods and conditions, with values shown above bars. Bottom panel presents correlation heatmap between true and estimated effects. Three methods are compared: fully Bayesian posterior mean (MCMC), empirical Bayes (EB), and maximum a posteriori (MAP). All methods achieve nearly identical performance, with differences typically less than 0.001 in RMSE. Performance improves dramatically with reliability I, while heteroscedasticity effects are modest and mixed.

Figure 5. Ninety percent credible intervals for 150 randomly sampled sites. Note: Black error bars indicate intervals that contain the true value; blue bars indicate non-coverage. Red triangles mark true site-specific effects. Sites are ordered by true effect size for visualization. (Top panel) (

I = 0.5, R = 9

): 83.3% coverage. (Bottom panel) (

I = 0.7, R = 9

): 94.0% coverage. Sample coverage varies due to finite sample size but approaches nominal 90% level.

Figure 5. Ninety percent credible intervals for 150 randomly sampled sites. Note: Black error bars indicate intervals that contain the true value; blue bars indicate non-coverage. Red triangles mark true site-specific effects. Sites are ordered by true effect size for visualization. (Top panel) (

I = 0.5, R = 9

): 83.3% coverage. (Bottom panel) (

I = 0.7, R = 9

): 94.0% coverage. Sample coverage varies due to finite sample size but approaches nominal 90% level.

Figure 6. Coverage probabilities of 90% intervals across all sites and conditions. Note: (Top row) shows coverage for

θ_{rep}

(parameter uncertainty about

Θ_{i}

); (bottom row) for

θ_{mean}

(estimator uncertainty of the posterior mean). For EB, intervals are computed using the delta method approximation (Equation (11)). Red dashed line indicates nominal 90% coverage. FB approach (MCMC) achieves near-nominal coverage for parameters while EB plug-in inference shows severe undercoverage. Both methods show appropriately low coverage for estimator uncertainty, correctly reflecting the narrower variability of posterior means as estimators.

Figure 6. Coverage probabilities of 90% intervals across all sites and conditions. Note: (Top row) shows coverage for

θ_{rep}

(parameter uncertainty about

Θ_{i}

); (bottom row) for

θ_{mean}

(estimator uncertainty of the posterior mean). For EB, intervals are computed using the delta method approximation (Equation (11)). Red dashed line indicates nominal 90% coverage. FB approach (MCMC) achieves near-nominal coverage for parameters while EB plug-in inference shows severe undercoverage. Both methods show appropriately low coverage for estimator uncertainty, correctly reflecting the narrower variability of posterior means as estimators.

Figure 7. Calibration assessment: nominal versus empirical coverage. Note: Points compare stated nominal coverage (x-axis) against observed empirical coverage (y-axis) across probability levels. Perfect calibration follows the diagonal. FB approach (red circles) shows excellent calibration across all conditions. EB approach (green triangles) exhibits severe miscalibration with empirical coverage plateauing below nominal levels. Results confirm proper uncertainty propagation in FB inference.

Figure 8. Sequential coverage rates as sites accumulate. Note: Lines track cumulative coverage as sites are processed sequentially. Red dashed line marks nominal 90% level. FB coverage stabilizes near target after several hundred sites, with minor fluctuations in heteroscedastic low-reliability conditions. Convergence to appropriate coverage demonstrates finite-sample reliability of the fully Bayesian approach.

Figure 9. Additional validity diagnostics for uncertainty quantification. Note: (Top panel): PIT histograms should be uniform under correct specification; mild deviations suggest minor miscalibration. (Middle panel): Q-Q plots confirm approximate normality of standardized residuals with slight heavy-tail behavior. (Bottom panel): Positive correlation between interval width and absolute error, with high coverage (green) for wide intervals, indicates appropriate adaptation of uncertainty to information content.

Figure 10. Firm-specific racial discrimination estimates with 90% uncertainty intervals. Note: Points represent posterior mean estimates of the white–Black callback rate difference for each of

K = 97

firms, ordered by magnitude. Blue intervals exclude zero (indicating significant discrimination at the 90% level), while gray intervals include zero. Top panel shows fully Bayesian credible intervals capturing posterior uncertainty about parameters. Bottom panel shows empirical Bayes confidence intervals via delta method, capturing only estimator uncertainty. The substantial difference in interval widths illustrates the anti-conservative nature of plug-in empirical Bayes inference for individual firm effects.

Figure 10. Firm-specific racial discrimination estimates with 90% uncertainty intervals. Note: Points represent posterior mean estimates of the white–Black callback rate difference for each of

K = 97

firms, ordered by magnitude. Blue intervals exclude zero (indicating significant discrimination at the 90% level), while gray intervals include zero. Top panel shows fully Bayesian credible intervals capturing posterior uncertainty about parameters. Bottom panel shows empirical Bayes confidence intervals via delta method, capturing only estimator uncertainty. The substantial difference in interval widths illustrates the anti-conservative nature of plug-in empirical Bayes inference for individual firm effects.

Figure 11. Firms with strongest evidence of racial discrimination based on fully Bayesian parameter uncertainty. Note: Forest plot displays the 20% most discriminatory (right panel) and 20% least discriminatory (left panel) firms based on posterior mean estimates. Points show posterior means with 90% credible intervals. Values in parentheses indicate

P (Θ_{i} > 0 ∣ D)

, the posterior probability of discrimination against Black applicants. Among the most discriminatory firms, posterior probabilities approach 1.00, indicating overwhelming evidence of discrimination. Among the least discriminatory firms, probabilities range from 0.28 (Dr Pepper) to 0.79 (Target), illustrating heterogeneity in the strength of evidence even among firms with similar point estimates.

Figure 11. Firms with strongest evidence of racial discrimination based on fully Bayesian parameter uncertainty. Note: Forest plot displays the 20% most discriminatory (right panel) and 20% least discriminatory (left panel) firms based on posterior mean estimates. Points show posterior means with 90% credible intervals. Values in parentheses indicate

P (Θ_{i} > 0 ∣ D)

, the posterior probability of discrimination against Black applicants. Among the most discriminatory firms, posterior probabilities approach 1.00, indicating overwhelming evidence of discrimination. Among the least discriminatory firms, probabilities range from 0.28 (Dr Pepper) to 0.79 (Target), illustrating heterogeneity in the strength of evidence even among firms with similar point estimates.

Figure 12. Distribution of racial discrimination estimates before and after Bayesian shrinkage. Note: Orange distribution shows raw OLS estimates of white–Black callback differences across

K = 97

firms (mean = 0.021, SD = 0.024). Blue distribution shows fully Bayesian posterior means after shrinkage (mean = 0.017, SD = 0.011). Red vertical line indicates no discrimination.

Figure 12. Distribution of racial discrimination estimates before and after Bayesian shrinkage. Note: Orange distribution shows raw OLS estimates of white–Black callback differences across

K = 97

firms (mean = 0.021, SD = 0.024). Blue distribution shows fully Bayesian posterior means after shrinkage (mean = 0.017, SD = 0.011). Red vertical line indicates no discrimination.

Table 1. Comparison of empirical Bayes (EB) and fully Bayesian (FB) approaches to meta-analytic deconvolution using log-spline priors.

Aspect	Empirical Bayes	Fully Bayesian
Uncertainty in g	Ignored in plug-in inference; approximated via delta method or bootstrap for estimator uncertainty	Fully propagated through hierarchical model; posterior samples of g reflect estimation uncertainty
Inference for $Θ_{i}$	Point estimates ${\hat{Θ}}_{i}^{EB}$ with anti-conservative intervals; requires ad hoc corrections	Complete posterior distribution $P (Θ_{i} ∣ D)$ with exact finite-sample validity
Loss-specific decisions	Optimized for squared error loss; adaptation to other losses requires re-estimation	Posterior samples enable decision-making under any loss function without re-fitting
Model extensions	Limited to fixed regularization; extensions require new theory	Natural incorporation of hierarchical structures, covariates, or mixture components
Diagnostics	Limited to likelihood-based checks	Full suite of Bayesian diagnostics: posterior predictive checks, MCMC convergence, prior sensitivity
Computation	Convex optimization; 0.3 s for $K = 1500$	MCMC sampling; 8.4 min for $K = 1500$ (4 chains, 3000 iterations each); Variational Bayes alternative: 12 s
Software	Specialized implementations (e.g., deconvolveR package)	General-purpose probabilistic programming (Stan, NIMBLE, etc.)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, J.; Sui, D. Fully Bayesian Inference for Meta-Analytic Deconvolution Using Efron’s Log-Spline Prior. Mathematics 2025, 13, 2639. https://doi.org/10.3390/math13162639

AMA Style

Lee J, Sui D. Fully Bayesian Inference for Meta-Analytic Deconvolution Using Efron’s Log-Spline Prior. Mathematics. 2025; 13(16):2639. https://doi.org/10.3390/math13162639

Chicago/Turabian Style

Lee, JoonHo, and Daihe Sui. 2025. "Fully Bayesian Inference for Meta-Analytic Deconvolution Using Efron’s Log-Spline Prior" Mathematics 13, no. 16: 2639. https://doi.org/10.3390/math13162639

APA Style

Lee, J., & Sui, D. (2025). Fully Bayesian Inference for Meta-Analytic Deconvolution Using Efron’s Log-Spline Prior. Mathematics, 13(16), 2639. https://doi.org/10.3390/math13162639

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fully Bayesian Inference for Meta-Analytic Deconvolution Using Efron’s Log-Spline Prior

Abstract

1. Introduction

2. Meta-Analytic Deconvolution Framework

2.1. Basic Setup

2.2. Marginal Densities and the Convolution Structure

2.3. Ill-Posedness and the Need for Structural Constraints

2.4. The Efron Log-Spline Prior: Flexible Parametric Regularization

3. Empirical Bayes Estimation

3.1. Estimation Strategy: G-Modeling and the Geometry of α

3.2. Penalized Marginal Likelihood Estimation

3.3. Posterior Inference and Shrinkage Estimation

3.4. Uncertainty Quantification for Plug-In EB

4. Fully Bayesian Estimation

4.1. Finite-Bayes Inference and Site-Specific Effects

4.2. Hierarchical Model Specification

4.3. Marginalization and Posterior Inference

4.4. Advantages over the Empirical Bayes Approach

5. Simulation Study

5.1. Data-Generating Scenarios

5.2. Simulation Analysis and Performance Evaluators

6. Simulation Results

6.1. Point Estimation Accuracy

6.2. Calibration of Uncertainty Estimates

7. Real-Data Application: Firm-Level Labor Market Discrimination

7.1. Data and Empirical Setting

7.2. Estimation Approach

7.3. Posterior Uncertainty Versus Estimator Uncertainty

7.4. Identifying Firms with Strongest Evidence of Discrimination

7.5. Distribution of Discrimination Across Firms

8. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Mathematical Results on Uncertainty Decomposition

Appendix A.1. Formal Setup and Notation

Appendix A.2. Main Decomposition Theorem

Appendix A.3. Anti-Conservative Nature of Empirical Bayes Intervals

Appendix A.4. Validity of Fully Bayesian Intervals

Appendix A.5. Connection to Delta Method Approximation

Appendix A.6. Numerical Illustration

Appendix A.7. Discussion

Appendix B. Stan Implementation of the Fully Bayesian Log-Spline Model

Appendix B.1. Data Block: Input Specifications

Appendix B.2. Parameters and Transformed Parameters: Hierarchical Structure

Appendix B.3. Model Block: Priors and Likelihood

Appendix B.4. Generated Quantities: Marginalization and Posterior Summaries

Appendix B.5. MCMC Sampling Configuration

Appendix C. Grid Resolution and Bounds Sensitivity Analysis

Appendix C.1. Sensitivity Analysis Design

Appendix C.2. Sensitivity Analysis Results

Appendix C.3. Theoretical Justification

Appendix C.3.1. Spline Smoothing and Implicit Interpolation

Appendix C.3.2. Likelihood-Driven Adaptation

Appendix C.3.3. Posterior Averaging and Uncertainty Propagation

Appendix D. Sensitivity Analysis for λ Hyperprior Specification

Appendix D.1. Sensitivity Analysis Design

Appendix D.2. Results and Interpretation

Appendix D.3. Theoretical Justification

Appendix E. Performance Analysis for Small-K Meta-Analyses

Appendix E.1. Simulation Design

Appendix E.2. Results

Appendix E.3. Theoretical Considerations

Appendix E.4. Practical Implications

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.1. Estimation Strategy: G-Modeling and the Geometry of $α$