Next Article in Journal
Quantum Interference Supernodes, Thermoelectric Enhancement, and the Role of Dephasing
Previous Article in Journal
Information Complexity of Time-Frequency Distributions of Signals in Detection and Classification Problems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The Learning Rate Is Not a Constant: Sandwich-Adjusted Markov Chain Monte Carlo Simulation

1
Department of Civil and Environmental Engineering, University of California, Irvine, CA 92697, USA
2
Center for Nonlinear Dynamics in Economics and Finance (CeNDEF), Amsterdam School of Economics, University of Amsterdam, 1018 WB Amsterdam, The Netherlands
*
Author to whom correspondence should be addressed.
Entropy 2025, 27(10), 999; https://doi.org/10.3390/e27100999
Submission received: 5 July 2025 / Revised: 26 August 2025 / Accepted: 26 August 2025 / Published: 25 September 2025

Abstract

A fundamental limitation of maximum likelihood and Bayesian methods under model misspecification is that the asymptotic covariance matrix of the pseudo-true parameter vector θ * is not the inverse of the Fisher information, but rather the sandwich covariance matrix 1 n A * 1 B * 1 A * 1 , where A * and B * are the sensitivity and variability matrices, respectively, evaluated at θ * for training data record ω 1 , , ω n . This paper makes three contributions. First, we review existing approaches to robust posterior sampling, including the open-faced sandwich adjustment and magnitude- and curvature-adjusted Markov chain Monte Carlo (MCMC) simulation. Second, we introduce a new sandwich-adjusted MCMC method. Unlike existing approaches that rely on arbitrary matrix square roots, eigendecompositions or a single scaling factor applied uniformly across the parameter space, our method employs a parameter-dependent learning rate λ ( θ ) that enables direction-specific tempering of the likelihood. This allows the sampler to capture directional asymmetries in the sandwich distribution, particularly under model misspecification or in small-sample regimes, and yields credible regions that remain valid when standard Bayesian inference underestimates uncertainty. Third, we propose information-theoretic diagnostics for quantifying model misspecification, including a strictly proper divergence score and scalar summaries based on the Frobenius norm, Earth mover’s distance, and the Herfindahl index. These principled diagnostics complement residual-based metrics for model evaluation by directly assessing the degree of misalignment between the sensitivity and variability matrices, A * and B * . Applications to two parametric distributions and a rainfall-runoff case study with the Xinanjiang watershed model show that conventional Bayesian methods systematically underestimate uncertainty, while the proposed method yields asymptotically valid and robust uncertainty estimates. Together, these findings advocate for sandwich-based adjustments in Bayesian practice and workflows.

1. Introduction

Suppose that we have a vector-valued statistical (mathematical) model y = f ( θ ) : R d R n of a d × 1 -vector of parameters θ = ( θ 1 , , θ d ) that we wish to estimate from training data ω n = ( ω 1 , , ω n ) . Common practice is to define a residual loss function e t ( θ ) = ω t y t ( θ ) for all t = 1 , , n and minimize (maximize, if appropriate) this function using an automatic search algorithm. In this special issue on Bayesian estimation and information theory, we shall use a likelihood function L n ( θ ) for θ given the n observations ω 1 , , ω n . However, the problem we address in this paper is not limited to Bayesian methods but applies equally to frequentist inference using least-squares methods. We use L ω ( θ ) as shorthand notation for L ( ω θ ) and write L ω ( θ ) = log L ω ( θ ) for the log-likelihood function. The joint likelihood L n ( θ ) for the n-vector of data points, ω 1 , , ω n , is equal to the product of L ω 1 ( θ ) , , L ω n ( θ ) . Now, the unnormalized d-variate posterior density P n ( θ ) = P ( θ ω n ) follows from Bayes’ theorem [1], P n ( θ ) P ( θ ) L n ( θ ) , where P ( θ ) is the prior density. In logarithmic form, P n ( θ ) = P ( θ ) + L n ( θ ) log ( Z n ) , where P ( θ ) = log P ( θ ) is the log-prior and Z n = P ( θ ) L n ( θ ) d θ denotes the marginal likelihood.
The Bernstein and von Mises [2] theorem establishes that when the sample size n grows, the posterior distribution of the parameters becomes approximately normal, centered on the true parameter values θ 0 of the data-generating process and with a covariance matrix 1 n I 0 1 ( θ 0 ) equal to the inverse of the d × d Fisher [3] information matrix [4]
I 0 ( θ 0 ) = E ω [ L ω ( θ 0 ) L ω ( θ 0 ) ] .
This theorem establishes that Bayesian credible sets asymptotically approximate optimal frequentist confidence sets and, as such, it forms the basis for using Bayesian credible sets in statistical inference. The fundamental underpinning of this theory is the information identity A 0 = B 0 , where
A 0 = E ω [ 2 L ω ( θ 0 ) ] ,
is the so-called sensitivity (negative Hessian) matrix, and
B 0 = E ω [ L ω ( θ 0 ) L ω ( θ 0 ) ] = Var ω [ L ω ( θ 0 ) ] ,
is the variability matrix at θ 0 . The term “variability” reflects the well-known identity Var [ X ] = E [ ( X μ ) ( X μ ) ] with μ = E [ X ] , applied to the score L ω ( θ 0 )
Var [ L ω ( θ 0 ) ] = E ω L ω ( θ 0 ) μ 0 L ω ( θ 0 ) μ 0 = E ω L ω ( θ 0 ) L ω ( θ 0 ) μ 0 μ 0 .
Under correct specification, the expected score μ 0 = E ω [ L ω ( θ 0 ) ] , equals a zero vector, and, consequently we yield that B 0 = Var ω [ L ω ( θ 0 ) ] .
We then also have that the maximum likelihood (ML) density estimator θ ^ n of the posterior parameter distribution satisfies [5]
n ( θ ^ n θ 0 ) d N d 0 , I 0 1 ( θ 0 ) ,
where I 0 ( θ 0 ) is the expected Fisher information for a single datum. Fisher information plays a fundamental role in statistical inference, including hypothesis testing, regression analysis, and the calculation of standard errors and parameter confidence intervals and regions. The information or second Bartlett [6] identity A 0 = B 0 , is only valid if the model f ( θ ) (and hence the likelihood function, L n ( θ ) ) is correctly specified [7]. If the model is misspecified, the sensitivity and variability matrices are misaligned [8] and asymptotic 100 ( 1 α ) % credible intervals will usually have less than nominal frequentist coverage probabilities. Thus, Bayesian credible sets of confidence level γ = 100 ( 1 α ) % cannot be interpreted as confidence sets of level γ % [9]. This so-called overconditioning [10,11,12] is a result of the customary aleatoric treatment of residuals when, in fact, they are nonrandom (systematic) in nature and manifest with an unduly small parameter uncertainty and poorly calibrated prediction intervals [13,14,15,16,17]. In such cases, interpretation of the posterior parameter distribution P ( θ ) L n ( θ ) may be problematic. Not only can the posterior P n ( θ ) fail to provide a valid probabilistic description of information about θ , but it may also be unclear whether θ corresponds to any meaningful or scientifically relevant quantities [18].
Upon misspecification, the true parameter values θ 0 of the data generating process are not in the model parameter space θ Θ R d (see Figure 1).
The best attainable values of the parameters or so-called pseudo-true parameter values θ * minimize the Kullback and Leibler [19] divergence between the true probability density function q Ω ( ω θ 0 ) of Ω and the incorrect family of densities f ( ω θ ) defined by θ Θ [8]. The consequences of misspecification are that the posterior distribution will now center on θ * , the best distribution out of all distributions in the misspecified parametric family. However, a more pertinent problem is that the information identity A * = B * will not hold under misspecification. The ML estimator will still be asymptotically normal but now around the pseudo-true parameter values θ ^ *
n ( θ ^ * ) d N d 0 , G 0 1 ( θ * ) ,
where the covariance matrix of the estimator is 1 n A * 1 B * 1 A * 1 = 1 n G 0 1 ( θ * ) and G 0 = A * 1 B * 1 A * 1 is the so-called Godambe [20] information matrix. Thus, Godambe information G 0 is the only valid currency of data information under misspecification. This information matrix guarantees asymptotically valid parameter confidence intervals and standard errors even when the likelihood function L n ( θ ) is incorrectly specified [21].
This paper builds on Vrugt et al. [8] and addresses the fundamental limitation that Bayesian methods do not provide asymptotically valid standard errors when the model is misspecified [22,23,24,25]. The asymptotic covariance matrix of Markov chain Monte Carlo (MCMC) simulation methods is the inverse of a single “slice of bread,” 1 n A * 1 , rather than the asymptotically valid sandwich matrix 1 n A * 1 B * 1 A * 1 . Analytic and numerical case studies in Vrugt et al. [8] confirm that the posterior distribution can significantly overestimate the informativeness of streamflow measurements, resulting in an overly optimistic model and parameter uncertainty estimates under misspecification. The sandwich estimator, by contrast, substantially widens the credible intervals for watershed model parameters and discharge. This theoretical inconsistency between Bayesian and frequentist approaches warrants a closer look at MCMC methodology, specifically, how we might adapt the Metropolis–Hastings (MH) algorithm [26,27] so that the stationary distribution of the Markov chains reflects the correct sandwich asymptotics. The general problem we address is that Bayesian methods yield θ N d ( θ ^ * , 1 n A * 1 ) as the asymptotic description of the posterior parameter distribution P n ( θ ) = P ( θ ) L n ( θ ) or P n ( θ ) exp P n ( θ ) , whereas this should be θ N d ( θ ^ * , 1 n A * 1 B * 1 A * 1 ) when the model is misspecified. This reconciliation of frequentist asymptotic theory with Bayesian computational procedures is of great practical importance, particularly for applications that make use of prior information, latent variables, and/or hierarchical models. We wish to enhance the robustness of Bayesian computation under model misspecification, while retaining the flexibility and coherence of MCMC simulation methods. We view this not as an attempt to force Bayesian and frequentist methods to align, but as a practical safeguard in applications where model assumptions are inevitably imperfect. Moreover, the strictly proper scoring rules we propose as a byproduct of the misalignment between the sensitivity and variability matrices offer information-theoretically principled metrics for quantifying the degree of misspecification and for guiding model selection and improvement.
The goals of this paper are three-fold. First, we review and examine existing methods for obtaining an asymptotically valid description of the sandwich posterior distribution using MCMC sampling methods. Then, as second objective, we introduce a new and more rigorous sandwich sampling method which overcomes limitations of currently available methods. In particular, existing approaches often rely on a single scalar correction factor applied uniformly across all parameters, which can fail to capture directional asymmetries in the sandwich distribution especially under model misspecification or for small-sample sizes. Our proposed method addresses this limitation by introducing a direction-dependent scaling factor or learning rate that adapts to the local curvature of the sandwich distribution. As third and last objective of this paper, we present an information-theoretic interpretation of the strictly proper alignment score proposed by Vrugt et al. [8], which quantifies the concordance between matrices A * and B * . Several other scalar indicators of model misspecification are also introduced in this section.
The theory and methodology of this paper are an integral part of DREAM-Suite, a Matlab-Python software package for Bayesian model training, evaluation and diagnostics [28]. This software can be downloaded from the first author’s GitHub account https://github.com/jaspervrugt (accessed on 3 September 2025) and includes the different case studies presented herein.

2. Notation and Definitions

Boldface uppercase letters denote matrices, A , boldface lowercase letters signify vectors, a , and italic lowercase letters are used for scalars, a. The superscripts “⊤” and “ 1 ” stand for matrix transpose and matrix inverse, respectively. By default, we assume column vectors and, thus, we write a = ( a 1 , , a n ) for a n × 1 vector. If X = ( X 1 , , X d ) is a vector of d random variables, then we say that its expectation is the vector μ = ( μ 1 , , μ d ) and write μ = E [ X ] , thus combining d scalar equations into one vector equation. The variance of random vector X is the d × d matrix Σ whose ( i , j ) th element is
Cov [ X i , X j ] = E [ ( X i μ i ) ( X j μ j ) ] ,
where i , j ( 1 , , d ) . In vector notation, we write
Var [ X ] = E [ ( X μ ) ( X μ ) ] ,
thus combining d 2 scalar equations into one matrix equation. In this formulation, X μ , is a d × 1 vector and the outer (cross) product of X μ and ( X μ ) returns a d × d matrix. The inner or dot product of two n-vectors a and b is equal to a b and returns a scalar. For notational convenience, we write Z n z ( θ ) instead of Z n ( θ ) z , where the superscript z { 1 , } denotes either matrix inversion or transposition, respectively. This convention applies to any matrix Z , such as A , B , I , and G . In the same spirit, we write θ L ω ( θ ) to denote the transpose of the gradient vector θ L ω ( θ ) , so that outer products are written compactly.
Suppose θ * are the pseudo-true parameter values of the data-generating process S and ω = ( ω 1 , , ω n ) and y = ( y 1 , , y n ) are n × 1 vectors of materialized and modeled outcomes, respectively. Then, the most important scalars, vectors, and matrices are
  • The likelihood is a scalar and denoted L ω ( θ ) for a single datum ω . For a data set ω 1 , , ω n , we write L n ( θ ) . The symbol L n ( θ ) denotes the natural logarithm of L n ( θ ) .
  • The d × d Hessian matrix H ω ( θ ) = 2 L ω ( θ ) contains the second-order partial derivatives of L ω ( θ ) w.r.t. θ . The total Hessian is given by H n ( θ ) = i = 1 n H ω i ( θ ) , equivalently H n ( θ ) = 2 L n ( θ ) .
  • The d × d sensitivity matrix is defined as A n = 1 n H n ( θ ^ * ) with probability limit A * = plim A n .
  • The d × d variability matrix is defined as B n = 1 n i = 1 n L ω i ( θ ^ * ) L ω i ( θ ^ * ) with probability limit B * = plim B n .
  • The d × d Fisher information matrix I n ( θ * ) = E ω [ L n ( θ * ) L n ( θ * ) ] is the expectation w.r.t. ω of the outer product of the gradient of the log-likelihood evaluated at θ * .
  • The matrix inverse of the Fisher information I n 1 ( θ * ) is a d × d covariance matrix. Under correct specification this naive variance equals the asymptotic variance of the ML estimator.
  • The d × d Godambe information matrix is defined as G n ( θ ^ * ) = n A n 1 B n 1 A n 1 , with probability limit G 0 = plim 1 n G n ( θ * ) = A * 1 B * 1 A * 1 .
  • The matrix inverse of the Godambe information G n 1 ( θ ^ * ) is a d × d covariance matrix. This robust or “sandwich” variance is a consistent estimator of the asymptotic variance of the ML estimator under misspecification.
Note that we omitted the subscript θ in the vector differential operator ∇ as differentiation of the log-likelihood function is always with respect to the parameters.
The entries of the d × d “information” matrices I n , H n , and G n grow linearly (on average) with n reflecting a steadily increasing amount of information about the unknown parameters θ with more data. In contrast, A n and B n are sample averages of the sensitivity and variability matrices for n data points. Cameron and Trivedi [29] treat these two d × d matrices as estimators of A * and B * , respectively, the probability limits under the pseudo-true parameters θ * . For the time being, we formulate all our information matrices A * , B * , I n ( θ * ) , and G n ( θ * ) as population quantities as if the pseudo-true parameter values θ * of the data generating process are exactly known. In practice, the “information” matrices A * and B * are replaced with empirical estimates A n and B n , evaluated at the estimator θ ^ * obtained from the ω 1 , , ω n . Further details are provided later.
Statistical distributions are designated as common symbols. If X is multivariate normally distributed with mean μ R d and d × d covariance matrix Σ = Var [ X ] , we write X N d ( μ , Σ ) and use X U d ( a , b ) for the continuous d-variate uniform distribution on the closed-region [ a , b ] , where a , b R d × 1 and a j < b j for all j = ( 1 , , d ) . We write P ( X ω ) for the conditional pdf of X given the n-materialized outcomes ω . The Greek letter α ( 0 , 1 ) denotes the probability of rejecting the null hypothesis when the null hypothesis is true. We write γ = 1 α for the confidence level.

3. Illustrative Example

Before discussing how to remedy Bayesian MCMC methods into sampling the asymptotically correct sandwich distribution, we first demonstrate the information identity A * = B * and failure thereof for a simple parametric model and synthetic data.
We revisit the first study of Vrugt et al. [8] and consider as data generating process Ω N ( μ , σ 2 ) of random variable Ω . We draw measurements ω 1 , , ω n for μ = 0 , σ 2 = 1 and n = 100 . As our model we use y i N ( m , s 2 ) with m unknown and s 2 fixed at some predefined value. If s 2 = σ 2 , then the model is correctly specified, otherwise for s 2 σ 2 the model is misspecified. Now, we wish to determine the value of m using training data ω 1 , , ω 100 . The normal log-likelihood L n n ( m s 2 ) is equal to
L n n ( m s 2 ) = log L n n ( m s 2 ) = n 2 log ( 2 π s 2 ) 1 2 s 2 i = 1 n ( ω i m ) 2 .
Figure 2 displays L n n ( m s 2 ) for 5 m 5 using s 2 = 1 / 2 (red), s 2 = 1 (green) and s 2 = 2 (blue).
For s 2 = 1 (green line), the model is correctly specified and the information identity, A * = B * will hold. This implies that the expected value of the second derivative L ¨ n n ( m s 2 ) of the log-likelihood function L n n ( m s 2 ) at the likelihood maximum m ^ μ will equal the expected value of the squared first-derivative L ˙ n n ( m s 2 ) of L n n ( m s 2 ) at this maximum, where the expectation is with respect to ω Ω . This is easy to demonstrate with an analytic proof. The first and second derivatives of L ω n ( m s 2 ) with respect to m are
L ˙ ω n ( m s 2 ) = d d m L ω n ( m s 2 ) = s 2 ( ω m ) L ¨ ω n ( m s 2 ) = d 2 d m 2 L ω n ( m s 2 ) = s 2 .
The sensitivity matrix (a scalar in this case) is now equal to
A * = E ω [ L ¨ ω n ( m s 2 ) ] = E ω [ s 2 ] = s 2 ,
and the variability matrix (also a scalar here) at the likelihood maximum m = μ is
B * = E ω [ L ˙ ω n ( m s 2 ) L ˙ ω n ( m s 2 ) ] = E ω [ s 2 ( ω m ) s 2 ( ω m ) ] = s 4 σ 2 .
If we assume the variance s 2 = σ 2 of the green line in Figure 2, then B * = s 2 . This is equal to the sensitivity matrix A * = s 2 from Equation (7), thus A * = B * . In this correctly specified case and with variance known, an exact 100 ( 1 α ) % confidence interval for m ^ is
m ^ ± Φ 1 ( 1 1 2 α ) σ 2 / n .
where Φ 1 ( p α ) is the quantile function of the standard normal distribution evaluated at percentile p α = 1 1 2 α . This confidence interval for m ^ coincides with the classical frequentist interval estimate of the sample mean [30].
For the other two models with s 2 = 1 / 2 (red line) and s 2 = 2 (blue line) A * B * , and consequently the naive variance 1 n A * 1 = s 2 / n will underestimate and overestimate, respectively, the actual uncertainty of m. Upon misspecification, the sandwich variance 1 n A * 1 B * 1 A * 1 = s 2 s 4 σ 2 s 2 / n = σ 2 / n equals the correct variance ( σ 2 / n ) of m. This estimator does not require prior knowledge of σ 2 as the matrices A * and B * are replaced by their sample estimates, A n and B n , evaluated at θ = θ ^ * .
In Appendix A, we verify that the variability matrix satisfies the variance rule given in Equation (4), confirming that B * = Var [ L ˙ ω n ( m s 2 ) ] .

4. Sandwich-Adjusted MCMC Simulation

4.1. The Metropolis–Hastings Algorithm

We must summarize the posterior parameter distribution, P n ( θ ) P ( θ ) L n ( θ ) . When this task cannot be carried out by analytical means nor by analytical approximation, Monte Carlo simulation methods can be used to generate samples from the posterior distribution.
The basis of MCMC simulation is a Markov chain that generates a random walk through the search space and successively visits solutions with stable frequencies stemming from a stationary distribution, P n ( θ ) . Assume that the points { θ ( 0 ) , θ ( 1 ) , , θ ( t 1 ) } have already been sampled, then the MH algorithm [26,27] proceeds as follows (see Algorithm 1). At iteration t, the transition kernel q ( θ p θ ( t 1 ) ) generates a trial move θ p around the current chain state θ ( t 1 ) . Next, this candidate point is accepted with MH probability
P acc ( θ ( t 1 ) θ p ) = min 1 , P ( θ p ) L n ( θ p ) q ( θ ( t 1 ) θ p ) P ( θ ( t 1 ) ) L n ( θ ( t 1 ) ) q ( θ p θ ( t 1 ) ) ,
and we set θ ( t ) = θ p , otherwise if the candidate point is rejected, the chain remains at is old position, θ ( t ) = θ ( t 1 ) . Repeated application of these steps results in a Markov chain { θ ( 0 ) , θ ( 1 ) , , θ ( T ) } which, under certain regularity conditions, has a unique stationary distribution with posterior probability density function, P n ( θ ) . In practice, this means that if one looks at the values of θ sufficiently far from the arbitrary initial value, θ ( 0 ) , the successively generated states of the chain will be distributed according to P n ( θ ) , the posterior probability distribution of θ . This so-called burn-in period { θ ( 0 ) , θ ( 1 ) , , θ ( b 1 ) } , where b T is required to allow the chain to travel to the high-probability density (HPD) region of the target distribution. Thus, the last M = T b + 1 samples { θ ( b ) , θ ( b + 1 ) , , θ ( T ) } are used to approximate the posterior parameter distribution, P n ( θ ) P ( θ ) L n ( θ ) .
Algorithm 1 Metropolis–Hastings (MH)
  • Input: Prior, P ( θ ) , likelihood, L n ( θ ) , and transition density, q ( θ p θ ( t 1 ) )
            Total number of samples T
  • Output: Samples { θ ( 0 ) , θ ( 1 ) , , θ ( T ) } with stationary distribution P n ( θ ) P ( θ ) L n ( θ )
  • Draw initial chain state θ ( 0 ) from the prior distribution, θ ( 0 ) P ( θ )
  • for  t = 1   to  T  do
  •     Sample a proposal, θ p q ( · θ ( t 1 ) ) , from the transition kernel
  •     Compute the acceptance probability P acc ( θ ( t 1 ) θ p ) using Equation (9)
  •     if  P acc ( θ ( t 1 ) θ p ) Z  then
  •         Accept the candidate point, θ ( t ) = θ p and P n ( θ ( t ) ) = P n ( θ p )
  •     else
  •         Reject the proposal and set θ ( t ) = θ ( t 1 ) and P n ( θ ( t ) ) = P n ( θ ( t 1 ) )
  •     end if
  • end for
  • Return:  { θ ( 0 ) , θ ( 1 ) , , θ ( T ) }
In the limit of T , the MAP density estimate θ ^ n of the sampled Markov chain will converge to the true parameter values θ 0 of the data generating process
lim n , T θ ^ n = θ 0 ,
or under misspecification, we write lim n , T θ ^ n = θ * . The asymptotic covariance matrix of the Markov chain will equal the matrix inverse of a single “slice of bread”
lim n , T n Cov [ { θ ( b ) , θ ( b + 1 ) , , θ ( T ) } ] = A * 1 .
Thus, after burn-in, the covariance matrix of the chain draws is the nonlinear sample equivalent of 1 n A * 1 , whereas we desire this to be
lim n , T n Cov [ { θ ( b ) , θ ( b + 1 ) , , θ ( T ) } ] = A * 1 B * 1 A * 1 .
In the language of Shaby [25], we want to complete the sandwich by joining the slice of bread A * 1 to the open-faced sandwich B * 1 A * 1 to obtain the desired sandwich covariance.
It would be ideal if we could reformulate the recipe of Algorithm 1 so that the sampled Markov chains always converge to the right asymptotic distribution, which under misspecification is the sandwich estimator. This has proven to be a formidable task. The culprit is the implicit assumption in Bayes law that the model correctly describes the relationship between prior, likelihood, and evidence. We can relax this assumption with the use of so-called belief distributions, but it is not immediately clear how to turn this new paradigm into an MH recipe with correct limiting distribution under misspecification.
We do not delve into the MCMC theory, but rather focus our attention on more practical remedies that help adjust the random walk of Algorithm 1 to the sandwich distribution. Existing methods for doing so transform either the likelihood function L n ( θ ) or the parameter values θ to match the curvature of the sandwich distribution around θ * . All these methods assume knowledge of the MAP parameter values θ ^ * and the sensitivity A * and variability B * matrices at θ = θ * . Next, we review two existing recipes based on magnitude- and curvature-adjustments of the log-likelihood function to sample the posterior sandwich distribution. Then, we present the theory of a third, more rigorous, and convenient approach, which we coin the kernel-amendment method for sandwich-adjusted MCMC simulation.

4.2. Algorithm 1: Magnitude Adjustment

We can adjust the magnitude of the log-likelihood function to enforce the sandwich variance matrix Σ n sand on the posterior realizations of the sampled Markov chain(s). If the log-likelihood function L n ( θ ) is approximately quadratic in a neighborhood of the maximum a posteriori (MAP) parameter values θ ^ * , then posterior exploration of the scaled log-likelihood k L n ( θ ) via the MH algorithm should yield a good approximation to the sandwich-adjusted posterior. This so-called omnibus adjustment was originally proposed by Pauli et al. [31] as a correction to the second Bartlett identity, and the scalar k is estimated following the procedures outlined by Ribatet et al. [24] and di San Miniato and Sartori [32]
k = d tr ( Σ n naive ) 1 Σ n sand p d tr ( A * 1 B * 1 ) .
The unary trace operator tr ( · ) returns the sum of the diagonal elements of the d × d matrix A * 1 B * 1 . This trace is equal to the sum of the eigenvalues of the matrix-matrix product A * 1 B * 1 . The omnibus adjustment can be thought of as a tempering of the log-likelihood function and flattens L n ( θ ) for 0 < k < 1 , thereby slowing down learning and matching the informativeness of the data with the Godambe information G n ( θ * ) = n A * 1 B * 1 A * 1 . The MH algorithm with product of the omnibus scalar k and log-likelihood L n ( θ ) , will from now on be referred to as Algorithm 1.
Remark 1. 
The omnibus scalar k is reminiscent of a learning rate in a power likelihood function, where ’learning rate’ refers to the extent to which the data influence posterior updating. A value of k < 1 downweights the data and reduces sensitivity to outliers, while k > 1 increases the influence of the data on the posterior distribution. To provide a deeper intuition of the advantages of a power likelihood, we revisit example 1 in Section 3. Suppose that we inadvertently assumed that s 2 = 2 (blue line in Figure 2) and the normal distribution model is misspecified. Our estimate of the (naive) variance of m ^ will be 1 n A * 1 = s 2 / n = 2 / n , while the true variance of m ^ is σ 2 / n or 1 / n . With the omnibus scalar, the sensitivity and variability matrices of k L n n ( m s 2 ) are equal to A * = k s 2 and B * = k 2 s 4 σ 2 , respectively (see Appendix C). For s 2 = 2 , we have A * = k / 2 and B * = k 2 / 4 . The information identity A * = B * holds if k 2 2 k = 0 , yielding, k = 2 . At this value of k, the naive variance of k L n n ( θ ) is 1 n A * 1 = 2 k 1 / n or 1 / n , which is the correct variance of m ^ as σ 2 = 1 . Thus, the idea behind the omnibus scalar k is to choose its value such that the powered likelihood L n k ( θ ) satisfies the information identity A * = B * . As a result, the naive posterior parameter distribution under L n k ( θ ) coincides with the sandwich distribution of the original likelihood L n ( θ ) . This omnibus correction yields the correct variance-covariance matrix for the estimator θ ^ * .
Remark 2. 
The use of a single scalar k for all d parameters may suffice when the sandwich distribution is well approximated by a multivariate Gaussian, that is, when the posterior is nearly symmetric and its surface is approximately quadratic around θ ^ * . However, in the presence of model misspecification, prior truncation, or directional heterogeneity in sensitivity, a global scalar k will distort the shape of the adjusted posterior distribution. In such cases, a separate scaling factor is required for each dimension to preserve the local geometry and asymmetry of the sandwich distribution. This directional asymmetry becomes especially pronounced in small-sample settings, say, n < 100 , where the central limit approximation does not hold and curvature varies across parameters.
Remark 3. 
Other definitions of the scalar k have been proposed in the literature [33], including those based on moment-matching conditions [34], adjustments to degrees of freedom [35] inspired by the Satterthwaite–Welch method [36,37], and alternative scaling approaches [38]. See also Varin et al. [39] for a broader overview.
Remark 4. 
The exponent in a power likelihood is typically denoted by λ rather than k.

4.3. Algorithm 2: Curvature Adjustment

While the asymptotic covariance of θ ^ * is the sandwich matrix Σ n sand , the MH algorithm instead yields 1 n A * 1 , the inverse of a single “slice of bread”. In the words of Shaby [25], we wish to complete the sandwich by attaching this slice to the open-faced piece B * 1 A * 1 to recover the full sandwich covariance 1 n A * 1 B * 1 A * 1 . We now review two approaches that adjust the curvature of the log-likelihood to match this target distribution.

4.3.1. A-Posteriori Adjustment

Let { θ ( b ) , θ ( b + 1 ) , , θ ( T ) } denote the post-burn-in samples from a Markov chain of length T, where the first b realizations are discarded. This yields M = T b + 1 samples drawn from the naive posterior P n ( θ ) P ( θ ) L n ( θ ) based on a data sample of size n. We center these samples at a reference point θ ^ * (e.g., posterior mode or mean) and pre-multiply by a d × d matrix Ψ * to obtain open-face sandwich (OFS)-adjusted samples [25]
θ ( j ) ofs = θ ^ * + Ψ * ( θ ( j ) θ ^ * ) , j = 1 , , M .
This linear map applies direction-specific dilations along the principal axes of the local posterior ellipsoid without unnecessary rotation [38], thereby changing the local geometry of the naive posterior to that of the sandwich distribution. A convenient choice is [25]
Ψ * = A * 1 1 / 2 B * 1 / 2 A * 1 / 2 .
Under standard regularity conditions, the naive posterior is locally Gaussian with covariance 1 n A * 1 . After centering at θ ^ * and applying the transformation Ψ * , we obtain
Cov ( θ ofs ) = Cov Ψ * ( θ θ ^ * ) = Ψ * 1 n A * 1 Ψ * = 1 n A * 1 1 / 2 B * 1 / 2 A * 1 / 2 A * 1 1 / 2 A * 1 / 2 B * 1 / 2 A * 1 1 / 2 = 1 n A * 1 B * 1 A * 1 ,
which is precisely the asymptotic sandwich covariance. Here, A * 1 / 2 is the principal square root of the symmetric positive definite matrix A * such that A * 1 / 2 · A * 1 / 2 = A * .
This a posteriori correction method is computationally appealing, as it does not require additional evaluations of the likelihood function L n ( θ ) . However, the OFS-adjusted posterior samples may not accurately represent the true sandwich parameter distribution, particularly if the adjustment matrix Ψ * is not constant over the region of θ ^ * in which there is sandwich parameter uncertainty. However, this assumption is not unique to curvature-based methods. What is specific to such methods is the nonuniqueness of the matrix square roots A * 1 / 2 and B * 1 / 2 when the sensitivity and variability matrices A * and B * are not positive semi-definite. This can occur if the quadratic approximation of L n ( θ ) via the second-order Taylor expansion at θ = θ ^ * does not adequately describe the actual curvature of the log-likelihood function. In such cases, A * and B * may not be symmetric. Moreover, under certain conditions, these matrices may be ill-conditioned or nearly singular. This may arise when two or more parameters exhibit strong linear dependence, when parameters have very different magnitudes, and/or when the sample size n is small. Ill-conditioning can also be introduced as an artifact of numerical approximation, particularly when finite differences are used to estimate the first- and second-order partial derivatives of L n ( θ ) with respect to θ . Floating-point arithmetic can lead to numerical instability due to rounding errors and subtractive cancellation when evaluating small differences between almost equal numbers. This will distort the computation of A * and B * .
We can symmetrize a matrix Z by working with 1 2 ( Z + Z ) instead. Ill-conditioning can be addressed through Tikhonov regularization by adding to A * and/or B * a diagonal matrix, ϵ I d , where I d is the d × d identity matrix and ϵ > 0 is a small positive scalar. This technique, also known as ridge regression, changes the eigenvalues of the matrix from λ ̲ 1 , , λ ̲ d to λ ̲ 1 + ϵ , , λ ̲ d + ϵ . If all eigenvalues are positive and the matrix is symmetric, then its principal square roots will be unique. Throughout this paper, we assume that the matrices A * and B * are positive definite.
The matrix square roots A * 1 / 2 and B * 1 / 2 of can be computed using different methods, including the generalized Cholesky factorization [40,41], singular value decomposition [42] or eigendecomposition [43,44]. These methods yield similar matrix square roots when A * and B * are approximately symmetric and positive definite, since the key properties of symmetric positive definite matrices—such as real, positive eigenvalues and diagonalizability—still hold approximately in such cases. In this case, Cholesky factorization provides computationally inexpensive and stable estimates of A * 1 / 2 and B * 1 / 2 [45]. However, if the matrices A * and B * are far from symmetric, we should expect different matrix square roots, as the methods listed above have different optimality and invariance properties [46]. Preferably, the transformation from θ ( j ) to θ ( j ) ofs preserves directions of asymmetry. Singular value decomposition is numerically stable and preserves key geometric attributes of A * and B * [25]. For symmetric matrices, the matrix square roots are given by
A * 1 / 2 = U a 1 / 2 D a 1 / 2 U a 1 / 2 and B * 1 / 2 = U b 1 / 2 D b 1 / 2 U b 1 / 2 ,
where U a D a U a and U b D b U b are the eigendecompositions of the sensitivity and variability matrices, respectively. Here, U a and U b are orthogonal matrices whose columns are the eigenvectors of A * and B * , and D 1 / 2 = diag ( λ ̲ 1 , , λ ̲ d ) , is the diagonal matrix of square roots of the corresponding eigenvalues.
The OFS adjustment of Equation (11) is equivalent to a linear transformation of the posterior samples. This transformation may not yield an accurate description of the true sandwich distribution if the model is highly nonlinear and/or the parameters are highly correlated. This difference between linear and nonlinear confidence intervals is well understood for the naive variance estimator [14,47,48,49,50], and these findings also apply to the sandwich estimator. In this situation, a priori adjustment (discussed next) will help determine whether the true sandwich confidence regions extend beyond this linear approximation. A consistent estimator of Ψ * should generate credible intervals that are consistent 100 ( 1 α ) % confidence intervals.

4.3.2. A Priori Adjustment

The OFS adjustment (Method 2a) transforms posterior samples post hoc, after the MH algorithm has completed sampling. While computationally appealing, this approach does not guarantee an accurate characterization of the true sandwich distribution under model misspecification. An alternative is to adjust the curvature of the likelihood function L n ( θ ) near the point estimate θ ^ * , thereby preserving the correct asymptotic behavior. Ribatet et al. [24] proposed achieving this by applying the affine transformation of Equation (11) during MCMC simulation
θ p ca = θ ^ * + C * ( θ p θ ^ * ) .
The likelihood L n ( θ ) is then evaluated at the curvature-adjusted candidate point θ p ca instead of θ p , resulting in the curvature-adjusted likelihood function
L n ca ( θ ) = L n ( θ ca ) = L n θ ^ * + C * ( θ p θ ^ * ) .
The transformation in (14) does not change the location of the MAP solution. Indeed, if we enter θ ^ * we yield that θ p ca = θ ^ * , and, thus, the MAP solution is also a global maximum for L n ca ( θ ) . The gradient or score and curvature (Hessian) of the curvature-adjusted log-likelihood L n ca ( θ ) at θ is equal to
L n ca ( θ ) = C * L n ( θ ca ) and 2 L n ca ( θ ) = C * 2 L n ( θ ca ) C * .
At the MAP estimator, the Hessian of the original log-likelihood L n ( θ ^ ) is asymptotically 2 L n ( θ ^ * ) n A * , so that
2 L n ca ( θ ^ * ) = n C * A * C * .
To obtain the correct asymptotic curvature under model misspecification, we equate this expression to the sandwich information matrix
C * A * C * = A * 1 B * 1 A * 1 .
If A * is symmetric positive definite, both its square root A * 1 / 2 and inverse square root A * 1 / 2 exist. Multiplying both sides of Equation (16) on the left and right by A * 1 / 2 gives
A * 1 / 2 C * 1 / 2 A * 1 / 2 C * 1 / 2 A * 1 / 2 = A * 1 / 2 A * 1 / 2 B * 1 1 / 2 A * 1 / 2 A * 1 / 2 .
Letting Q = A * 1 / 2 C * 1 / 2 A * 1 / 2 and hence C * 1 / 2 = A * 1 / 2 Q A * 1 / 2 , the above expression simplifies to
Q Q = A * 1 / 2 B * 1 1 / 2 A * 1 / 2 .
The matrix Q is not unique, since any matrix of the form R Q , where R is an orthogonal matrix ( R R = I d ), also satisfies the same condition. This phenomenon is known as rotational freedom, and it implies that any matrix square root Q r is only defined up to an orthogonal rotation or reflection
Q r = R Q = R A * 1 / 2 B * 1 1 / 2 A * 1 / 2 1 / 2 .
Substituting this expression for Q r back into the expression for C * yields the general form of the d × d curvature-adjustment matrix from Equation (14)
C * = A * 1 / 2 R A * 1 / 2 B * 1 1 / 2 A * 1 / 2 1 / 2 A * 1 / 2 .
This derivation demonstrates the nonuniqueness of matrix C * . If A * and B * commute (e.g., are simultaneously diagonalizable), the expression for C * in Equation (17) simplifies to the following more compact form [24,32,38]
C * 1 = A * 1 / 2 G 0 1 / 2 = A * 1 / 2 A * 1 / 2 B * 1 / 2 A * 1 / 2 = B * 1 / 2 A * 1 / 2 ,
and enforces the sandwich covariance matrix 1 n A * 1 B * 1 A * 1 onto the sampled Markov chains.
The mapping from θ to θ ca can be regarded as a succession of transformations in which the ellipsoidal contours of L n ca ( θ ) are first mapped to spheroids, and then transformed back to the contours of L n ( θ ) [38]. The MH acceptance probability of candidate point θ p becomes
P acc ( θ ( t 1 ) θ p ) = min 1 , P ( θ p ) L n ( θ p ca ) q ( θ ( t 1 ) θ p ) P ( θ ( t 1 ) ) L n ( θ ( t 1 ) ca ) q ( θ p θ ( t 1 ) ) .
Thus, the comparison of candidate points θ p and the current chain state θ ( t 1 ) in curvature-adjusted MCMC simulation takes place after θ p and θ ( t 1 ) are scaled and rotated [25]. Ribatet et al. [24] build on a result from Kent [51] to show that the acceptance probability in (19) shares the same asymptotic distribution as the true likelihood ratio. They further argue that the resulting sample has an asymptotic stationary distribution that is normal, with the desired sandwich covariance matrix.
Algorithm 2 outlines the steps of curvature-adjusted MCMC simulation using the MH algorithm. We refer to this procedure as the Curvature-Adjusted Metropolis–Hastings (CAMH) algorithm. The Markov chain generated by the CAMH algorithm has an asymptotic stationary distribution that is d-variate normal with mean θ ^ * and d × d sandwich covariance matrix given by 1 n A * 1 B * 1 A * 1 . For a symmetric proposal distribution, q ( θ ( t 1 ) θ p ) = q ( θ p θ ( t 1 ) ) , and Equation (19) simplifies to
P acc ( θ ( t 1 ) θ p ) = min 1 , P ( θ p ) L n ( θ p ca ) P ( θ ( t 1 ) ) L n ( θ ( t 1 ) ca ) .
The above expression further reduces to a likelihood ratio with a uniform prior, P ( θ ) = U d ( θ min , θ max ) , where θ min and θ max are d-vectors with lower and upper bounds of the parameters, where θ j min < θ j max for all j = 1 , , d .
The CAMH algorithm facilitates exploration of the posterior sandwich distribution, though it presents certain implementation challenges. First, and foremost, the tuning matrix C * may not be uniquely defined when the log-likelihood L n ( θ ) in θ ^ * is not exactly quadratic. In such cases, the nonuniqueness of the matrix square roots A * 1 / 2 and B * 1 / 2 introduces an arbitrary rotation of the spheroids prior to the back-transformation. Although this rotation is inconsequential when the likelihood is locally quadratic around θ ^ * , care must be taken to ensure that the mapping preserves directions of asymmetry in the posterior sandwich distribution. This issue is not specific to CAMH; any method that relies on the square roots of the “information” matrices A * and B * is subject to this ambiguity.
Second, a subtler issue arises from the nature of the curvature adjustment itself. The transformation is affine and acts on the parameter values such that the likelihood at a point θ is evaluated at its transformed counterpart θ ca , that is, L n ( θ ca ) . This approach may be easy to implement, but is not intuitive. It may assign high likelihoods to points that are relatively far from the MAP estimate θ ^ * , even when their original likelihoods were comparatively low, and vice versa. For example, if θ m is a local maximum of L n ( θ ) , the transformation substitutes L n ( θ m ) with L n θ ^ * + C * ( θ m θ ^ * ) , regardless of the actual likelihood at θ m . As a result, asymmetries or non-elliptical features of L n ( θ ) can be distorted, compromising the geometric fidelity of the sandwich distribution.
Finally, the parameter transformation used in curvature adjustment can conflict with bounded parameter spaces. A candidate point θ p may satisfy the prior constraints in the original parameterization, yet its curvature-adjusted counterpart θ p ca may lie outside the feasible parameter space. This complication is not insurmountable, but requires careful handling to ensure that the Markov chain respects parameter constraints, preserves detailed balance, and maintains acceptable sampling efficiency.
Algorithm 2 Curvature-adjusted Metropolis–Hastings (CAMH)
  • Input: Prior, P ( θ ) , likelihood, L n ( θ ) , and transition density, q ( θ p θ ( t 1 ) )
       Total number of samples T
       MAP solution, θ ^ * , and d × d tuning matrix C * of Equation (18)
  • Output: Samples { θ ( 0 ) , θ ( 1 ) , , θ ( T ) } from sandwich posterior of P n ( θ ) P ( θ ) L n ( θ )
  • Draw initial chain state θ ( 0 ) from the prior distribution, θ ( 0 ) P ( θ )
  • Transform initial state, θ ( 0 ) ca = θ ^ * + C * ( θ ( 0 ) θ ^ * ) , and compute L n ( θ ( 0 ) ca )
  • for  t = 1   to  T  do
  •     Sample a proposal θ p q ( · θ ( t 1 ) ) from the transition kernel
  •     Transform the candidate point θ p ca = θ ^ * + C * ( θ p θ ^ * ) and compute L n ( θ p ca )
  •     Compute the acceptance probability P acc ( θ ( t 1 ) θ p ) using Equation (19)
  •     Draw a label Z from a uniform distribution, Z U ( 0 , 1 )
  •     if  P acc ( θ ( t 1 ) θ p ) Z  then
  •    Accept the candidate point, θ ( t ) = θ p and P n ( θ ( t ) ) = P n ( θ p )
  •     else
  •    Reject the proposal and set θ ( t ) = θ ( t 1 ) and P n ( θ ( t ) ) = P n ( θ ( t 1 ) )
  •     end if
  • end for
  • Return:  { θ ( 0 ) , θ ( 1 ) , , θ ( T ) }

4.4. Algorithm 3: Kernel Adjustment

Given the limitations of existing sampling methods, we propose a new approach, so-called kernel-amendment, which combines elements of magnitude- and curvature-adjusted MCMC simulation but introduces two key innovations, (i) a direction-dependent scaling factor that captures asymmetric and non-quadratic features of the sandwich distribution, and (ii) an implementation that avoids matrix square roots of A * and B * . This method guarantees an accurate description of the sandwich distribution by MCMC methods. We first develop the theoretical framework, then assess the performance of both kernel-amendment and existing MCMC sampling methods through applications to commonly used parametric distributions and to numerical models, using both synthetic and measured data.

Theory

Suppose P n ( θ ) P ( θ ) L n ( θ ) is the unnormalized posterior density, or in logarithmic form, P n ( θ ) P ( θ ) + L n ( θ ) . Our proposed solution is to sample from the density
ϕ ( θ ) exp λ ( θ ) L n ( θ ) L n ( θ ^ * ) ,
where λ ( θ ) : R d R is a scalar-valued function which scales linearly the difference in the log-likelihoods of points θ and θ ^ * . The argument of the exponential
L n p ( θ λ ) = log L n p ( θ λ ) = λ ( θ ) [ L n ( θ ) L n ( θ ^ * ) ] ,
is itself a power log-likelihood function. By construction, L n p ( θ ^ * λ ) = 0 , while for all other θ Θ R d , it is negative. Subtracting the value at θ ^ * recenters L n p ( θ λ ) so that its Hessian H n ( θ ^ * ) = 2 L n p ( θ ^ * λ ) accurately captures the local curvature free from arbitrary offsets in L n ( θ ) . This centering allows us to exactly align the curvature of the power log-likelihood function with that of the sandwich distribution, even when L n ( θ ) is asymmetric near the mode and eliminates the need for matrix square roots (as shown later). Choosing θ ^ * as the centering point ensures generality: under a uniform prior it coincides with the ML estimator, while under an informative prior it becomes the MAP estimator. This makes the approach seamlessly applicable to both frequentist and Bayesian settings. In short, subtracting, L n ( θ ^ * ) normalizes curvature for scaling and avoids contamination by arbitrary log-likelihood offsets.
The Metropolis acceptance probability for a candidate point θ p now becomes
P acc ( θ p θ ( t 1 ) ) = min 1 , P ( θ p ) L n p ( θ p λ ) q ( θ ( t 1 ) θ p ) P ( θ ( t 1 ) ) L n p ( θ ( t 1 ) λ ) q ( θ p θ ( t 1 ) ) ,
where L n p ( θ λ ) is the normalized power likelihood function
L n p ( θ λ ) = exp L n p ( θ λ ) = exp λ ( θ ) [ L n ( θ ) L n ( θ ^ * ) ] = exp λ ( θ ) log L n ( θ ) λ ( θ ) log L n ( θ ^ * ) = exp log L n ( θ ) λ ( θ ) log L n ( θ ^ * ) λ ( θ ) = L n ( θ ) L n ( θ ^ * ) λ ( θ ) .
The acceptance probability in Equation (22) reduces to a likelihood ratio
P acc ( θ p θ ( t 1 ) ) = min { 1 , L n p ( θ p λ ) / L n p ( θ ( t 1 ) λ ) } = min 1 , L n ( θ ^ * ) λ ( θ ( t 1 ) ) L n ( θ ^ * ) λ ( θ p ) L n ( θ p ) λ ( θ p ) L n ( θ ( t 1 ) ) λ ( θ ( t 1 ) ) = min 1 , L n ( θ p ) λ ( θ p ) L n ( θ ( t 1 ) ) λ ( θ ( t 1 ) ) L n ( θ ^ * ) λ ( θ ( t 1 ) ) λ ( θ p ) ,
in the case of a uniform prior and symmetric transition kernel of the Markov chain.
We are now left to discuss the choice of λ ( θ ) . What should this scalar-valued function be? For θ in the vicinity of θ ^ * we know that
L n ( θ ) L n ( θ ^ * ) 1 2 n ( θ θ ^ * ) A * ( θ θ ^ * ) ,
whereas we desire this to be
L n ( θ ) L n ( θ ^ * ) = 1 2 n ( θ θ ^ * ) A * 1 B * 1 A * 1 ( θ θ ^ * ) .
So a sensible choice for λ ( θ ) might be
λ ( θ ) = ( θ θ ^ * ) A * 1 B * 1 A * 1 ( θ θ ^ * ) ( θ θ ^ * ) A * ( θ θ ^ * ) .
With this formulation for λ ( θ ) , the acceptance probability P acc ( θ p θ ( t 1 ) ) of Equation (22) will guide a Markov chain to a stationary distribution with the underlying probability density function ϕ ( θ ) in Equation (20). A formal proof of this result is provided in Appendix B, and the complete recipe for the sandwich-adjusted Metropolis–Hastings (SAMH) algorithm is given in Algorithm 3. By multiplying the difference in the log-likelihoods between any point θ and the MAP solution θ ^ * by the learning rate λ ( θ ) , the resulting Markov chain converges to a stationary distribution with the correct asymptotic sandwich variance.
Algorithm 3 Sandwich-adjusted Metropolis–Hastings (SAMH)
  • Input: Prior, P ( θ ) , likelihood, L n ( θ ) , and transition density, q ( θ p θ ( t 1 ) )
       Total number of samples T
       MAP solution, θ ^ * , associated likelihood, L n ( θ ^ * ) , and matrices A * and B *
  • Output: Samples { θ ( 0 ) , θ ( 1 ) , , θ ( T ) } from sandwich posterior of P n ( θ ) P ( θ ) L n ( θ )
  • Draw initial chain state θ ( 0 ) from the prior distribution, θ ( 0 ) P ( θ )
  • Compute λ ( θ ( 0 ) ) in Equation (24) and L n p ( θ ( 0 ) λ ) of Equation (21)
  • for  t = 1   to  T  do
  •     Sample a proposal θ p q ( · θ ( t 1 ) ) from the transition density
  •     Compute λ ( θ p ) in Equation (24) and L n p ( θ p λ ) in Equation (21)
  •     Compute the acceptance probability, P acc ( θ ( t 1 ) θ p ) , using Equation (22)
  •     Draw a label Z from a uniform distribution, Z U ( 0 , 1 )
  •     if  P acc ( θ ( t 1 ) θ p ) Z  then
  •    Accept the candidate point, θ ( t ) = θ p and P n ( θ ( t ) ) = P n ( θ p )
  •     else
  •    Reject the proposal and set θ ( t ) = θ ( t 1 ) and P n ( θ ( t ) ) = P n ( θ ( t 1 ) )
  •     end if
  • end for
  • Return:  { θ ( 0 ) , θ ( 1 ) , , θ ( T ) }
Before proceeding with a detailed discussion of our methodology, we briefly revisit our expression for λ ( θ ) . According to Definition 7 in Section 2 on Page 5, we can express the numerator of Equation (24) in terms of G 0 , the expected Godambe information of a single observation
λ ( θ ) = ( θ θ ^ * ) G 0 ( θ θ ^ * ) ( θ θ ^ * ) A * ( θ θ ^ * ) .
Matrix A * in the denominator is itself an information matrix. Under correct model specification, this matrix equals I 0 , the expected Fisher information of a single datum. The scalar λ ( θ ) is thus the ratio of two information matrices, with A * reflecting sensitivity-based curvature and G 0 ( θ ) representing the Godambe (sandwich) information. In most practical cases of model misspecification, the latter has smaller curvature, so 0 < λ ( θ ) 1 , with equality at 1 under correct specification. Values λ ( θ ) > 1 can occur but are uncommon, arising when the variability-based curvature locally exceeds the sensitivity-based curvature. In all cases, λ ( θ ) acts as a parameter-dependent learning rate that tempers data informativeness, ensuring that Algorithm 3 converges to the sandwich rather than the naive posterior distribution. The nonnegative multiplier λ ( θ ) can also be viewed as a kernel: it is symmetric about θ ^ * and positive semi-definite due to the information matrices in both numerator and denominator. In keeping with the terminology of the other two methods, we refer to our approach as the kernel-adjustment method, though the term generalized power likelihood is also appropriate.
Remark 5. 
We could define ϕ ( θ ) = exp 1 2 n ( θ θ ^ * ) A * 1 B * 1 A * 1 ( θ θ ^ * ) as the probability density function of the sandwich posterior distribution [52]. This mathematical form is equivalent to a d-variate normal distribution N d ( θ ^ * , 1 n A * 1 B * 1 A * 1 ) centered on θ ^ * with d × d sandwich covariance matrix. We can draw any desired number of “posterior” samples from this distribution. However, this assumes that the local quadratic sandwich approximation is valid. Algorithm 3 relaxes this assumption for finite n. The rationale for choosing a multiplicative form—rather than, say, an additive one—will become evident soon when we interpret our method in terms of power likelihoods.
Remark 6. 
For the single-parameter case, matrices A * and B * are scalars, and λ = A * / B * can be interpreted as an estimator of the “best” λ in power likelihood, exp ( λ L n ( θ ) ) , in the sense of having correctly sized credible sets asymptotically. In other words, for d = 1 , our approach reduces to the magnitude adjustment method (Method 1), but with one important distinction. We apply the power λ to the score difference L n ( θ ) L n ( θ ^ * ) rather than to the log-likelihood L n ( θ ) itself, as in Ribatet et al. [24]. This centering of the power likelihood around L n ( θ ^ * ) ensures proper scaling for d > 1 .
Remark 7. 
The ratio λ ( θ ) in Equation (24) depends only on the direction of θ θ ^ * , not on its magnitude. In fact, we obtain λ ( θ ) = λ ( r θ ) for any scalar r 0 . Although λ ( θ ) is not defined at θ = θ ^ * , the product λ ( θ ) [ L n ( θ ) L n ( θ ^ * ) ] remains well defined. This is because both the denominator of Equation (24) and the log-likelihood difference L n ( θ ) L n ( θ ^ * ) exhibit similar quadratic behavior in a neighborhood of θ = θ ^ * . Thus, Algorithm 3 generalizes the power likelihood to the multi-parameter case d > 1 as
P n ( θ ) exp λ ( θ ) L n ( θ ) L n ( θ ^ * ) ,
where λ ( θ ) = λ ( r θ ) for any scalar r 0 , thus defining a distinct scaling factor for each direction in parameter space.
Remark 8. 
The kernel λ ( θ ) has an eigenspace decomposition. We can transform the d × 1 -vector of differences θ θ ^ * and write lambda as a function of the matrix-vector product ϑ = A * 1 / 2 ( θ θ ^ * ) . The resulting expression
λ ϑ ( ϑ ) = ϑ A * 1 / 2 B * 1 A * 1 / 2 ϑ ϑ ϑ ,
is a Rayleigh quotient in terms of ϑ . The largest and smallest value of the Rayleigh quotient are equal to the largest λ ̲ 1 and smallest λ ̲ d eigenvalues of the precision matrix M A * 1 / 2 1 B * 1 1 / 2 A * 1 / 2 1 .
Remark 9. 
Since M is the inverse variance-covariance matrix of the posterior of ϑ , the critical values of λ ϑ ( ϑ ) where the gradient vanishes correspond to the orthogonal eigenvectors ϑ 1 ϑ d of M .
Remark 10. 
Eigenvalues and eigenvectors of M are informative on how much information the data carry on average on the transformed pseudo-true parameter values ϑ * = A * 1 / 2 ( θ * θ ^ ) in each of the d eigendirections of ϑ relative to what you would expect under correct specification.

4.5. Other Methods

As a general remedy for poor uncertainty quantification under misspecification, Frazier et al. [53] replace the usual posterior with a score-based approximate posterior
P ˜ n ( θ ) P ( θ ) exp n 2 s ¯ n ( θ ) B ^ n 1 ( θ ) s ¯ n ( θ ) ,
where s ¯ n ( θ ) = 1 n i = 1 n s ω i ( θ ) is the d × 1 mean score with s ω i ( θ ) = L ω i ( θ ) , and B ^ n ( θ ) estimates the score variability. This works well as n , but with a flat prior the exponential kernel equals 1 at any θ where s ¯ n ( θ ) = 0 d , so all such roots attain the same peak height. This obscures the relative support of competing modes and distorts the faithful representation of multimodal posteriors. These issues are most acute at small sample sizes.
In a recent paper, Li and Rice [54] reviewed Bayesian analogues of sandwich variance estimators and derive Bayes rules under a so-called balanced inference loss function, BI ( θ ) . Such loss functions, originally introduced by Zellner [55] and discussed by Dawid and Sebastiani [56] in the context of Bayesian decision theory and optimal experimental design, blend attributes of standard parametric inference with weighted average penalty terms for lack-of-fit and estimation error
BI ( θ , Σ n , Φ ) = log ( | Σ n | ) + ( θ μ θ ) Φ Σ n 1 ( θ μ θ ) Estimation   error + 1 n i = 1 n L ω i ( θ ) { Φ A n ( θ ) } 1 L ω i ( θ ) Lack of fit ,
where Σ n signifies the n × n measurement error covariance matrix of the data, ω 1 , , ω n , μ θ is the d × 1 vector of expected parameter values for the n data points, and Φ is a d × d positive definite weighting matrix. This balanced inference loss function is equivalent to a negative log-likelihood function in a Bayesian context. Li and Rice [54] show by simulation that the balanced inference loss function yields robust Bayesian standard error estimates under model misspecification, thus retaining the attractive features of frequentist inference. Yet, the balanced loss function of Equation (26) is optimal only when residuals follow a Gaussian distribution. This is a significant limitation for discharge residuals of conceptual hydrologic models, which typically deviate from normality and are more accurately described by Laplacian or double-exponential distributions [11,57]. Moreover, the balanced loss function requires repeated evaluation of the sensitivity matrix A n —i.e., the empirical Fisher information, which incurs a substantial computational overhead on the order of d 2 model evaluations for each MH candidate point θ p . The balanced inference loss function is well-suited for the ABC model used in our first sandwich paper [8], as it provides an analytic expression for the sensitivity matrix A n . For other studies, we resort instead to magnitude-, curvature-, or sandwich-adjusted MCMC simulation.

5. Empirical Estimates of Information Matrices

Sandwich-adjusted MCMC simulation assumes knowledge of the true parameter values θ 0 (and θ * under misspecification) and the sensitivity and variability matrices A * and B * , respectively, of the data-generating process S . These are theoretical quantities that are not known in practice. In Section 2, we define that A * is the probability limit of A n and, similarly, B * = plim B n , where A n and B n are the averages of the sensitivity and variability matrices for the n data points ω 1 , , ω n . Thus, we must replace the population quantities A * and B * with their sample-based estimates
A n = P θ ^ * n 2 L ω ( θ ) and B n = P θ ^ * n L ω ( θ ) θ L ω ( θ ) ,
where the notation with a precursor P θ ^ * n is borrowed from Kleijn and van der Vaart [9] and designates that we must evaluate the n-sample average for ω 1 , , ω n of the quantity to its right and in the MAP solution θ = θ ^ * . Sample-based quantities exhibit variability due to the random nature of the data, but should be consistent estimates. That means that as the sample size n increases, A n and B n tend to converge in probability to A * and B * , respectively. The matrices C * and Ψ * used in a priori and a posteriori curvature-adjusted posterior exploration, respectively, are replaced by their sample equivalents
C n = B n 1 / 2 A n 1 / 2 and Ψ n = A n 1 B n 1 / 2 A n 1 / 2 ,
where the matrix square roots A n 1 / 2 and B n 1 / 2 follow from Equation (13) using singular value decomposition.
The MAP solution can be determined from an optimization method or a MCMC pre-trial. The Markov chain sample that maximizes the posterior density, P n ( θ ) = P ( θ ) L n ( θ ) , is a MAP estimator. Next, the d × 1 vector of first-order derivatives, L ω ( θ * ) , and d × d matrix of second-order derivatives, 2 L ω ( θ * ) , can be determined by numerical means using values of the log-likelihood function L n ( θ ) at points nearby θ ^ * . We use the DERIVESTsuite toolbox of D’Errico [58], a Matlab collection of fully adaptive numerical differentiation methods for scalar- and vector-valued functions. This toolbox handles the computation of first- and higher-order derivatives of functions that do not have simple analytical expressions. We employ semi-adaptive central difference schemes of varying orders, combined with a generalized Richardson [59] extrapolation approach (this method is also referred to as multi-term extrapolation in the context of numerical integration Romberg [60]), to enhance the accuracy of the first- and second-order partial derivatives of the log-likelihood function L n ( θ ) w.r.t. the parameters. This estimation is obtained using a sequence of logarithmically spaced points away from the MAP solution. The “best” differencing interval is automatically selected from the sequence of proportionally cascading points to minimize the approximation errors of L n ( θ * ) and 2 L n ( θ * ) .
If successive data points ω 1 , , ω n exhibit serial correlation, then we must correct the variability matrix B n for possible autocorrelation among the successive scores, L ω 1 ( θ ^ * ) , , L ω n ( θ ^ * ) . As in Vrugt et al. [8], we use the estimator of Newey and West [61] to determine the variability matrix β n of the scores g t = L ω t ( θ ^ * ) as follows
β n = B 0 + τ = 1 τ max w ( τ , τ max ) ( B τ + B τ ) ,
where
B τ = 1 n t = τ + 1 n g t g t τ ,
is an estimate of the autocovariance matrix of scores a distance τ apart, τ max N + signifies the maximum lag and
w ( τ , τ max ) = 1 τ 1 + τ max τ [ 0 , τ max ] ,
is a weight function which smooths the sample autocovariance function [62]. For τ = 0 , we yield B 0 = 1 n t = 1 n g t g t , which corresponds to B n , the variance matrix of the scores—provided that the scores have zero mean. Under correct model specification, the sum of the lagged autocovariance matrices, B τ vanishes, yielding a d × d zero matrix. Consequently, we have β n = B n = B 0 .
Bartlett [63] proposed truncating the sum in Equation (27) at a finite lag τ max so as to balance the trade-off between estimator variance and bias. This finite lag is also called the Bartlett window. Larger windows increase the estimator’s variance, whereas smaller values of τ max increase the bias of β n by omitting relevant score autocovariances. Bartlett’s ideas about the adequate choice of the truncation lag τ max have been formalized in rules of thumb such as τ max = c · n 1 / 4 or τ max = c 4 ( n / 100 ) 2 / 9 [64], where c is a small positive integer and the function · rounds down to the nearest integer. If the scores exhibit strong autocorrelation, one can set c relatively large, say c = 10 , otherwise, one may use c = 1 . We set c = 5 and yield values of τ max on the order of 12, 18, 32, and 55 for data sets of length n = 10 , n = 100 , n = 1000 , and n = 10,000 , respectively.

6. Case Studies

We demonstrate the different sampling methods by application to three different case studies with increasing complexity. The first two case studies involve statistical models and analytic differentiation. These two studies are purposely kept simple, as this allows us to clearly demonstrate the effects of model misspecification and illustrate how the sandwich estimator rectifies resulting biases in uncertainty quantification as a result of a wrong model parameterization (study 1) or an inadequate parametric form (study 2) for the data-generating process. The third and last study considers the application of the presented methods to rainfall-discharge simulation using the Xinanjiang model [65,66]. This study confirms that traditional MCMC methods produce overly narrow credible regions, so-called overconditioning, and demonstrates the advantages of our proposed SAMH algorithm (Algorithm 3) for sandwich-adjusted posterior exploration.
For MCMC simulation, we employ the DREAM(ZS) algorithm [67,68,69], a differential evolution-based sampler that evolves multiple chains in parallel. Candidate points are generated dynamically using linear combinations of differences between chain states. The transition kernel is self-adaptive, automatically adjusting to the scale and orientation of the target distribution, P n ( θ ) P ( θ ) L n ( θ ) . Computational efficiency is not a primary concern in this paper, as our main objective is to evaluate the theoretical and practical differences between the magnitude-, curvature-, and sandwich-adjusted MCMC simulation methods. The relative speed with which an MCMC method converges to the sandwich distribution does not influence the validity of these adjustments, which are the central focus of our work. Nonetheless, the DREAM algorithm has been benchmarked extensively and shown to perform well across a wide range of complex inference problems (see references in Vrugt [28]). To ensure reliable posterior exploration, we monitor convergence using a suite of established diagnostics, including the single-chain methods of Raftery and Lewis [70] and Geweke [71], as well as the multi-chain scale-reduction factors proposed by Gelman and Rubin [72] and Brooks and Gelman [73], following best practices advocated by Cowles and Carlin [74].

6.1. Case Study 1

We revisit example 1 in Section 3 and compute the ML estimate m ^ of the mean of the normal distribution model N ( m , s 2 ) , and the corresponding values of A n , B n , omnibus scalar k ^ = B n 1 A n , naive Σ n naive = 1 n A n 1 , and sandwich Σ n sand = 1 n A n 1 B n A n 1 variances using s 2 = 2 , s 2 = 1 and s 2 = 1 / 2 . The ML solution m ^ is simply equal to the sample mean of the data points ω 1 , , ω 100 drawn from N ( μ , σ 2 ) with μ = 0 and σ 2 = 1 and A n and B n are derived by numerical means using the DERIVESTsuite toolbox of D’Errico [58]. We repeat this computation for M = 10 4 different realizations of the n data points. Table 1 presents the result of this Monte Carlo experiment and lists mean values of m ^ , A n , B n , k ^ , Σ n naive and Σ n sand and their respective standard deviations (in parentheses). The Matlab code is given in Appendix C.
The tabulated results confirm the theory. The ML estimate of the mean m ^ is centered around zero and has a standard deviation that approaches the theoretic standard deviation σ 2 / n = 0.1 . The ML sensitivity matrix A n derived from numerical differentiation equals its theoretic value A * = s 2 and does not differ between the trials. The variability matrix B n approaches its theoretic value B * = s 4 σ 2 and has a nonzero standard deviation as σ 2 is replaced by the sample variance of the ω ’s. The mean of the omnibus scalar approaches its theoretic value k = B * 1 A * = s 4 σ 2 s 2 . Thus, we find that k = s 2 σ 2 and the standard deviation between parenthesis is a result that σ 2 of the data-generating process is replaced by the sample variance of the ω ’s. The naive variance estimator is equal to its theoretic value Σ n naive = s 2 / n and does not differ between the trials, as n and s 2 are fixed. The sandwich variance estimator Σ n sand does not depend on the value of s 2 and asymptotically converges to the true variance of the mean μ of the data-generating process. The standard deviation is the result of the variation in B n in the Monte Carlo trials.
Figure 3 displays the histograms of the omnibus scalar k for each of the three normal distribution models. The use of k L n ( θ ) will retrieve the sandwich variance Σ n sand .
When s 2 = 2 , the model underestimates the information contained in the data and the omnibus scalar is greater than one. Vice versa, for s 2 = 0.5 we systematically overestimate the informativeness of the data and, as a result, k < 1 , to slow down learning and produce robust confidence intervals for μ , the mean of the data-generating process. For s 2 = 1 , the normal distribution model N ( m , s 2 ) is equal to the standard normal distribution N ( 0 , 1 ) of the data generating process and k = 1 .
To better understand the relationship between the number n of data points ω and the naive and sandwich variances, we repeat the analysis of Table 1 for different values of n. Figure 4 presents the results of this analysis.
The naive variance decreases linearly on a logarithmic scale with the length n of the training record. The slope of this line is proportional to 1 / n on a linear scale. The naive variance of m depends on the choice of s 2 . The sandwich variance does not depend on s 2 and settles on the true variance of m with increasing number of data points ω .
Table 2 examines the coverage probabilities of the true mean μ of the data generating process according to the 100 ( 1 α ) % confidence intervals of μ ^ derived from the naive and sandwich variance estimators.
The results of Table 2 demonstrate that the sandwich variance estimator provides adequate confidence intervals of the mean μ of the data-generating process, even if the underlying model is misspecified. The sandwich estimator Σ n sand consistently achieves the correct frequentist coverage probabilities, whereas the naive variance estimator Σ n naive either over- ( s 2 = 2 ) or under- ( s 2 = 1 / 2 ) estimates the coverage probabilities. The confidence intervals are either too dispersed or too sharp.
This concludes our first case study. This study was rather unrealistic in that misspecification was introduced by affixing one of the model parameters s 2 to a wrong value. The correct distribution was used, but with a wrong value for one of its parameters, namely the variance s 2 . In the next study, we are going to take misspecification one step further and use a different model for inference than was used to generate the data.

6.2. Case Study 2

Our second case study is another analytic exercise, but one that better reflects practice as the parametric form of our model differs from that of the data-generating process. We draw n measurements ω 1 , , ω n from a gamma distribution Ω G ( a , b ) with pdf
f G ( ω a , b ) = 1 b a Γ ( a ) ω a 1 exp ( ω / b ) , ω 0 ,
where a > 0 and b > 0 are shape and scale parameters, respectively, and Γ ( z ) is the Gamma function. Now, suppose our model for Ω is not a gamma but an exponential distribution E ( μ ) with location parameter μ > 0 and pdf
f E ( ω μ ) = μ 1 exp ( ω / μ ) , ω 0 .
Note that it is not uncommon to parameterize E ( μ ) with a rate parameter λ = μ 1 instead. The likelihood function for a single observation ω is now equal to
L ω ( μ ) = f ( ω μ ) = μ 1 exp ( ω / μ ) ,
and the log-likelihood L ω ( μ ) becomes
L ω ( μ ) = log L ω ( μ ) = log ( μ ) ω / μ .
Appendix D derives analytic expressions for the naive and sandwich variance estimators of the mean of the exponential distribution. We yield
Σ n = Σ n naive = m ω 2 / n = I ^ n 1 , naive variance , Σ n sand = s ω 2 / n = G ^ n 1 , sandwich variance .
where m ω and s ω 2 are the sample mean and sample variance of the data points, ω 1 , , ω n . According to Equation (A12), the omnibus scalar k is now equal to k ^ = m ω 2 s ω 2 , whereas its theoretical value, derived in Equation (A13), corresponds to the shape parameter a of G ( a , b ) .
Table 3 confirms again the erroneous description of the confidence intervals by the naive variance estimator. The 100 ( 1 α ) % confidence intervals are too sharp and underestimate the theoretic coverage probabilities of the mean μ of the data generating process. This overconditioning is a result of misspecification and, thus, due to a misalignment of the sensitivity and variability matrices. In contrast to other methods, the coverage probabilities of the sandwich estimator align much more closely with theoretical expectations. The estimates are not perfect, as a result of the symmetry assumption used in constructing confidence intervals for μ ^ . This assumption is not valid for the exponential distribution and is further exacerbated when the sample size n is small. To mitigate this latter effect, we chose n = 100 in our Monte Carlo experiments. To address the asymmetry, one could construct non-symmetric confidence intervals by identifying the shortest interval for μ ^ that contains the true mean with probability 1 α . However, doing so would require knowledge of the posterior distribution of μ ^ , which is generally not available in frequentist settings. Importantly, the variance of the ML estimates μ ^ across the M Monte Carlo trials matches the theoretical sandwich variance, Σ n sand = a 2 b 2 / n . This confirms that the only correct confidence intervals of μ ^ are those derived from the sandwich estimator.
Table 4 documents the nominal coverage probabilities of the credible regions obtained from MCMC simulation with the DREAM(ZS) algorithm using the log-likelihood L n ( μ ) of Equation (A10) (=naive estimator), OFS-adjusted naive posterior samples of Equation (11), magnitude-adjusted log-likelihood k L n ( μ ) with omnibus scalar k of Equation (10) (=Algorithm 1), curvature-adjusted log-likelihood L n ca ( μ ) = L n μ ^ + C n ( μ μ ^ ) of Equation (15) (=Algorithm 2), and centralized power log-likelihood L n p ( μ λ ) of Equation (21) (=Algorithm 3).
The curvature-adjustment matrix C n is a scalar in this case, and according to Equations (18) and (29), we yield C n = m ω s ω 1 , where s ω 1 is the reciprocal of the sample standard deviation of the ω ’s. For OFS-adjustment, we substitute the expressions for A * and B * of Equation (A7) into Equation (12) and yield Ψ n = s ω / m ω . The tabulated values confirm that
  • The asymptotic covariance matrix of the Metropolis algorithm is a single slice of bread. The 100 ( 1 α ) % credible intervals are in agreement with the frequentist confidence intervals of the naive variance estimator, Σ n naive in Table 3 and underestimate the theoretic coverage probabilities.
  • The OFS adjustment of Equation (11) enlarges the spread of the naive posterior samples but the coverage probabilities of the so-obtained sandwich credible regions underestimate their counterparts of the sandwich estimator in Table 3.
  • The three MCMC recipes discussed in this paper successfully join a single slice of bread 1 n A n 1 to the open-faced sandwich B n A n 1 to produce the sandwich variance Σ n sand . The coverage probabilities of the 100 ( 1 α ) % credible regions of Algorithms 1–3 match those of the sandwich estimator in Table 3.
  • The tabulated values for Algorithm 3 are the first proof that the centralized power log-likelihood function L n p ( μ λ ) of Equation (21) works in practice. This inspires confidence that we can sample the sandwich distribution without using matrix square roots.
The OFS adjustment is computationally appealing and enlarges the spread of the naive posterior samples, yet the so-obtained credible regions underestimate the theoretical coverage probabilities. Magnitude, curvature, and kernel adjustment of the log-likelihood function all appear viable methods for sandwich-adjusted MCMC simulation. There are important differences between these three sampling methods and the practical consequences of this are better illustrated with a multivariate target distribution.
Having completed the above exercise, we now replace G ( a , b ) with alternative distributions for the data-generating process. Figure 5 shows histograms of the omnibus scalar k when the data-generating process is (a) G ( a , b ) , (b) N ( μ , σ 2 ) , (c) LOGN ( μ , σ ) , (d) W ( α , β ) , and (e) B ( a , b ) . For comparability, the scale, shape, and/or location parameters of each distribution are chosen such that E [ ω ] = 1 / 2 and Var [ ω ] = 1 / 10 . The theoretical value of the omnibus scalar for each distribution is k = μ 2 σ 2 = 2.5 .
The histograms of k ^ appear remarkably similar across the different distributions. This confirms that our inferences for μ are robust and do not depend on the distribution of the data-generating process. The marginal distributions of the omnibus scalar center on the theoretic value of k = 2.5 and display a small right tail. The dispersion of k ^ is a result of sample size and will disappear if we set n much larger in the Monte Carlo trials.
We now move on to our third and last case study. This will involve the use of real-world data and a multivariate posterior distribution.

6.3. Case Study 3

Our third and final case study examines the streamflow response of the Leaf River near Collins, MS, USA. The precipitation–discharge transformation is simulated using the Xinanjiang conceptual watershed model originally developed by Zhao and Zhuang [65]. We adopt the implementation of Jayawardena and Zhou [75] and Knoben et al. [76], augmented with a pan evaporation parameter and three linear routing reservoirs. This configuration comprises seven control volumes that conceptually represent water storage and routing. Appendix E provides a detailed description of the Xinanjiang model structure, including the control volumes, state variables, flux relationships, and routing scheme used to convert areal average precipitation into total channel inflow and river discharge. The model equations are solved using a mass-conservative, second-order integration method with adaptive time stepping, ensuring both numerical stability and accuracy. A one-year spin-up period removes the influence of state variable initialization.
Table A2 lists the 14 parameters of the Xinanjiang model to be estimated from streamflow measurements. For inference, we express the Xinanjiang model as the vector-valued regression
ω = f ( θ , I ) + e ,
where ω = ( ω 1 , , ω n ) is the n × 1 vector of discharge observations, θ = ( f p , A im , a , b , f wm , f lm , c , s tot , β , k i , k g , c i , c g , k f ) signifies the parameter vector, I is the n × 2 matrix of exogenous variables containing daily areal-average rainfall and potential evapotranspiration, and e = ( e 1 , , e n ) is the n × 1 vector of discharge measurement errors. We assign a uniform prior P ( θ ) over the bounds given in Table A2 and use the standardized skewed-t (SST) density of Scharnagl et al. [77] to evaluate agreement between observed and simulated streamflows
f SST ( ϵ t 0 , 1 , ν , ξ ) = 2 σ ν ξ ( ξ + ξ 1 ) Γ ( ν + 1 ) / 2 Γ ( ν / 2 ) π ( ν 2 ) 1 + 1 ν 2 μ ν ξ + σ ν ξ ϵ t ξ sign ( μ ν ξ + σ ν ξ ϵ t ) 2 ( ν + 1 ) / 2 ,
where ϵ t = e t / s t is the tth studentized streamflow residual, sign ( x ) = | x | / x , denotes the signum function, and scalars μ ν ξ = M 1 ( ξ ξ 1 ) and σ ν ξ = { ( M 2 M 1 2 ) ( ξ 2 + ξ 2 ) + 2 M 1 2 M 2 } 1 / 2 are shift and scale constants, respectively, which depend on the degrees of freedom ν > 2 , the skewness parameter ξ > 0 , and the first and second absolute moments M 1 and M 2 of the SST density [57,77]. The total likelihood L n ( θ , ν , ξ ) for a n-record of studentized residuals ϵ 1 ( θ ) , , ϵ n ( θ ) is now equal to
L n ( θ , ν , ξ ) = C ( ν , ξ , n ) t = 1 n 1 + 1 ν 2 μ ν ξ + σ ν ξ ϵ t ( θ ) ξ sign ( μ ν ξ + σ ν ξ ϵ t ( θ ) ) 2 ( ν + 1 ) / 2 ,
where the prefactor C ( ν , ξ , n ) is
C ( ν , ξ , n ) = 2 σ ν ξ ( ξ + ξ 1 ) Γ ( ν + 1 ) / 2 Γ ( ν / 2 ) π ( ν 2 ) n .
The measurement error standard deviation s t of the tth streamflow observation ω t is modeled as a linear function of the simulated discharge y t ( θ ) under model parameters θ
s t = s 0 + s 1 y t ( θ ) ,
where the intercept s 0 = 10 4 (mm/d) is fixed at a small positive value, and the slope s 1 > 0 is determined offline so as to enforce unit variance of the studentized raw residuals ϵ 1 ( θ ) , , ϵ n ( θ ) . The slope is obtained via an iterative root-finding procedure described in detail by Vrugt et al. [57]. With this variance model, the Student–t log-likelihood becomes
L n s ( θ , ν , ξ s 0 ) n log C ( ν , ξ , 1 ) t = 1 n { log ( | s 0 + s 1 y t ( θ ) | ) } ν + 1 2 t = 1 n log 1 + 1 ν 2 μ ν ξ + σ ν ξ ϵ t ( θ ) ξ sign ( μ ν ξ + σ ν ξ ϵ t ( θ ) ) 2 .
To facilitate both pairwise and parameter-wise comparisons of the d × d sensitivity A n and variability B n matrices, we apply the affine rescaling
θ ̲ j = θ j θ j min θ j max θ j min for j = 1 , , 14 , and η ̲ r = η r η r min η r max η r min for r = 1 , 2 ,
which maps the Xinanjiang parameters θ = ( θ 1 , , θ 14 ) and nuisance variables η = ( ν , ξ ) onto the unit hypercube. Inference is then conducted on the normalized parameters
θ ̲ = ( f ̲ p , A ̲ im , a ̲ , b ̲ , f ̲ wm , f ̲ lm , c ̲ , s ̲ tot , β ̲ , k ̲ i , k ̲ g , c ̲ i , c ̲ g , k ̲ f ) ,
and normalized nuisance variables η ̲ = ( ν ̲ , ξ ̲ ) . Prior to Xinanjiang model execution, θ ̲ is transformed back to the original parameter scales using the lower and upper bounds in Table A2. The prior distributions for the degrees of freedom and skewness parameters are uniform with support ν ( 2 , 10 4 ] and ξ [ 10 1 , 10 2 ] , respectively.
Figure 6 shows histograms of the marginal posterior distributions of the normalized Xinanjiang model parameters obtained using the DREAM(ZS) algorithm.
The Markov chain sample with highest value of P n ( θ ̲ , ν ̲ , ξ ̲ ) = P ( θ ̲ , ν ̲ , ξ ̲ ) L n s ( θ ̲ , ν ̲ , ξ ̲ s 0 = 10 4 ) (red square) coincides almost perfectly with the ML solution (red cross) of the frequentist estimator separately obtained by maximizing the Student t likelihood L n s ( θ ̲ , ν ̲ , ξ ̲ s 0 = 10 4 ) using a gradient-based optimization method. For all Xinanjiang model parameters except the tension water inflection parameter a and the free water shape parameter β , the MCMC-sampled marginal posterior distributions are unimodal, bell-shaped and centered around the ML solution. In contrast, the marginal posterior distribution of a is approximately uniform on the interval between 0– 0.4 , whereas the density function of β has a trapezoidal shape. The MAP values of these two parameters do not coincide with distinct posterior peaks, yet are in close vicinity of their ML estimates.
Most of the MCMC-sampled posterior histograms are in close agreement with the normal marginal distributions (blue lines) derived from the naive variance Σ n naive = 1 n A n 1 of the frequentist estimator where the sensitivity matrix A n = 1 n 2 L n s ( θ ̲ ^ * , ν ̲ ^ * , ξ ̲ ^ * s 0 = 10 4 ) is computed from the second-order partial derivatives of the Student t likelihood function. This frequentist estimator assumes model linearity and a symmetric Gaussian distribution around the ML estimate. In contrast, the MCMC method approximates the marginal distributions of the parameters from a large sample of posterior realizations. This enables MCMC to account for nonlinear model relationships and represent arbitrary posterior shapes, including skewed and heavy-tailed distributions. As a result, frequentist and Bayesian estimates of parameter uncertainty may differ. In the literature, this distinction is often framed in terms of linear versus nonlinear confidence intervals. However, in the Bayesian context, the more appropriate term is credible intervals, which reflect the probabilistic interpretation of uncertainty inherent to Bayesian inference.
Table A3 in Appendix F shows that most Xinanjiang parameters exhibit only weak correlations. Notable exceptions are the recession parameters k i and k g of the interflow and groundwater reservoirs, respectively, which display a very strong correlation ( r = 0.985 ), followed by a correlation of r = 0.856 between the tension water inflection parameter a and the total soil moisture storage s tot , and r = 0.782 between the fraction of tension water storage f wm and the free water distribution shape parameter β . The generally low posterior correlation coefficients account in part for the close agreement between the naive posterior histograms of the Xinanjiang parameters and the normal marginal distributions derived from the frequentist estimator.
Before examining the posterior Xinanjiang parameter distributions obtained from Algorithms 1–3, we first take a closer look in Table 5 at the bread and meat matrices of the Student t likelihood L n s ( θ , ν , ξ s 0 = 10 4 ) . Comparing these two matrices offers insight into the magnitude of the sandwich correction.
The main diagonal entries of A n s and β n s are in relatively poor agreement. The bolded entries of β n s are nearly an order of magnitude larger than their counterparts of A n s . This gives rise to a value of k = 0.1327 for the omnibus scalar of Pauli et al. [31] in Equation (10). This value is far removed from the desired value of k = 1 under correct specification. In Section 9, we formulate several other quantitative measures of (dis)similarity between the bread and meat matrices. This includes the Frobenius norm of the naive and sandwich variance matrices in Equation (35). The norm exceeds 2.0 , indicating substantial misspecification and underscoring the need for the sandwich estimator to robustly quantify Xinanjiang parameter uncertainty.
In Table A4 of Appendix G we compare the frequentist bread matrix A n s with the inverse of the covariance matrix of the DREAM(ZS)-sampled naive posterior realizations. The MCMC-derived bread matrix is in reasonable agreement with A n s , consistent with the close correspondence observed in Figure 6 between the frequentist characterization of naive parameter uncertainty and the normal posterior histograms sampled by the DREAM(ZS) algorithm for most parameters. The marginal posterior distributions of the tension water inflection parameter a and the free water shape parameter β deviate noticeably from normality, which explains in part the relatively large differences in their diagonal elements of the bread matrices of the frequentist and MCMC methods. The largest discrepancy is observed for the parameter f wm , whose MCMC-derived bread matrix value on the main diagonal is 0.122 , approximately 40 times smaller than the corresponding value of 4.860 from the frequentist estimator. The culprit may be the prior distribution, which truncates the posterior distribution of f wm at unity but does not affect the normal approximation underlying the frequentist characterization of naive parameter uncertainty. Thus, in summary, good agreement between the linear (frequentist-based) and nonlinear (Bayesian sample-based) estimates of the sensitivity (bread) matrix suggests that the posterior distribution is approximately Gaussian, and that the multinormal frequentist description of the ML uncertainty is consistent with the fully Bayesian approach.
Figure 7 presents histograms of the OFS-adjusted posterior samples of the Xinanjiang parameters and degrees of freedom ν of the Student t likelihood function L n s ( θ , ν , ξ s 0 = 10 4 ) . The OFS-adjusted posterior samples are derived from Equation (11) using Ψ n = A n 1 B n 1 / 2 A n 1 / 2 where the matrix square roots, A n 1 / 2 and B n 1 / 2 , are computed according to Equation (13) using singular value decomposition.
The OFS-adjustment enhances substantially the dispersion of the posterior samples for all Xinanjiang parameters but the tension water inflection parameter a. The histograms of the OFS-adjusted posterior samples (green bars) stretch far beyond the normal marginal distributions (blue lines) derived from the sensitivity matrix A n of second-order partial derivatives of the Student t log-likelihood function w.r.t. the ML solution { θ , ν , ξ } and the naive (blue) histograms of the Xinanjiang parameters. The culprit is model misspecification and, consequently, a poor alignment of the sensitivity A n and variability β n matrices. The OFS-derived sandwich histograms of the Xinanjiang parameters are in reasonable agreement with the normal marginal distributions (green lines) of the frequentist estimator of Σ n sand . Note that the OFS-adjusted sandwich density functions for f wm , β , and k i are visibly lower than their corresponding frequentist densities. This discrepancy arises because the OFS transformation in Equation (11) does not honor the unit interval of the normalized Xinanjiang parameters. Infeasible parameter values lower the probability density of the adjusted posterior samples within the admissible range. Last but not least, for several parameters, the OFS adjustment of Equation (11) altered the location of the mode (peak) of the sandwich distribution. The most notable shifts occurred for f p , b, f lm , c, k g , and k f . Such changes are somewhat counterintuitive and arise in part from the non-uniqueness of the matrix square roots A n 1 / 2 and B n 1 / 2 used in the adjustment.
Figure 8 presents a matrix plot of the bivariate 95% confidence (lines) and credible (dots) regions of all pairs of Xinanjiang parameters. The blue area corresponds to the naive variance whereas the green area is associated with the sandwich-adjusted posterior samples of Algorithm 3.
The bivariate scatter plots offer a clearer depiction of the naive and sandwich uncertainty estimates for the Xinanjiang parameters. The following conclusions can be drawn.
  • The naive Bayesian 95% credible regions (blue squares), as sampled by the DREAM(ZS) algorithm are in strong agreement with the frequentist 95% confidence ellipsoids derived from the naive variance estimator. There are some notable exceptions, particularly in the bivariate scatter plots involving parameter a, where the MCMC-sampled naive confidence regions exceed the frequentist ellipsoids. This is a well-known phenomenon that highlights the distinction between linear and nonlinear confidence (or credible) regions [49,78,79,80].
  • The 95% credible regions of the sandwich-adjusted posterior samples (green dots) extend well beyond the sandwich ellipsoids (green lines) of the frequentist estimator. These linear sandwich confidence regions substantially underestimate the true parameter uncertainty, and appear woefully inadequate for accurately characterizing Xinanjiang discharge uncertainty.
  • For most parameter pairs, the MCMC-derived sandwich credible regions are unimodal and well described by a bivariate normal distribution.
  • The sandwich credible regions of the Xinanjiang parameters are much larger than their naive counterparts. This is a result of misspecification and confirms that the sensitivity (bread) matrix A n s of the Student t likelihood function substantially overestimates the information content of the discharge observations. The only valid currency of discharge data informativeness under model misspecification is the Godambe information, as expressed by the sandwich credible regions. The enlarged parameter uncertainty should yield the appropriate parameter coverage probabilities.
Substantial differences between linear and nonlinear confidence regions, such as those observed for the tension water inflection parameter a, often signal problems in model formulation. Other indicators of model misspecification include parameters whose MAP estimates occur at or near the bounds of their prior ranges. Although the Xinanjiang model does not exhibit this behavior for the Leaf River dataset, practical experience with other conceptual hydrologic models suggests that such issues are more common than rare. When a MAP estimate lies close to a parameter bound, the local curvature of the log-likelihood becomes poorly defined, making it difficult or impossible to compute a stable Hessian (bread) matrix. This, in turn, undermines the validity of asymptotic approximations in frequentist inference, such as the ML sandwich estimator used herein.
Before turning our attention to Xinanjiang discharge uncertainty, we first examine in Figure 9 bivariate scatter plots of the OFS-adjusted posterior samples and their counterparts obtained from magnitude-, curvature-, and sandwich-adjusted MCMC simulation. For a direct comparison of the different methods, the same x- and y-axis limits are used for all four graphs in each column. We focus our attention on only a subset of the Xinanjiang parameter pairs.
The results in Figure 9 highlight several interesting observations.
  • The sandwich credible regions for the Xinanjiang parameters vary substantially across different sandwich-adjustment methods and often diverge from the ellipsoidal confidence regions obtained using the frequentist sandwich estimator.
  • The OFS-adjusted posterior samples in the top panel yield, on average, the smallest 95% credible regions for the Xinanjiang parameters. These regions are straightforward to construct from the naive posterior samples but systematically underestimate the width of the frequentist sandwich confidence regions (green lines). Moreover, the OFS transformation of Equation (11) does not guarantee preservation of the posterior mode. This is evident in Figure 7, where the peak of the OFS-adjusted posterior distributions has shifted away from the ML/MAP solution.
  • Magnitude-adjusted MCMC simulation yields, on average, the largest credible regions for the Xinanjiang parameters. The sandwich credible regions of this method usually extend beyond the frequentist sandwich ellipsoids, although not necessarily in both directions of parameter space. The magnitude-adjusted sandwich uncertainty is particularly large for the parameter pairs a b and k i b , as shown in Figure 9b2 and Figure 9f2, respectively, with credible regions that extend almost the entire parameter space and appear truncated by the boundaries of the uniform prior distribution. This behavior may be an artifact of the omnibus scalar k, which does not preserve the directional asymmetries inherent in the bread and meat matrices. This requires the use of a separate scaling factor for each Xinanjiang parameter.
  • The 95% credible regions derived from curvature-adjusted MCMC simulation are the overall closest match to the 95% sandwich ellipsoids obtained from the frequentist estimator. The prime examples are the credible regions of f p A im (Figure 9a3), a b (Figure 9b3), f lm c (Figure 9d3), and c g k f (Figure 9h3). The sandwich credible regions center on the ML estimator, have a single peak, and appear well described by a multinormal distribution. Large discrepancies between the frequentist confidence regions and curvature-adjusted credible regions are visible for f wm f lm and k g c i in Figure 9c3 and Figure 9g3, respectively.
  • The sandwich-adjusted credible regions obtained from our SAMH algorithm closely align with those derived from curvature-adjusted MCMC simulation, though with a slightly enlarged dispersion. The sandwich credible regions are a compromise between the results of magnitude- and curvature-adjusted MCMC simulation. This result inspires confidence in the centralized power likelihood L n p ( θ λ ) of Equation (23) and couples with the dynamic learning rate λ ( θ ) of Equation (25) to successfully infer the sandwich posterior distribution. This method avoids the need for principal matrix square roots A n 1 / 2 and B n 1 / 2 in constructing a Bayesian approximation to the frequentist sandwich distribution. The dynamic learning rate λ ( θ ) redistributes the posterior probability mass away from the ML solution according to the more robust sandwich description of the parameters of the Xinanjiang model.
We cannot prove that the sandwich credible regions are more accurate, as the pseudo-true values of the Xinanjiang model parameters that generated the observed discharge record are unknown. Instead, we rely on statistical theory, which establishes the sandwich estimator as the only valid asymptotic descriptor of data informativeness, and, consequently, parameter uncertainty under model misspecification.
Figure 10 shows posterior predictive bands for simulated streamflow from Xinanjiang over a representative segment of the six-year training period, obtained by propagating posterior draws of θ through the model.
The sandwich variance estimator substantially widens the parameter uncertainty induced intervals for simulated streamflow from the Xinanjiang model, as evident in the right-hand panels. The 99% intervals expand markedly, especially near peak flows. Quantitatively, the 99%, 95%, 90% and 68% streamflow intervals based on sandwich parameter uncertainty contain 37.0 %, 26.3 %, 21.1 % and 12.9 % of the discharge observations, respectively, compared with 13.3 %, 10.1 %, 8.7 % and 5.4 % under the naive variance estimator. Thus, the naive 99% intervals achieve roughly the same coverage (≈13%) as the sandwich 68% intervals. In terms of width, the sandwich intervals are about twice as wide at low flows and roughly three times as wide at the hydrograph peaks (see Figure A2 in Appendix H).
Finally, we examine the discharge residuals obtained from the ML parameter values of the Xinanjiang model. Figure 11 shows a histogram of the studentized discharge residuals ϵ 1 ( θ ^ * ) , , ϵ n ( θ ^ * ) , with gray bars normalized to represent a probability density estimate such that the total area under the bars amounts to one. We also plot the SST density f SST ( ϵ 0 , 1 , ν , ξ ) of Equation (30) using the ML values of ν and ξ .
The histogram of the discharge residuals is in excellent agreement with the SST density. The studentized residuals ϵ 1 ( θ ^ ) , , ϵ n ( θ ^ ) follow a Student t distribution with ν ^ = 2.92 degrees of freedom and skewness ξ ^ = 2.09 . This degrees of freedom is much smaller than one would expect from the sample size n = 1827 and the number of parameters p = 16 alone. This result once again confirms that the discharge residuals follow a Laplacian or double exponential distribution [11,57]. A skewness of ξ ^ = 2.09 indicates that the distribution of MAP discharge residuals is right-skewed. Consequently, the mode (peak) of the distribution of the studentized streamflow residuals is located at 0.64 , to the left of the median value of 0.20 , which itself is smaller than the mean studentized residual of approximately 0.054 . This mean value points to a negative bias in the Xinanjiang model, indicating a tendency, on average, to underestimate measured streamflows. The magnitude of this bias is around 0.14 mm/d or 11.3 % of the mean measured discharge of 1.25 mm/d.
The SST density with low degrees of freedom exhibits both a sharper peak near its mean and heavier tails compared to the normal distribution (dotted line). This makes the Student t likelihood more robust to outliers and is well suited for inverse modeling of discharge data with the occasional large streamflow residuals. The largest residuals are typically attributable to precipitation measurement errors and are less governed by structural limitations and/or deficiencies of the hydrologic model. However, the sandwich estimator cannot distinguish between these two error sources. Both count as misspecification.

7. Numerical Estimation of the Sensitivity (Bread) Matrix

The naive and sandwich variance estimators rely on knowledge of the sensitivity matrix A n and variability matrix B n both evaluated at the ML estimator θ ^ * . Matrix B n = 1 n i = 1 n L ω i ( θ ^ * ) L ω i ( θ ^ * ) is constructed solely from first-order derivatives of the log-likelihood. When computed carefully either analytically or through numerical differentiation the resulting matrix is typically symmetric and positive definite. This is not true for the sensitivity matrix A n = 1 n 2 L n ( θ ^ * ) . The main challenge arises from the second-order derivatives of the log-likelihood, which are more difficult to compute than their first-order counterparts, especially when one or more parameters lie near their lower or upper bounds. The sensitivity matrix A n must be positive definite and therefore invertible to compute both the naive and sandwich variance estimators. If A n is not invertible, most textbooks advise that the model should be reconsidered, re-specified, and the analysis rerun, or, in some cases, that additional data should be collected. Holding certain model parameters constant at known or hypothesized values can restore invertibility, but this comes at the cost of a reduced model flexibility and potentially introducing bias if the fixed values are incorrect. Furthermore, model simplification affects the estimates of the remaining variables and therefore the interpretation of the findings [81].
Gill and King [41] suggest using a pseudo-factorization A n = V V of the sensitivity matrix if A n is not positive definite. Their so-called generalized Cholesky decomposition V = gchol ( A n ) avoids the failures of earlier factorization methods by Gill and Murray [40] and Gill et al. [82] by selectively modifying small or negative pivots. This yields a controlled decomposition even when A n is indefinite or nearly singular. The resulting pseudo-variance matrix ( V V ) 1 = V 1 ( V ) 1 serves as a stand-in for A n 1 . While this provides a computational workaround, it does not resolve the underlying invertibility problem; it merely allows variance estimation to proceed despite numerical artifacts. Consequently, when the sensitivity matrix is not invertible, results should be interpreted with caution, and model diagnostics and a careful reevaluation of assumptions remain essential.
Another possibility that requires almost no additional computation is to derive matrix A n from samples of the naive posterior distribution [25]. Theory establishes that this distribution will be asymptotically normal around the MAP estimator θ ^ * and with a covariance matrix equal to a single slice of bread 1 n A n 1 . Thus, we can use the post-burn-in naive posterior samples as estimators of the bread matrix, A ^ n = 1 n Cov [ { θ ( b ) , θ ( b + 1 ) , , θ ( T ) } ] 1 . Alternatively, we retain the results of the evaluations of L n ( θ ) at each iteration of the sampler and use them to numerically estimate the Hessian matrix at θ ^ * . This Hessian approximation will generally be a good estimator of A n .

8. Limitations of Sandwich-Adjusted MCMC Simulation

Unlike existing approaches that rely on arbitrary matrix square roots, eigendecompositions or a single scaling factor applied uniformly across the parameter space, our method employs a parameter-dependent learning rate λ ( θ ) that enables direction-specific tempering of the likelihood. This allows the sampler to capture directional asymmetries in the sandwich distribution, particularly under model misspecification or in small-sample regimes, and yields credible regions that remain valid when standard Bayesian inference underestimates uncertainty. In our research for this paper, we identified one potential weakness of our methodology. When the posterior distribution is multimodal and these modes are disconnected, then the learning rate λ ( θ ) can suppress one of the peaks, thereby inflating the probability mass of one or more other peaks. The sandwich-adjusted chains then concentrate on the other modes. Through our investigations, we found that a simple and effective remedy is to restrict the learning rate to the interval ( 0 , 1 ] . This preserves the multimodal structure of the posterior sandwich distribution.

9. Formal Measures for the Degree of Model Misspecification

The misalignment of the naive and sandwich variance estimators can be summarized by scalar measures of model misspecification. This idea is not new. For example, White [83] developed an information matrix test to assess whether the discrepancy between A n and B n is statistically significant. This is a Wald-type χ 2 test: under regularity conditions and correct specification, the stacked elements n ( B n A n ) are asymptotically jointly normal with mean zero, so the associated quadratic form converges to χ d ( d + 1 ) / 2 2  [84]. This section introduces additional misspecification metrics and presents an information-theoretic interpretation of the misalignment score of Vrugt et al. [8]. These measures complement commonly used model evaluation techniques such as residual diagnostics, which assess the validity of likelihood assumptions about variance, distributional form, and dependence structure [11,57]. In contrast, our proposed metrics do not rely explicitly on residual behavior or associated goodness-of-fit statistics. Instead, they assess misspecification implicitly through structural features, specifically, the alignment between the sensitivity and variability matrices, A n and B n , which reflect both the model’s internal dynamics and its interaction with the data. These diagnostics help guide model selection and improvement, and serve as a safeguard against overconfidence in model-based inference, particularly in applications where structural model error is difficult to detect or eliminate through residual analysis alone.
In theory, the proposed metrics can be evaluated at any θ Θ provided that the local sensitivity matrix is nonsingular (ideally positive definite) and the local variability matrix is positive semidefinite. In practice, we presuppose calibration and report the metrics at the MAP estimate θ ^ * , using the naive Σ n naive and sandwich Σ n sand variances computed from A n and B n . Without calibration, these quantities are not as meaningful.

9.1. Relative Entropy

Let F = N d ( θ ^ * , Σ n sand ) and P = N d ( θ ^ * , Σ n naive ) denote the d-variate normal distributions of the sandwich and naive variance estimators, respectively. The Kullback and Leibler [19] divergence d KL ( P , F ) of F and P equals (derivation in Appendix B of Vrugt [85])
d KL N d ( θ ^ * , Σ n naive ) , N d ( θ ^ * , Σ n sand ) = 1 2 log ( Σ n naive ) 1 Σ n sand + tr ( Σ n sand ) 1 Σ n naive d .
This statistical distance between the sandwich and naive posterior distributions is also known as the relative entropy from P to F, and equals the multivariate divergence score proposed by Dawid and Sebastiani [56] for identical means. We can admit the bread and meat matrices
d xx N d ( θ ^ * , Σ n naive ) , N d ( θ ^ * , Σ n sand ) = 1 2 log n A n 1 n A n 1 β n A n 1 + tr n A n β n 1 A n 1 n A n 1 d = 1 2 log ( | β n A n 1 | ) + 1 2 tr ( A n β n 1 ) d 2 .
This divergence score is strictly proper, meaning that d xx ( P , F ) is nonnegative and zero only if P = F , thus, Σ n naive = Σ n sand . The greater the misalignment between the sensitivity and variability matrices, the larger the value of d xx ( P , F ) . The subscript ‘xx’ is intentionally used as a neutral placeholder, and we leave the formal naming of this divergence to future users or the broader research community. The misalignment score of Equation (32) is particularly well suited for applications in machine learning, where the sensitivity (bread) matrix, A n and variability (meat) matrix β n can often be obtained “for free” as by-products of automatic differentiation. The misalignment score satisfies d xx ( P , F ) = H ( P , F ) H ( P ) , where
H ( P , F ) = 1 2 log ( 2 π ) d 1 n A n 1 β n A n 1 + 1 2 tr n A n β n 1 A n 1 n A n 1 ( 1 n A n 1 β n A n 1 = n d | A n 1 | 2 | β n | ) = d 2 log ( 2 π ) d 2 log ( n ) log ( | A n | ) + 1 2 log ( | β n | ) + 1 2 tr ( A n β n 1 ) ,
is the cross-entropy between the d-variate normal naive and sandwich distributions and
H ( P ) = 1 2 log ( 2 π e ) d 1 n A n 1 = d 2 + d 2 log ( 2 π ) d 2 log ( n ) 1 2 log ( | A n | ) ,
is the differential entropy of the multinormal naive distribution. This formulation highlights the misalignment score’s role as an information-theoretic measure of misspecification and closes the circle with our earlier work on probabilistic model evaluation [85].
The misalignment score of Equation (32) can be directly compared across models with differing numbers of parameters. It should yield the same model ranking as the logarithmic score, or expected log predictive density
S LS ( P , ω ) = 1 n t = 1 n log f P t ( ω t M )
where P = { P 1 , , P n } are the posterior predictive distributions under model M for the naive estimator. This forms the basis of model selection criteria such as the widely applicable information criterion or WAIC and leave-one-out cross-validation [86,87]. If so desired, the misalignment score can be normalized to explicitly account for the number of model parameters d. This yields a per-dimension misalignment score d ¯ xx ( P , F )
d ¯ xx N d ( θ ^ * , Σ n naive ) , N d ( θ ^ * , Σ n sand ) = 1 2 d 1 log ( | β n A n 1 | ) + 1 2 d 1 tr ( A n β n 1 ) 1 2 ,
which can be compared across an ensemble of candidate models with differing number of parameters.

9.2. Fréchet Distance

The misalignment between F = N d ( θ ^ * , Σ n sand ) and P = N d ( θ ^ * , Σ n naive ) can also be quantified by the Earth Mover’s or Fréchet distance [88,89]
d F N d ( θ ^ * , Σ n naive ) , N d ( θ ^ * , Σ n sand ) = tr Σ n naive + Σ n sand 2 ( Σ n naive Σ n sand ) 1 / 2 = n 1 / 2 tr A n 1 + A n 1 β n A n 1 2 ( A n 2 β n A n 1 ) 1 / 2 .
This distance is widely used in machine learning to compare the distribution of generated images from a model against the distribution of real images. Smaller values indicate greater similarity between distributions, with d F = 0 corresponding to perfect agreement.

9.3. Frobenius Norm

An alternative diagnostic metric is the Frobenius norm of the difference between the naive and sandwich variance estimators
P F F = i = 1 d j = 1 d Σ n , i j naive Σ n , i j sand 2 = 1 n A n 1 ( I d β n A n 1 ) F ,
where smaller values indicate better model specification, and a value of zero is ideal. Larger values imply greater degrees of misspecification. Alternatively, one can compare the observed Fisher I ^ n and Godambe G ^ n information matrices at the MAP parameter values
I ^ n G ^ n F = n A n ( I d β n 1 A n ) F .
This yields qualitatively similar conclusions but on a different scale. Further examination of the Fisher-Godambe discrepancy offers valuable insight into the nature and extent of model misspecification, particularly under different modeling assumptions and data sets.

9.4. Herfindahl Index

Under correct specification, the theoretical precision matrix M A * 1 / 2 1 B * 1 1 / 2 A * 1 / 2 1 defined in Remark 8 on Page 17 will equal an identity matrix I d . Then, the eigenvalues λ ̲ 1 , , λ ̲ d of M = I d will equal one. Suppose we normalize the d eigenvalues of M
λ ̲ i , n = λ ̲ i j = 1 d λ ̲ j , i = 1 , , d ,
then the Herfindahl index H
H = i = 1 d λ ̲ i , n 2 .
is a measure of how dispersed or concentrated the eigenvalues are across the parameter space. This metric is commonly used in economics as a scalar summary of the variance concentration in the principal components of a covariance matrix [90,91]. Under correct specification, all normalized eigenvalues attain a value of d 1 and H = 1 / d . This is the lowest possible value for the Herfindahl index and indicates maximum variance uniformity across dimensions. The maximum value H = 1 is reached when all variability is concentrated in a single direction. Higher Herfindahl indices, thus, imply that most of the uncertainty is concentrated along a few dimensions of the parameter space, potentially indicating ill-conditioning or overfitting. Large differences between the Herfindahl indices of Σ n naive and Σ n sand signal model misspecification. In particular, the naive estimator may imply uniformly distributed uncertainty, whereas the sandwich estimator captures the anisotropic structure introduced by model error.
For the identity matrix I d and diagonal matrix diag ( λ ̲ 1 , , λ ̲ d ) we find
d KL N d ( 0 , I d ) , N d 0 , diag ( λ ̲ 1 , , λ ̲ d ) ) ) = 1 2 i = 1 d λ ̲ i 1 1 + log ( λ ̲ i ) ,
and the reverse KL-divergence
d KL ( N d 0 , ( λ ̲ 1 , , λ ̲ d ) , N d ( 0 , I d ) ) = 1 2 i = 1 d λ ̲ i 1 log ( λ ̲ i ) .
The symmetrized KL divergence is 1 4 i = 1 d λ ̲ i 1 + λ ̲ i 1 .
Thus, the Herfindahl index adds to the suite of diagnostics by providing an interpretable scalar summary of the effective dimensionality of the parameter uncertainty. This makes it particularly useful for comparing models of varying complexity or visualizing behavior along a complexity–regularization trade-off. A related measure is the sample variance of λ ̲ i ’s or s λ ̲ 2 = ( i = 1 d λ ̲ i 2 d m λ ̲ 2 ) / ( d 1 ) , where m λ ̲ is the sample mean of λ ̲ 1 , , λ ̲ d .
The misspecification diagnostics introduced here serve as a companion to predictive model selection criteria such as the Akaike information criterion (AIC; [92]), the Bayesian information criterion (BIC; [93]), and the WAIC [86]. Whereas these criteria rank models by expected predictive performance under a correctly specified likelihood, our measures assess whether that assumption is credible by quantifying the alignment between the sensitivity and variability matrices. Low misalignment scores support the use of AIC/BIC/WAIC with greater confidence. Large discrepancies warn that their penalties may understate uncertainty and yield overconfident rankings. In practice, the proposed diagnostics can be used to screen out poorly specified models before comparing predictive performance.
In this study, we applied the proposed sandwich-adjusted MCMC simulation method to a collection of discharge data sets and a suite of hydrologic models of varying complexity. For each case, we computed the omnibus scalar k introduced by Pauli et al. [31]. In nearly all applications, the estimated k deviated markedly from the value of unity expected under correct model specification, indicating substantial misspecification across all models. Although more complex models with larger parameter dimensionality d often yielded higher omnibus values, the relationship was not strictly monotonic. This suggests that model structure rather than dimensionality alone plays a critical role in determining specification quality. These findings highlight the practical value of our method for assessing model adequacy and underscore the need for further research on the interplay between model complexity, misalignment scores, other misspecification-based diagnostics, and commonly used residual-based measures of predictive performance.

10. Summary and Conclusions

Frequentist and Bayesian methods are widely used for standard tasks such as statistical inference and hypothesis testing, as well as for more specific tasks including model training (calibration) and prediction (forecasting). In an earlier article, we demonstrated a critical flaw in both maximum likelihood (ML) and Bayesian approaches under model misspecification. Contrary to common teaching and statistical practice, the asymptotic covariance matrix of the ML parameter estimates, Σ n , does not equal the inverse of the observed Fisher information matrix, I ^ n . Instead, it corresponds to the sandwich variance matrix Σ n sand = G ^ G ^ ̲ n 1 where the observed Godambe information is defined as G ^ n = n A n 1 B n 1 A n 1 . This Godambe matrix serves as the fundamental measure of data informativeness under model misspecification [8]. Here, A n and B n are sample averages of the sensitivity and variability matrices, respectively, for n data points ω 1 , , ω n , evaluated at the ML parameter estimates θ ^ * .
The goals of this paper were three-fold. First, we reviewed and examined three existing methods for producing asymptotically valid sandwich posterior distributions. The first method, known as the open-faced sandwich (OFS) adjustment of Shaby [25], applies direction-specific dilations along the principal axes of the samples of the naive posterior distribution, Σ n naive = I ^ I ^ ̲ n 1 to align its local curvature around the MAP estimator θ ^ * with that of the sandwich variance matrix Σ n sand . Specifically, naive posterior samples θ ( 1 ) , , θ ( M ) are centered on the MAP estimator and pre-multiplied by the matrix Ψ n = A n 1 1 / 2 B n 1 / 2 A n 1 / 2 to yield OFS-adjusted samples θ ( j ) ofs = θ ^ * + Ψ n ( θ ( j ) θ ^ * ) for all j = 1 , , M . This a-posteriori transformation is computationally efficient and simple to implement, but it does not guarantee a fully accurate characterization of the sandwich distribution.
The second method, magnitude-adjusted MCMC of Pauli et al. [31], aligns the posterior with the sandwich distribution by raising the likelihood function L n ( θ ) to a scalar power 0 < k < 1 , known as the omnibus scalar. The scalar k is chosen such that the estimated information matrices, A n and B n satisfy the information identity k A n = k 2 B n . This power-likelihood approach, denoted L n k ( θ ) , effectively tempers the learning rate and produces posterior samples whose covariance is inversely proportional to the observed Godambe information G ^ n = n A n 1 B n 1 A n 1 . While this method is computationally attractive, it applies a single scalar k to all d parameters. Consequently, it yields exact results only when the local shape of L n ( θ ) near θ ^ * is approximately quadratic. If the posterior exhibits anisotropy (directional variation) or asymmetry, a single scalar k may distort the geometry of the true sandwich distribution, suggesting the need for dimensionality-specific scaling.
The third method, curvature-adjusted MCMC of Ribatet et al. [24], modifies the sampling procedure by evaluating the likelihood L n ( θ ) at an affine transformation of the candidate points. Specifically, each proposed point θ p is transformed to θ p ca = θ ^ * + C n ( θ p θ ^ * ) , where the tuning matrix is defined as C n = B n 1 / 2 A n 1 / 2 . This transformation effectively enforces the sandwich covariance on the MCMC samples, ensuring that the sampled chains reflect the curvature implied by the observed Godambe information. However, this method has important limitations. The matrix square roots A n 1 / 2 and B n 1 / 2 are not uniquely defined unless the log-likelihood L n ( θ ) is exactly quadratic in the neighborhood of θ ^ * . As a result, the transformation can induce arbitrary rotations of the posterior ellipsoids, which may misrepresent the true directional asymmetries of the sandwich distribution. Moreover, curvature-adjusted MCMC does not respect parameter bounds, and care must be taken to ensure that proposed candidate points lie within the feasible domain.
As the second objective of this paper, we presented the theoretical foundation of a kernel adjustment method for sandwich-adjusted MCMC simulation. This approach is similar in spirit as magnitude-adjusted MCMC but employs a scaled log-likelihood function of the form L n p ( θ λ ) = λ ( θ ) { L n ( θ ) L n ( θ ^ * ) } centered at the maximum a posteriori (MAP) estimator θ ^ * and governed by a nonconstant, parameter-dependent power λ ( θ ) > 0 . This dynamic learning rate is defined as λ ( θ ) = { ( θ θ ^ * ) G ^ n ( θ θ ^ * ) } / { ( θ θ ^ * ) n A n ( θ θ ^ * ) } and is typically less than one as the sandwich or Godambe information G ^ n is generally smaller in magnitude than n A n under correct specification. Thus, the power λ ( θ ) > 0 flattens the posterior surface in regions where the observed information exceeds the robust information, reducing overconfidence and improving robustness under model misspecification. Note that under correct specification, G ^ n = I ^ n = n A n , we yield a unit learning rate for all θ and recover the naive covariance matrix Σ n naive = I ^ I ^ ̲ n 1 . The learning rate λ ( θ ) facilitates the construction of robust Bayesian credible regions under misspecification without requiring matrix square roots or eigen decompositions.
We demonstrated the four different sandwich adjustment methods by application to three case studies of increasing complexity. The first two case studies focus attention to simple statistical models with analytically tractable derivatives and offer deeper insight into the differences between naive and sandwich variance estimators under a wrong parameterization (study 1) and inadequate parametric form (study 2) of the data-generating process. The naive variance estimator fails to account for these discrepancies and leads to overconfident inference. The first study confirmed that the frequentist sandwich variance estimator produces asymptotically valid confidence intervals. The second study demonstrated that the OFS adjustment of Shaby [25] increased the spread of the naive posterior samples, but the resulting credible regions did not achieve the theoretical sandwich coverage probabilities. In contrast, parameter credible regions obtained using magnitude-, curvature-, and sandwich-adjusted MCMC simulation were in close agreement with one another and almost attained the expected coverage. This deviation was due to the assumption of symmetry in constructing the 100 ( 1 α ) % credible intervals. Altogether, the first two studies confirmed that the sandwich estimator yields asymptotically valid “robust standard errors” even when L n ( θ ) is wrongly parameterized or misspecified.
The third and final study applied the proposed methods to a rainfall–discharge simulation using the Xinanjiang watershed model. The results confirmed that traditional MCMC methods tend to produce overly narrow credible intervals for both model parameters and simulated outputs. This well-known phenomenon of overconditioning arises from the incorrect assumption of a well-specified model. Magnitude-, curvature-, and sandwich-adjusted MCMC simulation relax this assumption and yield substantially larger credible regions of the Xinanjiang model parameters and simulated streamflows. Our proposed method with a dynamic learning rate yields more robust Bayesian credible intervals than magnitude-adjusted MCMC sampling and does not suffer problems with nonuniqueness of principal matrix square roots as in curvature-adjusted MCMC simulation. All three methods, magnitude-, curvature-, and sandwich-adjusted MCMC methods require sample estimates A n and B n 4 of the sensitivity and variability matrices, respectively, along with an estimate of θ ^ * . In principle, the sampled chains from sandwich-adjusted MCMC converge rapidly to the sandwich distribution, since the chains are initialized in the vicinity of the MAP solution. However, this approach will always incur a greater computational cost than naive Bayesian methods.
As third and final objective of this paper, we presented an information-theoretic interpretation of the alignment score proposed by Vrugt et al. [8]. This strictly proper score measures the concordance of the bread and meat matrices and can be decomposed into a cross-entropy and differential entropy term. The misalignment score guides model improvement and enables direct comparison across models with different numbers of parameters, supporting model selection. We also explored others scalar measures for the degree of model misspecification. These include the Earth Mover’s distance, Frobenius norm, and Herfindahl index. Each measure captures different aspects of the discrepancy between naive and sandwich variance estimators caused by model misspecification. The Herfindahl index also offers a scalar diagnostic of the effective dimensionality of posterior uncertainty and serves as a useful diagnostic of anisotropy and concentration in the naive and sandwich variance estimators. Application of these measures to a suite of hydrologic models confirmed that all models were substantially misspecified. This analysis further showed that an increased model complexity does not guarantee better specification. Further research is warranted on the interplay between model complexity, the proposed misalignment score, other misspecification-based diagnostics, and widely used residual-based measures of predictive performance.

Author Contributions

Conceptualization, J.A.V. and C.G.H.D.; methodology, J.A.V. and C.G.H.D.; software, J.A.V.; validation, J.A.V. and C.G.H.D.; formal analysis, J.A.V. and C.G.H.D.; investigation, J.A.V. and C.G.H.D.; resources, J.A.V.; data curation, J.A.V.; writing—original draft preparation, J.A.V.; writing—review and editing, J.A.V.; visualization, J.A.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The theory, methodology, and case studies presented in this paper are part of DREAM-Suite, a MATLAB–Python software package for Bayesian model training, evaluation, and diagnostics [28]. This software is available at https://github.com/jaspervrugt/dream-suite (accessed on 3 September 2025).

Acknowledgments

We appreciate the comments of the three anonymous reviewers. During the preparation of this manuscript, the authors used GPT-4o and GPT-5 (developed by OpenAI) to assist with language editing. The authors have thoroughly reviewed and edited all AI-generated content and take full responsibility for the final version of the publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MCMCMarkov Chain Monte Carlo
MHMetropolis–Hastings
DREAM(ZS)DiffeRential Evolution Adaptive Metropolis
MLMaximum Likelihood
MAPMaximum A Posteriori
OFSOpen-Face Sandwich
CAMHCurvature-Adjusted Metropolis–Hastings
SAMHSandwich-adjusted Metropolis–Hastings
DREAM-SuiteMatlab-Python software package for Bayesian training, evaluation and diagnostics

Appendix A

In this Appendix, we demonstrate that an equivalent expression for B * is the variance of the score L ˙ ω n ( m ) at m = μ .
The normal log-likelihood L ω n ( m s 2 ) for a single datum ω is given by
L ω n ( m s 2 ) = 1 2 log ( 2 π s 2 ) 1 2 s 2 ( ω m ) 2 ,
with first derivative
L ˙ ω n ( m s 2 ) = s 2 ( ω m ) ,
as shown in Equation (6).
The expected value of the score is equal to
E ω [ L ˙ ω n ( m s 2 ) ] = E ω [ s 2 ( ω m ) ] = s 2 ( E ω [ ω ] E ω [ m ] ) = s 2 ( μ m ) ,
where μ = E ω [ ω ] is the population mean. The variance of the score becomes
Var [ L ˙ ω n ( m s 2 ) ] = E ω L ˙ ω n ( m s 2 ) E ω [ L ˙ ω n ( m s 2 ) ] L ˙ ω n ( m s 2 ) E ω [ L ˙ ω n ( m s 2 ) ] = E ω s 2 ( ω m ) s 2 ( μ m ) 2 = E ω s 2 ( ω μ ) 2 = s 4 E ω [ ( ω μ ) 2 ] .
At the likelihood maximum m = μ , we yield the following expression
Var [ L ˙ n ω ( m s 2 ) ] = s 4 σ 2 ,
which corresponds to B * in Equation (8).

Appendix B

In this Appendix, we present an asymptotic proof of convergence of sandwich adjusted MCMC simulation.
Theorem A1. 
Let θ * R d denote the vector of pseudo-true parameter values and θ ^ * the MLE (or MAP under a uniform prior). Assume the prior is continuous and strictly positive at θ * . Suppose the log-likelihood L n ( θ ) = log L n ( θ ) is twice continuously differentiable in a neighborhood of θ * , and define
A * = 1 n E [ 2 L n ( θ * ) ] , B * = 1 n E L n ( θ * ) L n ( θ * ) ,
with A * and B * positive definite. Consider the power log-likelihood of Equation (21)
L n p ( θ λ ) = λ ( θ ) [ L n ( θ ) L n ( θ ^ * ) ] ,
with learning rate of Equation (24)
λ ( θ ) = ( θ θ ^ * ) A * 1 B * 1 A * 1 ( θ θ ^ * ) ( θ θ ^ * ) A * 1 ( θ θ ^ * ) ,
Then the corresponding target density
ϕ ( θ ) L n ( θ ) L n ( θ ^ * ) λ ( θ ) ,
is asymptotically equivalent to a d-variate normal distribution with mean θ ^ * and d × d covariance matrix 1 n A * 1 B * 1 A * 1 .
Consequently, the stationary distribution of the sandwich-adjusted Metropolis–Hastings (SAMH) algorithm targeting ϕ is the sandwich-adjusted posterior in the large-sample limit.
Proof. 
Let θ lie in a neighborhood of θ ^ * . A second-order Taylor expansion of the log-likelihood around θ ^ * yields
L n ( θ ) L n ( θ ^ * ) 1 2 n ( θ θ ^ * ) A * ( θ θ ^ * ) .
Substituting this into (A1) gives
L n p ( θ λ ) 1 2 n λ ( θ ) ( θ θ ^ * ) A * ( θ θ ^ * ) .
Using (A2)
L n p ( θ λ ) 1 2 n ( θ θ ^ * ) A * 1 B * 1 A * 1 ( θ θ ^ * ) ( θ θ ^ * ) A * 1 ( θ θ ^ * ) ( θ θ ^ * ) A * 1 ( θ θ ^ * ) ,
which simplifies to
L n p ( θ λ ) 1 2 n ( θ θ ^ * ) A * 1 B * 1 A * 1 ( θ θ ^ * ) .
Therefore, under the assumed continuity and positivity of the prior at θ *
ϕ ( θ ) exp 1 2 n ( θ θ ^ * ) A * 1 B * 1 A * 1 ( θ θ ^ * ) ,
which is d-variate normal with mean θ ^ * and covariance matrix 1 n A * 1 B * 1 A * 1 . Hence, the SAMH target is asymptotically the sandwich-adjusted posterior. □
Remark A1. 
(i) 
Informative priors. Let P ( θ ) be the log-prior. Under standard regularity conditions (prior density is positive and continuous near θ * ), log-prior curvature is O ( 1 ) while the log-likelihood curvature is O ( n ) , so the asymptotic covariance remains 1 n A * 1 B * 1 A * 1 , with A * and B * defined from the likelihood. Centering at the MAP estimator θ ^ improves finite-sample accuracy without altering the limit.
(ii) 
Sample estimators. As discussed in Section 5 we must replace A * and B * by consistent estimators A n and B n evaluated at θ ^ * . If A * = plim A n and B * = plim B n with positive definiteness, the same asymptotic result holds.
(iii) 
Finite-sample stability. When n is small or the model is ill-conditioned, eigenvalue clipping or ridge regularization of A n and B n , robust covariance estimation for dependence [61], smoothing across iterations, and constraining λ ( θ ) [ 0 , 1 ] can improve stability without affecting the asymptotic target.

Appendix C

In this Appendix, we determine the bread and meat matrices A * and B * , respectively, for the normal power likelihood function L n np ( m s 2 ) in Section 3.
The normal power log-likelihood L n np ( m s 2 ) is defined as
L n np ( m s 2 ) = k L n n ( m s 2 ) = 1 2 n k log ( 2 π s 2 ) 1 2 k s 2 i = 1 n ( ω i m ) 2 ,
or for a single datum ω we can write
L ω np ( m s 2 ) = 1 2 k log ( 2 π s 2 ) 1 2 k s 2 ( ω m ) 2 .
The first and second derivatives with respect to m are
L ˙ ω np ( m s 2 ) = d d m L ω np ( m s 2 ) = k s 2 ( ω m ) L ¨ ω np ( m s 2 ) = d 2 d m 2 L ω np ( m s 2 ) = k s 2 .
The sensitivity matrix (a scalar in this case) is
A * = E ω [ L ¨ ω np ( m s 2 ) ] = E ω [ k s 2 ] = k s 2 .
The variability matrix (also a scalar here) is
B * = E ω [ L ˙ ω np ( m s 2 ) L ˙ ω np ( m s 2 ) ] = E ω [ k s 2 ( ω m ) k s 2 ( ω m ) ] = k 2 s 4 E ω [ ( ω m ) 2 ] .
At the likelihood maximum m = μ , we have E ω [ ( ω m ) 2 ] σ 2 and hence B * = k 2 s 4 σ 2 .
Suppose the normal distribution model is correctly specified, so that s 2 = σ 2 . Then, the variability matrix simplifies to B * = k 2 s 2 and the naive and sandwich variances equal
Σ = 1 n A * 1 = k 1 s 2 / n , naive variance , 1 n A * 1 B * 1 A * 1 = k 1 s 2 k 2 s 4 σ 2 k 1 s 2 / n = σ 2 / n , sandwich variance .
The expression for the naive variance supports the widely held belief that applying an arbitrary power k > 0 to the likelihood function provides a mechanism to control parameter uncertainty. This idea underlies the GLUE methodology of Beven and Binley [13]. Specifically, values of 0 < k < 1 inflate the confidence regions of the estimated parameters, while learning rates k > 1 lead to a contraction of the “posterior” distribution of θ ^ * , thereby reducing parameter uncertainty.
Although elastic stretching of the likelihood function may appear to offer a pragmatic remedy for over-conditioning, it lacks rigorous theoretical support. This is clearly evidenced by the closed-form expression for the sandwich variance, in which the arbitrary power k cancels in the product A * 1 B * 1 A * 1 . Consequently, under model misspecification, the learning rate k has no effect on the estimated parameter (and predictive) uncertainty.
This result provides important insight into the robust quantification of model and predictive uncertainty in the presence of outliers or structural model errors. The same conclusion was previously discussed in Vrugt et al. [8], to which interested readers are referred for further discussion.

Computer Implementation

We implement the naive and sandwich variance estimators of Equation (A3) in Matlab. The code below generates Table 1 and Table 2. Built-in functions are highlighted with a low dash.
Entropy 27 00999 i001aEntropy 27 00999 i001b

Appendix D

In this Appendix, we derive analytic expressions for the empirical and theoretic naive and sandwich variances of the mean μ of the exponential distribution E ( μ ) . This also leads to an expression for the omnibus scalar k in Equation (10).
Suppose measurements ω 1 , , ω n are drawn from a gamma distribution Ω G ( a , b ) but we fit an exponential distribution E ( μ ) with location (mean) parameter μ > 0 . The exponential likelihood for a single observation ω is defined as
L ω ( μ ) = f ( ω μ ) = μ 1 exp ( ω / μ ) ,
and the corresponding log-likelihood becomes
L ω ( μ ) = log L ω ( μ ) = log ( μ ) ω / μ .
The first and second derivatives of the log-likelihood with respect to μ are
L ˙ ω ( μ ) = d d μ L ω ( μ ) = μ 1 + ω μ 2 L ¨ ω ( μ ) = d d μ L ˙ ω ( μ ) = μ 2 2 ω μ 3 .
The sensitivity matrix (scalar, since μ is univariate) is defined as
A * = E ω [ L ¨ ω ( μ ) ] = E ω [ μ 2 + 2 ω μ 3 ] = 2 E ω [ ω ] μ 3 μ 2 ,
and the variability matrix (also a scalar here) is
B * = E ω [ L ˙ n ( μ ) L ˙ n ( μ ) ] = E ω [ ( μ 1 + ω μ 2 ) ( μ 1 + ω μ 2 ) ] = E ω [ μ 2 + ω 2 μ 4 2 ω μ 3 ] = μ 2 2 E ω [ ω ] μ 3 + E ω [ ω 2 ] μ 4 .

Appendix D.1. Correct Model Specification

If the data ω are drawn from E ( μ ) , then the known moment identities apply:
E ω [ ω ] = μ and E ω [ ω 2 ] = 2 μ 2 .
Substituting in Equations (A4) and (A5) yields
A * = μ 2 + 2 μ 3 μ = μ 2 B * = μ 2 2 μ μ 3 + 2 μ 2 μ 4 = μ 2 .
Hence, under correct model specification, we have A * = B * for any μ R + , implying that the naive and sandwich variance estimators coincide and both equal μ 2 / n .

Appendix D.2. Incorrect Model Specification

If Ω E ( λ ) , the identities in (A6) no longer hold. Instead, we define
E ω [ ω ] = m ω and E ω [ ω 2 ] = Var [ ω ] + E [ ω ] 2 = s ω 2 + m ω 2 ,
where m ω and s ω 2 are the sample mean and variance of ω 1 , , ω n , respectively. Substituting in the sensitivity and variability matrices of Equations (A4) and (A5) gives
A n = 2 m ω μ 3 μ 2 B n = μ 2 2 m ω μ 3 + ( s ω 2 + m ω 2 ) μ 4 .
Let μ ^ be the maximum likelihood estimator of μ , and define the sample statistics
m ω = 1 n t = 1 n ω t , s ω 2 = 1 n 1 t = 1 n ( ω t m ω ) 2 .
The naive variance estimator then becomes
Σ n naive = 1 n A n 1 = ( 2 m ω μ ^ 3 μ ^ 2 ) 1 / n = μ ^ 3 ( 2 m ω μ ^ ) 1 / n ,
and the sandwich variance estimator is
Σ n sand = 1 n A n 1 B n 1 A n 1 = μ ^ 3 ( 2 m ω μ ^ ) 1 μ ^ 2 2 m ω μ ^ 3 + ( s ω 2 + m ^ ω 2 ) μ ^ 4 μ ^ 3 ( 2 m ω μ ^ ) 1 / n = μ ^ 4 ( 2 m ω μ ^ ) 2 1 2 m ω μ ^ 1 + ( s ω 2 + m ^ ω 2 ) μ ^ 2 / n ,
The maximum likelihood estimate of μ can be derived from the full log-likelihood
L n ( μ ) = t = 1 n L ω t ( μ ) = n log ( μ ) μ 1 t = 1 n ω t ,
by setting its derivative to zero
d d μ L n ( μ ) = d d μ n log ( μ ) μ 1 n · m ω = 0 .
This results in the following expression for μ ^
n μ ^ 1 + n · m ω μ ^ 2 = 0 ,
from which it follows that the ML value of μ ^ = m ω , the sample mean of the data.
Substituting μ ^ = m ω into Equations (A8) and (A9) yields the following estimators of the naive and sandwich variances
Σ n = Σ n naive = m ω 2 / n , naive variance , Σ n sand = s ω 2 / n , sandwich variance .
Since the n scores L ˙ ω 1 ( μ ^ ) , , L ˙ ω n ( μ ^ ) are independent, the variability matrix B n in Equation (A7) is not affected by serial dependence. Accordingly, no autocorrelation adjustment such as the Newey and West [61] correction in Equation (27) is required.
The omnibus scalar k for μ ^ = m ω is now equal to
k = d tr ( Σ n naive ) 1 Σ n sand = 1 ( m ω 2 / n ) 1 ( s ω 2 / n ) = m ω 2 s ω 2 .

Appendix D.3. Population Quantities

If we would know that the ω ’s are drawn from a gamma distribution G ( a , b ) , then
E ω [ ω ] = a b and Var [ ω ] = a b 2 .
We can substitute these expressions into the sensitivity and variability matrices of Equation (A11). This would give the following expressions for the naive and sandwich variances
Σ = a 2 b 2 / n naive variance , a b 2 / n sandwich variance .
This demonstrates that the sandwich variance can be either larger or smaller than the naive variance. Specifically, if a < 1 the sandwich variance exceeds the naive variance, whereas for 0 < a < 1 it yields smaller confidence intervals for μ ^ . For a = 1 , the two estimators coincide. This is intuitive, because setting a = 1 in the gamma PDF f G ( ω a , b ) of Equation (28) recovers the exponential density f E ( ω μ ) with b = μ .
We can also derive an expression for the theoretic omnibus scalar. Indeed, we can write
k = d tr ( Σ naive ) 1 Σ sand = 1 ( a 2 b 2 / n ) 1 ( a b 2 / n ) = a .
And, thus, the theoretic omnibus scalar is simply equal to a, the shape parameter of the gamma distribution. This confirms the relationship, Σ sand = a 1 Σ naive .

Appendix D.4. Computer Implementation

We implement the naive and sandwich variance estimators of Equation (A11) in Matlab. We also compute credible intervals of μ ^ using OFS adjustment and magnitude-, curvature- and sandwich-adjusted MCMC simulation using the Random Walk Metropolis algorithm. The script below computes Table 3 and Table 4 of this paper. Built-in functions are highlighted with a low dash.
Entropy 27 00999 i002

Appendix E

The Xinanjiang conceptual watershed model is the result of decades of work by Dr. Renjun Zhao and his colleagues at the Hydrological Bureau of the Ministry of Water Resources in China. The model’s initial formulation, based on a saturation-excess runoff mechanism and a top-down runoff generation approach, was developed in 1963 [65]. In 1980, it was formally named the Xinanjiang model [66], reflecting its intended application to the humid Xinanjiang river basin in China [94]. In a second development phase (1980–2002), several structural improvements were made, including a three-layer evapotranspiration module, the introduction of interflow as a runoff component, and the replacement of the original hydrograph method with a linear reservoir and/or lag-routing techniques.
The Xinanjiang model transforms areal average precipitation into streamflow by modeling control volumes, state variables, and fluxes as outlined in Figure A1.
Figure A1. Schematic illustration of the Xinanjiang conceptual watershed model. Blue boxes labeled in red are fictitious control volumes that govern the rainfall-runoff transformation. The model includes seven state variables, including the free w and tension s t water storages of the upper soil layer, interflow s i and groundwater s g reservoirs and water levels s r 1 , s r 2 , and s r 3 of the routing reservoirs. Fluxes (arrows) describe water movement into and out of compartments: precipitation ( p t ), runoff from impervious areas ( r b ) , infiltration ( p i ), surface runoff from the contributing free area ( r s ), evaporation ( e 1 ), runoff (r), interflow ( r i ), baseflow ( r g ), delayed interflow ( q i ), delayed baseflow ( q g ), and surface runoff ( q s ). These fluxes are computed as follows, r b = A im p t , p i = ( 1 A im ) p t , r s = r { 1 ( 1 s f / s max ) β } , e 1 = e pan if w > lm , e 1 = ( w / lm ) e pan if c · lm w lm otherwise e 1 = c · e pan , r = p i { ( 0.5 a ) ( 1 b ) ( w / w max ) b } if ( w / w max ) 0.5 a and r = p i { 1 ( 0.5 + a ) ( 1 b ) ( 1 w / w max ) b } otherwise, r i = k i s f { 1 ( 1 s f / s max ) β } , r g = k i s g { 1 ( 1 s f / s max ) β } , q i = c i s i , q g = c g s g , and q s = r b + r s , where e pan = f p e p is pan evaporation, e p denotes the potential evapotranspiration, w max = f wm s tot is the maximum tension water depth, s max = ( 1 f wm ) s tot is the maximum free water depth, lm = f lm w max is the tension water threshold for evaporation change and f p , A im , a, b, s tot , f wm , f lm , c, β , k i , k g , c i , c g and k f are free parameters. Total channel inflow q ch = q s + q i + q g is routed through three linear reservoirs (with identical recession constant k f ) and produces streamflow at watershed outlet, q t = k f s r 3 .
Figure A1. Schematic illustration of the Xinanjiang conceptual watershed model. Blue boxes labeled in red are fictitious control volumes that govern the rainfall-runoff transformation. The model includes seven state variables, including the free w and tension s t water storages of the upper soil layer, interflow s i and groundwater s g reservoirs and water levels s r 1 , s r 2 , and s r 3 of the routing reservoirs. Fluxes (arrows) describe water movement into and out of compartments: precipitation ( p t ), runoff from impervious areas ( r b ) , infiltration ( p i ), surface runoff from the contributing free area ( r s ), evaporation ( e 1 ), runoff (r), interflow ( r i ), baseflow ( r g ), delayed interflow ( q i ), delayed baseflow ( q g ), and surface runoff ( q s ). These fluxes are computed as follows, r b = A im p t , p i = ( 1 A im ) p t , r s = r { 1 ( 1 s f / s max ) β } , e 1 = e pan if w > lm , e 1 = ( w / lm ) e pan if c · lm w lm otherwise e 1 = c · e pan , r = p i { ( 0.5 a ) ( 1 b ) ( w / w max ) b } if ( w / w max ) 0.5 a and r = p i { 1 ( 0.5 + a ) ( 1 b ) ( 1 w / w max ) b } otherwise, r i = k i s f { 1 ( 1 s f / s max ) β } , r g = k i s g { 1 ( 1 s f / s max ) β } , q i = c i s i , q g = c g s g , and q s = r b + r s , where e pan = f p e p is pan evaporation, e p denotes the potential evapotranspiration, w max = f wm s tot is the maximum tension water depth, s max = ( 1 f wm ) s tot is the maximum free water depth, lm = f lm w max is the tension water threshold for evaporation change and f p , A im , a, b, s tot , f wm , f lm , c, β , k i , k g , c i , c g and k f are free parameters. Total channel inflow q ch = q s + q i + q g is routed through three linear reservoirs (with identical recession constant k f ) and produces streamflow at watershed outlet, q t = k f s r 3 .
Entropy 27 00999 g0a1
The Xinanjiang model is driven by daily time series of areal-average rainfall, ( p 1 , , p n ) , and potential evapotranspiration, ( e p 1 , , e p n ) . Our implementation follows the formulations of Zhao [94] and Jayawardena and Zhou [75], as summarized in ODE form by Knoben et al. [76], but includes two key additions: (i) an adjustment coefficient, f c , to convert meteorological estimates of potential evapotranspiration, e p (mm/d), into local estimates of actual evaporation; and (ii) a cascade of three linear reservoirs to route channel inflow and convert it into river discharge, q (mm/d).
Table A1 summarizes the state variables and fluxes associated with the control volumes of the Xinanjiang model, along with their corresponding symbols and units.
Table A1. State variables and fluxes of the Xinanjiang model.
Table A1. State variables and fluxes of the Xinanjiang model.
SymbolDescriptionUnits
StatewTension water storage in upper soil layermm
s f Free water storage in upper soil layermm
s i Interflow reservoirmm
s g Groundwater reservoirmm
s r m Water storage in cascade of routing reservoirs; m = 1 , , 3 mm
Fluxes into/out of compartments p t Precipitationmm d−1
e p Potential evapotranspirationmm d−1
e pan Pan evaporationmm d−1
p i Infiltrationmm d−1
r b Direct runoff from impermeable areamm d−1
rRunoff from tension watermm d−1
r s Surface runoffmm d−1
r i Interflowmm d−1
r g Baseflowmm d−1
q s Surface runoffmm d−1
q i Delayed interflowmm d−1
q g Delayed baseflowmm d−1
q ch Channel inflowmm d−1
q t River dischargemm d−1
A mass-conservative, second-order integration method with adaptive time stepping is used to solve for the state variables w, s f , s i , s g , s r 1 , s r 2 , and s r 3 , as well as the fluxes r x and q x into and out of the seven control volumes. A spin-up period is applied to minimize the influence of initial state conditions.
The fourteen parameters of the Xinanjiang model are listed in Table A2.
Table A2. Description of Xinanjiang parameters, including symbols, units, lower and upper bounds.
Table A2. Description of Xinanjiang parameters, including symbols, units, lower and upper bounds.
SymbolDescriptionUnitsMin.Max.
f p Ratio of potential evapotranspiration to pan evaporation- 0.5 1.5
A im Impervious area- 10 4 10 1
aTension water distribution inflection parameter- 0.5 0.5
bTension water distribution shape parameter- 10 1 2
f wm Fraction of s tot that is w max - 10 3 1
f lm Fraction of W max that is tension water threshold for evaporation change- 10 3 1
cFraction of tension water threshold for second evaporation change- 10 3 1
s tot Total soil moisture storagemm1 10 3
β Free water distribution shape parameter- 10 3 2
k i Free water interflow parameterd−1 10 3 3
k g Free water groundwater parameterd−1 10 3 1
c i Interflow time coefficientd−1 10 3 1
c g Baseflow time coefficientd−1 10 3 1
k f Recession constant of routing reservoirsd−1 10 1 5
The minimum and maximum values of the model parameters are collected in the d × 1 vectors θ min and θ max , respectively, where individual entries are denoted by θ j min and θ j max for all j = 1 , , d . This concludes the description of the Xinanjiang model.

Appendix F

This Appendix presents the correlation matrix of the naive posterior samples of the Xinanjiang model parameters and nuisance variables ν and ξ using the historical record of discharge measurements and Student t log-likelihood function of Equation (31).
Table A3. MCMC-derived naive posterior correlation matrix of the Xinanjiang model parameters and ν and ξ using the Student t likelihood.
Table A3. MCMC-derived naive posterior correlation matrix of the Xinanjiang model parameters and ν and ξ using the Student t likelihood.
Entropy 27 00999 i004

Appendix G

In this Appendix, we compare the frequentist bread matrix A n s against its sample average A ^ n s derived from Bayesian inference using the naive posterior realizations, A ^ n s = 1 n Cov [ { θ ( b ) , θ ( b + 1 ) , , θ ( T ) } ] 1 , where the first b samples θ ( 0 ) , , θ ( b 1 ) of the N Markov chains are discarded as burn-in.
Table A4. Frequentist A n s and Bayesian A ^ n s estimator of the bread matrix of the Xinanjiang model parameters and nuisance variables ν and ξ for the Student t log-likelihood function.
Table A4. Frequentist A n s and Bayesian A ^ n s estimator of the bread matrix of the Xinanjiang model parameters and nuisance variables ν and ξ for the Student t log-likelihood function.
Entropy 27 00999 i005

Appendix H

In this Appendix, we present a scatterplot of the width of the 100 γ % Xinanjiang streamflow credible intervals as function of ML simulated discharge. The blue and red dots correspond to the naive and sandwich variance estimators, respectively. The color tints represent different confidence levels.
Figure A2. Width of the 100 γ % streamflow credible intervals in units of mm/d resulting from naive (blue squares) and sandwich (green dots) parameter uncertainty, plotted as a function of the ML-simulated discharge values.
Figure A2. Width of the 100 γ % streamflow credible intervals in units of mm/d resulting from naive (blue squares) and sandwich (green dots) parameter uncertainty, plotted as a function of the ML-simulated discharge values.
Entropy 27 00999 g0a2

References

  1. Bayes, T. An essay toward solving a problem in the doctrine of chances. By the late Rev. Mr. Bayes, F. R. S. communicated by Mr. Price, in a letter to John Canton, A. M. F. R. S. Philos. Trans. R. Soc. Lond. 1763, 53, 370–418. [Google Scholar] [CrossRef]
  2. Bernstein, A.; von Mises, R. The Asymptotic Distribution of the Posterior in Bayesian Estimation. Ann. Math. Stat. 1949, 20, 743–752. [Google Scholar] [CrossRef]
  3. Fisher, R.A. On the probable error of a coefficient of correlation deduced from a small sample. Metron 1921, 1, 3–32. [Google Scholar] [CrossRef]
  4. Amari, S.I. Methods of Information Geometry. In Translations of Mathematical Monographs; American Mathematical Society: Providence, RI, USA, 2016; Volume 191. [Google Scholar]
  5. van der Vaart, A.W. Asymptotic Statistics; Cambridge Series in Statistical and Probabilistic Mathematics; Cambridge University Press: Cambridge, UK, 1998. [Google Scholar] [CrossRef]
  6. Bartlett, M.S. Sufficiency and Statistical Inference. J. R. Stat. Soc. Ser. B 1955, 17, 268–273. [Google Scholar] [CrossRef]
  7. Miller, J.W. Asymptotic Normality, Concentration, and Coverage of Generalized Posteriors. J. Mach. Learn. Res. 2021, 22, 1–53. [Google Scholar]
  8. Vrugt, J.A.; Diks, C.G.H.; de Punder, R.; Grünwald, P. A Sandwich with Water: Bayesian/Frequentist Uncertainty Quantification under Model Misspecification. ARC Geophys. Res. 2025; in review. [Google Scholar]
  9. Kleijn, B.J.K.; van der Vaart, A.W. The Bernstein-Von-Mises theorem under misspecification. Electron. J. Stat. 2012, 6, 354–381. [Google Scholar] [CrossRef]
  10. Beven, K. On doing better hydrological science. Hydrol. Processes 2008, 22, 3549–3553. [Google Scholar] [CrossRef]
  11. Schoups, G.; Vrugt, J.A. A formal likelihood function for parameter and predictive inference of hydrologic models with correlated, heteroscedastic, and non-Gaussian errors. Water Resour. Res. 2010, 46. [Google Scholar] [CrossRef]
  12. Beven, K.; Smith, P. Concepts of Information Content and Likelihood in Parameter Calibration for Hydrological Simulation Models. J. Hydrol. Eng. 2015, 20, A4014010. [Google Scholar] [CrossRef]
  13. Beven, K.; Binley, A. The future of distributed models: Model calibration and uncertainty prediction. Hydrol. Processes 1992, 6, 279–298. [Google Scholar] [CrossRef]
  14. Kuczera, G.; Parent, E. Monte Carlo assessment of parameter uncertainty in conceptual catchment models: The Metropolis algorithm. J. Hydrol. 1998, 211, 69–85. [Google Scholar] [CrossRef]
  15. Kavetski, D.; Kuczera, G.; Franks, S.W. Semidistributed hydrological modeling: A “saturation path” perspective on TOPMODEL and VIC. Water Resour. Res. 2003, 39. [Google Scholar] [CrossRef]
  16. Vrugt, J.A.; Gupta, H.V.; Bouten, W.; Sorooshian, S. A Shuffled Complex Evolution Metropolis algorithm for optimization and uncertainty assessment of hydrologic model parameters. Water Resour. Res. 2003, 39. [Google Scholar] [CrossRef]
  17. Beven, K. A manifesto for the equifinality thesis. J. Hydrol. 2006, 320, 18–36. [Google Scholar] [CrossRef]
  18. Hoff, P.; Wakefield, J. Bayesian sandwich posteriors for pseudo-true parameters: A discussion of “Bayesian inference with misspecified models” by Stephen Walker. J. Stat. Plan. Inference 2013, 143, 1638–1642. [Google Scholar] [CrossRef][Green Version]
  19. Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  20. Godambe, V.P. An Optimum Property of Regular Maximum Likelihood Estimation. Ann. Math. Stat. 1960, 31, 1208–1211. [Google Scholar] [CrossRef]
  21. Kauermann, G.; Carroll, R.J. A Note on the Efficiency of Sandwich Covariance Matrix Estimation. J. Am. Stat. Assoc. 2001, 96, 1387–1396. [Google Scholar] [CrossRef]
  22. Kass, R.E.; Raftery, A.E. Bayes Factors. J. Am. Stat. Assoc. 1995, 90, 773–795. [Google Scholar] [CrossRef]
  23. Bernardo, J.M.; Smith, A.F.M. Bayesian Theory; Wiley: Chichester, UK, 1994. [Google Scholar]
  24. Ribatet, M.; Cooley, D.; Davison, A.C. Bayesian Inference from Composite Likelihoods, with an application to spatial extremes. Stat. Sin. 2012, 22, 813–845. [Google Scholar]
  25. Shaby, B.A. The Open-Faced Sandwich Adjustment for MCMC Using Estimating Functions. J. Comput. Graph. Stat. 2014, 23, 853–876. [Google Scholar] [CrossRef]
  26. Metropolis, N.; Rosenbluth, A.W.; Rosenbluth, M.N.; Teller, A.H.; Teller, E. Equation of State Calculations by Fast Computing Machines. J. Chem. Phys. 1953, 21, 1087–1092. [Google Scholar] [CrossRef]
  27. Hastings, W.K. Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika 1970, 57, 97–109. [Google Scholar] [CrossRef]
  28. Vrugt, J.A. Markov chain Monte Carlo simulation using the DREAM software package: Theory, concepts, and MATLAB implementation. Environ. Model. Softw. 2016, 75, 273–316. [Google Scholar] [CrossRef]
  29. Cameron, A.; Trivedi, P. Microeconometrics: Methods and Applications; Cambridge University Press: Cambridge, UK, 2005. [Google Scholar]
  30. Snedecor, G.W.; Cochran, W.G. Statistical Methods, 8th ed.; Iowa State University Press: Ames, IA, USA, 1989. [Google Scholar]
  31. Pauli, F.; Racugno, W.; Ventura, L. Bayesian composite marginal likelihoods. Stat. Sin. 2011, 21, 149–164. [Google Scholar]
  32. di San Miniato, M.L.; Sartori, N. Adjusted composite likelihood for robust Bayesian meta-analysis. arXiv 2021, arXiv:2104.01920. [Google Scholar] [CrossRef]
  33. Stoehr, J.; Friel, N. Calibration of conditional composite likelihood for Bayesian inference on Gibbs random fields. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, San Diego, CA, USA, 9–12 May 2015; Lebanon, G., Vishwanathan, S.V.N., Eds.; Volume 38, pp. 921–929. [Google Scholar]
  34. Geys, H.; Molenberghs, G.; Ryan, L.M. Pseudolikelihood modelling of multivariate outcomes in developmental toxicology. J. Am. Stat. Assoc. 1999, 94, 734–745. [Google Scholar] [CrossRef]
  35. Varin, C. On composite marginal likelihoods. Adv. Stat. Anal. 2008, 92, 1–28. [Google Scholar] [CrossRef]
  36. Satterthwaite, F.E. An Approximate Distribution of Estimates of Variance Components. Biom. Bull. 1946, 2, 110–114. [Google Scholar] [CrossRef]
  37. Welch, B.L. The Generalization of ‘Student’s’ Problem when Several Different Population Variances are Involved. Biometrika 1947, 34, 28–35. [Google Scholar] [CrossRef]
  38. Chandler, R.E.; Bate, S. Inference for clustered data using the independence loglikelihood. Biometrika 2007, 94, 167–183. [Google Scholar] [CrossRef]
  39. Varin, C.; Reid, N.; Firth, D. An overview of composite likelihood methods. Stat. Sin. 2011, 21, 5–42. [Google Scholar]
  40. Gill, P.E.; Murray, W. Newton-type methods for unconstrained and linearly constrained optimization. Math. Program. 1974, 7, 311–350. [Google Scholar] [CrossRef]
  41. Gill, J.; King, G. What to Do When Your Hessian Is Not Invertible: Alternatives to Model Respecification in Nonlinear Estimation. Sociol. Methods Res. 2004, 33, 54–87. [Google Scholar] [CrossRef]
  42. Golub, G.H.; Reinsch, C. Singular Value Decomposition and Least Squares Solutions. Numer. Math. 1970, 14, 403–420. [Google Scholar] [CrossRef]
  43. Cauchy, A.L. Mémoire sur l’intégration des équations linéaires. Comptes Rendus Hebdomadaires des Séances de l’Académie des Sciences 1953, 36, 395–398. [Google Scholar]
  44. Horn, R.A.; Johnson, C.R. Matrix Analysis; Cambridge University Press: Cambridge, UK, 1985. [Google Scholar] [CrossRef]
  45. Press, W.H.; Teukolsky, S.A.; Vetterling, W.T.; Flannery, B.P. Numerical Recipes in C: The Art of Scientific Computing, 2nd ed.; Cambridge University Press: Cambridge, UK, 1992. [Google Scholar]
  46. Kessy, A.; Lewin, A.; Strimmer, K. Optimal Whitening and Decorrelation. Am. Stat. 2018, 72, 309–314. [Google Scholar] [CrossRef]
  47. Vecchia, A.V.; Cooley, R.L. Simultaneous confidence and prediction intervals for nonlinear regression models with application to a groundwater flow model. Water Resour. Res. 1987, 23, 1237–1250. [Google Scholar] [CrossRef]
  48. Kuczera, G. On the validity of first-order prediction limits for conceptual hydrologic models. J. Hydrol. 1988, 103, 229–247. [Google Scholar] [CrossRef]
  49. Cooley, R.L. Confidence Intervals for Ground-Water Models Using Linearization, Likelihood, and Bootstrap Methods. Groundwater 1997, 35, 869–880. [Google Scholar] [CrossRef]
  50. Vrugt, J.A.; Bouten, W. Validity of First-Order Approximations to Describe Parameter Uncertainty in Soil Hydrologic Models. Soil Sci. Soc. Am. J. 2002, 66, 1740–1751. [Google Scholar] [CrossRef]
  51. Kent, J.T. Robust Properties of Likelihood Ratio Test. Biometrika 1982, 69, 19–27. [Google Scholar] [PubMed]
  52. Müller, U.K. Risk of Bayesian Inference in Misspecified Models, and the Sandwich Covariance Matrix. Econometrica 2013, 81, 1805–1849. [Google Scholar] [CrossRef]
  53. Frazier, D.; Kohn, R.; Drovandi, C.; Gunawan, D. Reliable Bayesian Inference in Misspecified Models. Technical Report. arXiv 2023, arXiv:2302.06031. [Google Scholar] [CrossRef]
  54. Li, K.; Rice, K. A Bayesian “Sandwich” for Variance Estimation. Stat. Sci. 2024, 39, 589–600. [Google Scholar] [CrossRef]
  55. Zellner, A. Bayesian and non-Bayesian estimation using balanced loss functions. In Statistical Decision Theory and Related Topics V; Springer: New York, NY, USA, 1994; pp. 377–390. [Google Scholar]
  56. Dawid, P.; Sebastiani, P. Coherent dispersion criteria for optimal experimental design. Ann. Stat. 1999, 27, 65–81. [Google Scholar] [CrossRef]
  57. Vrugt, J.A.; Yumi de Oliveira, D.; Schoups, G.; Diks, C.G.H. On the use of distribution-adaptive likelihood functions: Generalized and universal likelihood functions, scoring rules and multi-criteria ranking. J. Hydrol. 2022, 615, 128542. [Google Scholar] [CrossRef]
  58. D’Errico, J. Adaptive Robust Numerical Differentiation. In Mathematics of Computation; American Mathematical Society: Providence, RI, USA, 2024. [Google Scholar]
  59. Richardson, L.F. The approximate arithmetical solution by finite differences of physical problems involving differential equations. Philos. Trans. R. Soc. Lond. Ser. A 1911, 210, 307–357. [Google Scholar]
  60. Romberg, W. Vereinfachte numerische Integration. In Det Kongelige Norske Videnskabers Selskab Forhandlinger; F. Bruns Bokhandel: Trondheim, Norway, 1955; Volume 28, pp. 30–36. [Google Scholar]
  61. Newey, W.K.; West, K.D. A Simple, Positive Semi-Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix. Econometrica 1987, 55, 703–708. [Google Scholar] [CrossRef]
  62. Anderson, T.W. The statistical analysis of time series. In Wiley Series in Probability and Mathematical Statistics; Wiley: New York, NY, USA, 1971. [Google Scholar]
  63. Bartlett, M.S. On the theoretical specification and sampling properties of autocorrelated time-series. Suppl. J. R. Stat. Soc. 1946, 8, 27–41. [Google Scholar] [CrossRef]
  64. Andrews, D.W.K. Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimation. Econometrica 1991, 59, 817–858. [Google Scholar] [CrossRef]
  65. Zhao, R.; Zhuang, Y. Regional Patterns of Rainfall-Runoff Relationship. J. Hohai Univ. 1963, S2, 53–68. (In Chinese) [Google Scholar]
  66. Zhao, R.; Zhuang, Y.; Fang, L.; Liu, X.; Zhang, Q. The Xinanjiang Model. In Proceedings of the Oxford Symposium on Hydrological Forecasting, Oxford, UK, 15–18 April 1980; UNESCO-WMO Symposium: Geneva, Switzerland, 1980. [Google Scholar]
  67. Vrugt, J.A.; ter Braak, C.J.F.; Clark, M.P.; Hyman, J.M.; Robinson, B.A. Treatment of input uncertainty in hydrologic modeling: Doing hydrology backward with Markov chain Monte Carlo simulation. Water Resour. Res. 2008, 44, W00B09. [Google Scholar] [CrossRef]
  68. Vrugt, J.A.; ter Braak, C.; Diks, C.; Robinson, B.A.; Hyman, J.M.; Higdon, D. Accelerating Markov Chain Monte Carlo Simulation by Differential Evolution with Self-Adaptive Randomized Subspace Sampling. Int. J. Nonlinear Sci. Numer. Simul. 2009, 10, 273–290. [Google Scholar] [CrossRef]
  69. Laloy, E.; Vrugt, J.A. High-dimensional posterior exploration of hydrologic models using multiple-try DREAM(ZS) and high-performance computing. Water Resour. Res. 2012, 48, W01526. [Google Scholar] [CrossRef]
  70. Raftery, A.E.; Lewis, S. How many iterations in the Gibbs sampler? In Bayesian Statistics 4; Bernardo, J.M., Berger, J.O., Dawid, A.P., Smith, A.F.M., Eds.; Oxford University Press: Oxford, UK, 1992; Volume 91, pp. 763–773. [Google Scholar]
  71. Geweke, J. Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. In Bayesian Statistics 4; Bernardo, J.M., Berger, J.O., Dawid, A.P., Smith, A.F.M., Eds.; Oxford University Press: Oxford, UK, 1992; Volume 91, pp. 169–193. [Google Scholar]
  72. Gelman, A.; Rubin, D. Inference from iterative simulation using multiple sequences. Stat. Sci. 1992, 7, 457–511. [Google Scholar] [CrossRef]
  73. Brooks, S.; Gelman, A. General methods for monitoring convergence of iterative simulations. J. Comput. Graph. Stat. 1998, 7, 434–455. [Google Scholar] [CrossRef]
  74. Cowles, M.K.; Carlin, B.P. Markov chain Monte Carlo convergence Diagnostics: A comparative review. J. Am. Stat. Assoc. 1996, 91, 883–904. [Google Scholar] [CrossRef]
  75. Jayawardena, A.; Zhou, M. A modified spatial soil moisture storage capacity distribution curve for the Xinanjiang model. J. Hydrol. 2000, 227, 93–113. [Google Scholar] [CrossRef]
  76. Knoben, W.J.M.; Woods, R.A.; Freer, J.E. A Quantitative Hydrological Climate Classification Evaluated with Independent Streamflow Data. Water Resour. Res. 2018, 54, 5088–5109. [Google Scholar] [CrossRef]
  77. Scharnagl, B.; Iden, S.C.; Durner, W.; Vereeken, H.; Herbst, M. Inverse modelling of in situ soil water dynamics: Accounting for heteroscedastic, autocorrelated, and non-Gaussian distributed residuals. Hydrol. Earth Syst. Sci. Discuss. 2015, 12, 2155–2199. [Google Scholar]
  78. Christensen, S.; Cooley, R.L. Evaluation of confidence intervals for a steady-state leaky aquifer model. Adv. Water Resour. 1999, 22, 807–817. [Google Scholar] [CrossRef]
  79. Vugrin, K.W.; Swiler, L.P.; Roberts, R.M.; Stucky-Mack, N.J.; Sullivan, S.P. Confidence region estimation techniques for nonlinear regression in groundwater flow: Three case studies. Water Resour. Res. 2007, 43, W03423. [Google Scholar] [CrossRef]
  80. Lu, D.; Ye, M.; Hill, M.C. Analysis of regression confidence intervals and Bayesian credible intervals for uncertainty quantification. Water Resour. Res. 2012, 48, 1087–1096. [Google Scholar] [CrossRef]
  81. Leamer, E.E. Multicollinearity: A Bayesian Interpretation. Rev. Econ. Stat. 1973, 55, 371–380. [Google Scholar] [CrossRef]
  82. Gill, P.E.; Murray, W.; Wright, M.H. Practical Optimization; Academic Press: London, UK; New York, NY, USA, 1981. [Google Scholar]
  83. White, H. Maximum Likelihood Estimation of Misspecified Models. Econometrica 1982, 50, 1–25. [Google Scholar] [CrossRef]
  84. Wald, A. Tests of Statistical Hypotheses Concerning Several Parameters When the Number of Observations is Large. Trans. Am. Math. Soc. 1943, 54, 426–482. [Google Scholar] [CrossRef]
  85. Vrugt, J.A. Distribution-Based Model Evaluation and Diagnostics: Elicitability, Propriety, and Scoring Rules for Hydrograph Functionals. Water Resour. Res. 2024, 60, e2023WR036710. [Google Scholar] [CrossRef]
  86. Watanabe, S. Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. J. Mach. Learn. Res. 2010, 11, 3571–3594. [Google Scholar]
  87. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Stat. Comput. 2017, 27, 1413–1432. [CrossRef]
  88. Fréchet, M. Sur la distance de deux lois de probabilité. Ann. L’Isup 1957, 3, 183–198. [Google Scholar]
  89. Dowson, D.C.; Landau, B.V. The Fréchet distance between multivariate normal distributions. J. Multivar. Anal. 1982, 12, 450–455. [Google Scholar] [CrossRef]
  90. Herfindahl, O.C. Concentration in the U.S. Steel Industry. Ph.D. Thesis, Columbia University, New York, NY, USA, 1950. [Google Scholar]
  91. Hirschman, A.O. The Paternity of an Index. Am. Econ. Rev. 1964, 54, 1044–1050. [Google Scholar]
  92. Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
  93. Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
  94. Zhao, R.J. The Xinanjiang model applied in China. J. Hydrol. 1992, 135, 371–381. [Google Scholar] [CrossRef]
Figure 1. Consequences of model misspecification. Let M denote the family of densities f ( ω θ ) used to model observations ω 1 , , ω n of a random variable Ω . Suppose the true density q Ω ( ω θ 0 ) lies outside M (two examples). The true parameters θ 0 are unattainable and the best approximation is given by the pseudo-true parameters θ * = ( θ 1 , * , , θ d , * ) = arg min θ Θ d KL q Ω ( ω θ 0 ) , f ( ω θ ) .
Figure 1. Consequences of model misspecification. Let M denote the family of densities f ( ω θ ) used to model observations ω 1 , , ω n of a random variable Ω . Suppose the true density q Ω ( ω θ 0 ) lies outside M (two examples). The true parameters θ 0 are unattainable and the best approximation is given by the pseudo-true parameters θ * = ( θ 1 , * , , θ d , * ) = arg min θ Θ d KL q Ω ( ω θ 0 ) , f ( ω θ ) .
Entropy 27 00999 g001
Figure 2. Normal log-likelihood L n n ( m s 2 ) from Equation (5) as a function of the mean m [ 5 , 5 ] of the Gaussian model y N ( m , s 2 ) , shown for three variances: s 2 = 1 / 2 , s 2 = 1 , and s 2 = 2 . The data ω 1 , , ω n are sampled from a normal distribution Ω N ( μ , σ 2 ) with μ = 0 , σ 2 = 1 , and sample size n = 100 . The vertical dashed gray line indicates the value of m that maximizes the log-likelihood.
Figure 2. Normal log-likelihood L n n ( m s 2 ) from Equation (5) as a function of the mean m [ 5 , 5 ] of the Gaussian model y N ( m , s 2 ) , shown for three variances: s 2 = 1 / 2 , s 2 = 1 , and s 2 = 2 . The data ω 1 , , ω n are sampled from a normal distribution Ω N ( μ , σ 2 ) with μ = 0 , σ 2 = 1 , and sample size n = 100 . The vertical dashed gray line indicates the value of m that maximizes the log-likelihood.
Entropy 27 00999 g002
Figure 3. Histogram of the omnibus scalar k for the M = 10 4 Monte Carlo simulations using (a) s 2 = 2 , (b) s 2 = 1 and (c) s 2 = 0.5 . The × in each graph corresponds to the mean value of k.
Figure 3. Histogram of the omnibus scalar k for the M = 10 4 Monte Carlo simulations using (a) s 2 = 2 , (b) s 2 = 1 and (c) s 2 = 0.5 . The × in each graph corresponds to the mean value of k.
Entropy 27 00999 g003
Figure 4. Relationship between number n of data points of the data generating process, Ω N ( 0 , 1 ) and the naive variance Σ n naive of the ML estimate m ^ of the mean of the normal distribution model N ( m , s 2 ) using s 2 = 2 (blue), s 2 = 1 (green) and s 2 = 1 / 2 (red). The black line displays the evolution of the sandwich variance estimator Σ n sand . This estimator is invariant to the choice of s 2 .
Figure 4. Relationship between number n of data points of the data generating process, Ω N ( 0 , 1 ) and the naive variance Σ n naive of the ML estimate m ^ of the mean of the normal distribution model N ( m , s 2 ) using s 2 = 2 (blue), s 2 = 1 (green) and s 2 = 1 / 2 (red). The black line displays the evolution of the sandwich variance estimator Σ n sand . This estimator is invariant to the choice of s 2 .
Entropy 27 00999 g004
Figure 5. Histogram of the omnibus scalar k for M = 10 4 Monte Carlo simulations using the (a) Gamma, (b) Normal, (c) Lognormal, (d) Weibull and (e) Beta distributions for the data generating process. The mean value of k is separately displayed in each graph with the solid cross, whereas the vertical black line is the theoretic value of the omnibus scalar.
Figure 5. Histogram of the omnibus scalar k for M = 10 4 Monte Carlo simulations using the (a) Gamma, (b) Normal, (c) Lognormal, (d) Weibull and (e) Beta distributions for the data generating process. The mean value of k is separately displayed in each graph with the solid cross, whereas the vertical black line is the theoretic value of the omnibus scalar.
Entropy 27 00999 g005
Figure 6. Marginal posterior distributions (blue histograms) of the Xinanjiang parameters (a) f p , (b) A im , (c) a, (d) b, (e) f wm , (f) f lm , (g) c, (h) s tot , (i) β , (j) k i , (k) k g , (l) c i , (m) c g , and (n) k f obtained from the DREAM(ZS) algorithm. Inference is based on the Student t likelihood function L n s ( θ , ν , ξ s 0 = 10 4 ) and a uniform prior distribution. The solid blue lines display the normal marginal distributions derived from the naive frequentist variance estimator. The red × corresponds to the ML estimator, whereas the red square is the MAP solution of the sampled Markov chains. To conserve space, we do not display numerical labels on the y-axis.
Figure 6. Marginal posterior distributions (blue histograms) of the Xinanjiang parameters (a) f p , (b) A im , (c) a, (d) b, (e) f wm , (f) f lm , (g) c, (h) s tot , (i) β , (j) k i , (k) k g , (l) c i , (m) c g , and (n) k f obtained from the DREAM(ZS) algorithm. Inference is based on the Student t likelihood function L n s ( θ , ν , ξ s 0 = 10 4 ) and a uniform prior distribution. The solid blue lines display the normal marginal distributions derived from the naive frequentist variance estimator. The red × corresponds to the ML estimator, whereas the red square is the MAP solution of the sampled Markov chains. To conserve space, we do not display numerical labels on the y-axis.
Entropy 27 00999 g006
Figure 7. Marginal distributions of the OFS-adjusted naive posterior samples of the Xinanjiang parameters (a) f p , (b) A im , (c) a, (d) b, (e) f wm , (f) f lm , (g) c, (h) s tot , (i) β , (j) k i , (k) k g , (l) c i , (m) c g and (n) k f obtained from Equation (11). The solid blue and green lines display the normal frequentist distributions of the naive and sandwich variance estimators. The blue histograms correspond to the naive posterior parameter distributions of Figure 6 and the red × highlights the ML solution.
Figure 7. Marginal distributions of the OFS-adjusted naive posterior samples of the Xinanjiang parameters (a) f p , (b) A im , (c) a, (d) b, (e) f wm , (f) f lm , (g) c, (h) s tot , (i) β , (j) k i , (k) k g , (l) c i , (m) c g and (n) k f obtained from Equation (11). The solid blue and green lines display the normal frequentist distributions of the naive and sandwich variance estimators. The blue histograms correspond to the naive posterior parameter distributions of Figure 6 and the red × highlights the ML solution.
Entropy 27 00999 g007
Figure 8. Scatter plot matrix of bivariate confidence and credible regions for all pairs of Xinanjiang model parameters. The ellipsoids show frequentist 95% confidence intervals estimated from the naive variance Σ n naive = 1 n A n (in blue) and the sandwich variance Σ n sand = 1 n A n 1 B n 1 A n 1 (in green). The blue squares and green dots represent the 95% credible regions of the naive and sandwich-adjusted posterior distributions sampled by the DREAM(ZS) algorithm. Axis values are omitted to save space.
Figure 8. Scatter plot matrix of bivariate confidence and credible regions for all pairs of Xinanjiang model parameters. The ellipsoids show frequentist 95% confidence intervals estimated from the naive variance Σ n naive = 1 n A n (in blue) and the sandwich variance Σ n sand = 1 n A n 1 B n 1 A n 1 (in green). The blue squares and green dots represent the 95% credible regions of the naive and sandwich-adjusted posterior distributions sampled by the DREAM(ZS) algorithm. Axis values are omitted to save space.
Entropy 27 00999 g008
Figure 9. Comparison of 95% parameter credible regions derived from (1) OFS-adjusted sandwich samples and (2) magnitude-, (3) curvature- and (4) sandwich-adjusted MCMC simulation using Algorithms 1–3, respectively: (a1a4) f p A im , (b1b4) a b , (c1c4) f wm f lm , (d1d4) f lm c , (e1e4) s tot β (f1f4) k i b , (g1g4) k g c i , and (h1h4) c g k f . The OFS-adjusted posterior samples are obtained from Equation (11) using Ψ n = A n 1 1 / 2 B n 1 / 2 A n 1 / 2 with matrix square roots A n 1 / 2 and B n 1 / 2 computed according to Equation (13) using singular value decomposition. The blue and green ellipsoids are the 95% confidence regions of the frequentist naive and sandwich variance estimators. Red lines delineate the boundaries of the standard uniform prior distribution.
Figure 9. Comparison of 95% parameter credible regions derived from (1) OFS-adjusted sandwich samples and (2) magnitude-, (3) curvature- and (4) sandwich-adjusted MCMC simulation using Algorithms 1–3, respectively: (a1a4) f p A im , (b1b4) a b , (c1c4) f wm f lm , (d1d4) f lm c , (e1e4) s tot β (f1f4) k i b , (g1g4) k g c i , and (h1h4) c g k f . The OFS-adjusted posterior samples are obtained from Equation (11) using Ψ n = A n 1 1 / 2 B n 1 / 2 A n 1 / 2 with matrix square roots A n 1 / 2 and B n 1 / 2 computed according to Equation (13) using singular value decomposition. The blue and green ellipsoids are the 95% confidence regions of the frequentist naive and sandwich variance estimators. Red lines delineate the boundaries of the standard uniform prior distribution.
Entropy 27 00999 g009
Figure 10. Simulation intervals for Xinanjiang streamflow based on the (a) naive and (b) sandwich variance estimators. Bands show the 68%, 90%, 95%, and 99% parameter uncertainty induced discharge intervals. Red dots are the observed discharge.
Figure 10. Simulation intervals for Xinanjiang streamflow based on the (a) naive and (b) sandwich variance estimators. Bands show the 68%, 90%, 95%, and 99% parameter uncertainty induced discharge intervals. Red dots are the observed discharge.
Entropy 27 00999 g010
Figure 11. Histogram (gray bins) of the studentized residuals ϵ 1 ( θ ^ ) , , ϵ n ( θ ^ ) of the Xinanjiang discharge simulation and SST density f SST ( ϵ 0 , 1 , ν , ξ ) using ML values of the model parameters, degrees of freedom ν and kurtosis ξ . The probability density of the standard normal distribution f N ( ϵ μ = 0 , σ 2 = 1 ) is separately displayed with a dashed black line.
Figure 11. Histogram (gray bins) of the studentized residuals ϵ 1 ( θ ^ ) , , ϵ n ( θ ^ ) of the Xinanjiang discharge simulation and SST density f SST ( ϵ 0 , 1 , ν , ξ ) using ML values of the model parameters, degrees of freedom ν and kurtosis ξ . The probability density of the standard normal distribution f N ( ϵ μ = 0 , σ 2 = 1 ) is separately displayed with a dashed black line.
Entropy 27 00999 g011
Table 1. Normal distribution model N ( m , s 2 ) : ML estimate of m ^ , and associated values of the information matrices A n and B n , omnibus scalar k ^ , and naive and sandwich variance estimators using the normal log-likelihood function L n n ( m s 2 ) in Equation (5) with s 2 = 2 , s 2 = 1 , and s 2 = 0.5 . Tabulated values are an average of M = 10 4 different realizations of ω 1 , , ω 100 sampled from the data generating process. Standard deviations are listed between parenthesis.
Table 1. Normal distribution model N ( m , s 2 ) : ML estimate of m ^ , and associated values of the information matrices A n and B n , omnibus scalar k ^ , and naive and sandwich variance estimators using the normal log-likelihood function L n n ( m s 2 ) in Equation (5) with s 2 = 2 , s 2 = 1 , and s 2 = 0.5 . Tabulated values are an average of M = 10 4 different realizations of ω 1 , , ω 100 sampled from the data generating process. Standard deviations are listed between parenthesis.
s 2 m ^ A n B n k ^ Σ n naive Σ n sand
2−0.001 (0.099)0.500 (0.000)0.247 (0.036)2.041 (0.298)0.020 (0.000)0.010 (0.001)
1−0.001 (0.099)1.000 (0.000)0.990 (0.143)1.021 (0.149)0.010 (0.000)0.010 (0.001)
0.5−0.001 (0.099)2.000 (0.000)3.959 (0.568)0.516 (0.089)0.005 (0.000)0.010 (0.001)
Table 2. Coverage in % of the true mean μ of the data generating process N ( μ , σ 2 ) for 100 ( 1 α ) % confidence intervals of the ML estimate m ^ under the normal distribution model N ( m , s 2 ) with s 2 = 2 , s 2 = 1 , and s 2 = 1 / 2 using the naive and sandwich variance estimators. Results are based on the code in Appendix C using M = 10 3 trials with μ = 0 , σ 2 = 1 , and sample size n = 100 .
Table 2. Coverage in % of the true mean μ of the data generating process N ( μ , σ 2 ) for 100 ( 1 α ) % confidence intervals of the ML estimate m ^ under the normal distribution model N ( m , s 2 ) with s 2 = 2 , s 2 = 1 , and s 2 = 1 / 2 using the naive and sandwich variance estimators. Results are based on the code in Appendix C using M = 10 3 trials with μ = 0 , σ 2 = 1 , and sample size n = 100 .
α s 2 = 2 s 2 = 1 s 2 = 0.5 Theoretic Coverage
Σ n naive Σ n sand Σ n naive Σ n sand Σ n naive Σ n sand
0.01100.0098.7098.9098.7093.0098.7099%
0.0599.1094.8094.7094.8083.1094.8095%
0.1097.9090.0090.3090.0075.7090.0090%
0.2092.7079.2079.2079.2064.0079.2080%
0.3084.9071.6071.4071.6053.4071.6070%
0.4075.9060.1060.4060.0044.0060.0060%
0.5067.0049.7049.8049.8034.5049.8050%
Table 3. Coverage in % of the true mean μ = a · b of the data generating process G ( a , b ) for 100 ( 1 α ) % confidence intervals of the ML estimate μ ^ under the exponential distribution E ( μ ) using the naive and sandwich variance estimators. Results are based on M = 10 4 trials with a = 0.5 , b = 0.2 and sample size n = 100 . The Matlab code is given in Appendix D.
Table 3. Coverage in % of the true mean μ = a · b of the data generating process G ( a , b ) for 100 ( 1 α ) % confidence intervals of the ML estimate μ ^ under the exponential distribution E ( μ ) using the naive and sandwich variance estimators. Results are based on M = 10 4 trials with a = 0.5 , b = 0.2 and sample size n = 100 . The Matlab code is given in Appendix D.
Method/Coverage α = 0.01 α = 0.05 α = 0.1 α = 0.2 α = 0.3 α = 0.4 α = 0.5
99% 95% 90% 80% 70% 60% 50%
Naive estimator92.9883.5675.6263.7154.2245.2637.19
Sandwich estimator97.7193.7188.4679.0468.7359.3849.72
Table 4. Coverage (in %) of the true mean μ = a · b of the data-generating process G ( a , b ) for the 100 ( 1 α ) % credible intervals of parameter μ ^ of E ( μ ) obtained from MCMC simulation (=naive MCMC), OFS-adjusted naive posterior samples, and magnitude-, curvature-, and sandwich-adjusted DREAM(ZS) algorithms. Results are based on M = 10 4 trials with a = 0.5 , b = 0.2 and n = 100 .
Table 4. Coverage (in %) of the true mean μ = a · b of the data-generating process G ( a , b ) for the 100 ( 1 α ) % credible intervals of parameter μ ^ of E ( μ ) obtained from MCMC simulation (=naive MCMC), OFS-adjusted naive posterior samples, and magnitude-, curvature-, and sandwich-adjusted DREAM(ZS) algorithms. Results are based on M = 10 4 trials with a = 0.5 , b = 0.2 and n = 100 .
MethodLikelihood α = 0.01 α = 0.05 α = 0.1 α = 0.2 α = 0.3 α = 0.4 α = 0.5
99% 95% 90% 80% 70% 60% 50%
Naive MCMC L n ( μ ) 92.3383.1175.3063.6953.3844.6436.38
OFS adjustment L n ( μ ) 94.9390.6985.9876.3666.5757.0447.46
Algorithm 1 L n ( μ ) k 98.3993.6788.9878.6168.8958.9348.92
Algorithm 2 L n ca ( μ ) 98.4193.6988.8678.8768.8059.1448.86
Algorithm 3 L n p ( μ λ ) 98.4993.7888.9678.5868.9659.4449.31
Table 5. Total sensitivity A n s and variability β n s matrices of the Xinanjiang model parameters and nuisance variables ν and ξ of the Student t likelihood.
Table 5. Total sensitivity A n s and variability β n s matrices of the Xinanjiang model parameters and nuisance variables ν and ξ of the Student t likelihood.
Entropy 27 00999 i003
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Vrugt, J.A.; Diks, C.G.H. The Learning Rate Is Not a Constant: Sandwich-Adjusted Markov Chain Monte Carlo Simulation. Entropy 2025, 27, 999. https://doi.org/10.3390/e27100999

AMA Style

Vrugt JA, Diks CGH. The Learning Rate Is Not a Constant: Sandwich-Adjusted Markov Chain Monte Carlo Simulation. Entropy. 2025; 27(10):999. https://doi.org/10.3390/e27100999

Chicago/Turabian Style

Vrugt, Jasper A., and Cees G. H. Diks. 2025. "The Learning Rate Is Not a Constant: Sandwich-Adjusted Markov Chain Monte Carlo Simulation" Entropy 27, no. 10: 999. https://doi.org/10.3390/e27100999

APA Style

Vrugt, J. A., & Diks, C. G. H. (2025). The Learning Rate Is Not a Constant: Sandwich-Adjusted Markov Chain Monte Carlo Simulation. Entropy, 27(10), 999. https://doi.org/10.3390/e27100999

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop