Decision-Making Under Model Misspecification: DRO with Robust Bayesian Ambiguity Sets

Dellaporta, Charita; O’Hara, Patrick; Damoulas, Theodoros

doi:10.3390/e28040430

Open AccessArticle

Decision-Making Under Model Misspecification: DRO with Robust Bayesian Ambiguity Sets

by

Charita Dellaporta

^1,*,†,‡

,

Patrick O’Hara

^2,‡ and

Theodoros Damoulas

^2,3

¹

Department of Statistical Science, University College London, London WC1E 6BT, UK

²

Department of Computer Science, University of Warwick, Coventry CV4 7AL, UK

³

Department of Statistics, University of Warwick, Coventry CV4 7AL, UK

^*

Author to whom correspondence should be addressed.

^†

Part of this work was undertaken while CD was affiliated with 3.

^‡

These authors contributed equally to this work.

Entropy 2026, 28(4), 430; https://doi.org/10.3390/e28040430

Submission received: 24 February 2026 / Revised: 31 March 2026 / Accepted: 9 April 2026 / Published: 11 April 2026

(This article belongs to the Special Issue Statistical Inference: Theory and Methods)

Download

Browse Figures

Versions Notes

Abstract

Distributionally Robust Optimisation (DRO) protects risk-averse decision-makers by considering the worst-case risk within an ambiguity set of distributions based on the empirical distribution or a model. To further guard against finite, noisy data, model-based approaches admit Bayesian formulations that propagate uncertainty from the posterior to the decision-making problem. However, when the model is misspecified, the decision-maker must stretch the ambiguity set to contain the data-generating process (DGP), leading to overly conservative decisions. We address this challenge by introducing DRO with Robust ayesian Ambiguity Sets (DRO-RoBAS) to model misspecification. These are Maximum Mean Discrepancy ambiguity sets centred at a robust posterior predictive distribution that incorporates beliefs about the DGP. We show that the resulting optimisation problem obtains a dual formulation in the Reproducing Kernel Hilbert Space and we give probabilistic guarantees on the tolerance level of the ambiguity set. Our method outperforms other Bayesian and empirical DRO approaches in out-of-sample performance on the Newsvendor and Portfolio problems with various cases of model misspecification.

Keywords:

robustness; Bayesian inference; stochastic optimisation; misspecification; divergence-based inference

1. Introduction

Decision-makers frequently encounter the challenge of optimising under uncertainty since the data-generating process (DGP) is not fully known. As a result, they must rely on available data and model families to estimate the DGP within the optimisation objective. However, the data may be noisy, the data distribution might change with time, or the model might be misspecified or poorly fitted, leading to distributional uncertainty. Decision-making in this setting is a critical challenge in various applications such as inventory planning [1], portfolio optimisation [2] and distribution shifts in machine learning applications [3].

A risk-averse decision-maker might choose to hedge against distributional uncertainty by finding the decision that minimises the worst-case risk over a set of distributions. This worst-case protection is at the heart of Distributionally Robust Optimisation (DRO) that defines an ambiguity set of distributions with respect to an estimator of the DGP. This can be fully data-driven, using the empirical measure of the observations (see, e.g., [4,5,6,7]), or model-based when expert knowledge is available to fit a model to available data, resulting in a model-based estimator for the DGP [8,9,10]. Both approaches are sensitive to the choice of DGP estimator and additional estimation error might exist in model-based DRO due to a poor fit or model uncertainty.

To overcome this, recently developed Bayesian formulations of DRO use posterior beliefs to inform the optimisation problem [11] or the ambiguity set itself [12]. However, these methods inherit the sensitivity of Bayesian posteriors to model misspecification (see, e.g., [13,14]). A key goal in DRO methodology is to choose the size of the ambiguity set such that the DGP falls within it with high probability, as illustrated in Figure 1 (left) for the two formulations of Bayesian Ambiguity Sets (BAS) [12] and our proposed Robust BAS (RoBAS). If the estimate is not accurate—for example, when the model is misspecified—then a much larger size will be required to contain the DGP as illustrated in Figure 1 (right). The price to pay for this large size is the inclusion of many probability distributions that are unlikely to occur, and which could be very pessimistic with respect to the objective function, leading to an overly conservative decision. If the decision-maker wrongly assumes the model is well specified and incorrectly chooses an overly optimistic ambiguity set size, then the DGP may not lie in the set, and the decision could be overly optimistic compared to the out-of-sample outcome, often referred to as the optimiser’s curse [4].

Uncertainty over the DGP is extensively studied outside stochastic optimisation. Consider a parametric model

P_{θ}

, indexed by the parameter of interest

θ

. In the Bayesian framework, uncertainty about the parameter is typically expressed directly through prior beliefs. However, recent work in robust Bayesian inference by Lyddon et al. [15] takes a different approach: uncertainty about the parameter is now induced by uncertainty, in the form of prior beliefs, in the DGP. This concept lies at the core of the Bayesian Nonparametric Learning (NPL) framework [15,16] which relaxes the well-specified model assumption imposed by standard Bayesian inference. In this spirit, we approach the challenge of DRO under model misspecification by extending the recently proposed DRO-BAS [12] framework to tackle model misspecification through a robust NPL posterior coupled with the Maximum Mean Discrepancy (MMD) inside the ambiguity set, thus introducing DRO with Robust Bayesian Ambiguity Sets (DRO-RoBAS). While DRO-BAS targets distributional uncertainty with respect to the DGP, the robustness offered is not sufficient under model misspecification, see Figure 1.

2. Background

Let

x \in X \subseteq R^{d}

be a decision-making variable for the cost function

f : X \times Ξ \to R

with data space

Ξ \subseteq R^{D}

. Let

{ξ_{i}}_{i = 1}^{n} \overset{iid}{\sim} P^{⋆}

be observations from the DGP

P^{⋆} \in P (Ξ)

, where

P (Ξ)

denotes the space of Borel probability measures on

Ξ

. Furthermore, consider a parametric model family

P_{Θ} : = {P_{θ} : θ \in Θ} \subset P (Ξ)

indexed by parameter of interest

θ \in Θ \subseteq R^{k}

. We say the model is misspecified if

P^{⋆} \notin P_{Θ}

. DRO methods construct an ambiguity set

A

, based on an estimator of

P^{⋆}

, called the nominal distribution, and minimise the worst-case expected cost over

A

.

The DRO literature typically categorises ambiguity sets into two classes [17]: moment-based and discrepancy-based. The former contain distributions that satisfy constraints on the moments of

P^{⋆}

, without necessarily considering an estimator. In contrast, discrepancy-based ambiguity sets consist of distributions close to the nominal according to a specified discrepancy measure. Examples include Integral Probability Metrics (IPMs) [18], such as the Wasserstein distance [4] and the MMD [5], as well as

ϕ

-divergences like the Kullback–Leibler (KL) divergence [7]. Regardless of the choice of ambiguity set, the resulting minimax problem can be seen as a game between the decision-maker who chooses x and an adversary who chooses the worst-case distribution in

A

:

\begin{matrix} min_{x \in X} sup_{P \in A} E_{ξ \sim P} [f_{x} (ξ)] \end{matrix}

(1)

where

f_{x} (ξ) : = f (x, ξ)

. Although most discrepancy-based DRO methods are fully empirical, i.e., the estimator is obtained only via

ξ_{1 : n}

, sometimes, such as in regression settings, the decision-maker needs to model the variables’ relationship via a model family

P_{Θ}

, describing the DGP. Model-based DRO methods (e.g., [8,9,10]) use the observations to obtain an estimator

P_{\hat{θ}} \in P_{Θ}

and use this as the nominal distribution. Thus, a poorly chosen

P_{\hat{θ}}

far from

P^{⋆}

(in some distance sense) requires a large

A

, leading to overly pessimistic decisions. This has led to Bayesian formulations of DRO which propagate uncertainty about

θ

in the optimisation problem.

2.1. Bayesian Formulations of DRO

Shapiro et al. [11] introduced Bayesian DRO (BDRO) which defines an expected worst-case risk objective:

\begin{matrix} min_{x \in X} E_{Π (θ ∣ ξ_{1 : n})} [sup_{P : d_{KL} (P | | P_{θ}) \leq ϵ} E_{ξ \sim P} [f_{x} (ξ)]] \end{matrix}

(2)

where

Π (θ ∣ ξ_{1 : n})

denotes the parameter posterior distribution for model family

P_{Θ}

. However, risk-averse decision-makers are interested in worst-case risk formulations. For this reason, Dellaporta et al. [12] proposed two formulations of the DRO with Bayesian Ambiguity Sets (DRO-BAS) that correspond to a worst-case optimisation problem with ambiguity sets informed by the standard Bayesian posterior. In particular, they defined DRO-BAS_PP:

\begin{matrix} min_{x \in X} sup_{P : d_{KL} (P | | E_{Π (θ ∣ ξ_{1 : n})} [P_{θ}]) \leq ϵ} E_{ξ \sim P} [f_{x} (ξ)], \end{matrix}

(3)

based on a KL-based ambiguity set with nominal distribution the posterior predictive, and DRO-BAS_PE:

\begin{matrix} min_{x \in X} sup_{P : E_{Π (θ ∣ ξ_{1 : n})} [d_{KL} (P | | P_{θ})] \leq ϵ} E_{ξ \sim P} [f_{x} (ξ)] . \end{matrix}

(4)

which considers the expected KL under the posterior distribution. The authors showcased improved out-of-sample robustness compared to BDRO in a number of Exponential family models. Although DRO-BAS in the standard Bayesian setting offers an intuitive, posterior-informed ambiguity set, it can be severely affected by model misspecification. Indeed, BAS_PE only considers probability measures

P

that are absolutely continuous with respect to

P_{θ}

(denoted by

P ≪ P_{θ}

) and also admit an expected KL divergence close enough to

P_{θ}

. Since the expectation is informed by the posterior, a non-robust posterior will likely lie far away from the DGP. Similarly, BAS_PP considers probability measures

P

that admit small KL-divergence with respect to

P_{n}^{pred}

, where

P_{n}^{pred} : = E_{θ \sim Π (θ ∣ ξ_{1 : n})} [P_{θ}]

and

P ≪ P_{n}^{pred}

. Hence, the sensitivity of the Bayesian posterior will propagate to the posterior predictive and the resulting ambiguity set. A similar observation was made by Shapiro et al. [11] for the BDRO method in the misspecified case.

To remedy this, we exploit the flexibility of the DRO-BAS framework which allows us to choose a different posterior distribution and discrepancy measure, suitable for model misspecification. The notion of targeting a different discrepancy measure, other than the KL divergence, to induce robustness in the Bayesian posterior has been well established in the Bayesian inference literature. The NPL posterior [15,16] was introduced to, among others, remedy the sensitivity of Bayesian inference to model misspecification by removing the assumption that the model is correct. This is done by setting uncertainty, via nonparametric prior beliefs, directly on the DGP rather than on the parameter of interest. Incorporating DGP uncertainty in decision-making has also been considered by Wang et al. [19], who explored a nonparametric Dirichlet Process (DP) model for the DGP. Unlike the current paper, this work is not suited for parametric models and considers a weighted objective, with only one counterpart corresponding to a worst-case risk. We focus on decision-making under parametric models, which are especially useful for interpretability in decision-making, while also incorporating nonparametric prior beliefs about the DGP.

To achieve this, we leverage the work of Dellaporta et al. [20] who extended the NPL posterior to discrepancy-based loss functions and showed robustness guarantees when the Maximum Mean Discrepancy (MMD) is used. Using an MMD-based loss allows us to also employ the MMD to construct the ambiguity set. NPL is a natural choice for the DRO-BAS framework as distributional uncertainty in DRO stems directly from uncertainty in the DGP. Before we introduce the NPL posterior and the DRO-RoBAS framework, we first give a brief overview of robust Bayesian inference methodologies based on divergencies.

2.2. Robust Bayesian Inference via Divergences

To mitigate the lack of robustness in standard Bayesian inference, a class of approaches known as Generalised Bayesian Inference (GBI) [21,22,23] has been introduced. In this framework, one replaces the log-likelihood with a general empirical loss function

l_{n} : Ξ^{n} \times Θ \to R

, together with a learning rate

β > 0

. The resulting posterior distribution for prior

π (θ)

is defined through the density

\begin{matrix} Π_{GBI} (θ | ξ_{1 : n}) \propto exp (- β l_{n} (ξ_{1 : n}, θ)) π (θ) . \end{matrix}

In most cases, the loss function

l_{n}

is chosen in relation to the model family

P_{Θ}

. In particular, the choice of

l_{n} (ξ_{1 : n}, θ) : = - log p (ξ_{1 : n}, θ)

and

β = 1

recovers the standard Bayesian posterior. This observation highlights that the sensitivity of the standard Bayesian posterior to model misspecification stems from this specific choice of loss. To improve robustness, alternative generalised posteriors have been proposed by selecting discrepancies D with desirable properties and select

l_{n}

induced by an approximation of

D (P^{⋆}, P_{θ})

using observed data. Examples of such works based on robust divergences or discrepancies include [24,25,26,27]. Beyond GBI approaches, Bayesian NPL provides an alternative generalised formulation of Bayesian inference, which in some cases is also grounded in discrepancy-based loss functions. Unlike GBI methods, NPL places uncertainty directly on the DGP, thereby enhancing robustness to model misspecification. As distributional robustness similarly seeks to account for uncertainty in the DGP, we adopt this framework in the present work and introduce it below.

Robust NPL Posterior

In this work, we propose an alternative formulation of DRO-BAS, based on the robust NPL posterior introduced by Lyddon et al. [15] and Fong et al. [16]. We introduce a DP prior

Q \sim DP (α, F)

on the DGP

P^{⋆}

where

α > 0

and

F \in P (Ξ)

. Here,

F

represents our prior beliefs about the DGP and the concentration parameter

α

dictates the strength of the beliefs with

α = 0

representing a non-informative prior. To see this, note that given data

ξ_{1 : n}

, the posterior is

\begin{matrix} \begin{matrix} Q | ξ_{1 : n} & \sim DP (α^{'}, F^{'}), \\ α^{'} : = α + n, F^{'} & : = \frac{α}{α + n} F + \frac{n}{α + n} P_{n} \end{matrix} \end{matrix}

(5)

where

P_{n} : = \frac{1}{n} \sum_{i = 1}^{n} δ_{ξ_{i}}

is the empirical measure and

δ_{ξ}

denotes the Dirac measure at

ξ \in Ξ

. For

α = 0

, the DP posterior is centred directly on

P_{n}

. If

P^{⋆}

was known, we could directly compute:

\begin{matrix} θ_{L} (P^{⋆}) : = \underset{θ \in Θ}{arg min} E_{ξ \sim P^{⋆}} [L (ξ; θ)] \end{matrix}

(6)

where

L : Ξ \times Θ \to R

denotes any loss function. Note that this NPL objective does not assume that the model is well specified but simply looks for the most likely value of

θ

under the expectation of the DGP or, equivalently, the parameter value that best describes the data under the candidate model. Since

P^{⋆}

is unknown and we instead have a nonparametric posterior over it, we can propagate our posterior beliefs to the parameter of interest through the push-forward measure

{(θ_{L})}_{#} (DP (α^{'}, F^{'}))

to give a posterior

Π_{NPL}

on

Θ

. Sampling from this posterior can be done through the Posterior Bootstrap [16]: For B Posterior Bootstrap iterations, at iteration

j \in [B]

:

Sample $Q^{(j)}$ from the posterior $DP (α^{'}, F^{'})$ .
Compute $θ^{(j)} = θ_{L} (Q^{(j)})$ where $θ_{L} (\cdot)$ as in (6).

Dellaporta et al. [20] suggested using a discrepancy-based loss function in (6) which we introduce below.

2.3. Maximum Mean Discrepancy

The MMD belongs to the family of IPMs [28]. Let

H_{k}

be a Reproducing Kernel Hilbert Space (RKHS), for kernel

k : Ξ \times Ξ \to R

and norm

{∥ \cdot ∥}_{k}

. For

P_{k} (Ξ) : = {P \in P (Ξ) : \int_{Ξ} \sqrt{k (ξ, ξ)} P (d ξ) < \infty}

, the MMD between

P, Q \in P_{k} (Ξ)

is defined as:

\begin{matrix} D_{k} (P, Q) : = sup_{f \in H_{k}, {∥ f ∥}_{k} \leq 1} |E_{P} [f (ξ)] - E_{Q} [f (ξ)]| . \end{matrix}

(7)

Dellaporta et al. [20] define the NPL target in (6) as:

\begin{matrix} θ_{k} (P^{⋆}) : = \underset{θ \in R^{k}}{arg min} D_{k} (P^{⋆}, P_{θ}) . \end{matrix}

(8)

One of the attractive properties of the MMD is that the supremum in (7) can be obtained in closed form as

\begin{matrix} D_{k}^{2} (P, Q) & = E_{ξ, ξ^{'} \sim P} [k (ξ, ξ^{'})] - 2 E_{ξ \sim P, ξ^{'} \sim Q} [k (ξ, ξ^{'})] \\ + E_{ξ, ξ^{'} \sim Q} [k (ξ, ξ^{'})] \end{matrix}

(9)

and can be approximated via sampling (see, e.g., [29]). The resulting NPL posterior with the MMD is called an NPL-MMD posterior.

We explore the gains of using the MMD both in the robust NPL posterior and in the Bayesian ambiguity set. Since the NPL-MMD posterior will target the point in the model family closest (w.r.t. the MMD) to

P^{⋆}

, the MMD-based ambiguity set may require a smaller radius to include

P^{⋆}

, resulting in less conservative decisions.

2.4. DRO with the Maximum Mean Discrepancy

The MMD has previously been used as a distance metric in the DRO context by Staib and Jegelka [5] who considered the ambiguity set

B_{ϵ}^{k} (P_{n}) : = {P \in P_{k} (Ξ) : D_{k} (Q, P_{n}) \leq ϵ}

where

ϵ > 0

is the radius of the MMD ball, k is the kernel and

P_{n} : = \frac{1}{n} \sum_{i = 1}^{n} δ_{ξ_{i}}

is the empirical measure of the observations

ξ_{1 : n}

. As highlighted by the authors, this ambiguity set has several advantages. First, MMD-DRO can be readily applied to complex data structures such as images or graphs by choosing an appropriate kernel defined on the corresponding data space. This is in contrast to DRO with other choices of distance metrics like the popular Wasserstein distance, in which case, a lot of theoretical and optimisation results rely on specific choices of ground metric, which limit their applicability to more complex data. Moreover, existing finite sample conctration results (see, e.g., [29,30]) can be used for radius selection. Importantly, these results do not suffer from the curse of dimensionality and do not require any assumptions on the DGP, contrary to other distance metrics like the Wasserstein [4]. Although several works have provided remedies to this curse [31], results usually require assumptions on the DGP or the loss function and are hence not always applicable. However, these advantages come at a cost: Ref. [5] explored the optimisation of the MMD-DRO objective by deriving an upper bound which, however, requires for the loss function to be a member of the RKHS, which is often hard to verify. To remedy this, Zhu et al. [6] studied the same ambiguity set along with an extended family of kernel-based ambiguity sets, collectively called Kernel DRO, and provided strong duality results of the optimisation problem which do not require the loss function to be a member of the corresponding RKHS. In this work, we adopt this dual formulation and give a detailed explanation in Section 3.1. Chen et al. [32] generalised this to conditional Kernel DRO, leveraging conditional distributions and Romao et al. [33] explored this framework for dynamic programming.

3. DRO with Robust Bayesian Ambiguity Sets

We propose a robust version of DRO-BAS_PP (3) via the MMD and the NPL-MMD posterior predictive defined as:

\begin{matrix} P_{n}^{pred (NPL)} : = E_{Q \sim DP (c^{'}, F^{'})} [P_{θ_{k} (Q)}] . \end{matrix}

(10)

We assume that for

Q \in P (Ξ)

, the map

Q \mapsto θ_{k} (Q)

is measurable such that

θ_{k} (Q) \in arg {min}_{θ \in Θ} D_{k} (Q, P_{θ})

, so that the expectation in (10) is well defined. Throughout, we also assume that

P_{n}^{pred (NPL)} \in P_{k} (Ξ)

, which holds, for example, when the kernel k is bounded. A sufficient condition ensuring the existence of the minimiser in (8) is given in Section 3.2. Notice that contrary to the standard Bayesian posterior predictive,

P_{n}^{pred (NPL)}

is defined through marginalisation over the nonparametric posterior over the DGP and is defined for any choice of nonparametric prior

DP (α, F)

as defined in Section 2.1. Since the MMD can be approximated only via samples (Section 2.3), a closed-form density for the predictive in (10) is not required. We define the following Robust Bayesian Ambiguity Set (RoBAS) with the NPL posterior predictive:

\begin{matrix} B_{ϵ}^{k} (P_{n}^{pred (NPL)}) : = {P \in P_{k} (Ξ) : D_{k} (P, P_{n}^{pred (NPL)}) \leq ϵ} . \end{matrix}

Note that

B_{ϵ}^{k} (P_{n}^{pred (NPL)})

forms a ball when k is a characteristic kernel, as this makes the MMD a probability metric. This property is desirable as it guarantees that the MMD will be zero if and only if

P \equiv P_{n}^{pred (NPL)}

. We hence obtain the following DRO-RoBAS worst-case risk problem:

\begin{matrix} min_{x \in X} sup_{P \in B_{ϵ}^{k} (P_{n}^{pred (NPL)})} E_{ξ \sim P} [f_{x} (ξ)] \end{matrix}

(11)

Similarly to DRO-BAS, this optimisation problem corresponds to a worst-case risk over a set of probability measures informed by posterior beliefs. Posterior beliefs about

θ

are obtained via posterior beliefs about the DGP and the map

θ_{k}

in (8). Since the goal of DRO is to target uncertainty about the DGP, the NPL posterior is a natural choice to inform the ambiguity set as it takes into account any prior beliefs about the DGP. Moreover, by targeting the MMD, rather than the KL divergence as in the BAS case, RoBAS is not restricted to probability measures that are absolutely continuous with respect to

P_{θ}

.

Intuitively, RoBAS is expected to be a better-informed ambiguity set than BAS when the model is misspecified since it is informed by a robust posterior predictive and a robust discrepancy measure. This is better understood through a toy example. Figure 2 shows a Gaussian location model in the presence of outliers. The top panel shows the DGP of the training data contaminated with 20% of outliers along with a pathological model

P_{pathological}

case with a mean larger than that of the DGP. In the BAS_PE case, the expected KL from the DGP to the model is significantly larger than that from the pathological model

P_{pathological}

due to the sensitivity of the Bayesian posterior to outliers. A similar result holds for the KL divergence between the posterior predictive and the DGP and pathological model in the BAS_PP case. In contrast, the MMD from the NPL posterior predictive to the DGP is much smaller compared to that of the pathological model. Alternative DRO-RoBAS formulations are provided in Appendix C. The results of this section are in Appendix A.

3.1. Duality of the DRO-RoBAS Problem

We first formulate our optimisation problem as a kernel DRO problem [6]. This allows us to obtain a dual formulation of (11) in the RKHS which can be optimised using kernel methods. Let

ϕ : Ξ \to H_{k}

denote the feature map associated with kernel k and let

μ_{P} \in H_{k}

denote the kernel mean embedding of the probability measure

P \in P_{k} (Ξ)

, i.e.,

μ_{P} : = E_{ξ \sim P} [ϕ (ξ)]

. Then the MMD is equivalently defined as

D_{k} (P, Q) = {∥ μ_{P} - μ_{Q} ∥}_{k}

([29], Lemma 4). Throughout we assume that

P_{n}^{pred (NPL)} \in P_{k} (Ξ)

. Note that this condition is trivially satisfied for a bounded kernel k. Consider the following set satisfying the conditions of RoBAS:

\begin{matrix} C^{⋆} : = {μ \in H_{k} : ∥ μ - μ_{P_{n}^{pred (NPL)}} ∥_{k} \leq ϵ} . \end{matrix}

(12)

The associated ambiguity set induced by

C^{⋆}

is

\begin{matrix} \begin{matrix} K_{C^{⋆}} & : = {P \in P_{k} (Ξ), μ_{P} \in C^{⋆}} \equiv B_{ϵ}^{k} (P_{n}^{pred (NPL)}) \end{matrix} \end{matrix}

(13)

and our DRO-RoBAS problem in (11) is equivalent to:

\begin{matrix} \begin{matrix} min_{x \in X} sup_{P, μ_{P}} E_{ξ \sim P} [f_{x} (ξ)] s . t . P \in P_{k} (Ξ), μ_{P} \in C^{⋆} . \end{matrix} \end{matrix}

(14)

The equivalence can be seen through the

C^{⋆}

-induced ambiguity set for distributions

P

which can be written as

K_{C^{⋆}}

.

Before we proceed to our dual formulation, we introduce the main result from Zhu et al. [6] which gives a dual formulation of Kernel DRO problems for general sets

C

satisfying certain assumptions.

Theorem 1

(Zhu et al. [6], Theorem 3.1). Assume

C \subset H_{k}

is closed convex,

f_{x} (\cdot)

is proper, upper semi-continuous, and

ri (K_{C}) \neq \emptyset

, where

ri (K_{C})

denotes the relative interior of

K_{C}

. Then the primal problem:

\begin{matrix} min_{x} sup_{P, μ} E_{ξ \sim P} [f_{x} (ξ)] s . t . P \in P, μ_{P} = μ, μ \in C \end{matrix}

is equivalent to:

\begin{matrix} min_{x, g_{0} \in R, g \in H_{k}} g_{0} + δ_{C}^{⋆} (g) s . t . f_{x} (ξ) \leq g_{0} + g (ξ), \forall ξ \in Ξ \end{matrix}

where

δ_{C}^{⋆} (g) : = {sup}_{μ \in C} {〈g, μ〉}_{H_{k}}

the support function of

C

.

This theorem gives an effective way to transition from the primal to the dual formulation by using the support function of the set

C

. Importantly, in contrast to other kernel-based dual formulations (e.g., [5]), this theorem does not require the objective function f to be a member of the RKHS

H_{k}

. We first derive the support function of

C^{⋆}

in the DRO-RoBAS case. We denote

E_{Q \sim DP (α^{'}, F^{'})}

and

E_{ξ \sim P_{θ_{k} (Q)}}

by

E_{{DP}_{ξ_{1 : n}}}

and

E_{P_{θ_{k} (Q)}}

respectively.

Proposition 1.

Let

C^{⋆}

be defined by (12). Then we have

δ_{C^{⋆}}^{⋆} (g) = E_{{DP}_{ξ_{1 : n}}} [E_{P_{θ_{k} (Q)}} [g (ξ)]] + ϵ {∥ g ∥}_{k}

.

We can now apply Theorem 1 to our problem.

Corollary 1.

Let

C^{⋆}

as in (12) and

f_{x} (\cdot)

proper, upper semi-continuous. Then problem (14) is equivalent to:

\begin{matrix} \begin{matrix} min_{x, g_{0} \in R, g \in H_{k}} & g_{0} + E_{{DP}_{ξ_{1 : n}}} [E_{P_{θ_{k} (Q)}} [g (ξ)]] + ϵ {∥ g ∥}_{k} \\ subject to & f_{x} (ξ) \leq g_{0} + g (ξ), \forall ξ \in Ξ . \end{matrix} \end{matrix}

(15)

Computation of (15):

The problem in (15) can be solved by the batch approach with discretization of a semi-infinite programme (SIP) [35] suggested in Zhu et al. [6], in addition to a Sample Average Approximation (SAA). Let

{{\hat{ξ}}_{i}}_{i = 1}^{N}

be samples from the nested expectation in (15) and

{ζ_{j}}_{j = 1}^{m} \subseteq Ξ^{m}

be a set of discretisation points. Then the problem can be approximated by:

\begin{matrix} \begin{matrix} min_{x, g_{0} \in R, g \in H_{k}} & g_{0} + \frac{1}{N} \sum_{i = 1}^{N} g ({\hat{ξ}}_{i}) + ϵ {∥ g ∥}_{k} \\ s . t . & f_{x} (ζ_{j}) \leq g_{0} + g (ζ_{j}), \forall j \in [m] . \end{matrix} \end{matrix}

(16)

We can now apply the distributional robust version of the Representer theorem ([6], Lemma B.1) which states that it is sufficient to parametrise g by

g (\cdot) = \sum_{i = 1}^{N} α_{i} k ({\hat{ξ}}_{i}, \cdot) + \sum_{j = 1}^{m} α_{N + j} k (ζ_{j}, \cdot)

for some

α_{k} \in R

, for all

k = 1, \dots, N + m

.

3.2. Tolerance Level Guarantees

We start by using the generalisation error results for the NLP-MMD posterior to obtain a result in probability that the DGP lies within our ambiguity set. First, we give a concentration type bound for

E_{Q \sim DP (α^{'}, F^{'})} [D_{k} (P^{⋆}, Q)]

. In practice, exact sampling from a DP is not possible, so we consider the approximation of the DP suggested in the NPL literature [15,16,20] to sample during the MMD Posterior Bootstrap. In particular, denote by

{\hat{DP}}_{ξ_{1 : n}}

the probability measure on

P (Ξ)

induced by the following sampling process for

(w_{1 : n}, {\tilde{w}}_{1 : τ}) \sim Dir (1, \dots, 1, \frac{α}{τ}, \dots, \frac{α}{τ})

and

{\tilde{ξ}}_{1 : τ} \overset{iid}{\sim} F

:

\begin{matrix} \begin{matrix} Q : = \sum_{i = 1}^{n} w_{i} δ_{ξ_{i}} + \sum_{k = 1}^{τ} {\tilde{w}}_{k} δ_{{\tilde{ξ}}_{k}} \sim {\hat{DP}}_{ξ_{1 : n}} . \end{matrix} \end{matrix}

(17)

The associated approximate posterior predictive is

{\hat{P}}_{n}^{pred (NPL)} : = E_{Q \sim {\hat{DP}}_{ξ_{1 : n}}} [P_{θ_{k} (Q)}] .

We provide a concentration result for the MMD between

P^{⋆}

and

{\hat{P}}_{n}^{pred (NPL)}

as the approximated predictive is used in practice. However, similar results can be derived for the exact case with

P_{n}^{pred (NPL)}

following the arguments in [20]. Additionally, all theoretical results regarding the duality of the DRO-RoBAS framework from Section 3.1 hold exactly the same for the approximated DP as they are proven for a general posterior. We make the following assumptions:

Assumption 1.

For every

Q \in P (Ξ)

there exists

c > 0

such that the set

{θ \in Θ : D_{k} (Q, P_{θ}) \leq {inf}_{θ \in Θ} D_{k} (Q, P_{θ}) + c}

is bounded.

Assumption 2.

The kernel k is such that

| k (ξ, ξ^{'}) | \leq M

,

M < \infty

, for any

ξ, ξ^{'} \in Ξ

.

Assumption 1 ensures that a minimiser in (8) exists and is a common assumption made in MMD estimator methods (see [30]). Assumption 2 is needed to obtain a concentration inequality for the NPL posterior and it is often made in methods using MMD estimators (see, e.g., [20,30,36,37]), as it ensures robustness guarantees (see, e.g., [20,30,36,37]). Intuitively, bounded kernels control the contribution of extreme or unlikely observations, preventing them from having an arbitrarily large effect on the distance and thereby yielding a robust measure of discrepancy. Many commonly used kernels are bounded, such as the Gaussian, Matern and Exponential kernels (see, e.g., [38]).

Theorem 2.

Suppose Assumptions 1 and 2 hold. Then with probability at least

1 - δ

:

\begin{matrix} \begin{matrix} D_{k} (P^{⋆}, {\hat{P}}_{n}^{pred (NPL)}) \leq inf_{θ \in Θ} D_{k} (P_{θ}, P^{⋆}) + C_{n, M, α} \end{matrix} \end{matrix}

(18)

where

C_{n, M, α}

is a constant depending on the number of samples n, the upper bound of the kernel M and the concentration parameter on the DP prior α.

Remark 1.

C_{n, M, α}

has an overall rate of

1 / \sqrt{n}

consistent with existing results for minimum MMD estimators [30,36]. Moreover, given an upper bound of the kernel M, the constant is fully known. Hence, if

{inf}_{θ \in Θ} D_{k} (P_{θ}, P^{⋆})

can be reasonably approximated, this result can be used to select the radius ensuring RoBAS includes

P^{⋆}

with high probability. However, in practice, this theoretical radius can be over-conservative and a suitable value can be chosen via cross-validation or related bootstrapping procedures following standard practices in the DRO literature (see, e.g., [12,39,40]).

We can now obtain an upper bound for the target optimisation problem for large enough

ϵ

.

Corollary 2.

Suppose Assumptions 1 and 2 hold and let

C_{n, M, α}

as in Theorem 2. Then, for

ϵ \geq C_{n, M, α} + {inf}_{θ \in Θ} D_{k} (P_{θ}, P^{⋆})

, with probability at least

1 - δ

:

\begin{matrix} E_{ξ \sim P^{⋆}} [f_{x} (ξ)] \leq sup_{B_{ϵ}^{k} ({\hat{P}}_{n}^{pred (NPL)})} E_{ξ \sim P} [f_{x} (ξ)] . \end{matrix}

In the special case of Huber’s contamination model Huber [41], we can obtain a guarantee similar to Theorem 2 which depends on the contamination level.

Corollary 3

(Huber’s cont. model). Suppose

P^{⋆} = (1 - η) P_{θ_{0}} + η Q

for some

θ_{0} \in Θ

,

Q \in P (Ξ)

and

η \in [0, 1]

. Suppose Assumptions 1 and 2 hold and let

C_{n, M, α}

as in Theorem 2. Then with probability at least

1 - δ

:

D_{k} (P_{θ_{0}}, {\hat{P}}_{n}^{pred (NPL)}) \leq 4 η + 2 C_{n, M, α} .

4. Experiments

We evaluate our method on several different DGPs, model families and misspecification settings for two decision-making problems: the Newsvendor and the Portfolio.

We compare our method to existing Bayesian formulations of DRO—DRO-BAS [12] and Bayesian DRO (BDRO) [11]—both based on the KL divergence and standard Bayesian posterior. To assess how much robustness in our framework is gained through the robust posterior compared to the choice of the MMD in the ambiguity set, we further compare against the empirical method which uses an MMD ball around the empirical measure (denoted by Empirical MMD). This was presented in Staib and Jegelka [5] and also forms a special case of Kernel DRO [6]. Implementation details are provided in Appendix B.

We explore two types of misspecification.

Model misspecification which occurs when the DGP $P^{⋆}$ does not belong to the model family $P_{Θ}$ , e.g., if $P^{⋆}$ is multimodal while $P_{Θ}$ assumes unimodality. This affects Bayesian DRO methods (BDRO, DRO-BAS, DRO-RoBAS) but not empirical approaches, as the latter do not rely on a model.
Huber’s contamination model [41] which is a specific type of model misspecification (see Figure 2) wherein the training DGP is $P^{⋆} = (1 - η) \tilde{P} + η Q$ for some $η \in [0, 1]$ and $\tilde{P}, Q \in P (Ξ)$ . Contamination, limited to the training set, impacts both Bayesian and empirical DRO methods since the test distribution is assumed to be $\tilde{P}$ . Huber contamination relates to concepts like distribution shift and out-of-distribution robustness (e.g., [42]) and its importance in DRO has attracted increasing attention in recent work [43,44].

4.1. The Newsvendor Problem

We start with the commonly explored Newsvendor problem (e.g., [45]). The goal is to choose the optimal amount of products to buy based on consumers’ demand. The cost is defined as:

f (x, ξ) : = h max (x - ξ, 0) + b max (ξ - x, 0)

where

x \in R_{\geq 0}^{D}

is the number of product units ordered,

ξ \in R^{D}

is the consumers’ demand,

0 \in R^{D}

is the zero vector and b and h denote the backorder and holding cost per unit respectively. In all examples we follow the implementation of Shapiro et al. [11], Dellaporta et al. [12] and set

b = 8

and

h = 3

. We run each experiment

J = 100

times for

n = 20

observations and compute the out-of-sample mean and variance of the cost incurred.

We consider two models and several DGP cases as follows. First, we assume the demand

ξ \in Ξ

follows a Gaussian distribution with known variance, i.e.,

P_{θ} : = N (θ, σ^{2} I_{D \times D})

while the DGP is a bimodal Gaussian distribution (case 1 above):

P^{⋆} : = 0.5 N (θ_{1}^{⋆}, σ^{2} I_{D \times D}) + 0.5 N (θ_{2}^{⋆}, σ^{2} I_{D \times D})

for

D = 1

and

D = 5

. Furthermore, we explore the same Gaussian model with a contaminated Gaussian training DGP (case 2 above):

P_{train}^{⋆} : = (1 - η) N (θ^{⋆}, σ^{2}) + η N (θ^{'}, σ^{2})

and an Exponential model

P_{θ} : = Exp (θ)

with a contaminated Exponential DGP for the training DGP (case 2 above):

P_{train}^{⋆} : = (1 - η) Exp (θ^{⋆}) + η N (μ, σ)

for

η \in {0.0, 0.1, 0.2}

.

Figure 3 presents the out-of-sample mean and variance of the methods for the bimodal univariate and multivariate Gaussian DGPs.

The effect of model misspecification is notably more pronounced for the DRO-BAS and BDRO instantiations, which are based on the standard Bayesian posterior and the KL divergence. In the DRO-RoBAS case, this robustness is likely due to the fact that the obtained NPL-MMD posterior is bimodal, even though the model itself is unimodal. In contrast, the standard Bayesian posterior is highly sensitive to misspecification, resulting in a unimodal posterior concentrated between the two modes. Consequently, DRO-BAS and BDRO require very large values of

ϵ

to capture the true DGP, leading to conservative decisions that incur high out-of-sample costs. This is evident as they achieve lower mean-variance as

ϵ

increases.

We further observe (Figure 3) that in the univariate case, empirical MMD achieves a lower (difference of ≤1) out-of-sample mean for most values of

ϵ < 1

. However, DRO-RoBAS consistently shows a lower out-of-sample variance. In the multivariate case, the performances of the two methods are similar, though empirical MMD outperforms DRO-RoBAS in both mean and variance. Notably, this comparison pits a Bayesian method under model misspecification against a completely empirical method unaffected by this misspecification.

However, it is promising that DRO-RoBAS remains highly competitive against this baseline. The next example, based on contamination models, illustrates a scenario where robustness is crucial for both model-based and empirical methods.

In the second simulation, we consider the Huber contamination models [41] where the training set is contaminated, whereas the test set is not. Figure 4 demonstrates that both DRO-BAS formulations outperform DRO-RoBAS and the empirical MMD method in the well-specified case, where there is no contamination in the training set, and the training and test distributions are identical. In misspecified cases, where

η > 0

, DRO-RoBAS shows greater robustness compared to the other methods in terms of the out-of-sample mean-variance trade-off. This suggests that the robust posterior and robust distance measures in RoBAS contribute to a better-informed ambiguity set concerning the test set generating process.

This example further illustrates that, while the motivation for a robust ambiguity set stemmed from concerns about model misspecification, even entirely empirical methods, like the empirical MMD, can be sensitive to misspecifications arising from discrepancies between the training and test distributions.

4.2. The Portfolio Optimisation Problem

We continue with the multi-dimensional Portfolio problem, also considered by Shapiro et al. [11], which chooses stock weightings (

x \in R^{D}

) to maximise returns. The objective function is

f_{x} (ξ) = - ξ^{⊤} x

which corresponds to maximising the return and the optimisation is subject to the constraints

x_{i} \geq 0

for all

i = 1, \dots, D

and

\sum_{i = 1}^{D} x_{i} = 1

. We generate

n = 100

observations from a 5D Gaussian DGP with contamination on three-out-of-five dimensions:

P^{⋆} = (1 - η) N (μ^{⋆}, Σ^{⋆}) + η N (μ^{'}, Σ^{⋆})

. We use a multivariate Gaussian model with unknown mean and variance.

Figure 5 shows DRO-RoBAS is unaffected by the contamination, whilst empirical MMD, DRO-BAS, and BDRO are negatively affected. Consider

η = 0.1

: for

ϵ < 0.2

, empirical DRO performance quickly degrades compared to DRO-RoBAS; but, for

ϵ \geq 0.2

, empirical MMD performs similarly due to the MMD robustness. This effect is magnified for

η = 0.2

, demonstrating that the empirical nominal distribution is unreliable due to outliers, whilst DRO-RoBAS benefits from a robust nominal—the NPL-MMD posterior predictive—thus performs better for small

ϵ

.

4.3. Computational Time

The increased robustness of DRO-RoBAS comes at the cost of higher computational demands (see Table A1 and Table A2 of Appendix B). This cost arises from the complex optimisation problem in the RKHS and the longer sampling time required for the NPL posterior. However, as previously demonstrated, this cost is justified by improved out-of-sample performance across various cases of model misspecification. Moreover, by leveraging the NPL-MMD and the MMD, DRO-RoBAS can be used for any choice of model family, even likelihood-free models. Notice that DRO-BAS_PE is limited to Exponential family models whereas the computational cost of BDRO and DRO-BAS_PP increases considerably if the posterior is not available in closed form, as methods like Markov Chain Monte Carlo are needed for posterior sampling. This highlights the flexibility and robustness of DRO-RoBAS despite its computational demands. Nevertheless, possible scalability improvements are discussed in Section 5.

5. Conclusions

Bayesian formulations of DRO for decision-making problems can suffer from model misspecification as the ambiguity set heavily relies on the non-robust Bayesian posterior. We addressed this challenge by using a robust NPL posterior to inform the ambiguity set and leveraging the MMD to construct both the posterior and the ambiguity set itself. We show that DRO-RoBAS admits a dual formulation in the RKHS and we provide probabilistic guarantees for the tolerance level such that the resulting optimisation problem upper bounds the true objective with high probability.

Scalability improvements for DRO-RoBAS can be achieved through existing tools from kernel methods such as Fourier features [46] and low-rank kernel matrix approximations [47] that are left for future work. Our empirical evidence suggests that if the model is well specified, or the level of misspecification is low, then existing Bayesian formulations like DRO-BAS can have better performance and scalability. At the same time, when model misspecification is moderate or high then DRO-RoBAS achieves significantly improved out-of-sample performance and robustness. Note that any prior knowledge on the misspecification level can be naturally incorporated into our framework to further boost performance.

Finally, the construction of Robust Bayesian Ambiguity Sets can extend beyond the choices of NPL and MMD. For example, another instantiation of our framework arises if we employ Generalised Bayesian Inference (GBI) [21]. One of the motivations of GBI is to induce robustness with respect to model misspecification [22,23,24,25,26,48,49] by targeting a different divergence than the KL. While these methods do not directly impose uncertainty on the DGP like NPL does, they can produce robust GBI posteriors, making it worthwhile to integrate into DRO-RoBAS. Notably, the duality results in Section 3.1 are based on the MMD choice; however, they hold under a general posterior and are not dependent on the NPL framework.

Author Contributions

Conceptualization, C.D., P.O. and T.D.; methodology, C.D., P.O. and T.D.; software, C.D. and P.O.; validation, C.D., P.O. and T.D.; formal analysis, C.D., P.O. and T.D.; investigation, C.D., P.O. and T.D.; writing—original draft preparation, C.D. and P.O.; writing—review and editing, C.D., P.O. and T.D. All authors have read and agreed to the published version of the manuscript.

Funding

CD acknowledges support from EPSRC grant [EP/T51794X/1] as part of the Warwick CDT in Mathematics and Statistics and EPSRC grant [EP/Y022300/1]. PO and TD acknowledge support from a UKRI Turing AI acceleration Fellowship [EP/V02678X/1] and a Turing Impact Award from the Alan Turing Institute. For the purpose of open access, the authors have applied a Creative Commons Attribution (CC-BY) licence to any Author Accepted Manuscript version arising from this submission.

Data Availability Statement

The code to reproduce the simulated experiments are openly available at https://github.com/PatrickOHara/mis-dro-code (accessed on 8 April 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Proofs of Theoretical Results

In this section, we provide detailed proofs of the theoretical results concerning the duality of the DRO-RoBAS problem and the tolerance level guarantees.

Appendix A.1. Proof of Proposition 1

First, we derive the support function of the effective ambiguity set in terms of kernel mean embeddings defined in Section 3.1:

\begin{matrix} C^{⋆} & : = {μ \in H_{k} : ∥ μ - μ_{P_{n}^{pred (NPL)}} ∥_{k} \leq ϵ} \end{matrix}

(A1)

where we denote by

E_{{DP}_{ξ_{1 : n}}} [\cdot]

the expectation under

E_{Q \sim DP (c^{'}, F^{'})} [\cdot]

and we have defined the NPL-MMD posterior predictive as

P_{n}^{pred (NPL)} : = E_{{DP}_{ξ_{1 : n}}} [P_{θ_{k} (Q)}]

. We start with the following Lemma.

Lemma A1.

For any

ϵ > 0

we have:

\begin{matrix} sup_{∥ μ - μ_{P_{n}^{pred (NPL)}} ∥_{k} \leq ϵ} {〈g, μ - μ_{P_{n}^{pred (NPL)}}〉}_{H_{k}} = ϵ {∥ g ∥}_{k} . \end{matrix}

Proof.

The proof follows the same logic as the proof for the ambiguity set corresponding to the MMD ball around the empirical measure, provided in Appendix A.2 of Zhu et al. [6]. To prove the equality statement we will prove both sides of the inequality. We first prove that the left-hand side is less than or equal to the right-hand side. Applying the Cauchy–Schwarz inequality, we obtain:

\begin{matrix} sup_{∥ μ - μ_{P_{n}^{pred (NPL)}} ∥_{k} \leq ϵ} {〈g, μ - μ_{P_{n}^{pred (NPL)}}〉}_{H_{k}} \leq sup_{∥ μ - μ_{P_{n}^{pred (NPL)}} ∥_{k} \leq ϵ} {∥ g ∥}_{k} ∥ μ - μ_{P_{n}^{pred (NPL)}} ∥_{k} = ϵ {∥ g ∥}_{k} . \end{matrix}

(A2)

For the opposite direction, let

μ^{'} : = μ_{P_{n}^{pred (NPL)}} + ϵ \frac{g}{{∥ g ∥}_{k}}

. Then

\begin{matrix} ∥ μ^{'} - μ_{P_{n}^{pred (NPL)}} ∥_{H_{k}}^{2} \\ = {〈μ^{'} - μ_{P_{n}^{pred (NPL)}}, μ^{'} - μ_{P_{n}^{pred (NPL)}}〉}_{H_{k}} \\ = {〈μ^{'}, μ^{'}〉}_{H_{k}} - 2 {〈μ^{'}, μ_{P_{n}^{pred (NPL)}}〉}_{H_{k}} + {〈μ_{P_{n}^{pred (NPL)}}, μ_{P_{n}^{pred (NPL)}}〉}_{H_{k}} \\ = {〈μ_{P_{n}^{pred (NPL)}} + ϵ \frac{g}{{∥ g ∥}_{k}}, μ_{P_{n}^{pred (NPL)}} + ϵ \frac{g}{{∥ g ∥}_{k}}〉}_{H_{k}} + {〈μ_{P_{n}^{pred (NPL)}}, μ_{P_{n}^{pred (NPL)}}〉}_{H_{k}} \\ - 2 {〈μ_{P_{n}^{pred (NPL)}} + ϵ \frac{g}{{∥ g ∥}_{k}}, μ_{P_{n}^{pred (NPL)}}〉}_{H_{k}} \\ = {〈ϵ \frac{g}{{∥ g ∥}_{k}}, ϵ \frac{g}{{∥ g ∥}_{k}}〉}_{H_{k}} \\ = ϵ^{2} \frac{{〈g, g〉}_{H_{k}}}{{∥ g ∥}_{k}^{2}} \\ = ϵ^{2} \end{matrix}

and hence

\begin{matrix} ∥ μ^{'} - μ_{P_{n}^{pred (NPL)}} ∥_{k} = ϵ \end{matrix}

which proves that

μ^{'} \in C^{⋆}

. To complete the proof it suffices to show that

\begin{matrix} 〈g, μ^{'} - μ_{P_{n}^{pred (NPL)}}〉 = ϵ {∥ g ∥}_{k} . \end{matrix}

(A3)

By definition of

μ^{'}

we have:

\begin{matrix} {〈g, μ^{'} - μ_{P_{n}^{pred (NPL)}}〉}_{H_{k}} & = 〈g, μ_{P_{n}^{pred (NPL)}} + ϵ \frac{g}{{∥ g ∥}_{k}} - μ_{P_{n}^{pred (NPL)}}〉 \\ = 〈g, ϵ \frac{g}{{∥ g ∥}_{k}}〉 \\ = {ϵ ∥ g ∥}_{k} . \end{matrix}

Since

μ^{'} \in C^{⋆}

we have

\begin{matrix} sup_{∥ μ - μ_{P_{n}^{pred (NPL)}} ∥_{k}] \leq ϵ} {〈g, μ - μ_{P_{n}^{pred (NPL)}}〉}_{H_{k}} \geq {〈g, μ^{'} - μ_{P_{n}^{pred (NPL)}}〉}_{H_{k}} = ϵ {∥ g ∥}_{k} \end{matrix}

which completes the proof. □

We are now ready to prove Proposition 1.

Proof of Proposition 1.

The result follows from the definition of the support function by applying Lemma A1 and the reproducing property as follows:

\begin{matrix} δ_{C^{⋆}}^{⋆} (g) & = sup_{μ \in C^{⋆}} {〈g, μ〉}_{H_{k}} \\ = {〈g, μ_{P_{n}^{pred (NPL)}}〉}_{H_{k}} + sup_{μ \in C^{⋆}} {〈g, μ〉}_{H_{k}} - {〈g, μ_{P_{n}^{pred (NPL)}}〉}_{H_{k}} \\ = {〈g, μ_{P_{n}^{pred (NPL)}}〉}_{H_{k}} + sup_{μ \in C^{⋆}} {〈g, μ - μ_{P_{n}^{pred (NPL)}}〉}_{H_{k}} \\ \overset{(1)}{=} E_{{DP}_{ξ_{1 : n}}} [E_{P_{θ_{k} (Q)}} [g (x)]] + sup_{∥ μ - μ_{P_{n}^{pred (NPL)}} ∥_{k} \leq ϵ} {〈g, μ - μ_{P_{n}^{pred (NPL)}}〉}_{H_{k}} \\ \overset{(2)}{=} E_{{DP}_{ξ_{1 : n}}} [E_{P_{θ_{k} (Q)}} [g (x)]] + ϵ {∥ g ∥}_{k} . \end{matrix}

Equality (1) follows from the reproducing property and the definition of a kernel mean embedding by noting that:

\begin{matrix} μ_{P_{n}^{pred (NPL)}} : = E_{ξ \sim P_{n}^{pred (NPL)}} [ϕ (ξ)] = E_{Q \sim {DP}_{ξ_{1 : n}}} [E_{ξ \sim P_{θ_{k} (Q)}} [ϕ (ξ)]] = E_{{DP}_{ξ_{1 : n}}} [μ_{P_{θ_{k} (Q)}}] \end{matrix}

(A4)

where

ϕ

denotes the feature map associated with kernel k. Equality (2) follows from Lemma A1. □

Appendix A.2. Proof of Corollary 1

To prove the Corollary it suffices to verify the assumptions of Theorem 1 for our set

C^{⋆}

.

Lemma A2.

C^{⋆}

, as defined in (A1), is closed and convex. Furthermore, ambiguity set

K_{C^{⋆}}

satisfies

ri (K_{C^{⋆}}) \neq \emptyset

, where

\begin{matrix} \begin{matrix} K_{C^{⋆}} : = {P \in P_{k} (Ξ), μ_{P} \in C^{⋆}} = {P : D_{k} (P, P_{n}^{pred (NPL)}) \leq ϵ} . \end{matrix} \end{matrix}

(A5)

Proof.

Convexity: Convexity follows trivially by the triangle inequality. Let

μ_{1}, μ_{2} \in C^{⋆}

and

λ \in [0, 1]

. Then

\begin{matrix} {∥λ μ_{1} + (1 - λ) μ_{2} - μ_{P_{n}^{pred (NPL)}}∥}_{k} & \leq λ {∥μ_{1} - μ_{P_{n}^{pred (NPL)}}∥}_{k} + (1 - λ) {∥μ_{2} - μ_{P_{n}^{pred (NPL)}}∥}_{k} \\ \leq λ ϵ + (1 - λ) ϵ = ϵ \end{matrix}

where the first inequality follows from the triangle inequality applied to the RKHS norm. Hence,

λ μ_{1} + (1 - λ) μ_{2} \in C^{⋆}

and

C^{⋆}

is convex.

Closeness of

C^{⋆}

: By definition, a closed set is a set whose complement is open, so it suffices to show the set

H_{k} \ C^{⋆}

is open. Again by definition, set

H_{k} \ C^{⋆}

is open if, for all

y \in H_{k} \ C^{⋆}

, there exists

r > 0

such that any point

z \in H_{k}

satisfying

{∥ y - z ∥}_{H_{k}} < r

also belongs to

H_{k} \ C^{⋆}

. Given any

y \in H_{k} \ C^{⋆}

, let

r = ∥ y - μ_{P_{n}^{pred (NPL)}} ∥_{H_{k}} - ϵ

. Observe that

r > 0

because

∥ y - μ_{P_{n}^{pred (NPL)}} ∥_{H_{k}} > ϵ

. Now, for all

z \in H_{k}

such that

{∥ y - z ∥}_{H_{k}} < r

, we can apply the triangle inequality to obtain

∥ y - μ_{P_{n}^{pred (NPL)}} ∥_{H_{k}} \leq ∥ z - μ_{P_{n}^{pred (NPL)}} ∥_{H_{k}} + {∥ z - y ∥}_{H_{k}} .

After rearranging, we have

\begin{matrix} ∥ z - μ_{P_{n}^{pred (NPL)}} ∥_{H_{k}} & \geq ∥ y - μ_{P_{n}^{pred (NPL)}} ∥_{H_{k}} - {∥ z - y ∥}_{H_{k}} \\ > ∥ y - μ_{P_{n}^{pred (NPL)}} ∥_{H_{k}} - r = ϵ . \end{matrix}

Therefore,

∥ z - μ_{P_{n}^{pred (NPL)}} ∥_{H_{k}} > ϵ

, so z lies in set

H_{k} \ C^{⋆}

, which concludes our claim that

H_{k} \ C^{⋆}

is open and proves that

C^{⋆}

is closed.

Non-empty relative interior of

K_{C^{⋆}}

: It suffices to prove that

K_{C^{⋆}}

is non-empty and convex. Convexity follows from an analogous proof of convexity of

C^{⋆}

since MMD satisfies the triangle inequality.

K_{C^{⋆}}

is non-empty since, for any

ϵ > 0

,

P_{n}^{pred (NPL)} \in K_{C^{⋆}}

. □

Appendix A.3. Proof of Theorem 2

We first provide a Lemma adapted from the results of Dellaporta et al. [20] which bounds the expected MMD (under a finite DGP sample) between the DGP and a sample from the approximate DP posterior. This is necessary to quantify the generalisation error between the DGP and the obtained model

P_{θ_{k} (Q)}

for a DP sample

Q

.

Lemma A3.

Let

Q \sim {\hat{DP}}_{ξ_{1 : n}}

be defined as in (17). Assume that

ξ_{1 : n} \overset{iid}{\sim} P^{⋆}

and that the kernel k is such that

| k (ξ, ξ^{'}) | \leq K

,

K < \infty

, for any

ξ, ξ^{'} \in Ξ

. Then

E_{ξ_{1 : n} \overset{iid}{\sim} P^{⋆}} [E_{{\hat{DP}}_{ξ_{1 : n}}} [D_{k} (P^{⋆}, Q)]] \leq \sqrt{\frac{K}{n}} + \sqrt{\frac{2 K (n - 1) + c (c + 1)}{(c + n) (c + n + 1)}} + \sqrt{\frac{K c (c + 1)}{(c + n) (c + n + 1)}} .

(A6)

Proof.

The result follows directly from Lemmas 6 and 11 of Dellaporta et al. [20] with some small modifications. First notice that by the triangle inequality

\begin{matrix} D_{k} (P^{⋆}, Q) \leq D_{k} (P^{⋆}, {\hat{P}}_{n}) + D_{k} ({\hat{P}}_{n}, Q) \end{matrix}

(A7)

where

{\hat{P}}_{n} : = \frac{1}{n} \sum_{i = 1}^{n} δ_{ξ_{1 : n}}

. By Dellaporta et al. [20], Lemma 6, we have that

\begin{matrix} E_{ξ_{1 : n} \overset{iid}{\sim} P^{⋆}} [D_{k} (P^{⋆}, {\hat{P}}_{n})] \leq \sqrt{\frac{K}{n}} . \end{matrix}

(A8)

Moreover, by Dellaporta et al. [20], Lemma 11, we have that:

\begin{matrix} E_{ξ_{1 : n} \overset{iid}{\sim} P^{⋆}} [E_{Q \sim {\hat{DP}}_{ξ_{1 : n}}} [D_{k} ({\hat{P}}_{n}, Q)]] & \leq \sqrt{\frac{2 K (n - 1) + c (c + 1)}{(c + n) (c + n + 1)}} + \sqrt{\frac{K c (c + 1)}{(c + n) (c + n + 1)}} . \end{matrix}

(A9)

Taking expectations and using Equations (A8) and (A9) in (A7) we obtain the required result:

\begin{matrix} E_{ξ_{1 : n} \overset{iid}{\sim} P^{⋆}} [E_{Q \sim {\hat{DP}}_{ξ_{1 : n}}} [D_{k} (P^{⋆}, Q)]] & \leq E_{ξ_{1 : n} \overset{iid}{\sim} P^{⋆}} [D_{k} (P^{⋆}, {\hat{P}}_{n})] + E_{ξ_{1 : n} \overset{iid}{\sim} P^{⋆}} [E_{Q \sim {\hat{DP}}_{ξ_{1 : n}}} [D_{k} ({\hat{P}}_{n}, Q)]] \\ \leq \sqrt{\frac{K}{n}} + \sqrt{\frac{2 K (n - 1) + c (c + 1)}{(c + n) (c + n + 1)}} + \sqrt{\frac{K c (c + 1)}{(c + n) (c + n + 1)}} . \end{matrix}

(A10)

□

Based on this bound we can now provide a result in probability.

Lemma A4.

Assume that

ξ_{1 : n} \overset{iid}{\sim} P^{⋆}

and that kernel k is such that

| k (ξ, ξ^{'}) | \leq K

,

K < \infty

, for any

ξ, ξ^{'} \in Ξ

. Then with probability

1 - δ

we have:

\begin{matrix} E_{Q \sim {\hat{DP}}_{ξ_{1 : n}}} [D_{k} (P^{⋆}, Q)] & \leq \sqrt{\frac{K}{n}} + \sqrt{\frac{2 K (n - 1) + c (c + 1)}{(c + n) (c + n + 1)}} \\ + \sqrt{\frac{K c (c + 1)}{(c + n) (c + n + 1)}} + \frac{\sqrt{2 n K (log \frac{1}{δ})}}{c + n} . \end{matrix}

(A11)

Proof.

We follow the proof technique of Lemma 1 of Briol et al. [30] based on McDiarmid’s inequality [50]. First, notice that by definition of

{\hat{DP}}_{ξ_{1 : n}}

in (17) we can re-write the objective as

\begin{matrix} E_{Q \sim {\hat{DP}}_{ξ_{1 : n}}} [D_{k} (P^{⋆}, Q)] = E_{w \sim Dir} [E_{{\tilde{ξ}}_{1 : T} \overset{iid}{\sim} F} [D_{k} (P^{⋆}, {\hat{Q}}_{ξ_{1 : n}})]] \end{matrix}

(A12)

where

{\hat{Q}}_{ξ_{1 : n}} = \sum_{i = 1}^{n} w_{i} δ_{ξ_{i}} + \sum_{k = 1}^{τ} {\tilde{w}}_{k} δ_{{\tilde{ξ}}_{k}} \sim {\hat{DP}}_{ξ_{1 : n}}

and

E_{w \sim Dir}

denotes the expectation under

Dir (1, \dots 1, \frac{c}{τ}, \dots, \frac{c}{τ})

. Let

h (ξ_{1}, \dots, ξ_{n}) : = E_{w \sim Dir} [E_{{\tilde{ξ}}_{1 : T} \overset{iid}{\sim} F} [D_{k} (P^{⋆}, {\hat{Q}}_{ξ_{1 : n}})]]

. Then for all

{ξ_{i}}_{i = 1}^{n}, ξ_{i}^{'} \in Ξ

and

ξ_{1 : n}^{'} : = {ξ_{1}, \dots, ξ_{i - 1}, ξ_{i}^{'}, ξ_{i + 1}, \dots, ξ_{n}}

we have

\begin{matrix} |h (ξ_{1}, \dots, ξ_{i - 1}, ξ_{i}, ξ_{i + 1}, \dots, ξ_{n}) - h (ξ_{1}, \dots, ξ_{i - 1}, ξ_{i}^{'}, ξ_{i + 1}, \dots, ξ_{n})| \\ = |E_{w \sim Dir} [E_{{\tilde{ξ}}_{1 : T} \overset{iid}{\sim} F} [D_{k} (P^{⋆}, {\hat{Q}}_{ξ_{1 : n}})]] - E_{w \sim Dir} [E_{{\tilde{ξ}}_{1 : T} \overset{iid}{\sim} F} [D_{k} (P^{⋆}, {\hat{Q}}_{ξ_{1 : n}^{'}})]]| \\ = |E_{w \sim Dir} [E_{{\tilde{ξ}}_{1 : T} \overset{iid}{\sim} F} [D_{k} (P^{⋆}, {\hat{Q}}_{ξ_{1 : n}}) - D_{k} (P^{⋆}, {\hat{Q}}_{ξ_{1 : n}^{'}})]]| \\ \leq E_{w \sim Dir} [E_{{\tilde{ξ}}_{1 : T} \overset{iid}{\sim} F} [|D_{k} (P^{⋆}, {\hat{Q}}_{ξ_{1 : n}}) - D_{k} (P^{⋆}, {\hat{Q}}_{ξ_{1 : n}^{'}})|]] \\ = E_{w \sim Dir} [E_{{\tilde{ξ}}_{1 : T} \overset{iid}{\sim} F} [|∥ μ_{P}^{⋆} - μ_{{\hat{Q}}_{x_{1 : n}}} ∥_{k} - {∥ μ_{P}^{⋆} - μ_{{\hat{Q}}_{ξ_{1 : n}^{'}}} ∥}_{k}|]] \\ \leq E_{w \sim Dir} [E_{{\tilde{ξ}}_{1 : T} \overset{iid}{\sim} F} [∥ μ_{P}^{⋆} - μ_{{\hat{Q}}_{ξ_{1 : n}}} - μ_{P}^{⋆} + μ_{{\hat{Q}}_{ξ_{1 : n}^{'}}} ∥_{k}]] \\ = E_{w \sim Dir} [E_{{\tilde{ξ}}_{1 : T} \overset{iid}{\sim} F} [∥ w_{i} (k (ξ_{i}, \cdot) - k (ξ_{i}^{'}, \cdot)) ∥_{k}]] \\ = E_{w \sim Dir} [w_{i} {∥ k (ξ_{i}, \cdot) - k (ξ_{i}^{'}, \cdot) ∥}_{k}] \\ \leq E_{w \sim Dir} [w_{i}] 2 \sqrt{K} \\ = \frac{2 \sqrt{K}}{c + n} \end{matrix}

(A13)

where in the first inequality we used Jensen’s inequality and in the second we used the reverse triangle inequality. Then by McDiarmid’s inequality [50] we have that for any

ϵ > 0

:

\begin{matrix} P (E_{w \sim Dir} [E_{{\tilde{ξ}}_{1 : T} \overset{iid}{\sim} F} [D_{k} (P^{⋆}, {\hat{Q}}_{ξ_{1 : n}})]] \\ - E_{ξ_{1 : n} \overset{iid}{\sim} P^{⋆}} [E_{w \sim Dir} [E_{{\tilde{ξ}}_{1 : T} \overset{iid}{\sim} F} [D_{k} (P^{⋆}, {\hat{Q}}_{ξ_{1 : n}})]]] \geq ϵ) \\ \leq exp (\frac{- 2 ϵ^{2}}{\frac{4 n K}{{(n + c)}^{2}}}) \\ = exp (\frac{- ϵ^{2} {(n + c)}^{2}}{2 n K}) \end{matrix}

(A14)

Let

δ : = exp (\frac{- ϵ^{2} {(n + c)}^{2}}{2 n K})

then it follows that with probability

1 - δ

:

\begin{matrix} E_{w \sim Dir} [E_{{\tilde{ξ}}_{1 : T} \overset{iid}{\sim} F} [D_{k} (P^{⋆}, {\hat{Q}}_{ξ_{1 : n}})]] \\ \leq E_{ξ_{1 : n} \overset{iid}{\sim} P^{⋆}} [E_{w \sim Dir} [E_{{\tilde{ξ}}_{1 : T} \overset{iid}{\sim} F} [D_{k} (P^{⋆}, {\hat{Q}}_{ξ_{1 : n}})]]] + ϵ \\ = E_{ξ_{1 : n} \overset{iid}{\sim} P^{⋆}} [E_{w \sim Dir} [E_{{\tilde{ξ}}_{1 : T} \overset{iid}{\sim} F} [D_{k} (P^{⋆}, {\hat{Q}}_{ξ_{1 : n}})]]] + \frac{\sqrt{2 n K (log \frac{1}{δ})}}{n + c} \\ \leq \sqrt{\frac{K}{n}} + \sqrt{\frac{2 K (n - 1) + c (c + 1)}{(c + n) (c + n + 1)}} + \sqrt{\frac{K c (c + 1)}{(c + n) (c + n + 1)}} \\ + \frac{\sqrt{2 n K (log \frac{1}{δ})}}{c + n} \end{matrix}

(A15)

where the last inequality follows from Lemma A3. □

We are now ready to prove the main result.

Proof of Theorem 2.

First note that

\begin{matrix} D_{k} (P^{⋆}, P_{θ_{k} (Q)}) - inf_{θ \in Θ} D_{k} (P_{θ}, P^{⋆}) & \leq D_{k} (P^{⋆}, Q) + D_{k} (Q, P_{θ_{k} (Q)}) - inf_{θ \in Θ} D_{k} (P_{θ}, P^{⋆}) \\ = D_{k} (P^{⋆}, Q) + inf_{θ \in Θ} D_{k} (P_{θ}, Q) - inf_{θ \in Θ} D_{k} (P_{θ}, P^{⋆}) \\ \leq D_{k} (P^{⋆}, Q) + |inf_{θ \in Θ} D_{k} (P_{θ}, Q) - inf_{θ \in Θ} D_{k} (P_{θ}, P^{⋆})| \\ \leq D_{k} (P^{⋆}, Q) + sup_{θ \in Θ} |D_{k} (P_{θ}, Q) - D_{k} (P_{θ}, P^{⋆})| \\ \leq D_{k} (P^{⋆}, Q) + sup_{θ \in Θ} D_{k} (Q, P^{⋆}) \\ = 2 D_{k} (P^{⋆}, Q) \end{matrix}

(A16)

where we used the triangle inequality in the first inequality, the definition of

θ_{k} (\cdot)

in the first equality and the reversed triangle inequality in the last inequality. The third inequality follows from the fact (mentioned in Briol et al. [30], Theorem 1) that since k is bounded, the family

{inf}_{θ \in Θ} D_{k} (P_{θ}, \cdot)

is uniformly bounded and for bounded functions

f, g

we have

∥ {inf}_{θ} f (θ) - {inf}_{θ} g (θ) ∥ \leq {sup}_{θ} | f (θ) - g (θ) |

. Hence, by Jensen’s inequality, since the MMD is convex, we have

\begin{matrix} D_{k} (P^{⋆}, {\hat{P}}_{n}^{pred (NPL)}) = D_{k} (P^{⋆}, E_{{\hat{DP}}_{ξ_{1 : n}}} [P_{θ_{k} (Q)}]) \leq E_{{\hat{DP}}_{ξ_{1 : n}}} [D_{k} (P^{⋆}, P_{θ_{k} (Q)})] \end{matrix}

(A17)

and therefore

\begin{matrix} D_{k} (P^{⋆}, {\hat{P}}_{n}^{pred (NPL)}) - inf_{θ \in Θ} D_{k} (P_{θ}, P^{⋆}) & \leq E_{Q \sim {\hat{DP}}_{ξ_{1 : n}}} [D_{k} (P^{⋆}, P_{θ_{k} (Q)})] - inf_{θ \in Θ} D_{k} (P_{θ}, P^{⋆}) \\ \leq 2 E_{Q \sim {\hat{DP}}_{ξ_{1 : n}}} [D_{k} (P^{⋆}, Q)] . \end{matrix}

(A18)

By Lemma A4 it follows that with probability

1 - δ

the advertised result holds. □

Appendix A.4. Proof of Corollary 3

Recall that

P^{⋆} = (1 - η) P_{θ_{0}} + η Q

for some

θ_{0} \in Θ

,

Q \in P (Ξ)

and

η \in [0, 1]

. Corollary 3 is an immediate consequence of Lemma 3.3 of Chérief-Abdellatif and Alquier [36] which states that for any

θ \in Θ

:

\begin{matrix} | D_{k} (P_{θ}, P^{⋆}) - D_{k} (P_{θ}, P_{θ_{0}}) | \leq 2 η . \end{matrix}

(A19)

Using this and Theorem 2 we can prove the advertised result as follows:

Proof of Corollary 3.

It follows by (A19) that:

\begin{matrix} D_{k} (P_{θ_{k} (Q)}, P_{θ_{0}}) \leq 2 η + D_{k} (P_{θ_{k} (Q)}, P^{⋆}) . \end{matrix}

(A20)

and hence

\begin{matrix} E_{{\hat{DP}}_{ξ_{1 : n}}} [D_{k} (P_{θ_{k} (Q)}, P_{θ_{0}})] \leq 2 η + E_{{\hat{DP}}_{ξ_{1 : n}}} [D_{k} (P_{θ_{k} (Q)}, P^{⋆})] . \end{matrix}

(A21)

We then have:

\begin{matrix} D_{k} (P_{θ_{0}}, {\hat{P}}_{n}^{pred (NPL)}) & \leq E_{{\hat{DP}}_{ξ_{1 : n}}} [D_{k} (P_{θ_{0}}, P_{θ_{k}})] \\ \leq 2 E_{{\hat{DP}}_{ξ_{1 : n}}} [D_{k} (P_{θ_{0}}, Q)] + inf_{θ} D_{k} (P_{θ_{0}}, P_{θ}) \\ = 2 E_{{\hat{DP}}_{ξ_{1 : n}}} [D_{k} (P_{θ_{0}}, Q)] \\ \leq 4 η + 2 E_{{\hat{DP}}_{ξ_{1 : n}}} [D_{k} (P^{⋆}, Q)] \end{matrix}

and the result follows from Lemma A4. The first inequality follows from Jensen’s inequality as in (A17), the second inequality follows from (A16), the equality holds from the fact that

P_{θ_{0}} \in P_{Θ}

and the last inequality follows form (A21). □

Appendix B. Additional Experimental Details

We implemented the dual problems for DRO-RoBAS (Corollary 1) in Python version 3.11 using CVXPY version 1.5.2 and the MOSEK solver version 10.1.28 following the implementation of Kernel DRO algorithms by Zhu et al. [6]. We used 200 discretisation points for the constraint, as described in Computation of (15). For the implementation of the NPL-MMD posterior, we follow the median heuristic suggested by Gretton et al. [29] and the ADAM optimiser [51] with learning rate

h = 0.1

. For the DP approximation we used

α = 0

and truncation limit

τ = 100

. For all Bayesian methods we use

N = 900

total Monte Carlo samples to approximate the expectations in the optimisation objectives.

The out-of-sample mean and variance were calculated as:

\begin{matrix} m (ϵ) & = \frac{1}{J T} \sum_{j = 1}^{J} \sum_{t = 1}^{T} f (x_{ϵ}^{(j)}, ξ_{n + t}), \\ v (ϵ) & = \frac{1}{J T - 1} \sum_{j = 1}^{J} \sum_{t = 1}^{T} {(f (x_{ϵ}^{(j)}, ξ_{n + t}) - m (ϵ))}^{2}, \end{matrix}

(A22)

where

x_{ϵ}^{(j)}

is the obtained solution on training dataset

ξ_{1 : n}^{(j)}

and we set

T = 50

for the number of test observations.

Detailed results on the average solve and sampling times and associated standard deviations are provided in Table A1 and Table A2.

Data-Generating Process Settings

For the Newsvendor problem, we first considered a Gaussian distribution with known variance, i.e.,

P_{θ} : = N (θ, σ^{2} I_{D \times D})

while the DGP is a bimodal Gaussian distribution:

P^{⋆} : = 0.5 N (θ_{1}^{⋆}, σ^{2} I_{D \times D}) + 0.5 N (θ_{2}^{⋆}, σ^{2} I_{D \times D})

. We considered a univariate (

d = 1

) with

(θ_{1}^{⋆}, θ_{2}^{⋆}, σ) = (10, 60, 5)

and a multivariate case (

d = 5

) with

θ_{1}^{⋆} = (10, 20, 33, 22, 25)

,

θ_{2}^{⋆} = (60, 60, 60, 60, 60)

and

σ^{2} = 5

. Furthermore, we explore the same Gaussian model

P_{θ} : = N (θ, σ^{2} I_{D \times D})

with a contaminated Gaussian training DGP:

P_{train}^{⋆} : = (1 - η) N (θ^{⋆}, σ^{2}) + η N (θ^{'}, σ^{2})

where we set

(θ^{⋆}, θ^{'}, σ) = (25, 75, 5)

and for

η \in {0.0, 0.1, 0.2}

and an Exponential model

P_{θ} : = Exp (θ)

with a contaminated Exponential DGP for the training DGP:

P_{train}^{⋆} : = (1 - η) Exp (0.05) + η N (100, 0.5)

P_{train}^{⋆} : = (1 - η) Exp (θ^{⋆}) + η N (μ, σ)

for

η \in {0.0, 0.1, 0.2}

.

Table A1. Average (standard derivation) solve time in seconds of algorithms across different DGPs.

DGP	RoBAS	Empirical MMD	KL-BDRO	DRO-BAS_PE	DRO-BAS_PP
1D $N$ bimodal	275.33 (25.98)	2.17 (0.25)	3.90 (0.55)	0.29 (0.03)	0.30 (0.05)
5D $N$ bimodal	262.94 (31.72)	2.18 (0.36)	2.63 (0.26)	1.24 (0.19)	1.38 (0.29)
$N, η = 0.0$	286.27 (16.14)	2.22 (0.28)	4.01 (0.49)	0.29 (0.04)	0.30 (0.04)
$N, η = 0.1$	284.62 (17.42)	2.23 (0.24)	3.99 (0.42)	0.29 (0.03)	0.30 (0.04)
$N, η = 0.2$	260.67 (35.90)	2.20 (0.24)	3.74 (0.52)	0.28 (0.03)	0.28 (0.04)
Exp, $η = 0.0$	311.74 (37.48)	5.59 (0.68)	0.43 (0.06)	0.40 (0.06)	0.49 (0.06)
Exp, $η = 0.1$	296.65 (46.10)	5.39 (0.54)	0.44 (0.06)	0.42 (0.05)	0.50 (0.06)
Exp, $η = 0.2$	282.55 (53.23)	5.22 (0.47)	0.44 (0.06)	0.42 (0.05)	0.50 (0.06)

Table A2. Average (standard derivation) sample time in seconds of algorithms across different DGPs.

DGP	RoBAS	KL-BDRO	DRO-BAS_PE	DRO-BAS_PP
1D $N$ bimodal	15.65 (4.76)	0.001 (0.002)	0.0002 (0.0)	0.0002 (0.0)
5D $N$ bimodal	15.86 (4.84)	0.0104 (0.0004)	0.0006 (0.0003)	0.001 (0.0004)
$N, η = 0.0$	15.85 (4.67)	0.0009 (0.002)	0.0002 (0.0)	0.0002 (0.0)
$N, η = 0.1$	15.78 (4.59)	0.0009 (0.0017)	0.0002 (0.0)	0.0002 (0.0)
$N, η = 0.2$	15.68 (4.58)	0.0008 (0.0014)	0.0002 (0.0)	0.0002 (0.0)
Exp, $η = 0.0$	38.10 (5.62)	0.1 (0.01)	0.0 (0.0)	0.0 (0.0)
Exp, $η = 0.1$	37.77 (4.78)	0.1 (0.01)	0.0 (0.0)	0.0 (0.0)
Exp, $η = 0.2$	37.82 (4.85)	0.1 (0.01)	0.0 (0.0)	0.0 (0.0)

Appendix C. Alternative RoBAS Formulations

We now discuss alternative RoBAS formulations as the ones provided in Dellaporta et al. [12] through the DRO-BAS framework. In particular, in the DRO-BAS case the authors explored two distinct formulations of ambiguity sets: the BAS_PP (3), mirroring the definition of

B_{ϵ}^{k} (P_{n}^{pred (NPL)})

, and the BAS_PE (4) which is based on a posterior expectation by considering the expected KL distance to the model family.

In the case of Robust Bayesian Ambiguity Sets, an interesting connection can be made between

B_{ϵ} (P_{n}^{pred (NPL)})

and the ambiguity set based on the expected squared MMD, namely:

\begin{matrix} A_{ϵ}^{k} (DP (c^{'}, F^{'})) = \{P \in P_{k} (Ξ) : E_{Q \sim DP (c^{'}, F^{'})} [D_{k}^{2} (P, P_{θ_{k} (Q)})]\} . \end{matrix}

(A23)

The squared MMD has previously been used in Generalised Bayesian Inference to define a loss function by Chérief-Abdellatif and Alquier [49]. By using the properties of the MMD, we can prove the following equivalence:

Lemma A5.

Let

E_{{DP}_{ξ_{1 : n}}}

denote the expectation under

E_{Q \sim DP (c^{'}, F^{'})}

and suppose:

\begin{matrix} v (DP) & : = E_{{DP}_{ξ_{1 : n}}} [〈μ_{P_{θ_{k} (Q)}}, μ_{P_{θ_{k} (Q)}}〉] - 〈E_{{DP}_{ξ_{1 : n}}} [μ_{P_{θ_{k} (Q)}}], E_{{DP}_{ξ_{1 : n}}} [μ_{P_{θ_{k} (Q)}}]〉 \\ = E_{{DP}_{ξ_{1 : n}}} [{∥μ_{P_{θ_{k} (Q)}}∥}_{H_{k}}^{2}] - {∥E_{{DP}_{ξ_{1 : n}}} [μ_{P_{θ_{k} (Q)}}]∥}_{H_{k}}^{2} \\ < \infty . \end{matrix}

(A24)

Then for any

P \in P_{k} (Ξ)

:

\begin{matrix} E_{{DP}_{ξ_{1 : n}}} [D_{k}^{2} (P, P_{θ_{k} (Q)})] = D_{k}^{2} (P, P_{n}^{pred (NPL)}) + v (DP) \end{matrix}

(A25)

and for any

ϵ \geq 0

:

\begin{matrix} B_{ϵ}^{k} (P_{n}^{pred (NPL)}) \equiv A_{ϵ^{2} + v (DP)}^{k} (DP (c^{'}, F^{'})) . \end{matrix}

(A26)

Proof of Lemma A5.

The proof is based on the reproducing property of the RKHS. Recall that

E_{{DP}_{ξ_{1 : n}}}

denotes the expectation under

E_{Q \sim DP (c^{'}, F^{'})}

. For any

P \in P_{k} (Ξ)

and

ϵ \geq 0

we have:

\begin{matrix} E_{{DP}_{ξ_{1 : n}}} [D_{k}^{2} (P, P_{θ_{k} (Q)})] \\ = {〈μ_{P}, μ_{P}〉}_{H_{k}} - 2 E_{{DP}_{ξ_{1 : n}}} [{〈μ_{P}, μ_{P_{θ_{k} (Q)}}〉}_{H_{k}}] + E_{{DP}_{ξ_{1 : n}}} [{〈μ_{P_{θ_{k} (Q)}}, μ_{P_{θ_{k} (Q)}}〉}_{H_{k}}] \\ = {〈μ_{P}, μ_{P}〉}_{H_{k}} - 2 E_{{DP}_{ξ_{1 : n}}} [{〈μ_{P}, μ_{P_{θ_{k} (Q)}}〉}_{H_{k}}] + {〈E_{{DP}_{ξ_{1 : n}}} [μ_{P_{θ_{k} (Q)}}], E_{{DP}_{ξ_{1 : n}}} [μ_{P_{θ_{k} (Q)}}]〉}_{H_{k}} \\ - {〈E_{{DP}_{ξ_{1 : n}}} [μ_{P_{θ_{k} (Q)}}], E_{{DP}_{ξ_{1 : n}}} [μ_{P_{θ_{k} (Q)}}]〉}_{H_{k}} + E_{{DP}_{ξ_{1 : n}}} [{〈μ_{P_{θ_{k} (Q)}}, μ_{P_{θ_{k} (Q)}}〉}_{H_{k}}] \\ = D_{k}^{2} (Q, E_{{DP}_{ξ_{1 : n}}} [P_{θ_{k} (Q)}]) + v (DP) . \end{matrix}

(A27)

This proves the statement of Equation (A25). To prove the statement of Equation (A26) recall that

\begin{matrix} B_{ϵ}^{k} (P_{n}^{pred (NPL)}) & : = {P \in P_{k} (Ξ) : D_{k} (P, P_{n}^{pred (NPL)}) \leq ϵ} \end{matrix}

(A28)

\begin{matrix} = {P \in P_{k} (Ξ) : D_{k} (P, E_{{DP}_{ξ_{1 : n}}} [P_{θ_{k} (Q)}]) \leq ϵ} \end{matrix}

(A29)

Using (A27) we have that for

ϵ \geq 0

and some

P \in B_{ϵ}^{k} (P_{n}^{pred (NPL)})

:

\begin{matrix} D_{k} (P, E_{{DP}_{ξ_{1 : n}}} [P_{θ_{k} (Q)}]) \leq ϵ \end{matrix}

(A30)

\begin{matrix} \Leftrightarrow & D_{k}^{2} (P, E_{{DP}_{ξ_{1 : n}}} [P_{θ_{k} (Q)}]) \leq ϵ^{2} \end{matrix}

(A31)

\begin{matrix} \Leftrightarrow & E_{{DP}_{ξ_{1 : n}}} [D_{k}^{2} (P, P_{θ_{k} (Q)})] \leq ϵ^{2} + v (n) \end{matrix}

(A32)

\begin{matrix} \Leftrightarrow & P \in A_{ϵ^{2} + v (DP)}^{k} (DP (c^{'}, F^{'})) . \end{matrix}

(A33)

hence,

B_{ϵ}^{k} (P_{n}^{pred (NPL)}) \equiv A_{ϵ^{2} + v (DP)}^{k} (DP (c^{'}, F^{'}))

, which completes the proof. □

The constant

v (DP)

(A24) is an effective variance term of the kernel mean embedding

μ_{P_{θ_{k} (Q)}}

in the RKHS under the nonparametric posterior

DP \equiv DP (c^{'}, F^{'})

. It follows directly from Lemma A5 that the primal DRO-RoBAS problem in (11) is equivalent to the following optimisation problem:

\begin{matrix} min_{x \in X} sup_{P : E_{{DP}_{ξ_{1 : n}}} [D_{k}^{2} (P, P_{θ_{k} (Q)})] \leq ϵ^{2} + v (DP)} E_{x \sim P} [f (z, x)] \end{matrix}

(A34)

or equivalently,

\begin{matrix} min_{x \in X} sup_{P \in A_{ϵ^{2} + v (DP)}^{k} (P_{n}^{pred (NPL)})} E_{ξ \sim P} [f_{x} (ξ)] . \end{matrix}

(A35)

Since these two problems are equivalent, we refer to both of them as DRO-RoBAS.

An alternative formulation could be defined, based on the expected MMD distance, i.e., by considering the set

\begin{matrix} \{P \in P_{k} (Ξ) : E_{{DP}_{ξ_{1 : n}}} [D_{k} (P, P_{θ_{k} (Q)})] \leq ϵ\} \end{matrix}

(A36)

for tolerance level

ϵ \geq 0

. This mirrors the BAS_PE in Dellaporta et al. [12] and could lead to a differently shaped ambiguity set as it corresponds to an expected MMD ball. However, deriving the dual formulation for the corresponding worst-case DRO problem based on this set is challenging as it involves deriving the convex conjugate of the expected MMD distance

{(E_{{DP}_{ξ_{1 : n}}} [D_{k} (\cdot, P_{θ_{k} (Q)})])}^{*}

or equivalently the support function

δ_{C}^{⋆} (g)

of the corresponding set

C : = {μ \in H_{k} : E_{Q \sim DP (c^{'}, F^{'})} [∥ μ - μ_{P_{θ_{k} (Q)}} ∥_{k}] \leq ϵ}

. This is not trivial and it has not been tackled in this work, however it poses itself as important future work as it can open the way to ambiguity sets whose shape itself is informed by the posterior rather than just their nominal distribution as in RoBAS.

References

Black, B. Parametric Distributionally Robust Optimisation Models for Resource and Inventory Planning Problems; Lancaster University: Lancaster, UK, 2022. [Google Scholar]
Li, J.Y.; Kwon, R.H. Portfolio selection under model uncertainty: A penalized moment-based optimization approach. J. Glob. Optim. 2013, 56, 131–164. [Google Scholar] [CrossRef]
Zhang, J.; Menon, A.K.; Veit, A.; Bhojanapalli, S.; Kumar, S.; Sra, S. Coping with label shift via distributionally robust optimisation. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
Kuhn, D.; Esfahani, P.M.; Nguyen, V.A.; Shafieezadeh-Abadeh, S. Wasserstein Distributionally Robust Optimization: Theory and Applications in Machine Learning. In Operations Research & Management Science in the Age of Analytics; Informs: Catonsville, MD, USA, 2019; Chapter 6; pp. 130–166. [Google Scholar] [CrossRef]
Staib, M.; Jegelka, S. Distributionally robust optimization and generalization in kernel methods. Adv. Neural Inf. Process. Syst. 2019, 32, 9134–9144. [Google Scholar]
Zhu, J.J.; Jitkrittum, W.; Diehl, M.; Schölkopf, B. Kernel distributionally robust optimization: Generalized duality theorem and stochastic approximation. In Proceedings of the International Conference on Artificial Intelligence and Statistics; PMLR: Cambridge, MA, USA, 2021; pp. 280–288. [Google Scholar]
Hu, Z.; Hong, L.J. Kullback-Leibler divergence constrained distributionally robust optimization. Available Optim. Online 2013, 1, 9. [Google Scholar]
Iyengar, G.; Lam, H.; Wang, T. Hedging against complexity: Distributionally robust optimization with parametric approximation. In Proceedings of the International Conference on Artificial Intelligence and Statistics; PMLR: Cambridge, MA, USA, 2023; pp. 9976–10011. [Google Scholar]
Michel, P.; Hashimoto, T.; Neubig, G. Modeling the second player in distributionally robust optimization. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
Michel, P.; Hashimoto, T.; Neubig, G. Distributionally robust models with parametric likelihood ratios. In Proceedings of the International Conference on Learning Representations, Virtual, 26–28 October 2022. [Google Scholar]
Shapiro, A.; Zhou, E.; Lin, Y. Bayesian distributionally robust optimization. SIAM J. Optim. 2023, 33, 1279–1304. [Google Scholar] [CrossRef]
Dellaporta, C.; O’Hara, P.; Damoulas, T. Decision Making under the Exponential Family: Distributionally Robust Optimisation with Bayesian Ambiguity Sets. In Proceedings of the International Conference on Machine Learning, Vancouver, BC, Canada, 13–19 July 2025. [Google Scholar]
Grünwald, P. The safe Bayesian: Learning the learning rate via the mixability gap. In Proceedings of the International Conference on Algorithmic Learning Theory; Springer: Berlin/Heidelberg, Germany, 2012; pp. 169–183. [Google Scholar]
Walker, S.G. Bayesian inference with misspecified models. J. Stat. Plan. Inference 2013, 143, 1621–1633. [Google Scholar] [CrossRef]
Lyddon, S.; Walker, S.; Holmes, C.C. Nonparametric learning from Bayesian models with randomized objective functions. Adv. Neural Inf. Process. Syst. 2018, 31, 2075–2085. [Google Scholar]
Fong, E.; Lyddon, S.; Holmes, C. Scalable nonparametric sampling from multimodal posteriors with the posterior bootstrap. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2019; pp. 1952–1962. [Google Scholar]
Rahimian, H.; Mehrotra, S. Distributionally robust optimization: A review. arXiv 2019, arXiv:1908.05659. [Google Scholar]
Husain, H.; Nguyen, V.; van den Hengel, A. Distributionally Robust Bayesian Optimization with φ-divergences. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36. [Google Scholar]
Wang, S.; Wang, H.; Li, X.; Honorio, J. Learning against distributional uncertainty: On the trade-off between robustness and specificity. IEEE J. Sel. Top. Signal Process. 2025, 19, 1420–1435. [Google Scholar] [CrossRef]
Dellaporta, C.; Knoblauch, J.; Damoulas, T.; Briol, F.X. Robust Bayesian inference for simulator-based models via the MMD posterior bootstrap. In Proceedings of the International Conference on Artificial Intelligence and Statistics; PMLR: Cambridge, MA, USA, 2022; pp. 943–970. [Google Scholar]
Bissiri, P.G.; Holmes, C.C.; Walker, S.G. A general framework for updating belief distributions. J. R. Stat. Soc. Ser. B Stat. Methodol. 2016, 78, 1103–1130. [Google Scholar] [CrossRef]
Ghosh, A.; Basu, A. Robust Bayes estimation using the density power divergence. Ann. Inst. Stat. Math. 2016, 68, 413–437. [Google Scholar] [CrossRef]
Knoblauch, J.; Jewson, J.; Damoulas, T. An optimization-centric view on Bayes’ rule: Reviewing and generalizing variational inference. J. Mach. Learn. Res. 2022, 23, 1–109. [Google Scholar]
Matsubara, T.; Knoblauch, J.; Briol, F.X.; Oates, C.J. Robust generalised Bayesian inference for intractable likelihoods. J. R. Stat. Soc. Ser. B Stat. Methodol. 2022, 84, 997–1022. [Google Scholar] [CrossRef]
Altamirano, M.; Briol, F.X.; Knoblauch, J. Robust and scalable Bayesian online changepoint detection. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2023; pp. 642–663. [Google Scholar]
Altamirano, M.; Briol, F.X.; Knoblauch, J. Robust and Conjugate Gaussian Process Regression. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2024; pp. 1155–1185. [Google Scholar]
Laplante, W.; Altamirano, M.; Duncan, A.; Knoblauch, J.; Briol, F.X. Robust and conjugate spatio-temporal gaussian processes. arXiv 2025, arXiv:2502.02450. [Google Scholar]
Müller, A. Integral probability metrics and their generating classes of functions. Adv. Appl. Probab. 1997, 29, 429–443. [Google Scholar] [CrossRef]
Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A. A kernel two-sample test. J. Mach. Learn. Res. 2012, 13, 723–773. [Google Scholar]
Briol, F.X.; Barp, A.; Duncan, A.B.; Girolami, M. Statistical inference for generative models with maximum mean discrepancy. arXiv 2019, arXiv:1906.05944. [Google Scholar]
Gao, R. Finite-sample guarantees for Wasserstein distributionally robust optimization: Breaking the curse of dimensionality. Oper. Res. 2023, 71, 2291–2306. [Google Scholar] [CrossRef]
Chen, Y.; Kim, J.; Anderson, J. Distributionally robust decision making leveraging conditional distributions. In Proceedings of the IEEE Conference on Decision and Control, Cancun, Mexico, 6–9 December 2022; pp. 5652–5659. [Google Scholar]
Romao, L.; Hota, A.R.; Abate, A. Distributionally robust optimal and safe control of stochastic systems via kernel conditional mean embedding. In Proceedings of the IEEE Conference on Decision and Control; IEEE: New York, NY, USA, 2023; pp. 2016–2021. [Google Scholar]
Gao, R.; Kleywegt, A. Distributionally robust stochastic optimization with Wasserstein distance. Math. Oper. Res. 2023, 48, 603–655. [Google Scholar] [CrossRef]
Vázquez, F.G.; Rückmann, J.J.; Stein, O.; Still, G. Generalized semi-infinite programming: A tutorial. J. Comput. Appl. Math. 2008, 217, 394–419. [Google Scholar] [CrossRef]
Chérief-Abdellatif, B.E.; Alquier, P. Finite sample properties of parametric MMD estimation: Robustness to misspecification and dependence. Bernoulli 2022, 28, 181–213. [Google Scholar] [CrossRef]
Alquier, P.; Gerber, M. Universal robust regression via maximum mean discrepancy. Biometrika 2024, 111, 71–92. [Google Scholar] [CrossRef]
Seeger, M. Gaussian processes for machine learning. Int. J. Neural Syst. 2004, 14, 69–106. [Google Scholar] [CrossRef]
Mohajerin Esfahani, P.; Kuhn, D. Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Math. Program. 2018, 171, 115–166. [Google Scholar] [CrossRef]
Bertsimas, D.; Gupta, V.; Kallus, N. Robust saa. arXiv 2014, arXiv:1408.4445. [Google Scholar]
Huber, P.J. Robust estimation of a location parameter. In Breakthroughs in Statistics: Methodology and Distribution; Springer: Berlin/Heidelberg, Germany, 1992; pp. 492–518. [Google Scholar]
Liu, J.; Shen, Z.; He, Y.; Zhang, X.; Xu, R.; Yu, H.; Cui, P. Towards out-of-distribution generalization: A survey. arXiv 2021, arXiv:2108.13624. [Google Scholar]
Zhai, R.; Dan, C.; Kolter, Z.; Ravikumar, P. Doro: Distributional and outlier robust optimization. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2021; pp. 12345–12355. [Google Scholar]
Nietert, S.; Goldfeld, Z.; Shafiee, S. Outlier-robust wasserstein dro. Adv. Neural Inf. Process. Syst. 2023, 36, 62792–62820. [Google Scholar]
Porteus, E.L. Stochastic inventory theory. Handb. Oper. Res. Manag. Sci. 1990, 2, 605–652. [Google Scholar]
Rahimi, A.; Recht, B. Random features for large-scale kernel machines. Adv. Neural Inf. Process. Syst. 2007, 20, 1177–1184. [Google Scholar]
Bach, F. Sharp analysis of low-rank kernel matrix approximations. In Proceedings of the Conference on Learning Theory; PMLR: Cambridge, MA, USA, 2013; pp. 185–209. [Google Scholar]
Jewson, J.; Smith, J.Q.; Holmes, C. Principles of Bayesian inference using general divergence criteria. Entropy 2018, 20, 442. [Google Scholar] [CrossRef]
Chérief-Abdellatif, B.E.; Alquier, P. MMD-Bayes: Robust Bayesian estimation via maximum mean discrepancy. In Proceedings of the Symposium on Advances in Approximate Bayesian Inference; PMLR: Cambridge, MA, USA, 2020; pp. 1–21. [Google Scholar]
McDiarmid, C. On the method of bounded differences. Surv. Comb. 1989, 141, 148–188. [Google Scholar]
Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]

Figure 1. Illustration of (approximated) BAS_PE, BAS_PP and RoBAS (ours) with

P_{θ} = N (μ, σ^{2})

over a grid of

(μ, σ)

pairs for a fixed

ϵ

. In the well-specified case (left), all ambiguity sets include the DGP while RoBAS covers a slightly bigger area than BAS_PE and BAS_PP. For a contaminated dataset (right), RoBAS continues to contain the DGP and maintains a similar area, whereas the BAS formulations exclude it and cover a much larger area of distributions further away from the DGP.

Figure 1. Illustration of (approximated) BAS_PE, BAS_PP and RoBAS (ours) with

P_{θ} = N (μ, σ^{2})

over a grid of

(μ, σ)

pairs for a fixed

ϵ

. In the well-specified case (left), all ambiguity sets include the DGP while RoBAS covers a slightly bigger area than BAS_PE and BAS_PP. For a contaminated dataset (right), RoBAS continues to contain the DGP and maintains a similar area, whereas the BAS formulations exclude it and cover a much larger area of distributions further away from the DGP.

Figure 2. Contaminated Gaussian location example. (Top): Histogram of observed data along with the true (DGP), outlier and pathological densities. (Bottom): Posterior marginal distributions for NPL-MMD (blue) and standard Bayes (orange). The true mean is indicated with a black dotted line. For BAS_PE it holds that

E_{Π_{Bayes}} [d_{KL} (P_{pathological}, P_{θ})] \approx 0.17 < 0.42 \approx E_{Π_{Bayes}} [d_{KL} (P^{⋆}, P_{θ})]

and similarly for BAS_PP it holds that

d_{KL} (P_{pathological}, P_{n}^{pred}) \approx 0.18 < 0.38 \approx d_{KL} (P^{⋆}, P_{n}^{pred})

. In contrast for RoBAS we have

D_{k} (P_{pathological}, P_{n}^{pred (NPL)}) \approx 0.65 > 0.55 \approx D_{k} (P^{⋆}, P_{n}^{pred (NPL)})

. This example is inspired by Figure 1 of Gao and Kleywegt [34].

Figure 2. Contaminated Gaussian location example. (Top): Histogram of observed data along with the true (DGP), outlier and pathological densities. (Bottom): Posterior marginal distributions for NPL-MMD (blue) and standard Bayes (orange). The true mean is indicated with a black dotted line. For BAS_PE it holds that

E_{Π_{Bayes}} [d_{KL} (P_{pathological}, P_{θ})] \approx 0.17 < 0.42 \approx E_{Π_{Bayes}} [d_{KL} (P^{⋆}, P_{θ})]

and similarly for BAS_PP it holds that

d_{KL} (P_{pathological}, P_{n}^{pred}) \approx 0.18 < 0.38 \approx d_{KL} (P^{⋆}, P_{n}^{pred})

. In contrast for RoBAS we have

D_{k} (P_{pathological}, P_{n}^{pred (NPL)}) \approx 0.65 > 0.55 \approx D_{k} (P^{⋆}, P_{n}^{pred (NPL)})

. This example is inspired by Figure 1 of Gao and Kleywegt [34].

Figure 3. The out-of-sample mean and variance for the Newsvendor problem with a misspecified Gaussian location model and a bimodal Gaussian DGP. Results are shown for the univariate (

D = 1

, (top)) and the multivariate (

D = 5

, (bottom)) cases, with markers representing

ϵ

values. For illustration purposes, the bottom-left area of the multivariate case is shown in a zoomed-in view.

Figure 3. The out-of-sample mean and variance for the Newsvendor problem with a misspecified Gaussian location model and a bimodal Gaussian DGP. Results are shown for the univariate (

D = 1

, (top)) and the multivariate (

D = 5

, (bottom)) cases, with markers representing

ϵ

values. For illustration purposes, the bottom-left area of the multivariate case is shown in a zoomed-in view.

Figure 4. Out-of-sample mean-variance trade-off in the Newsvendor problem for a Gaussian location model (top) and an Exponential model (bottom) with a contaminated training dataset. Results are shown for contamination levels

η = 0.0

(left),

η = 0.1

(middle), and

η = 0.2

(right). Each marker represents a specific

ϵ

value, with some labelled for reference.

Figure 4. Out-of-sample mean-variance trade-off in the Newsvendor problem for a Gaussian location model (top) and an Exponential model (bottom) with a contaminated training dataset. Results are shown for contamination levels

η = 0.0

(left),

η = 0.1

(middle), and

η = 0.2

(right). Each marker represents a specific

ϵ

value, with some labelled for reference.

Figure 5. Out-of-sample mean-variance trade-off in the Portfolio problem for a 5D contaminated Gaussian DGP with

η = 0.0, 0.1, 0.2

. Note that the goal is to maximise returns, so larger out-of-sample mean is better.

Figure 5. Out-of-sample mean-variance trade-off in the Portfolio problem for a 5D contaminated Gaussian DGP with

η = 0.0, 0.1, 0.2

. Note that the goal is to maximise returns, so larger out-of-sample mean is better.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dellaporta, C.; O’Hara, P.; Damoulas, T. Decision-Making Under Model Misspecification: DRO with Robust Bayesian Ambiguity Sets. Entropy 2026, 28, 430. https://doi.org/10.3390/e28040430

AMA Style

Dellaporta C, O’Hara P, Damoulas T. Decision-Making Under Model Misspecification: DRO with Robust Bayesian Ambiguity Sets. Entropy. 2026; 28(4):430. https://doi.org/10.3390/e28040430

Chicago/Turabian Style

Dellaporta, Charita, Patrick O’Hara, and Theodoros Damoulas. 2026. "Decision-Making Under Model Misspecification: DRO with Robust Bayesian Ambiguity Sets" Entropy 28, no. 4: 430. https://doi.org/10.3390/e28040430

APA Style

Dellaporta, C., O’Hara, P., & Damoulas, T. (2026). Decision-Making Under Model Misspecification: DRO with Robust Bayesian Ambiguity Sets. Entropy, 28(4), 430. https://doi.org/10.3390/e28040430

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Decision-Making Under Model Misspecification: DRO with Robust Bayesian Ambiguity Sets

Abstract

1. Introduction

2. Background

2.1. Bayesian Formulations of DRO

2.2. Robust Bayesian Inference via Divergences

Robust NPL Posterior

2.3. Maximum Mean Discrepancy

2.4. DRO with the Maximum Mean Discrepancy

3. DRO with Robust Bayesian Ambiguity Sets

3.1. Duality of the DRO-RoBAS Problem

3.2. Tolerance Level Guarantees

4. Experiments

4.1. The Newsvendor Problem

4.2. The Portfolio Optimisation Problem

4.3. Computational Time

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Proofs of Theoretical Results

Appendix A.1. Proof of Proposition 1

Appendix A.2. Proof of Corollary 1

Appendix A.3. Proof of Theorem 2

Appendix A.4. Proof of Corollary 3

Appendix B. Additional Experimental Details

Data-Generating Process Settings

Appendix C. Alternative RoBAS Formulations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI