Distributionally Robust Bayesian Optimization via Sinkhorn-Based Wasserstein Barycenter

Seyedi, Iman; Candelieri, Antonio; Archetti, Francesco

doi:10.3390/make7030090

Open AccessArticle

Distributionally Robust Bayesian Optimization via Sinkhorn-Based Wasserstein Barycenter

by

Iman Seyedi

^1,*

,

Antonio Candelieri

^2,*

and

Francesco Archetti

¹

Department of Computer Science Systems and Communication, University of Milano-Bicocca, 20126 Milan, Italy

²

Department of Economics Management and Statistics, University of Milano-Bicocca, 20126 Milan, Italy

^*

Authors to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2025, 7(3), 90; https://doi.org/10.3390/make7030090

Submission received: 17 July 2025 / Revised: 14 August 2025 / Accepted: 25 August 2025 / Published: 28 August 2025

Download

Browse Figures

Versions Notes

Abstract

This paper introduces a novel framework for Distributionally Robust Bayesian Optimization (DRBO) with continuous context that integrates optimal transport theory and entropic regularization. We propose the sampling from the Wasserstein Barycenter Bayesian Optimization (SWBBO) method to deal with uncertainty about the context; that is, the unknown stochastic component affecting the observations of the black-box objective function. This approach captures the geometric structure of the underlying distributional uncertainty and enables robust acquisition strategies without incurring excessive computational costs. The method incorporates adaptive robustness scheduling, Lipschitz regularization, and efficient barycenter construction to balance exploration and exploitation. Theoretical analysis establishes convergence guarantees for the robust Bayesian Optimization acquisition function. Empirical evaluations on standard global optimization problems and real-life inspired benchmarks demonstrate that SWBBO consistently achieves faster convergence, good final regret, and greater stability than other recently proposed methods for DRBO with continuous context. Indeed, SWBBO outperforms all of them in terms of both optimization performance and robustness under repeated evaluations.

Keywords:

Distributionally Robust Bayesian Optimization; Wasserstein Barycenter; Sinkhorn-regularized Wasserstein distance

1. Introduction

Modern decision-making systems are increasingly ascendant in situations of distributional uncertainty, where data-generating processes are subject to shifts, perturbations, or misspecifications at the base level. This is most salient for black-box optimization situations, where expensive function evaluations are performed under uncertain contextual settings. Although vanilla Bayesian Optimization (BO) is known to be effective for deterministic or well-characterized stochastic settings [1,2,3], it can exhibit poor out-of-sample performance when the true distribution deviates from prior assumptions. In Sample Average Approximation (SAA) methods, this phenomenon is often referred to as the “optimizer’s curse”.

Distributionally Robust Optimization (DRO) emerged as a justified methodology for addressing such uncertainties through optimization under worst-case scenarios that are aligned with carefully crafted ambiguity sets around empirical samples, e.g., historical data. Instead of point estimates of unknown distributions, DRO considers all possible distributions aligned with a neighbourhood of the empirical distribution, providing performance guarantees that degrade gracefully under shifts of the underlying distributions. Its underlying intuition is that through hedging against the uncertainty of distributions, decision-makers can gain solutions that share robust performance under a variety of plausible scenarios, even if at the cost of nominal optimality under ideal conditions.

Applying DRO concepts to BO presents a number of characteristic challenges that make it different from classical robust optimization [4]. To begin with, the black-box nature of expensive objective functions demands explicit balancing of exploration versus exploitation, simultaneously coping with distributional uncertainty. Traditional acquisition functions must be fundamentally reconsidered to incorporate robustness considerations without sacrificing the sample efficiency that makes BO attractive for expensive optimization problems. Second, the construction of appropriate ambiguity sets becomes critical in the BO setting, where limited samples must inform both the underlying surrogate model of the black-box objective function and the distributional uncertainty quantification. The choice of distance metric for defining ambiguity sets—whether based on φ-divergences, Wasserstein distances, or moment-based constraints—has a significant impact on both computational tractability and the quality of robust solutions. Recent theoretical work by Ref. [5] shows that worst-case expectation problems can be transformed into finite convex problems for many practical loss functions, which provides a foundation for tractable distributionally robust formulations. Third, the high-dimensional nature of many practical optimization problems exacerbates the curse of dimensionality in distributional uncertainty quantification. As noted by Ref. [6], distance metrics defined in high-dimensional spaces can lead to exceedingly large ambiguity sets, potentially resulting in overly conservative solutions. This observation points to the need for more sophisticated approaches that can capture the geometric structure of data distributions while maintaining computational efficiency.

Current approaches to Distributionally Robust Bayesian Optimization (DRBO) have several limitations that this work addresses. Existing methods usually rely on simple empirical averaging or uniform sampling within ambiguity sets, which fail to capture the underlying geometric structure of the context distribution. This approach can lead to suboptimal representation of distributional uncertainty and, consequently, poor exploration strategies in the robust optimization landscape [7,8].

Furthermore, many existing robust BO methods exhibit excessive conservatism, especially when ambiguity sets are large compared to the available data. The resulting solutions may sacrifice significant performance under nominal conditions to hedge against unlikely worst-case scenarios. The challenge lies in developing acquisition functions that can adaptively balance robustness and performance as more information becomes available through the optimization process. Computational complexity presents another significant barrier, as the nested min–max formulation inherent in distributionally robust optimization introduces substantial computational overhead. Traditional approaches often need to solve complex optimization problems at each iteration, making them impractical for high-dimensional or time-sensitive applications. Recent work by Ref. [9] addressed Wasserstein Distributionally Robust Bayesian Optimization (WDRBO) with continuous contexts, establishing sublinear regret bounds for settings where context distributions are uncertain but lie within Wasserstein ball ambiguity sets. However, both proposed approaches still have difficulty efficiently aggregating distributional information while preserving geometric structure.

This paper introduces a novel methodology that addresses these limitations through the principled application of the Wasserstein Barycenter, efficiently computed via the Sinkhorn regularization. Our key insight is that the geometric structure of distributional uncertainty can be effectively captured using Wasserstein Barycenter, which provides a natural generalization of the notion of mean distributions in the space of probability measures. The Wasserstein Barycenter, also known as the Fréchet mean, computed in the space of probability distributions endowed with Wasserstein distance [10], offers a non-linear interpolation mechanism that preserves the geometric characteristics of constituent distributions. Unlike simple weighted averages that ignore spatial structure, the Wasserstein Barycenter gives us a principled approach to aggregating multiple plausible distributions while maintaining their essential geometric properties. Our methodology takes advantage of the computational efficiency of the Sinkhorn algorithm to approximate Wasserstein Barycenter computation. While this approach trades the traditional distance properties for divergence measures, it provides crucial advantages: computational tractability for practical applications and guaranteed uniqueness of solutions. By combining entropic regularization with iterative scaling techniques, Sinkhorn achieves significant computational speedups while preserving the essential geometric properties required for robust optimization. The resulting Distributionally Robust Sinkhorn Barycenter Upper Confidence Bound acquisition function seamlessly integrates distributional robustness considerations into the BO framework while maintaining the exploration–exploitation balance that characterizes effective acquisition strategies. Our approach incorporates adaptive robustness scheduling and distributional Lipschitz regularization to ensure that the level of conservatism appropriately adapts to the available information and problem structure.

The main contributions of this work can be summarized as follows:

We propose SWBBO, a novel Bayesian Optimization algorithm that integrates Sinkhorn-regularized Wasserstein Barycenters into a robust acquisition strategy, providing geometry-aware modeling of distributional uncertainty in continuous contexts.

We develop a computationally efficient and theoretically grounded acquisition function that combines entropic optimal transport, adaptive robustness scheduling, and distributional Lipschitz regularization, offering strong regret guarantees under distributional shifts.

We demonstrate that SWBBO outperforms existing DRBO methods on both synthetic benchmarks and real-world-inspired optimization tasks, achieving faster convergence and improved robustness under repeated evaluations.

The rest of this paper proceeds in the following manner. We begin with Section 2, which covers essential background on DRO and Wasserstein Barycenter. In Section 3, we describe the DRBO framework, along with the Sinkhorn-regularized barycenter computation and the acquisition function. Section 4 formalizes the problem setting, introduces our distributional robustness framework, and presents the core methodology. Section 5 presents a comprehensive experimental evaluation, and Section 6 concludes with a discussion of future research directions.

2. Literature Review

2.1. Theoretical Foundations of DRO

The theoretical foundations of Distributionally Robust Optimization (DRO) have advanced significantly in recent years, revealing important connections between robust statistical methods and optimization under distributional uncertainty. In 2022, Rahimian and Mehtora published a comprehensive survey [11] that thoroughly covers the field, tracing its evolution from classical robust optimization to modern distributionally robust approaches. Their work categorizes different types of ambiguity sets—including moment-based, distance-based, and divergence-based formulations—and examines the trade-offs between robustness guarantees and computational feasibility. Building on this foundation, Ref. [5] demonstrates that many DRO problems can be seen as extensions of robust statistical estimation methods. Rather than simply looking for robust parameter estimates, the focus shifts toward finding decisions that work well across multiple plausible data-generating distributions. The authors in Ref. [12] took these concepts further by developing adaptive DRO frameworks that dynamically adjust ambiguity sets as new data become available. This method addresses the fundamental challenge of balancing robustness with performance in sequential decision-making, providing a way to reduce the excessive conservatism typical of worst-case optimization while maintaining strong out-of-sample guarantees. This work tackles key problems found in traditional stochastic optimization (SO) methods that depend on fixed probability distributions. DRO’s mathematical framework handles uncertainty using three main paradigms. SO represents the first approach, which maximizes expected performance under a known distribution [7]:

\max_{x \in X \subset R^{d}} E_{c \sim P} [f (x, c)]

(1)

Let

x \in X \subset R^{d}

denote the decision variables, and

c \sim P

denote the context. The function

f (x, c)

is the black-box objective function to be minimized, which depends on both decision and contextual variables. Decisions are made assuming perfect knowledge of the probability distribution P governing uncertain parameters c. Worst-case Robust Optimization (RO) adopts an extremely conservative approach by optimizing against the worst possible realization:

\max_{x \in X \subset R^{d}} \min_{c \in Δ} f (x, c),

(2)

where no probabilistic assumptions are made about the uncertainty set Δ.

Finally, DRO achieves a balanced compromise through optimizing against the worst-case expectation over all distributions Q within an ambiguity set U.

\max_{x \in X \subset R^{d}} \underset{Q \in U}{Inf} E_{c \sim Q} [f (x, c)],

(3)

In the following formulations,

f (x, c)

represents the objective function evaluated at decision

x

under context

c

. Depending on the optimization paradigm, we take expectations over

c

, worst-case realizations, or robust expectations over an ambiguity set.

2.2. Wasserstein Distance and Optimal Transport

In statistics and machine learning, the Wasserstein distance has increasingly gained importance as it allows the comparison of probability distributions of different types while satisfying all properties of a distance metric under certain assumptions. It has been recently applied to extend BO on complicated—also known as exotic—tasks such as grey-box optimization [13] and multiple information source optimization [14].

The mathematical foundation for the p-Wasserstein distance between probability measures α and β in P(Ω) is defined as [15]

W_{p}^{p} (α, β) = \underset{π \in Π (α, β)}{Inf} \int_{Ω \times Ω} d^{p} (ω, ω^{'}) d π (ω, ω^{'})

(4)

where

Π (α, β)

is the set of probability measures on

Ω \times Ω

having marginals α and

β

, and

p \geq 1

.

ω

and

ω^{'}

are probability measures (i.e., distributions) over the space

Ω

, and

π (ω, ω^{'})

denotes the set of joint couplings (transport plans) with marginals

ω

and

ω^{'}

. Setting

p = 2

and

d (., .)

the Euclidean distance leads to the so-called 2-Wasserstein distance with Equation (4) that can be rewritten as

W_{2}^{2} (α, β) = \min_{T_{#} α = β} \int_{Ω} {∥ T (ω) - ω ∥}_{2}^{2} d α (ω)

(5)

which is intended as an optimal transport problem [10]. This formulation interprets the distance as the minimum cost of transporting probability mass

α

to match target probability mass

β

, where symbol # denotes the push-forward operator ensuring that α matches

β

following transport plan T.

For Gaussian distributions, that is

α = N (m_{α}, Σ_{α})

and

β = N (m_{β}, Σ_{β})

, the 2-Wasserstein distance simplifies to [9]

W_{2}^{2} (α, β) = {∥ m_{α} - m_{β} ∥}^{2} + B {(Σ_{α}, Σ_{β})}^{2}

(6)

where

B

represents the Bures metric between positive definite matrices [16,17]:

B {(Σ_{α}, Σ_{β})}^{2} = t r a c e (Σ_{α} + Σ_{β} - 2 {(Σ_{α}^{\frac{1}{2}} Σ_{β} Σ_{α}^{\frac{1}{2}})}^{\frac{1}{2}})

(7)

Thus, in the case of centered GDs (i.e.,

m_{α} = m_{β} = 0)

, the 2-Wasserstein distance resembles the Bures metric. Moreover, if

Σ_{α}

and

Σ_{β}

are diagonal, the Bures metric is the Hellinger distance, while in the commutative case, that is

Σ_{α} Σ_{β}

=

Σ_{β} Σ_{α}

, the Bures metric is equal to the Frobenius norm

∥ Σ_{α}^{\frac{1}{2}} - Σ_{β}^{\frac{1}{2}} ∥_{F r o b e n i u s}^{2}

.

When dealing with general probability distributions beyond the Gaussian case, computing the Wasserstein distance becomes much more difficult and requires solving the Monge or Kantorovich optimal transport problems. The Kantorovich formulation turns this into a linear program with

O (n^{3} l o g n)

complexity for discrete measures on n points, which makes exact computation impractical for large-scale applications. This computational burden has motivated the development of entropic regularization techniques that approximate the optimal transport solution while maintaining computational tractability.

The computational tractability of Wasserstein-based methods has been revolutionized through entropic regularization approaches, most notably the introduction of Sinkhorn distances [18], which enables rapid computation of optimal transport through entropic penalties. Recent complexity analyses have improved our understanding of entropic regularized algorithms for optimal transport between discrete probability measures, establishing refined convergence bounds and computational guarantees [19]. The extension to Wasserstein Barycenter computation through fast Sinkhorn-based algorithms [15] has established the algorithmic foundations for practical implementation while maintaining essential geometric properties with significant computational speedups.

2.3. Wasserstein Barycenters and Geometric Averaging

The mathematical foundations for geometric averaging of probability distributions are rooted in optimal transport theory and the concept of Wasserstein Barycenters.

Given a set of probability measures

{α_{i}}_{i = 1}^{N}

in

P \subset P (Ω)

with associated weights

{λ_{i}}_{i = 1}^{N}

with

\sum_{i = 1}^{N} λ_{i} = 1

, their Wasserstein Barycenter

\bar{α}

is defined as

\bar{α} \in \underset{α \in P \subset P (Ω)}{argmin} \sum_{i = 1}^{N} λ_{i} W_{2}^{2} (α, α_{i})

(8)

where

\bar{α}

is uniquely defined by the weights

{λ_{i}}_{i = 1}^{N}

. Thus, different weights lead to a different Wasserstein Barycenter. The most common situation is to consider equally weighted probability measures, that is

λ_{i} = 1 / N, \forall i = 1, \dots, N

, leading to

\bar{α} \in \underset{α \in P \subset P (Ω)}{\arg \min} 1 / N \sum_{i = 1}^{N} W_{2}^{2} (α, α_{i})

(9)

For univariate Gaussian distributions

α_{i} = N (m_{i}, σ_{i})

, the Wasserstein Barycenter

\bar{α} = N (\bar{m}, \bar{σ})

has parameters determined by

\bar{m}, \bar{σ} \in \underset{m, σ}{argmin} \frac{1}{N} \sum_{i = 1}^{N} [{(m_{i} - m)}^{2} + {(σ_{i} - σ)}^{2}]

(10)

The solution is elegantly simple:

\bar{m} = 1 / N \sum_{i = 1}^{N} m_{i}

and

\bar{σ} = 1 / N \sum_{i = 1}^{N} σ_{i}

. This demonstrates that Wasserstein Barycenters of Gaussian distributions preserve the Gaussian structure while averaging parameters in the natural geometric space.

Figure 1 provides a practical and simple example showing the difference between the Wasserstein Barycenter of two (equally weighted) normal distributions and the L2-norm barycenter obtained by considering, as the distance, the L2-norm between the probability density functions aligned on the support. The shape preservation guaranteed by the Wasserstein Barycenter is clearly visible: it is normal, instead of bimodal.

Wasserstein Barycenters generalize McCann’s interpolation to the case of more than two measures, providing a natural extension of mean distributions that preserves geometric structure in probability spaces [20]. Recent theoretical advances have established both the fundamental importance and computational complexity of these objects, with Wasserstein Barycenters being proven NP-hard to compute in general settings [21], highlighting the necessity for efficient algorithmic approximations.

2.4. Ambiguity Set Construction and Distance Metrics

Constructing ambiguity sets is one of the most important design decisions in DRO, since different distance metrics create entirely different optimization problems and computational challenges. Ref. [22] came up with a new data-driven approach that relies on statistical hypothesis tests to build uncertainty sets that balance robustness with performance. This gives us principled statistical frameworks instead of arbitrary specifications. They use historical data to figure out appropriate robustness levels and incorporate complex statistical testing procedures that can handle intricate dependency structures.

There are several basic ways to construct ambiguity sets in DRO. Moment-based ambiguity sets work by constraining the mean, variance, and higher-order moments. While these are computationally manageable, they often miss important aspects of the distribution. Divergence-based ambiguity sets create neighborhoods around reference measures using metrics like Kullback–Leibler divergence [17] or Total Variation distance. These have solid theoretical backing but can be hard to interpret geometrically. Wasserstein-based ambiguity sets offer flexible and intuitive frameworks that use optimal transport metrics to measure distributional distances, which gives us both computational feasibility and meaningful geometric understanding.

Wasserstein-based methods have become popular among these options because they can handle realistic transportation costs and come with strong theoretical guarantees. Figure 2 shows the three main ways to construct ambiguity sets in DRO, illustrating how moment-based, divergence-based, and Wasserstein-based methods create different-sized uncertainty neighborhoods around the reference distribution. The figure highlights the trade-offs between computational ease, geometric understanding, and methodological flexibility across these different approaches.

Computational Speed. How fast and practical it is to solve the resulting optimization problem.

Geometric Clarity. How easy it is to understand the geometric meaning of distributional distances and uncertainty neighborhoods (Wasserstein methods give clear transportation cost intuition, while moment-based methods involve abstract statistical constraints).

Adaptability. How well the method can handle different problem types, cost functions, and distributional characteristics (Wasserstein methods can adjust cost functions for specific uses, while moment-based methods are stuck with fixed moment structures).

Building on these foundations, Ref. [24] made significant theoretical contributions to optimal transport-based DRO, establishing crucial structural properties of the value function in Wasserstein-based DRO problems and developing efficient iterative solution schemes. The computational implications of different ambiguity set constructions vary significantly depending on the chosen distance metric, with Wasserstein-based methods requiring more sophisticated optimal transport algorithms but providing valuable geometric insights, particularly in high-dimensional settings where traditional distance metrics may be overly conservative.

2.5. BO Challenges

Traditional BO approaches run into several basic problems that have pushed researchers to develop distributionally robust extensions. DRBO methods specifically handle uncertainty in the underlying probability distribution of the objective function, which is affected by unknown stochastic noise. Ref. [9] shows how Wasserstein-based distributionally robust approaches can give theoretical guarantees even when the model is misspecified. Meanwhile, Ref. [4] lays out the mathematical foundations for incorporating distributional ambiguity. More generally, Wasserstein Distributionally Robust Optimization (WDRO) has become a solid framework for dealing with distributional uncertainty across different optimization problems [25,26]. It provides robustness guarantees that stay valid when the true distribution falls within a specified Wasserstein ball around the empirical distribution. This way of thinking about distributional robustness changes the focus from obtaining perfect model specification to developing optimization strategies that work well across a range of plausible distributions. This makes the approach more practical for real-world applications where we are inherently uncertain about distributional assumptions.

2.6. Acquisition Functions for DRBO

Standard acquisition functions must be adapted for the distributionally robust setting. The key challenge is incorporating distributional uncertainty into the acquisition strategy while maintaining computational tractability.

For UCB-based approaches, the robust acquisition function typically takes the following form:

α_{R - U C B} (x) = {\min_{P \in U} E}_{P} [μ_{n} (x, c) + β_{n} σ_{n} (x, c)]

(11)

where

U

is the ambiguity set over distributions of the context, which is unknown a priori,

μ_{n}

and

σ_{n}

are the GP’s posterior mean and standard deviation, after

n

observations, and

β_{n}

is a parameter controlling the exploration–exploitation trade-off.

However, simple empirical averaging over observed contexts may not capture the underlying structure of the context distribution, motivating the adoption of the Wasserstein ball ambiguity set and, due to the need for computational efficiency, the Sinkhorn regularization.

2.7. Integrating Distributionally Robust Principles into BO

Combining BO with DRO has become a new framework for tackling both parametric uncertainty and distributional uncertainty in data-driven optimization problems. Ref. [27] introduced Bayesian Distributionally Robust Optimization (BDRO), which builds a comprehensive theoretical foundation that differs from traditional DRO approaches by including Bayesian estimation of unknown parametric distributions. Their framework builds ambiguity sets using parametric distributions as reference points, which allows for robust optimization that keeps the benefits of Bayesian estimation when data are scarce while providing robustness against model uncertainty.

Extending distributionally robust principles to BO is a recent development that tackles expensive function evaluation under distributional uncertainty. Ref. [28] worked on black-box optimization problems that include both design variables and uncertain context variables. They handled both aleatoric and epistemic uncertainty sources using adaptive and safe optimization strategies in high-dimensional spaces. Ref. [9] made important progress by tackling situations where context distributions are uncertain, but we know they fall within ambiguity sets defined as balls in the Wasserstein space. The researchers established sublinear regret bounds that give us theoretical guarantees for how DRBO algorithms converge. These developments together show that DRBO has matured into a unified framework that can handle both parametric and distributional uncertainties in complex decision-making situations.

DRBO brings several unique challenges compared to standard BO:

Ambiguity Set Construction: The choice of ambiguity set

U

significantly impacts the robustness performance trade-off. People typically use Wasserstein balls, ϕ-divergence balls, and moment-based sets [23].

Computational Complexity: The min–max formulation creates a nested optimization problem that is still computationally difficult even with recent algorithmic improvements. Modern approaches use dual formulations and cutting-plane methods to deal with the computational burden of continuous ambiguity sets [11,29,30,31], but scaling up to high-dimensional problems is still an active area of research.

Conservative Solutions: DRBO can produce overly conservative solutions when the ambiguity set is not sized correctly. Recent work has focused on developing principled approaches for ambiguity set calibration that balance being conservative with maintaining good performance. This includes adaptive shrinking strategies and risk-aware formulations [32,33]. One key limitation is sample efficiency: when you are dealing with distributional uncertainty, it takes longer for the algorithm to converge compared to standard Bayesian Optimization. This happens because the algorithm needs to protect itself against multiple possible scenarios that could play out. To address this problem, current research is working on better acquisition function designs and multi-fidelity approaches that can reduce the computational burden that comes with distributional robustness [34].

2.8. Computational Challenges and High-Dimensional Considerations

The curse of dimensionality creates fundamental challenges in DRO, since traditional distance metrics can produce prohibitively large ambiguity sets and overly conservative performance guarantees in high-dimensional spaces. Ref. [35] tackled this critical limitation by developing the first finite-sample guarantees for Wasserstein DRO that break the curse of dimensionality. Their groundbreaking work demonstrates how the out-of-sample performance of robust solutions depends on sample size, uncertainty dimension, and loss function complexity in a nearly optimal manner, providing theoretical foundations that make high-dimensional Wasserstein DRO practically viable. Their groundbreaking work shows how the out-of-sample performance of robust solutions depends on sample size, uncertainty dimension, and loss function complexity in a nearly optimal manner, providing theoretical foundations that make high-dimensional Wasserstein DRO practically viable. The theoretical foundations established by Ref. [23] provided essential performance guarantees and tractable reformulations for data-driven DRO using the Wasserstein metric, enabling practical implementations that have significantly influenced subsequent algorithmic developments. Ref. [25] showed that data-driven Wasserstein DRO can achieve workable reformulations and strong performance guarantees when you properly account for the underlying geometry of the problem. They found that Wasserstein ambiguity sets naturally regularize the system, which helps prevent overfitting when you are working with limited samples.

Recent developments include advanced generic column generation algorithms for high-dimensional multimarginal optimal transport problems, which allow for accurate mesh-free Wasserstein Barycenters and cubic Wasserstein splines [36]. The comprehensive theoretical framework from Ref. [10] continues to support modern computational optimal transport applications with solid convergence properties and approximation bounds for practical data science uses.

2.9. Research Gaps

This literature review shows critical computational and theoretical gaps in DRBO that standard approaches simply cannot address. Current robust BO methods face two key problems: first, naive aggregation strategies like uniform weighting across uncertainty sets completely ignore how probability spaces are actually structured, which leads to suboptimal acquisition functions that work less efficiently than methods that understand distributions better. Second, existing Wasserstein-based approaches need to solve transport problems at every single BO iteration, which creates major bottlenecks and makes them too slow to use in practice. The real problem here is keeping distributional geometry intact during optimization. When uncertainty sets contain distributions that have completely different support or covariance structures, simply averaging them destroys the geometric relationships that actually drive good exploration–exploitation trade-offs. Our Sinkhorn-regularized approach tackles this computational barrier head-on by approximating Wasserstein Barycenters while keeping geometric fidelity within ε-accuracy of exact solutions. But we still face three unresolved challenges: handling distributional uncertainty sets that span different support dimensions, keeping barycenter approximation quality stable as the ambient space dimension grows, and developing adaptive regularization parameters that can balance computational speed against geometric precision across different types of problems. This points to a clear need for developing computationally practical methods that preserve essential distributional geometry without giving up the convergence guarantees that make BO useful for expensive optimization problems.

3. Sinkhorn Barycenter Upper Confidence Bound for DRBO

This section walks through our Sinkhorn Barycenter Upper Confidence Bound (SB-UCB) approach for DRBO. We start by setting up the DRBO framework and laying out the minimax optimization problem when we are dealing with distributional uncertainty (Section 3.1). Then we show how to extend standard GP modeling so it can handle both contextual variables and decision variables together (Section 3.2). The key innovation here is how we build Wasserstein-based ambiguity sets (Section 3.3) and use Sinkhorn-regularized Wasserstein Barycenters to create robust distributional representations (Section 3.4). Here is how it works: we take the empirical context distribution and split it into multiple candidate distributions, then compute their entropic-regularized Wasserstein Barycenter to obtain a single representative distribution. This representative distribution keeps the important geometric structure while still accounting for distributional uncertainty. We then use this barycenter distribution in our robust acquisition function to make optimization both safe and efficient.

3.1. DRBO Framework

DRBO tackles optimization problems where we are not sure about the underlying data-generating distribution, which means it extends traditional BO to handle situations where we have distributional ambiguity ([11,28]). This approach is especially useful in real-world applications like robust hyperparameter tuning when datasets shift over time, autonomous systems that need to work under uncertain environmental conditions, and Portfolio Optimization, where we face model risk ([37,38]).

The DRO problem is formulated as

x * = \underset{x \in X \subset R^{d}}{argmax} \min_{P \in U} E_{c \sim P} [f (x, c)]

(12)

where

x \in X \subseteq R^{d}

represents the decision variables,

c \in C \subseteq R^{k}

denotes the context variables, and

f : X \times C \to R

is an unknown objective function.

U

is the ambiguity set containing plausible distributions.

In practice, we construct the ambiguity set from empirical observations, leading to

x * = \underset{x \in X}{argmax} {\min_{P \in U_{n}} E}_{c \sim P} [f (x, c)]

(13)

where

U_{n}

represents the ambiguity set constructed accordingly to n observations of the context, that is

C_{n} = {c_{1}, c_{2}, \dots, c_{n}}

.

3.2. GP Modeling for DRBO

BO represents a sample-efficient model-based sequential method for global optimization of black-box, multi-extremal, and expensive-to-evaluate objective functions [1,2,3]. The fundamental problem addresses global minimization:

x * \in \arg \min_{x \in X \subset R^{h}} f (x, c)

(14)

where

X

typically represents the h-dimensional unit hypercube

X = {[0, 1]}^{h}

,

c \in C \subset R^{p}

denotes the contextual variables, and

f : X \times C \to R

represents the expensive objective function exhibiting black-box and multi-extremal properties. The context

c

captures environmental conditions, problem parameters, or external factors that influence the objective function but are not directly controllable during optimization.

The generic iteration of BO consists of generating an approximation of

f (x, c)

depending on previous observations

D^{(n)} = (X, C, y)

, with

X = {x^{(i)}}_{i = 1 : n}

,

C = {c^{(i)}}_{i = 1 : n}

and

y = {y^{(i)}}_{i = 1 : n}

. After that, the next query

x^{(n + 1)}

that balances exploration and exploitation is selected, given the current context

c^{(n + 1)}

.

GP regression over the joint input space

X \times C

provides the most common approximation framework, leading to GP-based BO. While

X

is the space of all the possible decision,

C

is the space of the observable—but not decidable—contexts. The GP predictive equations for mean and variance are

µ (x, c) = µ_{0} (x, c) + k ((x, c), (X, C)) {[K + σ_{ε}^{2} I]}^{- 1} (y - µ_{0} (X, C))

(15)

σ^{2} (x, c) = k ((x, c), (x, c)) - k ((x, c), (X, C)) {[K + σ_{ε}^{2} I]}^{- 1} k ((X, C), (x, c))

(16)

where

µ_{0} (x, c)

represents the prior mean over the joint space,

k (\cdot, \cdot)

denotes the kernel function for every pair

(x, c)

, and

K

is the

n \times n

kernel matrix whose entries are

K_{i j} = k ((x^{(i)}, c^{(i)}), (x^{(j)}, c^{(j)}))

. Then, the acquisition function, such as GP-UCB, determines the next query given the observed context by solving

x^{(n + 1)} \in \arg \min_{x \in X} μ_{n} (x, c^{(n + 1)}) - β_{n} σ_{n} (x, c^{(n + 1)})

(17)

where the optimization is performed over the decision variables

x

while conditioning on the given context

c^{(n + 1)}

.

In our implementation, we use a Matérn 5/2 kernel operating on the joint input space of decision variables and contexts. Specifically, we use an anisotropic Matérn 5/2 kernel, with a different length-scale for each dimension, as determined by Automatic Relevance Determination (ARD) [39].

3.3. Ambiguity Set Construction via Wasserstein Distance

Given empirical observations

{c_{1}, c_{2}, \dots, c_{n}}

, we construct Wasserstein-based ambiguity sets:

U_{n}^{ε} = {P : W_{2}^{2} (P, {\hat{P}}_{n}) \leq ε}

(18)

where

{\hat{P}}_{n} = \frac{1}{n} \sum_{i = 1}^{n} δ_{c_{i}}

is the empirical distribution and

W_{2}

is the 2-Wasserstein distance.

3.4. Sinkhorn Barycenter Construction for Robust Representation

3.4.1. Distributional Uncertainty Partitioning

We partition the uncertainty space by considering multiple plausible distributions within the ambiguity set. We construct

m

candidate distributions by

Random Partitioning: Generate $m$ random partitions from the empirical distribution using permutation-based splitting
Bootstrap Sampling: Generate $m$ bootstrap samples from the empirical distribution
Balanced Partitioning: Create distributions by balanced division to ensure minimum samples per distribution

Each candidate distribution

P_{i}

defines a discrete measure:

{\hat{P}}_{i} = \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} δ_{c_{i}^{(j)}}

. where

c_{i}^{(j)}

denotes the j-th sample in candidate distribution

P_{i}

derived from

{c_{1}, c_{2}, \dots, c_{n}}

.

Figure 3 shows the three ways we can partition distributional uncertainty. Random partitioning (b) creates balanced distributions by splitting them based on permutations, bootstrap sampling (c) builds distributions by sampling with replacement, and balanced partitioning (d) takes the empirical distribution (a) and divides it step-by-step into equal-sized chunks. We chose to go with the random partitioning approach because it strikes a good balance between randomness and size constraints—this way, each distribution has enough samples to be meaningful while still creating variety across the ambiguity set {P₁, P₂, P₃}.

3.4.2. Entropic Regularized Wasserstein Barycenter for Robust Aggregation

The concept of Wasserstein Barycenters, introduced by Ref. [20], allows us to construct a representative distribution that captures the geometry of the ambiguity set. We compute the Wasserstein Barycenter of candidate distributions. Given candidate distributions

{P_{1}, \dots, P_{m}}

over

C

constructed from empirical context observations (where each

P_{i}

represents a plausible distributional model of the context-generating process), the Wasserstein Barycenter

ν *

with weights

λ = (λ_{1}, \dots, λ_{m})

, where

\sum_{i = 1}^{m} λ_{i} = 1

, solves

ν * = \arg \min_{ν \in P (C)} \sum_{i = 1}^{m} λ_{i} W_{2}^{2} (ν, P_{i})

(19)

where

W_{2} (\cdot, \cdot)

is the 2-Wasserstein distance, and

P (C)

denotes the space of probability measures supported on the context space

C

, and

P_{i} \in P (C)

. The barycenter

ν *

provides a geometrically principled aggregation of the candidate distributions that preserves their essential structural properties while accounting for distributional uncertainty in the context generation process.

Computing exact Wasserstein Barycenters is computationally prohibitive for large datasets. Following Ref. [18] and the comprehensive treatment by Ref. [10], we employ entropic regularization to make the problem tractable. The entropic regularized optimal transport problem is

W_{2, γ}^{2} (ν, P_{i}) = \min_{T \in Π (ν, P_{i})} ⟨ C, T ⟩ + γ H (T)

(20)

where

$C$ is the cost matrix with $C_{i j} = {‖c_{i} - c_{j}‖}_{2}^{2}$
$T$ is the transport plan constrained to $Π (μ, ν)$ (the set of couplings with marginals $μ$ and $ν$ )
$γ > 0$ is the regularization parameter
$H (T) = - \sum_{j, k} T_{j k} l o g T_{j k}$ is the entropy of the transport plan

The regularized barycenter problem becomes

ν_{γ}^{*} = \arg \min_{ν \in P (C)} \sum_{i = 1}^{m} λ_{i} W_{2, γ}^{2} (ν, P_{i})

(21)

This formulation allows for efficient computation using the Sinkhorn algorithm while preserving the essential geometric properties of the Wasserstein Barycenter [40].

4. Distributionally Robust Sinkhorn Barycenter UCB Method

This section walks through the complete algorithmic implementation of our Distributionally Robust Sinkhorn Barycenter UCB (DR-SB-UCB) method. We start by explaining the Sinkhorn algorithm and how it efficiently computes Barycenters using entropic regularization (Section 4.1). Then, we develop a robust acquisition function that maximizes UCB values over barycenter support points while adding distributional Lipschitz regularization to control how sensitive the method is to context variations, following the approach by Ref. [9] (Section 4.2). The complete SWBBO algorithm brings these components together with adaptive robustness scheduling that becomes less conservative over time (Section 4.3). Finally, we establish theoretical convergence guarantees for each part of the algorithm, including how fast Sinkhorn converges, bounds on barycenter approximation quality, and regret bounds for the overall robust optimization process.

4.1. Sinkhorn Algorithm for Robust Barycenter Computation

In our acquisition function, as we discussed in the previous section, we used the Entropic Regularized Wasserstein Barycenter to compute a compact and representative distribution over the empirical context distributions

{P_{1}, \dots, P_{m}}

. This approach lets us efficiently approximate the Wasserstein Barycenter by adding an entropic penalty to the optimal transport formulation, which makes the computation scalable. The entropic regularization transforms the optimal transport problem into a form that can be solved using Bregman projections [41,42]. Bregman divergences provide a geometric framework for measuring distances between probability distributions based on convex functions—in this case, the entropy function. The Sinkhorn algorithm can be viewed as alternating Bregman projections onto marginal constraints, which guarantees convergence to the unique solution of the regularized problem. This framework is crucial because it provides the theoretical foundation for why entropic regularization preserves the geometric structure of optimal transport while making computation tractable. To implement this method, we used the ot.bregman.barycenter function from the Python Optimal Transport (POT) library [43]. This implementation handles free support barycenter computation by alternating between Sinkhorn-based optimal transport plan estimation and support point updates [43,44]. We chose the regularization parameter

γ = 0.1

to obtain a good balance between computational efficiency and approximation quality, and we set the maximum number of iterations to

T_{m a x} = 100

to give it enough time to converge.

4.2. Distributionally Robust Upper Confidence Bound Acquisition

4.2.1. Robust UCB with Barycenter Representation

Given the barycenter representation, we define the Distributionally Robust Sinkhorn Barycenter UCB acquisition function as

α_{D R - S B - U C B} (x) = \max_{c \in Barycenter} [U C B (x, c) - L (x) \cdot R (t)]

(22)

where

U C B (x, c) = μ_{n} (x, c) + β_{t} σ_{n} (x, c)

(23)

Here,

μ_{n} (x, c)

and

σ_{n} (x, c)

are the posterior mean and standard deviation of the GP at

(x, c)

after n observations. The term

L (x)

represents a Lipschitz constant, estimated via the gradient norm of the UCB for the context variable, and

R (t)

is a robustness radius that bounds the size of the Wasserstein uncertainty set. The radius term follows an adaptive schedule that balances robustness and performance:

R (t) = \frac{R_{0}}{\sqrt{t}}

(24)

where

R_{0} > 0

is the initial radius parameter and t is the current iteration. The initial radius is set to 0.3, and t is the current iteration.

The max operation selects the most promising support point across the barycenter, capturing the optimistic (best-case) perspective while maintaining robustness through the barycenter construction. The barycenter weights are implicitly incorporated through the support point selection process during barycenter computation. This acquisition function promotes the selection of inputs that are robustly optimal across plausible perturbations of the context distribution. The maximization over support points captures the most optimistic evaluation while accounting for distributional uncertainty via the penalization term

L (x) \times R (t)

. Evolution of the best observed value is usually reported into a chart, along with the boxplot of the final best value over different runs, like those in our results (Figures 4–11).

In SWBBO, the Sinkhorn-based Wasserstein Barycenter is a discrete probability distribution, not merely a set of samples. It consists of support points and associated weights, forming an empirical distribution that captures the geometric structure of the context uncertainty. These support points are directly used in the acquisition function (Equation (22)), where the UCB is evaluated and maximized over them.

This is visually illustrated in Figures 12–17, where the barycenter (in red) clearly differs from simple samples. it interpolates the original distributions while preserving shape and structure, especially at lower regularization levels (e.g., Figure 16).

4.2.2. Distributional Lipschitz Regularization

The Lipschitz term

L (x)

provides regularization based on the function’s sensitivity to changes in the distribution. This formulation of distributional Lipschitz regularization follows the approach introduced by Micheli et al. [9]:

L (x) = \max_{c \in C_{L}} {‖𝛻_{c} [μ_{n} (x, c) + β_{t} σ_{n} (x, c)]‖}_{2}

(25)

where

C_{L}

is a set of sampled contexts for Lipschitz constant estimation, and the gradient is computed via automatic differentiation.

The form of the confidence parameter

β_{t}

follows the standard formulation introduced by Srinivas et al. [45] for GP-UCB in Equation (26), which provides high-probability bounds on the cumulative regret. In our implementation, we keep it fixed (default: 1.5) for computational simplicity, as commonly performed in prior robust BO literature (e.g., Ref. [9]).

β_{t} = 2 l o g (\frac{t^{2} {2 π}^{2}}{3 δ}) + 2 d \log (t^{2} d b r \sqrt{\log (\frac{4 d a}{δ})})

(26)

where d is the dimension of the input space and

δ \in (0,1)

is the confidence level.

4.3. Complete Distributionally Robust Optimization Procedure

Now that we have established the theoretical foundation of entropic regularized Wasserstein Barycenters and the distributionally robust acquisition function, we can present the complete algorithmic procedure for our proposed SWBBO method. The algorithm brings together all the key components we have discussed in the previous sections: Sinkhorn-based barycenter computation for robust context aggregation, the DR-SB-UCB acquisition function with distributional Lipschitz regularization, and adaptive robustness radius scheduling. The algorithm below gives you a step-by-step implementation that balances computational efficiency with theoretical guarantees, using the POT library’s optimized Sinkhorn implementation while keeping the distributional robustness properties that are essential for uncertain environments. Here is the complete SWBBO Algorithm 1:

Algorithm 1: SWBBO
Input: Objective function $f$ , domain $X$ , context space $C$ , budget $N$ , initial samples $n_{0}$ , ambiguity radius $ε$ , Regularization parameter $γ$ , Number of candidate distributions $m$ , Sinkhorn iterations $T_{m a x}$
Output: Robust optimal point $x *$ and function evaluations ${(x_{t}, c_{t}, y_{t})}_{t = 1}^{t - 1}$
	1. Initialization: Sample $n_{0}$ initial points ${(x_{i}, c_{i})}_{i = 1}^{n_{0}}$ and evaluate $y_{i} = f (x_{i}, c_{i})$ Set $D = {(x_{i}, c_{i}, y_{i})}_{i = 1}^{n_{0}}$ Initialize context samples $C_{L}$ for Lipschitz estimation For $t = n_{0} + 1, \dots, N$ : 2. Fit Gaussian Process: Fit Gaussian Process: Train the GP model $G P_{t}$ on $D$ Compute posterior $G P_{t} (x, c) = N (μ_{t} (x, c), σ_{t}^{2} (x, c))$ 3. Construct Candidate Distributions: Extract context samples ${c_{i}}_{i = 1}^{t - 1}$ from $D$ . Partition contexts into $m$ candidate distributions: ${P_{1}, \dots, P_{m}}$ . by random partitioning as explained in Section 3.4.1. Set uniform weights $λ_{i} = \frac{1}{m}$ for $i = 1, \dots, m_{i}$ 4. Compute Entropic Regularized Barycenter: Solve the regularized barycenter problem, Equation (21): $ν_{γ}^{} = \arg \min_{ν \in P (C)} \sum_{i = 1}^{m} λ_{i} W_{2, γ}^{2} (ν, P_{i})$ Use ot.bregman.barycenter with Regularization parameter $γ = 0.1$ Maximum iterations $T_{m a x} = 100$ 5. Estimate Distributional Lipschitz Constant, Equation (25):* $L (x) = \max_{c \in C_{L}} {‖𝛻_{c} [μ_{n} (x, c) + β_{t} σ_{n} (x, c)]‖}_{2}$ 6. Robust Acquisition Optimization Set adaptive robustness radius Equation (24): $R (t) = \frac{R_{0}}{\sqrt{t}}$ Define the acquisition function, Equation (22): $α_{D R - S B - U C B} (x) = \max_{c \in Barycenter} [U C B (x, c) - L (x) \cdot R (t)]$ Solve: $x_{t} = \underset{x \in X}{\arg \max} α_{D R - S B - U C B} (x)$ using multi-start optimization 7. Sample Context and Evaluation: Sample context $c_{t}$ from the current environment Evaluate $y_{t} = f (x_{t}, c_{t})$ Update dataset $D = D \cup {(x_{t}, c_{t}, y_{t})}$ 8. Return $x *$

This algorithm balances optimistic exploration with robustness to distributional uncertainty over context variables, leveraging Wasserstein Barycenters and regularized UCB scoring.

4.4. Convergence Properties

The convergence behavior of the proposed SWBBO algorithm builds upon established theoretical guarantees from optimal transport theory, entropic regularization, and Bayesian Optimization. This section integrates these results to characterize the convergence of SWBBO in terms of both approximation quality and cumulative regret under distributional uncertainty.

Proposition 1

(Sinkhorn Algorithm Convergence)

Under entropic regularization, the Sinkhorn algorithm converges geometrically to the unique minimizer of the regularized optimal transport problem. Specifically, the convergence rate is

O ({e x p}^{(- t / τ)})

where

τ \propto 1 / γ

and depends on the regularization parameter

γ > 0

and the geometry of the support (e.g., the diameter of the context space) [18,19]. This ensures that our barycenter approximation via Sinkhorn is computationally efficient and converges reliably at each iteration of BO.

Proposition 2

(Regularized Barycenter Approximation Error)

As

γ \to 0

, the entropic-regularized Wasserstein Barycenter

{\hat{P}}_{γ}

converges to the true (unregularized) Wasserstein Barycenter

\hat{P} *

. For discrete empirical distributions with finite support, the approximation error is bounded as

W_{2} ({\hat{P}}_{γ}, \hat{P} *) \leq O (γ l o g (\frac{1}{γ}))

([10,15]). Therefore, the SWBBO acquisition function constructed from

{\hat{P}}_{γ}

remains faithful to the true geometry of the context distribution as

γ

decreases.

Proposition 3 (Robust GP-UCB Regret Bound with Barycentric Aggregation): In each iteration, SWBBO uses a Wasserstein Barycenter (computed via Sinkhorn) to summarize the distributional uncertainty and defines the robust acquisition function accordingly. Assuming the objective function

f

lies in the reproducing kernel Hilbert space

H_{k}

, with bounded norm

{‖f‖}_{k}^{2} \leq B

, and that the posterior mean and variance are estimated using a GP with kernel

k (x, x^{'})

, the robust cumulative regret after NNN steps satisfies

R_{N}^{r o b u s t} = O * (\sqrt{N} (B \sqrt{γ N} + γ N))

(27)

This regret bound is adapted from the GP-UCB framework by Srinivas et al. [45], extended to account for our robust acquisition function using Wasserstein Barycenters, where

γ N

is the maximum information gain from

N

observations. The

O *

notation suppresses logarithmic factors, typically of order

{l o g}^{3} N

. The quantity

γ N

depends on the kernel function and the effective dimensionality of the input space [3,45].

Proposition 4 (Distributional Robustness Guarantee): Under the Wasserstein ambiguity set framework, the optimization of the worst-case expected value provides distributional robustness guarantees [23]. The solution quality degrades gracefully as the true distribution deviates from the empirical distribution, with performance bounds depending on the Wasserstein distance between distributions.

5. Experiments and Results

Following the experimental setting presented in Ref. [3], we have compared three DRBO algorithms on five benchmark test functions: Ackley, Modified Branin, Hartmann, Three Hump Camel, Six Hump Camel, and three real-life related problems, specifically, Continuous Vendor, Portfolio, and Portfolio Normal Optimization. The algorithms compared are Empirical Risk Bayesian Optimization (ERBO), Wasserstein Distance Robust Bayesian Optimization (WDRBO), and SWBBO (With

γ

= 0.1, 0.01, 0.001). Each algorithm is evaluated over 30 independent runs to ensure statistical significance. The baseline methods, ERBO and WDRBO, are from the implementation provided by the authors of Ref. [9] as freely accessible on their GitHub repository.

The following table summarizes the best value of

f (x, c)

observed at the end of the optimization processes performed by the different algorithms (median and standard deviation over 30 independent runs), separately for the test problems.

As reported in Ref. [9], ERBO is always better (on median) than WDRBO, apart from in the case of Portfolio Normal Optimization. Moreover, in our experiments—which are more extensive than those reported in Ref. [9]—this difference was statistically significant (with respect to a Wilcoxon test) in the following cases: Ackley (p-value = 0.001), Three Hump Camel (p-value = 0.046), and Six Hump Camel (p-value = 0.015), while it was not statistically significant in the following: Modified_Branin (p-value = 0.532), Hartmann (p-value = 0.260), Continuous Vendor (p-value = 0.605), Portfolio Optimization (p-value = 0.665), and Portfolio Normal Optimization (p-value = 0.254).

When ERBO is compared against the best among the three SWBBO alternatives (Table 1), its results are better (on median) only in the case of Modified Branin, but this difference is not statistically significant (p-value = 0.290). In all the other test problems, the best SWBBO alternative is (on median) better than ERBO; more precisely, it is significantly better in the following: Six Hump Camel (p-value < 0.001), Continuous Vendor (p-value < 0.001), and Portfolio Optimization (p-value < 0.001), while the difference is not statistically significant in Ackley (p-value = 0.127), Hartmann (p-value = 0.708), Three Hump Camel (p-value = 0.440), and Portfolio Optimization (p-value = 0.088). Results are also reported in Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10 and Figure 11.