Next Article in Journal
Machine Learning Prediction Models of Beneficial and Toxicological Effects of Zinc Oxide Nanoparticles in Rat Feed
Previous Article in Journal
AlzheimerRAG: Multimodal Retrieval-Augmented Generation for Clinical Use Cases
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Distributionally Robust Bayesian Optimization via Sinkhorn-Based Wasserstein Barycenter

by
Iman Seyedi
1,*,
Antonio Candelieri
2,* and
Francesco Archetti
1
1
Department of Computer Science Systems and Communication, University of Milano-Bicocca, 20126 Milan, Italy
2
Department of Economics Management and Statistics, University of Milano-Bicocca, 20126 Milan, Italy
*
Authors to whom correspondence should be addressed.
Mach. Learn. Knowl. Extr. 2025, 7(3), 90; https://doi.org/10.3390/make7030090
Submission received: 17 July 2025 / Revised: 14 August 2025 / Accepted: 25 August 2025 / Published: 28 August 2025

Abstract

This paper introduces a novel framework for Distributionally Robust Bayesian Optimization (DRBO) with continuous context that integrates optimal transport theory and entropic regularization. We propose the sampling from the Wasserstein Barycenter Bayesian Optimization (SWBBO) method to deal with uncertainty about the context; that is, the unknown stochastic component affecting the observations of the black-box objective function. This approach captures the geometric structure of the underlying distributional uncertainty and enables robust acquisition strategies without incurring excessive computational costs. The method incorporates adaptive robustness scheduling, Lipschitz regularization, and efficient barycenter construction to balance exploration and exploitation. Theoretical analysis establishes convergence guarantees for the robust Bayesian Optimization acquisition function. Empirical evaluations on standard global optimization problems and real-life inspired benchmarks demonstrate that SWBBO consistently achieves faster convergence, good final regret, and greater stability than other recently proposed methods for DRBO with continuous context. Indeed, SWBBO outperforms all of them in terms of both optimization performance and robustness under repeated evaluations.

1. Introduction

Modern decision-making systems are increasingly ascendant in situations of distributional uncertainty, where data-generating processes are subject to shifts, perturbations, or misspecifications at the base level. This is most salient for black-box optimization situations, where expensive function evaluations are performed under uncertain contextual settings. Although vanilla Bayesian Optimization (BO) is known to be effective for deterministic or well-characterized stochastic settings [1,2,3], it can exhibit poor out-of-sample performance when the true distribution deviates from prior assumptions. In Sample Average Approximation (SAA) methods, this phenomenon is often referred to as the “optimizer’s curse”.
Distributionally Robust Optimization (DRO) emerged as a justified methodology for addressing such uncertainties through optimization under worst-case scenarios that are aligned with carefully crafted ambiguity sets around empirical samples, e.g., historical data. Instead of point estimates of unknown distributions, DRO considers all possible distributions aligned with a neighbourhood of the empirical distribution, providing performance guarantees that degrade gracefully under shifts of the underlying distributions. Its underlying intuition is that through hedging against the uncertainty of distributions, decision-makers can gain solutions that share robust performance under a variety of plausible scenarios, even if at the cost of nominal optimality under ideal conditions.
Applying DRO concepts to BO presents a number of characteristic challenges that make it different from classical robust optimization [4]. To begin with, the black-box nature of expensive objective functions demands explicit balancing of exploration versus exploitation, simultaneously coping with distributional uncertainty. Traditional acquisition functions must be fundamentally reconsidered to incorporate robustness considerations without sacrificing the sample efficiency that makes BO attractive for expensive optimization problems. Second, the construction of appropriate ambiguity sets becomes critical in the BO setting, where limited samples must inform both the underlying surrogate model of the black-box objective function and the distributional uncertainty quantification. The choice of distance metric for defining ambiguity sets—whether based on φ-divergences, Wasserstein distances, or moment-based constraints—has a significant impact on both computational tractability and the quality of robust solutions. Recent theoretical work by Ref. [5] shows that worst-case expectation problems can be transformed into finite convex problems for many practical loss functions, which provides a foundation for tractable distributionally robust formulations. Third, the high-dimensional nature of many practical optimization problems exacerbates the curse of dimensionality in distributional uncertainty quantification. As noted by Ref. [6], distance metrics defined in high-dimensional spaces can lead to exceedingly large ambiguity sets, potentially resulting in overly conservative solutions. This observation points to the need for more sophisticated approaches that can capture the geometric structure of data distributions while maintaining computational efficiency.
Current approaches to Distributionally Robust Bayesian Optimization (DRBO) have several limitations that this work addresses. Existing methods usually rely on simple empirical averaging or uniform sampling within ambiguity sets, which fail to capture the underlying geometric structure of the context distribution. This approach can lead to suboptimal representation of distributional uncertainty and, consequently, poor exploration strategies in the robust optimization landscape [7,8].
Furthermore, many existing robust BO methods exhibit excessive conservatism, especially when ambiguity sets are large compared to the available data. The resulting solutions may sacrifice significant performance under nominal conditions to hedge against unlikely worst-case scenarios. The challenge lies in developing acquisition functions that can adaptively balance robustness and performance as more information becomes available through the optimization process. Computational complexity presents another significant barrier, as the nested min–max formulation inherent in distributionally robust optimization introduces substantial computational overhead. Traditional approaches often need to solve complex optimization problems at each iteration, making them impractical for high-dimensional or time-sensitive applications. Recent work by Ref. [9] addressed Wasserstein Distributionally Robust Bayesian Optimization (WDRBO) with continuous contexts, establishing sublinear regret bounds for settings where context distributions are uncertain but lie within Wasserstein ball ambiguity sets. However, both proposed approaches still have difficulty efficiently aggregating distributional information while preserving geometric structure.
This paper introduces a novel methodology that addresses these limitations through the principled application of the Wasserstein Barycenter, efficiently computed via the Sinkhorn regularization. Our key insight is that the geometric structure of distributional uncertainty can be effectively captured using Wasserstein Barycenter, which provides a natural generalization of the notion of mean distributions in the space of probability measures. The Wasserstein Barycenter, also known as the Fréchet mean, computed in the space of probability distributions endowed with Wasserstein distance [10], offers a non-linear interpolation mechanism that preserves the geometric characteristics of constituent distributions. Unlike simple weighted averages that ignore spatial structure, the Wasserstein Barycenter gives us a principled approach to aggregating multiple plausible distributions while maintaining their essential geometric properties. Our methodology takes advantage of the computational efficiency of the Sinkhorn algorithm to approximate Wasserstein Barycenter computation. While this approach trades the traditional distance properties for divergence measures, it provides crucial advantages: computational tractability for practical applications and guaranteed uniqueness of solutions. By combining entropic regularization with iterative scaling techniques, Sinkhorn achieves significant computational speedups while preserving the essential geometric properties required for robust optimization. The resulting Distributionally Robust Sinkhorn Barycenter Upper Confidence Bound acquisition function seamlessly integrates distributional robustness considerations into the BO framework while maintaining the exploration–exploitation balance that characterizes effective acquisition strategies. Our approach incorporates adaptive robustness scheduling and distributional Lipschitz regularization to ensure that the level of conservatism appropriately adapts to the available information and problem structure.
The main contributions of this work can be summarized as follows:
We propose SWBBO, a novel Bayesian Optimization algorithm that integrates Sinkhorn-regularized Wasserstein Barycenters into a robust acquisition strategy, providing geometry-aware modeling of distributional uncertainty in continuous contexts.
We develop a computationally efficient and theoretically grounded acquisition function that combines entropic optimal transport, adaptive robustness scheduling, and distributional Lipschitz regularization, offering strong regret guarantees under distributional shifts.
We demonstrate that SWBBO outperforms existing DRBO methods on both synthetic benchmarks and real-world-inspired optimization tasks, achieving faster convergence and improved robustness under repeated evaluations.
The rest of this paper proceeds in the following manner. We begin with Section 2, which covers essential background on DRO and Wasserstein Barycenter. In Section 3, we describe the DRBO framework, along with the Sinkhorn-regularized barycenter computation and the acquisition function. Section 4 formalizes the problem setting, introduces our distributional robustness framework, and presents the core methodology. Section 5 presents a comprehensive experimental evaluation, and Section 6 concludes with a discussion of future research directions.

2. Literature Review

2.1. Theoretical Foundations of DRO

The theoretical foundations of Distributionally Robust Optimization (DRO) have advanced significantly in recent years, revealing important connections between robust statistical methods and optimization under distributional uncertainty. In 2022, Rahimian and Mehtora published a comprehensive survey [11] that thoroughly covers the field, tracing its evolution from classical robust optimization to modern distributionally robust approaches. Their work categorizes different types of ambiguity sets—including moment-based, distance-based, and divergence-based formulations—and examines the trade-offs between robustness guarantees and computational feasibility. Building on this foundation, Ref. [5] demonstrates that many DRO problems can be seen as extensions of robust statistical estimation methods. Rather than simply looking for robust parameter estimates, the focus shifts toward finding decisions that work well across multiple plausible data-generating distributions. The authors in Ref. [12] took these concepts further by developing adaptive DRO frameworks that dynamically adjust ambiguity sets as new data become available. This method addresses the fundamental challenge of balancing robustness with performance in sequential decision-making, providing a way to reduce the excessive conservatism typical of worst-case optimization while maintaining strong out-of-sample guarantees. This work tackles key problems found in traditional stochastic optimization (SO) methods that depend on fixed probability distributions. DRO’s mathematical framework handles uncertainty using three main paradigms. SO represents the first approach, which maximizes expected performance under a known distribution [7]:
max x X R d E c P [ f ( x , c ) ]
Let x X R d denote the decision variables, and c P denote the context. The function f x , c is the black-box objective function to be minimized, which depends on both decision and contextual variables. Decisions are made assuming perfect knowledge of the probability distribution P governing uncertain parameters c. Worst-case Robust Optimization (RO) adopts an extremely conservative approach by optimizing against the worst possible realization:
max x X R d   min c Δ   f x , c ,
where no probabilistic assumptions are made about the uncertainty set Δ.
Finally, DRO achieves a balanced compromise through optimizing against the worst-case expectation over all distributions Q within an ambiguity set U.
max x X R d     Inf Q U   E c Q   [ f ( x , c ) ] ,
In the following formulations, f ( x , c ) represents the objective function evaluated at decision x under context c . Depending on the optimization paradigm, we take expectations over c , worst-case realizations, or robust expectations over an ambiguity set.

2.2. Wasserstein Distance and Optimal Transport

In statistics and machine learning, the Wasserstein distance has increasingly gained importance as it allows the comparison of probability distributions of different types while satisfying all properties of a distance metric under certain assumptions. It has been recently applied to extend BO on complicated—also known as exotic—tasks such as grey-box optimization [13] and multiple information source optimization [14].
The mathematical foundation for the p-Wasserstein distance between probability measures α and β in P(Ω) is defined as [15]
W p p ( α , β ) = Inf π Π ( α , β ) Ω × Ω d p ( ω , ω ) d π ( ω , ω )
where Π ( α , β ) is the set of probability measures on Ω × Ω having marginals α and β , and p     1 . ω and ω are probability measures (i.e., distributions) over the space Ω , and π ( ω , ω ) denotes the set of joint couplings (transport plans) with marginals ω and ω . Setting p   =   2 and d ( . , . ) the Euclidean distance leads to the so-called 2-Wasserstein distance with Equation (4) that can be rewritten as
W 2 2 ( α ,   β ) = min T # α = β T ω ω 2 2 d α ( ω )
which is intended as an optimal transport problem [10]. This formulation interprets the distance as the minimum cost of transporting probability mass α to match target probability mass β , where symbol # denotes the push-forward operator ensuring that α matches β following transport plan T.
For Gaussian distributions, that is α   = N ( m α ,   Σ α )   and β   = N ( m β ,   Σ β ) , the 2-Wasserstein distance simplifies to [9]
W 2 2     ( α ,   β ) = m α m β 2     +   B ( Σ α ,   Σ β ) 2
where B represents the Bures metric between positive definite matrices [16,17]:
B ( Σ α ,   Σ β ) 2   =   t r a c e ( Σ α   + Σ β     2 ( Σ α 1 2     Σ β   Σ α 1 2   ) 1 2 )
Thus, in the case of centered GDs (i.e., m α =   m β = 0 ) , the 2-Wasserstein distance resembles the Bures metric. Moreover, if Σ α and Σ β are diagonal, the Bures metric is the Hellinger distance, while in the commutative case, that is Σ α Σ β = Σ β Σ α , the Bures metric is equal to the Frobenius norm Σ α 1 2     Σ β 1 2   F r o b e n i u s 2 .
When dealing with general probability distributions beyond the Gaussian case, computing the Wasserstein distance becomes much more difficult and requires solving the Monge or Kantorovich optimal transport problems. The Kantorovich formulation turns this into a linear program with O ( n 3   l o g   n ) complexity for discrete measures on n points, which makes exact computation impractical for large-scale applications. This computational burden has motivated the development of entropic regularization techniques that approximate the optimal transport solution while maintaining computational tractability.
The computational tractability of Wasserstein-based methods has been revolutionized through entropic regularization approaches, most notably the introduction of Sinkhorn distances [18], which enables rapid computation of optimal transport through entropic penalties. Recent complexity analyses have improved our understanding of entropic regularized algorithms for optimal transport between discrete probability measures, establishing refined convergence bounds and computational guarantees [19]. The extension to Wasserstein Barycenter computation through fast Sinkhorn-based algorithms [15] has established the algorithmic foundations for practical implementation while maintaining essential geometric properties with significant computational speedups.

2.3. Wasserstein Barycenters and Geometric Averaging

The mathematical foundations for geometric averaging of probability distributions are rooted in optimal transport theory and the concept of Wasserstein Barycenters.
Given a set of probability measures { α i } i = 1 N in P P ( Ω ) with associated weights { λ i } i = 1   N with i = 1 N λ i   =   1 , their Wasserstein Barycenter α ¯ is defined as
α ¯     argmin α P ( ) i = 1 N λ i W 2 2 α ,   α i    
where α ¯ is uniquely defined by the weights { λ i } i = 1   N . Thus, different weights lead to a different Wasserstein Barycenter. The most common situation is to consider equally weighted probability measures, that is λ i = 1 / N , i = 1 , , N , leading to
α ¯     arg min α P P   ( ) 1 / N i = 1 N W 2 2 α ,   α i
For univariate Gaussian distributions α i = N ( m i , σ i ) , the Wasserstein Barycenter α ¯ = N ( m ¯ , σ ¯ ) has parameters determined by
m ¯ ,   σ ¯     argmin m , σ 1 N i = 1 N m i m 2   + σ i σ 2    
The solution is elegantly simple: m ¯ = 1 / N i = 1 N m i and σ ¯ = 1 / N i = 1 N σ i . This demonstrates that Wasserstein Barycenters of Gaussian distributions preserve the Gaussian structure while averaging parameters in the natural geometric space.
Figure 1 provides a practical and simple example showing the difference between the Wasserstein Barycenter of two (equally weighted) normal distributions and the L2-norm barycenter obtained by considering, as the distance, the L2-norm between the probability density functions aligned on the support. The shape preservation guaranteed by the Wasserstein Barycenter is clearly visible: it is normal, instead of bimodal.
Wasserstein Barycenters generalize McCann’s interpolation to the case of more than two measures, providing a natural extension of mean distributions that preserves geometric structure in probability spaces [20]. Recent theoretical advances have established both the fundamental importance and computational complexity of these objects, with Wasserstein Barycenters being proven NP-hard to compute in general settings [21], highlighting the necessity for efficient algorithmic approximations.

2.4. Ambiguity Set Construction and Distance Metrics

Constructing ambiguity sets is one of the most important design decisions in DRO, since different distance metrics create entirely different optimization problems and computational challenges. Ref. [22] came up with a new data-driven approach that relies on statistical hypothesis tests to build uncertainty sets that balance robustness with performance. This gives us principled statistical frameworks instead of arbitrary specifications. They use historical data to figure out appropriate robustness levels and incorporate complex statistical testing procedures that can handle intricate dependency structures.
There are several basic ways to construct ambiguity sets in DRO. Moment-based ambiguity sets work by constraining the mean, variance, and higher-order moments. While these are computationally manageable, they often miss important aspects of the distribution. Divergence-based ambiguity sets create neighborhoods around reference measures using metrics like Kullback–Leibler divergence [17] or Total Variation distance. These have solid theoretical backing but can be hard to interpret geometrically. Wasserstein-based ambiguity sets offer flexible and intuitive frameworks that use optimal transport metrics to measure distributional distances, which gives us both computational feasibility and meaningful geometric understanding.
Wasserstein-based methods have become popular among these options because they can handle realistic transportation costs and come with strong theoretical guarantees. Figure 2 shows the three main ways to construct ambiguity sets in DRO, illustrating how moment-based, divergence-based, and Wasserstein-based methods create different-sized uncertainty neighborhoods around the reference distribution. The figure highlights the trade-offs between computational ease, geometric understanding, and methodological flexibility across these different approaches.
Computational Speed. How fast and practical it is to solve the resulting optimization problem.
Geometric Clarity. How easy it is to understand the geometric meaning of distributional distances and uncertainty neighborhoods (Wasserstein methods give clear transportation cost intuition, while moment-based methods involve abstract statistical constraints).
Adaptability. How well the method can handle different problem types, cost functions, and distributional characteristics (Wasserstein methods can adjust cost functions for specific uses, while moment-based methods are stuck with fixed moment structures).
Building on these foundations, Ref. [24] made significant theoretical contributions to optimal transport-based DRO, establishing crucial structural properties of the value function in Wasserstein-based DRO problems and developing efficient iterative solution schemes. The computational implications of different ambiguity set constructions vary significantly depending on the chosen distance metric, with Wasserstein-based methods requiring more sophisticated optimal transport algorithms but providing valuable geometric insights, particularly in high-dimensional settings where traditional distance metrics may be overly conservative.

2.5. BO Challenges

Traditional BO approaches run into several basic problems that have pushed researchers to develop distributionally robust extensions. DRBO methods specifically handle uncertainty in the underlying probability distribution of the objective function, which is affected by unknown stochastic noise. Ref. [9] shows how Wasserstein-based distributionally robust approaches can give theoretical guarantees even when the model is misspecified. Meanwhile, Ref. [4] lays out the mathematical foundations for incorporating distributional ambiguity. More generally, Wasserstein Distributionally Robust Optimization (WDRO) has become a solid framework for dealing with distributional uncertainty across different optimization problems [25,26]. It provides robustness guarantees that stay valid when the true distribution falls within a specified Wasserstein ball around the empirical distribution. This way of thinking about distributional robustness changes the focus from obtaining perfect model specification to developing optimization strategies that work well across a range of plausible distributions. This makes the approach more practical for real-world applications where we are inherently uncertain about distributional assumptions.

2.6. Acquisition Functions for DRBO

Standard acquisition functions must be adapted for the distributionally robust setting. The key challenge is incorporating distributional uncertainty into the acquisition strategy while maintaining computational tractability.
For UCB-based approaches, the robust acquisition function typically takes the following form:
α R U C B ( x ) = min P U E P [ μ n ( x , c ) + β n σ n ( x , c ) ]
where U is the ambiguity set over distributions of the context, which is unknown a priori, μ n and σ n are the GP’s posterior mean and standard deviation, after n observations, and β n is a parameter controlling the exploration–exploitation trade-off.
However, simple empirical averaging over observed contexts may not capture the underlying structure of the context distribution, motivating the adoption of the Wasserstein ball ambiguity set and, due to the need for computational efficiency, the Sinkhorn regularization.

2.7. Integrating Distributionally Robust Principles into BO

Combining BO with DRO has become a new framework for tackling both parametric uncertainty and distributional uncertainty in data-driven optimization problems. Ref. [27] introduced Bayesian Distributionally Robust Optimization (BDRO), which builds a comprehensive theoretical foundation that differs from traditional DRO approaches by including Bayesian estimation of unknown parametric distributions. Their framework builds ambiguity sets using parametric distributions as reference points, which allows for robust optimization that keeps the benefits of Bayesian estimation when data are scarce while providing robustness against model uncertainty.
Extending distributionally robust principles to BO is a recent development that tackles expensive function evaluation under distributional uncertainty. Ref. [28] worked on black-box optimization problems that include both design variables and uncertain context variables. They handled both aleatoric and epistemic uncertainty sources using adaptive and safe optimization strategies in high-dimensional spaces. Ref. [9] made important progress by tackling situations where context distributions are uncertain, but we know they fall within ambiguity sets defined as balls in the Wasserstein space. The researchers established sublinear regret bounds that give us theoretical guarantees for how DRBO algorithms converge. These developments together show that DRBO has matured into a unified framework that can handle both parametric and distributional uncertainties in complex decision-making situations.
DRBO brings several unique challenges compared to standard BO:
Ambiguity Set Construction: The choice of ambiguity set U significantly impacts the robustness performance trade-off. People typically use Wasserstein balls, ϕ-divergence balls, and moment-based sets [23].
Computational Complexity: The min–max formulation creates a nested optimization problem that is still computationally difficult even with recent algorithmic improvements. Modern approaches use dual formulations and cutting-plane methods to deal with the computational burden of continuous ambiguity sets [11,29,30,31], but scaling up to high-dimensional problems is still an active area of research.
Conservative Solutions: DRBO can produce overly conservative solutions when the ambiguity set is not sized correctly. Recent work has focused on developing principled approaches for ambiguity set calibration that balance being conservative with maintaining good performance. This includes adaptive shrinking strategies and risk-aware formulations [32,33]. One key limitation is sample efficiency: when you are dealing with distributional uncertainty, it takes longer for the algorithm to converge compared to standard Bayesian Optimization. This happens because the algorithm needs to protect itself against multiple possible scenarios that could play out. To address this problem, current research is working on better acquisition function designs and multi-fidelity approaches that can reduce the computational burden that comes with distributional robustness [34].

2.8. Computational Challenges and High-Dimensional Considerations

The curse of dimensionality creates fundamental challenges in DRO, since traditional distance metrics can produce prohibitively large ambiguity sets and overly conservative performance guarantees in high-dimensional spaces. Ref. [35] tackled this critical limitation by developing the first finite-sample guarantees for Wasserstein DRO that break the curse of dimensionality. Their groundbreaking work demonstrates how the out-of-sample performance of robust solutions depends on sample size, uncertainty dimension, and loss function complexity in a nearly optimal manner, providing theoretical foundations that make high-dimensional Wasserstein DRO practically viable. Their groundbreaking work shows how the out-of-sample performance of robust solutions depends on sample size, uncertainty dimension, and loss function complexity in a nearly optimal manner, providing theoretical foundations that make high-dimensional Wasserstein DRO practically viable. The theoretical foundations established by Ref. [23] provided essential performance guarantees and tractable reformulations for data-driven DRO using the Wasserstein metric, enabling practical implementations that have significantly influenced subsequent algorithmic developments. Ref. [25] showed that data-driven Wasserstein DRO can achieve workable reformulations and strong performance guarantees when you properly account for the underlying geometry of the problem. They found that Wasserstein ambiguity sets naturally regularize the system, which helps prevent overfitting when you are working with limited samples.
Recent developments include advanced generic column generation algorithms for high-dimensional multimarginal optimal transport problems, which allow for accurate mesh-free Wasserstein Barycenters and cubic Wasserstein splines [36]. The comprehensive theoretical framework from Ref. [10] continues to support modern computational optimal transport applications with solid convergence properties and approximation bounds for practical data science uses.

2.9. Research Gaps

This literature review shows critical computational and theoretical gaps in DRBO that standard approaches simply cannot address. Current robust BO methods face two key problems: first, naive aggregation strategies like uniform weighting across uncertainty sets completely ignore how probability spaces are actually structured, which leads to suboptimal acquisition functions that work less efficiently than methods that understand distributions better. Second, existing Wasserstein-based approaches need to solve transport problems at every single BO iteration, which creates major bottlenecks and makes them too slow to use in practice. The real problem here is keeping distributional geometry intact during optimization. When uncertainty sets contain distributions that have completely different support or covariance structures, simply averaging them destroys the geometric relationships that actually drive good exploration–exploitation trade-offs. Our Sinkhorn-regularized approach tackles this computational barrier head-on by approximating Wasserstein Barycenters while keeping geometric fidelity within ε-accuracy of exact solutions. But we still face three unresolved challenges: handling distributional uncertainty sets that span different support dimensions, keeping barycenter approximation quality stable as the ambient space dimension grows, and developing adaptive regularization parameters that can balance computational speed against geometric precision across different types of problems. This points to a clear need for developing computationally practical methods that preserve essential distributional geometry without giving up the convergence guarantees that make BO useful for expensive optimization problems.

3. Sinkhorn Barycenter Upper Confidence Bound for DRBO

This section walks through our Sinkhorn Barycenter Upper Confidence Bound (SB-UCB) approach for DRBO. We start by setting up the DRBO framework and laying out the minimax optimization problem when we are dealing with distributional uncertainty (Section 3.1). Then we show how to extend standard GP modeling so it can handle both contextual variables and decision variables together (Section 3.2). The key innovation here is how we build Wasserstein-based ambiguity sets (Section 3.3) and use Sinkhorn-regularized Wasserstein Barycenters to create robust distributional representations (Section 3.4). Here is how it works: we take the empirical context distribution and split it into multiple candidate distributions, then compute their entropic-regularized Wasserstein Barycenter to obtain a single representative distribution. This representative distribution keeps the important geometric structure while still accounting for distributional uncertainty. We then use this barycenter distribution in our robust acquisition function to make optimization both safe and efficient.

3.1. DRBO Framework

DRBO tackles optimization problems where we are not sure about the underlying data-generating distribution, which means it extends traditional BO to handle situations where we have distributional ambiguity ([11,28]). This approach is especially useful in real-world applications like robust hyperparameter tuning when datasets shift over time, autonomous systems that need to work under uncertain environmental conditions, and Portfolio Optimization, where we face model risk ([37,38]).
The DRO problem is formulated as
x * = argmax x X R d   min P U E c P f ( x , c )
where x X R d represents the decision variables, c C R k denotes the context variables, and f : X × C R is an unknown objective function. U is the ambiguity set containing plausible distributions.
In practice, we construct the ambiguity set from empirical observations, leading to
x * = argmax x X   min P U n E c P f ( x , c )
where U n represents the ambiguity set constructed accordingly to n observations of the context, that is C n = { c 1 , c 2 , , c n } .

3.2. GP Modeling for DRBO

BO represents a sample-efficient model-based sequential method for global optimization of black-box, multi-extremal, and expensive-to-evaluate objective functions [1,2,3]. The fundamental problem addresses global minimization:
x *     arg min x X R h f ( x , c )
where X typically represents the h-dimensional unit hypercube X = [ 0 , 1 ] h , c C R p denotes the contextual variables, and f : X × C R represents the expensive objective function exhibiting black-box and multi-extremal properties. The context c captures environmental conditions, problem parameters, or external factors that influence the objective function but are not directly controllable during optimization.
The generic iteration of BO consists of generating an approximation of f ( x , c ) depending on previous observations D ( n ) = ( X , C , y ) , with X = { x ( i ) } i = 1 : n , C = { c ( i ) } i = 1 : n and y = { y ( i ) } i = 1 : n . After that, the next query x ( n + 1 ) that balances exploration and exploitation is selected, given the current context c ( n + 1 ) .
GP regression over the joint input space X × C provides the most common approximation framework, leading to GP-based BO. While X is the space of all the possible decision, C is the space of the observable—but not decidable—contexts. The GP predictive equations for mean and variance are
µ x , c = µ 0 x , c + k ( x , c , X , C ) [ K + σ ε 2 I ] 1 ( y µ 0 ( X , C ) )
σ 2 ( x , c ) = k ( ( x , c ) , ( x , c ) ) k ( x , c , X , C ) [ K + σ ε 2 I ] 1 k ( X , C , x , c )
where µ 0 x , c represents the prior mean over the joint space, k ( · , · ) denotes the kernel function for every pair x , c , and K is the n × n kernel matrix whose entries are K i j = k x i , c i , x j , c j . Then, the acquisition function, such as GP-UCB, determines the next query given the observed context by solving
x ( n + 1 )     arg   min x X μ n x , c n + 1 β n σ n ( x , c ( n + 1 ) )  
where the optimization is performed over the decision variables x while conditioning on the given context c ( n + 1 ) .
In our implementation, we use a Matérn 5/2 kernel operating on the joint input space of decision variables and contexts. Specifically, we use an anisotropic Matérn 5/2 kernel, with a different length-scale for each dimension, as determined by Automatic Relevance Determination (ARD) [39].

3.3. Ambiguity Set Construction via Wasserstein Distance

Given empirical observations { c 1 , c 2 , , c n } , we construct Wasserstein-based ambiguity sets:
U n ε = { P : W 2 2 ( P , P ^ n ) ε }
where P ^ n = 1 n i = 1 n δ c i is the empirical distribution and W 2 is the 2-Wasserstein distance.

3.4. Sinkhorn Barycenter Construction for Robust Representation

3.4.1. Distributional Uncertainty Partitioning

We partition the uncertainty space by considering multiple plausible distributions within the ambiguity set. We construct m candidate distributions by
  • Random Partitioning: Generate m random partitions from the empirical distribution using permutation-based splitting
  • Bootstrap Sampling: Generate m bootstrap samples from the empirical distribution
  • Balanced Partitioning: Create distributions by balanced division to ensure minimum samples per distribution
Each candidate distribution P i defines a discrete measure: P ^ i = 1 n i j = 1 n i δ c i ( j ) . where c i ( j ) denotes the j-th sample in candidate distribution P i derived from { c 1 , c 2 , , c n } .
Figure 3 shows the three ways we can partition distributional uncertainty. Random partitioning (b) creates balanced distributions by splitting them based on permutations, bootstrap sampling (c) builds distributions by sampling with replacement, and balanced partitioning (d) takes the empirical distribution (a) and divides it step-by-step into equal-sized chunks. We chose to go with the random partitioning approach because it strikes a good balance between randomness and size constraints—this way, each distribution has enough samples to be meaningful while still creating variety across the ambiguity set {P1, P2, P3}.

3.4.2. Entropic Regularized Wasserstein Barycenter for Robust Aggregation

The concept of Wasserstein Barycenters, introduced by Ref. [20], allows us to construct a representative distribution that captures the geometry of the ambiguity set. We compute the Wasserstein Barycenter of candidate distributions. Given candidate distributions { P 1 , , P m } over C constructed from empirical context observations (where each P i represents a plausible distributional model of the context-generating process), the Wasserstein Barycenter ν * with weights λ = ( λ 1 , , λ m ) , where i = 1 m λ i = 1 , solves
ν * = arg min ν P ( C ) i = 1 m λ i W 2 2 ( ν , P i )
where W 2 ( , ) is the 2-Wasserstein distance, and P ( C ) denotes the space of probability measures supported on the context space C , and P i P ( C ) . The barycenter ν * provides a geometrically principled aggregation of the candidate distributions that preserves their essential structural properties while accounting for distributional uncertainty in the context generation process.
Computing exact Wasserstein Barycenters is computationally prohibitive for large datasets. Following Ref. [18] and the comprehensive treatment by Ref. [10], we employ entropic regularization to make the problem tractable. The entropic regularized optimal transport problem is
W 2 , γ 2 ( ν , P i ) = min T Π ( ν , P i ) C , T + γ H ( T )
where
  • C is the cost matrix with C i j = c i c j 2 2
  • T is the transport plan constrained to Π ( μ , ν ) (the set of couplings with marginals μ and ν )
  • γ > 0 is the regularization parameter
  • H ( T ) = j , k T j k l o g T j k is the entropy of the transport plan
The regularized barycenter problem becomes
ν γ * = arg min ν P ( C ) i = 1 m λ i W 2 , γ 2 ( ν , P i )
This formulation allows for efficient computation using the Sinkhorn algorithm while preserving the essential geometric properties of the Wasserstein Barycenter [40].

4. Distributionally Robust Sinkhorn Barycenter UCB Method

This section walks through the complete algorithmic implementation of our Distributionally Robust Sinkhorn Barycenter UCB (DR-SB-UCB) method. We start by explaining the Sinkhorn algorithm and how it efficiently computes Barycenters using entropic regularization (Section 4.1). Then, we develop a robust acquisition function that maximizes UCB values over barycenter support points while adding distributional Lipschitz regularization to control how sensitive the method is to context variations, following the approach by Ref. [9] (Section 4.2). The complete SWBBO algorithm brings these components together with adaptive robustness scheduling that becomes less conservative over time (Section 4.3). Finally, we establish theoretical convergence guarantees for each part of the algorithm, including how fast Sinkhorn converges, bounds on barycenter approximation quality, and regret bounds for the overall robust optimization process.

4.1. Sinkhorn Algorithm for Robust Barycenter Computation

In our acquisition function, as we discussed in the previous section, we used the Entropic Regularized Wasserstein Barycenter to compute a compact and representative distribution over the empirical context distributions { P 1 , , P m } . This approach lets us efficiently approximate the Wasserstein Barycenter by adding an entropic penalty to the optimal transport formulation, which makes the computation scalable. The entropic regularization transforms the optimal transport problem into a form that can be solved using Bregman projections [41,42]. Bregman divergences provide a geometric framework for measuring distances between probability distributions based on convex functions—in this case, the entropy function. The Sinkhorn algorithm can be viewed as alternating Bregman projections onto marginal constraints, which guarantees convergence to the unique solution of the regularized problem. This framework is crucial because it provides the theoretical foundation for why entropic regularization preserves the geometric structure of optimal transport while making computation tractable. To implement this method, we used the ot.bregman.barycenter function from the Python Optimal Transport (POT) library [43]. This implementation handles free support barycenter computation by alternating between Sinkhorn-based optimal transport plan estimation and support point updates [43,44]. We chose the regularization parameter γ = 0.1 to obtain a good balance between computational efficiency and approximation quality, and we set the maximum number of iterations to T m a x = 100 to give it enough time to converge.

4.2. Distributionally Robust Upper Confidence Bound Acquisition

4.2.1. Robust UCB with Barycenter Representation

Given the barycenter representation, we define the Distributionally Robust Sinkhorn Barycenter UCB acquisition function as
α D R S B U C B ( x ) = max c Barycenter U C B ( x , c ) L ( x ) R ( t )
where
U C B ( x , c ) = μ n ( x , c ) + β t σ n ( x , c )
Here, μ n ( x , c ) and σ n ( x , c ) are the posterior mean and standard deviation of the GP at ( x , c ) after n observations. The term L ( x ) represents a Lipschitz constant, estimated via the gradient norm of the UCB for the context variable, and R ( t ) is a robustness radius that bounds the size of the Wasserstein uncertainty set. The radius term follows an adaptive schedule that balances robustness and performance:
R ( t ) = R 0 t
where R 0 > 0 is the initial radius parameter and t is the current iteration. The initial radius is set to 0.3, and t is the current iteration.
The max operation selects the most promising support point across the barycenter, capturing the optimistic (best-case) perspective while maintaining robustness through the barycenter construction. The barycenter weights are implicitly incorporated through the support point selection process during barycenter computation. This acquisition function promotes the selection of inputs that are robustly optimal across plausible perturbations of the context distribution. The maximization over support points captures the most optimistic evaluation while accounting for distributional uncertainty via the penalization term L ( x ) × R ( t ) . Evolution of the best observed value is usually reported into a chart, along with the boxplot of the final best value over different runs, like those in our results (Figures 4–11).
In SWBBO, the Sinkhorn-based Wasserstein Barycenter is a discrete probability distribution, not merely a set of samples. It consists of support points and associated weights, forming an empirical distribution that captures the geometric structure of the context uncertainty. These support points are directly used in the acquisition function (Equation (22)), where the UCB is evaluated and maximized over them.
This is visually illustrated in Figures 12–17, where the barycenter (in red) clearly differs from simple samples. it interpolates the original distributions while preserving shape and structure, especially at lower regularization levels (e.g., Figure 16).

4.2.2. Distributional Lipschitz Regularization

The Lipschitz term L ( x ) provides regularization based on the function’s sensitivity to changes in the distribution. This formulation of distributional Lipschitz regularization follows the approach introduced by Micheli et al. [9]:
L ( x ) = max c C L 𝛻 c [ μ n ( x , c ) + β t σ n ( x , c ) ] 2
where C L is a set of sampled contexts for Lipschitz constant estimation, and the gradient is computed via automatic differentiation.
The form of the confidence parameter β t follows the standard formulation introduced by Srinivas et al. [45] for GP-UCB in Equation (26), which provides high-probability bounds on the cumulative regret. In our implementation, we keep it fixed (default: 1.5) for computational simplicity, as commonly performed in prior robust BO literature (e.g., Ref. [9]).
β t = 2 l o g ( t 2 2 π 2 3 δ ) + 2 d log ( t 2 d b   r log 4 d a δ )
where d is the dimension of the input space and δ ( 0,1 ) is the confidence level.

4.3. Complete Distributionally Robust Optimization Procedure

Now that we have established the theoretical foundation of entropic regularized Wasserstein Barycenters and the distributionally robust acquisition function, we can present the complete algorithmic procedure for our proposed SWBBO method. The algorithm brings together all the key components we have discussed in the previous sections: Sinkhorn-based barycenter computation for robust context aggregation, the DR-SB-UCB acquisition function with distributional Lipschitz regularization, and adaptive robustness radius scheduling. The algorithm below gives you a step-by-step implementation that balances computational efficiency with theoretical guarantees, using the POT library’s optimized Sinkhorn implementation while keeping the distributional robustness properties that are essential for uncertain environments. Here is the complete SWBBO Algorithm 1:
Algorithm 1: SWBBO
  Input: Objective function f , domain X , context space C , budget N , initial samples n 0 , ambiguity radius ε , Regularization parameter γ , Number of candidate distributions m , Sinkhorn iterations T m a x
  Output: Robust optimal point x * and function evaluations { ( x t , c t , y t ) } t = 1 t 1
1. 
Initialization:
  • Sample n 0 initial points { ( x i , c i ) } i = 1 n 0 and evaluate y i = f ( x i , c i )
  • Set D = { ( x i , c i , y i ) } i = 1 n 0
  • Initialize context samples C L for Lipschitz estimation
For t = n 0 + 1 , , N :
2. 
Fit Gaussian Process:
  • Fit Gaussian Process:
  • Train the GP model G P t on D
  • Compute posterior G P t ( x , c ) = N ( μ t ( x , c ) , σ t 2 ( x , c ) )
3. 
Construct Candidate Distributions:
  • Extract context samples { c i } i = 1 t 1 from D .
  • Partition contexts into m candidate distributions: { P 1 , , P m } . by random partitioning as explained in Section 3.4.1.
  • Set uniform weights λ i = 1 m for i = 1 , , m i
4. 
Compute Entropic Regularized Barycenter:
  • Solve the regularized barycenter problem, Equation (21):
    ν γ * = arg min ν P ( C ) i = 1 m λ i W 2 , γ 2 ( ν , P i )
  • Use ot.bregman.barycenter with
Regularization parameter γ = 0.1
Maximum iterations T m a x = 100
5. 
Estimate Distributional Lipschitz Constant, Equation (25):
L ( x ) = max c C L 𝛻 c [ μ n ( x , c ) + β t σ n ( x , c ) ] 2
6. 
Robust Acquisition Optimization
  • Set adaptive robustness radius Equation (24):
    R ( t ) = R 0 t
  • Define the acquisition function, Equation (22):
    α D R S B U C B x = max c Barycenter U C B x , c L x R t
  • Solve:
    x t = arg max x X α D R S B U C B ( x )
  • using multi-start optimization
7. 
Sample Context and Evaluation:
  • Sample context c t from the current environment
  • Evaluate y t = f ( x t , c t )
  • Update dataset D = D { ( x t , c t , y t ) }
8. 
Return  x *
This algorithm balances optimistic exploration with robustness to distributional uncertainty over context variables, leveraging Wasserstein Barycenters and regularized UCB scoring.

4.4. Convergence Properties

The convergence behavior of the proposed SWBBO algorithm builds upon established theoretical guarantees from optimal transport theory, entropic regularization, and Bayesian Optimization. This section integrates these results to characterize the convergence of SWBBO in terms of both approximation quality and cumulative regret under distributional uncertainty.
Proposition 1 
(Sinkhorn Algorithm Convergence)
Under entropic regularization, the Sinkhorn algorithm converges geometrically to the unique minimizer of the regularized optimal transport problem. Specifically, the convergence rate is O ( e x p ( t / τ ) ) where τ 1 / γ and depends on the regularization parameter γ > 0 and the geometry of the support (e.g., the diameter of the context space) [18,19]. This ensures that our barycenter approximation via Sinkhorn is computationally efficient and converges reliably at each iteration of BO.
Proposition 2 
(Regularized Barycenter Approximation Error)
As γ 0 , the entropic-regularized Wasserstein Barycenter P ^ γ converges to the true (unregularized) Wasserstein Barycenter P ^ * . For discrete empirical distributions with finite support, the approximation error is bounded as W 2 ( P ^ γ , P ^ * ) O ( γ l o g ( 1 γ ) ) ([10,15]). Therefore, the SWBBO acquisition function constructed from P ^ γ remains faithful to the true geometry of the context distribution as γ decreases.
Proposition 3 (Robust GP-UCB Regret Bound with Barycentric Aggregation): In each iteration, SWBBO uses a Wasserstein Barycenter (computed via Sinkhorn) to summarize the distributional uncertainty and defines the robust acquisition function accordingly. Assuming the objective function f lies in the reproducing kernel Hilbert space H k , with bounded norm f k 2 B , and that the posterior mean and variance are estimated using a GP with kernel k ( x , x ) , the robust cumulative regret after NNN steps satisfies
R N r o b u s t = O * N B γ N + γ N
This regret bound is adapted from the GP-UCB framework by Srinivas et al. [45], extended to account for our robust acquisition function using Wasserstein Barycenters, where γ N is the maximum information gain from N observations. The O * notation suppresses logarithmic factors, typically of order l o g 3 N . The quantity γ N depends on the kernel function and the effective dimensionality of the input space [3,45].
Proposition 4 (Distributional Robustness Guarantee): Under the Wasserstein ambiguity set framework, the optimization of the worst-case expected value provides distributional robustness guarantees [23]. The solution quality degrades gracefully as the true distribution deviates from the empirical distribution, with performance bounds depending on the Wasserstein distance between distributions.

5. Experiments and Results

Following the experimental setting presented in Ref. [3], we have compared three DRBO algorithms on five benchmark test functions: Ackley, Modified Branin, Hartmann, Three Hump Camel, Six Hump Camel, and three real-life related problems, specifically, Continuous Vendor, Portfolio, and Portfolio Normal Optimization. The algorithms compared are Empirical Risk Bayesian Optimization (ERBO), Wasserstein Distance Robust Bayesian Optimization (WDRBO), and SWBBO (With γ = 0.1, 0.01, 0.001). Each algorithm is evaluated over 30 independent runs to ensure statistical significance. The baseline methods, ERBO and WDRBO, are from the implementation provided by the authors of Ref. [9] as freely accessible on their GitHub repository.
The following table summarizes the best value of f ( x , c ) observed at the end of the optimization processes performed by the different algorithms (median and standard deviation over 30 independent runs), separately for the test problems.
As reported in Ref. [9], ERBO is always better (on median) than WDRBO, apart from in the case of Portfolio Normal Optimization. Moreover, in our experiments—which are more extensive than those reported in Ref. [9]—this difference was statistically significant (with respect to a Wilcoxon test) in the following cases: Ackley (p-value = 0.001), Three Hump Camel (p-value = 0.046), and Six Hump Camel (p-value = 0.015), while it was not statistically significant in the following: Modified_Branin (p-value = 0.532), Hartmann (p-value = 0.260), Continuous Vendor (p-value = 0.605), Portfolio Optimization (p-value = 0.665), and Portfolio Normal Optimization (p-value = 0.254).
When ERBO is compared against the best among the three SWBBO alternatives (Table 1), its results are better (on median) only in the case of Modified Branin, but this difference is not statistically significant (p-value = 0.290). In all the other test problems, the best SWBBO alternative is (on median) better than ERBO; more precisely, it is significantly better in the following: Six Hump Camel (p-value < 0.001), Continuous Vendor (p-value < 0.001), and Portfolio Optimization (p-value < 0.001), while the difference is not statistically significant in Ackley (p-value = 0.127), Hartmann (p-value = 0.708), Three Hump Camel (p-value = 0.440), and Portfolio Optimization (p-value = 0.088). Results are also reported in Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10 and Figure 11.
Figure 4. Results on Ackley over 30 independent runs: (top) best observed value of f ( x , c ) with respect to BO queries and (bottom) boxplot of the final best value of f ( x , c ) .
Figure 4. Results on Ackley over 30 independent runs: (top) best observed value of f ( x , c ) with respect to BO queries and (bottom) boxplot of the final best value of f ( x , c ) .
Make 07 00090 g004
Figure 5. Results on Modified_Branin over 30 independent runs: (top) best observed value of f ( x , c ) with respect to BO queries and (bottom) boxplot of the final best value of f ( x , c ) .
Figure 5. Results on Modified_Branin over 30 independent runs: (top) best observed value of f ( x , c ) with respect to BO queries and (bottom) boxplot of the final best value of f ( x , c ) .
Make 07 00090 g005
Figure 6. Results on Hartmann over 30 independent runs: (top) best observed value of f ( x , c ) with respect to BO queries and (bottom) boxplot of the final best value of f ( x , c ) .
Figure 6. Results on Hartmann over 30 independent runs: (top) best observed value of f ( x , c ) with respect to BO queries and (bottom) boxplot of the final best value of f ( x , c ) .
Make 07 00090 g006
Figure 7. Results on Three Hump Camel over 30 independent runs: (top) best observed value of f ( x , c ) with respect to BO queries and (bottom) boxplot of the final best value of f ( x , c ) .
Figure 7. Results on Three Hump Camel over 30 independent runs: (top) best observed value of f ( x , c ) with respect to BO queries and (bottom) boxplot of the final best value of f ( x , c ) .
Make 07 00090 g007
Figure 8. Results on Six Hump Camel over 30 independent runs: (top) best observed value of f ( x , c ) with respect to BO queries and (bottom) boxplot of the final best value of f ( x , c ) .
Figure 8. Results on Six Hump Camel over 30 independent runs: (top) best observed value of f ( x , c ) with respect to BO queries and (bottom) boxplot of the final best value of f ( x , c ) .
Make 07 00090 g008
Figure 9. Results on Continuous_Vendor over 30 independent runs: (top) best observed value of f ( x , c ) with respect to BO queries and (bottom) boxplot of the final best value of f ( x , c ) .
Figure 9. Results on Continuous_Vendor over 30 independent runs: (top) best observed value of f ( x , c ) with respect to BO queries and (bottom) boxplot of the final best value of f ( x , c ) .
Make 07 00090 g009
Figure 10. Results on Portfolio_Optimization over 30 independent runs: (top) best observed value of f ( x , c ) with respect to BO queries and (bottom) boxplot of the final best value of f ( x , c ) .
Figure 10. Results on Portfolio_Optimization over 30 independent runs: (top) best observed value of f ( x , c ) with respect to BO queries and (bottom) boxplot of the final best value of f ( x , c ) .
Make 07 00090 g010
Figure 11. Results on Portfolio_Normal_Optimization over 30 independent runs: (top) best observed value of f ( x , c ) with respect to BO queries and (bottom) boxplot of the final best value of f ( x , c ) .
Figure 11. Results on Portfolio_Normal_Optimization over 30 independent runs: (top) best observed value of f ( x , c ) with respect to BO queries and (bottom) boxplot of the final best value of f ( x , c ) .
Make 07 00090 g011
Finally, Table 2 summarizes the results related to the Cumulative Stochastic Regret (CSR) and computational time. It is evident that ERBO offered a lower CSR, on median, more frequently than the other methods, even in cases in which SWBBO has provided a better value of f x , c . The main plausible reason is a more explorative behaviour of SWBBO after the identification of its own best observed f x , c .
As far as computational time is concerned, there are no relevant differences between the approaches, nor is there an evident relationship between CSR and time. As shown in Table 2, SWBBO1 and SWBBO3 are often faster or comparable to baseline methods. SWBBO2, which uses a moderate regularization value, incurs slightly higher cost in some cases but offers improved robustness. Thus, computational trade-offs depend on the specific configuration and should be considered jointly with regret performance.

5.1. Evaluation Metrics

To evaluate how well our computed barycenters and transport maps perform, we use two established metrics from the entropic optimal transport literature [46]. The B W 2 2 U V P (Bures–Wasserstein Unexplained Variance Percentage) metric measures the quality of the generated barycenter compared to the ground truth, and it is defined as
B W 2 2 U V P ν ,   ν ~ = 100 . B W 2 2 ν ,   ν ~ 1 2 V a r ( ν ~ ) % ,
where the Bures–Wasserstein metric is B W 2 2 ν ,   ν ~ = W 2 2 ( N m ν , Σ ν , N ( m ν ~ , Σ ν ~ ) , for the respective means and covariances of the distributions. This metric provides a normalized measure of the distance between the computed and true barycenter, expressed as a percentage of the target distribution’s variance. Where the Bures–Wasserstein metric is computed for the respective means and covariances of the distributions. The L 2 U V P ( L 2 Unexplained Variance Percentage metric assesses the quality of individual transport maps from each marginal distribution to the barycenter:
L 2 U V P ( T ^ , T * ) = 100 . T ^ T * 2 V a r ( ν ~ )
where T ^ denotes the learned transport map from the marginal to the barycenter, and T * represents the ground truth mapping. This metric captures how well the learned transport maps approximate the optimal transport plans, normalized by the variance of the target distribution.
For illustration, we compute these metrics for the Ackley function across different regularization parameter values. The results demonstrate how regularization strength affects both barycenter quality and transport map accuracy.
Table 3 shows how well the barycenter computation and transport map estimation perform for the Ackley test function across different regularization parameter values (ε = 0.1, 0.01, 0.001), using two key evaluation metrics, B W 2 2 U V P and L 2 U V P . The B W 2 2 U V P metric shows that performance varies with regularization strength: moderate regularization ( γ = 0.1, 0.01) yields acceptable barycenter quality with B W 2 2 U V P values of 0.35% and 0.77% respectively, while strong regularization ( γ = 0.001) achieves perfect barycenter approximation with B W 2 2 U V P = 0.0000%. The L 2 U V P metric values stay consistently around 7–8% across all regularization levels, which tells us that the learned transport maps maintain reasonable accuracy no matter which regularization parameter we choose. These results show that stronger regularization (smaller γ ) improves barycenter quality for the Ackley function, while transport map fidelity stays stable. This suggests our method performs robustly across different regularization settings for this test problem.

5.2. Sinkhorn vs. LP Barycenter Comparison

Table 4 shows a quantitative comparison between barycenters computed using the Sinkhorn Wasserstein method and those we obtained through Linear Programming (LP) for the Ackley test function across different regularization parameters (γ = 0.1, 0.01, 0.001). We use three metrics to evaluate them: Wasserstein Distance (WD), Mean Euclidean Distance (MED), and Maximum Mean Discrepancy (MMD). The results demonstrate that the regularization parameter has a significant impact on the agreement between Sinkhorn and LP Barycenters. For the Wasserstein Distance, the deviation decreases as regularization becomes stronger: from 0.002593 at γ = 0.1 to 0.001621 at γ = 0.01, and further to 0.001208 at γ = 0.001. This pattern tells us that when we use smaller γ values, the two methods work much better together under optimal transport geometry. The Mean Euclidean Distance behaves the same way, dropping from 0.010303 when γ = 0.1 down to 0.007938 at γ = 0.01, and then down to 0.003189 at γ = 0.001. What this means is that the pointwise differences between the barycenters become much smaller when we use stronger regularization. The MMD Approximation, which measures distributional similarity through kernel-based comparison, shows the most dramatic improvement: from 0.000604 at γ = 0.1 to 0.000299 at γ = 0.01, and finally to 0.000059 at γ = 0.001.
Overall, the results show that both Sinkhorn and LP methods produce increasingly aligned barycenters as the regularization parameter becomes smaller, with the strongest agreement achieved at γ = 0.001 across all evaluation metrics for the Ackley function.
The B W 2 2 U V P and L 2 U V P metrics might not really capture what is going on with entropic optimal transport convergence, since they are mainly looking at pointwise differences instead of the underlying geometric structure of how we compute Barycenters. The WD, MED, and MMD metrics seem like better choices for evaluation because they actually measure how well the distributions line up and whether the geometry stays consistent between different ways of computing Barycenters.
Figure 12, Figure 13, Figure 14, Figure 15, Figure 16 and Figure 17 show us a comparative analysis of Sinkhorn Barycenter computation versus LP Barycenter methods across different regularization strengths. We have organized the analysis in pairs, where the even-numbered Figure 12, and Figure 16 show the Sinkhorn Barycenter among multiple distributions with the computed barycenter highlighted in red, while the odd-numbered Figure 13, Figure 15 and Figure 17 give us direct comparisons between the Sinkhorn–Wasserstein (S-W) method (red curves) and LP method (green curves). The regularization parameter γ decreases progressively across the figure pairs: Figure 12 and Figure 13 use γ = 0.1, Figure 14 and Figure 15 use γ = 0.01, and Figure 16 and Figure 17 use γ = 0.001. This progression reveals the fundamental relationship between regularization strength and solution characteristics.
When we use high regularization (γ = 0.1, Figure 12 and Figure 13), the Sinkhorn method creates highly smooth, almost flat distributions that look quite different from what the LP solution gives us. The strong entropic regularization pushes the barycenter toward maximum entropy, which means we end up with broad, spread-out distributions. Meanwhile, the LP method keeps sharper, more concentrated peaks that do a better job of preserving what the original distributions actually looked like. When we move to moderate regularization (γ = 0.01, Figure 14 and Figure 15), the Sinkhorn Barycenter starts becoming closer to the LP solution while still staying smooth. The entropic regularization still smooths things out noticeably, but now the barycenter does a better job of capturing the underlying distribution structure with more defined peaks and valleys. At low regularization (γ = 0.001, Figure 16 and Figure 17), the Sinkhorn and LP methods converge toward nearly identical solutions. The minimal regularization lets the Sinkhorn method become really close to the true optimal transport solution while still keeping its computational advantages. The distributions show sharp, well-defined features that match up closely between both methods.
This progression shows us the trade-off between how fast we can compute things and how accurate our solution is in regularized optimal transport. When we use higher regularization, we obtain faster convergence, but we might end up smoothing out important features of the distributions too much. When we use lower regularization, we stay more faithful to the true barycenter, but it takes more computational work to achieve.
Figure 12. Sinkhorn Barycenter on Ackley at γ = 0.1.
Figure 12. Sinkhorn Barycenter on Ackley at γ = 0.1.
Make 07 00090 g012
Figure 13. Sinkhorn vs. LP Barycenter comparison on Ackley at γ = 0.1.
Figure 13. Sinkhorn vs. LP Barycenter comparison on Ackley at γ = 0.1.
Make 07 00090 g013
Figure 14. Sinkhorn Barycenter on Ackley at γ = 0.01.
Figure 14. Sinkhorn Barycenter on Ackley at γ = 0.01.
Make 07 00090 g014
Figure 15. Sinkhorn vs. LP Barycenter comparison on Ackley at γ = 0.01.
Figure 15. Sinkhorn vs. LP Barycenter comparison on Ackley at γ = 0.01.
Make 07 00090 g015
Figure 16. Sinkhorn Barycenter on Ackley at γ = 0.001.
Figure 16. Sinkhorn Barycenter on Ackley at γ = 0.001.
Make 07 00090 g016
Figure 17. Sinkhorn vs. LP Barycenter comparison on Ackley at γ = 0.001.
Figure 17. Sinkhorn vs. LP Barycenter comparison on Ackley at γ = 0.001.
Make 07 00090 g017

6. Conclusions: Limitations and Perspectives

BO works really well for optimizing data-driven systems when uncertainty is involved, where randomness comes from things like contextual conditions, model parameters, or noisy observations. In these situations, uncertainty has many different sides, which makes developing DRBO algorithms both important and tricky. One of the main challenges we face is that context distributions are infinite-dimensional, especially when we are dealing with continuous contexts. Solving BO problems robust to such uncertainty requires tractable approximations of otherwise intractable constrained optimization problems over probability spaces. Historically, robust BO formulations have employed φ-divergences to define ambiguity sets over context distributions. However, Wasserstein distances, arising from OT theory, have recently gained prominence due to their desirable geometric, topological, and statistical properties. Unlike φ-divergences, Wasserstein distances naturally incorporate the geometry of the underlying space and provide meaningful comparisons even when distributions have disjointed support. This geometry-aware notion of robustness allows the Wasserstein metric to capture subtle shifts in distribution and thus leads to better-behaved optimization landscapes. As a result, Wasserstein-based ambiguity sets have become increasingly popular for modeling uncertainty in robust machine learning, including BO.
Despite these advantages, computational complexity remains a well-known bottleneck. The exact computation of Wasserstein distances has a complexity of O ( n 3   l o g   n ) with respect to the number of samples n. This makes OT prohibitive in high dimensions. To address this, entropic regularization via the Sinkhorn algorithm has emerged as a practical alternative, offering a per-iteration complexity of O ( n 2 ) . Moreover, Sliced Wasserstein approaches and neural OT methods (e.g., via learned transport maps) have enabled further scaling by approximating high-dimensional OT through lower-dimensional projections or learned parametrizations. The SWBBO algorithm developed in this study contributes to this growing intersection of Bayesian Optimization and optimal transport. By leveraging Wasserstein Barycenters, our methods offer tractable and robust BO strategies that adapt to distributional shifts in context while maintaining computational efficiency through entropic regularization. In conclusion, Wasserstein-based BO methods such as SWBBO offer promising tools for robust and geometry-aware optimization under uncertainty. While computational challenges remain, the flexibility and strong theoretical foundations of optimal transport provide a compelling framework for advancing robust machine learning, with exciting future applications in optimization, control, and learning under uncertainty.
Looking ahead, we expect this direction to keep expanding, especially when we integrate it with generative modeling, distributional Reinforcement Learning, and distributional robustness. In generative modeling, for example, people are increasingly using Wasserstein distances as loss functions because they can compare entire distributions rather than just point estimates. In dictionary learning, researchers use Wasserstein Barycenters to combine atoms meaningfully across distributional structures. Similarly, Distributional Reinforcement Learning expands what we are trying to learn from mean rewards to full return distributions, often using Wasserstein-based metrics for more informative policy updates.

Author Contributions

Conceptualization, all; methodology, F.A. and A.C.; software, I.S.; validation, A.C.; writing—original draft preparation, I.S. and F.A.; writing—review and editing, all; visualization, A.C. and I.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The entire code, including the acquisition functions, is freely available at the following repository: https://github.com/iman-ie/SWBBO (accessed on 12 July 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Garnett, R. Bayesian Optimization; Cambridge University Press: Cambridge, UK, 2023. [Google Scholar]
  2. Frazier, P.I. Bayesian Optimization. In Recent Advances in Optimization and Modeling of Contemporary Problems, Proceedings of the INFORMS Annual Meeting, Phoenix, AZ, USA, 4–7 November 2018; INFORMS: Catonsville, MD, USA, 2018; pp. 255–278. [Google Scholar]
  3. Archetti, F.; Candelieri, A. Bayesian Optimization and Data Science; Springer International Publishing: Berlin/Heidelberg, Germany, 2019; p. 849. [Google Scholar]
  4. Seyedi, I.; Candelieri, A.; Messina, E.; Archetti, F. Wasserstein Distributionally Robust Optimization for Chance-Constrained Facility Location Under Uncertain Demand. Mathematics 2025, 13, 2144. [Google Scholar] [CrossRef]
  5. Blanchet, J.; Li, J.; Lin, S.; Zhang, X. Distributionally Robust Optimization and Robust Statistics. arXiv 2024, arXiv:2401.14655. [Google Scholar] [CrossRef]
  6. Liu, J.; Wu, J.; Li, B.; Cui, P. Distributionally robust optimization with data geometry. Adv. Neural Inf. Process. Syst. 2022, 35, 33689–33701. [Google Scholar]
  7. Kirschner, J.; Bogunovic, I.; Jegelka, S.; Krause, A. Distributionally robust Bayesian optimization. In Proceedings of the International Conference on Artificial Intelligence and Statistics, PMLR, Valencia, Spain, 26–28 August 2020; pp. 2174–2184. Available online: http://proceedings.mlr.press/v108/kirschner20a.html (accessed on 7 July 2025).
  8. Husain, H.; Nguyen, V.; van den Hengel, A. Distributionally Robust Bayesian Optimization with ϕ-Divergences. 2023. Available online: https://proceedings.neurips.cc/paper_files/paper/2023/file/3feb8ed3c33c3310b45f80be7dfef707-Supplemental-Conference.pdf (accessed on 17 July 2025).
  9. Micheli, F.; Balta, E.C.; Tsiamis, A.; Lygeros, J. Wasserstein Distributionally Robust Bayesian Optimization with Continuous Context. arXiv 2025, arXiv:2503.20341. [Google Scholar] [CrossRef]
  10. Peyré, G.; Cuturi, M. Computational optimal transport: With applications to data science. Found. Trends Mach. Learn. 2019, 11, 355–607. [Google Scholar] [CrossRef]
  11. Rahimian, H.; Mehrotra, S. Frameworks and Results in Distributionally Robust Optimization. Open J. Math. Optim. 2022, 3, 1–85. [Google Scholar] [CrossRef]
  12. Bertsimas, D.; Sim, M.; Zhang, M. Adaptive Distributionally Robust Optimization. Manag. Sci. 2019, 65, 604–618. [Google Scholar] [CrossRef]
  13. Candelieri, A.; Ponti, A.; Archetti, F. Wasserstein enabled Bayesian optimization of composite functions. J. Ambient Intell. Humaniz. Comput. 2023, 14, 11263–11271. [Google Scholar] [CrossRef]
  14. Sabbatella, A.; Ponti, A.; Candelieri, A.; Archetti, F. Bayesian Optimization Using Simulation-Based Multiple Information Sources over Combinatorial Structures. Mach. Learn. Knowl. Extr. 2024, 6, 2232–2247. [Google Scholar] [CrossRef]
  15. Cuturi, M.; Doucet, A. Fast computation of Wasserstein Barycenters. In Proceedings of the International Conference on Machine Learning, PMLR, Beijing, China, 22–24 June 2014; pp. 685–693. [Google Scholar]
  16. Mallasto, A.; Feragen, A. Learning from uncertain curves: The 2-Wasserstein metric for Gaussian processes. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  17. Joyce, J.M. Kullback-Leibler Divergence. In International Encyclopedia of Statistical Science; Lovric, M., Ed.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 720–722. [Google Scholar]
  18. Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–8 December 2013. [Google Scholar]
  19. Dvurechensky, P.; Gasnikov, A.; Omelchenko, S.; Tiurin, A. A Stable Alternative to Sinkhorn’s Algorithm for Regularized Optimal Transport. In Mathematical Optimization Theory and Operations Research, Lecture Notes in Computer Science; Kononov, A., Khachay, M., Kalyagin, V.A., Pardalos, P., Eds.; Springer International Publishing: Cham, Switzerland, 2020; Volume 12095, pp. 406–423. [Google Scholar]
  20. Agueh, M.; Carlier, G. Barycenters in the Wasserstein Space. SIAM J. Math. Anal. 2011, 43, 904–924. [Google Scholar] [CrossRef]
  21. Heinemann, F.; Klatt, M.; Munk, A. Kantorovich–Rubinstein Distance and Barycenter for Finitely Supported Measures: Foundations and Algorithms. Appl. Math. Optim. 2023, 87, 4. [Google Scholar] [CrossRef]
  22. Bertsimas, D.; Gupta, V.; Kallus, N. Data-driven robust optimization. Math. Program. 2018, 167, 235–292. [Google Scholar] [CrossRef]
  23. Mohajerin Esfahani, P.; Kuhn, D. Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Math. Program. 2018, 171, 115–166. [Google Scholar] [CrossRef]
  24. Blanchet, J.; Murthy, K.; Zhang, F. Optimal Transport-Based Distributionally Robust Optimization: Structural Properties and Iterative Schemes. Math. Oper. Res. 2022, 47, 1500–1529. [Google Scholar] [CrossRef]
  25. Kuhn, D.; Esfahani, P.M.; Nguyen, V.A.; Shafieezadeh-Abadeh, S. Wasserstein Distributionally Robust Optimization: Theory and Applications in Machine Learning. In Operations Research & Management Science in the Age of Analytics, Proceedings of INFORMS Annual Meeting, Washington, DC, USA, 20–23 October 2019; Netessine, S., Shier, D., Greenberg, H.J., Eds.; Informs: Catonsville, MD, USA, 2019; pp. 130–166. [Google Scholar]
  26. Gao, R.; Chen, X.; Kleywegt, A.J. Wasserstein Distributionally Robust Optimization and Variation Regularization. Oper. Res. 2024, 72, 1177–1191. [Google Scholar] [CrossRef]
  27. Shapiro, A.; Zhou, E.; Lin, Y. Bayesian Distributionally Robust Optimization. SIAM J. Optim. 2023, 33, 1279–1304. [Google Scholar] [CrossRef]
  28. Kirschner, J.; Mutny, M.; Hiller, N.; Ischebeck, R.; Krause, A. Adaptive and safe Bayesian optimization in high dimensions via one-dimensional subspaces. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 3429–3438. [Google Scholar]
  29. Rajabi-Kafshgar, A.; Seyedi, I.; Tirkolaee, E.B. Circular closed-loop supply chain network design considering 3D printing and PET bottle waste. Environ. Dev. Sustain. 2024, 27, 20345–20381. [Google Scholar] [CrossRef]
  30. Chen, Z.; Kuhn, D.; Wiesemann, W. Technical Note—Data-Driven Chance Constrained Programs over Wasserstein Balls. Oper. Res. 2024, 72, 410–424. [Google Scholar] [CrossRef]
  31. Hosseini Baboli, S.A.; Arabkoohsar, A.; Seyedi, I. Numerical modeling and optimization of pressure drop and heat transfer rate in a polymer fuel cell parallel cooling channel. J. Braz. Soc. Mech. Sci. Eng. 2023, 45, 201. [Google Scholar] [CrossRef]
  32. Lam, H. Robust Sensitivity Analysis for Stochastic Systems. Math. Oper. Res. 2016, 41, 1248–1275. [Google Scholar] [CrossRef]
  33. Bayraksan, G.; Love, D.K. Data-Driven Stochastic Programming Using Phi-Divergences. In The Operations Research Revolution; Aleman, D., Thiele, A., Smith, J.C., Greenberg, H.J., Eds.; Informs: Catonsville, MD, USA, 2015; pp. 1–19. [Google Scholar]
  34. Zhou, Z.; Mertikopoulos, P.; Bambos, N.; Glynn, P.; Ye, Y.; Li, L.J.; Fei-Fei, L. Distributed asynchronous optimization with unbounded delays: How slow can you go? In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 5970–5979. [Google Scholar]
  35. Gao, R. Finite-sample guarantees for Wasserstein distributionally robust optimization: Breaking the curse of dimensionality. Oper. Res. 2023, 71, 2291–2306. [Google Scholar] [CrossRef]
  36. Altschuler, J.; Niles-Weed, J.; Rigollet, P. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  37. Snoek, J.; Larochelle, H.; Adams, R.P. Practical bayesian optimization of machine learning algorithms. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012. [Google Scholar]
  38. Berkenkamp, F.; Krause, A.; Schoellig, A.P. Bayesian optimization with safety constraints: Safe and automatic parameter tuning in robotics. Mach. Learn. 2023, 112, 3713–3747. [Google Scholar] [CrossRef] [PubMed]
  39. Williams, C.K.; Rasmussen, C.E. Gaussian Processes for Machine Learning, 2nd ed.; MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
  40. Santambrogio, F. Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling; Progress in Nonlinear Differential Equations and Their Applications; Springer International Publishing: Cham, Switzerland, 2015; Volume 87. [Google Scholar]
  41. Benamou, J.-D.; Carlier, G.; Cuturi, M.; Nenna, L.; Peyré, G. Iterative Bregman Projections for Regularized Transportation Problems. SIAM J. Sci. Comput. 2015, 37, A1111–A1138. [Google Scholar] [CrossRef]
  42. Pham, T.; Dal Poz Kouřimská, H.; Wagner, H. Bregman–Hausdorff Divergence: Strengthening the Connections Between Computational Geometry and Machine Learning. Mach. Learn. Knowl. Extr. 2025, 7, 48. [Google Scholar] [CrossRef]
  43. POT: Python Optimal Transport—POT Python Optimal Transport 0.9.5 Documentation. Available online: https://pythonot.github.io/releases.html (accessed on 17 July 2025).
  44. Lindheim, J.V. Simple approximative algorithms for free-support Wasserstein barycenters. Comput. Optim. Appl. 2023, 85, 213–246. [Google Scholar] [CrossRef]
  45. Srinivas, N.; Krause, A.; Kakade, S.M.; Seeger, M. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. arXiv 2009, arXiv:0912.3995v4. [Google Scholar]
  46. Howard, S.; Potaptchik, P.; Deligiannidis, G. Schrödinger Bridge Matching for Tree-Structured Costs and Entropic Wasserstein Barycentres. arXiv 2025, arXiv:2506.17197. [Google Scholar] [CrossRef]
Figure 1. Shape preservation of the Wasserstein Barycenter against L2-norm barycenter.
Figure 1. Shape preservation of the Wasserstein Barycenter against L2-norm barycenter.
Make 07 00090 g001
Figure 2. Nested ambiguity sets around a reference distribution inspired by Ref. [23].
Figure 2. Nested ambiguity sets around a reference distribution inspired by Ref. [23].
Make 07 00090 g002
Figure 3. Illustration of distributional uncertainty partitioning methods. (a) Shows the original empirical distribution P ^ with n = 24 contexts. (b) Random partitioning creates K = 3 distributions by randomly splitting the data. (c) Bootstrap sampling creates distributions by sampling with replacement. (d) Balanced partitioning splits contexts one by one into equal-sized distributions. Each method creates an ambiguity set {P1, P2, P3} for uncertainty quantification.
Figure 3. Illustration of distributional uncertainty partitioning methods. (a) Shows the original empirical distribution P ^ with n = 24 contexts. (b) Random partitioning creates K = 3 distributions by randomly splitting the data. (c) Bootstrap sampling creates distributions by sampling with replacement. (d) Balanced partitioning splits contexts one by one into equal-sized distributions. Each method creates an ambiguity set {P1, P2, P3} for uncertainty quantification.
Make 07 00090 g003aMake 07 00090 g003b
Table 1. Best observed value of f ( x , c ) at the end of the optimization: median and (standard deviation) over 30 independent runs.
Table 1. Best observed value of f ( x , c ) at the end of the optimization: median and (standard deviation) over 30 independent runs.
ERBOWDRBOSWBBO
γ = 0.1
SWBBO
γ = 0.01
SWBBO
γ = 0.001
Ackley−3.6299
(0.9521)
−4.4796
(0.9650)
−3.1460
(0.9238)
−3.2748
(0.9356)
−3.1267
(0.9036)
Modified_Branin−2.2648
(0.7371)
−2.4816
(1.0053)
−2.4422
(0.5570)
−2.8250
(1.1473)
−2.5814
(1.2763)
Hartmann−2.2648
(0.7371)
−2.4816
(1.0053)
−2.4422
(0.5570)
−2.8250
(1.1473)
−2.5814
(1.2763)
Three Hump Camel3.3068
(0.0451)
3.2740
(0.1113)
3.2969
(0.0634)
3.2909
(0.0793)
3.2735
(0.1121)
Six Hump Camel−0.0003
(0.0003)
−0.0004
(0.0004)
−0.0003
(0.0003)
−0.0003
(0.0003)
−0.0003
(0.0003)
Continuous_Vendor1.0141
(0.0078)
1.0186
(0.0108)
1.0242
(0.0098)
1.0235
(0.0088)
1.0235
(0.0094)
Portfolio1.1213
(0.2880)
1.1020
(0.2988)
1.8119
(0.2000)
1.7784
(0.1794)
1.8518
(0.2247)
Portfolio_normal23.3439
(1.4309)
22.9966
(1.7259)
23.5711
(1.9742)
24.4015
(1.3356)
24.0899
(1.4707)
Table 2. Cumulative Stochastic Regret (CSR) and computational time: median and (standard deviation) on 30 independent runs.
Table 2. Cumulative Stochastic Regret (CSR) and computational time: median and (standard deviation) on 30 independent runs.
ERBOWDRBOSWBBO1SWBBO2SWBBO3
CSRTimeCSRTimeCSRTimeCSRTimeCSRTime
Ackley329.37
(40.99)
192.78 (35.5)477.969
(54.21)
142.11 (18.3)221.232 (25.989)101.66  (10.2)215.007 (42.56)239.08 (63.7)217.558 (31.23)178.28 (81.89)
Modified_Branin847.91 (153.5)114.50 (23.5)909.500 (160.32)98.704 (20.6)913.623 (177.59)113.17 (10.9)1411.67 (330.3)276.05 (66.4)1300.74 (312.1)174.34 (82.34)
Hartmann53.4273 (11.62)176.45 (19.1)58.9245 (19.395)196.57 (25.99)52.3729 (25.716)176.19  (33.97)54.9945 (38.994)319.52 (165.0)53.2995 (47.91)239.63 (156.4)
Three Hump Camel4.8008 (0.6641)205.87 (34.91)4.9377 (0.6990)244.91 (50.62)4.8738 (0.7061)101.93  (11.76)4.9080 (0.6867)112.18 (13.56)5.0666 (0.7160)161.29 (73.60)
Six Hump Camel135.199 (49.428)121.96 (15.64)142.203 (50.576)126.97 (22.99)162.483 (51.242)110.29 (13.47)166.402 (67.497)114.83 (14.70)173.509 (84.057)88.930  (8.576)
Continuous_Vendor13.0761 (1.6257)90.267  (12.90)14.3611 (1.5091)100.67 (12.27)59.9870 (11.795)111.64 (13.74)63.1201 (11.191)104.04 (12.14)76.4769 (19.224)94.936 (10.22)
Portfolio436.454 (63.871)191.84 (30.01)436.660 (70.312)206.12 (33.78)479.525 (47.024)171.26  (20.65)493.743 (38.103)187.76 (19.20)484.731 (37.243)207.32 (31.35)
Portfolio_normal595.509 (80.1849)186.44 (25.100)573.669 (72.0094)189.62 (23.748)646.748 (63.9574)176.69 (26.136)669.314 (58.4004)184.69 (25.934)695.958 (62.4598)145.18  (12.555)
Table 3. B W 2 2 U V P and L 2 U V P metrics for the Ackley test problem.
Table 3. B W 2 2 U V P and L 2 U V P metrics for the Ackley test problem.
Metric γ Ackley
0.10.3542
B W 2 2 U V P % 0.010.7747
0.0010.0000
0.17.1934
L 2 U V P % 0.017.8164
0.0017.5822
Table 4. Comparison metrics between Sinkhorn Wasserstein and LP Barycenters for the Ackley test problem.
Table 4. Comparison metrics between Sinkhorn Wasserstein and LP Barycenters for the Ackley test problem.
Metric γ Ackley
0.10.002593
WD0.010.001621
0.0010.001208
0.10.010303
MED0.010.007938
0.0010.003189
0.10.000604
MMD0.010.000299
0.0010.000059
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Seyedi, I.; Candelieri, A.; Archetti, F. Distributionally Robust Bayesian Optimization via Sinkhorn-Based Wasserstein Barycenter. Mach. Learn. Knowl. Extr. 2025, 7, 90. https://doi.org/10.3390/make7030090

AMA Style

Seyedi I, Candelieri A, Archetti F. Distributionally Robust Bayesian Optimization via Sinkhorn-Based Wasserstein Barycenter. Machine Learning and Knowledge Extraction. 2025; 7(3):90. https://doi.org/10.3390/make7030090

Chicago/Turabian Style

Seyedi, Iman, Antonio Candelieri, and Francesco Archetti. 2025. "Distributionally Robust Bayesian Optimization via Sinkhorn-Based Wasserstein Barycenter" Machine Learning and Knowledge Extraction 7, no. 3: 90. https://doi.org/10.3390/make7030090

APA Style

Seyedi, I., Candelieri, A., & Archetti, F. (2025). Distributionally Robust Bayesian Optimization via Sinkhorn-Based Wasserstein Barycenter. Machine Learning and Knowledge Extraction, 7(3), 90. https://doi.org/10.3390/make7030090

Article Metrics

Back to TopTop