Shortfall-Based Wasserstein Distributionally Robust Optimization

Ruoxuan Li; Wenhua Lv; Tiantian Mao

doi:10.3390/math11040849

,

and

¹

Department of Statistics and Finance, University of Science and Technology of China, Hefei 230052, China

²

School of Mathematics and Finance, Chuzhou University, Chuzhou 239000, China

^*

Author to whom correspondence should be addressed.

Mathematics2023, 11(4), 849;https://doi.org/10.3390/math11040849

This article belongs to the Special Issue Advances in Mathematical Modelling and Statistical Methods for Risk Management

Version Notes

Order Reprints

Abstract

In this paper, we study a distributionally robust optimization (DRO) problem with affine decision rules. In particular, we construct an ambiguity set based on a new family of Wasserstein metrics, shortfall–Wasserstein metrics, which apply normalized utility-based shortfall risk measures to summarize the transportation cost random variables. In this paper, we demonstrate that the multi-dimensional shortfall–Wasserstein ball can be affinely projected onto a one-dimensional one. A noteworthy result of this reformulation is that our program benefits from finite sample guarantee without a dependence on the dimension of the nominal distribution. This distributionally robust optimization problem also has computational tractability, and we provide a dual formulation and verify the strong duality that enables a direct and concise reformulation of this problem. Our results offer a new DRO framework that can be applied in numerous contexts such as regression and portfolio optimization.

Keywords:

distributionally robust optimization; Wasserstein metrics; utility-based shortfall risk measures

MSC:

90C17; 91B05; 91G70

1. Introduction

In the literature of operations research (OR) and machine learning (ML), stochastic optimization problems of the following form have been widely studied:

\begin{matrix} inf_{ω \in D} ρ^{P} (l (ω^{⊤} ξ)), \end{matrix}

(1)

where

ξ \in R^{n}

is a random vector with distribution

P

,

ω

is the decision variable restricted to the set

D

,

l (\cdot)

represents a cost/loss function and

ρ^{P} (\cdot)

is a measure of risk under the constraint that the distribution of the random variable is

P

. In ML applications, the function

ρ^{P} (\cdot)

typically takes the form

ρ^{P} (X) = E^{P}

. In OR,

l (\cdot)

always typifies the disutility function, and

E^{P} l (X)

is used as a tool to quantify risks.

In practice, the true distribution

P

of

ξ

is often unknown. To overcome the lack of knowledge on this distribution, distributionally robust optimization (DRO) was proposed as an alternative modeling paradigm. It seeks to find a decision variable

ω

that minimizes the worst-case expected loss

{sup}_{P \in P} E^{P} l (ω^{⊤} ξ)

, where

P

is referred to as the ambiguity set characterized through some known properties of the true distribution

P

. The choice of

P

is of great significance and there are typically two ways to construct it. The first is moment-based ambiguity which contains distributions whose moments satisfy certain conditions [1]. The other, more popular approach is the discrepancy-based ambiguity set which is generally taken as a ball that contains distributions close to a nominal distribution with respect to a statistical distance. Popular choices of the statistical distance include Kullback–Leibler (KL) divergence [2,3], the Wasserstein metric [4], etc. Since the Wasserstein metric can be defined between a discrete distribution and a continuous distribution, the ambiguity set based on the Wasserstein metric includes richer distributions than that based on divergence [5]. This makes the Wasserstein metric more popular in modeling the ambiguity set. However, the authors of [5] point out that the ball

B_{ε}^{d_{W}} ({\hat{P}}_{N})

is desirable as it contains various forms of distributions; however, the flip side is that it may be considered overly-conservative as distributions differing greatly from the empirical distribution may also be included.

Motivated by this, we extend classic Wasserstein metrics to shortfall–Wasserstein metrics by applying the utility-based shortfall risk measures to summarize the distribution of a transportation cost random variable. The formal definition will be given in Section 2. It is worth noting that the properties of the utility-based shortfall risk measure are widely researched in the literature of risk measures [6,7] and that it naturally includes the expectation as special cases. Based on the shortfall–Wasserstein metrics, we define an ambiguity set and formulate a new DRO problem. The utility based shortfall risk measure has been applied well in DRO; see, e.g., [8,9]. However, in the literature, the shortfall risk measure serves as an objective risk measure. In this paper, we employ a utility-based shortfall risk measure to construct the ambiguity set instead of the objective function, which is novel. Moreover, in this paper, we study the tractability of the new problem. In particular, we study the reformulation of the problem and reduce the problem to a convex problem with tractability. For the new metric, the property of the finite sample guarantee is also studied. In particular, to obtain those results, a key result is the projection result of the ambiguity set based on the shortfall–Wasserstein metric. Such projection result for different ambiguity sets have been studied by [10,11]. In this paper, a necessary and sufficient condition for the projection result of the new ambiguity set is given.

The main contribution of the paper can be summarized as follows.

A new family of Wasserstein metrics based on utility-based shortfall risk measures is introduced and is called the shortfall–Wasserstein metric. We propose a data-driven DRO problem based on the shortfall–Wasserstein metric. It is shown that the new DRO model has the benefits of desirable properties of finite sample guarantee and computational tractability.
For the new shortfall–Wasserstein metric, we define the corresponding uncertainty set, which is called the shortfall–Wasserstein ball, and give an equivalent characterization of the projection result to a one-dimensional ball. Based on the projection result, we show that the multi-dimensional constraint of our distributionally robust optimization model can be reformulated as a one-dimensional one. Based on this reformulation, we established the finite sample guarantee of the DRO problem which is free from the curse of dimensionality.
We obtain a dual formulation for the robust optimization problem and verify the strong duality. In addition, the dual form admits a reformulation that can be completely characterized when taking the discrete empirical distribution as the center of ambiguity sets.

Related Work

The central idea behind DD-DRO is to model the distribution of uncertain parameters by constructing a uncertainty set from the sample. Designing a good uncertainty set is crucial. There are two kinds of sets in literature: the moment-based ambiguity set [1] and a “ball” structure based on a statistical distance [4]. Popular choices of the statistical distance include Kullback–Leibler (KL) divergence [2,3], the Wasserstein metric [4,12,13] and the CVaR– and expectile–Wasserstein metrics [5]. The shortfall–Wasserstein metrics proposed in this paper employ a utility-based shortfall risk measure to construct the ambiguity set which includes the expectile–Wasserstein metric as a special case. We also point out that there are also uncertainty sets defined based on stochastic dominance; see [14].

Notational conventions Throughout this paper, we denote

\bar{R} : = R \cup {+ \infty}

as the extended reals and

R_{+} : = [0, \infty)

. Let

M (R^{n})

denote the set of all distributions supported on

R^{n}

. For

p ⩾ 1

,

{∥ \cdot ∥}_{p}

denotes the

𝓎_{p}

norm on

R^{n}

. By

δ_{ξ}

we denote the Dirac distribution concentrating unit mass at

ξ

and

x_{+} : = max {x, 0}

and

x_{-} : = max {- x, 0}

. For any

A \subset S \times S

, let us denote

P r o j_{1} (A) = {x_{1} : (x_{1}, x_{2}) \in A}

. Moreover, by S we denote a Polish space and by

B (S)

the corresponding

σ

-algebra and we let

M_{b} (S)

be the set of all finite signed measures on

(S, B (S))

. We use

C_{b} (S)

to denote the set of all finite continuous functions from S to

R

.

P (S)

is denoted as as the set of all probability measures. For any

μ \in P (S)

, we use

B_{μ} (S)

to denote the completion of

B (S)

with respect to

μ

. The extension of

μ

to

B_{μ} (S)

is unique, and we interpret

μ

in

\int φ d μ

as this extension defined on

B_{μ} (S)

when

φ : (S, B (S)) \to (R, B (R))

is not measurable but

φ : (S, B_{μ} (S)) \to (R, B (R))

is measurable [15]. For any

μ \in P (S)

and

m ⩾ 1

, by

L^{m} (d μ)

we denote the set of all Borel-measurable functions

f : S \to R

satisfying

{\int | f |}^{m} d μ < \infty

. Let

U (S) : = \cap_{μ \in P (S)} B_{μ} (S)

, then we shall use

m_{U} (S; \bar{R})

to denote the set of all measurable functions

φ : (S, U (S)) \to (\bar{R}, B (\bar{R}))

.

2. Shortfall–Wasserstein Metric

Motivated by data-drivenness, the choice of

P_{0}

is usually chosen as the empirical distribution. Let us denote the training dataset by

{\hat{Ξ}}_{N} : = {{\hat{ξ}}_{i}}_{i ⩽ N} \subseteq R^{n}

and the empirical distribution by

{\hat{P}}_{N} = \frac{1}{N} \sum_{i = 1}^{N} δ_{{\hat{ξ}}_{i}}

, with

δ_{ξ}

denoting the Dirac distribution at

ξ \in R^{n}

. In the DRO models introduced above, the choice of the probability metric d plays a significant role in constructing the ambiguity set

B_{ε}^{d} ({\hat{P}}_{N})

. One of the most widely used probability metrics is the (type-1) Wasserstein metric [16]

d_{W} (P_{1}, P_{2}) : = inf \{E^{Π} (∥ ξ_{1} - ξ_{2} ∥) |\begin{matrix} Π is a joint distribution of ξ_{1} and ξ_{2} \\ with marginals P_{1} and P_{2}, respectively \end{matrix}\},

which is applicable to any distributions

P_{1}, P_{2} \in M (R^{n})

with finite first moments. In this paper, we take

𝓎_{p}

norm

{∥ \cdot ∥}_{p}

with

p > 1

. The ambiguity set based on the Wasserstein metric is naturally defined as

B_{ε}^{d_{W}} ({\hat{P}}_{N}) : = \{P \in M (R^{n}) : d_{W} ({\hat{P}}_{N}, P) ⩽ ε\} .

which is called the Wasserstein ball centered at

{\hat{P}}_{N}

with radius

ε

. In the literature, the distributional robust problem with the ambiguity set being the above Wasserstein ball has been widely studied [4,12,13] and is known as the Wasserstein data-driven distributionally robust optimization model (W-DD-DRO),

inf_{ω \in D} sup_{P \in B_{ε}^{d_{W}} ({\hat{P}}_{N})} E^{P} l (ω^{⊤} ξ) .

As argued by [5], the ball

B_{ε}^{d_{W}} ({\hat{P}}_{N})

is desirable as it contains various forms of distributions; however, the flip side is that it may be considered overly-conservative as distributions differing greatly from the empirical distribution may also be included. Motivated by this, they proposed the CVaR–Wasserstein and expectile–Wasserstein balls and studied the reformulation and tractability of the corresponding DRO problems. In this paper, we extend the expectile–Wasserstein metrics to Shortfall–Wasserstein metrics by applying normalized utility-based shortfall risk measures to evaluate the transportation cost.

2.1. Risk Measures

We first introduce some notions of risk measures. Let

(Ω, B, P)

be the probability space, and

B

is the

σ

-algebra on

Ω

. Following the convention in mathematical finance, we describe a risk by a random variable

X : Ω \to R

. A risk measure is a functional

ρ

mapping from a set of risks to

R

. We list some desired properties as follows:

(P1): (Translation invariance) $ρ (X + a) = ρ (X) + a$ for $\forall a ⩾ 0$ ;
(P2): (Positive homogeneity) $ρ (c X) = c ρ (X)$ for any $c > 0$ ;
(P3): (Monotonicity) $ρ (X_{1}) ⩾ ρ (X_{2})$ for any $X_{1} ⩾ X_{2}$ ;
(P4): (Subadditivity) $ρ (X_{1} + X_{2}) ⩽ ρ (X_{1}) + ρ (X_{2})$ ;
(P5): (Convexity) $ρ (α X + (1 - α) Y) ⩽ α ρ (X) + (1 - α) ρ (Y),$ $\forall α \in (0, 1)$ ;
(P6): (Law invariance) $ρ (X_{1}) = ρ (X_{2})$ for any $X_{1} \overset{d}{=} X_{2}$ .

A risk measure satisfying the above properties (P1) and (P2) is called a monetary risk measure and a risk measure satisfying the above properties (P1)–(P4) is called a coherent risk measure, which has been viewed as one of the most important risk measures since the seminal work [17]. A risk measure

ρ

is called a convex risk measure if it satisfies properties (P1), (P2) and (P5). We then introduce the definition of utility-based shortfall risk measures.

Definition 1

(Utility-based shortfall risk measures). Let

u : R \to R

be non-decreasing and continuous, satisfying

u (0) = 0

. For a random variable X, the utility-based shortfall risk measure is defined as

S_{u} (X) = inf {t : E [u (X - t)] ⩽ u (0)} .

It is well known that any utility-based shortfall risk measure satisfies the monotonicity and translation invariance, and thus is a monetary risk measure. Based on the utility-based shortfall risk measure, we now formally give the definition of shortfall–Wasserstein metric as follows.

Definition 2

(Shortfall–Wasserstein metric). A metric

d_{u} (\cdot, \cdot) : M (R^{n}) \times M (R^{n}) \to R \cup {+ \infty}

is called the Shortfall–Wasserstein metric if it has the form of

\begin{matrix} d_{u} (P_{1}, P_{2}) : = inf \{S_{u}^{Π} (∥ ξ_{1} - ξ_{2} ∥_{p}) |\begin{matrix} Π is a joint distribution of ξ_{1} and ξ_{2} \\ with marginals P_{1} and P_{2}, respectively \end{matrix}\}, \end{matrix}

where

p > 1

and

{∥ \cdot ∥}_{p}

is the

𝓎_{p}

norm on

R^{n}

.

We first study the basic properties of the shortfall–Wasserstein metric.

Proposition 1.

Let

u : R \to R

be a convex, non-decreasing and continuous function satisfying

u (0) = 0

. We have the following statements:

(i): $d_{u}$ satisfies the identity of indiscernibles, i.e., $d_{u} (P_{1}, P_{2}) = 0$ if and only if $P_{1} = P_{2}$ ;
(ii): $d_{u}$ satisfies the symmetry, that is, $d_{u} (P_{1}, P_{2}) = d_{u} (P_{2}, P_{1})$ for any $P_{1}, P_{2}$ ;
(iii): $d_{u}$ satisfies the non-negativity, i.e., $d_{u} (P_{1}, P_{2}) ⩾ 0$ for any $P_{1}, P_{2}$ ;
(iv): If u is positively homogeneous, then $d_{u}$ satisfies the triangle inequality: $d_{u} (P_{1}, P_{2}) + d_{u} (P_{2}, P_{3}) ⩾ d_{u} (P_{1}, P_{3})$ for any $P_{1}$ , $P_{2}$ , $P_{3}$ .

Proof.

(i) Denote by

Π (P_{1}, P_{2})

the set of all distributions on

M (R^{2 n})

with margins

P_{1}

and

P_{2}

. To see the sufficiency, note that when

P_{1} = P_{2} = P

and

ξ \sim P

, the set

Π (P_{1}, P_{2})

contains the joint distribution of

(ξ, ξ)

. Therefore,

d_{u} (P_{1}, P_{2}) = 0

as

S_{u} (∥ ξ - ξ ∥) = 0

. To see the necessity, note that under the condition,

S_{u}

satisfies convexity. By Theorem 4.2 of [18], we have

S_{u} ⩾ E

. By that, the Wasserstein metric satisfies the identity of indiscernibles, then we have

P_{1} = P_{2}

.

(ii) The symmetry follows immediately.

(iii) Note that

u (0) = 0

and u is non-decreasing, we have

u (x) ⩾ 0

for every

x ⩾ 0

. As the norm is non-negative and u is non-decreasing, we have

S_{u} (∥ ξ_{1} - ξ_{2} ∥) ⩾ 0

, which means that

d_{u}

satisifes the non-negativity.

(iv) From the definition,

S_{u}

is convex and homogeneous as u is convex and homogeneous. It follows that for any

ε > 0

,

Π_{1} \in Π (P_{1}, P_{2}), Π_{2} \in Π (P_{2}, P_{3})

exists such that

d_{u} (P_{1}, P_{2}) ⩾ S_{u}^{Π_{1}} (∥ ξ_{1} - ξ_{2} ∥) - ε, d_{u} (P_{2}, P_{3}) ⩾ S_{u}^{Π_{1}} (∥ ξ_{2} - ξ_{3} ∥) - ε .

From Theorem 6.10 of [19], we know that

ξ_{1}^{*}, ξ_{2}^{*}, ξ_{3}^{*}

exists such that

(ξ_{1}^{*}, ξ_{2}^{*})

has the joint distribution

Π_{1}

and

(ξ_{2}^{*}, ξ_{3}^{*})

has the joint distribution

Π_{2}

. As a result,

\begin{matrix} d_{u} (P_{1}, P_{2}) + d_{u} (P_{2}, P_{3}) & ⩾ S_{u} (∥ ξ_{1}^{*} - ξ_{2}^{*} ∥) + S_{u} (∥ ξ_{2}^{*} - ξ_{3}^{*} ∥) - 2 ε \\ = S_{u} (2 ∥ ξ_{1}^{*} - ξ_{2}^{*} ∥) / 2 + S_{u} (2 ∥ ξ_{2}^{*} - ξ_{3}^{*} ∥) / 2 - 2 ε \\ ⩾ S_{u} (∥ ξ_{1}^{*} - ξ_{3}^{*} ∥) - 2 ε, \end{matrix}

where the second equality is the result of the homogeneity of

S_{u}

and the last inequality holds due to the convexity of

S_{u}

and the subadditivity of the norm. □

Proposition 1 tells us that when u is increasing, convex and positively homogeneous with

u (0) = 0

, then the metric

d_{u}

satisfies all the desired properties of a distance metric.

2.2. Formula of DRO Problems Based on Shortfall-Wasserstein Metric

Based on the shortfall–Wasserstein metric, we define the following ambiguity set

B_{ε}^{d_{u}} ({\hat{P}}_{N}) : = \{P \in M (R^{n}) : d_{u} ({\hat{P}}_{N}, P) ⩽ ε\} .

which is called the shortfall–Wasserstein ball centered at the empirical distribution

{\hat{P}}_{N}

. We consider the following problem

\begin{matrix} inf_{ω \in D} sup_{P \in B_{ε}^{d_{u}} ({\hat{P}}_{N})} E^{P} l (ω^{⊤} ξ), \end{matrix}

(2)

where

ω

is the decision vector,

D

is the feasible set of decision vector, and l is the loss function. The above problem (2) is called the shortfall–Wasserstein data-driven distributionally robust optimization model (SW-DD-DRO).

To end the section, we present two applications of stochastic optimization (1): regression and risk minimization.

2.2.1. Regression

Considering a linear regression problem, our purpose is to find a linear predictor function

β^{⊤} X

with

β

the regression coefficient vector. Our attention here is to find an accurate estimator of

β

that is robust to adversarial perturbations of the data. Thus, distributionally robust regression models are constructed in the following form:

\begin{matrix} inf_{β \in \bar{D}} sup_{P \in B_{ε}^{d} ({\hat{P}}_{N})} E^{P} l (Y - β^{⊤} X), \end{matrix}

(3)

In the literature, the loss function always takes the form

l (Y, X) = | Y - β^{⊤} {X |}^{p}

with

p ⩾ 1

, and the regression model is known as least-squares regression when

p = 2

. For convenience, we denote

ξ : = (Y, X)

and

D : = {(1, - β) : β \in \bar{D}}

, then the problem can be reformulated as the form of (2).

2.2.2. Portfolio Optimization

If we denote by

ξ

a random vector of returns from n different financial assets and by

ω

the allocation vector, then the random variable

ω^{⊤} ξ

means the total return of the portfolio. Thus, the problem (1) is considered a portfolio optimization problem. As an example, the risk measure

ρ^{P} (ω^{⊤} ξ) = E^{P} ({(ω^{⊤} ξ - c)}_{+}^{p})

is a well-known class of downside risk measure as the loss function takies the form of

l (X) = {(X - c)}_{+}^{p}, p ⩾ 1

.

3. Shortfall–Wasserstein Data-Driven DRO

3.1. Reformulation of the Shortfall–Wasserstein DRO

To solve problem (2), the key step is to solve the inner maximization problem of (2), that is, for fixed

ω \in D, ε ⩾ 0

,

\begin{matrix} \begin{matrix} sup_{P \in M (R^{n})} & E^{P} l (ω^{⊤} ξ) \\ subject to & d_{u} (P, {\hat{P}}_{N}) ⩽ ε . \end{matrix} \end{matrix}

(4)

By the definition of the metric, the above problem can be rewritten as

\begin{matrix} \begin{matrix} sup_{P \in M (R^{n})} & E^{P} l (ω^{⊤} ξ) \\ subject to & S_{u}^{Π} (∥ ξ - \hat{ξ} ∥_{p}) ⩽ ε, \\ ◾ \in M (R^{n} \times R^{n}) is a joint distribution of ξ and \hat{ξ} \\ with marginals P and {\hat{P}}_{N}, respectively . \end{matrix} \end{matrix}

(5)

As

S_{u} (\cdot)

satisifes the translation invariance, for fixed

ε ⩾ 0

, we have the constraint of the above problem

S_{u}^{Π} (∥ ξ - \hat{ξ} ∥_{p}) ⩽ ε

being equivalent to

S_{u}^{Π} (∥ ξ - \hat{ξ} ∥_{p} - ε) ⩽ 0

, that is,

E [u (∥ ξ - \hat{ξ} ∥_{p} - ε)] ⩽ 0

. Therefore, we can further rewrite the above problem (5) as

\begin{matrix} \begin{matrix} sup_{P \in M (R^{n})} & E^{P} l (ω^{⊤} ξ) \\ subject to & E^{Π} u (∥ ξ - \hat{ξ} ∥_{p} - ε) ⩽ 0, \\ ◾ \in M (R^{n} \times R^{n}) is a joint distribution of ξ and \hat{ξ} \\ with marginals P and {\hat{P}}_{N}, respectively . \end{matrix} \end{matrix}

(6)

To further simplify this problem, we first define the following notation:

\begin{matrix} B_{p, ε}^{n} (P_{0}) : = {P : E^{Π} u (∥ ξ - ξ_{0} ∥_{p} - ε) ⩽ 0, Π (d ξ, R^{n}) = P (d ξ), Π (R^{n}, d ξ_{0}) = P_{0} (d ξ_{0})}, \end{matrix}

(7)

and

\begin{matrix} B_{ω, p, ε} (P_{0}) : = {F_{ω^{⊤} Z} : F_{Z} \in B_{p, ε}^{n} (P_{0})} . \end{matrix}

(8)

We study the projection result of the shortfall–Wasserstein ball. Throughout the following subsections,

U

denotes the set of all strictly increasing continuous functions on

R

with

u (0) = 0

. In the following theorem, we show that the constraint of the above problem (6) can be conveniently converted to a univariate setting.

Theorem 1.

Assume random vector

X

with distribution function

F_{X} \in M (R^{n})

,

u \in U

and

p > 1, 1 / p + 1 / q = 1

, we obtain that

B_{ω, p, ε} (F_{X}) = B_{{∥ ω ∥}_{q} ε}^{1} (F_{ω^{⊤} X})

holds for any

ε ⩾ 0, ω \in R^{n}

, if and only if

β, a_{1}, a_{2} > 0

exists such that

u (x) = \{\begin{matrix} a_{1} x^{β}, & x ⩾ 0, \\ - a_{2} {(- x)}^{β}, & x < 0 . \end{matrix}

(9)

Proof.

To see the sufficiency, assume that

u (\cdot)

is given by (9) with

β, a_{1}, a_{2} > 0

. Then we can verify that

u (λ x) = λ^{β} u (x)

for any

x \in R

and

λ ⩾ 0

. For any

F \in B_{ω, p, ε} (F_{X})

,

Z \in R^{n}

exists such that

F_{Z} \in B_{p, ε}^{n} (F_{X})

satisfies

E^{Π} {u (∥ Z - X ∥}_{p} - ε) ⩽ 0

, and

F = F_{ω^{⊤} Z}

. Due to the increasing property of

u (\cdot)

and the Hölder inequality, we have

\begin{matrix} E u (| ω^{⊤} Z - ω^{⊤} {X | - ∥ ω ∥}_{q} ε) & ⩽ E u (∥ ω ∥_{q} ∥ Z - X ∥_{p} - ∥ ω ∥_{q} ε) \\ = {∥ ω ∥}_{q}^{β} {E u (∥ Z - X ∥}_{p} - ε) ⩽ 0 . \end{matrix}

Thus,

F \in B_{{∥ ω ∥}_{q} ε}^{1} (F_{ω^{⊤} X})

, then

B_{ω, p, ε} (F_{X}) \subset B_{{∥ ω ∥}_{q} ε}^{1} (F_{ω^{⊤} X})

.

The converse direction of the set inclusion is presented next. For any

F \in B_{{∥ ω ∥}_{q} ε}^{1} (F_{ω^{⊤} X})

,

Z \sim F

exists satisfying

E u (| Z - ω^{⊤} {X | - ∥ ω ∥}_{q} ε) ⩽ 0

. Let

Y = Z - ω^{⊤} X

,

\tilde{Z} = X + (ω^{q / p} Y) / (ω^{⊤} ω^{q / p})

, in which

ω^{q / p}

is defined of the form

ω^{q / p} : = {(s i g n (ω_{1}) {| ω_{1} |}^{q / p}, \dots, s i g n (ω_{n}) {| ω_{n} |}^{q / p})}^{⊤},

and

s i g n : R \to {- 1, 1}

is the sign function. Thus, we compute

ω^{⊤} ω^{q / p} = \sum_{i = 1}^{n} | ω_{i} | {| ω_{i} |}^{q / p} = \sum_{i = 1}^{n} {| ω_{i} |}^{1 + q / p} = \sum_{i = 1}^{n} {| ω_{i} |}^{q} = {∥ ω ∥}_{q}^{q}

,

∥ ω^{q / p} {Y ∥}_{p} = {(\sum_{i = 1}^{n} {| ω_{i} |}^{q} {| Y |}^{p})}^{1 / p} = | Y | {(\sum_{i = 1}^{n} {| ω_{i} |}^{q})}^{1 / p} = | Y | {∥ ω ∥}_{q}^{q / p}

. Then,

\begin{matrix} E u (∥ \tilde{Z} - X ∥_{p} - ε) & = E u (∥ (ω^{q / p} Y) / (ω^{⊤} ω^{q / p}) ∥_{p} - ε) \\ = E u (\frac{1}{{∥ ω ∥}_{q}^{q}} {∥ ω^{q / p} Y ∥}_{p} - ε) \\ = E u (\frac{{∥ ω ∥}_{q}^{q / p}}{{∥ ω ∥}_{q}^{q}} | Y | - ε) \\ = E u (\frac{1}{{∥ ω ∥}_{q}} | Y | - ε) \\ = \frac{1}{{∥ ω ∥}_{q}^{β}} E u (| Z - ω^{⊤} {X | - ∥ ω ∥}_{q} ε) ⩽ 0, \end{matrix}

the fourth equality comes from

q (1 - 1 / p) = 1

. As

ω^{⊤} \tilde{Z} = Z

, we obtain

F \in B_{ω, p, ε} (F_{X})

. This implies

B_{{∥ ω ∥}_{q} ε}^{1} (F_{ω^{⊤} X}) \subset B_{ω, p, ε} (F_{X})

. Hence, we conclude that

B_{ω, p, ε} (F_{X}) = B_{{∥ ω ∥}_{q} ε}^{1} (F_{ω^{⊤} X})

.

To see the necessity, suppose

B_{ω, p, ε} (F_{X}) = B_{{∥ ω ∥}_{q} ε}^{1} (F_{ω^{⊤} X})

holds for any

ε ⩾ 0, ω \in R^{n}

. From (7) and (8), we obtain that

{E u (∥ Z - X ∥}_{p} - ε) ⩽ 0 \Leftrightarrow E u (| ω^{⊤} Z - ω^{⊤} {X | - ∥ ω ∥}_{q} ε) ⩽ 0, \forall ω \in R^{n} .

(10)

Specifically, choose

Z - X = Y e

, where Y is a random variable on

R

and

e \in R^{n}

with

{∥ e ∥}_{p} = 1

. Let

ω = λ e / {∥ e ∥}_{q}, λ > 0

, then we have

E u (| ω^{⊤} Z - ω^{⊤} {X | - ∥ ω ∥}_{q} ε) = E u (λ \frac{| e^{⊤} e |}{{∥ e ∥}_{q}} | Y | - λ ε) = E u (λ | Y | - λ ε),

where the second equality comes from

| e^{⊤} {e | = ∥ e ∥}_{p} {∥ e ∥}_{q}

and

{∥ e ∥}_{p} = 1

. Noting that

{E u (∥ Z - X ∥}_{p} - ε) = E u (| Y | - ε)

, from (10) we obtain

E u (λ | Y | - λ ε) ⩽ 0

if and only if

E u (| Y | - ε) ⩽ 0

for any

λ > 0

. By the arbitrariness of random variable Y and

ε ⩾ 0

, we obtain for any random variable X on

R

E u (X) ⩽ 0 \Leftrightarrow E u (λ X) ⩽ 0, \forall λ > 0 .

It follows that

S_{u} (\cdot)

is positively homogeneous. By the proposition 2.9 of [20], we can obtain the representation of (9). This completes the proof. □

By Theorem 1, a multi-dimensional sphere can be affinely projected to a one-dimensional sphere; thus, the multi-dimensional constraint of (6) can be simplified to the following one-dimensional one which enables the problem to be much more tractable.

Corollary 1.

Under the condition of Theorem 1, for u given by (9), we have problem (6) equivalent to

\begin{matrix} sup_{G \in M (R)} & E^{G} l (Y) \\ subject to & E^{G} u (| Y - ω^{⊤} \hat{ξ} {| - ∥ ω ∥}_{q} ε) ⩽ 0 . \end{matrix}

(11)

3.2. Finite Sample Guarantee

Next, we will demonstrate that SW-DD-DRO has the great property of finite sample guarantee. In practice, the optimal solution

{\hat{ω}}_{N}

is constructed from the training dataset

{\hat{Ξ}}_{N}

, but we always pay more attention to the out-of-sample performance of the optimizer. As the true distribution

P_{0}

is always unknown, the out-of-sample performance is hard to evaluate. Thus, we hope to establish a tight upper bound to provide performance guarantees for the solution as our main concern is to control the costs from above. To clarify our analysis, we first introduce some notations and assumptions.

$V^{*}$ : the optimal risk we target at, i.e.,

$\begin{matrix} V^{*} : = inf_{ω \in D} E^{P_{0}} l (ω^{⊤} ξ), \end{matrix}$

(12)

where $P_{0}$ is the true distribution.
${\hat{V}}_{N}$ : the in-sample risk achieved by SW-DD-DRO, i.e.,

$\begin{matrix} {\hat{V}}_{N} : = inf_{ω \in D} sup_{P \in B_{ε_{u}}^{d_{u}} ({\hat{P}}_{N})} E^{P} l (ω^{⊤} ξ), \end{matrix}$

(13)

where ${\hat{P}}_{N}$ is the empirical distribution.
$V_{*}$ : the out-of-sample risk achieved by the SW-DD-DRO solution, i.e.,

$\begin{matrix} V_{*} = E^{P_{0}} l ({\hat{ω}}_{N}^{⊤} ξ), \end{matrix}$

(14)

where ${\hat{ω}}_{N}$ is the optimal solution of the problem in (13).

From (11), the ambiguity set of the inner maximization problem of (13) can be reformulated as

B_{{∥ ω ∥}_{q} ε_{u}}^{1} (F_{ω^{⊤} \hat{ξ}}) : = {G \in M (R) : E^{G} u (| Y - ω^{⊤} \hat{ξ} {| - ∥ ω ∥}_{q} ε_{u}) ⩽ 0} .

To establish the result of finite sample guarantee, the following assumptions are needed.

Assumption 1

(Light-tailed distribution). With β identified in the function

u (\cdot)

defined in Theorem 1, an exponent

a > β

exists such that

A : = E^{P_{0}} {[exp (∥ ξ ∥}_{p}^{a})] = {\int exp (∥ ξ ∥}_{p}^{a}) P_{0} (d ξ) < \infty

.

Assumption 2.

The feasible set

D \in R^{n}

is bounded from above such that

H_{D} : = {sup}_{ω \in D} {∥ ω ∥}_{q} < + \infty

, where

1 / p + 1 / q = 1

.

Assumption 3.

The feasible set

D \in R^{n}

is away from the origin such that

L_{D} : = {inf}_{ω \in D} {∥ ω ∥}_{q} > 0

, where

1 / p + 1 / q = 1

.

Assumption 1 is a common assumption that demands the rate of decay of the tail of the distribution

P_{0}

. Assumptions 2 and 3 are requirements of the feasible set

D

, which are also mentioned in [21,22].

Proposition 2

(Finite sample guarantee). Assumptions 1–3 are in force,

u (\cdot)

has the form defined in Theorem 1. Let

{\hat{V}}_{N}

and

{\hat{ω}}_{N}

denote the optimal value and an optimizer of the problem (13). The radius of the ambiguity set is defined as the following form, with

η \in (0, 1)

and a constant C relies on

a_{1}, a_{2}, β

,

ε_{N}^{u} (η) = \{\begin{matrix} C^{- 1 / β} L_{D}^{- 1} {(\frac{log (c_{1} η^{- 1})}{c_{2} N})}^{1 / 2 β}, i f N ⩾ \frac{l o g (c_{1} η^{- 1})}{c_{2}}, \\ C^{- 1 / β} L_{D}^{- 1} {(\frac{log (c_{1} η^{- 1})}{c_{2} N})}^{1 / a}, i f N < \frac{l o g (c_{1} η^{- 1})}{c_{2}} . \end{matrix}

(15)

where

c_{1}

,

c_{2}

only rely on a, A,

H_{D}

and β. Thus, it holds the finite sample guarantee

P^{N} {V_{*} ⩽ {\hat{V}}_{N}} ⩾ 1 - η .

Proof.

Denote

I_{1} : = I {X ⩾ ε}, I_{2} : = I {X < ε}

, for all nonnegative random variables X,

\begin{matrix} E u (X - ε) & = E [u (X - ε) I_{1}] + E [u (X - ε) I_{2}] \\ = a_{1} E [{(X - ε)}^{β} I_{1}] - a_{2} E [{(ε - X)}^{β} I_{2}] \\ ⩽ a_{1} E [X^{β} I_{1}] - a_{2} E [{(ε - X)}^{β} I_{2}] \\ = a_{1} E [X^{β}] - E [(a_{1} X^{β} + a_{2} {(ε - X)}^{β}) I_{2}] . \end{matrix}

Define

f (x) = a_{1} x^{β} + a_{2} {(ε - x)}^{β}, x \in [0, ε)

, we have

f^{'} (x) = a_{1} β x^{β - 1} - a_{2} β {(ε - x)}^{β - 1}

,

f^{″} (x) = a_{1} β (β - 1) x^{β - 2} + a_{2} β (β - 1) {(ε - x)}^{β - 2}

. Solving the equation

f^{'} (x) = 0

, we obtain

x_{0} = \frac{{a_{2}}^{1 / (β - 1)}}{{a_{1}}^{1 / (β - 1)} + {a_{2}}^{1 / (β - 1)}} ε .

Denote

c_{0} : = \frac{{a_{2}}^{1 / (β - 1)}}{{a_{1}}^{1 / (β - 1)} + {a_{2}}^{1 / (β - 1)}}

, then

x_{0} \in (0, ε)

as

0 < c_{0} < 1

. When

0 < β < 1

, we obtain

f^{″} (x) < 0

, thus

f (x)

is concave. In the interval

[0, ε)

,

f (x)

attains its infimum at the endpoints, which means

f (x) ⩾ min {a_{1}, a_{2}} ε^{β}

. When

β > 1

, we have

f^{″} (x) > 0

,

f (x)

is a convex function and attain its infimum at

x_{0}

. We have

f (x) ⩾ f (c_{0} ε) = (a_{1} {c_{0}}^{β} + a_{2} {(1 - c_{0})}^{β}) ε^{β}

. Then

f (x) ⩾ min {a_{1} {c_{0}}^{β} + a_{2} {(1 - c_{0})}^{β}, min {a_{1}, a_{2}}} ε^{β}

holds for all

β > 0

. Denoting

C : = min {a_{1} {c_{0}}^{β} + a_{2} {(1 - c_{0})}^{β}, min {a_{1}, a_{2}}} / a_{1}

.Therefore,

E [(a_{1} X^{β} + a_{2} {(ε - X)}^{β}) I_{2}] ⩾ a_{1} C ε^{β}

, and

E u (X - ε) ⩽ a_{1} E [X^{β}] - a_{1} C ε^{β} .

(16)

As a result, for any

ε > 0

,

E u (X - ε) ⩽ 0

if

E [X^{β}] ⩽ C ε^{β}

. The

β

-type Wasserstein metric is defined as the following:

d_{W}^{β} (P_{1}, P_{2}) : = inf \{E^{Π} (∥ ξ_{1} - ξ_{2} ∥_{p}^{β}) |\begin{matrix} ◾ is a joint distribution of ξ_{1} and ξ_{2} \\ with marginals P_{1} and P_{2}, respectively \end{matrix}\},

Therefore, we have

B_{C ε^{β}}^{d_{W}^{β}} (F_{ω^{⊤} \hat{ξ}}) \subseteq B_{ε}^{d_{u}} (F_{ω^{⊤} \hat{ξ}}) .

(17)

From the assumption,

a > β

,

ξ_{0} \sim P_{0}

exists, satisfying

A : = E^{P_{0}} [exp (∥ ξ_{0} ∥_{p}^{a})] < \infty

. We have

E^{F_{0}} [exp (H_{D}^{- a} | ω^{⊤} ξ_{0} |^{a})] ⩽ E^{P_{0}} [exp (∥ ξ_{0} ∥_{p}^{a})] < \infty,

where

F_{0} : = F_{ω^{⊤} ξ_{0}}

, and the first inequality is due to the Hölder inequality. The above inequality implies that

F_{ω^{⊤} ξ_{0}}

satisfies the condition of Theorem 2 of [23]. As the corresponding empirical distribution of

F_{ω^{⊤} ξ_{0}}

is

F_{ω^{⊤} \hat{ξ}}

, we know that the finite sample guarantee holds when

d_{W}^{β}

is applied to summarize the transportation cost random variable and

ε_{N}^{W} (η)

is specified, that is

P (F_{ω^{⊤} ξ_{0}} \in B_{ε_{N}^{W}}^{d_{W}^{β}} (F_{ω^{⊤} \hat{ξ}})) ⩾ 1 - η, where ε_{N}^{W} (η) : = \{\begin{matrix} {(\frac{log (c_{1} η^{- 1})}{c_{2} N})}^{1 / 2}, if N ⩾ \frac{log (c_{1} η^{- 1})}{c_{2}}, \\ {(\frac{log (c_{1} η^{- 1})}{c_{2} N})}^{β / a}, if N < \frac{log (c_{1} η^{- 1})}{c_{2}}, \end{matrix}

(18)

where

c_{1}

,

c_{2}

only rely on a, A,

H_{D}

and

β

. We define

ε_{N}^{u} (η)

for the shortfall–Wasserstein metric

d_{u}

,

ε_{N}^{u} (η) : = \frac{1}{L_{D}} {(\frac{ε_{N}^{W} (η)}{C})}^{1 / β} = \{\begin{matrix} C^{- 1 / β} L_{D}^{- 1} {(\frac{log (c_{1} η^{- 1})}{c_{2} N})}^{1 / 2 β}, if N ⩾ \frac{log (c_{1} η^{- 1})}{c_{2}}, \\ C^{- 1 / β} L_{D}^{- 1} {(\frac{log (c_{1} η^{- 1})}{c_{2} N})}^{1 / a}, if N < \frac{log (c_{1} η^{- 1})}{c_{2}} . \end{matrix}

(19)

From (17), we obtain

\begin{matrix} P (V_{*} ⩽ {\hat{V}}_{N}) & ⩾ P (P_{0} \in B_{ε_{N}^{u}}^{d_{u}} ({\hat{P}}_{N})) \\ = P (F_{ω^{⊤} ξ_{0}} \in B_{{∥ ω ∥}_{q} ε_{N}^{u}}^{d_{u}} (F_{ω^{⊤} \hat{ξ}})) \\ ⩾ P (F_{ω^{⊤} ξ_{0}} \in B_{{(\frac{{∥ ω ∥}_{q}}{L_{D}})}^{β} ε_{N}^{W}}^{d_{W}^{β}} (F_{ω^{⊤} \hat{ξ}})) \\ ⩾ P (F_{ω^{⊤} ξ_{0}} \in B_{ε_{N}^{W}}^{d_{W}^{β}} (F_{ω^{⊤} \hat{ξ}})) ⩾ 1 - η, \end{matrix}

where

ξ_{0} \sim P_{0}, \hat{ξ} \sim {\hat{P}}_{N}

. The first inequation comes from the fact that

P_{0} \in B_{ε_{N}^{u}}^{d_{u}} ({\hat{P}}_{N})

implies

V_{*} ⩽ {\hat{V}}_{N}

. This completes the proof. □

We show in Proposition 2 that the out-of-sample performance of

{\hat{ω}}_{N}

can be bounded, when the radius of the ambiguity set is properly calibrated, by the optimal value

{\hat{V}}_{N}

with some confidence level. It is noteworthy that the order of the radius

ε

in this paper is

O (N^{- 1 / 2 β})

, independent of the dimension of the nominal distribution, while the order of the radius suffers seriously from the curse of dimensionality in [4].

4. Worst-Case Expectation under the Shortfall–Wasserstein Metric

This section studies the tractability of solving (4). For notational convenience, denote

c (x, y) : = u (| y - x | - ε^{*})

, and X has a known probability distribution

μ \in P (R)

. We define the primal problem in the following form:

\bar{I} : = sup \{\int l (y) d π (x, y) : π \in Φ_{μ}\},

where

Φ_{μ} : = {π \in Π (μ, ν) : \int c (x, y) d π (x, y) ⩽ 0}

. Let

C : = {(x, y) \in R \times R, c (x, y) < + \infty}

,

Λ_{c, l} : = {(λ, φ) : λ ⩾ 0, φ \in m_{U} (R; \bar{R}), φ (x) + λ c (x, y) ⩾ l (y) for all (x, y) \in C}

[15]. For every such

(λ, φ) \in Λ_{c, f}

, define

J (λ, φ) : = \int_{C} φ d μ .

For any

π \in Φ_{μ}

,

(λ, φ) \in Λ_{c, l}

,

π (C) = 1

as

\int c d π

is finite. As a result, for every measurable

g : (R \times R, B (R \times R)) \to (\bar{R}, B (\bar{R}))

,

\int_{C} g d π = \int g d π

. Thus we have

\begin{matrix} J (λ, φ) & = \int_{C} φ d μ ⩾ \int_{C} (l (y) - λ c (x, y)) d π (x, y) \\ = \int l (y) d π (x, y) - λ \int c (x, y) d π (x, y) \\ ⩾ \int l (y) d π (x, y) = I (π) . \end{matrix}

According to the tradition in optimization theory, we refer to the following problem as the dual to the primal problem:

\underset{̲}{J} : = inf \{\int φ d μ : (λ, φ) \in Λ_{c, l}\} .

Consequently, the weak duality holds:

\underset{̲}{J} ⩾ \bar{I} .

To further identify the equivalence between

\bar{I}

and

\underset{̲}{J}

, for every

λ ⩾ 0

, we define

φ_{λ} : R \to \bar{R}

as follows:

φ_{λ} (x) : = \{\begin{matrix} sup_{y \in R} {l (y) - λ c (x, y)}, i f λ > 0, \\ sup_{y \in R} {l (y) : c (x, y) < + \infty}, i f λ = 0 . \end{matrix}

(20)

To simplify the notation, we write

λ c (x, y) = + \infty

whenever

λ = 0

and

c (x, y) = + \infty

; thus, we have

φ_{λ} (x) = {sup}_{y \in R} {l (y) - λ c (x, y)}, x \in R

. In Theorem 2, we show that, for performance measures l satisfying Assumption 4,

\bar{I}

equals

\underset{̲}{J}

.

Assumption 4.

The function

l : R \to R

is upper semicontinuous with

l \in L^{1} (d μ)

.

Theorem 2.

Under Assumption 4 and

c (x, y) = u (| y - x | - ε^{*})

, we can conclude

(a): $\bar{I} = \underset{̲}{J}$ ;
(b): A dual optimizer of the form $(λ^{*}, φ_{λ^{*}})$ exists for some $λ^{*} ⩾ 0$ and $φ_{λ^{*}} (\cdot)$ defined as in (20). Moreover, any feasible solutions $π^{*}$ and $(λ^{*}, φ_{λ^{*}})$ are optimizers of the primal and dual problem, satisfying $I (π^{*}) = J (λ^{*}, φ_{λ^{*}})$ , if and only if

$\begin{matrix} l (y) - λ^{*} c (x, y) = sup_{z \in R} {l (z) - λ^{*} c (x, z)} π^{*} a . s ., \end{matrix}$

(21)

$\begin{matrix} λ^{*} \int c (x, y) d π^{*} (x, y) = 0 . \end{matrix}$

(22)

Additionally, if the primal optimizer

π^{*}

exists and there is solely one y in

R

that attains the supremum in

{sup}_{y \in R} {l (y) - λ^{*} c (x, y)}

for μ, almost every

x \in R

, then

π^{*}

is unique.

Remark 1.

From (21), if the optimal measure

π^{*}

exists,

π^{*} {(x, y) \in R \times R : y \in {arg max}_{z \in R} {l (z) - λ^{*} c (x, z)}} = 1

, which means the worst-case joint probability is identified by a transport plan that transports mass from x to the optimizer of the local optimization problem

{sup}_{z \in R} {l (z) - λ^{*} c (x, z)}

. Furthermore, when

λ^{*} > 0

, we obtain

\int c (x, y) d π^{*} (x, y) = 0

.

Remark 2.

If a subset

A \subset R, μ (A) > 0

exists, for any

x \in A

, the rate the loss function

l (y)

grows to

+ \infty

is faster than the rate of

c (x, y)

growth, then

\bar{I} = \underset{̲}{J} = + \infty

. The reason is that for every

λ ⩾ 0

and

x \in A

,

φ_{λ} (x) = {sup}_{z \in R} {l (z) - λ c (x, z)} = + \infty

; thus,

\int φ_{λ} (x) d μ (x) = + \infty

. Therefore, sometimes, it may be necessary to require the loss function l not to grow faster than c.

The proof of Theorem 2 is provided in Appendix A.

Corollary 2.

Under the Assumption 4 and

c (X, Y) = u (| Y - X | - ε^{*})

, we can conclude

\begin{matrix} \bar{I} = inf_{λ ⩾ 0} \{E_{μ} [sup_{y \in R} {l (y) - λ u (| y - X | - ε^{*})}]\} . \end{matrix}

(23)

Proof.

From the proof of Theorem 2, we can see that there always exist

λ^{*} \in [0, + \infty), \underset{̲}{J} = {inf}_{λ ⩾ 0} {\int φ_{λ} (x) d μ (x)} = \int φ_{λ^{*}} (x) d μ (x) ⩾ \bar{I} = \underset{̲}{J}

, then we conclude

\begin{matrix} \bar{I} = \underset{̲}{J} = inf_{λ ⩾ 0} \{\int φ_{λ} (x) d μ (x)\} & = inf_{λ ⩾ 0} \{E_{μ} [sup_{y \in R} {l (y) - λ c (X, y)}]\} \\ = inf_{λ ⩾ 0} \{E_{μ} [sup_{y \in R} {l (y) - λ u (| y - X | - ε^{*})}]\} . \end{matrix}

The proof is complete. □

Remark 3.

This conclusion that the optimal value of the multi-dimensional primal problem can be obtained by paying attention to the univariate reformulation in the right-hand side of (23) is of great significance. Moreover, the right-hand side of (23) is completely characterized given

X : = ω^{⊤} \hat{ξ}

and the training dataset

{{\hat{ξ}}_{i}}_{i ⩽ N}

.

Looking back on the problem (11), with X taking values in

{ω^{⊤} {\hat{ξ}}_{1}, ω^{⊤} {\hat{ξ}}_{2}, \dots, ω^{⊤} {\hat{ξ}}_{N}}

with equal probability and

ε^{*} : = {∥ ω ∥}_{q} ε

, we obtain

\begin{matrix} {\bar{I}}_{N} = inf_{λ ⩾ 0} \{\frac{1}{N} \sum_{i = 1}^{N} [sup_{y \in R} {l (y) - λ u (| y - ω^{⊤} {\hat{ξ}}_{i} {| - ∥ ω ∥}_{q} ε)}]\} . \end{matrix}

(24)

Indeed, this result implies that this multi-dimensional optimization problem in (5) can also be solved by putting effort into the completely characterized reformulation of (24). Considering the problem with the ambiguity set constructed by the classic Wasserstein metric (the norm defined in the metric is also

{∥ \cdot ∥}_{p}

with

p > 1

),

\begin{matrix} inf_{ω \in D} sup_{P \in B_{ε}^{d_{W}} ({\hat{P}}_{N})} E^{P} l (ω^{⊤} ξ) . \end{matrix}

Denote the inner maximum problem

{\bar{I}}_{N}^{W} = {sup}_{P \in B_{ε}^{d_{W}} ({\hat{P}}_{N})} E^{P} l (ω^{⊤} ξ)

, we can obtain a similar result by applying Theorem 7 of [10] and Theorem 1 of [15], that is

\begin{matrix} {\bar{I}}_{N}^{W} = inf_{λ ⩾ 0} \{{λ ∥ ω ∥}_{q} ε + \frac{1}{N} \sum_{i = 1}^{N} [sup_{y \in R} {l (y) - λ | y - ω^{⊤} {\hat{ξ}}_{i} |}]\} . \end{matrix}

(25)

Observing the formulas above, we can find that

{\bar{I}}_{N} = {\bar{I}}_{N}^{W}

when

u (x) = x_{+} - x_{-} = x

. The result (24) is much more flexible than the result (25) as the form of the

u (\cdot)

is optional.

To elaborate on the way our results help to solve practical problems, several simulations are introduced in the following.

5. Simulation

In this section, we show the application of Corollary 2 in solving the regression problem and the portfolio optimization problem. Moreover, several simulations are operated to investigate the performance of our model.

5.1. Regression Model

As mentioned in Section 2, our result can help to find an accurate estimator of the regression coefficient vector in linear regression problems

\begin{matrix} inf_{β \in \bar{D}} sup_{P \in B_{ε}^{d} ({\hat{P}}_{N})} E^{P} l (Y - β^{⊤} X), \end{matrix}

(26)

with

ξ : = (Y, X)

and

D : = {(1, - β) : β \in R^{d - 1}}

. In the literature, loss functions with the form

{l : l (Y, X) = | Y - β^{⊤} X |^{p}, p ⩾ 1}

are widely studied, especially when

p = 1

and

p = 2

. To elaborate on how this result helps to solve the tractability of regression programs, an example is presented below.

Example 1.

The Least Absolute Deviation (LAD) regression model seeks the regression coefficient estimator that minimizes the sum of absolute residuals

\sum_{i = 1}^{N} | y_{i} - {x_{i}}^{⊤} β |

. Take

l (y) : = | y |

, and more specifically, we take

ε = 1

,

u (x) = α x_{+} - β x_{-}, 0 < α < β

. Then we have

\begin{matrix} {\bar{I}}_{N} & = inf_{λ ⩾ 0} \{\frac{1}{N} \sum_{i = 1}^{N} [sup_{y \in R} {l (y) - λ u (| y - ω^{⊤} {\hat{ξ}}_{i} {| - ∥ ω ∥}_{q})}]\} \\ = inf_{λ ⩾ 0} \{\frac{1}{N} \sum_{i = 1}^{N} [sup_{y \in R} {| y | - α λ (| y - ω^{⊤} {\hat{ξ}}_{i} {| - ∥ ω ∥}_{q})_{+} + β λ (| y - ω^{⊤} {\hat{ξ}}_{i} {| - ∥ ω ∥}_{q})_{-}}]\} . \end{matrix}

Let

h_{i} (y) = | y | - α λ (| y - ω^{⊤} {\hat{ξ}}_{i} {| - ∥ ω ∥}_{q})_{+} + β λ (| y - ω^{⊤} {\hat{ξ}}_{i} {| - ∥ ω ∥}_{q})_{-}

, then

h_{i} (y)

can be further broken down into the following form:

h_{i} (y) = \{\begin{matrix} | y | - α λ (y - ω^{⊤} {\hat{ξ}}_{i} - {∥ ω ∥}_{q}), y ⩾ ω^{⊤} {\hat{ξ}}_{i} + {∥ ω ∥}_{q}, \\ | y | + {β λ (∥ ω ∥}_{q} - | y - ω^{⊤} {\hat{ξ}}_{i} |), ω^{⊤} {\hat{ξ}}_{i} - {∥ ω ∥}_{q} < y < ω^{⊤} {\hat{ξ}}_{i} + {∥ ω ∥}_{q}, \\ | y | - α λ (ω^{⊤} {\hat{ξ}}_{i} - y - {∥ ω ∥}_{q}), y ⩽ ω^{⊤} {\hat{ξ}}_{i} - {∥ ω ∥}_{q} . \end{matrix}

Considering the case

y \to + \infty

, then

h_{i} (y) = y - α λ (y - ω^{⊤} {\hat{ξ}}_{i} - {∥ ω ∥}_{q}) = (1 - α λ) y + α λ (ω^{⊤} {\hat{ξ}}_{i} + {∥ ω ∥}_{q}) .

To make the supremum bounded, it is necessary to have

1 - α λ ⩽ 0

, such that

λ ⩾ 1 / α

. Similarly, when

y \to - \infty

, we can also derive

λ ⩾ 1 / α

. We next consider the following three cases.

Case A.: Considering when $y ⩾ ω^{⊤} {\hat{ξ}}_{i} + {∥ ω ∥}_{q}$ , we can see that $h_{i} (y)$ is non-increasing whether $ω^{⊤} {\hat{ξ}}_{i} + {∥ ω ∥}_{q}$ is non-negative or non-positive; thus, $h_{i} (y)$ attains its supremum at $y = ω^{⊤} {\hat{ξ}}_{i} + {∥ ω ∥}_{q}$ , with $h_{i} (ω^{⊤} {\hat{ξ}}_{i} + {∥ ω ∥}_{q}) = | ω^{⊤} {\hat{ξ}}_{i} + {∥ ω ∥}_{q} | .$
Case B.: Considering when $ω^{⊤} {\hat{ξ}}_{i} - {∥ ω ∥}_{q} < y < ω^{⊤} {\hat{ξ}}_{i} + {∥ ω ∥}_{q}$ , we are supposed to further consider the form of $h_{i} (y)$ under the cases $ω^{⊤} {\hat{ξ}}_{i} + {∥ ω ∥}_{q} ⩽ 0$ and $ω^{⊤} {\hat{ξ}}_{i} + {∥ ω ∥}_{q} > 0$ .
(Case B.1) If $ω^{⊤} {\hat{ξ}}_{i} + {∥ ω ∥}_{q} ⩽ 0$ ,

$h_{i} (y) = \{\begin{matrix} - (1 + β λ) y + β λ (ω^{⊤} {\hat{ξ}}_{i} + {∥ ω ∥}_{q}), ω^{⊤} {\hat{ξ}}_{i} ⩽ y < ω^{⊤} {\hat{ξ}}_{i} + {∥ ω ∥}_{q}, \\ (β λ - 1) y - β λ (ω^{⊤} {\hat{ξ}}_{i} - {∥ ω ∥}_{q}), ω^{⊤} {\hat{ξ}}_{i} - {∥ ω ∥}_{q} < y < ω^{⊤} {\hat{ξ}}_{i} . \end{matrix}$

The supremum in this case is $- ω^{⊤} {\hat{ξ}}_{i} + β λ {∥ ω ∥}_{q}$ .
(Case B.2) If $ω^{⊤} {\hat{ξ}}_{i} + {∥ ω ∥}_{q} > 0$ and $ω^{⊤} {\hat{ξ}}_{i} - {∥ ω ∥}_{q} > 0$ ,

$h_{i} (y) = \{\begin{matrix} (1 - β λ) y + β λ (ω^{⊤} {\hat{ξ}}_{i} + {∥ ω ∥}_{q}), ω^{⊤} {\hat{ξ}}_{i} ⩽ y < ω^{⊤} {\hat{ξ}}_{i} + {∥ ω ∥}_{q}, \\ (1 + β λ) y - β λ (ω^{⊤} {\hat{ξ}}_{i} - {∥ ω ∥}_{q}), ω^{⊤} {\hat{ξ}}_{i} - {∥ ω ∥}_{q} < y ⩽ ω^{⊤} {\hat{ξ}}_{i} . \end{matrix}$

The supremum in this case is $ω^{⊤} {\hat{ξ}}_{i} + β λ {∥ ω ∥}_{q}$ .
(Case B.3) If $ω^{⊤} {\hat{ξ}}_{i} + {∥ ω ∥}_{q} > 0$ and $ω^{⊤} {\hat{ξ}}_{i} - {∥ ω ∥}_{q} ⩽ 0$ and $ω^{⊤} {\hat{ξ}}_{i} > 0$ ,

$h_{i} (y) = \{\begin{matrix} (1 - β λ) y + β λ (ω^{⊤} {\hat{ξ}}_{i} + {∥ ω ∥}_{q}), ω^{⊤} {\hat{ξ}}_{i} < y < ω^{⊤} {\hat{ξ}}_{i} + {∥ ω ∥}_{q}, \\ (1 + β λ) y - β λ (ω^{⊤} {\hat{ξ}}_{i} - {∥ ω ∥}_{q}), 0 < y ⩽ ω^{⊤} {\hat{ξ}}_{i}, \\ (β λ - 1) y - β λ (ω^{⊤} {\hat{ξ}}_{i} - {∥ ω ∥}_{q}), ω^{⊤} {\hat{ξ}}_{i} - {∥ ω ∥}_{q} < y ⩽ 0 . \end{matrix}$

The supremum in this case is $ω^{⊤} {\hat{ξ}}_{i} + β λ {∥ ω ∥}_{q}$ .
(Case B.4) If $ω^{⊤} {\hat{ξ}}_{i} + {∥ ω ∥}_{q} > 0$ and $ω^{⊤} {\hat{ξ}}_{i} - {∥ ω ∥}_{q} ⩽ 0$ and $ω^{⊤} {\hat{ξ}}_{i} ⩽ 0$ ,

$h_{i} (y) = \{\begin{matrix} (1 - β λ) y + β λ (ω^{⊤} {\hat{ξ}}_{i} + {∥ ω ∥}_{q}), 0 < y < ω^{⊤} {\hat{ξ}}_{i} + {∥ ω ∥}_{q}, \\ - (1 + β λ) y + β λ (ω^{⊤} {\hat{ξ}}_{i} + {∥ ω ∥}_{q}), ω^{⊤} {\hat{ξ}}_{i} < y ⩽ 0, \\ (β λ - 1) y - β λ (ω^{⊤} {\hat{ξ}}_{i} - {∥ ω ∥}_{q}), ω^{⊤} {\hat{ξ}}_{i} - {∥ ω ∥}_{q} < y ⩽ ω^{⊤} {\hat{ξ}}_{i} . \end{matrix}$

The supremum in this case is $- ω^{⊤} {\hat{ξ}}_{i} + β λ {∥ ω ∥}_{q}$ .
Combining the above four subcases, we conclude that the supremum is $| ω^{⊤} {\hat{ξ}}_{i} {| + β λ ∥ ω ∥}_{q}$ in Case B. Moreover, in all different situations, $h_{i} (x)$ keeps the property of first monotonically increasing and then monotonically decreasing.
Case C.: Considering when $y ⩽ ω^{⊤} {\hat{ξ}}_{i} - {∥ ω ∥}_{q}$ , similar to Case A, $h_{i} (x)$ is non-decreasing and attains its supremum at $y = ω^{⊤} {\hat{ξ}}_{i} - {∥ ω ∥}_{q}$ , with $h_{i} (ω^{⊤} {\hat{ξ}}_{i} - {∥ ω ∥}_{q}) = | ω^{⊤} {\hat{ξ}}_{i} - {∥ ω ∥}_{q} | .$

Combining the above three cases, we have that

h_{i} (x)

is monotonically increasing and then monotonically decreasing under all situations. The supremum is

| ω^{⊤} {\hat{ξ}}_{i} {| + β λ ∥ ω ∥}_{q}

. As a result, we have the optimum value

\begin{matrix} {\bar{I}}_{N} = inf_{λ ⩾ 1 / α} \{\frac{1}{N} \sum_{i = 1}^{N} [| ω^{⊤} {\hat{ξ}}_{i} {| + β λ ∥ ω ∥}_{q}]\} = \frac{1}{N} \sum_{i = 1}^{N} | ω^{⊤} {\hat{ξ}}_{i} | + \frac{α}{β} {∥ ω ∥}_{q} . \end{matrix}

We can also consider this problem under the Wasserstein metric. A similar discussion yields that

\begin{matrix} {\bar{I}}_{N}^{W} = \frac{1}{N} \sum_{i = 1}^{N} | ω^{⊤} {\hat{ξ}}_{i} {| + ∥ ω ∥}_{q} . \end{matrix}

Although the two results share similar formulas, the result under the shortfall–Wasserstein metric is more flexible as it is possible to adjust the parameters

α, β

to achieve better performance.

The model that we simulated here is as follows:

y_{i} = β^{⊤} x_{i} + e_{i}, β = {(1, 2, 2, 3)}^{⊤}, x_{i} = {(x_{1 i}, x_{2 i}, x_{3 i}, x_{4 i})}^{⊤}, e_{i} \sim N (0, 1) .

Applying the above result under the shortfall–Wasserstein metric to this practical problem, we are supposed to find an estimator

\hat{β}

that minimizes

\frac{1}{N} \sum_{i = 1}^{N} | y_{i} - β^{⊤} x_{i} | + \frac{α}{β} {(1 + ∥ β ∥}_{q}) .

(27)

When

α = β

, it reduces to the case of the classic Wasserstein metric. To illustrate their performance, we generate six hundred sets of samples with

x_{i}^{⊤}

generated from the distribution

U (0, 1)

and

e_{i}

generated from the distribution

N (0, 1)

independently and identically. The trend of MSE as

λ : = \frac{α}{β}

ranges from one to ten is presented below.

In Figure 1, we find that it is possible to achieve smaller MSE by adjusting the value of

λ

. Moreover, when

q = 2

and

q = 3

, the value of

λ

that minimizes MSE is larger than 1, which means that the shortfall–Wasserstein robust regression can achieve a better prediction effect than the Wasserstein robust regression.

Figure 1. MSE with respect to

λ

for different values of q.

To test the robustness of our model, we further introduce some outliers and compare the performance with the least-squares regression (LSR) model and the ridge regression model. We set

α = β

for the shortfall–Wasserstein robust regression model and the regularization coefficient 1 for the ridge regression. We conduct one hundred times iterations, producing one hundred sets of samples each time, half of which served as the training set while the rest served as the testing set. Moreover, ten sets of outlier samples are added to the training set at each iteration. The result is presented below.

In Figure 2, the MSE under the shortfall–Wasserstein robust regression model is much smaller and less volatile than the MSE under the other two models, which means the shortfall–Wasserstein robust regression model is better in resisting the large deviations in the predictors. As a result, the result based on the shortfall–Wasserstein metric is reliable, stable, and superior to the one based on the Wasserstein metric.

Figure 2. The red, blue, and green curves represent the MSE under the SW-DD-DRO, the least-squares methods, and the ridge regression, respectively.

5.2. Portfolio Optimization

With

ξ

being a random vector of returns from n different financial assets and

ω

being the allocation vector, the problem becomes a portfolio optimization problem. In this subsection, we take

l (X) = {(X - c)}_{+}

to characterize the downside risk, and the problem we are interested in is of the following form:

inf_{ω \in D} sup_{P \in B_{ε}^{d_{u}} ({\hat{P}}_{N})} E^{P} {(ω^{⊤} ξ - c)}_{+},

where c is a constant. By Corollary 2, with

u (x) = α x_{+} - β x_{-}, 0 < α < β

, the inner maximum problem can be reformulated to the following form:

\frac{1}{N} \sum_{i = 1}^{N} {(ω^{⊤} {\hat{ξ}}_{i} - c)}_{+} + \frac{α}{β} {∥ ω ∥}_{q} ε .

With the ambiguity set constructed by the classic Wasserstein metric, the result is the same as the form of the above formula with

α = β

. For simulation, we choose four MSCI index assets, which are the MSCI Denmark index, the MSCI Turkey index, the MSCI Greece index, and the MSCI Norway index. We collect the daily closing prices of those indexes ranging from 1 January 2020 to 31 December 2022 from cn.investing.com. It is noteworthy that the COVID-19 outbreak began in 2020, and the distribution of assets is highly uncertain during this period. Based on those data, assume the initial asset is USD 1000 and there is no short sale. We set a 30-day time sliding window, the optimal weight is calculated based on the previous 30 days of historical data, and this result is taken as the investment decision of the next day. As the time window rolls, the cumulative return curves under different strategies are presented below.

The four curves represent the cumulative returns under the model constructed by the shortfall–Wasserstein metric with

\frac{β}{α} = 9

, the model constructed by the classic Wasserstein metric, the mean-variance model and the

\frac{1}{n}

portfolio model, respectively. The mean-variance model, a model that aims to minimize investment variance under certain expected returns, was proposed by [24]. The

\frac{1}{n}

portfolio model divides the money equally among each asset, and [25] found that the

\frac{1}{n}

portfolio performs well when the overall distribution of assets is highly uncertain.

In Figure 3, we can see that the first two cumulative return curves under the robust optimization models outperform the curves under the mean-variance model and the

\frac{1}{n}

portfolio model. Because of the similarity of the result of models under the shortfall–Wasserstein metric and the classic Wasserstein metric, the first two curves essentially behave the same. Since 2020, the global economy has been in recession due to the pandemic. In 2022, the epidemic was relatively stable and the economy slowly began to recover. During this period, the distribution of the four assets is highly uncertain. All four of these models perform well as the curves climb steadily, but the robust models take the variability into account and make better decisions, especially during the recovery period.

Figure 3. Cumulative return curves under different strategies.

In general, in regression problems, our results are more reliable and robust when the sample is contaminated; in portfolio optimization problems, when the distribution of assets is highly uncertain, our results perform significantly better. Moreover, our result can be more widely used than the result under the Wasserstein model as it is applicable for many complex forms of the function l. Even when the loss function

l (\cdot)

is relatively simple, our model can achieve the same or even better performance than the classic Wasserstein model by adjusting the form of the function

u (\cdot)

.

6. Conclusions

In this paper, we propose a new DRO framework by extending the classic Wasserstein metrics to the shortfall–Wasserstein metrics. We study the tractability and reformulations of the shortfall–Wasserstein DRO problems for the loss function which is linear in the decision vector. This case of objective function includes many applications, such as regression and portfolio selection. One interesting result in the paper is that we give an equivalent characterization of the projection result to a one-dimensional ball. Based on the projection result, we show that the multi-dimensional constraint of our distributionally robust models can be reformulated as a one-dimensional one. Based on this reformulation, we established the finite sample guarantee of the DRO problem which is free from the curse of dimensionality. We present the application of our model in regression and provide simulation results to illustrate the performance of our results. In addition, we also present the real-data analysis on a portfolio selection to illustrate the performance of our new DRO model. Since this paper focuses on the linear cost function

l (w^{⊤} x)

, a possible future study is to consider the general loss function

l (w, x)

and study the reformulations and tractability of the general DRO problems.

Author Contributions

Conceptualization, R.L.; methodology, R.L., W.L. and T.M.; software, R.L.; validation, R.L., W.L. and T.M.; formal analysis, R.L. and W.L.; investigation, W.L. and T.M.; resources, W.L. and T.M.; data curation, R.L.; writing—original draft preparation, R.L.; writing—review and editing, W.L. and T.M.; visualization, R.L. and W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (Nos. 71671176, 71871208) and Anhui Natural Science Foundation (No. 2208085MA07).

Data Availability Statement

The data that support the analysis of this study are openly available in https://cn.investing.com/.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Theorem 2

The proof procedure of Theorem 2 follows a similar idea as Theorem 1 of [15]. For completely, we have included it in the Appendix. We verify the strong duality in compact Polish spaces S first and then generalize to

R

.

Appendix A.1. Strong Duality in Compact Spaces

To prepare for the verification, let us denote

C_{S} : = {(x, y) \in S \times S, c (x, y) < + \infty}

,

Λ_{S, c, l} : = {(λ, φ) : λ ⩾ 0, φ \in m_{U} (S; \bar{R}), φ (x) + λ c (x, y) ⩾ l (y) f o r a l l (x, y) \in C_{S}}

, and for

\forall η \in R

, define

\bar{I_{η}} = sup \{\int l (y) d π (x, y) : π \in ⋃_{ν \in P (S)} Π (μ, ν), \int c d π ⩽ - η\},

\underset{̲}{J_{η}} = inf {- λ η + \int φ d μ : (λ, φ) \in Λ_{S, c, l}} .

Proposition A1.

Suppose S is a compact Polish space, and

l : S \to R

satisfies Assumption 4,

c (X, Y) = u (| Y - X | - ε^{*})

, then for any

η \in R

,

\bar{I_{η}} = \underset{̲}{J_{η}}

, and a primal optimizer

π_{η}^{*}

satisfying

I (π_{η}^{*}) = \bar{I_{η}}

exists.

Proof.

To prepare to apply the Fenchel duality theorem (see Theorem 1 in Chapter 7 of [26]), denote

X = C_{b} (S \times S)

and its topological dual

X^{*} = M_{b} (S \times S)

, which is a consequence of the Riesz representation theorem. Next, let

C : = {g (x, y) \in X : g (x, y) = φ (x) + λ c (x, y) f o r a l l x, y f o r s o m e φ \in C_{b} (S), λ ⩾ 0},

D : = {g (x, y) \in X : g (x, y) ⩾ l (y) f o r e v e r y x, y} .

In the set C, each g is identified by the pair

(λ, φ)

φ (x) = g (x, x) - λ c (- ε^{*}), λ = \frac{g (x, y) - g (x, x) + λ c (- ε^{*})}{c (x, y)},

for some

x, y

in S,

c (x, y) \neq 0

. Define functions

Φ : C \to R

and

Γ : D \to R

as below:

Φ (g) : = - λ η + \int φ d μ, Γ (g) : = 0 .

From the definition, it is evident that the functional

Φ

is convex and

Γ

is concave,

inf_{g \in C \cap D} {Φ (g) - Γ (g)} = inf \{- λ η + \int φ d μ : φ (x) + λ c (x, y) ⩾ l (y) f o r a l l x, y f o r s o m e φ \in C_{b} (S), λ ⩾ 0\} .

The corresponding conjugate functions

Φ^{*} : C^{*} \to R

and

Γ^{*} : D^{*} \to R

are as below:

C^{*} = \{π \in X^{*} : sup_{g \in C} [⟨ g, π ⟩ - Φ (g)] < + \infty\} = \{π \in X^{*} : sup_{g \in C} \{\int g d π - Φ (g)\} < + \infty\},

D^{*} = \{π \in X^{*} : inf_{g \in D} [⟨ g, π ⟩ - Γ (g)] > - \infty\} = \{π \in X^{*} : inf_{g \in D} \int g d π > - \infty\},

Φ^{*} (π) = sup_{g \in C} \{\int g d π - Φ (g)\}, Γ^{*} (π) = inf_{g \in D} \int g d π .

For every

π \in M_{b} (S \times S)

,

\begin{matrix} sup_{g \in C} \{\int g d π - Φ (g)\} & = sup_{(λ, φ) \in R_{+} \times C_{b} (S)} \{\int (φ (x) + λ c (x, y)) d π (x, y) - \int φ (x) d μ (x) + λ η\} \\ = sup_{(λ, φ) \in R_{+} \times C_{b} (S)} \{λ (\int c (x, y) d π (x, y) + η) - \int φ (x) (d π (x, y) - d μ (x))\} \\ = \{\begin{matrix} 0, i f \int c d π ⩽ - η a n d π (A \times S) = μ (A) f o r a l l A \in B (S), \\ \infty, o t h e r w i s e . \end{matrix} \end{matrix}

Therefore,

C^{*} = \{π \in M_{b} (S \times S) : \int c d π ⩽ - η, π (A \times S) = μ (A) f o r a l l A \in B (S)\}, Φ^{*} (π) = 0 .

Next, to determine

D^{*}

, we can see that if a measure

π \in M_{b} (S \times S)

is not non-negative,

{inf}_{g \in D} \int g d π = - \infty

(for more details, see Lemma B.6 of [15]). When

π \in M_{b} (S \times S)

is non-negative, as l is an upper semicontinuous function that is bounded from above, we can use a monotonically decreasing sequence of continuous functions to approximate l point-by-point. As a result of the monotone convergence theorem, we have the following equality:

inf \{\int g (x, y) d π (x, y) : g (x, y) ⩾ l (y) f o r a l l x, y\} = \int l (y) d π (x, y) .

Then,

D^{*} = \{π \in M_{b} (S \times S) : \int l (y) d π (x, y) > - \infty\}, Γ^{*} (π) = \int l (y) π (x, y) .

Then,

Γ^{*} (π) - Φ^{*} (π) = \int l (y) d π (x, y),

C^{*} \cap D^{*} = \{π \in ⋃_{ν \in P (S)} Π (μ, ν) : \int c d π ⩽ - η, \int l (y) d π (x, y) > - \infty\} .

Then,

\bar{I_{η}} = sup {Γ^{*} (π) - Φ^{*} (π) : π \in C^{*} \cap D^{*}} .

Because there are some points of the relative interiors of C and D contained in the set

C \cap D

and the epigraph of the function has a nonempty interior, further

{inf}_{g \in C \cap D} {Φ (g) - Γ (g)}

is finite, by the Fenchel’s duality theorem,

inf_{g \in C \cap D} {Φ (g) - Γ (g)} = sup_{π \in C^{*} \cap D^{*}} {Γ^{*} (π) - Φ^{*} (π)} = \bar{I_{η}} .

Moreover, some

π_{η}^{*} \in Φ_{μ}

that attain the supremum exist in the right. As

C_{b} (S) \subseteq m_{U} (S; \bar{R})

, then

\underset{̲}{J_{η}} ⩽ inf \{- λ η + \int φ d μ : φ (x) + λ c (x, y) ⩾ f (y) f o r a l l x, y f o r s o m e φ \in C_{b} (S), λ ⩾ 0\} = \bar{I_{η}} .

Similar to the proof of weak duality, we can verify that

\bar{I_{η}} ⩽ \underset{̲}{J_{η}}

. Then we obtain

\bar{I_{η}} = \underset{̲}{J_{η}}

and some

π_{η}^{*}

exist such that

I (π_{η}^{*}) = \bar{I_{η}}

. The proof is complete. □

Appendix A.2. Strong Duality in Non-Compact Spaces

Proposition A1 has established the strong duality in compact Polish spaces. We intend to extend the strong duality to

R

, the verification is broken down into the following steps, and Proposition A2 is the first step.

For convenience, we first introduce some notations. For any

π \in P (R \times R)

, denote

S_{π} : = S p t (π_{X}) \cup S p t (π_{Y}),

in which

S p t (π_{X})

and

S p t (π_{Y})

are, respectively, referred to as the supports of marginals

π_{X} (\cdot) : = π (\cdot \times R)

and

π_{Y} (\cdot) : = π (R \times \cdot)

. We have that the set

S_{π} \times S_{π}

is

σ

-compact as every probability measure defined on a Polish space has

σ

-compact support; thus, in Proposition A2,

S_{π} \times S_{π}

can be expressed as the union of an increasing sequence of compact subsets

(S_{n} \times S_{n} : n ⩾ 1)

[15]. Then we can apply the results of Proposition A1 via a sequential argument. For any closed subset

V \subseteq R

, let

Λ (V \times V) : = {(λ, φ) : λ > 0, φ \in m_{U} (V; \bar{R}), φ (x) + λ c (x, y) ⩾ l (y) f o r a l l (x, y) \in (V \times V) \cap C},

where

C : = {(x, y) \in R \times R : c (x, y) < \infty}

and

m_{U} (V; \bar{R})

denotes the set of measurable functions

φ : (V, U (V)) \to (\bar{R}, B (\bar{R}))

. With this notation,

Λ (R \times R) = Λ_{c, l}

. Furthermore, the function

φ_{λ} (x) : = {sup}_{y \in S} {l (y) - λ c (x, y)} : (R, U (R) \to (\bar{R}, B (\bar{R}))

is measurable. A more detailed explanation of this can be found in [15]. Finally, let us denote E as

E : = \{π \in ⋃_{ν \in P (R)} Π (μ, ν) : \int c (x, y) d π (x, y) < + \infty, \int | l (y) | d π (x, y) < + \infty\} .

Proposition A2.

If Assumption 4 holds and

c (X, Y) = u (| Y - X | - ε^{*})

, then for any

π \in E

,

inf_{(λ, φ) \in Λ (S_{π} \times S_{π})} J (λ, φ) ⩽ \bar{I} .

Proof.

This proof is similar to [15]. We give a proof for completeness here. From the discussion above the proposition, we know that

S_{π} \times S_{π}

is

σ

-compact. By the definition of

σ

-compactness, an increasing sequence of compact subsets of

S_{n} \times S_{n}

exists such that

S p t (π) \subseteq S_{π} \times S_{π} = \cup_{n ⩾ 1} (S_{n} \times S_{n})

. As

\int | l (y) | d π (x, y)

and

\int c (x, y) d π (x, y)

are finite, one is able to find

(S_{n} \times S_{n}, n ⩾ 1)

and

η \in (0, + \infty)

to satisfy

p_{n} : = π (S_{n} \times S_{n}) ⩾ 1 - \frac{1}{n},

\int c (x, y) 1_{{(S_{n} \times S_{n})}^{c}} d π (x, y) ⩽ \frac{η}{n} (1 - \frac{1}{n}),

\int | l (y) | 1_{{(S_{n} \times S_{n})}^{c}} d π (x, y) ⩽ \frac{1}{n},

where

{(S_{n} \times S_{n})}^{c} : = (R \times R) ∖ (S_{n} \times S_{n})

. Define

π_{n} \in P (S_{n} \times S_{n})

and its corresponding marginals

μ_{n} \in P (S_{n})

as below:

π_{n} (\cdot) : = \frac{π (\cdot \cap (S_{n} \times S_{n}))}{p_{n}}, μ_{n} (\cdot) : = π_{n} (\cdot \times S_{n}) .

For every

n ⩾ 1

, define

\bar{I_{η, n}} : = sup \{\int l (y) γ (d x, d y) : γ \in \underset{ν \in P (S_{n})}{\cup} Π (μ_{n}, ν), \int c (x, y) d γ (x, y) ⩽ - \frac{η}{n}\},

\underset{̲}{J_{η, n}} : = inf \{- λ \frac{η}{n} + \int φ d μ_{n} : (λ, φ) \in Λ (S_{n} \times S_{n})\},

whose supports are

S_{n} \times S_{n}

. As

S_{n}

is compact, from Proposition A1, we know that a

γ_{η, n}^{*} \in P (S_{n} \times S_{n})

exists, satisfying

\int l (y) γ_{η, n}^{*} (d x, d y) = \bar{I_{η, n}} = \underset{̲}{J_{η, n}} .

Construct a measure

\tilde{π} \in P (R \times R)

,

\tilde{π} (\cdot) = p_{n} γ_{η, n}^{*} (\cdot \cap (S_{n} \times S_{n})) + π (\cdot \cap {(S_{n} \times S_{n})}^{c}) .

It can be verified that

\tilde{π} \in Π (μ, ν)

for some

ν \in P (R)

. From the definition of

\tilde{π}

, we have

\begin{matrix} \tilde{π} (\cdot \times R) & = p_{n} γ_{η, n}^{*} ((\cdot \cap S_{n}) \times S_{n}) + π ((\cdot \times R) \cap {(S_{n} \times S_{n})}^{c}) \\ = p_{n} μ_{n} (\cdot \cap S_{n}) + π ((\cdot \times R) \cap {(S_{n} \times S_{n})}^{c}) \\ = π ((\cdot \times R) \cap (S_{n} \times S_{n})) + π ((\cdot \times R) \cap {(S_{n} \times S_{n})}^{c}) \\ = π (\cdot \times R) = μ (\cdot), \end{matrix}

while the third equality is because

μ_{n} (\cdot) = π (\cdot \times S_{n}) / p_{n}

, and

p_{n} μ_{n} (\cdot \cap S_{n}) = π ((\cdot \cap R) \cap (S_{n} \times S_{n}))

. Furthermore,

\int c d \tilde{π} = p_{n} \int_{S_{n} \times S_{n}} c d γ_{η, n}^{*} + \int_{{(S_{n} \times S_{n})}^{c}} c d π ⩽ - \frac{η}{n} (1 - \frac{1}{n}) + \frac{η}{n} (1 - \frac{1}{n}) = 0 .

Then,

\tilde{π} \in Φ_{μ}

, and consequently,

\bar{I} ⩾ \int l (y) \tilde{π} (d x, d y) = p_{n} \int_{S_{n} \times S_{n}} l d γ_{η, n}^{*} + \int_{{(S_{n} \times S_{n})}^{c}} l d π .

Thus, we have

\bar{I} ⩾ p_{n} \bar{I_{η, n}} - n^{- 1}

, and as

\bar{I_{η, n}} = \underset{̲}{J_{η, n}}

, we have

\underset{̲}{J_{η, n}} ⩽ {(1 - \frac{1}{n})}^{- 1} (\bar{I} + \frac{1}{n})

. Proposition A2 already holds when

\bar{I} = + \infty

, so we take

\bar{I} < + \infty

; for every

n ⩾ 1

and any

ε > 0

, take an

ε

-optimal solution

(λ_{n}, φ_{n})

for

\underset{̲}{J_{η, n}}

,

- λ_{n} \frac{η}{n} + \int φ_{n} d μ_{n} ⩽ \underset{̲}{J_{η, n}} + ε .

As

(λ_{n}, φ_{n})

belongs to

Λ (S_{n} \times S_{n})

, from definition,

φ_{n} ⩾ {sup}_{z \in S_{n}} {l (z) - λ_{n} c (x, z)}

. Then, for every

x \in S_{n}

and every n, we have

- λ_{n} \frac{η}{n} + \int sup_{z \in S_{n}} {l (z) - λ_{n} c (x, z)} d μ_{n} (x) ⩽ \underset{̲}{J_{η, n}} + ε .

Combining the fact that

μ_{n} (\cdot) = π (\cdot \times S_{n}) / p_{n}

, we further have

\bar{lim_{n \to \infty}} (- λ_{n} \frac{η}{n} + \int sup_{z \in S_{n}} {l (z) - λ_{n} c (x, z)} \frac{1_{s_{n}} (x) \cdot 1_{s_{n}} (y)}{p_{n}} d π (x, y)) ⩽ \bar{I} + ε .

(A1)

Since

c (x, x) = c (- ε^{*}) > - \infty

, then

l (x) - λ_{n} c (- ε^{*})

is a lower bound of the integrand above on

S_{n} \times S_{n}

, and we also have

l \in L^{1} (d μ)

, we obtain the following two results:

(a) From (A1) and the fact that

λ_{n} ⩾ 0

, we have

{lim^{̲}}_{n \to \infty} λ_{n}, {lim_{̲}}_{n \to \infty} λ_{n}

are finite, so

{λ_{n} : n ⩾ 1}

has convergent subsequences, which means that a subsequence

{n_{k} : k ⩾ 1}

exists at least such that

λ_{n_{k}} \to λ^{*}

as

k \to \infty

for some

λ^{*} \in [0, \infty)

;

(b) By Fatou’s lemma and dominated convergence theorem, we obtain

\begin{matrix} \bar{I} + ε & ⩾ \underset{k \to \infty}{lim_{̲}} (- λ_{n_{k}} \frac{η}{n_{k}} + \int sup_{z \in S_{n_{k}}} {l (z) - λ_{n_{k}} c (x, z)} \frac{1_{s_{n_{k}}} (x) \cdot 1_{s_{n_{k}}} (y)}{p_{n_{k}}} d π (x, y)) \\ ⩾ \int_{S_{π} \times S_{π}} sup_{z \in S_{π}} {l (z) - λ^{*} c (x, z)} d π (x, y) \\ = \int_{S_{π}} sup_{z \in S_{π}} {l (z) - λ^{*} c (x, z)} d μ (x) . \end{matrix}

Here, these facts are used:

\frac{η}{n} \to 0, λ_{n} \to λ^{*}

as

n \to \infty

and

p_{n} \to 1, \cup_{n ⩾ 1} S_{n} = \cup_{k ⩾ 1} S_{n_{k}} = S_{π}

. We also used the fact that

{lim_{̲}}_{k} {sup}_{z \in S_{n_{k}}} {l (z) - λ_{n_{k}} c (x, z)} ⩾ {sup}_{z \in S_{n_{k}}} {l (z) - λ^{*} c (x, z)}

(see Lemma B.7 in Appendix B of [15]). If we let

φ^{*} (x) = {sup}_{z \in S_{π}} {l (z) - λ^{*} c (x, z)}

, then

(λ^{*}, φ^{*}) \in Λ (S_{π} \times S_{π})

, and due to the arbitrariness of

ε

, it follows that

J (λ^{*}, φ^{*}) ⩽ \bar{I} .

This completes the proof. □

Proposition A3.

Suppose that Assumption 4 is in force,

c (X, Y) = u (| Y - X | - ε^{*})

, then for any

λ ⩾ 0

,

sup_{π \in E} \int {l (y) - λ c (x, y)} d π (x, y) = \int sup_{z \in R} {l (z) - λ c (x, z)} d μ (x) .

Proof.

Let

g (x, y) : = l (y) - λ c (x, y)

, for

n ⩾ 1

and

k ⩽ n^{2}

, define

A_{k, n} : = \{(x, y) \in R \times R : \frac{k - 1}{n} ⩽ g (x, y) ⩽ \frac{k}{n}\},

B_{k, n} : = P r o j_{1} (A_{k, n}) ∖ \cup_{j > k} P r o j_{1} (A_{j, n}) .

Noting that g is upper semicontinuous and the sets

A_{k, n}

are Borel-measurable, we have that their projections

B_{k, n}

are also Borel-measurable subsets of

R

. Additionally, from the definition, the collection

(B_{k, n} : k ⩽ n^{2})

is disjoint. Then, due to the Jankov–von Neumann selection theorem [27], a universally measurable function

γ_{k} (x) : P r o j_{1} (A_{k, n}) \to R

for each

k ⩽ n^{2}

exits such that

A_{k, n} \neq \emptyset

, satisfying

\frac{k - 1}{n} ⩽ g (x, γ_{k} (x)) ⩽ \frac{k}{n} .

Next, as

B_{k, n} \subseteq P r o j_{1} (A_{k, n})

and

(B_{k, n} : k ⩽ n^{2})

is disjoint, we define

Γ_{n} : R \to R

:

Γ_{n} (x) : = \{\begin{matrix} γ_{k} (x), & if x \in B_{k, n} for some k ⩽ n^{2}, \\ x, & otherwise . \end{matrix}

Since each

γ_{k} (x)

is measurable, we have that

Γ_{n} (x)

is also measurable. For

\forall x \in R

, if k exists such that

x \in B_{k, n}

, then

g (x, Γ_{k} (x)) = g (x, γ_{k} (x)) ⩾ \frac{k - 1}{n}

, and

sup {g (x, y) : g (x, y) ⩽ n, y \in S} - \frac{1}{n} ⩽ \frac{k}{n} - \frac{1}{n} = \frac{k - 1}{n}

. If such k do not exists, then

g (x, Γ_{k} (x)) = g (x, x)

and

{(x, y) : g (x, y) ⩽ n, y \in S}

is an empty set, then we obtain

\begin{matrix} sup {g (x, y) : g (x, y) ⩽ n, y \in R} - \frac{1}{n} ⩽ g (x, Γ_{n} (x)) ⩽ n . \end{matrix}

(A2)

Let

n \to \infty

, the we have

sup {g (x, y) : y \in R} ⩽ \underset{n \to \infty}{lim_{̲}} g (x, Γ_{n} (x)) ⩽ \infty .

Define the family of probability measures

(π_{n} : n ⩾ 1)

as

d π_{n} (x, y) = d μ (x) \cdot d δ_{Γ_{n} (x)} (y) .

From (A2),

g (x, y) ⩽ n, π_{n}

almost surely; thus,

\int | c (x, y) | d π_{n} < + \infty

and

\int | l (y) | d π_{n} < + \infty

. Moreover, since

π_{n} (\cdot \times R) = μ (\cdot)

, then

π_{n} \in E

. Finally, since

g (x, Γ_{n} (x)) ⩾ l (x) - λ c (- ε^{*}) - 1 > - \infty

, by Fatou’s lemma,

\begin{matrix} \underset{n \to + \infty}{lim_{̲}} \int g (x, y) d π_{n} (x, y) & = \underset{n \to + \infty}{lim_{̲}} \int g (x, Γ_{n} (x)) d μ (x) \\ ⩾ \int \underset{n \to + \infty}{lim_{̲}} g (x, Γ_{n} (x)) d μ (x) \\ ⩾ \int sup_{y \in R} g (x, y) d μ (x) . \end{matrix}

Then,

{sup}_{π \in E} \int g d π ⩾ \int {sup}_{y \in R} g (x, y) d μ (x)

holds. Since

\int g d π ⩽ \int {sup}_{y \in R} g (x, y) d π (x, y)

for any

π \in E

, the proof is complete. □

Proof of Theorem 2.

If

\bar{I} = \infty

, then

\bar{I} ⩽ \underset{̲}{J} = \infty

, the proof is complete. Consider the case that

\bar{I} < \infty

. As a result of Proposition A2, for every

π \in E

, we have

\bar{I} ⩾ inf_{(λ, φ) \in Λ (S_{π} \times S_{π})} \{\int φ (x) d μ (x)\} ⩾ inf_{λ ⩾ 0} \{\int sup_{y \in S_{π}} {l (y) - λ c (x, y)} d μ (x)\} .

(A3)

For any

π \in E

and

λ ⩾ 0

, define

T (λ, π) : = \int sup_{y \in S_{π}} {l (y) - λ c (x, y)} d μ (x) .

As

c (x . x) = c (- ε^{*})

,

T (λ, π) ⩾ \int l d μ - λ c (- ε^{*}) .

(A4)

Since

T (λ, π) > \bar{I}

for every

λ > λ_{max} : = (\int l d μ - \bar{I}) / c (- ε^{*})

, we restrict attention to the compact subset

[0, λ_{max}]

, from (A3), we can see

\bar{I} ⩾ inf_{λ \in [0, λ_{max}]} T (λ, π) .

(A5)

As

{sup}_{y \in S_{π}} {l (y) - λ c (x, y)}

is a lower semicontinuous function with respect to the variable

λ

, we can verify that

T (λ, π)

is lower semicontinuous in

λ

as well. For any

λ_{n} \to λ

, as (A4) holds, we can apply Fatou’s lemma,

\begin{matrix} \underset{n \to + \infty}{lim_{̲}} T (λ_{n}, π) & ⩾ \int \underset{n}{lim_{̲}} sup_{y \in S_{n}} {l (y) - λ_{n} c (x, y)} d μ (x) \\ ⩾ \int sup_{y \in S_{n}} {l (y) - λ c (x, y)} d μ (x) = T (λ, π), \end{matrix}

T (λ, π)

is a convex function with respect to

λ

for fixed

π

. Additionally, for any

α \in (0, 1)

and

π_{1}, π_{2} \in E

,

\begin{matrix} T (λ, α π_{1} + (1 - α) π_{2}) & = \int sup_{y \in S_{α π_{1} + (1 - α) π_{2}}} {l (y) - λ c (x, y)} d μ (x) \\ ⩾ max_{i = 1, 2} \{\int sup_{y \in S_{π_{i}}} {l (y) - λ c (x, y)} d μ (x)\} \\ ⩾ α T (λ, π_{1}) + (1 - α) T (λ, π_{2}), \end{matrix}

which means that

T (λ, π)

is concave in

π

for fixed

λ

. By applying the minimax theorem, we can conclude

sup_{π \in E} inf_{λ \in [0, λ_{m a x}]} T (λ, π) = inf_{λ \in [0, λ_{m a x}]} sup_{π \in E} T (λ, π) .

In conjunction with (A5), the following is yielded:

\bar{I} ⩾ sup_{π \in E} inf_{λ \in [0, λ_{m a x}]} T (λ, π) ⩾ inf_{λ \in [0, λ_{m a x}]} \{sup_{π \in E} \int sup_{y \in S_{π}} {l (y) - λ c (x, y)} d μ (x)\} .

(A6)

Now, since

\int {sup}_{y \in S_{π}} {l (y) - λ c (x, y)} d μ (x) ⩾ \int (l (y) - λ c (x, y)) d π (x, y)

for any

π \in E

, from Proposition A3, we have

sup_{π \in E} \int sup_{y \in S_{π}} {l (y) - λ c (x, y)} d μ (x) = \int sup_{y \in R} {l (y) - λ c (x, y)} d μ (x) = \int φ_{λ} (x) d μ (x) .

In conjunction with (A6), we further obtain

\bar{I} ⩾ inf_{λ \in [0, λ_{m a x}]} \int φ_{λ} (x) d μ (x) ⩾ inf_{λ ⩾ 0} \int φ_{λ} (x) d μ (x) .

(A7)

Let

g (λ) : = \int φ_{λ} (x) d μ (x)

, by Fatou’s lemma,

{\underset{̲}{l i m}}_{n \to + \infty} g (λ_{n}) ⩾ g (λ)

as

λ_{n} \to λ,

then

g (\cdot)

is lower semicontinuous, and

g (λ) \to \infty

when

λ \to \infty

as

g (λ) ⩾ \int l (x) d μ (x) - λ c (- ε^{*})

and

c (- ε^{*}) > 0

. As a result,

{λ : λ ⩾ 0, g (λ) ⩽ m}

are compact for every m. Therefore,

g (\cdot)

attains its infimum, i.e., we can find

λ^{*} \in [0, \infty)

such that

{inf}_{λ ⩾ 0} {\int φ_{λ} (x) d μ (x)} = \int φ_{λ^{*}} (x) d μ (x)

. As

φ_{λ^{*}} (x) + λ^{*} c (x, y) ⩾ l (y)

, we have

(λ^{*}, φ_{λ^{*}}) \in Λ_{c, f}

; thus,

\underset{̲}{J} ⩽ \int φ_{λ^{*}} (x) d μ (x)

. In conjunction with (A7), we have

\bar{I} ⩾ \underset{̲}{J}

, then the proof of assertion(i) is complete.

Through the above verification, it is known that an optimizer of the dual problem

(λ^{*}, φ_{λ^{*}})

always exists. Next, if

π^{*}

exists satisfying

\bar{I} = I (π^{*}) = J (λ^{*}, φ_{λ^{*}})

, which means it is an optimizer to the primal problem, then we have

\int l (y) d π^{*} (x, y) = \int φ_{λ^{*}} (x) d μ (x) .

In addition, as

π^{*} \in Φ_{μ}

and

φ_{λ^{*}} (x, y) = {sup}_{y \in R} {l (y) - λ^{*} c (x, y)}

:

\begin{matrix} \int l (y) d π^{*} (x, y) & = \int (l (y) - λ^{*} c (x, y)) d π^{*} (x, y) + λ^{*} \int c (x, y) d π^{*} (x, y) \\ ⩽ \int φ_{λ^{*}} (x, y) d π^{*} (x, y) + λ^{*} \int c (x, y) d π^{*} (x, y) \\ ⩽ \int φ_{λ^{*}} (x, y) d π^{*} (x, y) = \int l (y) d π^{*} (x, y) . \end{matrix}

Thus, we can conclude

l (y) - λ^{*} c (x, y) = sup_{z \in R} {l (z) - λ^{*} c (x, z)} π^{*} a . s .,

(A8)

λ^{*} \int c (x, y) d π^{*} (x, y) = 0 .

(A9)

Alternatively, if (A8) and (A9) are satisfied by any

π^{*} \in Φ_{μ}, (λ^{*}, φ_{λ^{*}}) \in Λ (c, f)

,

\begin{matrix} \int l (y) d π^{*} (x, y) & = \int (l (y) - λ^{*} c (x, y)) d π^{*} (x, y) + λ^{*} \int c (x, y) d π^{*} (x, y) \\ = \int sup_{z \in R} {l (z) - λ^{*} c (x, z)} d π^{*} (x, y) = \int φ_{λ^{*}} (x) d μ (x), \end{matrix}

which means that

π^{*}

and

(λ^{*}, φ_{λ^{*}})

are optimal to the primal and dual problem. The proof of the uniqueness of the primal optimizer

π^{*}

is similar to the proof of Theorem 1(b) in [15] and is, thus, omitted here. We then complete the proof. □

References

Wiesemann, W.; Kuhn, D.; Sim, M. Distributionally robust convex optimization. Oper. Res. 2014, 62, 1358–1376. [Google Scholar] [CrossRef]
Jiang, R.; Guan, Y. Data-driven chance constrained stochastic program. Math. Program. 2016, 158, 291–327. [Google Scholar] [CrossRef]
Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Mohajerin Esfahani, P.; Kuhn, D. Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations. Math. Program. 2018, 171, 115–166. [Google Scholar] [CrossRef]
Li, J.Y.M.; Mao, T. A general wasserstein framework for data-driven distributionally robust optimization: Tractability and applications. arXiv 2022, arXiv:2207.09403. [Google Scholar] [CrossRef]
Föllmer, H.; Schied, A. Convex measures of risk and trading constraints. Financ. Stoch. 2002, 6, 429–447. [Google Scholar] [CrossRef]
Föllmer, H.; Schied, A. Stochastic Finance: An Introduction in Discrete Time, 4th ed.; Walter de Gruyter: Berlin, Germany, 2016. [Google Scholar]
Delage, E.; Guo, S.; Xu, H. Shortfall risk models when information on loss function is incomplete. Oper. Res. 2022, forthcoming. [Google Scholar] [CrossRef]
Guo, S.; Xu, H. Distributionally robust shortfall risk optimization model and its approximation. Math. Program. 2019, 174, 473–498. [Google Scholar] [CrossRef]
Mao, T.; Wang, R.; Wu, Q. Model Aggregation for Risk Evaluation and Robust Optimization. arXiv 2022, arXiv:2201.06370. [Google Scholar] [CrossRef]
Popescu, I. Robust mean-covariance solutions for stochastic optimization. Oper. Res. 2007, 55, 98–112. [Google Scholar] [CrossRef]
Hanasusanto, G.A.; Kuhn, D. Conic programming reformulations of two-stage distributionally robust linear programs over wasserstein balls. Oper. Res. 2018, 66, 849–869. [Google Scholar] [CrossRef]
Kuhn, D.; Esfahani, P.M.; Nguyen, V.A.; Shafieezadeh-Abadeh, S. Wasserstein distributionally robust optimization: Theory and applications in machine learning. In Operations Research & Management Science in the Age of Analytics; Informs: Catonsville, MD, USA, 2019; pp. 130–166. [Google Scholar]
Peng, C.; Delage, E. Data-driven optimization with distributionally robust second order stochastic dominance constraints. Oper. Res. 2022. [Google Scholar] [CrossRef]
Blanchet, J.; Murthy, K. Quantifying distributional model risk via optimal transport. Math. Oper. Res. 2019, 44, 565–600. [Google Scholar] [CrossRef]
Kantorovich, L.V.; Rubinshtein, S.G. On a space of totally additive functions. Vestn. St. Petersburg Univ. Math. 1958, 13, 52–59. [Google Scholar]
Artzner, P.; Delbaen, F.; Eber, J.-M.; Heath, D. Coherent measures of risk. Math. Financ. 1999, 9, 203–228. [Google Scholar] [CrossRef]
Bäuerle, N.; Müller, A. Stochastic orders and risk measures: Consistency and bounds. Insur. Math. Econ. 2006, 38, 132–148. [Google Scholar] [CrossRef]
Kallenberg, O.; Kallenberg, O. Foundations of Modern Probability; Springer: Berlin/Heidelberg, Germany, 1997; Volume 2. [Google Scholar]
Mao, T.; Cai, J. Risk measures based on behavioural economics theory. Financ. Stoch. 2018, 22, 367–393. [Google Scholar] [CrossRef]
Shafieezadeh-Abadeh, S.; Kuhn, D.; Esfahani, P.M. Regularization via mass transportation. J. Mach. Learn. Res. 2019, 20, 1–68. [Google Scholar]
Wu, Q.; Li, J.Y.M.; Mao, T. On generalization and regularization via wasserstein distributionally robust optimization. arXiv 2022, arXiv:2212.05716. [Google Scholar] [CrossRef]
Fournier, N.; Guillin, A. On the rate of convergence in wasserstein distance of the empirical measure. Probab. Theory Relat. Fields 2015, 162, 707–738. [Google Scholar] [CrossRef]
Markovitz, H.M. Portfolio selection. J. Financ. 1952, 7, 77–91. [Google Scholar]
Plyakha, Y.; Uppal, R.; Vilkov, G. Why Does an Equal-Weighted Portfolio Outperform Value-and Price-Weighted Portfolios? 2012. Available online: https://ssrn.com/abstract=2724535 (accessed on 16 September 2017).
Luenberger, D.G. Optimization by Vector Space Methods; John Wiley & Sons: Hoboken, NJ, USA, 1997. [Google Scholar]
Bertsekas, D.; Shreve, S.E. Stochastic Optimal Control: The Discrete-Time Case; Athena Scientific: Belmont, MA, USA, 1996; Volume 5. [Google Scholar]

Figure 1. MSE with respect to

λ

for different values of q.

Figure 2. The red, blue, and green curves represent the MSE under the SW-DD-DRO, the least-squares methods, and the ridge regression, respectively.

Figure 3. Cumulative return curves under different strategies.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Shortfall-Based Wasserstein Distributionally Robust Optimization

Abstract

1. Introduction

Related Work

2. Shortfall–Wasserstein Metric

2.1. Risk Measures

2.2. Formula of DRO Problems Based on Shortfall-Wasserstein Metric

2.2.1. Regression

2.2.2. Portfolio Optimization

3. Shortfall–Wasserstein Data-Driven DRO

3.1. Reformulation of the Shortfall–Wasserstein DRO

3.2. Finite Sample Guarantee

4. Worst-Case Expectation under the Shortfall–Wasserstein Metric

5. Simulation

5.1. Regression Model

5.2. Portfolio Optimization

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Proof of Theorem 2

Appendix A.1. Strong Duality in Compact Spaces

Appendix A.2. Strong Duality in Non-Compact Spaces

References

Article Metrics

Citations

Article Access Statistics