Abstract
Fortet-Mourier (FM) probability metrics are important probability metrics, which have been widely adopted in the quantitative stability analysis of stochastic programming problems. In this study, we contribute to different types of convergence assertions between a probability distribution and its empirical distribution when the deviation is measured by FM metrics and consider their applications in stochastic optimization. We first establish the quantitative relation between FM metrics and Wasserstein metrics. After that, we derive the non-asymptotic moment estimate, asymptotic convergence, and non-asymptotic concentration estimate for FM metrics, which supplement the existing results. Finally, we apply the derived results to four kinds of stochastic optimization problems, which either extend the present results to more general cases or provide alternative avenues. All these discussions demonstrate the motivation as well as the significance of our study.
Keywords:
Fortet-Mourier metric; discrete approximation; stochastic optimization; stochastic dominance; distributionally robust MSC:
90C15; 91B70
1. Introduction
The estimation of the distance between a distribution and its empirical approximation obtained from some independent and identically distributed (iid) samples is an important subject in probability theory, mathematical statistics, and information theory. It has vast applications in many fields, such as quantization, optimal matching, density estimation, clustering, and so on (see [1] and the references therein for more details). To quantify the distance between two probability distributions, some rules have been adopted to generate probability metrics, such as the commonly used -structure metric. By selecting different generators of the -structure metric, we obtain a number of well-known probability metrics, such as the Wasserstein metric (in this study, when we refer to Wasserstein metric, it means the 1-Wasserstein metric, which is also called the Kantorovich–Rubinstein metric or Kantorovich metric), FM metric, and total variation metric.
Among probability metrics with -structure, the Wasserstein metric is the most popular one which has been widely applied in statistics, probability, and machine learning [2]. It originates from the optimal transportation problem and thus can be interpreted as an optimal mass transportation plan. Except for its practical meaning in transportation, the Wasserstein metric has some good properties. For example, convergence in the Wasserstein metric is equivalent to weak convergence plus the convergence of the first order absolute moment [2].
There is some literature concentrated on the convergence analysis under Wasserstein metrics between a distribution and its empirical approximation. From now on, we refer to this as the data-driven Wasserstein metric for simplicity, and other probability metrics do the same. These convergence analyses can be mainly divided into two parts: moment estimates which aim at providing the rate of convergence for the expectation of the Wasserstein distance between a distribution and its empirical approximation and concentration estimates which focus on the violation probability under a given tolerance. As for moment estimates, some earlier results can be found in [3,4], which provide a relatively loose convergence rate. More recently, Weed and Bach [5] focused on the compact support set case and obtained a sharp convergence rate. Dereich et al. [6] conducted the almost optimal convergence analysis. However, they put some restrictions on the range of parameters. An interesting result was given in [1], which extends some results in [6] from a limited range of parameters to the general case. As for concentration estimates, only a few results are available. The corresponding results can be found in [7,8] under some strong assumptions. Moreover, it requires that the violation parameter is large enough. In [9], Zhao and Guan investigated the case with a discrete and bounded support set. Particularly, an elaborate result on the rate of convergence of data-driven Wasserstein distance was presented in [1].
As pointed out in [2] (p. 110), the Wasserstein metric is a rather strong probability metric. Intuitively, it needs harsh conditions to establish the strong Wasserstein type upper bound estimate. Actually, we know from the definition of the Wasserstein metric that its generator is the set of Lipschitz continuous functions with Lipschitz modulus being one.
Compared with the Wasserstein metric, FM metrics are more general; their generator is a class of locally Lipschitz continuous functions. Therefore, it is more friendly to obtain some upper bounds by adopting FM metrics. In view of this, FM metrics have been widely used in the quantitative stability analysis of stochastic programming problems when the underlying probability distribution is perturbed and approximated, see for example [10,11,12,13]. Moreover, FM metrics have a close relationship with Wasserstein metrics through dual representation (Kantorovich–Rubinstein theorem). Generally, the pth order FM metric degrades into the Wasserstein metric when . From this point of view, the FM metric can be viewed as an extension of the Wasserstein metric. Nevertheless, there are few results concerning the convergence analysis for data-driven FM metrics. To the best of our knowledge, only Strugarek [14] examined the asymptotic convergence analysis under the FM distance.
In view of the above situations, in this article we study the data-driven FM metric. The main contributions of this study can be summarized as follows:
- We establish the quantitative connection between the Wasserstein metric and the FM metric. Based on this connection, we investigate the non-asymptotic moment estimate, asymptotic convergence, and non-asymptotic concentration estimate for data-driven FM metrics.
- We provide an alternative avenue for the convergence analysis of discrete approximations for two-stage stochastic programming problems. Different from the convergence or exponential rate of convergence analysis in [15,16], where some complex conditions are required, our approach is straightforward and brief.
- We reestablish the quantitative stability results for stochastic optimization problems with stochastic dominance constraints through FM metrics. Compared with that in [17], our conditions are weaker and different probability metrics are adopted. More importantly, we can apply the convergence conclusion to examine the discrete approximation method which is crucial for numerical solution.
- We consider data-driven distributionally robust optimization (DRO) problems with FM ball, which extends the results in [18] from the ambiguity set constructed by Wasserstein ball to the FM ball case. We prove the finite sample guarantee and asymptotic consistency, which lay the theoretical foundation for the data-driven approach for the DRO model.
- We analyze the discrete approximation of the DRO problem whose ambiguity set is constructed with the general moment information. Compared with the existing work [19] under the bounded support set, we weaken their conditions and extend their results to the case with an unbounded support set.
The remainder of this study is organized as follows. In Section 2, we give some prerequisites for further discussion. In Section 3, we discuss different kinds of convergence results for data-driven FM metrics. We consider four applications to verify our convergence results and to further demonstrate the motivation and significance of this study in Section 4. Finally, we have some concluding remarks in Section 5.
2. Prerequisites
Let be a random vector defined on the probability space . Then, its induced probability distribution (sometimes it is called probability measure) on is . We use to denote all the probability distributions on . The set of probability distributions having finite pth order absolute moments is denoted by .
Probability metrics measure the distance between two probability distributions. Generally, they do not satisfy the three axioms of usual distance in metric space. A commonly used class of probability metrics is the probability metric with -structure, whose definition is as follows.
Definition 1.
Let be a set of measurable functions from Ξ to . Then, for any ,
is called the ζ-structure probability metric induced by .
The in Definition 1 totally determines the resulting -structure probability metric, so it is called the generator of the -structure probability metric. FM metrics and Wasserstein metrics can be deduced from the -structure probability metric by choosing specific generators. Particularly, we have the following definitions.
Definition 2.
Let for some and denote a set of locally Lipschitz continuous functions given by
Then, the pth order FM metric between P and Q is
Definition 3.
Let and
Then, the Wasserstein metric between P and Q is
It is easy to see from the above definitions that for any . Moreover, we have that: if , then, , so does . Therefore, we can ignore the absolute value operator in Definitions 2 and 3 when we take supremum. Moreover, both FM metrics and Wasserstein metrics have a close relationship with weak convergence. One can refer to [10] (p. 490) and [2] (Theorem 6.9) for more details.
The Wasserstein metric has an alternative definition which corresponds to the coupling marginal distributions. Specifically, the Wasserstein metric between P and Q is defined as (see [2], Definition 6.1):
where is the collection of all joint distributions of and with marginal distributions P and Q, respectively. It is known from Kantorovich–Rubinstein theorem [20] that Definition 3 is the dual representation of (1).
We have the following extension theorem for Lipschitz functions in Hilbert space (see [21], Theorems 4 and 5).
Lemma 1.
Let X and Y be Hilbert spaces and be a Lipschitz function with Lipschitz modulus . Then, there exists a Lipschitz function such that for any and is also the Lipschitz modulus of .
Lemma 1 is important for the following discussion. In [1], the authors assumed that the support set is the whole space . They obtained the non-asymptotic moment estimate [1] (Theorem 1) and concentration estimate [1] (Theorem 2) for the Wasserstein metric. For any , we can view them as probability distributions through the following correspondence:
for all . That is, we set the probability of the area to be zero. Generally, we have . The details are as follows:
where denotes the collection of all the Lipschitz continuous functions with Lipschitz modulus 1 on , and is the extension of according to Lemma 1. Obviously, which is the set of Lipschitz continuous functions with Lipschitz modulus 1 over . Thus, we have the estimation
That is, .
On the other hand, for any , its restriction on is Lipschitz continuous with Lipschitz modulus 1. Thus,
Finally, we have . Therefore, although all the convergence results in [1] were derived under , we can extend them to any support set .
Lemma 2
([1], Theorem 1). Let for some . Then, there exists a constant C depending only on s (the dimension of Ξ) and p such that, for all ,
where log is the natural logarithm.
Lemma 2 cannot cover all the pairs , for example, or . However, we can always reset p such that Lemma 2 holds by the following procedures. If or 2 and , P must belong to for any . If and , we can select such that . If and , we can choose any . Then, we let and have that Lemma 2 holds with or 2 and or and . Therefore, Lemma 2 is applicable for any through carefully prepared p. In the following discussion, without loss of generality, we always assume that Lemma 2 holds for any pair . Further, we can, according to Lemma 2, obtain the following uniform upper bound:
for any and .
Assumption 1.
Let satisfy
for some constant b.
Lemma 3.
Suppose that Assumption 1 holds for some . Then, we have for that
for all , where α and β are two positive constants depending only on P, b, and s.
Proof.
Based on Assumption 1, we know that Condition (1) in [1] holds. Then, due to , Lemma 3 directly follows from [1] (Theorem 2). □
For a more comprehensive version of Lemma 3, one can refer to [1] (Theorem 2). Here, we focus on the case because it is more interesting for us to investigate a smaller violation rather than a bigger one. A simplified version can also be found in [18] (Theorem 3.4) where the assumption is imposed.
To simplify the following discussion, we derive a uniform upper bound for the right-hand side in Lemma 3. Note the fact that for any . We have
Letting gives us that
When , we have
Moreover, for ,
Therefore, we can obtain a loose but uniform upper bound estimation
for any and .
3. Convergence Analyses of Data-Driven FM Metrics
In this section, we will investigate different kinds of convergence for data-driven FM metrics. To this end, let be N iid samples generated according to P. These samples are viewed here as the random sample , , on the probability space . Then, we obtain the empirical distribution defined as
where is the indicator function, that is, for and otherwise.
We first give the following vital lemma.
Lemma 4.
Let for some . Then,
for any R satisfying and . Here is the original point in and is the closed ball centered at with radius R.
The proof of Lemma 4 can be found in Appendix A.
If we define
then, we can obtain a tighter upper bound estimation of , that is
The first convergence result is about the non-asymptotic moment estimate. It provides an upper bound for the expectation of the FM distance between P and its empirical approximation distribution.
Theorem 1
(Non-asymptotic moment estimates for FM metrics). Suppose that for some . Then, for sufficiently large N, we have
where is a sequence of positive numbers satisfying as .
The proof of Theorem 1 can be found in Appendix A.
Theorem 1 establishes the convergence in the sense of expectation. However, it fails to tell us the sample-wise convergence. The following theorem states the asymptotic convergence under FM metrics for almost every sample.
Theorem 2
(Asymptotic convergence of FM metrics). Suppose that . Then,
with probability 1, as . Here is defined at the beginning of this section.
The proof of Theorem 2 can be found in Appendix A.
Theorems 1 and 2 claim the convergence. As we know, the rate of convergence is quite important for guiding the solution process in practice. The following theorem gives the estimate of the convergence rate under certain assumptions.
Theorem 3
(Non-asymptotic concentration estimates for FM metrics). Suppose that and Assumption 1 holds with . Then, for any , we have
for some constants depending on P, b, and s, and depending on P, b, s, and ϵ.
The proof of Theorem 3 can be found in Appendix A.
Remark 1.
Here we assume that . The main reason is that we want to give a relatively simple proof. Fortunately, it is more interesting for us to consider a small violation rather than a large one.
Under certain assumptions, we can obtain an estimation for . For example, if for , here σ is a positive constant, we have . Then, according to the properties of convex quadratic functions, the rate function has the lower bound
Thus, we can further obtain a concrete estimate for . For more details in this aspect, one can refer to [16].
4. Applications
In this section, we consider four applications of convergence conclusions about FM metrics obtained in Section 3. Specifically, we study the discrete approximation of two-stage stochastic programming problems, stochastic optimization problems with dominance constraints, data-driven distributionally robust optimization problems with FM ball, and the discrete approximation for distributionally robust optimization problems with general moment ambiguity set. They will not only further illustrate the motivations of this study but also provide alternative avenues or extensions for the current results.
4.1. Two-Stage Stochastic Linear Programming Problems
Discrete approximation is an important issue in stochastic optimization, which is crucial for its numerical solution. In this subsection, by employing the convergence results in Section 3, we give an alternative avenue for analyzing the discrete approximation of two-stage stochastic programming problems.
Consider the two-stage stochastic programming problem:
where ; is a polyhedron; the probability measure P is supported on , which is a polyhedron; and
Here , , , . and depend affine linearly on .
Denote , and let and denote the optimal value and optimal solution set of Problem (4). Moreover, we use to denote the set . Denote .
To quantify the upper semicontinuity or the deviation distance of the optimal solution set, we define the growth function as
Its inverse function is given by
Thus, we can define the associated conditioning function as
It is easy to verify that is nondecreasing and is increasing. Both and are lower semicontinuous on and vanish at 0. One can refer to [10] for more details.
Moreover, we have as . We illustrate this fact by contradiction. Suppose that there exists a sequence satisfying as , such that . Denote . The lower semicontinuity of means that is closed. Thus, . Due to the nondecreasing property of and as , must be bounded. Without loss of generality, we assume that as , where is a positive constant. According to the lower semicontinuity of , we have
which leads to a contradiction.
According to the definition of , we can immediately deduce that as .
To introduce the following discussion, we make some standard assumptions (see [11]).
Assumption 2.
Let the following assertions hold:
- (1)
- For each pair , and ;
- (2)
- .
Under the above assumptions, we have the following quantitative stability results about the optimal value and optimal solution set of Problem (4).
Lemma 5
([11], Theorem 3.3). Suppose that Assumption 2 holds and is nonempty and bounded. Then, there exist constants and such that
when and , where is the closed unit ball in .
Based on Lemma 5 and the convergence results in Section 3, we have the following convergence conclusions between the two-stage stochastic programming problem (4) and its empirical approximation.
Theorem 4.
Suppose that: (i) Assumption 2 holds; (ii) is nonempty and bounded. Then,
with probability 1, as .
Proof.
For the first assertion, we have from Theorem 2 that with probability 1. This means that: for the defined in Lemma 5, there exists a positive number such that for any , for almost every . Then, by Lemma 5, we have that
hold almost surely as , here L is defined in Lemma 5. According to Theorem 2 and the property of , we have
and thus
with probability 1, as . These facts imply that
with probability 1, as . □
Theorem 5.
Suppose that: (i) Assumption 1 holds with ; (ii) Assumption 2 holds; (iii) is nonempty and bounded. Then, for any , there exist depending on P and s, and depending on P, s and ϵ, such that
Proof.
If for L defined in Lemma 5, we have from Theorem 3 that
for any , where depends on P and s, and depends on P, s and . Here we use to stress its dependence on .
As shown in Theorem 4, a sufficient condition for
is , where is defined in Lemma 5. Without loss of generality, we assume that . Analogously, we have that
Then, we obtain
Similarly, we have
where the equality follows from the strictly increasing property of . By the same procedure, we can derive the second assertion. □
Remark 2.
The convergence analysis about two-stage stochastic programming problems can also be found in [11] (Section 4), where the covering and bracketing numbers are introduced. However, it seems difficult to verify the growth rate of the covering or bracketing number in the general case (see [11], Proposition 4.2). Our convergence results are more straightforward. Compared with [11] (Proposition 4.2), instead of the growth rate of the covering or bracketing number, we use the light-tailed distribution assumption. This assumption is commonly used in the literature, see for example [1,18].
4.2. Stochastic Optimization Problems with Stochastic Dominance Constraints
In this part, we consider stochastic optimization problems with stochastic dominance constraints. Stochastic dominance is an important ingredient in economics, decision theory, statistics, and nowadays in modern optimization. It has been widely studied in the last two decades, see for example [17,22,23,24,25,26] and their references therein. Different from classical stochastic optimization models which cope with random variables by taking expectation, stochastic dominance can better reflect the relationship between two random variables. It is known that expected utility theory can also provide the comparison of two random variables. However, it is hardly possible for us to explicitly express the utility functions of decision makers [27]. From this point of view, stochastic dominance is more friendly in practice. Actually, stochastic dominance has a close relationship with expected utility theory. Generally, a random variable dominates another random variable in the kth order, denoted by , if for every nondecreasing function from a certain set of utility functions [17]. Specially, if and only if for every nondecreasing utility function . if and only if for every nondecreasing and concave utility function [27].
The convex stochastic optimization model with the kth order stochastic dominance constraint can be described as (see [22,27]):
where D is a nonempty closed and convex subset of ; is a convex function; Y is a random variable supported on , which can be treated as the random benchmark; and . Moreover, we assume that G is locally Lipschitz continuous with respect to in the following sense:
for any , where and . G satisfies the linear growth condition:
for every and , where B is any bounded subset of , and depends on B.
Actually, we can impose a more general growth condition on G, for example,
for and the following discussion still holds. Here a linear growth condition simplifies the demonstration. The above requirements for can be met easily. For instance, the objective function of the two-stage stochastic programming problem with fixed recourse satisfies the above conditions (see [11], Proposition 3.2).
Due to its attractive modeling technique, the quantitative stability analysis of stochastic optimization models with dominance constraints has been recently investigated in several works. Dencheva et al. first studied in [22] stochastic optimization problems with first order stochastic dominance constraints, which was extended by Dencheva and Römisch in [17] to the problem with general kth () order stochastic dominance constraints. In [24], Chen and Jiang weaken the assumptions of the quantitative stability analysis in [17] by considering the case that is generated by the two-stage fully random stochastic programming problem.
To establish the convergence results, we first investigate the quantitative stability of model (6). By convention, we consider its relaxed problem (see also [17,24]):
where is a compact interval: .
In view of our focus in this study, we reestablish the quantitative stability conclusions of Problem (9) in what follows. We use and to denote the probability distributions of and Y, respectively. We denote the feasible solution set of Problem (9) by
and its perturbed feasible solution set under by
First we examine the quantitative stability of the feasible solution set.
Proposition 1.
Proof.
We know from the proof of [17] (Proposition 3.2) that
whenever the right-hand side is less than or equal to some positive scalar .
In view of this, we estimate
and
respectively.
Note the fact that (see [17], (3.9))
for some positive constant and any . Then, we have
This means that
where .
Similarly, we have
which means that
Taking , we have
whenever . □
The quantitative stability result in Proposition 1 differs in two perspectives from the corresponding results in [17]. One is the locally Lipschitz continuity of G; the other is the probability metric we choose. In [17], the authors assumed that G is Lipschitz continuous, and adopted Rachev metrics and the th order Wasserstein metric. As far as we know, there does not exist data-driven results under Rachev metrics.
Let and denote the optimal value and optimal solution set of Problem (9), respectively. Similar to that in Section 4.1, we can define the growth function of Problem (9) as
Then, its inverse function and the associated conditioning function are
and
Proposition 2.
Under the conditions of Proposition 1, there exist constants and such that
whenever .
Proof.
Since f is convex, f is locally Lipschitz continuous. Since D is compact, f is in fact Lipschitz continuous over D. Then, the assertions follow from a similar proof as that for [17] (Theorem 3.3). □
Now we consider the iid samples of and Y. For convenience, we assume that the samples drawn from and Y have the same sample size N. The N iid samples of are and the N iid samples of Y are . Then, we have the following empirical distributions:
and
With these preparations, we can establish the following convergence results.
Theorem 6.
Let D be compact, , , and G satisfy the locally Lipschitz continuity condition (7) and the linear growth condition (8).
- (i)
- We haveandwith probability 1, as .
- (ii)
- If, moreover,andfor some and , then, for , there exist positive scalers depending on , b and s; depending on , b, s and ϵ; depending on and c, and depending on , c and ϵ, such thatandwhere L and are defined in Propositions 1 and 2, respectively.
Proof.
Part (i) can be similarly proved as that in Theorem 4 by utilizing Theorem 2 and Proposition 1.
For Part (ii), we have
where the last inequality follows from Theorem 3; depends on , b, and s; depends on , b, s, and ; depends on and c; and depends on , c, and .
The second and third probability inequalities can be analogously verified and thus we omit the proof here. □
4.3. Data-Driven DRO Problems with FM Ball
A general stochastic optimization model can be formulated as
where , , and is the support set of . The sample average approximation (SAA) is usually used to solve Problem (10) numerically. The SAA method acquiescently assumes that we can generate any number of samples based on P. To better approximate Problem (10), a large sample size is needed [28]. However, in practice, the true probability distribution P cannot be known exactly, and thus we cannot generate a sufficiently large number of samples to make the SAA method well-defined, due to the expensive cost for more samples. However, it is possible for us to obtain a limited number of samples or scenarios, such as historical data. Under these settings, the data-driven DRO model is proposed [18,29,30]. The natural idea is to use the partial information to construct an ambiguity set such that the true probability distribution is included in the ambiguity set. As pointed out in [18], under certain conditions, it offers powerful out-of-sample performance guarantees.
For further discussion, we denote the limited finite samples by and the corresponding empirical distribution by . Since the number of samples N is limited, we cannot adopt the classical SAA method, which requires that the sample size tends to infinity. However, we can use the limited information to construct a set of probability measures which contains the true one, that is, the ambiguity set. In this subsection, we consider the following FM ball-based ambiguity set:
where , the positive constant r stands for the confidence parameter determined by the decision maker. Then, we have the data-driven DRO problem with the FM ball-based ambiguity set of Problem (10) as follows:
It is common for us to see that the Wasserstein ball is used to build the ambiguity set, for example, [18]. To further explain the reasonability and motivations for us to adopt the FM metric, we have the following comments.
Remark 3.
As we know, a key issue for DRO problems is how to build the ambiguity set. Different kinds of ambiguity sets have been proposed, such as moment information [31], ζ-ball [32], and so on. Of course, the FM metric, as a specific case of the ζ-structure probability metric, can be employed to construct the ambiguity set.
More importantly, the decision maker can utilize the limited empirical distribution to obtain an approximate optimal value, say . By prior experience, the decision maker usually has some confidence, measured by the derivation constant , that the true optimal value, denoted by , locates in the interval . Frequently, is locally Lipschitz continuous in the following sense:
for some positive constant L. A typical example is that is the objective function of the two-stage stochastic programming problem, here (see [11], Proposition 3.2). Then, we have the quantitative relationship:
Therefore, it is reasonable for the decision maker to consider the ambiguity set
Moreover, since with probability 1, as (see Theorem 2), P must be included in for suitable N and r.
Finally, we have . The equality holds if . Thus,
This tells us that the ambiguity set constructed by the FM ball is tighter than that constructed with the Wasserstein ball.
All these arguments motivate us to consider the data-driven DRO problem with the FM ball-based ambiguity set.
To quantify the out-of-sample performance of the data-driven DRO problem (11), we examine the following probability
where is any optimal solution of Problem (11). Of course, we hope that, for sufficiently small , there exists a finite positive integer such that
for any .
If P satisfies , we have . Thus,
for any , which of course implies that
From Theorem 3, we have for any that
here we use the notation to stress the dependence of on r. Consequently,
Denote
where stands for rounding up to an integer. Sometimes, we use the notation to stress the dependence of on and r.
Summarizing the above discussions, we obtain the following so-called finite sample guarantee property (see also [18,30]).
Proposition 3
Proposition 3 tells us, for the fixed confidence parameter r, at least how large the sample size should be to ensure the significance level . Now we slightly modify model (11) and consider the following data-driven DRO problem:
where and as . It reflects the natural fact that the decision maker becomes more confident with more information. Meanwhile, the model (11) emphasizes the fixed limited information. We use to denote the optimal value of Problem (14). In what follows, we investigate the asymptotic consistency whenever N tends to infinity. To this end, we need the following lemma.
Lemma 6.
Let and be two sequences of random variables defined on the probability space . If converges almost surely and
for , where with , we have
Proof.
We prove by contradiction. That is, we assume that
This implies that there exists a subset with such that
for every . Define the sequence as
for . Obviously, according to (15), we have , which implies that
Then, we can always choose a sufficiently large , such that
Choose
and we have
for all , which implies that
This contradicts the definition of . We complete the proof. □
The following proposition states that the optimal value and optimal solution set of the data-driven DRO problem (14) converge to those of the original Problem (10), which verifies the reasonability of our data-driven DRO model (14).
Proposition 4
(Asymptotic consistency). Suppose that is locally Lipschitz continuous in the following sense:
for every , where and . Let be any optimal solution of Problem (14). Then, the following assertions hold:
- (i)
- with probability 1, as ;
- (ii)
- If, moreover, X is closed, is lower semicontinuous for every and dominates some P-integrable function uniformly with respect to , then, any accumulation point of is an optimizer of Problem (10) almost surely.
Proof.
Part (i): Notice that
where and are any optimal solutions of Problems (10) and (14), respectively. For the first term on the right-hand side, we have
almost surely, as , where the first inequality is due to the definition of supremum for some with almost surely and ; the second inequality follows from the definition of FM metric. Similarly, we can derive
almost surely, as . Thus, we obtain that
Part (ii): Without loss of generality, in the following discussion, we assume with probability 1 as . Moreover, we select a sequence with and . According to (12), for each pair with defined in (14), we can select an ( is defined in (13)) such that
We know from Lemma 6 and assertion (i) that
Then, the following inequalities hold almost surely:
where (a) follows from due to the closedness of X; (b) follows from the lower semicontinuity of for every ; (c) is due to Fatou’s lemma; (d) follows from (16). □
Remark 4.
Propositions 3 and 4 establish the finite sample guarantee and the asymptotic consistency, which are two desirable properties of the data-driven DRO problem [18,30]. Different from the existing results in [18] where the Wasserstein ball is used to construct the ambiguity set, we adopt the FM ball. Due to the feature of Wasserstein metric, to ensure the existence of the significance parameter ϵ, they explicitly derived the radius depending on ϵ and N and the finite sample size depending only on ϵ. In Proposition 3, we view both r and ϵ as parameters because couples with ϵ implicitly in Theorem 3. Moreover, the assumptions for the asymptotic consistency (Proposition 4) are different from those in [18] (Theorem 3.6), where the upper semicontinuity and linear growth were employed. Here we use the locally Lipschitz continuity but a weaker assumption of the lower bound. Specially, Ref. [18] (Theorem 3.6) employs Borel–Cantelli lemma to obtain
This is not applicable for our case, so we need Lemma 6.
4.4. Discrete Approximation for DRO Problems with General Moment Information
We consider the following general DRO problem:
where is a compact set, , , is a closed and convex set in the Cartesian product of some finite dimensional vector and/or matrix spaces, and is a general mapping on . We implicitly assume that, for each , for all .
The above ambiguity set is very general, and it covers almost all the available ambiguity sets with moment information (see, e.g., [19], Examples 3–5). Zhang et al. discussed in [33] the quantitative stability of the DRO problem with a general moment information ambiguity set. There are usually two ways to numerically solve Problem (17): One is to use some kind of duality argument to reformulate Problem (17) as a solvable Problem [18,31]; the other is to discretize the ambiguity set, which leads to a saddle point problem in the finite dimensional space [19]. For instance, the discrete approximation in [19] is conducted under a bounded support set. In this part, by employing our results in Section 3, we consider the discrete approximation for problem (17) under weaker conditions.
Denote by the collection of all discrete distributions which have at most N supporting elements, that is,
We define the discrete approximation of as
Obviously, . Then, the discrete approximation of Problem (17) can be written as
We use and to denote the optimal value and optimal solution set of Problem (17). and are the optimal value and optimal solution set of Problem (18). To make sense of the discrete approximation, we hope that Problem (18) can approximately solve Problem (17) when N is sufficiently large.
To continue the following discussion, we define the growth function of Problem (17) as
and its inverse function is
Thus, the associated conditioning function is defined as
Immediately, we have the following quantitative stability results:
Proposition 5.
Suppose that: (i) ; (ii) for each and a measurable function with for any ; (iii) is lower semicontinuous for each ; (iv)
for each . Then, and
Proof.
Since and is lower semicontinuous for each , we have from Fatou’s lemma that
holds for any such that . This implies that is lower semicontinuous. According to [34] (Lemma 4.1), is lower semicontinuous too. This, together with the compactness of X, ensures that . Similarly, we can prove that .
Note that
where (a) follows from the fact ; (b) is due to the definition of the pth order FM metric.
Finally, based on the first assertion, the inclusion for the optimal solution sets can be analogously derived as that in [11]. □
For simplicity as well as to show the linear relationship more clearly, we write as in what follows. We need the following technical assumption to proceed.
Assumption 3
(see [19]). The system satisfies the following Slater condition:
for some and .
Proposition 6.
Suppose that Assumption 3 holds and . Then, there exists an with , such that for any and , we have
for any and , where is a positive integer depending on and ω.
Proof.
Let the empirical approximation of be . Then, we have from the law of large numbers that
with probability 1, as . Equivalently, there exists an with , such that for any and , we have
for . This implies that
or equivalently,
for , where is the unit closed ball in the space of .
Notice that , and hence, for , the Slater condition holds with respect to for the system
Now we define, for any , and
Obviously, we have . Similar to that proof of [19] (Theorem 2), we again obtain . Then, we have
for and with , where the last inequality follows from Theorem 2.
Finally, letting and completes the proof. □
The following theorem states that the discrete approximation ambiguity set converges to as in the sense of FM metrics.
Theorem 7.
Suppose that: (i) Assumption 3 holds; (ii) ; (iii)
Then,
with probability 1.
Proof.
For any , by the triangle inequality, we have
where is the empirical distribution of P with N samples. Since , we know from Proposition 6 that
for and almost every . Thus, we have
Subsequently,
For the first term on the right-hand side, the definition of supremum, the boundedness of , and Theorem 2 give rise to
with probability 1, where is a sequence included in such that
and is a positive sequence with as . Thus, we obtain
with probability 1.
Analogously, by the law of large numbers, we can derive that
with probability 1.
Then, we complete the proof. □
The following corollary shows the reasonability for the approximation of Problem (18) to Problem (17).
Corollary 1.
Under the conditions of Proposition 5 and Theorem 7, we have
with probability 1, as .
Remark 5.
In this subsection, we investigated the discrete approximation of the DRO problem with the general moment information ambiguity set. Compared with the existing work [19], we have further weakened the necessary assumptions and extended them to a more general case. Firstly, the Lipschitz continuity of the objective function is required in [19] (Theorem 14) due to the adoption of the Wasserstein metric, so that the upper bound between the discrete approximation of the DRO problem and the original DRO problem can be derived [19] (Proposition 7). We only call for the locally Lipschitz continuity. More importantly, they restricted their discussion to the bounded support set case because the upper bound in [19] (Proposition 7) would be infinity when the support set is unbounded, which is not well defined in this case. However, our support set can be unbounded by employing our convergence results in Section 3.
5. Concluding Remarks
In this study, we investigated different kinds of convergence assertions about data-driven FM metrics and their possible applications. In view of the rich results about Wasserstein metrics (Lemmas 2 and 3), we first established the relationship between the FM metric and the Wasserstein metric (Lemma 4). Based on these results, the non-asymptotic moment estimate (Theorem 1), asymptotic convergence estimate (Theorem 2), and non-asymptotic concentration estimate (Theorem 3) for FM metrics were presented. These convergence assertions for FM metrics were applied to the asymptotic analyses of the empirical approximations of four kinds of stochastic optimization problems. The results sufficiently show the motivations of this study and its importance.
There are still some topics to settle in the future. For example, we leave the numerical tractability for the results in Section 4.3 and Section 4.4 for future work.
Author Contributions
Supervision, Z.C.; Writing—original draft, J.J.; Writing—review & editing, H.H. All authors have read and agreed to the published version of the manuscript.
Funding
This research was supported by China Postdoctoral Science Foundation (Grant Number 2020M673117), the National Natural Science Foundation of China (Grant Numbers 11991023 and 11735011).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Not applicable.
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A
Proof of Lemma 4.
According to the definition of FM metric, we have
Moreover, it is easy to verify that adding any constant will not change the value of the integral. For simplification of the following discussion, without loss of generality, we hereafter set for any fixed . We denote and is the complementary set of . Since and thus , we have an upper bound estimation of as follows:
If , then, , and we have the following upper bound of :
Then, we continue
where is defined by . It is easy to see that h is Lipschitz continuous on with Lipschitz modulus 1. Based on Lemma 1, we can extend h to , and its restriction on is denoted by . Then, is Lipschitz continuous on with Lipschitz modulus 1. Thus, we can continue
So, for any , we have
Note that , so
Similarly, this means that for any . Then, we continue
The proof is complete. □
Proof of Theorem 1.
According to Lemma 4 with , we have
Moreover, since
we obtain
Meanwhile, we know from Lemma 2 that
where as . Then, we take . Since as , we have and for sufficiently large N. Therefore, we have
and
Thus, letting
completes the proof. □
Proof of Theorem 2.
To prove this assertion, we need to verify that: for any , there exists a positive number such that
as for almost every . Notice from Lemma 4 that
for sufficiently large R.
We can deduce from and Lemma 2 that
and
with probability 1. Thus, there always exists a sufficiently large positive number such that
as . Moreover, there exists a positive number such that
as with probability 1, which implies from the triangle inequality that
with probability 1. Combining (A2) with (A3), we have
as and , with probability 1.
On the other hand, we know from the Glivenko-Cantelli theorem [35] that
which implies that there exists a positive number such that
when .
Proof of Theorem 3.
We know from Lemma 4 that
Then, we have
For the first term, we know from (3) that
We, in what follows, consider the estimation of the second term:
Since , we can choose a sufficiently large such that
Then, we have
Furthermore, according to Cramér’s large deviation theorem, we have
where is the so-called (large deviations) rate function defined as
and
for , where the last inequality follows from Assumption 1 with .
We know from [28] (Section 7.2.9) that is positive, convex, and infinitely differentiable at the interior of its domain. This means that is also convex and infinitely differentiable at the interior of its domain, which is consistent with the domain of . Since is finite on , is differentiable on . Note that
Then, the derivative of
which is
is larger than 0 at . Due to its differentiability, which implies the continuity, there exists a sufficiently small such that (A7) is larger than 0 for any . Then, for any , we have
Therefore, we obtain that is positive.
Finally, we obtain
Letting and
completes the proof. □
References
- Fournier, N.; Guillin, A. On the rate of convergence in Wasserstein distance of the empirical measure. Probab. Theory Relat. Fields 2015, 162, 707–738. [Google Scholar] [CrossRef]
- Villani, C. Optimal Transport: Old and New; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2008; Volume 338. [Google Scholar]
- Rachev, S.T.; Rüschendorf, L. Mass Transportation Problems: Volume I: Theory; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1998; Volume 1. [Google Scholar]
- Horowitz, J.; Karandikar, R.L. Mean rates of convergence of empirical measures in the Wasserstein metric. J. Comput. Appl. Math. 1994, 55, 261–273. [Google Scholar] [CrossRef]
- Weed, J.; Bach, F. Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance. arXiv 2017, arXiv:1707.00087. [Google Scholar] [CrossRef]
- Dereich, S.; Scheutzow, M.; Schottstedt, R. Constructive quantization: Approximation by empirical measures. Ann. L’Ihp Probab. Stat. 2013, 49, 1183–1203. [Google Scholar] [CrossRef]
- Bolley, F.; Guillin, A.; Villani, C. Quantitative concentration inequalities for empirical measures on non-compact spaces. Probab. Theory Relat. Fields 2007, 137, 541–593. [Google Scholar] [CrossRef]
- Boissard, E. Simple bounds for the convergence of empirical and occupation measures in 1-Wasserstein distance. Electron. J. Probab. 2011, 16, 2296–2333. [Google Scholar] [CrossRef]
- Zhao, C.; Guan, Y. Data-driven risk-averse two-stage stochastic program with ζ-structure probability metrics. Optim. Online 2015, 2, 1–40. [Google Scholar]
- Römisch, W. Stability of Stochastic Programming Problems. Handb. Oper. Res. Manag. Sci. 2003, 10, 483–554. [Google Scholar]
- Rachev, S.T.; Römisch, W. Quantitative stability in stochastic programming: The method of probability metrics. Math. Oper. Res. 2002, 27, 792–818. [Google Scholar] [CrossRef]
- Römisch, W.; Vigerske, S. Quantitative stability of fully random mixed-integer two-stage stochastic programs. Optim. Lett. 2008, 2, 377–388. [Google Scholar] [CrossRef][Green Version]
- Han, Y.; Chen, Z. Quantitative stability of full random two-stage stochastic programs with recourse. Optim. Lett. 2015, 9, 1075–1090. [Google Scholar] [CrossRef]
- Strugarek, C. On the Fortet-Mourier Metric for The Stability of Stochastic Optimization Problems, An Example; Humboldt-Universität zu Berlin: Berlin, Germany, 2004. [Google Scholar]
- Shapiro, A. Monte Carlo sampling methods. Handb. Oper. Res. Manag. Sci. 2003, 10, 353–425. [Google Scholar]
- Shapiro, A.; Xu, H. Stochastic mathematical programs with equilibrium constraints, modelling and sample average approximation. Optimization 2008, 57, 395–418. [Google Scholar] [CrossRef]
- Dentcheva, D.; Römisch, W. Stability and sensitivity of stochastic dominance constrained optimization models. SIAM J. Optim. 2013, 23, 1672–1688. [Google Scholar] [CrossRef]
- Esfahani, P.M.; Kuhn, D. Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Math. Program. 2018, 171, 115–166. [Google Scholar] [CrossRef]
- Liu, Y.; Pichler, A.; Xu, H. Discrete approximation and quantification in distributionally robust optimization. Math. Oper. Res. 2018, 44, 19–37. [Google Scholar] [CrossRef]
- Kantorovich, L.V.; Rubinstein, G.S. On a space of completely additive functions. Vestn. Leningrad. Univ. 1958, 13, 52–59. [Google Scholar]
- Valentine, F.A. A Lipschitz condition preserving extension for a vector function. Am. J. Math. 1945, 67, 83–93. [Google Scholar] [CrossRef]
- Dentcheva, D.; Henrion, R.; Ruszczyński, A. Stability and sensitivity of optimization problems with first order stochastic dominance constraints. SIAM J. Optim. 2007, 18, 322–337. [Google Scholar] [CrossRef]
- Dentcheva, D.; Ruszczyński, A. Robust stochastic dominance and its application to risk-averse optimization. Math. Program. 2010, 123, 85–100. [Google Scholar] [CrossRef]
- Chen, Z.; Jiang, J. Stability analysis of optimization problems with kth order stochastic and distributionally robust dominance constraints induced by full random recourse. SIAM J. Optim. 2018, 28, 1396–1419. [Google Scholar] [CrossRef]
- Sun, H.; Xu, H. Convergence analysis of stationary points in sample average approximation of stochastic programs with second order stochastic dominance constraints. Math. Program. 2014, 143, 31–59. [Google Scholar] [CrossRef]
- Liu, Y.; Xu, H. Stability analysis of stochastic programs with second order dominance constraints. Math. Program. 2013, 142, 435–460. [Google Scholar] [CrossRef]
- Dentcheva, D.; Ruszczyński, A. Optimization with stochastic dominance constraints. SIAM J. Optim. 2003, 14, 548–566. [Google Scholar] [CrossRef]
- Shapiro, A.; Dentcheva, D.; Ruszczyński, A. Lectures on Stochastic Programming: Modeling and Theory; SIAM: Philadelphia, PA, USA, 2014. [Google Scholar]
- Bertsimas, D.; Gupta, V.; Kallus, N. Data-driven robust optimization. Math. Program. 2018, 167, 235–292. [Google Scholar] [CrossRef]
- Bertsimas, D.; Gupta, V.; Kallus, N. Robust sample average approximation. Math. Program. 2018, 171, 217–282. [Google Scholar] [CrossRef]
- Delage, E.; Ye, Y. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Oper. Res. 2010, 58, 595–612. [Google Scholar] [CrossRef]
- Pichler, A.; Xu, H. Quantitative stability analysis for minimax distributionally robust risk optimization. Math. Program. 2022, 191, 47–77. [Google Scholar] [CrossRef]
- Zhang, J.; Xu, H.; Zhang, L. Quantitative stability analysis for distributionally robust optimization with moment constraints. SIAM J. Optim. 2016, 26, 1855–1882. [Google Scholar] [CrossRef]
- Jiang, J.; Chen, Z. Quantitative stability analysis of two-stage stochastic linear programs with full random recourse. Numer. Funct. Anal. Optim. 2019, 40, 1847–1876. [Google Scholar] [CrossRef]
- Varadarajan, V.S. On the convergence of sample probability distributions. Sankhyā Indian J. Stat. 1958, 19, 23–26. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).