Convergence Analysis on Data-Driven Fortet-Mourier Metrics with Applications in Stochastic Optimization

: Fortet-Mourier (FM) probability metrics are important probability metrics, which have been widely adopted in the quantitative stability analysis of stochastic programming problems. In this study, we contribute to different types of convergence assertions between a probability distribution and its empirical distribution when the deviation is measured by FM metrics and consider their applications in stochastic optimization. We ﬁrst establish the quantitative relation between FM metrics and Wasserstein metrics. After that, we derive the non-asymptotic moment estimate, asymptotic convergence, and non-asymptotic concentration estimate for FM metrics, which supplement the existing results. Finally, we apply the derived results to four kinds of stochastic optimization problems, which either extend the present results to more general cases or provide alternative avenues. All these discussions demonstrate the motivation as well as the signiﬁcance of our study.


Introduction
The estimation of the distance between a distribution and its empirical approximation obtained from some independent and identically distributed (iid) samples is an important subject in probability theory, mathematical statistics, and information theory. It has vast applications in many fields, such as quantization, optimal matching, density estimation, clustering, and so on (see [1] and the references therein for more details). To quantify the distance between two probability distributions, some rules have been adopted to generate probability metrics, such as the commonly used ζ-structure metric. By selecting different generators of the ζ-structure metric, we obtain a number of well-known probability metrics, such as the Wasserstein metric (in this study, when we refer to Wasserstein metric, it means the 1-Wasserstein metric, which is also called the Kantorovich-Rubinstein metric or Kantorovich metric), FM metric, and total variation metric.
Among probability metrics with ζ-structure, the Wasserstein metric is the most popular one which has been widely applied in statistics, probability, and machine learning [2]. It originates from the optimal transportation problem and thus can be interpreted as an optimal mass transportation plan. Except for its practical meaning in transportation, the Wasserstein metric has some good properties. For example, convergence in the Wasserstein metric is equivalent to weak convergence plus the convergence of the first order absolute moment [2].
There is some literature concentrated on the convergence analysis under Wasserstein metrics between a distribution and its empirical approximation. From now on, we refer to this as the data-driven Wasserstein metric for simplicity, and other probability metrics do the same. These convergence analyses can be mainly divided into two parts: moment estimates which aim at providing the rate of convergence for the expectation of the Wasserstein distance between a distribution and its empirical approximation and concentration estimates which focus on the violation probability under a given tolerance. As for moment estimates, some earlier results can be found in [3,4], which provide a relatively loose convergence rate. More recently, Weed and Bach [5] focused on the compact support set case and obtained a sharp convergence rate. Dereich et al. [6] conducted the almost optimal convergence analysis. However, they put some restrictions on the range of parameters. An interesting result was given in [1], which extends some results in [6] from a limited range of parameters to the general case. As for concentration estimates, only a few results are available. The corresponding results can be found in [7,8] under some strong assumptions. Moreover, it requires that the violation parameter is large enough. In [9], Zhao and Guan investigated the case with a discrete and bounded support set. Particularly, an elaborate result on the rate of convergence of data-driven Wasserstein distance was presented in [1].
As pointed out in [2] (p. 110), the Wasserstein metric is a rather strong probability metric. Intuitively, it needs harsh conditions to establish the strong Wasserstein type upper bound estimate. Actually, we know from the definition of the Wasserstein metric that its generator is the set of Lipschitz continuous functions with Lipschitz modulus being one.
Compared with the Wasserstein metric, FM metrics are more general; their generator is a class of locally Lipschitz continuous functions. Therefore, it is more friendly to obtain some upper bounds by adopting FM metrics. In view of this, FM metrics have been widely used in the quantitative stability analysis of stochastic programming problems when the underlying probability distribution is perturbed and approximated, see for example [10][11][12][13]. Moreover, FM metrics have a close relationship with Wasserstein metrics through dual representation (Kantorovich-Rubinstein theorem). Generally, the pth order FM metric degrades into the Wasserstein metric when p = 1. From this point of view, the FM metric can be viewed as an extension of the Wasserstein metric. Nevertheless, there are few results concerning the convergence analysis for data-driven FM metrics. To the best of our knowledge, only Strugarek [14] examined the asymptotic convergence analysis under the FM distance.
In view of the above situations, in this article we study the data-driven FM metric. The main contributions of this study can be summarized as follows: • We establish the quantitative connection between the Wasserstein metric and the FM metric. Based on this connection, we investigate the non-asymptotic moment estimate, asymptotic convergence, and non-asymptotic concentration estimate for data-driven FM metrics. • We provide an alternative avenue for the convergence analysis of discrete approximations for two-stage stochastic programming problems. Different from the convergence or exponential rate of convergence analysis in [15,16], where some complex conditions are required, our approach is straightforward and brief. • We reestablish the quantitative stability results for stochastic optimization problems with stochastic dominance constraints through FM metrics. Compared with that in [17], our conditions are weaker and different probability metrics are adopted. More importantly, we can apply the convergence conclusion to examine the discrete approximation method which is crucial for numerical solution. • We consider data-driven distributionally robust optimization (DRO) problems with FM ball, which extends the results in [18] from the ambiguity set constructed by Wasserstein ball to the FM ball case. We prove the finite sample guarantee and asymptotic consistency, which lay the theoretical foundation for the data-driven approach for the DRO model. • We analyze the discrete approximation of the DRO problem whose ambiguity set is constructed with the general moment information. Compared with the existing work [19] under the bounded support set, we weaken their conditions and extend their results to the case with an unbounded support set.
The remainder of this study is organized as follows. In Section 2, we give some prerequisites for further discussion. In Section 3, we discuss different kinds of convergence results for data-driven FM metrics. We consider four applications to verify our convergence results and to further demonstrate the motivation and significance of this study in Section 4. Finally, we have some concluding remarks in Section 5.

Prerequisites
Let ξ : Ω → Ξ ⊆ R s be a random vector defined on the probability space (Ω, F , P). Then, its induced probability distribution (sometimes it is called probability measure) on Ξ is P := P • ξ −1 . We use P (Ξ) to denote all the probability distributions on Ξ. The set of probability distributions having finite pth order absolute moments is denoted by P p (Ξ) := {P ∈ P (Ξ) : Ξ ξ p P(dξ) < +∞}.
Probability metrics measure the distance between two probability distributions. Generally, they do not satisfy the three axioms of usual distance in metric space. A commonly used class of probability metrics is the probability metric with ζ-structure, whose definition is as follows. Definition 1. Let G be a set of measurable functions from Ξ to R. Then, for any P, Q ∈ P (Ξ), is called the ζ-structure probability metric induced by G.
The G in Definition 1 totally determines the resulting ζ-structure probability metric, so it is called the generator of the ζ-structure probability metric. FM metrics and Wasserstein metrics can be deduced from the ζ-structure probability metric by choosing specific generators. Particularly, we have the following definitions. Definition 2. Let P, Q ∈ P p (Ξ) for some p ≥ 1 and G FM p denote a set of locally Lipschitz continuous functions given by Then, the pth order FM metric between P and Q is Definition 3. Let P, Q ∈ P 1 (Ξ) and Then, the Wasserstein metric between P and Q is It is easy to see from the above definitions that ζ 1 (P, Q) = D W (P, Q) for any P, Q ∈ P (Ξ). Moreover, we have that: if g ∈ G FM p , then, −g ∈ G FM p , so does G W . Therefore, we can ignore the absolute value operator in Definitions 2 and 3 when we take supremum. Moreover, both FM metrics and Wasserstein metrics have a close relationship with weak convergence. One can refer to [10] (p. 490) and [2] (Theorem 6.9) for more details.
The Wasserstein metric has an alternative definition which corresponds to the coupling marginal distributions. Specifically, the Wasserstein metric between P and Q is defined as (see [2], Definition 6.1): where Π is the collection of all joint distributions of ξ 1 and ξ 2 with marginal distributions P and Q, respectively. It is known from Kantorovich-Rubinstein theorem [20] that Definition 3 is the dual representation of (1). We have the following extension theorem for Lipschitz functions in Hilbert space (see [21], Theorems 4 and 5). Lemma 1. Let X and Y be Hilbert spaces and g : B ⊆ X → Y be a Lipschitz function with Lipschitz modulus L g . Then, there exists a Lipschitz functionĝ : X → Y such thatĝ(x) = g(x) for any x ∈ B and L g is also the Lipschitz modulus ofĝ.
Lemma 1 is important for the following discussion. In [1], the authors assumed that the support set Ξ is the whole space R s . They obtained the non-asymptotic moment estimate [1] (Theorem 1) and concentration estimate [1] (Theorem 2) for the Wasserstein metric. For any P, Q ∈ P (Ξ), we can view them as probability distributionsP,Q ∈ P (R s ) through the following correspondence: for all A ⊆ R s . That is, we set the probability of the area R s \Ξ to be zero. Generally, we have D W (P, Q) = D W (P,Q). The details are as follows: where G W (Ξ) denotes the collection of all the Lipschitz continuous functions with Lipschitz modulus 1 on Ξ, andĜ W (R s ) is the extension of G W (Ξ) according to Lemma 1. Obviously, G W (R s ) ⊆ G W (R s ) which is the set of Lipschitz continuous functions with Lipschitz modulus 1 over R s . Thus, we have the estimation That is, D W (P, Q) ≤ D W (P,Q). On the other hand, for anyḡ ∈ G W (R s ), its restriction on Ξ is Lipschitz continuous with Lipschitz modulus 1. Thus, Finally, we have D W (P, Q) = D W (P,Q). Therefore, although all the convergence results in [1] were derived under R s , we can extend them to any support set Ξ ⊆ R s . Lemma 2 ([1], Theorem 1). Let P ∈ P p (Ξ) for some p > 1. Then, there exists a constant C depending only on s (the dimension of Ξ) and p such that, for all N ≥ 1, where log is the natural logarithm.
Lemma 2 cannot cover all the pairs (s, p), for example, (s, p) = (1, 2) or (s, p) = (2, 2). However, we can always reset p such that Lemma 2 holds by the following procedures. If s = 1 or 2 and p = 2, P must belong to P q (Ξ) for any q ∈ (1, 2). If s > 2 and p = 2, we can select q ∈ (1, 2) such that q = s/(s − 1). If s > 2 and p = s/(s − 1), we can choose any q ∈ (1, s/(s − 1)). Then, we let p = q and have that Lemma 2 holds with s = 1 or 2 and p ∈ (1, 2) or s > 2 and p = s/(s − 1). Therefore, Lemma 2 is applicable for any s ∈ N through carefully prepared p. In the following discussion, without loss of generality, we always assume that Lemma 2 holds for any pair (s, p) ∈ N × [1, +∞). Further, we can, according to Lemma 2, obtain the following uniform upper bound: for any s ∈ N and N ≥ 1.
for some constant b.
For a more comprehensive version of Lemma 3, one can refer to [1] (Theorem 2). Here, we focus on the case ∈ (0, 1] because it is more interesting for us to investigate a smaller violation rather than a bigger one. A simplified version can also be found in [18] (Theorem 3.4) where the assumption s = 2 is imposed.
To simplify the following discussion, we derive a uniform upper bound for the righthand side in Lemma 3. Note the fact that 1 + δ ≤ e δ for any δ ∈ R. We have Letting ∈ (0, 1/2] gives us that Moreover, for ∈ (0, 1/2], Therefore, we can obtain a loose but uniform upper bound estimation for any ∈ (0, 1/2] and s ∈ N.

Convergence Analyses of Data-Driven FM Metrics
In this section, we will investigate different kinds of convergence for data-driven FM metrics. To this end, let ξ 1 , ξ 2 , · · · , ξ N be N iid samples generated according to P. These samples are viewed here as the random sample ξ i : Ω → Ξ, 1 ≤ i ≤ N, on the probability space (Ω, F , P). Then, we obtain the empirical distribution P N defined as We first give the following vital lemma.

Lemma 4.
Let P, Q ∈ P p (Ξ) for some p ≥ 1. Then, for any R satisfying R ≥ 1 and B(0, R) ∩ Ξ = ∅. Here 0 is the original point in R s and B(0, R) is the closed ball centered at 0 with radius R.
The proof of Lemma 4 can be found in Appendix A. If we define then, we can obtain a tighter upper bound estimation of ζ p (P, Q), that is The first convergence result is about the non-asymptotic moment estimate. It provides an upper bound for the expectation of the FM distance between P and its empirical approximation distribution.
Theorem 1 (Non-asymptotic moment estimates for FM metrics). Suppose that P ∈ P p (Ξ) for some p > 1. Then, for sufficiently large N, we have The proof of Theorem 1 can be found in Appendix A. Theorem 1 establishes the convergence in the sense of expectation. However, it fails to tell us the sample-wise convergence. The following theorem states the asymptotic convergence under FM metrics for almost every sample.
Theorem 2 (Asymptotic convergence of FM metrics). Suppose that P ∈ P p (Ξ). Then, Here P N is defined at the beginning of this section.
The proof of Theorem 2 can be found in Appendix A. Theorems 1 and 2 claim the convergence. As we know, the rate of convergence is quite important for guiding the solution process in practice. The following theorem gives the estimate of the convergence rate under certain assumptions.
The proof of Theorem 3 can be found in Appendix A.

Remark 1.
Here we assume that ∈ (0, 1/2]. The main reason is that we want to give a relatively simple proof. Fortunately, it is more interesting for us to consider a small violation rather than a large one. Under certain assumptions, we can obtain an estimation for I 16 . For example, if M(t) ≤ exp σ 2 t 2 /2 for t ∈ R, here σ is a positive constant, we have log M(t) ≤ σ 2 t 2 /2. Then, according to the properties of convex quadratic functions, the rate function has the lower bound Thus, we can further obtain a concrete estimate forβ. For more details in this aspect, one can refer to [16].

Applications
In this section, we consider four applications of convergence conclusions about FM metrics obtained in Section 3. Specifically, we study the discrete approximation of twostage stochastic programming problems, stochastic optimization problems with dominance constraints, data-driven distributionally robust optimization problems with FM ball, and the discrete approximation for distributionally robust optimization problems with general moment ambiguity set. They will not only further illustrate the motivations of this study but also provide alternative avenues or extensions for the current results.

Two-Stage Stochastic Linear Programming Problems
Discrete approximation is an important issue in stochastic optimization, which is crucial for its numerical solution. In this subsection, by employing the convergence results in Section 3, we give an alternative avenue for analyzing the discrete approximation of two-stage stochastic programming problems.
Consider the two-stage stochastic programming problem: where c ∈ R n ; X ⊆ R n is a polyhedron; the probability measure P is supported on Ξ ⊆ R s , which is a polyhedron; and Here , and let v(P) and S(P) denote the optimal value and optimal solution set of Problem (4). Moreover, we use PosW to denote the set {Wŷ : To quantify the upper semicontinuity or the deviation distance of the optimal solution set, we define the growth function ψ P : R + → R + as Its inverse function ψ −1 P is given by Thus, we can define the associated conditioning function Ψ P : R + → R + as It is easy to verify that ψ P is nondecreasing and Ψ P is increasing. Both ψ P and Ψ P are lower semicontinuous on R + and vanish at 0. One can refer to [10] for more details.
Moreover, we have ψ −1 P (t) → 0 + as t → 0 + . We illustrate this fact by contradiction. Suppose that there exists a sequence {t n } satisfying t n → 0 + as n → ∞, such that ψ −1 . The lower semicontinuity of ψ P means that {τ ∈ R + : ψ P (τ) ≤ t n } is closed. Thus, ψ P (τ n ) ≤ t n . Due to the nondecreasing property of ψ P and t n → 0 + as n → ∞, {τ n } must be bounded. Without loss of generality, we assume that τ n → τ * as n → ∞, where τ * is a positive constant. According to the lower semicontinuity of ψ P , we have which leads to a contradiction.
To introduce the following discussion, we make some standard assumptions (see [11]).
Under the above assumptions, we have the following quantitative stability results about the optimal value and optimal solution set of Problem (4).
Based on Lemma 5 and the convergence results in Section 3, we have the following convergence conclusions between the two-stage stochastic programming problem (4) and its empirical approximation.
Theorem 4. Suppose that: (i) Assumption 2 holds; (ii) S(P) is nonempty and bounded. Then, Proof. For the first assertion, we have from Theorem 2 that ζ 2 (P, P N ) → 0 with probability 1. This means that: for the δ defined in Lemma 5, there exists a positive number N 0 = N 0 (δ, ω) such that for any N ≥ N 0 , ζ 2 (P, P N ) ≤ δ for almost every ω ∈ Ω. Then, by Lemma 5, we have that hold almost surely as N ≥ N 0 , here L is defined in Lemma 5. According to Theorem 2 and the property of Ψ P , we have ζ 2 (P, P N ) → 0 and thus Ψ P (Lζ 2 (P, P N )) → 0 with probability 1, as N → ∞. These facts imply that |v(P) − v(P N )| → 0, d(S(P N ), S(P)) → 0 with probability 1, as N → ∞.
As shown in Theorem 4, a sufficient condition for where δ is defined in Lemma 5. Without loss of generality, we assume that δ ∈ (0, 1/2]. Analogously, we have that Then, we obtain Similarly, we have where the equality follows from the strictly increasing property of Ψ P (·). By the same procedure, we can derive the second assertion.

Remark 2.
The convergence analysis about two-stage stochastic programming problems can also be found in [11] (Section 4), where the covering and bracketing numbers are introduced. However, it seems difficult to verify the growth rate of the covering or bracketing number in the general case (see [11], Proposition 4.2). Our convergence results are more straightforward. Compared with [11] (Proposition 4.2), instead of the growth rate of the covering or bracketing number, we use the light-tailed distribution assumption. This assumption is commonly used in the literature, see for example [1,18].

Stochastic Optimization Problems with Stochastic Dominance Constraints
In this part, we consider stochastic optimization problems with stochastic dominance constraints. Stochastic dominance is an important ingredient in economics, decision theory, statistics, and nowadays in modern optimization. It has been widely studied in the last two decades, see for example [17,[22][23][24][25][26] and their references therein. Different from classical stochastic optimization models which cope with random variables by taking expectation, stochastic dominance can better reflect the relationship between two random variables. It is known that expected utility theory can also provide the comparison of two random variables. However, it is hardly possible for us to explicitly express the utility functions of decision makers [27]. From this point of view, stochastic dominance is more friendly in practice. Actually, stochastic dominance has a close relationship with expected utility theory. Generally, a random variable X dominates another random variable Y in the kth (k ≥ 1) order, denoted by for every nondecreasing function u(·) from a certain set of utility functions [17]. Specially, X (1) Y if and only if E[u(X )] ≥ E[u(Y )] for every nondecreasing utility function u(·). X (2) Y if and only if E[u(X )] ≥ E[u(Y )] for every nondecreasing and concave utility function u(·) [27].
The convex stochastic optimization model with the kth order stochastic dominance constraint can be described as (see [22,27]): where D is a nonempty closed and convex subset of R n ; f : R n → R is a convex function; Y is a random variable supported on Y ⊆ R, which can be treated as the random benchmark; and G : R n × Ξ → R. Moreover, we assume that G is locally Lipschitz continuous with respect to ξ in the following sense: for any ξ , ξ ∈ Ξ, where p ≥ 1 and L G > 0. G satisfies the linear growth condition: for every x ∈ B and ξ ∈ Ξ, where B is any bounded subset of R n , and C G (B) > 0 depends on B. Actually, we can impose a more general growth condition on G, for example, for q ≥ 1 and the following discussion still holds. Here a linear growth condition simplifies the demonstration. The above requirements for G(x, ξ) can be met easily. For instance, the objective function of the two-stage stochastic programming problem with fixed recourse satisfies the above conditions (see [11], Proposition 3.2). Due to its attractive modeling technique, the quantitative stability analysis of stochastic optimization models with dominance constraints has been recently investigated in several works. Dencheva et al. first studied in [22] stochastic optimization problems with first order stochastic dominance constraints, which was extended by Dencheva and Römisch in [17] to the problem with general kth (k ≥ 2) order stochastic dominance constraints. In [24], Chen and Jiang weaken the assumptions of the quantitative stability analysis in [17] by considering the case that G(x, ξ) is generated by the two-stage fully random stochastic programming problem.
In view of our focus in this study, we reestablish the quantitative stability conclusions of Problem (9) in what follows. We use P ξ and P Y to denote the probability distributions of ξ and Y, respectively. We denote the feasible solution set of Problem (9) by and its perturbed feasible solution set under (Q ξ , Q Y ) by First we examine the quantitative stability of the feasible solution set.

Proposition 1.
Let D be compact, P ξ ∈ P k+p−2 (Ξ), P Y ∈ P k−1 (Y ), and G satisfy the locally Lipschitz continuity condition (7) and the linear growth condition (8). Then, there exist constants L > 0 and δ > 0 such that Proof. We know from the proof of [17] (Proposition 3.2) that whenever the right-hand side is less than or equal to some positive scalarδ.
In view of this, we estimate respectively. Note the fact that (see [17], (3.9)) for some positive constant K I and any η ∈ I. Then, we have This means that Taking L := 1 (k−1)! max{L 1 , K I (k − 1)}, we have The quantitative stability result in Proposition 1 differs in two perspectives from the corresponding results in [17]. One is the locally Lipschitz continuity of G; the other is the probability metric we choose. In [17], the authors assumed that G is Lipschitz continuous, and adopted Rachev metrics and the (k − 1)th order Wasserstein metric. As far as we know, there does not exist data-driven results under Rachev metrics.

Proposition 2. Under the conditions of Proposition 1, there exist constantsL
Proof. Since f is convex, f is locally Lipschitz continuous. Since D is compact, f is in fact Lipschitz continuous over D. Then, the assertions follow from a similar proof as that for [17] (Theorem 3.3).
Now we consider the iid samples of ξ and Y. For convenience, we assume that the samples drawn from ξ and Y have the same sample size N. The N iid samples of ξ are ξ 1 , ξ 2 , · · · , ξ N and the N iid samples of Y are Y 1 , Y 2 , · · · , Y N . Then, we have the following empirical distributions: With these preparations, we can establish the following convergence results.
for some b > k + p − 2 and c > k − 1, then, for ∈ (0, 1/2], there exist positive scalers α 1 depending on P ξ , b and s; β 1 depending on P ξ , b, s and ; α 2 depending on P Y and c, and β 2 depending on P Y , c and , such that where L andL are defined in Propositions 1 and 2, respectively. Proof. Part (i) can be similarly proved as that in Theorem 4 by utilizing Theorem 2 and Proposition 1.
For Part (ii), we have where the last inequality follows from Theorem 3; α 1 depends on P ξ , b, and s; β 1 depends on P ξ , b, s, and ; α 2 depends on P Y and c; and β 2 depends on P Y , c, and .
The second and third probability inequalities can be analogously verified and thus we omit the proof here.

Data-Driven DRO Problems with FM Ball
A general stochastic optimization model can be formulated as where g : X × Ξ → R, X ⊆ R n , and Ξ ⊆ R s is the support set of ξ. The sample average approximation (SAA) is usually used to solve Problem (10) numerically. The SAA method acquiescently assumes that we can generate any number of samples based on P. To better approximate Problem (10), a large sample size is needed [28]. However, in practice, the true probability distribution P cannot be known exactly, and thus we cannot generate a sufficiently large number of samples to make the SAA method well-defined, due to the expensive cost for more samples. However, it is possible for us to obtain a limited number of samples or scenarios, such as historical data. Under these settings, the data-driven DRO model is proposed [18,29,30]. The natural idea is to use the partial information to construct an ambiguity set such that the true probability distribution is included in the ambiguity set. As pointed out in [18], under certain conditions, it offers powerful out-of-sample performance guarantees. For further discussion, we denote the limited finite samples by ξ 1 , ξ 2 , · · · , ξ N and the corresponding empirical distribution by P N . Since the number of samples N is limited, we cannot adopt the classical SAA method, which requires that the sample size tends to infinity. However, we can use the limited information to construct a set of probability measures which contains the true one, that is, the ambiguity set. In this subsection, we consider the following FM ball-based ambiguity set: where p ≥ 1, the positive constant r stands for the confidence parameter determined by the decision maker. Then, we have the data-driven DRO problem with the FM ball-based ambiguity set of Problem (10) as follows: It is common for us to see that the Wasserstein ball is used to build the ambiguity set, for example, [18]. To further explain the reasonability and motivations for us to adopt the FM metric, we have the following comments.

Remark 3.
As we know, a key issue for DRO problems is how to build the ambiguity set. Different kinds of ambiguity sets have been proposed, such as moment information [31], ζ-ball [32], and so on. Of course, the FM metric, as a specific case of the ζ-structure probability metric, can be employed to construct the ambiguity set.
More importantly, the decision maker can utilize the limited empirical distribution P N to obtain an approximate optimal value, say v(P N ). By prior experience, the decision maker usually has some confidence, measured by the derivation constantr > 0, that the true optimal value, denoted by v(P), locates in the interval [v(P N ) −r, v(P N ) +r]. Frequently, g(x, ξ) is locally Lipschitz continuous in the following sense: for some positive constant L. A typical example is that g(x, ξ) is the objective function of the two-stage stochastic programming problem, here p = 2 (see [11], Proposition 3.2). Then, we have the quantitative relationship: Therefore, it is reasonable for the decision maker to consider the ambiguity set {Q ∈ P (Ξ) : ζ p (Q, P N ) ≤ r :=r/L}. Moreover, since ζ p (P, P N ) → 0 with probability 1, as N → ∞ (see Theorem 2), P must be included in {Q ∈ P (Ξ) : ζ p (Q, P N ) ≤ r} for suitable N and r.
Finally, we have ζ p (P, Q) ≥ D W (P, Q). The equality holds if p = 1. Thus, This tells us that the ambiguity set constructed by the FM ball is tighter than that constructed with the Wasserstein ball.
All these arguments motivate us to consider the data-driven DRO problem with the FM ball-based ambiguity set.
We use v * and v N to denote the optimal values of Problems (10) and (11), respectively. To quantify the out-of-sample performance of the data-driven DRO problem (11), we examine the following probability where x N is any optimal solution of Problem (11). Of course, we hope that, for sufficiently small > 0, there exists a finite positive integer N 0 such that for any N ≥ N 0 . If P satisfies ζ p (P, P N ) < r, we have P ∈ B r (P N ). Thus, for any x ∈ X, which of course implies that From Theorem 3, we have for any r ∈ (0, 1/2] that P(ζ p (P, P N ) ≥ r) ≤α exp(−β(r)N), here we use the notationβ(r) > 0 to stress the dependence ofβ on r. Consequently, P(ζ p (P, P N ) ≤ r) ≥ P(ζ p (P, P N ) < r) ≥ 1 −α exp(−β(r)N).
A sufficient condition to satisfy (12) is where · stands for rounding up to an integer. Sometimes, we use the notation N 0 ( , r) to stress the dependence of N 0 on and r. Summarizing the above discussions, we obtain the following so-called finite sample guarantee property (see also [18,30]).
Proposition 3 tells us, for the fixed confidence parameter r, at least how large the sample size should be to ensure the significance level . Now we slightly modify model (11) and consider the following data-driven DRO problem: where r N > 0 and r N → 0 as N → ∞. It reflects the natural fact that the decision maker becomes more confident with more information. Meanwhile, the model (11) emphasizes the fixed limited information. We usev N to denote the optimal value of Problem (14). In what follows, we investigate the asymptotic consistency whenever N tends to infinity. To this end, we need the following lemma.
Proof. We prove by contradiction. That is, we assume that This implies that there exists a subsetΩ ⊆ Ω with P(Ω) > 0 such that for everyω ∈Ω. Define the sequence {Ω N } as for N ∈ N. Obviously, according to (15), we have P(Ω N ) ≤ κ N , which implies that Then, we can always choose a sufficiently largeN ∈ N, such that Choose This contradicts the definition ofΩ. We complete the proof.
The following proposition states that the optimal value and optimal solution set of the data-driven DRO problem (14) converge to those of the original Problem (10), which verifies the reasonability of our data-driven DRO model (14).
Proposition 4 (Asymptotic consistency). Suppose that g(x, ξ) is locally Lipschitz continuous in the following sense: for every x ∈ X, where L > 0 and p ≥ 1. Let x N be any optimal solution of Problem (14). Then, the following assertions hold: (ii) If, moreover, X is closed, g(·, ξ) is lower semicontinuous for every ξ ∈ Ξ and g(x, ξ) dominates some P-integrable function uniformly with respect to x ∈ X, then, any accumulation point of {x N } is an optimizer of Problem (10) almost surely.

Proof. Part (i): Notice that
where x * and x N are any optimal solutions of Problems (10) and (14), respectively. For the first term on the right-hand side, we have sup Q∈B r N (P N ) almost surely, as N → ∞, where the first inequality is due to the definition of supremum for some δ N > 0 with δ N → 0 almost surely and Q N ∈ B r N (P N ); the second inequality follows from the definition of FM metric. Similarly, we can derive almost surely, as N → ∞. Thus, we obtain that |v N − v * | → 0 almost surely, as N → ∞.
Part (ii): Without loss of generality, in the following discussion, we assume x N →x with probability 1 as N → ∞. Moreover, we select a sequence { k } with k ∈ (0, 1) and ∑ ∞ k=1 k < ∞. According to (12), for each pair ( k , r k ) with r k defined in (14), we can select We know from Lemma 6 and assertion (i) that Then, the following inequalities hold almost surely: where (a) follows fromx ∈ X due to the closedness of X; (b) follows from the lower semicontinuity of g(·, ξ) for every ξ ∈ Ξ; (c) is due to Fatou's lemma; (d) follows from (16).

Remark 4.
Propositions 3 and 4 establish the finite sample guarantee and the asymptotic consistency, which are two desirable properties of the data-driven DRO problem [18,30]. Different from the existing results in [18] where the Wasserstein ball is used to construct the ambiguity set, we adopt the FM ball. Due to the feature of Wasserstein metric, to ensure the existence of the significance parameter , they explicitly derived the radius r( , N) depending on and N and the finite sample size N 0 ( ) depending only on . In Proposition 3, we view both r and as parameters becauseβ couples with implicitly in Theorem 3. Moreover, the assumptions for the asymptotic consistency (Proposition 4) are different from those in [18] (Theorem 3.6), where the upper semicontinuity and linear growth were employed. Here we use the locally Lipschitz continuity but a weaker assumption of the lower bound. Specially, Ref. [18] (Theorem 3.6) employs Borel-Cantelli lemma to obtain P{E P [g(x N , ξ)] ≤v N for sufficiently large N} = 1.
This is not applicable for our case, so we need Lemma 6.

Discrete Approximation for DRO Problems with General Moment Information
We consider the following general DRO problem: where X ⊆ R n is a compact set, h : X × Ξ → R, P := {P ∈ P (Ξ) : E P [Γ(ξ)] ∈ K}, K is a closed and convex set in the Cartesian product of some finite dimensional vector and/or matrix spaces, and Γ is a general mapping on Ξ. We implicitly assume that, for each x ∈ X, E P [h(x, ξ)] < +∞ for all P ∈ P. The above ambiguity set P is very general, and it covers almost all the available ambiguity sets with moment information (see, e.g., [19], Examples 3-5). Zhang et al. discussed in [33] the quantitative stability of the DRO problem with a general moment information ambiguity set. There are usually two ways to numerically solve Problem (17): One is to use some kind of duality argument to reformulate Problem (17) as a solvable Problem [18,31]; the other is to discretize the ambiguity set, which leads to a saddle point problem in the finite dimensional space [19]. For instance, the discrete approximation in [19] is conducted under a bounded support set. In this part, by employing our results in Section 3, we consider the discrete approximation for problem (17) under weaker conditions. Denote byP N the collection of all discrete distributions which have at most N supporting elements, that is, We define the discrete approximation of P as Obviously, P N ⊆ P. Then, the discrete approximation of Problem (17) can be written as We use v(P ) and S(P ) to denote the optimal value and optimal solution set of Problem (17). v(P N ) and S(P N ) are the optimal value and optimal solution set of Problem (18). To make sense of the discrete approximation, we hope that Problem (18) can approximately solve Problem (17) when N is sufficiently large.
Proof. Since h(x, ξ) ≥ g(ξ) and h(·, ξ) is lower semicontinuous for each ξ ∈ Ξ, we have from Fatou's lemma that holds for any {x k } ⊆ X such that lim k→∞ x k =x. This implies that E P [h(·, ξ)] is lower semicontinuous. According to [34] (Lemma 4.1), sup P∈P E P [h(·, ξ)] is lower semicontinuous too. This, together with the compactness of X, ensures that S(P ) = ∅. Similarly, we can prove that S(P N ) = ∅. Note that where (a) follows from the fact P N ⊆ P; (b) is due to the definition of the pth order FM metric. Finally, based on the first assertion, the inclusion for the optimal solution sets can be analogously derived as that in [11].
For simplicity as well as to show the linear relationship more clearly, we write E P [Γ(ξ)] as P, Γ(ξ) in what follows. We need the following technical assumption to proceed.
The following theorem states that the discrete approximation ambiguity set P N converges to P as N → ∞ in the sense of FM metrics. Proof. For any P ∈ P, by the triangle inequality, we have where P N is the empirical distribution of P with N samples. Since P N ∈P N ⊆ P, we know from Proposition 6 that for N ≥N(δ, ω) and almost every ω ∈ Ω. Thus, we have Subsequently, For the first term on the right-hand side, the definition of supremum, the boundedness of P, and Theorem 2 give rise to with probability 1, where {P k } is a sequence included in P such that sup P∈P ζ p (P, P N ) ≤ ζ p (P k , P k N ) + k and { k } is a positive sequence with k → 0 as N → ∞. Thus, we obtain lim N→∞ sup P∈P ζ p (P, P N ) = 0 with probability 1. Analogously, by the law of large numbers, we can derive that lim N→∞ sup P∈P P, Γ(ξ) − P N , Γ(ξ) = 0 with probability 1. Then, we complete the proof.
The following corollary shows the reasonability for the approximation of Problem (18) to Problem (17). with probability 1, as N → ∞.

Remark 5.
In this subsection, we investigated the discrete approximation of the DRO problem with the general moment information ambiguity set. Compared with the existing work [19], we have further weakened the necessary assumptions and extended them to a more general case. Firstly, the Lipschitz continuity of the objective function is required in [19] (Theorem 14) due to the adoption of the Wasserstein metric, so that the upper bound between the discrete approximation of the DRO problem and the original DRO problem can be derived [19] (Proposition 7). We only call for the locally Lipschitz continuity. More importantly, they restricted their discussion to the bounded support set case because the upper bound in [19] (Proposition 7) would be infinity when the support set is unbounded, which is not well defined in this case. However, our support set can be unbounded by employing our convergence results in Section 3.

Concluding Remarks
In this study, we investigated different kinds of convergence assertions about datadriven FM metrics and their possible applications. In view of the rich results about Wasserstein metrics (Lemmas 2 and 3), we first established the relationship between the FM metric and the Wasserstein metric (Lemma 4). Based on these results, the non-asymptotic moment estimate (Theorem 1), asymptotic convergence estimate (Theorem 2), and nonasymptotic concentration estimate (Theorem 3) for FM metrics were presented. These convergence assertions for FM metrics were applied to the asymptotic analyses of the empirical approximations of four kinds of stochastic optimization problems. The results sufficiently show the motivations of this study and its importance.
There are still some topics to settle in the future. For example, we leave the numerical tractability for the results in Sections 4.3 and 4.4 for future work.
Then, we continue where h : M R → R is defined by h(·) := g(·)/R p−1 . It is easy to see that h is Lipschitz continuous on M R with Lipschitz modulus 1. Based on Lemma 1, we can extend h to R s , and its restriction on Ξ is denoted byh(·). Then,h(·) is Lipschitz continuous on Ξ with Lipschitz modulus 1. Thus, we can continue So, for any ξ ∈ Ξ, we have Similarly, this means that h (ξ) ≤ 2 ξ for any ξ ∈ M c R . Then, we continue The proof is complete.

Proof of Theorem 2.
To prove this assertion, we need to verify that: for any > 0, there exists a positive number N( , ω) such that ζ p (P, P N ) ≤ (A1) as N ≥ N(x, ω) for almost every ω ∈ Ω. Notice from Lemma 4 that ζ p (P, P N ) ≤ R p−1 D W (P, P N ) + 4 M c R ξ p (P + P N )(dξ) for sufficiently large R. We can deduce from P ∈ P p (Ξ) and Lemma 2 that as R ≥ R( ) and N ≥ N 1 , with probability 1.
On the other hand, we know from the Glivenko-Cantelli theorem [35] that D W (P, P N ) → 0 as N → ∞ with probability 1, which implies that there exists a positive number N 2 := N 2 ( , ω) such that when N ≥ N 2 .
Proof of Theorem 3. We know from Lemma 4 that ζ p (P, P N ) ≤ R p−1 D W (P, P N ) + 4 M c R ξ p (P + P N )(dξ).
For the first term, we know from (3) that P R p−1 D W (P, P N ) ≥ /2 = P D W (P, P N ) ≥ /(2R p−1 ) We, in what follows, consider the estimation of the second term: P 4 M c R ξ p (P + P N )(dξ) ≥ /2 .
Since P ∈ P p (Ξ), we can choose a sufficiently large R = R( ) such that M c R ξ p P(dξ) ≤ 32 .
Then, we have Furthermore, according to Cramér's large deviation theorem, we have where I(·) is the so-called (large deviations) rate function defined as and , where the last inequality follows from Assumption 1 with b > p. We know from [28] (Section 7.2.9) that M(t) is positive, convex, and infinitely differentiable at the interior of its domain. This means that log M(t) is also convex and infinitely differentiable at the interior of its domain, which is consistent with the domain of M(t).
Since M(t) is finite on [−1, 1], M(t) is differentiable on (−1, 1). Note that Then, the derivative of which is is larger than 0 at t = 0. Due to its differentiability, which implies the continuity, there exists a sufficiently small 0 <t ≤ 1 such that (A7) is larger than 0 for any t ∈ [0,t]. Then, for any t ∈ (0,t], we have Therefore, we obtain that I 16 is positive. Finally, we obtain P(ζ p (P, P N ) ≥ ) ≤ α exp −βN