Survey of Distances between the Most Popular Distributions

: We present a number of upper and lower bounds for the total variation distances between the most popular probability distributions. In particular, some estimates of the total variation distances in the cases of multivariate Gaussian distributions, Poisson distributions, binomial distributions, between a binomial and a Poisson distribution, and also in the case of negative binomial distributions are given. Next, the estimations of Lévy–Prohorov distance in terms of Wasserstein metrics are discussed, and Fréchet, Wasserstein and Hellinger distances for multivariate Gaussian distributions are evaluated. Some novel context-sensitive distances are introduced and a number of bounds mimicking the classical results from the information theory are proved.


Introduction
Measuring a distance, whether in the sense of a metric or a divergence, between two probability distributions (PDs) is a fundamental endeavor in machine learning and statistics [1]. We encounter it in clustering, density estimation, generative adversarial networks, image recognition and just about any field that undertakes a statistical approach towards data. The most popular case is measuring the distance between multivariate Gaussian PDs, but other examples such as Poisson, binomial and negative binomial distributions, etc., frequently appear in applications too. Unfortunately, the available textbooks and reference books do not present them in a systematic way. Here, we make an attempt to fill this gap. For this aim, we review the basic facts about the metrics for probability measures, and provide specific formulae and simplified proofs that could not be easily found in the literature. Many of these facts may be considered as a scientific folklore known to experts but not represented in any regular way in the established sources. A tale that becomes folklore is one that is passed down and whispered around. The second half of the word, lore, comes from Old English lār, i.e., 'instruction'. The basic reference for the topic is [2], and, in recent years, the theory has achieved substantial progress. A selection of recent publications on stability problems for stochastic models may be found in [3], but not much attention is devoted to the relationship between different metrics useful in specific applications. Hopefully, this survey helps to make this treasure more accessible and easy to handle.
The rest of the paper proceeds as follows: In Section 2, we define the total variation, Kolmogorov-Smirnov, Jensen-Shannon and geodesic metrics. Section 3 is devoted to the total variation distance for 1D Gaussian PDs. In Section 4, we survey a variety of different cases: Poisson, binomial, negative-binomial, etc. In Section 5, the total variation bounds for multivariate Gaussian PDs are presented, and they are proved in Section 6. In Section 7, the estimations of Lévy-Prohorov distance in terms of Wasserstein metrics are presented. The Gaussian case is thoroughly discussed in Section 8. In Section 9, a relatively new topic of distances between the measures of different dimensions is briefly discussed. Finally, in Section 10, new context-sensitive metrics are introduced and a number of inequalities mimicking the classical bounds from information theory are proved.

The Most Popular Distances
The most interesting metrics on the space of probability distributions are the total variation (TV), Lévy-Prohorov, Wasserstein distances. We will also discuss Fréchet, Kolmogorov-Smirnov and Hellinger distances. Let us remind readers that, for probability measures P, Q with densities p, q, We need the coupling characterization of the total variation distance. For two distributions, P and Q, a pair (X, Y) of random variables (r.v.) defined on the same probability space is called a coupling for P and Q if X ∼ P and Y ∼ Q. Note the following fact: there exists a coupling (X, Y) such that P(X = Y) = TV(P, Q). Therefore, for any measurable function f , we have P( f (X) = f (Y)) ≤ TV(X, Y) with equality iff f is reversible.
In a one-dimensional case, the Kolmogorov-Smirnov distance is useful (only for probability measures on R): 's, and Y has a density w.r.t. Lebesgue measure bounded by a constant C. Then, Kolm(P, Q) ≤ 2 CWass 1 (P, Q). Here, Wass 1 (P, Let X 1 , X 2 be random variables with the probability density functions p, q, respectively. Define the Kullback-Leibler (KL) divergence ).
The total variance distance and the Kullback-Leibler (KL) divergence appear naturally in statistics. Say, for example, in the testing of binary hypothesis H 0 :X ∼ P versus H 1 :X ∼ Q, the sum of errors of both types as the infimum over all reasonable decision rules d: Moreover, when minimizing the probability of type-II error subjected to type-I error constraints, the optimal test guarantees that the probability of type-II error decays exponentially in view of Sanov's theorem where n is the sample size. In the case of selecting between M ≥ 2 distributions, The KL-divergence is not symmetric and does not satisfy the triangle inequality. However, it gives rise to the so-called Jensen-Shannon metric [4] JS(P, Q) = D(P||R) + D(Q||R) with R = 1 2 (P + Q). It is a lower bound for the total variance distance 0 ≤ JS(P, Q) ≤ TV(P, Q).
The Jensen-Shannon metric is not easy to compute in terms of covariance matrices in the multi-dimensional Gaussian case.
The proof is sketched in Section 6. The upper bound is based on the following.
Proposition 2 (Pinsker's inequality). Let X 1 , X 2 be random variables with the probability density functions p, q, and the Kullback-Leibler divergence KL(P X 1 ||P X 2 ). Then, for τ(X 1 , X 2 ) = TV(X 1 , X 2 ), Proof of Pinsker's inequality. We need the following bound: If P and Q are singular, then KL = ∞ and Pinsker's inequality holds true. Assume P and Q are absolutely continuous. In view of (7) and Cauchy-Schwarz inequality, To check (12), define [Mark S. Pinsker was invited to be the Shannon Lecturer at the 1979 IEEE International Symposium on Information Theory, but could not obtain permission at that time to travel to the symposium. However, he was officially recognized by the IEEE Information Theory Society as the 1979 Shannon Award recipient].
For one-dimensional Gaussian distributions, In the multi-dimensional Gaussian case, Next, define the Hellinger distance and note that, for one-dimensional Gaussian distributions, For multi-dimensional Gaussian PDs with δ = µ 1 − µ 2 , In fact, the following inequalities hold: where dx. These inequalities are not sharp. For example, the Cauchy-Schwarz inequality immediately implies τ(X, Y) ≤ 1 2 χ 2 (X, Y). There are also reverse inequalities in some cases.

Proposition 3 (Le Cam's inequalities).
The following inequality holds: Therefore, by Cauchy-Schwarz: where ∆ is small enough. Let r < d and A be r × d semi-orthogonal matrix AA T = I r . Define τ := τ(AX, AY). Then, Proof. In view of Le Cam's inequalities, it is enough to evaluate η 2 . Note that all r eigenvalues of [Ernst Hellinger was imprisoned in Dachau but released by the interference of influential friends and emigrated to the US].

Bounds on the Total Variation Distance
This section is devoted to the basic examples and partially based on [5]. However, it includes more proofs and additional details (  (1), Pois(1 + a)). (a) Note that the upper bound becomes useless for p 2 − p 1 ≥ 0.07; (b) blue and orange curves -exact TV distance: the blue curve works for 1 ≤ λ 2 λ 1 ≤ 2 and the orange curve for 2 ≤ λ 2 λ 1 ≤ 4. Note that the linear upper bound (red curve) is not relevant and the square root upper (green curve) bound becomes useless for λ 2 λ 1 ≥ 4.

Proposition 4 (Distances between exponential distributions). (a) Let X
(23) Given y > 0, the area of an
Proof. Let us prove the following inequality: where p = p 1 , p + x = p 2 and q = 1 − p. By concavity of the ln, given p ∈ (0, 1) and This gives the bound np 1 ≤ l as follows: (32) On the other hand, as h(0) = 0 and h (x) = ln(1 + x/p − ln(1 − x/q) ≤ 0; this implies the bound l ≤ np 2 . Indeed: The rest of the solution goes in parallel with that of Proposition 5. Equation (27) is replaced with the following relation: if S n (p) ∼ Bin(n, p); then, In fact, iterated integration by parts yields the RHS of (35) the LHS of (35).

Proposition 7 (Distance between binomial and Poisson distributions
Alternative bound The stronger form of (39): where S n (u) ∼ Bin(n, u) and
Suppose r d, and we want to find a low-dimensional projection A ∈ R r×d , AA T = I r of the multidimensional data X ∼N(µ 1 , Σ 1 ) and Y ∼N(µ 2 , Σ 2 ) such that TV(AX, AY) → max. The problem may be reduced to the case µ 1 = µ 2 = 0, Σ 1 = I n , Σ 2 = Σ, cf. [6]. In view of (44), it is natural to maximize where g(x) = 1 x − 1 2 and γ i are the eigenvalues of AΣA T . Consider all permutations π of these eigenvalues. Let Then, rows of matrix A should be selected as the normalized eigenvectors of Σ associated with the eigenvalues γ i .  .
For the optimization procedure in (47), the following result is very useful.

Estimation of Lévy-Prokhorov Distance
Let P i , i = 1, 2, be probability distributions on a metric space W with metric r. Define the Lévy-Prokhorov distance ρ L−P (P 1 , P 2 ) between P 1 , P 2 as the infimum of numbers > 0 such that, for any closed set C ⊂ W, where C stands for the -neighborhood of C in metric r. It could be checked that ρ L−P (P 1 , P 2 ) ≤ τ(P 1 , P 2 ), i.e., the total variance distance. Equivalently, where P (P 1 , P 2 ) is the set of all jointP on W × W with marginals P i . Next, define the Wasserstein distance W r p (P 1 , P 2 ) between P 1 , P 2 by In the case of Euclidean space with r(x 1 , x 2 ) = ||x 1 − x 2 ||, the index r is omitted. Total Variation, Wasserstein and Kolmogorov-Smirnov distances defined above are stronger than weak convergence (i.e., convergence in distribution, which is weak* convergence on the space of probability measures, seen as a dual space). That is, if any of these metrics go to zero as n → ∞, then we have weak convergence. However, the converse is not true. However, weak convergence is metrizable (e.g., by the Lévy-Prokhorov metric).
The Lévy-Prokhorov distance is quite tricky to compute, whereas the Wasserstein distance can be found explicitly in a number of cases. Say, in a 1D case W = R 1 , we have Theorem 5. For d = 1, Proof. First, check the upper bound W 1 (P 1 , Then, in view of the Fubini theorem, For the proof of the inverse inequality, see [8].
Proposition 10. For d = 1 and p > 1, Proof. It follows from the identity The minimum is achieved forF(x, y) = min[F 1 (x), F 2 (y)]. For an alternative expression (see [9]): , ϕ and Φ are PDF and CDF of the standard Gaussian RV. Note that, in the case µ X = µ Y , the first term in (74) vanishes, and the second term gives We also present expressions for the Frechet-3 and Frechet-4 distances All of these expressions are minimized when Cov(X j , Y j ), j = 1, . . . , d are maximal. However, this fact does not lead immediately to the explicit expressions for Wasserstein's metrics. The problem here is that the joint covariance matrix Σ X,Y should be positively definite. Thus, the straightforward choice Corr(X j , Y j ) = 1 is not always possible; see Theorem 6 below and [10].
[Maurice René Fréchet (1878-1973), a French mathematician, worked in topology, functional analysis, probability theory and statistics. He was the first to introduce the concept of a metric space (1906) and prove the representation theorem in L 2 (1907). However, in both cases, the credit was given to other people: Hausdorff and Riesz. Some sources claim that he discovered the Cramér-Rao inequality before anybody else, but such a claim was impossible to verify since lecture notes of his class appeared to be lost. Fréchet worked in several places in France before moving to Paris in 1928. In 1941, he succeeded Borel at the Chair of Calculus of Probabilities and Mathematical Physics in Sorbonne. In 1956, he was elected to the French Academy of Sciences, at the age of 78, which was rather unusual. He influenced and mentored a number of young mathematicians, notably Fortet and Loève. He was an enthusiast of Esperanto; some of his papers were published in this language].

Wasserstein Distance in the Gaussian Case
In the Gaussian case, it is convenient to use the following extension of Dobrushin's bound for p = 2: For simplicity, assume that both matrices Σ 2 1 and Σ 2 2 are non-singular (In the general case, the statement holds with Σ −1 1 understood as Moore-Penrose inversion). Then, the L 2 −Wasserstein distance W 2 (X 1 , where (Σ 1 Σ 2 2 Σ 1 ) 1/2 stands for the positively definite matrix square-root. The value (78) is achieved when Note that the expression in (79) vanishes when Σ 2 1 = Σ 2 2 .
1 ρ ρ 1 and ρ ∈ (−1, 1). Then, Note that, in the case Proof. First, reduce to the case µ 1 = µ 2 = 0 by using the identity W 2 2 (X 1 , Note that the infimum in (19) is always attained on Gaussian measures as W 2 (X 1 , X 2 ) is expressed in terms of the covariance matrix Σ 2 = Σ 2 X,Y only (cf. (81) below). Let us write the covariance matrix in the block form where the so-called Shur's complement S = Σ 2 2 − K T Σ −2 1 K. The problem is reduced to finding the matrix K in (80) that minimizes the expression subject to a constraint that the matrix Σ 2 in (80) is positively definite. The goal is to check that the minimum (81) is achieved when the Shur's complement S in (80) equals 0. Consider the fiber σ −1 (S), i.e., the set of all matrices K such that σ(K) It is enough to check that the maximum value of tr(K) on this fiber equals Since the matrix S is positively defined, it is easy to check that the fiber S = 0 should be selected. In order to establish (82), represent the positively definite matrix where the diagonal matrix D 2 r = diag(λ 2 1 , . . . , λ 2 r , 0, . . . , 0) and λ i > 0. Next, U = (U r |U d−r ) is the orthogonal matrix of the corresponding eigenvectors. We obtain the following r × r identity: It means that Σ −1 The matrix O r parametrises the fiber σ −1 (S). As a result, we have an optimization problem in a matrix-valued argument O r , subject to the constraint O T r O r = I r . A straightforward computation gives the answer tr[(M T M) 1/2 ], which is equivalent to (82). Technical details can be found in [11,12].

Remark 3.
For general zero means RVs X, Y ∈ R d with the covariance matrices Σ 2 i , i = 1, 2, the following inequality holds [13]:
Example 5 (Wasserstein-2 distance between Dirac measure on R m and a discrete measure on R d ). Let y ∈ R m and µ 1 ∈ M(R m ) be the Dirac measure with µ 1 (y) = 1, i.e., all mass centered at y. Let x 1 , . . . , x k ∈ R d be distinct points, p 1 , . . . , p k ≥ 0, p 1 + . . . + p k = 0, and let µ 2 ∈ M(R d ) be the discrete measure of point masses with µ 2 (x i ) = p i , i = 1, . . . , k. We seek the Wasserstein distanceŴ 2 (µ 1 , µ 2 ) in a closed-form solution. Suppose m ≤ d; then, noting that the second infimum is attained by b = y − k ∑ i=1 p i Vx i and defining C in the last infimum to be Let the eigenvalue decomposition of the symmetric positively semidefinite matrix C be C = QΛQ T with Λ = diag(λ 1 , . . . , λ d ), and is attained when V ∈ O(m, d) has row vectors given by the last m columns of Q ∈ O(d).
Note that the geodesic distance (7) and (8) between Gaussian PDs (or corresponding covariance matrices) is equivalent to the formula for the Fisher information metric for the multivariate normal model [15]. Indeed, the multivariate normal model is a differentiable manifold, equipped with the Fisher information as a Riemannian metric; this may be used in statistical inference. Example 6. Consider i.i.d. random variables Z l , . . . , Z n to be bi-variately normally distributed with diagonal covariance matrices, i.e., we focus on the manifold M diag = {N(µ, Λ) : µ ∈ R 2 , Λ diagonal}. In this manifold, consider the submodel M * diag = {N(µ, σ 2 I) : µ ∈ R 2 , σ 2 ∈ R + } corresponding to the hypothesis H 0 : σ 2 1 = σ 2 2 . First, consider the standard statistical estimatesZ for the mean and s 1 , s 2 for the variances. Ifσ 2 denotes the geodesic estimate of the common variance, the squared distance between the initial estimate and the geodesic estimate under the hypothesis H 0 is given by which is minimized byσ 2 = s 1 s 2 . Hence, instead of the arithmetic mean of the initial standard variation estimates, we use as an estimate the geometric mean of these quantities.
Finally, we present the distance between the symmetric positively definite matrices of Then, the distance is defined as follows: In order to estimate the distance (93), after the simultaneous diagonalization of matrices A and B, the following classical result is useful:

Context-Sensitive Probability Metrics
The weighted entropy and other weighted probabilistic quantities generated a substantial amount of literature (see [16,17] and the references therein). The purpose was to introduce a disparity between outcomes of the same probability: in the case of a standard entropy, such outcomes contribute the same amount of information/uncertainty, which is appropriate in context-free situations. However, imagine two equally rare medical conditions, occurring with probability p 1, one of which carries a major health risk while the other is just a peculiarity. Formally, they provide the same amount of information: − log p, but the value of this information can be very different. The applications of the weighted entropy to the clinical trials are in the process of active development (see [18] and the literature cited therein). In addition, the contribution to the distance (say, from a fixed distribution Q) related to these outcomes, is the same in any conventional sense. The weighted metrics, or weight functions, are supposed to fulfill the task of samples graduation, at least to a certain extent.
Let the weight function or graduation ϕ > 0 on the phase space X be given. Define the total weighted variation (TWV) distance τ ϕ (P 1 , P 2 ) = 1 2 sup Similarly, define the weighted Hellinger distance. Let p 1 , p 2 be the densities of P 1 , P 2 w.r.t. to a measure ν. Then,

Conclusions
The contribution of the current paper is summarized in the Table 1 below. The objects 1-8 belong to the treasures of probability theory and statistics, and we present a number of examples and additional facts that are not easy to find in the literature. The objects 9-10, as well as the distances between distributions of different dimensions, appeared quite recently. They are not fully studied and quite rarely used in applied research. Finally, objects [11][12] have been recently introduced by the author and his collaborators. This is the field of the current and future research.