Survey of Distances between the Most Popular Distributions

Mark Kelbert

doi:10.3390/analytics2010012

Abstract

We present a number of upper and lower bounds for the total variation distances between the most popular probability distributions. In particular, some estimates of the total variation distances in the cases of multivariate Gaussian distributions, Poisson distributions, binomial distributions, between a binomial and a Poisson distribution, and also in the case of negative binomial distributions are given. Next, the estimations of Lévy–Prohorov distance in terms of Wasserstein metrics are discussed, and Fréchet, Wasserstein and Hellinger distances for multivariate Gaussian distributions are evaluated. Some novel context-sensitive distances are introduced and a number of bounds mimicking the classical results from the information theory are proved.

Keywords:

probability distribution; total variation distance; Pinsker’s inequality; Le Cam’s inequalities; Lévy–Prohorov distance; Wasserstein distance; Jensen–Shannon distance; context-sensitive metrics

1. Introduction

Measuring a distance, whether in the sense of a metric or a divergence, between two probability distributions (PDs) is a fundamental endeavor in machine learning and statistics [1]. We encounter it in clustering, density estimation, generative adversarial networks, image recognition and just about any field that undertakes a statistical approach towards data. The most popular case is measuring the distance between multivariate Gaussian PDs, but other examples such as Poisson, binomial and negative binomial distributions, etc., frequently appear in applications too. Unfortunately, the available textbooks and reference books do not present them in a systematic way. Here, we make an attempt to fill this gap. For this aim, we review the basic facts about the metrics for probability measures, and provide specific formulae and simplified proofs that could not be easily found in the literature. Many of these facts may be considered as a scientific folklore known to experts but not represented in any regular way in the established sources. A tale that becomes folklore is one that is passed down and whispered around. The second half of the word, lore, comes from Old English lār, i.e., ‘instruction’. The basic reference for the topic is [2], and, in recent years, the theory has achieved substantial progress. A selection of recent publications on stability problems for stochastic models may be found in [3], but not much attention is devoted to the relationship between different metrics useful in specific applications. Hopefully, this survey helps to make this treasure more accessible and easy to handle.

The rest of the paper proceeds as follows: In Section 2, we define the total variation, Kolmogorov–Smirnov, Jensen–Shannon and geodesic metrics. Section 3 is devoted to the total variation distance for 1D Gaussian PDs. In Section 4, we survey a variety of different cases: Poisson, binomial, negative-binomial, etc. In Section 5, the total variation bounds for multivariate Gaussian PDs are presented, and they are proved in Section 6. In Section 7, the estimations of Lévy–Prohorov distance in terms of Wasserstein metrics are presented. The Gaussian case is thoroughly discussed in Section 8. In Section 9, a relatively new topic of distances between the measures of different dimensions is briefly discussed. Finally, in Section 10, new context-sensitive metrics are introduced and a number of inequalities mimicking the classical bounds from information theory are proved.

2. The Most Popular Distances

The most interesting metrics on the space of probability distributions are the total variation (TV), Lévy–Prohorov, Wasserstein distances. We will also discuss Fréchet, Kolmogorov–Smirnov and Hellinger distances. Let us remind readers that, for probability measures

P, Q

with densities

p, q

,

TV (P, Q) = sup_{A \subset R^{d}} | P (A) - Q (A) | = \frac{1}{2} \int_{R^{d}} | p (u) - q (u) | d u

(1)

We need the coupling characterization of the total variation distance. For two distributions,

P

and

Q

, a pair

(X, Y)

of random variables (r.v.) defined on the same probability space is called a coupling for

P

and

Q

if

X \sim P

and

Y \sim Q

. Note the following fact: there exists a coupling

(X, Y)

such that

P (X \neq Y) = TV (P, Q)

. Therefore, for any measurable function f, we have

P (f (X) \neq f (Y)) \leq TV (X, Y)

with equality iff f is reversible.

In a one-dimensional case, the Kolmogorov–Smirnov distance is useful (only for probability measures on

R

): Kolm

(P, Q) = sup_{x \in R} | P (- \infty, x) - Q (- \infty, x) | \leq TV (P, Q)

. Suppose

X \sim P, Y \sim Q

are two r.v.’s, and Y has a density w.r.t. Lebesgue measure bounded by a constant C. Then, Kolm

(P, Q) \leq 2 \sqrt{C {Wass}_{1} (P, Q)}

. Here, Wass

_{1} (P, Q) = inf [E | X - Y | : X \sim P, Y \sim Q]

.

Let

X_{1}, X_{2}

be random variables with the probability density functions

p, q

, respectively. Define the Kullback–Leibler (KL) divergence

KL (P_{X_{1}} | | P_{X_{2}}) = \int p log \frac{p}{q} .

(2)

Example 1.

Consider the scale family

{p_{s} (x) = \frac{1}{s} p (\frac{x}{s}), s \in (0, \infty)}

. Then,

KL (p_{s_{1}} | | p_{s_{2}}) = KL (p_{\frac{s_{1}}{s_{2}}} | | p_{1}) = KL (p_{1} | | p_{\frac{s_{2}}{s_{1}}}) .

The total variance distance and the Kullback–Leibler (KL) divergence appear naturally in statistics. Say, for example, in the testing of binary hypothesis

H_{0}

:

X \sim P

versus

H_{1}

:

X \sim Q

, the sum of errors of both types

inf_{d} [P (d (X) = H_{1}) + Q (d (X) = H_{0})] = \int min [p, q] = 1 - TV (P, Q)

(3)

as the infimum over all reasonable decision rules d:

X \to {H_{0}, H_{1}}

or the critical domains W is achieved for

W^{*} = {p (x) < q (x)}

. Moreover, when minimizing the probability of type-II error subjected to type-I error constraints, the optimal test guarantees that the probability of type-II error decays exponentially in view of Sanov’s theorem

lim_{n \to \infty} - \frac{ln Q (d (X) = H_{0})}{n} = KL (P | | Q) .

(4)

where n is the sample size. In the case of selecting between

M \geq 2

distributions,

inf_{d} max_{1 \leq j \leq M} P_{j} (d (X) \neq j) \geq 1 - \frac{\frac{1}{M^{2}} \sum_{j, k = 1}^{M} KL (P_{j}, P_{k}) + log 2}{M - 1} .

(5)

The KL-divergence is not symmetric and does not satisfy the triangle inequality. However, it gives rise to the so-called Jensen–Shannon metric [4]

JS (P, Q) = \sqrt{D (P | | R) + D (Q | | R)}

with

R = \frac{1}{2} (P + Q)

. It is a lower bound for the total variance distance

0 \leq JS (P, Q) \leq TV (P, Q) .

(6)

The Jensen–Shannon metric is not easy to compute in terms of covariance matrices in the multi-dimensional Gaussian case.

A natural way to develop a computationally effective distance in the Gaussian case is to define first a metric between the positively definite matrices. Let

λ_{1}, \dots, λ_{d}

be the generalized eigenvalues, i.e., the solutions of

\det (Σ_{1} - λ Σ_{2}) = 0

. Define the distance between the positively definite matrices by

d (Σ_{1}, Σ_{2}) = \sqrt{\sum_{j = 1}^{d} {(ln λ_{j})}^{2}}

, and a geodesic metric between Gaussian PDs

X_{1} \sim

N

(μ_{1}, Σ_{1})

and

X_{2} \sim

N

(μ_{2}, Σ_{2})

:

d (X_{1}, X_{2}) = {(δ^{T} S^{- 1} δ)}^{1 / 2} + {(\sum_{j = 1}^{d} {(ln λ_{j})}^{2})}^{1 / 2}

(7)

where

δ = μ_{1} - μ_{2}

and

S = \frac{1}{2} Σ_{1} + \frac{1}{2} Σ_{2}

. Equivalently,

d^{2} (Σ_{1}, Σ_{2}) = tr [{(ln (Σ_{1}^{- 1 / 2} Σ_{2} Σ_{1}^{- 1 / 2}))}^{2}] .

(8)

Remark 1.

It may be proved that the set of symmetric positively definite matrices

M^{+} (d, R)

is a Riemannian manifold, and (8) is a geodesic distance corresponding to the bilinear form

B (X, Y) = 4 tr (X Y)

on the tangent space of symmetric matrices

M (d, R)

.

3. Total Variation Distance between 1D Gaussian PDs

Let

Φ

and

φ

be the standard normal distribution and its density. Let

X_{i} \sim

N

(μ_{i}, σ_{i}^{2})

,

i = 1, 2

. Define

τ = τ (X_{1}, X_{2}) = TV (N (μ_{1}, σ_{1}^{2}), N (μ_{2}, σ_{2}^{2})) .

Note that

τ

depends on the parameters

Δ = | δ |

, with

δ = μ_{1} - μ_{2}

, and

σ_{1}^{2}, σ_{2}^{2}

.

Proposition 1.

In the case

σ_{1}^{2} = σ_{2}^{2} = σ^{2}

, the total variation distance is computed exactly:

τ (X_{1}, X_{2}) = 2 Φ (\frac{| μ_{1} - μ_{2} |}{2 σ}) - 1

.

Proof.

By using a shift, we can assume that

μ_{1} = 0

and

μ_{2} = Δ > 0

. Then, the set

A = {x : p_{1} (x) > p_{2} (x)}

is specified as

A = {e^{- \frac{x^{2}}{2 σ^{2}}} > e^{- \frac{{(x - Δ)}^{2}}{2 σ^{2}}}} = (- \infty, Δ / 2) .

Hence,

τ (X_{1}, X_{2}) = \frac{1}{\sqrt{2 π} σ} \int_{- \infty}^{Δ / 2} (e^{- \frac{x^{2}}{2 σ^{2}}} - e^{- \frac{{(x - Δ)}^{2}}{2 σ^{2}}}) d x = Φ (b) - Φ (- b)

(9)

where

b = \frac{Δ}{2 σ}

. Using the property

Φ (- b) = 1 - Φ (b)

leads to the answer. □

Theorem 1.

\frac{1}{200} min [1, max [\frac{| σ_{1}^{2} - σ_{2}^{2} |}{min [σ_{1}^{2}, σ_{2}^{2}]}, \frac{40 Δ}{min [σ_{1}, σ_{2}]}]] \leq τ \leq \frac{3 | σ_{1}^{2} - σ_{2}^{2} |}{2 max [σ_{1}^{2}, σ_{2}^{2}]} + \frac{Δ}{2 max [σ_{1}, σ_{2}]}

(10)

The proof is sketched in Section 6. The upper bound is based on the following.

Proposition 2 (Pinsker’s inequality).

Let

X_{1}, X_{2}

be random variables with the probability density functions

p, q

, and the Kullback–Leibler divergence

KL (P_{X_{1}} | | P_{X_{2}})

. Then, for

τ (X_{1}, X_{2}) = TV (X_{1}, X_{2})

,

\begin{matrix} τ (X_{1}, X_{2}) \leq min [1, \sqrt{KL (P_{X_{1}} | | P_{X_{2}}) / 2}] . \end{matrix}

(11)

Proof of Pinsker’s inequality.

We need the following bound:

| x - 1 | \leq \sqrt{(\frac{4}{3} + \frac{2 x}{3}) ϕ (x)}, ϕ (x) : = x ln x - x + 1

(12)

If

P

and

Q

are singular, then

KL = \infty

and Pinsker’s inequality holds true. Assume

P

and

Q

are absolutely continuous. In view of (7) and Cauchy–Schwarz inequality,

\begin{matrix} τ (X, Y) = \frac{1}{2} \int | p - q | = \frac{1}{2} \int q | \frac{p}{q} - 1 | 1_{{q > 0}} \\ \leq \frac{1}{2} {(\int (\frac{4 q}{3} + \frac{2 p}{3}))}^{1 / 2} {(\int q ϕ (\frac{p}{q}) 1_{{q > 0}})}^{1 / 2} \\ = {(\frac{1}{2} \int p ln (\frac{p}{q}) 1_{{q > 0}})}^{1 / 2} = {(KL (P | | Q) / 2)}^{1 / 2} \end{matrix}

(13)

To check (12), define

g (x) = {(x - 1)}^{2} - (\frac{4}{3} + \frac{2 x}{3}) ϕ (x)

Then,

g (1) = g^{'} (1) = 0

,

g^{″} (x) = - \frac{4 ϕ (x)}{3 x} < 0

. Hence,

g (x) = g (1) + g^{'} (1) (x - 1) + \frac{1}{2} g^{″} (ξ) {(x - 1)}^{2} = - \frac{4 ϕ (ξ)}{6 ξ} {(x - 1)}^{2} \leq 0 .

□

[Mark S. Pinsker was invited to be the Shannon Lecturer at the 1979 IEEE International Symposium on Information Theory, but could not obtain permission at that time to travel to the symposium. However, he was officially recognized by the IEEE Information Theory Society as the 1979 Shannon Award recipient].

For one-dimensional Gaussian distributions,

KL (P_{X_{1}} | | P_{X_{2}}) = \frac{1}{2} (\frac{σ_{2}^{2}}{σ_{1}^{2}} - 1 + \frac{Δ^{2}}{σ_{1}^{2}} - ln \frac{σ_{2}^{2}}{σ_{1}^{2}}) .

In the multi-dimensional Gaussian case,

KL (P_{X_{1}} | | P_{X_{2}}) = \frac{1}{2} (tr (Σ_{1}^{- 1} Σ_{2} - I) + δ^{T} Σ_{1}^{- 1} δ - ln \det (Σ_{2} Σ_{1}^{- 1}))

(14)

Next, define the Hellinger distance

η (X, Y) = \frac{1}{\sqrt{2}} {(\int {(\sqrt{p_{X} (u)} - \sqrt{p_{Y} (u)})}^{2} d u)}^{1 / 2}

(15)

and note that, for one-dimensional Gaussian distributions,

η {(X, Y)}^{2} = 1 - \frac{\sqrt{2 σ_{1} σ_{2}}}{\sqrt{σ_{1}^{2} + σ_{2}^{2}}} e^{- \frac{Δ^{2}}{4 (σ_{1}^{2} + σ_{2}^{2})}}

(16)

For multi-dimensional Gaussian PDs with

δ = μ_{1} - μ_{2}

,

η {(X, Y)}^{2} = 1 - \frac{2^{d / 2} \det {(Σ_{1})}^{1 / 4} \det {(Σ_{2})}^{1 / 4}}{\det {(Σ_{1} + Σ_{2})}^{1 / 2}} exp (- \frac{1}{8} δ^{T} {(\frac{Σ_{1} + Σ_{2}}{2})}^{- 1} δ) .

(17)

In fact, the following inequalities hold:

τ (X, Y) \leq \sqrt{2} η (X, Y) \leq \sqrt{2} \sqrt{KL (P_{X} | | P_{Y})} \leq \sqrt{2} \sqrt{χ^{2} (X, Y)}

(18)

where

χ^{2} (P, Q) = \int \frac{{(p (x) - q (x))}^{2}}{p (x)} d x

. These inequalities are not sharp. For example, the Cauchy–Schwarz inequality immediately implies

τ (X, Y) \leq \frac{1}{2} \sqrt{χ^{2} (X, Y)}

. There are also reverse inequalities in some cases.

Proposition 3 (Le Cam’s inequalities).

The following inequality holds:

η {(X, Y)}^{2} \leq τ (X, Y) \leq η (X, Y) {(2 - η {(X, Y)}^{2})}^{1 / 2}

(19)

Proof of Le Cam’s inequalities.

From

τ (X, Y) = \frac{1}{2} \int | p - q | = 1 - \int min [p, q]

and

min [p, q]

\leq \sqrt{p q}

, it follows that

τ (X, Y) \geq 1 - \int \sqrt{p q} = η^{2} (X, Y)

. Next,

\int min [p, q] + \int max [p, q] = 2

. Therefore, by Cauchy–Schwarz:

\begin{matrix} {(\int \sqrt{p q})}^{2} = {(\int \sqrt{min [p, q] max [p, q]})}^{2} \leq \int min [p, q] \int max [p, q] \\ = \int min [p, q] (2 - \int min [p, q]) \end{matrix}

(20)

Hence,

\begin{matrix} {(1 - η {(X, Y)}^{2})}^{2} \leq (1 - τ (X, Y)) (1 + τ (X, Y)) \\ \Rightarrow τ (X, Y) \leq η (X, Y) {(2 - η {(X, Y)}^{2})}^{1 / 2} . \end{matrix}

(21)

□

Example 2.

Let

X \sim

N

(0, Σ_{1})

,

Y \sim

N

(0, Σ_{2})

be d-dimensional Gaussian vectors. Suppose that

Σ_{2} = (1 + Δ) Σ_{1}

, where Δ is small enough. Let

r < d

and A be

r \times d

semi-orthogonal matrix

A A^{T} = I_{r}

. Define

τ : = τ (A X, A Y)

. Then,

\frac{1}{16} Δ^{2} r \leq τ \leq \frac{1}{2^{3 / 2}} Δ \sqrt{r} .

Proof.

In view of Le Cam’s inequalities, it is enough to evaluate

η^{2}

. Note that all r eigenvalues of

Σ_{1} Σ_{2}^{- 1}

equal

{(1 + Δ)}^{- 1}

. Thus,

η^{2} = 1 - \frac{4^{r / 4} {(1 + Δ)}^{r / 4}}{{(2 + Δ)}^{r / 2}} = \frac{1}{8} Δ^{2} [- r (\frac{r}{4} - 1) + \frac{r}{2} (\frac{r}{2} - 1)] + o (Δ^{2}) = \frac{1}{16} Δ^{2} r + o (Δ^{2}) .

(22)

□

[Ernst Hellinger was imprisoned in Dachau but released by the interference of influential friends and emigrated to the US].

4. Bounds on the Total Variation Distance

This section is devoted to the basic examples and partially based on [5]. However, it includes more proofs and additional details (Figure 1).

Figure 1. Exact TV distance and the upper bound for (a) TV(Bin

(20, \frac{1}{2})

, Bin

(20, \frac{1}{2} + a))

and (b) TV(Pois

(1)

, Pois

(1 + a)

). (a) Note that the upper bound becomes useless for

p_{2} - p_{1} \geq 0.07

; (b) blue and orange curves – exact TV distance: the blue curve works for

1 \leq \frac{λ_{2}}{λ_{1}} \leq 2

and the orange curve for

2 \leq \frac{λ_{2}}{λ_{1}} \leq 4

. Note that the linear upper bound (red curve) is not relevant and the square root upper (green curve) bound becomes useless for

\frac{λ_{2}}{λ_{1}} \geq 4

.

Proposition 4 (Distances between exponential distributions).

(a) Let

X \sim

Exp

(λ)

,

Y \sim

Exp

(μ)

,

0 < λ \leq μ < \infty

. Then,

τ (X, Y) = {(\frac{λ}{μ})}^{\frac{λ}{μ - λ}} - {(\frac{λ}{μ})}^{\frac{μ}{μ - λ}} .

(23)

(b) Let

X = (X_{1}, \dots, X_{d})

,

Y = (Y_{1}, \dots, Y_{d})

, each with d i.i.d. components

X_{i} \sim

Exp

(λ)

,

Y_{i} \sim

Exp

(μ)

. Then,

τ (X, Y) = \int_{z^{*}}^{\infty} (λ^{d} e^{- λ y} - μ^{d} e^{- μ y}) \frac{{(\sqrt{2} y)}^{d - 1}}{(d - 1)!} d y

(24)

where

z^{*} = \frac{d}{μ - λ} ln \frac{μ}{λ}

.

Proof.

(a) Indeed, the set

A = {x > 0 : λ e^{- λ x} > μ e^{- μ x}}

coincides with the half-axis

(y^{*}, \infty)

with

y^{*} = \frac{1}{μ - λ} ln \frac{μ}{λ}

. Consequently,

τ (X, Y) = e^{- λ y^{*}} - e^{- μ y^{*}}

. (b) In this case, the set

A = {X : x_{i} > 0, \sum_{j = 1}^{d} x_{j} > z^{*}}

with

z^{*} = \frac{d}{μ - λ} ln \frac{μ}{λ}

. Given

y > 0

, the area of an

(d - 1)

-dimensional simplex

{x : x_{i} > 0, \sum_{j = 1}^{d} x_{j} = y}

equals

\frac{{(\sqrt{2} y)}^{d - 1}}{(d - 1)!}

. Then,

τ (X, Y) = \int_{x \in A} [\prod_{j = 1}^{d} λ e^{- λ x_{i}} - \prod_{j = 1}^{d} μ e^{- μ x_{i}}] d x

coincides with (24). □

Proposition 5 (Distances between Poisson distributions).

Let

X_{i} \sim

Po

(λ_{i})

, where

0 < λ_{1} < λ_{2} .

Then,

τ (X_{1}, X_{2}) = \int_{λ_{1}}^{λ_{2}} P (N (u) = l - 1) d u \leq min [λ_{2} - λ_{1}, \sqrt{\frac{2}{e}} (\sqrt{λ_{2}} - \sqrt{λ_{1}})]

(25)

where

N (u) \sim

Po

(u)

and

l = l (λ_{1}, λ_{2}) = ⌈ (λ_{2} - λ_{1}) {(ln (λ_{2} / λ_{1}))}^{- 1} ⌉

(26)

with

⌈ λ_{1} ⌉ \leq l \leq ⌈ λ_{2} ⌉

.

Proof.

Let

N (t) \sim

Po

(t)

; then, via iterated integration by part,

P (N (t) \leq n) = \sum_{k = 0}^{n} e^{- t} \frac{t^{k}}{k!} = \int_{t}^{\infty} e^{- u} \frac{u^{n}}{n!} d u = \int_{t}^{\infty} P (N (u) = n) d u .

(27)

Hence, Kolm

(X_{1}, X_{2}) = τ (X_{1}, X_{2}) = P (X_{2} \geq l) - P (X_{1} \geq l) =

P (X_{1} \leq l - 1) - P (X_{2} \leq l - 1) = \int_{λ_{1}}^{λ_{2}} P (N (u) = l - 1) d u

where

l = min [k \in Z_{+} : f (k) \geq 1] = ⌈ (λ_{2} - λ_{1}) {(ln (λ_{2} / λ_{1}))}^{- 1} ⌉

and

f (k) = \frac{P (N (λ_{2}) = k)}{P (N (λ_{1}) = k)}

. □

Proposition 6 (Distances between binomial distributions).

X_{i} \sim

Bin

(n, p_{i})

,

0 < p_{1} < p_{2} < 1

.

τ (X_{1}, X_{2}) = n \int_{p_{1}}^{p_{2}} P (S_{n - 1} (u) = l - 1) d u \leq \frac{\sqrt{e}}{2} \frac{ψ (p_{2} - p_{1})}{{(1 - ψ (p_{2} - p_{1}))}^{2}}

(28)

where

S_{n - 1} (u) \sim

Bin

(n - 1, u)

and

ψ (x) = x \sqrt{\frac{n + 2}{2 p_{1} (1 - p_{1})}}

. Finally, define

l = ⌈\frac{- n ln (1 - \frac{p_{2} - p_{1}}{1 - p_{1}})}{ln (1 + \frac{p_{2} - p_{1}}{p_{1}}) - ln (1 - \frac{p_{2} - p_{1}}{1 - p_{1}}))}⌉

(29)

with

⌈ n p_{1} ⌉ \leq l \leq ⌈ n p_{2} ⌉

.

Proof.

Let us prove the following inequality:

n p \leq \frac{- n ln (1 - x / q)}{ln (1 + x / p) - ln (1 - x / q)} \leq n (p + x), 0 < x < q

(30)

where

p = p_{1}

,

p + x = p_{2}

and

q = 1 - p

. By concavity of the ln, given

p \in (0, 1)

and

q = 1 - p

,

f (x) = p ln (1 + x / p) + q ln (1 - x / q) \leq ln 1 = 0, 0 < x < q .

(31)

This gives the bound

⌈ n p_{1} ⌉ \leq l

as follows:

\begin{matrix} p ln (1 + x / p) + q ln (1 - x / q) \leq 0 \Rightarrow n p ln (1 + x / p) - n p ln (1 - x / q) \leq - n ln (1 - x / q) \\ \Rightarrow n p \leq \frac{- n ln (1 - x / q)}{ln (1 + x / p) - ln (1 - x / q)} . \end{matrix}

(32)

On the other hand,

h (x) = (p + x) ln (1 + x / p) + (q - x) ln (1 - x / q) \geq 0, 0 \leq x \leq q

(33)

as

h (0) = 0

and

h^{'} (x) = ln (1 + x / p - ln (1 - x / q) \leq 0

; this implies the bound

l \leq ⌈ n p_{2} ⌉

. Indeed:

\begin{matrix} (p + x) ln (1 + x / p) + (q - x) ln (1 - x / q) \geq 0 \Rightarrow \\ n (p + x) ln (1 + x / p) + n (p + x) ln (1 - x / q) \geq - n ln (1 - x / q) \\ \Rightarrow n (p + x) \geq \frac{- n ln (1 - x / q)}{ln (1 + x / p) - ln (1 - x / q)} . \end{matrix}

(34)

The rest of the solution goes in parallel with that of Proposition 5. Equation (27) is replaced with the following relation: if

S_{n} (p) \sim

Bin

(n, p)

; then,

P (S_{n} (p) \geq k) = n \int_{0}^{p} P (S_{n - 1} (u) = k - 1) d u

(35)

In fact, iterated integration by parts yields the RHS of (35)

\begin{matrix} = \frac{n (n - 1) \dots (n - k + 1)}{1 \dots k} p^{k} {(1 - p)}^{n - k} + \frac{n (n - 1) \dots (n - k)}{1 \dots (k + 1)} p^{k + 1} {(1 - p)}^{n - k + 1} \\ + \dots + p^{n} = \end{matrix}

(36)

the LHS of (35). □

Proposition 7 (Distance between binomial and Poisson distributions).

X \sim

Bin

(n, p)

and

Y \sim

Po

(n p)

,

0 < n p < 2 - \sqrt{2}

τ (X, Y) = n p [{(1 - p)}^{n - 1} - e^{- n p}]

(37)

Alternative bound

TV (Bin (n, \frac{λ}{n}), Pois (λ)) \leq 1 - {(1 - \frac{λ}{n})}^{1 / 2} .

(38)

For the sum of Bernoulli r.v.’s

S_{n} = \sum_{j = 1}^{n} X_{j}

with

P (X_{i} = 1) = p_{i}

,

τ (S_{n}, Y_{n}) = \frac{1}{2} \sum_{k = 1}^{\infty} | P (S_{n} = k) - \frac{λ_{n}^{k}}{k!} e^{- λ_{n}} | < \sum_{i = 1}^{n} p_{i}^{2}

(39)

where

Y_{n} \sim

Po

(λ_{n})

,

λ_{n} = p_{1} + p_{2} + \dots + p_{n}

(Le Cam). A stronger result: for

X_{i} \sim

Bernoulli

(p_{i})

and

Y_{i} \sim

Po

(λ_{i} = p_{i})

, there exists a coupling s.t.

τ (X_{i}, Y_{i}) = P (X_{i} \neq Y_{i}) = p_{i} (1 - e^{- p_{i}}) .

The stronger form of (39):

\frac{1}{32} (1 \land λ_{n}^{- 1}) \sum_{j = 1}^{n} p_{i}^{2} \leq τ (X_{n}, Y_{n}) \leq λ_{n}^{- 1} (1 - e^{- λ_{n}}) \sum_{j = 1}^{n} p_{i}^{2} .

(40)

Proposition 8 (Distance between negative binomial distributions).

Let

X_{i} \sim

NegBin

(m, p_{i})

,

0 < p_{1} < p_{2} < 1

τ (X_{1}, X_{2}) = (m + l - 1) \int_{p_{1}}^{p_{2}} P (S_{m + l - 2} (u) = m - 1) d u

(41)

where

S_{n} (u) \sim

Bin

(n, u)

and

l = ⌈ - m \frac{ln (1 + \frac{p_{2} - p_{1}}{p_{1}})}{ln (1 - \frac{p_{2} - p_{1}}{1 - p_{1}})} ⌉

(42)

with

⌈ m \frac{1 - p_{2}}{p_{2}}] \leq l \leq [m \frac{1 - p_{1}}{p_{1}} ⌉

.

5. Total Variance Distance in the Multi-Dimensional Gaussian Case

Theorem 2.

Let

τ = TV (N (μ_{1}, Σ_{1}), N (μ_{2}, Σ_{2}))

, and

Σ_{1}, Σ_{2}

be positively definite. Let

δ = μ_{1} - μ_{2}

and Π be a

d \times (d - 1)

matrix whose columns form a basis for the subspace orthogonal to δ. Let

λ_{1}, \dots, λ_{d - 1}

denote the eigenvalues of the matrix

{(Π^{T} Σ_{1} Π)}^{- 1} Π^{T} Σ_{2} Π - I_{d - 1}

and

λ = \sqrt{\sum_{i = 1}^{d - 1} λ_{i}^{2}}

. In

μ_{1} \neq μ_{2}

, then

\frac{1}{200} min [1, φ (δ, Σ_{1}, Σ_{2})] \leq τ \leq \frac{9}{2} min [1, φ (δ, Σ_{1}, Σ_{2})]

(43)

where

φ (δ, Σ_{1}, Σ_{2}) = max [\frac{δ^{T} (Σ_{1} - Σ_{2}) δ}{δ^{T} Σ_{1} δ}, \frac{\sqrt{δ^{T} δ}}{\sqrt{δ^{T} Σ_{1} δ}}, λ]

(44)

In the case of equal means

μ_{1} = μ_{2}

, the bound (43) is simplified:

\frac{1}{100} min [1, λ] \leq τ \leq \frac{3}{2} min [1, λ] .

(45)

Here,

λ = \sqrt{\sum_{j = 1}^{d} λ_{j}^{2}}

,

λ_{1}, \dots, λ_{d}

are the eigenvalues of

Σ_{1}^{- 1} Σ_{2} - I_{d}

for positively definite

Σ_{1}, Σ_{2}

.

Proof is given in Section 6.

Suppose

r ≪ d

, and we want to find a low-dimensional projection

A \in R^{r \times d}, A A^{T} = I_{r}

of the multidimensional data

X \sim

N

(μ_{1}, Σ_{1})

and

Y \sim

N

(μ_{2}, Σ_{2})

such that

TV (A X, A Y) \to max

. The problem may be reduced to the case

μ_{1} = μ_{2} = 0,

Σ_{1} = I_{n}

,

Σ_{2} = Σ

, cf. [6]. In view of (44), it is natural to maximize

min [1, \sum_{i = 1}^{r} g (γ_{i})]

(46)

where

g (x) = {(\frac{1}{x} - 1)}^{2}

and

γ_{i}

are the eigenvalues of

A Σ A^{T}

. Consider all permutations

π

of these eigenvalues. Let

π^{*} = {argmax}_{π} \sum_{i = 1}^{r} g (λ_{π (i)}), γ_{i} = λ_{π^{*} (i)}, i = 1, \dots, r .

(47)

Then, rows of matrix A should be selected as the normalized eigenvectors of

Σ

associated with the eigenvalues

γ_{i}

.

Remark 2.

For zero-mean Gaussian models, this procedure may be repeated mutatis mutandis for any of the so-called f-divergences

D_{f} (P | | Q) : = E_{P} [f (\frac{d Q}{d P})]

, where f is a convex function such that

f (1) = 0

, cf. [6]. The most interesting examples are:

(1): KL-divergence: $f (t) = t log t$ and $g (x) = \frac{1}{2} (x - log x - 1)$ ;
(2): Symmetric KL-divergence: $f (t) = (t - 1) log t$ and $g (x) = \frac{1}{2} (x + \frac{1}{x} - 2)$ ;
(3): The total variance distance: $f (t) = \frac{1}{2} | t - 1 |$ and $g (x) = {(\frac{1}{x} - 1)}^{2}$ ;
(4): The square of Hellinger distance: $f (t) = {(\sqrt{t} - 1)}^{2}$ and $g (x) = {(\frac{x + 1}{x})}^{2}$ ;
(5): $χ^{2} -$ divergence: $f (t) = {(t - 1)}^{2}$ and $g (x) = \frac{1}{\sqrt{x (2 - x)}}$ .

For the optimization procedure in (47), the following result is very useful.

Theorem 3 (Poincaré Separation Theorem).

Let Σ be a real symmetric

d \times d

matrix and A be a semi-orthogonal

r \times d

matrix. The eigenvalues of Σ (sorted in the descending order) and the eigenvalues of

A Σ A^{T}

denoted by

{γ_{i}, i = 1, \dots, r}

(sorted in the descending order) satisfy

λ_{d - (r - i)} \leq γ_{i} \leq λ_{i}, i = 1, \dots, r .

Proposition 9.

Let

X, Y

be two Gaussian PDs with the same covariance matrix:

X \sim

N

(μ_{1}, Σ)

,

Y \sim

N

(μ_{2}, Σ)

. Suppose that matrix Σ is non-singular. Then,

τ (X, Y) = 2 Φ (| | Σ^{- 1 / 2} (μ_{1} - μ_{2}) | | / 2) - 1 .

(48)

Proof.

Here, the set

A : = {x \in R^{d} : p (x | μ_{1}, Σ) > p (x | μ_{2}, Σ)}

is a half-space. Indeed,

p (x | μ_{1}, Σ) > p (x | μ_{2}, Σ) \Leftrightarrow 2 x^{T} Σ^{- 1} (μ_{2} - μ_{1}) < μ_{2}^{T} Σ^{- 1} μ_{2} - μ_{1}^{T} Σ^{- 1} μ_{1} .

(49)

After the change of variables

x \to x + μ_{1}

, we need to evaluate the expression

\begin{matrix} I : = \frac{1}{{(2 π)}^{d / 2} \det {(Σ)}^{1 / 2}} \int_{R^{d}} 1 (x^{T} Σ^{- 1} δ < \frac{1}{2} | | Σ^{- 1 / 2} δ {| |}^{2}) \\ \times (e^{- x^{T} Σ^{- 1} x / 2} - e^{- {(x - δ)}^{T} Σ^{- 1} (x - δ) / 2}) d x . \end{matrix}

(50)

Take an orthogonal

d \times d

matrix O such that

O Σ^{- 1 / 2} δ = | | Σ^{- 1 / 2} δ | | e_{1}

and change the variables

x = Σ^{1 / 2} O^{T} u

. Then,

\begin{matrix} x^{T} Σ^{- 1} δ = | | Σ^{- 1 / 2} δ | | u_{1}, x^{T} Σ^{- 1} x = u^{T} u, \\ {(x - δ)}^{T} Σ^{- 1} (x - δ) = u^{T} u + | | Σ^{- 1 / 2} {δ | |}^{2} - 2 | | Σ^{- 1 / 2} δ | | u_{1} . \end{matrix}

(51)

Thus,

\begin{matrix} I = \frac{1}{{(2 π)}^{d / 2}} \int_{R^{d - 1}} e^{- v^{T} v / 2} d v \int_{- \infty}^{| | Σ^{- 1 / 2} δ | | / 2} (e^{- u_{1}^{2} / 2} - e^{- (u_{1} - | | Σ^{- 1 / 2} {δ | |)}^{2} / 2}) d u_{1} \\ = Φ (b) - Φ (- b) \end{matrix}

(52)

where

b = | | Σ^{- 1 / 2} δ | | / 2

. □

6. Proofs for the Multi-Dimensional Gaussian Case

Let

X_{i} \sim

N

(μ_{i}, Σ_{i}), i = 1, 2

. W.l.o.g., assume that

Σ_{1}, Σ_{2}

are positively definite, and the general case may be followed from the identity

TV (N (0, Σ_{1}), N (0, Σ_{2})) = TV (N (0, Π^{T} Σ_{1} Π), N (0, Π^{T} Σ_{2} Π))

(53)

where

Π

is

d \times r

matrix whose columns form an orthogonal basis for range

(Σ_{1, 2})

. Denote

u = (μ_{1} + μ_{2}) / 2, δ = μ_{1} - μ_{2}

and decompose

\forall w \in R^{d}

as

w = u + f_{1} (w) δ + f_{2} (w), f_{2} {(w)}^{T} δ = 0 .

Then,

\begin{matrix} max [TV (f_{1} (X_{1}), f_{1} (X_{2})), TV (f_{2} (X_{1}), f_{2} (X_{2}))] \leq TV (X_{1}, X_{2}) \\ \leq TV (f_{1} (X_{1}), f_{1} (X_{2})) + TV (f_{2} (X_{1}), f_{2} (X_{2})) \end{matrix}

(54)

All the components are Gaussian and

f_{1} (X_{1}) \sim

N

(\frac{1}{2}, \frac{δ^{T} Σ_{1} δ}{δ^{T} δ})

,

f_{1} (X_{2}) \sim

N

(- \frac{1}{2}, \frac{δ^{T} Σ_{2} δ}{δ^{T} δ})

,

f_{2} (X_{1}) \sim

N

(0, P Σ_{1} P)

,

f_{2} (X_{2}) \sim

N

(0, P Σ_{2} P)

,

P = I_{d} - \frac{δ δ^{T}}{δ^{T} δ}

. We claim that

\begin{matrix} \frac{1}{200} min [1, max [\frac{δ^{T} (Σ_{1} - Σ_{2}) δ}{2 δ^{T} Σ_{1} δ}, \frac{40 \sqrt{δ^{T} δ}}{\sqrt{δ^{T} Σ_{1} δ}}]] \\ \leq TV (f_{1} (X_{1}), f_{1} (X_{2})) \leq \frac{3 δ^{T} (Σ_{1} - Σ_{2}) δ}{2 δ^{T} Σ_{1} δ} + \frac{\sqrt{δ^{T} δ}}{2 \sqrt{δ^{T} Σ_{1} δ}}, \end{matrix}

(55)

\frac{1}{100} min [1, λ] \leq TV (f_{2} (X_{1}), f_{2} (X_{2})) \leq \frac{3}{2} λ

(56)

where

λ = {(\sum_{j = 1}^{d} λ_{j})}^{1 / 2}

and

λ_{i}

are the eigenvalues of

Σ_{1}^{- 1} Σ_{2} - I_{d}

.

Proof of upper bound.

It follows from Pinsker’s inequality. Let

d = 1

and

σ_{2} \geq σ_{1}

. Then, for

x = \frac{σ_{2}^{2}}{σ_{1}^{2}}

, we have

x - 1 - ln x \leq {(x - 1)}^{2}

and, by Pinsker’s inequality,

\begin{matrix} TV (N (μ_{1}, σ_{1}^{2}), N (μ_{2}, σ_{2}^{2})) \leq \frac{1}{2} \sqrt{\frac{σ_{2}^{2}}{σ_{1}^{2}} - 1 - ln \frac{σ_{2}^{2}}{σ_{1}^{2}} + \frac{Δ^{2}}{σ_{1}^{2}}} \\ \leq \frac{1}{2} \sqrt{\frac{σ_{2}^{2}}{σ_{1}^{2}} - 1 - ln \frac{σ_{2}^{2}}{σ_{1}^{2}}} + \frac{1}{2} \sqrt{\frac{Δ^{2}}{σ_{1}^{2}}} \leq \frac{1}{2} \frac{| σ_{2}^{2} - σ_{1}^{2} |}{σ_{1}^{2}} + \frac{1}{2} \frac{Δ}{σ_{1}} . \end{matrix}

(57)

For

d > 1

, it is enough to obtain the upper bound in the case

μ_{1} = μ_{2} = 0

. Again, Pinsker’s inequality implies: if

λ_{i} > - \frac{2}{3} \forall i

,

4 TV {(N (0, Σ_{1}), N (0, Σ_{2}))}^{2} \leq \sum_{i = 1}^{d} λ_{i} - ln (1 + λ_{i}) \leq \sum_{i = 1}^{d} λ_{i}^{2} = λ^{2}

(58)

□

Sketch of proof for lower bound, cf. [7].

In a 1D case with

X_{i} \sim

N

(μ_{i}, σ_{i}^{2})

(

μ_{1} \leq μ_{2}

),

\begin{matrix} TV (N (μ_{1}, σ_{1}^{2}), N (μ_{2}, σ_{2}^{2})) \geq P (X_{2} \geq μ_{2}) - P (X_{1} \geq μ_{2}) = \\ \frac{1}{2} - (\frac{1}{2} - P (X_{1} \in (μ_{1}, μ_{2}))) = P (X_{1} \in (μ_{1}, μ_{2})) \\ \geq \frac{1}{5} min [1, \frac{Δ}{σ_{1}}] \end{matrix}

(59)

Next,

TV (N (μ_{1}, σ_{1}^{2}), N (μ_{2}, σ_{2}^{2})) \geq \frac{1}{2} TV (N (0, σ_{1}^{2}), N (0, σ_{2}^{2}))

(60)

Indeed, assume w.l.o.g.

μ_{1} \leq μ_{2}, σ_{1} \leq σ_{2}

. Then,

\exists c = c (σ_{1}, σ_{2})

:

TV (N (0, σ_{1}^{2}), N (0, σ_{2}^{2})) = P (N (0, σ_{2}^{2}) \notin [- c, c]) - P (N (0, σ_{1}^{2}) \notin [- c, c])

Hence,

\begin{matrix} TV (N (μ_{1}, σ_{1}^{2}), N (μ_{2}, σ_{2}^{2})) \geq P (N (μ_{2}, σ_{2}^{2}) > c + μ_{1}) - P (N (μ_{1}, σ_{1}^{2}) > c + μ_{1}) \\ \geq \frac{1}{2} TV (N (0, σ_{1}^{2}), N (0, σ_{2}^{2})) \end{matrix}

(61)

Thus, it is enough to study the case

μ_{1} = μ_{2} = 0

. Let

C = diag (1 + λ_{i})

. Then,

TV (N (0, Σ_{1}), N (0, Σ_{2})) = TV (N (0, C^{- 1}), N (0, I_{d}))

In the case when there exists i:

| λ_{i} | > 0.1

,

\begin{matrix} TV (N (0, C^{- 1}), N (0, I_{d})) \geq TV (N (0, {(1 + λ_{i})}^{- 1}), N (0, 1)) = \\ TV (N (0, 1), N (0, 1 + λ_{i})) \geq P (N (0, 1) \in [- 1, 1]) \\ - P (N (0, 1.1) \in [- 1, 1]) > 0.68 - 0.66 > 0.01 \end{matrix}

(62)

Finally, in the case when

| λ_{i} | \leq 0.1

\forall i

, the result follows from the lower bound

TV (N (0, C^{- 1}), N (0, I_{d})) \geq \frac{λ}{6} - \frac{λ^{2}}{8} - \frac{1}{2} (e^{λ^{2}} - 1)

(63)

The bound (63)

> \frac{λ}{100}

if

λ < 0.17

and

> 0.01

if

λ \geq 0.17

and

| λ_{i} | < 0.1 \forall i

. We refer to [7] for the proofs of these facts. □

7. Estimation of Lévy–Prokhorov Distance

Let

P_{i}, i = 1, 2,

be probability distributions on a metric space W with metric r. Define the Lévy–Prokhorov distance

ρ^{L - P} (P_{1}, P_{2})

between

P_{1}, P_{2}

as the infimum of numbers

ϵ > 0

such that, for any closed set

C \subset W

,

P_{1} (C) - P_{2} (C_{ϵ}) < ϵ, P_{2} (C) - P_{1} (C_{ϵ}) < ϵ

(64)

where

C_{ϵ}

stands for the

ϵ

-neighborhood of C in metric r. It could be checked that

ρ^{L - P} (P_{1}, P_{2}) \leq τ (P_{1}, P_{2})

, i.e., the total variance distance. Equivalently,

ρ^{L - P} (P_{1}, P_{2}) = inf_{\bar{P} \in P (P_{1}, P_{2})} inf [ϵ > 0 : P (r (X_{1}, X_{2}) > ϵ) < ϵ]

(65)

where

P (P_{1}, P_{2})

is the set of all joint

\bar{P}

on

W \times W

with marginals

P_{i}

.

Next, define the Wasserstein distance

W_{p}^{r} (P_{1}, P_{2})

between

P_{1}, P_{2}

by

W_{p}^{r} (P_{1}, P_{2}) = inf_{\bar{P} \in P (P_{1}, P_{2})} {(E_{\bar{P}} [r {(X_{1}, X_{2})}^{p}])}^{1 / p} .

(66)

In the case of Euclidean space with

r (x_{1}, x_{2}) = | | x_{1} - x_{2} | |

, the index r is omitted.

Total Variation, Wasserstein and Kolmogorov–Smirnov distances defined above are stronger than weak convergence (i.e., convergence in distribution, which is weak* convergence on the space of probability measures, seen as a dual space). That is, if any of these metrics go to zero as

n \to \infty

, then we have weak convergence. However, the converse is not true. However, weak convergence is metrizable (e.g., by the Lévy–Prokhorov metric).

Theorem 4 (Dobrushin’s bound).

ρ^{L - P} (P_{1}, P_{2}) \leq {[W_{1}^{r} (P_{1}, P_{2})]}^{1 / 2} .

(67)

Proof.

Suppose that there exists a closed set C for which at least one of the inequalities (64) fails, say

P_{1} (C) \geq ϵ + P_{2} (C_{ϵ})

. Then, for any joint

\bar{P}

with marginals

P_{1}

and

P_{2}

,

\begin{matrix} E_{\bar{P}} [r (X_{1}, X_{2})] \geq E_{\bar{P}} [1 (r (X_{1}, X_{2}) \geq ϵ) r (X_{1}, X_{2})] \\ \geq ϵ \bar{P} (r (X_{1}, X_{2}) \geq ϵ) \geq ϵ \bar{P} (X_{1} \in C, X_{2} \in W ∖ C_{ϵ}) \\ \geq ϵ [\bar{P} (X_{1} \in C) - \bar{P} (X_{1} \in C, X_{2} \in C_{ϵ})] \\ \geq ϵ [\bar{P} (X_{1} \in C) - \bar{P} (X_{2} \in C_{ϵ})] = ϵ [P_{1} (X_{1} \in C) - P_{2} (X_{2} \in C_{ϵ})] \geq ϵ^{2} . \end{matrix}

(68)

This leads to (67), as claimed. □

The Lévy–Prokhorov distance is quite tricky to compute, whereas the Wasserstein distance can be found explicitly in a number of cases. Say, in a 1D case

W = R^{1}

, we have

Theorem 5.

For

d = 1

,

W_{1} (P_{1}, P_{2}) = \int_{R} | F_{1} (x) - F_{2} (x) | d x .

(69)

Proof.

First, check the upper bound

W_{1} (P_{1}, P_{2}) \leq \int_{R} | F_{1} (x) - F_{2} (x) | d x

. Consider

ξ \sim

U

[0, 1]

,

X_{i} = F_{i}^{- 1} (ξ), i = 1, 2

. Then, in view of the Fubini theorem,

E [| X_{1} - X_{2} |] = \int_{0}^{1} | F_{1}^{- 1} (y) - F_{2}^{- 1} (y) | d y = \int_{R} | F_{1} (x) - F_{2} (x) | d x .

(70)

For the proof of the inverse inequality, see [8]. □

Proposition 10.

For

d = 1

and

p > 1

,

\begin{matrix} W_{p} {(P_{1}, P_{2})}^{p} = p (p - 1) \int_{- \infty}^{\infty} d y \int_{y}^{\infty} max [F_{2} (y) - F_{1} (x), 0] {(x - y)}^{p - 2} d x \\ + p (p - 1) \int_{- \infty}^{\infty} d x \int_{x}^{\infty} max [F_{1} (x) - F_{2} (y), 0] {(y - x)}^{p - 2} d y . \end{matrix}

(71)

Proof.

It follows from the identity

\begin{matrix} {E [| X - Y |}^{p}] = p (p - 1) \int_{- \infty}^{\infty} d y \int_{y}^{\infty} [F_{2} (y) - F (x, y)] {(x - y)}^{p - 2} d x \\ + p (p - 1) \int_{- \infty}^{\infty} d x \int_{x}^{\infty} [F_{1} (x) - F (x, y)] {(y - x)}^{p - 2} d y \end{matrix}

(72)

The minimum is achieved for

\bar{F} (x, y) = min [F_{1} (x), F_{2} (y)]

. For an alternative expression (see [9]):

W_{p} {(P_{1}, P_{2})}^{p} = \int_{0}^{1} {| F_{1}^{- 1} (t) - F_{2}^{- 1} (t) |}^{p} d t .

(73)

□

Proposition 11.

Let

(X, Y) \in R^{2 d}

be jointly Gaussian random variables (RVs) with

E [X] = μ^{X}, E [Y] = μ^{Y}

. Then, the Frechet-1 distance

\begin{matrix} ρ^{F_{1}} (X, Y) : = E [\sum_{j = 1}^{d} | X_{j} - Y_{j} |] \\ = \sum_{j = 1}^{d} [(μ_{j}^{X} - μ_{j}^{Y}) (1 - 2 Φ (- \frac{(μ_{j}^{X} - μ_{j}^{Y})}{{\hat{σ}}_{j}})) + 2 {\hat{σ}}_{j} φ (- \frac{(μ_{j}^{X} - μ_{j}^{Y})}{{\hat{σ}}_{j}})] . \end{matrix}

(74)

where

{\hat{σ}}_{j} = {({(σ_{j}^{X})}^{2} + {(σ_{j}^{Y})}^{2} - 2 Cov (X_{j}, Y_{j}))}^{1 / 2}

, φ and Φ are PDF and CDF of the standard Gaussian RV. Note that, in the case

μ^{X} = μ^{Y}

, the first term in (74) vanishes, and the second term gives

ρ^{F_{1}} (X, Y) = \sqrt{\frac{2}{π}} \sum_{j = 1}^{d} {\hat{σ}}_{j} .

(75)

We also present expressions for the Frechet-3 and Frechet-4 distances

\begin{matrix} ρ^{F_{3}} (X, Y) = {(\sum_{j = 1}^{d} {| X_{j} - Y_{j} |}^{3})}^{1 / 3} = (\sum_{j = 1}^{d} {(μ_{j}^{X} - μ_{j}^{Y})}^{3} (1 - 2 Φ (- \frac{(μ_{j}^{X} - μ_{j}^{Y})}{{\hat{σ}}_{j}})) \\ + 6 {(μ_{j}^{X} - μ_{j}^{Y})}^{2} {\hat{σ}}_{j} φ (- \frac{(μ_{j}^{X} - μ_{j}^{Y})}{{\hat{σ}}_{j}}) + 3 {({\hat{σ}}_{j})}^{2} (μ_{j}^{X} - μ_{j}^{Y}) [1 - 2 Φ (- \frac{(μ_{j}^{X} - μ_{j}^{Y})}{{\hat{σ}}_{j}}) - \\ 2 \frac{(μ_{j}^{X} - μ_{j}^{Y})}{{\hat{σ}}_{j}} φ (- \frac{(μ_{j}^{X} - μ_{j}^{Y})}{{\hat{σ}}_{j}})] + 2 {({\hat{σ}}_{j})}^{3} φ (- \frac{(μ_{j}^{X} - μ_{j}^{Y})}{{\hat{σ}}_{j}}) [{(\frac{(μ_{j}^{X} - μ_{j}^{Y})}{{\hat{σ}}_{j}})}^{2} + 2])^{1 / 3} \\ ρ^{F_{4}} (X, Y) = {(\sum_{j = 1}^{d} {| X_{j} - Y_{j} |}^{4})}^{1 / 4} = {(\sum_{j = 1}^{d} {(μ_{j}^{X} - μ_{j}^{Y})}^{4} + 6 {(μ_{j}^{X} - μ_{j}^{Y})}^{2} {({\hat{σ}}_{j})}^{2} + 3 {({\hat{σ}}_{j})}^{4})}^{1 / 4} . \end{matrix}

(76)

All of these expressions are minimized when

Cov (X_{j}, Y_{j}), j = 1, \dots, d

are maximal. However, this fact does not lead immediately to the explicit expressions for Wasserstein’s metrics. The problem here is that the joint covariance matrix

Σ_{X, Y}

should be positively definite. Thus, the straightforward choice

Corr (X_{j}, Y_{j}) = 1

is not always possible; see Theorem 6 below and [10].

[Maurice René Fréchet (1878–1973), a French mathematician, worked in topology, functional analysis, probability theory and statistics. He was the first to introduce the concept of a metric space (1906) and prove the representation theorem in

L_{2}

(1907). However, in both cases, the credit was given to other people: Hausdorff and Riesz. Some sources claim that he discovered the Cramér–Rao inequality before anybody else, but such a claim was impossible to verify since lecture notes of his class appeared to be lost. Fréchet worked in several places in France before moving to Paris in 1928. In 1941, he succeeded Borel at the Chair of Calculus of Probabilities and Mathematical Physics in Sorbonne. In 1956, he was elected to the French Academy of Sciences, at the age of 78, which was rather unusual. He influenced and mentored a number of young mathematicians, notably Fortet and Loève. He was an enthusiast of Esperanto; some of his papers were published in this language].

8. Wasserstein Distance in the Gaussian Case

In the Gaussian case, it is convenient to use the following extension of Dobrushin’s bound for

p = 2

:

ρ^{L - P} (P_{1}, P_{2}) \leq {[W_{p} (P_{1}, P_{2})]}^{p / 2}, p \geq 1 .

(77)

Theorem 6.

Let

X_{i} \sim

N

(μ_{i}, Σ_{i}^{2}), i = 1, 2,

be d-dimensional Gaussian RVs. For simplicity, assume that both matrices

Σ_{1}^{2}

and

Σ_{2}^{2}

are non-singular (In the general case, the statement holds with

Σ_{1}^{- 1}

understood as Moore–Penrose inversion). Then, the

L_{2} -

Wasserstein distance

W_{2} (X_{1}, X_{2}) = W_{2} (N (μ_{1}, Σ_{1}^{2}), N (μ_{2}, Σ_{2}^{2}))

equals

W_{2} (X_{1}, X_{2}) = {[| | μ_{1} - μ_{2} {| |}^{2} + tr (Σ_{1}^{2}) + tr (Σ_{2}^{2}) - 2 tr [{(Σ_{1} Σ_{2}^{2} Σ_{1})}^{1 / 2}]]}^{1 / 2}

(78)

where

{(Σ_{1} Σ_{2}^{2} Σ_{1})}^{1 / 2}

stands for the positively definite matrix square-root. The value (78) is achieved when

X_{2} = μ_{2} + A (X_{1} - μ_{1})

where

A = Σ_{1}^{- 1} {(Σ_{1} Σ_{2}^{2} Σ_{1})}^{1 / 2} Σ_{1}^{- 1}

.

Corollary 1.

Let

μ_{1} = μ_{2} = 0

. Then, for

d = 1

,

W_{2} (X_{1}, X_{2}) = | σ_{1} - σ_{2} |

. For

d = 2

,

W_{2} (X_{1}, X_{2}) = {[tr (Σ_{1}^{2}) + tr (Σ_{2}^{2}) - 2 {[tr (Σ_{1}^{2} Σ_{2}^{2}) + 2 \sqrt{\det (Σ_{1} Σ_{2})}]}^{1 / 2}]}^{1 / 2} .

(79)

Note that the expression in (79) vanishes when

Σ_{1}^{2} = Σ_{2}^{2}

.

Example 3.

(a) Let

X \sim

N

(0, Σ_{X}^{2})

,

Y \sim

N

(0, Σ_{Y}^{2})

where

Σ_{X}^{2} = σ_{X}^{2} I_{d}

and

Σ_{Y}^{2} = σ_{Y}^{2} I_{d}

. Then,

W_{2} (X, Y) = \sqrt{d} | σ_{X} - σ_{Y} |

.

(b) Let $d = 2$ , $X \sim$ N $(0, Σ_{X}^{2})$ , $Y \sim$ N $(0, Σ_{Y}^{2})$ , where $Σ_{X}^{2} = σ_{X}^{2} I_{2}$ , $Σ_{Y}^{2} = σ_{Y}^{2} (\begin{matrix} 1 & ρ \\ ρ & 1 \end{matrix})$ and $ρ \in (- 1, 1)$ . Then,

$W_{2} (X, Y) = 2^{1 / 2} {(σ_{X}^{2} + σ_{Y}^{2} - σ_{X} σ_{Y} {[2 + 2 {(1 - ρ^{2})}^{1 / 2}]}^{1 / 2})}^{1 / 2} .$
(c) Let $d = 2$ , $X \sim$ N $(0, Σ_{X}^{2})$ , $Y \sim$ N $(0, Σ_{Y}^{2})$ , where $Σ_{X}^{2} = σ_{X}^{2} (\begin{matrix} 1 & ρ_{1} \\ ρ_{1} & 1 \end{matrix})$ , $Σ_{Y}^{2} = σ_{Y}^{2} (\begin{matrix} 1 & ρ_{2} \\ ρ_{2} & 1 \end{matrix})$ and $ρ_{1}, ρ_{2} \in (- 1, 1)$ . Then,

$W_{2} (X, Y) = 2^{1 / 2} {(σ_{X}^{2} + σ_{Y}^{2} - σ_{X} σ_{Y} {[2 + 2 ρ_{1} ρ_{2} + 2 {(1 - ρ_{1}^{2})}^{1 / 2} {(1 - ρ_{2}^{2})}^{1 / 2}]}^{1 / 2})}^{1 / 2} .$

Note that, in the case

ρ_{1} = ρ_{2}

,

W_{2} (X, Y) = \sqrt{2} | σ_{X} - σ_{Y} |

as in (a).

Proof.

First, reduce to the case

μ_{1} = μ_{2} = 0

by using the identity

W_{2}^{2} (X_{1}, X_{2}) = | | μ_{1} - μ_{2} {| |}^{2} + W_{2}^{2} (ξ_{1}, ξ_{2})

with

ξ_{i} = X_{i} - μ_{i}

. Note that the infimum in (19) is always attained on Gaussian measures as

W_{2} (X_{1}, X_{2})

is expressed in terms of the covariance matrix

Σ^{2} = Σ_{X, Y}^{2}

only (cf. (81) below). Let us write the covariance matrix in the block form

Σ^{2} = (\begin{matrix} Σ_{1}^{2} & K \\ K^{T} & Σ_{2}^{2} \end{matrix}) = (\begin{matrix} Σ_{1} & 0 \\ K^{T} Σ_{1}^{- 1} & I \end{matrix}) (\begin{matrix} I & 0 \\ 0 & S \end{matrix}) (\begin{matrix} Σ_{1} & Σ_{1}^{- 1} K \\ 0 & I \end{matrix})

(80)

where the so-called Shur’s complement

S = Σ_{2}^{2} - K^{T} Σ_{1}^{- 2} K

. The problem is reduced to finding the matrix K in (80) that minimizes the expression

\int_{R^{d} \times R^{d}} | | x - {y | |}^{2} d P_{X, Y} (x, y) = tr (Σ_{1}^{2}) + tr (Σ_{2}^{2}) - 2 tr (K)

(81)

subject to a constraint that the matrix

Σ^{2}

in (80) is positively definite. The goal is to check that the minimum (81) is achieved when the Shur’s complement S in (80) equals 0. Consider the fiber

σ^{- 1} (S)

, i.e., the set of all matrices K such that

σ (K) : = Σ_{Y}^{2} - K^{T} {(Σ_{X}^{2})}^{- 1} K = S

. It is enough to check that the maximum value of

tr (K)

on this fiber equals

max_{F \in σ^{- 1} (S)} tr (K) = tr [{(Σ_{Y} (Σ_{X}^{2} - S) Σ_{Y})}^{1 / 2}] .

(82)

Since the matrix S is positively defined, it is easy to check that the fiber

S = 0

should be selected. In order to establish (82), represent the positively definite matrix

Σ_{Y}^{2} - S

in the form

Σ_{Y}^{2} - S = U D_{r}^{2} U^{T}

, where the diagonal matrix

D_{r}^{2} = diag (λ_{1}^{2}, \dots, λ_{r}^{2}, 0, \dots, 0)

and

λ_{i} > 0

. Next,

U = (U_{r} | U_{d - r})

is the orthogonal matrix of the corresponding eigenvectors. We obtain the following

r \times r

identity:

{(Σ_{X}^{- 1} K U_{r} D_{r}^{- 1})}^{T} (Σ_{X}^{- 1} K U_{r} D_{r}^{- 1}) = I_{r} .

(83)

It means that

Σ_{X}^{- 1} K U_{r} D_{r}^{- 1} = O_{r}

, an ‘orthogonal’

d \times r

matrix, with

O_{r}^{T} O_{r} = I_{r}

, and

K = Σ_{X} O_{r} D_{r} U_{r}^{T}

. The matrix

O_{r}

parametrises the fiber

σ^{- 1} (S)

. As a result, we have an optimization problem

tr (O^{T} M) \to max, M = Σ_{X} U_{r} D_{r}

(84)

in a matrix-valued argument

O_{r}

, subject to the constraint

O_{r}^{T} O_{r} = I_{r}

. A straightforward computation gives the answer

tr [{(M^{T} M)}^{1 / 2}]

, which is equivalent to (82). Technical details can be found in [11,12]. □

Remark 3.

For general zero means RVs

X, Y \in R^{d}

with the covariance matrices

Σ_{i}^{2}, i = 1, 2

, the following inequality holds [13]:

tr (Σ_{1}^{2}) + tr (Σ_{2}^{2}) - 2 tr [{(Σ_{1} Σ_{2}^{2} Σ_{1})}^{1 / 2}] \leq E [| | X - {Y | |}^{2}] \leq tr (Σ_{1}^{2}) + tr (Σ_{2}^{2}) + 2 tr [{(Σ_{1} Σ_{2}^{2} Σ_{1})}^{1 / 2}] .

(85)

9. Distance between Distributions of Different Dimensions

For

m \leq d

, define a set of matrices with orthonormal rows:

O (m, d) = {V \in R^{m \times d} : V V^{T} = I_{m}}

(86)

and a set of affine maps

φ : R^{d} \to R^{m}

such that

φ_{V, b} (x) = V x + b

.

Definition 1.

For any measures

μ \in

M

(R^{m})

and

ν \in

M

(R^{d})

, the embeddings of μ into

R^{d}

are the set of d-dimensional measures

Φ^{+} (μ, d) : = {α \in M (R^{n}) : φ_{V, β} (α) = μ}

for some

V \in O (m, d), b \in R^{m}

, and the projections of ν onto

R^{m}

are the set of m-dimensional measures

Φ^{-} (ν, m) : = {β \in M (R^{m}) : φ_{V, β} (ν) = β}

for some

V \in O (m, d), b \in R^{m}

.

Given a metric

κ

between measures of the same dimension, define the projection distance

d^{-} (μ, ν) : = {inf}_{β \in Φ^{-} (ν, m)} κ (μ, β)

and the embedding distance

d^{+} (μ, ν) : = {inf}_{α \in Φ^{+} (μ, d)}

κ (α, ν)

. It may be proved [14] that

d^{+} (μ, ν) = d^{-} (μ, ν)

; denote the common value by

\hat{d} (μ, ν)

.

Example 4.

Let us compute Wasserstein distance between one-dimensional

X \sim

N

(μ_{1}, σ^{2})

and d-dimensional

Y \sim

N

(μ_{2}, Σ)

. Denote by

λ_{1} \geq λ_{2} \geq \dots \geq λ_{d}

the eigenvalues of Σ. Then,

{\hat{W}}_{2} (X, Y) = \{\begin{matrix} σ - \sqrt{λ_{1}} i f σ > \sqrt{λ_{1}} \\ 0 i f \sqrt{λ_{d}} \leq σ \leq \sqrt{λ_{1}} \\ \sqrt{λ_{d}} - σ i f σ < \sqrt{λ_{d}} . \end{matrix}

(87)

Indeed, in view of Theorem 6, write

\begin{matrix} {(W_{2}^{-} (X, Y))}^{2} = min_{{| | x | |}_{2} = 1, b \in R} [| | μ_{1} - x^{T} μ_{2} - b {| |}_{2}^{2} \\ + tr (σ^{2} + x^{T} Σ x - 2 σ \sqrt{x^{T} Σ x})] = min_{{| | x | |}_{2} = 1} {(σ - \sqrt{x^{T} Σ x})}^{2}, \end{matrix}

(88)

and (87) follows.

Example 5 (Wasserstein-2 distance between Dirac measure on $R^{m}$ and a discrete measure on $R^{d}$ ).

Let

y \in R^{m}

and

μ_{1} \in

M

(R^{m})

be the Dirac measure with

μ_{1} (y) = 1

, i.e., all mass centered at

y

. Let

x_{1}, \dots, x_{k} \in R^{d}

be distinct points,

p_{1}, \dots, p_{k} \geq 0, p_{1} + \dots + p_{k} = 0

, and let

μ_{2} \in

M

(R^{d})

be the discrete measure of point masses with

μ_{2} (x_{i}) = p_{i}

,

i = 1, \dots, k

. We seek the Wasserstein distance

{\hat{W}}_{2} (μ_{1}, μ_{2})

in a closed-form solution. Suppose

m \leq d

; then,

\begin{matrix} {(W_{2}^{-} (μ_{1}, μ_{2}))}^{2} = inf_{V \in O (m, d)} inf_{b \in R^{m}} \sum_{i = 1}^{k} p_{i} | | V x_{i} + b - y {| |}_{2}^{2} \\ = inf_{V \in O (m, d)} \sum_{i = 1}^{k} p_{i} | | V x_{i} - \sum_{i = 1}^{k} p_{i} V x_{i} {| |}_{2}^{2} = inf_{V \in O (m, d)} tr (V C V^{T}) \end{matrix}

(89)

noting that the second infimum is attained by

b = y - \sum_{i = 1}^{k} p_{i} V x_{i}

and defining C in the last infimum to be

C : = \sum_{i = 1}^{k} p_{i} (x_{i} - \sum_{i = 1}^{k} p_{i} x_{i}) {(x_{i} - \sum_{i = 1}^{k} p_{i} x_{i})}^{T} \in R^{d \times d} .

(90)

Let the eigenvalue decomposition of the symmetric positively semidefinite matrix C be

C = Q Λ Q^{T}

with

Λ = diag (λ_{1}, \dots, λ_{d})

,

λ_{1} \geq \dots \geq λ_{d} \geq 0 .

Then,

inf_{V \in O (m, d)} tr (V C V^{T}) = \sum_{i = 0}^{m - 1} λ_{d - i}

(91)

and is attained when

V \in

O

(m, d)

has row vectors given by the last m columns of

Q \in

O

(d)

.

Note that the geodesic distance (7) and (8) between Gaussian PDs (or corresponding covariance matrices) is equivalent to the formula for the Fisher information metric for the multivariate normal model [15]. Indeed, the multivariate normal model is a differentiable manifold, equipped with the Fisher information as a Riemannian metric; this may be used in statistical inference.

Example 6.

Consider i.i.d. random variables

Z_{l}, \dots, Z_{n}

to be bi-variately normally distributed with diagonal covariance matrices, i.e., we focus on the manifold

M_{d i a g} = {N (μ, Λ) : μ \in R^{2}, Λ diagonal}

. In this manifold, consider the submodel

M_{d i a g}^{*} = {N (μ, σ^{2} I) : μ \in R^{2}, σ^{2} \in R_{+}}

corresponding to the hypothesis

H_{0} : σ_{1}^{2} = σ_{2}^{2}

. First, consider the standard statistical estimates

\bar{Z}

for the mean and

s_{1}, s_{2}

for the variances. If

{\bar{σ}}^{2}

denotes the geodesic estimate of the common variance, the squared distance between the initial estimate and the geodesic estimate under the hypothesis

H_{0}

is given by

\frac{n}{2} [{(ln \frac{{\bar{σ}}^{2}}{s_{1}^{2}})}^{2} + {(ln \frac{{\bar{σ}}^{2}}{s_{2}^{2}})}^{2}]

(92)

which is minimized by

{\bar{σ}}^{2} = s_{1} s_{2}

. Hence, instead of the arithmetic mean of the initial standard variation estimates, we use as an estimate the geometric mean of these quantities.

Finally, we present the distance between the symmetric positively definite matrices of different dimensions. Let

m \leq d

, A is

m \times m

and

B = (\begin{matrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{matrix})

is

d \times d

; here,

B_{11}

is a

m \times m

block. Then, the distance is defined as follows:

d_{2} (A, B) : = {(\sum_{j = 1}^{m} {(max [0, ln λ_{j} (A^{- 1} B_{11})])}^{2})}^{1 / 2} .

(93)

In order to estimate the distance (93), after the simultaneous diagonalization of matrices A and B, the following classical result is useful:

Theorem 7 (Cauchy interlacing inequalities).

Let

B = (\begin{matrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{matrix})

be a

d \times d

symmetric positively definite matrix with eigenvalues

λ_{1} (B) \leq \dots \leq λ_{d} (B)

and

m \times m

block

B_{11}

. Then,

λ_{j} (B) \leq λ_{j} (B_{11}) \leq λ_{j + d - m} (B), j = 1, \dots, m .

(94)

10. Context-Sensitive Probability Metrics

The weighted entropy and other weighted probabilistic quantities generated a substantial amount of literature (see [16,17] and the references therein). The purpose was to introduce a disparity between outcomes of the same probability: in the case of a standard entropy, such outcomes contribute the same amount of information/uncertainty, which is appropriate in context-free situations. However, imagine two equally rare medical conditions, occurring with probability

p ≪ 1

, one of which carries a major health risk while the other is just a peculiarity. Formally, they provide the same amount of information:

- log p

, but the value of this information can be very different. The applications of the weighted entropy to the clinical trials are in the process of active development (see [18] and the literature cited therein). In addition, the contribution to the distance (say, from a fixed distribution

Q

) related to these outcomes, is the same in any conventional sense. The weighted metrics, or weight functions, are supposed to fulfill the task of samples graduation, at least to a certain extent.

Let the weight function or graduation

φ > 0

on the phase space

X

be given. Define the total weighted variation (TWV) distance

τ_{φ} (P_{1}, P_{2}) = \frac{1}{2} (sup_{A} [\int_{A} φ d P_{1} - \int_{A} φ d P_{2}] + sup_{A} [\int_{A} φ d P_{2} - \int_{A} φ d P_{1}]) .

(95)

Similarly, define the weighted Hellinger distance. Let

p_{1}, p_{2}

be the densities of

P_{1}, P_{2}

w.r.t. to a measure

ν

. Then,

η_{φ} (P_{1}, P_{2}) : = \frac{1}{\sqrt{2}} {(\int φ {(\sqrt{p_{1}} - \sqrt{p_{2}})}^{2} d ν)}^{1 / 2} .

(96)

Lemma 1.

Let

p_{1}, p_{2}

be the densities of

P_{1}, P_{2}

w.r.t. to a measure ν. Then,

τ_{φ} (P_{1}, P_{2})

is a distance and

τ_{φ} (P_{1}, P_{2}) = \frac{1}{2} \int φ | p_{1} - p_{2} | d ν

(97)

Proof.

The triangular inequality and other properties of the distance follow immediately. Next,

\begin{matrix} \int_{p_{1} > p_{2}} φ (p_{1} - p_{2}) = \frac{1}{2} (\int φ p_{1} - \int φ p_{2}) + \frac{1}{2} \int φ | p_{1} - p_{2} | d ν \\ \int_{p_{2} > p_{1}} φ (p_{2} - p_{1}) = \frac{1}{2} (\int φ p_{2} - \int φ p_{1}) + \frac{1}{2} \int φ | p_{1} - p_{2} | d ν \end{matrix}

(98)

Summing up these equalities implies (97). □

Let

\int φ p_{1} d ν \geq \int φ p_{2} d ν

. Then, by the weighted Gibbs inequality [16],

{KL}_{φ} (P_{1} | | P_{2}) : = \int φ p_{1} log \frac{p_{1}}{p_{2}} \geq 0

.

Theorem 8 (Weighted Pinsker’s inequality).

\frac{1}{2} \int φ | p_{1} - p_{2} | \leq \sqrt{{KL}_{φ} (P_{1} | | P_{2}) / 2} \sqrt{\int φ p_{1}} .

(99)

Proof.

Define the function

G (x) = x log x - x + 1

. The following bound holds, cf. (12):

G (x) = x log x - x + 1 \geq \frac{3}{2} \frac{{(x - 1)}^{2}}{x + 2}, x > 0 .

(100)

Now, by the Cauchy–Schwarz inequality,

\begin{matrix} {(\int φ p_{2} | \frac{p_{1}}{p_{2}} - 1 |)}^{2} \leq \int φ \frac{{(\frac{p_{1}}{p_{2}} - 1)}^{2}}{\frac{p_{1}}{p_{2}} + 2} p_{2} \int φ (\frac{p_{1}}{p_{2}} + 2) p_{2} \\ \leq 3 \int φ \frac{{(\frac{p_{1}}{p_{2}} - 1)}^{2}}{\frac{p_{1}}{p_{2}} + 2} p_{2} \int φ p_{1} \leq 2 \int φ G (\frac{p_{1}}{p_{2}}) p_{2} \int φ p_{1} \leq {KL}_{φ} (P_{1} | | P_{2}) \int φ p_{1} . \end{matrix}

(101)

□

Theorem 9 (Weighted Le Cam’s inequality).

τ_{φ} (P_{1}, P_{2}) \geq η_{φ} {(P_{1}, P_{2})}^{2} .

(102)

Proof.

In view of inequality

\frac{1}{2} | p_{1} - p_{2} | = \frac{1}{2} p_{1} + \frac{1}{2} p_{2} - min [p_{1}, p_{2}] \geq \frac{1}{2} p_{1} + \frac{1}{2} p_{2} - \sqrt{p_{1} p_{2}},

one obtains

τ_{φ} (P_{1}, P_{2}) \geq \frac{1}{2} \int φ p_{1} + \frac{1}{2} \int φ p_{2} - \int φ \sqrt{p_{1} p_{2}} = η_{φ} {(P_{1}, P_{2})}^{2} .

(103)

□

Next, we relate TWV distance to the sum of sensitive errors of both types in statistical estimation. Let C be the critical domain for the checking the hypothesis

H_{1}

:

P_{1}

versus the alternative

H_{2}

:

P_{2}

. Define by

α_{φ} = \int_{C} φ p_{1}

and

β_{φ} = \int_{X ∖ C} φ p_{2}

the weighted error probabilities of the I and II types.

Lemma 2.

Let

d = d_{C}

be the decision rule with the critical domain C. Then,

inf_{d} [α_{φ} + β_{φ}] = \frac{1}{2} [\int φ d P_{1} + \int φ d P_{2}] - τ_{φ} (P_{1}, P_{2}) .

(104)

Proof.

Denote

C^{*} = {x : p_{2} (x) > p_{1} (x)}

. Then, the result follows from the equality

\forall C

\begin{matrix} \int_{C} φ d P_{1} + \int_{X ∖ C} φ d P_{2} = \frac{1}{2} [\int φ d P_{1} + \int φ d P_{2}] \\ + \int φ | p_{1} - p_{2} | [1 (x \in C \cap X ∖ C^{*}) - 1 (x \in C \cap C^{*})] . \end{matrix}

(105)

□

Theorem 10 (Weighted Fano’s inequality).

Let

P_{1}, \dots, P_{M}

,

M \geq 2

be probability distributions such that

P_{j} ≪ P_{k}

,

\forall j, k

. Then,

\begin{matrix} inf_{d} max_{1 \leq j \leq M} \int φ (x) 1 (d (x) \neq j) d P_{j} (x) \geq \frac{log (M)}{log (M - 1)} \frac{1}{M} \sum_{j = 1}^{M} \int φ p_{j} \\ - \frac{1}{log (M - 1)} [\frac{1}{M^{2}} \sum_{j, k = 1}^{M} {KL}_{φ} (P_{j}, P_{k}) + log 2 \frac{1}{M} \sum_{j = 1}^{M} \int φ p_{j}] \end{matrix}

(106)

where the infimum is taken over all tests with values in

{1, \dots, M}

.

Proof.

Let

Z \in {1, \dots, M}

be a random variable such that

P (Z = i) = \frac{1}{M}

and let

X \sim P_{Z}

. Note that

P_{Z}

is a mixture distribution so that, for any measure

ν

such that

P_{Z} ≪ ν

, we have

\frac{d P_{Z}}{d ν} = \frac{1}{M} \sum_{k = 1}^{M} \frac{d P_{j}}{d ν}

and so

P (Z = j | X = x) = d P_{j} (x) {(\sum_{k = 1}^{M} d P_{k} (x))}^{- 1} .

It implies by Jensen’s inequality applied to the convex function

- log x

\begin{matrix} \int φ (x) \sum_{j = 1}^{M} P (Z = j | X = x) log P (Z = j | X = x) d P_{X} (x) \\ \leq \frac{1}{M^{2}} \sum_{j, k = 1}^{M} \int φ log (\frac{d P_{j}}{d P_{k}}) d P_{j} - log (M) \frac{1}{M} \sum_{j = 1}^{M} \int φ p_{j} \\ = \sum_{j, k = 1}^{M} {KL}_{φ} (P_{j}, P_{k}) - log (M) \frac{1}{M} \sum_{j = 1}^{M} \int φ p_{j} . \end{matrix}

(107)

On the other hand, denote by

q_{j} = \frac{P (Z = j | X)}{P (Z \neq d (X) | X)}

and

h (x) = x log x + (1 - x) log (1 - x)

. Note that

h (x) \geq - log 2

and by Jensen’s inequality

\sum_{j \neq d (X)} q_{j} log q_{j} \geq - log (M - 1)

. The following inequality holds:

\begin{matrix} \sum_{j = 1}^{M} P (Z = j | X) log P (Z = j | X) \\ = (1 - P (Z \neq d (X) | X)) log (1 - P (Z \neq d (X) | X)) + \sum_{j \neq d (X)} P (Z = j | X) log P (Z = j | X) \\ = h (P (Z = d (X) | X)) + P (Z \neq d (X) | X) \sum_{j \neq d (X)} q_{j} log q_{j} \\ \geq - log 2 - log (M - 1) P (d (X) \neq Z | X) . \end{matrix}

(108)

Integration of (108) yields

\begin{matrix} \int φ (x) \sum_{j = 1}^{M} P (Z = j | X = x) log P (Z = j | X = x) d P_{X} (x) \\ \geq (- log 2 \frac{1}{M} \sum_{j = 1}^{M} \int φ p_{j} - log (M - 1) max_{1 \leq j \leq M} \int φ (x) 1 (d (x) \neq j) d P_{j}) . \end{matrix}

(109)

Combining (107) and (109) proves (106). □

11. Conclusions

The contribution of the current paper is summarized in the Table 1 below. The objects 1–8 belong to the treasures of probability theory and statistics, and we present a number of examples and additional facts that are not easy to find in the literature. The objects 9–10, as well as the distances between distributions of different dimensions, appeared quite recently. They are not fully studied and quite rarely used in applied research. Finally, objects 11–12 have been recently introduced by the author and his collaborators. This is the field of the current and future research.

Table 1. The main metrics and divergencies.

Funding

This research is supported by the grant 23-21-00052 of RSF and the HSE University Basic Research Program.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Suhov, Y.; Kelbert, M. Probability and Statistics by Example: Volume I. Basic Probability and Statistics; Second Extended Edition; Cambridge University Press: Cambridge, UK, 2014; 457p. [Google Scholar]
Rachev, S.T. Probability Metrics and the Stability of Stochastic Models; Wiley: New York, NY, USA, 1991. [Google Scholar]
Zeifman, A.; Korolev, V.; Sipin, A. (Eds.) Stability Problems for Stochastic Models: Theory and Applications; MDPI: Basel, Switzerland, 2020. [Google Scholar]
Endres, D.M.; Schindelin, J.E. A new metric for probability distributions. IEEE Trans. Inf. Theory 2003, 49, 1858–1860. [Google Scholar] [CrossRef]
Kelbert, M.; Suhov, Y. What scientific folklore knows about the distances between the most popular distributions. Izv. Sarat. Univ. (N.S.) Ser. Mat. Mekh. Inform. 2022, 22, 233–240. [Google Scholar] [CrossRef]
Dwivedi, A.; Wang, S.; Tajer, A. Discriminant Analysis under f-Divergence Measures. Entropy 2022, 24, 188. [Google Scholar] [CrossRef] [PubMed]
Devroye, L.; Mehrabian, A.; Reddad, T. The total variation distance between high-dimensional Gaussians. arXiv 2020, arXiv:1810.08693v5. [Google Scholar]
Vallander, S.S. Calculation of the Wasserstein distance between probability distributions on the line. Theory Probab. Appl. 1973, 18, 784–786. [Google Scholar] [CrossRef]
Rachev, S.T. The Monge-Kantorovich mass transference problem and its stochastic applications. Theory Probab. Appl. 1985, 29, 647–676. [Google Scholar] [CrossRef]
Gelbrich, M. On a formula for the L₂ Wasserstein metric between measures on Euclidean and Hilbert spaces. Math. Nachrichten 1990, 147, 185–203. [Google Scholar] [CrossRef]
Givens, R.M.; Shortt, R.M. A class of Wasserstein metrics for probability distributions. Mich. Math J. 1984, 31, 231240. [Google Scholar] [CrossRef]
Olkin, I.; Pwelsheim, F. The distances between two random vectors with given dispersion matrices. Lin. Algebra Appl. 1982, 48, 267–2263. [Google Scholar] [CrossRef]
Dowson, D.C.; Landau, B.V. The Fréchet distance between multivariate Normal distributions. J. Multivar. Anal. 1982, 12, 450–456. [Google Scholar] [CrossRef]
Cai, Y.; Lim, L.-H. Distances between probability distributions of different dimensions. IEEE Trans. Inf. Theory 2022, 68, 4020–4031. [Google Scholar] [CrossRef]
Skovgaard, L.T. A Riemannian geometry of the multivariate normal model. Scand. J. Stat. 1984, 11, 211–223. [Google Scholar]
Stuhl, I.; Suhov, Y.; Yasaei Sekeh, S.; Kelbert, M. Basic inequalities for weighted entropies. Aequ. Math. 2016, 90, 817–848. [Google Scholar]
Stuhl, I.; Kelbert, M.; Suhov, Y.; Yasaei Sekeh, S. Weighted Gaussian entropy and determinant inequalities. Aequ. Math. 2022, 96, 85–114. [Google Scholar] [CrossRef]
Kasianova, K.; Kelbert, M.; Mozgunov, P. Response-adaptive randomization for multi-arm clinical trials using context-dependent information measures. Comput. Stat. Data Anal. 2021, 158, 107187. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Exact TV distance and the upper bound for (a) TV(Bin

(20, \frac{1}{2})

, Bin

(20, \frac{1}{2} + a))

and (b) TV(Pois

(1)

, Pois

(1 + a)

). (a) Note that the upper bound becomes useless for

p_{2} - p_{1} \geq 0.07

; (b) blue and orange curves – exact TV distance: the blue curve works for

1 \leq \frac{λ_{2}}{λ_{1}} \leq 2

and the orange curve for

2 \leq \frac{λ_{2}}{λ_{1}} \leq 4

. Note that the linear upper bound (red curve) is not relevant and the square root upper (green curve) bound becomes useless for

\frac{λ_{2}}{λ_{1}} \geq 4

.

Figure 1. Exact TV distance and the upper bound for (a) TV(Bin

(20, \frac{1}{2})

, Bin

(20, \frac{1}{2} + a))

and (b) TV(Pois

(1)

, Pois

(1 + a)

). (a) Note that the upper bound becomes useless for

p_{2} - p_{1} \geq 0.07

; (b) blue and orange curves – exact TV distance: the blue curve works for

1 \leq \frac{λ_{2}}{λ_{1}} \leq 2

and the orange curve for

2 \leq \frac{λ_{2}}{λ_{1}} \leq 4

. Note that the linear upper bound (red curve) is not relevant and the square root upper (green curve) bound becomes useless for

\frac{λ_{2}}{λ_{1}} \geq 4

.

Table 1. The main metrics and divergencies.

Number	Name	Reference	Comment
1	Kullback–Leibler	(2)	Divergence but not a distance
2	Total variation (TV)	(1)	Bounded by Pinsker’s inequality
3	Kolmogorov–Smirnov	p. 2	Specific for 1D case
4	Hellinger	(16)	Bounded by Le Cam’s inequality
5	Lévy–Prohorov	(1)	Metrization of the weak convergence
6	Fréchet	(8, 80)	Requires the joint distribution
7	Wasserstein	(69)	Marginal distributions only
8	$χ^{2}$	p. 5	Divergence but not a distance
9	Jensen–Shannon	(6)	Constructed from Kullback–Leibler
10	Geodesic	(8)	Specific for Gaussian case
11	Weighted TV	(97)	Context sensitive
12	Weighted Hellinger	(98)	Context sensitive

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.