Comparing the Zeta Distributions with the Pareto Distributions from the Viewpoint of Information Theory and Information Geometry: Discrete versus Continuous Exponential Families of Power Laws

: We consider the zeta distributions, which are discrete power law distributions that can be interpreted as the counterparts of the continuous Pareto distributions with a unit scale. The family of zeta distributions forms a discrete exponential family with normalizing constants expressed using the Riemann zeta function. We present several information-theoretic measures between zeta distributions, study their underlying information geometry, and compare the results with their continuous counterparts, the Pareto distributions.


Introduction
Zeta distributions [1,2] are parametric discrete distributions with probability mass functions indexed by a scalar parameter s ∈ (1, ∞) whose support is the set of positive integers N: The normalization function ζ(s) of the zeta distributions p s (x) = 1 ζ(s) 1 x s such that ∑ x∈N p s (x) = 1 is the real Riemann zeta function [3][4][5]: The set of zeta distributions Z = {p s (x) : s ∈ (1, ∞)} forms a discrete exponential family [6,7] with natural parameter θ(s) = s lying in the natural parameter space Θ = (1, ∞), the sufficient statistic t(x) = − log x, and the cumulant function or log-normalizer F(θ) = log ζ(θ). Therefore, it follows from the theory of exponential families [7] that log ζ(θ) is a strictly convex and real analytic function (see Figure 1). Thus, the pmf of zeta distributions can be rewritten in the canonical form of exponential families as: p s (x) = exp(θ(s)t(x) − F(θ(s))). ( The characteristic function is thus φ s (t) = ζ(s+it) ζ(s) . Thus, a zeta distribution p s (x) can be interpreted as the discrete equivalent of a Pareto distribution q s (x) of scale 1 and shape s − 1 with probability density function q s (x) = s−1 x s for x > 1 (see Table 1). Table 1. Comparisons between the Zeta family and the Pareto subfamily. The function ζ(s) is the real zeta function.

Zeta Distribution Pareto Distribution
Univariate Uni-Order Exponential Family exp(θt(x) − F(θ)) The zeta function is known to be irrational at many positive odd integers [8][9][10] and can be calculated using Bernoulli numbers [11] for positive even integers:

Discrete EF Continuous EF
, n ∈ N. The zeta function can be calculated fast [12] and precisely [13]. The derivatives of the zeta function have also been studied [12,14].
The zeta distributions are related to the Zipf distributions [15] p s,N (x) ∝ 1 x s for x ∈ {1, . . . , N} and the Zipf-Mandelbrot distributions [16,17] p s,q,N (x) ∝ 1 (x+q) s for x ∈ {1, . . . , N}, which play an important role in quantitative linguistics. See [6] for more details. The Zipf distributions and the Zipf-Mandelbrot distributions both have finite support and can be interpreted as truncated zeta distributions (right truncation for Zipf distributions and both left & right truncations for the Zipf-Mandelbrot distributions) with normalizing constants, which can be calculated approximately using properties of the zeta function [18]. Left-only truncations of the Zeta distributions are called Hurwitz zeta distributions [19]. Similarly, truncated Pareto distributions are used in applications [20]. Notice that truncated distributions of an exponential family with fixed truncation support form another exponential family [21]. The zeta distributions are infinite divisible [19,22]: A random variable following a zeta distribution can be expressed as the probability distribution of the sum of an arbitrary number of independent and identically distributed random variables. In applications, it is important to quantitatively discriminate between zeta distributions (see, for example [23,24] or [25]). Mixtures of zeta distributions have also been used to model social networks [26]. In general, products of exponential families yield other exponential families. The products of d zeta distributions form an exponential family called the zeta-star distributions [22].
In this paper, we first study various information-theoretic measures between zeta distributions by considering them as a discrete exponential family [7]: That is, we consider the α-divergences [27] between zeta distributions in Section 2, and study their limit Kullback-Leibler-oriented divergences when α → 1 and α → 0 in Section 3. We then compare these results with the counterpart results obtained for the continuous exponential family of Pareto distributions in Section 4. Finally, we conclude this work in Section 5.

Amari's α-Divergences and Sharma-Mittal Divergences
To measure the dissimilarity between two zeta distributions p s 1 and p s 2 , one can use the α-divergences [27] defined for a real α ∈ (0, 1) as follows: where is the α-Bhattacharyya coefficient (a similarity measure also called an affinity coefficient). It follows from [28] that the skewed Bhattacharyya coefficient amounts to a skewed Jensen divergence between the natural parameters of the exponential family E : where J F,α is the skewed Jensen divergence induced by a strictly convex and smooth convex function F(θ): Thus, we have the α-divergences between two zeta distributions p s 1 and p s 2 available in closed form.
Corollary 1 (Approximation of the Kullback-Leibler divergence). The Kullback-Leibler divergence between two zeta distributions p s 1 and p s 2 can be approximated for small values > 0 by We can also calculate the KLD D KL [p X 1 s 1 : p X 2 s 2 ] between two truncated zeta distributions with nested supports X 1 ⊆ X 2 . See [21]. A truncated zeta distribution on the support {a, a + 1, . . . , b} ⊂ N (with b > a) has pmf p a,b x s . The Chernoff information [29] is defined by C[p 1 , p 2 ] = − log min α∈(0,1) I α [p 1 , p 2 ]. The unique optimal value α * maximizing the Chernoff α-divergences C α [p 1 , p 2 ] = − log I α [p 1 , p 2 ] is called the Chernoff exponent [29] due to its role in bounding the probability of error in Bayesian hypothesis testing. When both pdfs or pmfs belong to the same exponential family, we have [29] where B F denotes the Bregman divergence (corresponding to the KLD) and (θ 1 θ 2 ) α * = α * θ 1 + (1 − α * )θ 2 . For a uniorder exponential family such as the zeta distributions, a closed-form formula for the optimal Chernoff exponent α * is reported in [29]: . The Sharma-Mittal divergences [30] between two densities p and q is a biparametric family of relative entropies defined by The Sharma-Mittal divergence is induced from the Sharma-Mittal entropies, which unify the extensive Rényi entropies with the non-extensive Tsallis entropies [30]. The Sharma-Mittal divergences include the Rényi divergences (β → 1) and the Tsallis divergences (β → α), and in the limit case of α, β → 1, the Kullback-Leibler divergence [31]. When both densities p = p θ 1 and q = p θ 2 belong to the same exponential family, we have the following closed-form formula [31]: Thus, we get the following theorem: For α > 0, α = 1, β = 1, the Sharma-Mittal divergence between two zeta distributions p s 1 and p s 2 is

The Kullback-Leibler Divergence between Two Zeta Distributions
It is well-known that the KLD between two probability mass functions of an exponential family amounts to a reverse Bregman divergence induced by the cumulant function [32]: . Furthermore, this Bregman divergence amounts to a Fenchel-Young divergence [33] so that we have where F * (η) denotes the Legendre convex conjugate of F, θ(s) = s and η(s [7]. Moreover, the convex conjugate F * (η(s)) corresponds to the negentropy [34]: F * (η(s)) = −H[p s ], where the entropy of a zeta distribution p s is defined by: Using the fact that i s ζ(s) , we can express the entropy as follows: log(i s ζ(s)). Since ζ(θ) has been tabulated in [35] (page 400). Notice that the maximum likelihood estimator [7] of n independently and identically distributed observations x 1 , . . . , The inverse of the zeta function ζ −1 (·) has been studied in [36].
Proposition 1 (KLD between zeta distributions). The Kullback-Leibler divergence between two zeta distributions can be written as: Moreover, the logarithmic derivative of the zeta function can be expressed using the von Mangoldt function [37] (page 1850) for θ > 1: where Λ(i) = log p is i = p k for some prime p and integer k ≥ 1 and 0 otherwise. Notice that the zeta function can be calculated using Euler product formula: ζ(θ) = ∏ p:prime Theorem 3. The Kullback-Leibler divergence between two zeta distributions can be expressed using the real zeta function ζ and the von Mangoldt function Λ as: 0.430495790304827 . . .
It is well-known that the KLD between two arbitrarily close zeta distributions p s and p s+ds amounts to half of the quadratic distance induced by the Fisher information: where where the first-order and second-order derivatives are taken with respect to the parameter s. Thus, for uniorder exponential families, the Fisher information matrix is This second-order derivative (log ζ(s)) has been studied in [38]. We have where Λ(n) is the Von Mangoldt function.

Comparison of the Zeta Family with a Pareto Subfamily
The zeta distribution is also called the "pure power-law distribution" in the literature [2]. We can compute the α-divergences between two Pareto distributions q s 1 and q s 2 with fixed scale 1 and respective shapes s 1 − 1 and s 2 − 1. The Pareto density writes q s (x) = s−1 x s for x ∈ X = (1, ∞). The family of such Pareto distributions forms a continuous exponential family with natural parameter θ = s, sufficient statistic t(x) = − log(x), and convex cumulant function F(θ) = − log(θ − 1) for θ ∈ Θ = (1, ∞). Thus we have [28]: and we obtain the following closed form for the α-divergences between two Pareto distributions q s 1 and q s 2 : The moment parameter is The differential entropy of the Pareto distribution q s is with η(s) = − 1 s−1 . We find that Example 5. For comparison, we calculate the KLD between two Pareto distributions with parameters s 1 = 4 and s 2 = 12. We find D KL [q s 1 : q s 2 ] = log 3 11 + 8 3 1.367383682536406 . . . Table 1 compares the discrete exponential family of zeta distributions with the continuous exponential family of Pareto distributions with fixed scale 1.

Conclusions
In general, it is interesting to consider discrete counterparts of continuous exponential families. For example, the discrete Gaussian distributions or discrete normal distributions defined as maximum entropy distributions have been studied in [39,40]. The log-normalizer or cumulant function of the discrete Gaussian distributions are related to the Riemann theta function [41]. Given a prescribed sufficient statistics t(x), we may define the continuous exponential family with respect to the Lebesgue measure µ as the probability density functions p(x) maximizing the differential entropy under the moment constraint E p [t(x)] = η. The corresponding discrete exponential family is obtained by the distributions with probability mass functions maximizing Shannon entropy under the moment constraint E p [t(x)] = η.