The Cauchy Distribution in Information Theory

The Gaussian law reigns supreme in the information theory of analog random variables. This paper showcases a number of information theoretic results which find elegant counterparts for Cauchy distributions. New concepts such as that of equivalent pairs of probability measures and the strength of real-valued random variables are introduced here and shown to be of particular relevance to Cauchy distributions.


Introduction
Since the inception of information theory [1], the Gaussian distribution has emerged as the paramount example of a continuous random variable leading to closed-form expressions for information measures and extremality properties possessing great pedagogical value. In addition, the role of the Gaussian distribution as a ubiquitous model for analog information sources and for additive thermal noise has elevated the corresponding formulas for ratedistortion functions and capacity-cost functions to iconic status in information theory. Beyond discrete random variables, by and large, information theory textbooks confine their coverage and examples to Gaussian random variables.
The exponential distribution has also been shown [2] to lead to closed-form formulas for various information measures such as differential entropy, mutual information and relative entropy, as well as rate-distortion functions for Markov processes and the capacity of continuous-time timing channels with memory such as the exponential-server queue [3].
Despite its lack of moments, the Cauchy distribution also leads to pedagogically attractive closed-form expressions for various information measures. In addition to showcasing those, we introduce an attribute, which we refer to as the strength of a real-valued random variable, under which the Cauchy distribution is shown to possess optimality properties. Along with the stability of the Cauchy law, those properties result in various counterparts to the celebrated fundamental limits for memoryless Gaussian sources and channels.
To enhance readability and ease of reference, the rest of this work is organized in 120 items grouped into 17 sections, plus an appendix. Section 2 presents the family of Cauchy random variables and their basic properties as well as multivariate generalizations, and the Rider univariate density which includes the Cauchy density as a special case and finds various information theoretic applications.
Section 3 gives closed-form expressions for the differential entropies of the univariate and multivariate densities covered in Section 2.
Introduced previously for unrelated purposes, the Shannon and η-transforms reviewed in Section 4 prove useful to derive several information theoretic results for Cauchy and related laws.
Applicable to any real-valued random variable and inspired by information theory, the central notion of strength is introduced in Section 5 along with its major properties. In particular, it is shown that convergence in strength is an intermediate criterion between convergence in probability and convergence in L q , q > 0, and that differential entropy is continuous with respect to the addition of independent vanishing strength noise. Section 6 shows that, for any ρ > 0 the maximal differential entropy density satisfying E[log(1 + |Z| ρ )] ≤ θ can be obtained in closed form, but its shape (not just its scale) depends on the value of θ. In particular, the Cauchy density is the solution only if ρ = 2, and θ = log 4. In contrast, we show that, among all the random variables with a given strength, the centered Cauchy density has maximal differential entropy, regardless of the value of the constraint. This result suggests the definition of entropy strength of Z, as the strength of a Cauchy random variable whose differential entropy is the same as that of Z. Modulo a factor, entropy power is the square of entropy strength. Section 6 also gives a maximal differential entropy characterization of the standard spherical Cauchy multivariate density.
Information theoretic terminology for the logarithm of the Radon-Nikodym derivative, as well as its distribution, the relative information spectrum is given in Section 7. The relative information spectrum for Cauchy distributions is found and shown to depend on their location and scale through a single scalar. This is a rare property, not satisfied by most common families such as Gaussian, exponential, Laplace, etc. Section 8 introduces the notion of equivalent pairs of probability measures, which plays an important role not only in information theory but in statistical inference. Distinguishing P 1 from Q 1 has the same fundamental limits as distinguishing P 2 from Q 2 if (P 1 , Q 1 ) and (P 2 , Q 2 ) are equivalent pairs. Section 9 studies the interplay between f -divergences and equivalent pairs. A simple formula for the f -divergence between Cauchy distributions results from the explicit expression for the relative information spectrum found in Section 7. These results are then used to easily derive a host of explicit expressions for χ 2 -divergence, relative entropy, total variation distance, Hellinger divergence and Rényi divergence in Sections 10-14, respectively.
In addition to the Fisher information matrix of the Cauchy family, Section 15 finds a counterpart of de Bruijn's identity [4] for convolutions with scaled Cauchy random variables, instead of convolutions with scaled Gaussian random variables as in the conventional setting.
Section 16 is devoted to mutual information. The mutual information between a Cauchy random variable and its noisy version contaminated by additive independent Cauchy noise exhibits a pleasing counterpart (modulo a factor of two) with the Gaussian case, in which the signal-to-noise ratio is now given by the ratio of strengths rather than variances. With Cauchy noise, Cauchy inputs maximize mutual information under an output strength constraint. The elementary fact that an output variance constraint translates directly into an input variance constraint does not carry over to input and output strengths, and indeed we identify non-Cauchy inputs that may achieve higher mutual information than a Cauchy input with the same strength. Section 16 also considers the dual setting in which the input is Cauchy, but the additive noise need not be. Lower bounds on the mutual information, attained by Cauchy noise, are offered. However, as the bounds do not depend exclusively on the noise strength, they do not rule out the possibility that a non-Cauchy noise with identical strength may be least favorable. If distortion is measured by strength, the rate-distortion function of a Cauchy memoryless source is shown to admit (modulo a factor of two) the same rate-distortion function as the memoryless Gaussian source with mean-square distortion, replacing the source variance by its strength. Theorem 17 gives a very general continuity result for mutual information that encompasses previous such results. While convergence in probability to zero of the input to an additive-noise transformation does not imply vanishing input-output mutual information, convergence in strength does under very general conditions on the noise distribution.
Some concluding observations about generalizations and open problems are collected in Section 17, including a generalization of the notion of strength.
Those definite integrals used in the main body are collected and justified in the Appendix A.

The Cauchy Distribution and Generalizations
In probability theory, the Cauchy (also known as Lorentz and as Breit-Wigner) distribution is the prime example of a real-valued random variable none of whose moments of order one or higher exists, and as such it is not encompassed by either the law of large numbers or the central limit theorem.

1.
A real-valued random variable V is said to be standard Cauchy if its probability density function is Furthermore, X is said to be Cauchy if there exist λ = 0 and µ ∈ R such that X = λV + µ, in which case where µ and |λ| are referred to as the location (or median) and scale, respectively, of the Cauchy distribution. If µ = 0, (2) is said to be centered Cauchy.

2.
Since E[max{0, V}] = E[max{0, −V}] = ∞, the mean of a Cauchy random variable does not exist. Furthermore, E[|V| q ] = ∞ for q ≥ 1, and the moment generating function of V does not exist (except, trivially, at 0). The characteristic function of the standard Cauchy random variable is 3.
Using (3), we can verify that a Cauchy random variable has the curious property that adding an independent copy to it has the same effect, statistically speaking, as adding an identical copy. In addition to the Gaussian and Lévy distributions, the Cauchy distribution is stable: a linear combination of independent copies remains in the family, and is infinitely divisible: it can be expressed as an n-fold convolution for any n. It follows from (3) that if {V 1 , V 2 , . . .} are independent, standard Cauchy, and a is a deterministic sequence with finite 1 -norm a 1 , then ∑ ∞ i=1 a i V i has the same distribution as a 1 V. In particular, the time average of independent identically distributed Cauchy random variables has the same distribution as any of the random variables. The families {λV, λ ∈ I} and {V + µ, µ ∈ I}, with I any interval of the real line, are some of the simplest parametrized random variables that are not an exponential family.

4.
If Θ is uniformly distributed on [− π 2 , π 2 ], then tan Θ is standard Cauchy. This follows since, in view of (1) and (A1), the standard Cauchy cumulative distribution function is Therefore, V has unit semi-interquartile length. The functional inverse of (4) is the standard Cauchy quantile function given by

5.
If X 1 and X 2 are standard Gaussian with correlation coefficient ρ ∈ (−1, 1), then X 1 /X 2 is Cauchy with scale 1 − ρ 2 and location ρ. This implies that the reciprocal of a standard Cauchy random variable is also standard Cauchy. 6.
Taking the cue from the Gaussian case, we say that a random vector is multivariate Cauchy if any linear combination of its components has a Cauchy distribution. Necessary and sufficient conditions for a characteristic function to be that of a multivariate Cauchy were shown by Ferguson [5]. Unfortunately, no general expression is known for the corresponding probability density function. This accounts for the fact that one aspect, in which the Cauchy distribution does not quite reach the wealth of information theoretic results attainable with the Gaussian distribution, is in the study of multivariate models of dependent random variables. Nevertheless, special cases of multivariate Cauchy distribution do admit some interesting information theoretic results as we will see below. The standard spherical multivariate Cauchy probability density function on R n is (e.g., [6]) where Γ(·) is the Gamma function. Therefore, V n = (V 1 , . . . , V n ) are exchangeable random variables. If X 0 , X 1 , . . . , X n are independent standard normal, then the vector X −1 0 X n has the density in (6). With the aid of (A10), we can verify that any subset of k ∈ {1, . . . , n − 1} components of V n is distributed according to V k . In particular, the marginals of (6) are given by (1). Generalizing (3), the characteristic function of (6) is 7.
In parallel to Item 1, we may generalize (6) by dropping the restriction that it be centered at the origin and allowing ellipsoidal deformation, i.e., letting Z n = Λ 1 2 V n + µ with µ ∈ R n and a positive definite n × n matrix Λ. Therefore, While ρ Z n is a Cauchy random variable for ρ ∈ R n − {0}, (8) fails to encompass every multivariate Cauchy distribution-in particular, the important case of independent Cauchy random variables. Another reason the usefulness of the model in (8) is limited is that it is not closed under independent additions: if V n andV n are independent, each distributed according to (6); then, Λ 1 2 V n +Λ 1 2V n , while multivariate Cauchy, does not have a density of the type in (8) unless Λ = αΛ for some α > 0. 8.
Another generalization of the (univariate) Cauchy distribution, which comes into play in our analysis, was introduced by Rider in 1958 [7]. With ρ > 0 and β ρ > 1, In addition to the (β, ρ) parametrization in (9), we may introduce scale and location parameters by means of λV β,ρ + µ, just as we did in the Cauchy case (β, ρ) = (1, 2). Another notable special case is √ ν Vν+1 2 ,2 , which is the centered Student-t random variable, itself equivalent to a Pearson type VII distribution.

9.
The differential entropy of a Cauchy random variable is using (A3). Throughout this paper, unless the logarithm base is explicitly shown, it can be chosen by the reader as long as it is the same on both sides of the equation. For natural logarithms, the information measure unit is the nat. 10. An alternative, sometimes advantageous, expression for the differential entropy of a real-valued random variable is feasible if its cumulative distribution function F X is continuous and strictly monotonic. Then, the quantile function is its functional inverse, i.e., F X (Q X (t)) = t for all t ∈ (0, 1), which implies thatQ X (t) f X (Q X (t)) = 1 for all t ∈ (0, 1). Moreover, since X and Q X (U) with U uniformly distributed on [0, 1] have identical distributions, we obtain Since (4) is indeed continuous and strictly monotonic, we can verify that we recover (12) by means of (5), (13) and (A2). 11. Despite not having finite moments, an independent identically distributed sequence of Cauchy random variables {Z i } is information stable in the sense that because of the strong law of large numbers. 12. With V n distributed according to the standard spherical multivariate Cauchy density in (6), it is shown in [8] that where γ is the Euler-Mascheroni constant and ψ(·) is the digamma function. Therefore, the differential entropy of (6) is, in nats, (see also [9]) h(V n ) = n + 1 2 E log e 1 + V n 2 + n + 1 2 log e π − log e Γ n + 1 2 (16) whose growth is essentially linear with n: the conditional differential entropy 13. By the scaling law of differential entropy and its invariance to location, we obtain 14. Invoking (A6), we obtain a closed-form formula for the differential entropy, in nats, of the generalized Cauchy distribution (9) as with κ β,ρ defined in (10).

The Shannon-and η-Transforms
In this section, we recall the definitions of two notions introduced in [10] for the unrelated purpose of expressing the asymptotic singular value distribution of large random matrices.

Strength
The purpose of this section is to introduce an attribute which is particularly useful to compare random variables that do not have finite moments. 21. The strength ς(Z) ∈ [0, +∞] of a real-valued random variable Z is defined as It follows that the only random variable with zero strength is Z = 0, almost surely. If the inequality in (34) is not satisfied for any ς > 0, then If ς(Z) ≤ ς, then (35) holds with ≤. 22. The set of probability measures whose strength is upper bounded by a given finite nonnegative constant, is convex: The set A 0 is a singleton as seen in Item 21, while, for 0 < ς < ∞, we can express (36) as Therefore, if P Z 0 ∈ A ς and P Z 1 ∈ A ς , we must have αP Z 1 + (1 − α)P Z 0 ∈ A ς . 23. The peculiar constant in the definition of strength is chosen so that if V is standard Cauchy, then its strength is ς(V) = 1 because, in view of (29), 24. If Z = k ∈ R, a.s., then its strength is 25. The left side of (35) is the Shannon transform of Z 2 evaluated at ς −2 , which is continuous in ς 2 . If ς(Z) ∈ (0, ∞) then, (35) can be written as where, on the right side, we have denoted the functional inverse of the Shannon transform. Clearly, the square root of the right side of (40) cannot be expressed as the expectation with respect to Z of any b : R → R that does not depend on P Z . Nevertheless, thanks to (37), (36) can be expressed as 26.

Theorem 1.
The strength of a real-valued random variable satisfies the following properties: with equality if and only if |Z| is deterministic.
if it exists, otherwise, ς( where V is standard Cauchy, and D(X Y) stands for the relative entropy with reference probability measure P Y and dominated measure P X .
(i) The finiteness of strength is sufficient for the finiteness of the entropy of the integer part of the random variable, i.e., (j) If Z n → Z in L q for any q ∈ (0, 1], then ς(Z n ) → ς(Z).

Proof.
For the first three properties, it is clear that they are satisfied if ς(Z) = 0, i.e., Z = 0 almost surely. (a) If ς 2 ∈ (0, ∞) is the solution to (35), then λ 2 ς 2 is a solution to (35) with λZ taking the role of Z. If (35) has no solution, neither does its version in which λZ takes the role of Z.
Jensen's inequality applied to the left side of (35) results in The strict concavity of log(1 + t) implies that equality holds if and only if Z 2 is deterministic. If (35) has no solution, the same reasoning implies that First, it is easy to check that, for q ∈ (0, 2), the function f q : (0, ∞) → (0, ∞) given by f q (t) = t −q log 4 (1 + t 2 ) attains its maximum κ q at a unique point. Assume ς(Z) ∈ (0, ∞). Since κ q t q ≥ log 4 (1 + t 2 ) for all t > 0, letting t = |Z|/ς(Z) and taking expectations, (35) (choosing 4 as the logarithm base) results in which is the same as (44).
Substituting x by X and averaging over X, the result follows from the definition of strength.
(h) It is sufficient to assume λ = 1 for the condition on the right of (49) because the condition on the left holds if and only if it holds for αZ, for any α > 0 and which is finite unless either h(Z) = −∞ or E[log(1 + Z 2 )] = ∞. This establishes =⇒ in view of (48). To establish ⇐=, it is enough to show that in view of (48) and the fact that, according to (59), h(Z) > −∞ if both D(Z V) and E log 1 + Z 2 are finite. To show (60), we invoke the following variational representation of relative entropy (first noted by Kullback [12] for absolutely continuous random variables): If P Z P V , then attained only at Q = P Z . Let Q be the absolutely continuous random variable with probability density function Then, where (65) holds since 4 5 log(1 + x 2 ) ≥ − log(π log e 2) + 2 log log e |x| + log |x|, |x| > 2. (67) where (68)-(70) follow from (48), log(1 + x 2 ) ≤ 2 log(1 + |x|), and p. 3743 in [13], respectively. (j) If ς(Z) = 0, then Z = 0 a.e., and the result follows from (44). For all (x, z) ∈ R 2 , where (71) follows by maximizing the left side over z ∈ R. Denote the difference between the right side and the left side of (72) by f q (x), an even function which satisfies f q (0) = 0, anḋ Therefore, (72) follows. Assuming 0 < ς(Z) < ∞, we have Now, because of the scaling property in (42), we may assume without loss of generality that ς(Z) = 1. Thus, (74) and (75) result in which requires that ς(Z n ) → 1, since, by assumption, the right side vanishes. Assume now that ς(Z) = ∞, and therefore, E log(1 + Z 2 ) = ∞. Inequality (75) remains valid in this case, implying that, as soon as the right side is finite (which it must be for all sufficiently large n), E log(1 + Z 2 n ) = ∞, and therefore, ς(Z n ) = ∞ in view of (48). (k) 1st ⇐= For any > 0, Markov's inequality results in =⇒ First, we show that, for any α > 0, we have The case 0 < α < 1 is trivial. The case α > 1 follows because E log 1 + Z 2 where ≥ is obvious, and ≤ holds because If ς(Z n ) = ∞ infinitely often, so is E log 1 + Z 2 n in view of (48). Assume that lim sup ς(Z n ) = ς ∈ (0, ∞], and ς(Z n ) is finite for all sufficiently large. Then, there is a subsequence such that ς(Z n i ) → ς, and for all sufficiently large i and λ < ς. Consequently, (78) implies that 2nd ⇐= Suppose that E log 1 + Z 2 n → 0. Therefore, there is a subsequence If η ≥ log 4, then ς(Z n i ) > 1 along the subsequence. Because of the continuity of the Shannon transform and the fact that it grows without bound as its argument goes to infinity (Item 25), if 0 < η < log 4, we can find 1 < α < ∞ such that We start by showing that where we have denoted the right side of (71) with arbitrary logarithm base by where the lower and upper bounds are attained uniquely at x = 0 and |x| = 1 √ 2 , respectively. The lower bound results in ⇐= in (83). To show =⇒, decompose, for arbitrary > 0, where (87) holds from the upper bound in (84), and the fact that (89) is decreasing in , and (88) holds for all sufficiently large n if E log 1 + X 2 n → 0. Since the right side of (88) goes to 0 as → 0, (83) is established.
Therefore, we may restrict attention to ς(Z) = 1 without loss of generality. Following (71) and (74), and abbreviating Z n = Z + X n , we obtain Thus, the desired result follows in view of (50) and (83). To handle the case ς(Z) = ∞, we use the same reasoning as in the proof of (i) since (83) remains valid in that case.
Under the assumptions, Part (l) guarantees that If V is a standard Cauchy random variable, then ς( for all t. Analogously, according to Part (k), Z + X n D −→ Z since X n → 0 in probability. Since the strength of X n + Z is finite for all sufficiently large n, we may invoke (47) to express, for those n, The lower semicontinuity of relative entropy under weak convergence (which, in turn, is a corollary to the Donsker-Varadhan [14,15] variational representation of relative entropy) results in Therefore, (92) follows from (94) and (95). 27. In view of (42) and Item 23, ς(λV) = |λ| if V is standard Cauchy. Furthermore, if X 1 and X 2 are centered independent Cauchy random variables, then their sum is centered Cauchy with More generally, it follows from Theorem 1-(d) that, if X 1 is centered Cauchy, and (96) holds for X 2 = αX and all α ∈ R, then X must be centered Cauchy. Invoking (52), we obtain which is also valid for λ = 0 as we saw in Item 24. 28. If X is standard Gaussian, then ς 2 (X) = 0.171085 . . . , and ς 2 (σX) = σ 2 ς 2 (X). Therefore, if X 1 and X 2 are zero-mean independent Gaussian random variables, then Thus, in this case, ς(X 1 + X 2 ) < ς(X 1 ) + ς(X 2 ). 29. It follows from Theorem 1-(d) that, with X independent of standard Cauchy V, we obtain ς( An example is the heavy-tailed probability density function for which 7.0158 . . . = ς(X + V) > ς(X) + ς(V) = 6.8457 . . .. 30. Using (A8), we can verify that, if X is zero-mean uniform with variance σ 2 , then where c is the solution to log e (1 + c 2 ) + 2 c arctan(c) = 2 + log e 4. 31. We say that Z n → 0 in strength if ς(Z n ) → 0. Parts (j) and (k) of Theorem 1 show that this convergence criterion is intermediate between the traditional in probability and L q criteria. It is not equivalent to either one: If n , with probability 1 n , then Z n → 0 in strength, but not in L q for any 0 < q. 32. The assumption in Theorem 1-(m) that X n → 0 in strength cannot be weakened to convergence in probability. Suppose that X n is absolutely continuous with probability density function We have X n → 0 in probability since, regardless of how small > 0, P[X n > ] = 1 n for all n ≥ 1 . Furthermore, because (103) is the mixture of a uniform and an infinite differential entropy probability density function, and differential entropy is concave. We conclude that The following result on the continuity of differential entropy is shown in [16]: This result is weaker than Theorem 1-(m) because finite first absolute moment implies finite strength as we saw in (44), and X → 0 in L 1 if → 0, and therefore, it vanishes in strength too. 34. If Z and V are centered and standard Cauchy, respectively, then min λ D(Z λV) is achieved by λ = ς(Z). Otherwise, in general, this does not hold.
, the minimum is attained at the solution to where we have used the η-transform in (31). 35. Using (28) and the concavity of log(1 + x), we can verify that if X 0 and X 1 are centered Cauchy, or, more generally, Not only is this property not satisfied if X = 1 but (107) need not hold in that case, as we can verify numerically for α = 0.1, λ 1 = 1 and λ 0 > 20.

Maximization of Differential Entropy
36. Among random variables with a given second moment (resp. first absolute moment), differential entropy is maximized by the zero-mean Gaussian (resp. Laplace) distribution. More generally, among random variables with a given p-absolute moment µ, differential entropy is maximized by the parameter-p Subbotin (or generalized normal) distribution with p-absolute moment µ [17] Among nonnegative random variables with a given mean, differential entropy is maximized by the exponential distribution. In those well-known solutions, the cost function is an affine function of the negative logarithm of the maximal differential entropy probability density function. Is there a cost function such that, among all random variables with a given expected cost, the Cauchy distribution is the maximal differential entropy solution? To answer this question, we adopt a more general viewpoint. Consider the following result, whose special case ρ = 2 was solved in [18] using convex optimization: where V β,ρ is defined in (9), the right side of (109) is given in (22), and β > ρ −1 is the solution to Therefore, the standard Cauchy distribution is the maximal differential entropy distribution provided that ρ = 2 and θ = log e 4.

Proof.
(a) For every ρ > 0 and θ > 0, there is a unique β > ρ −1 that satisfies (110) because the function of β on the right side is strictly monotonically decreasing, grows without bound as β ↓ 1 ρ , and goes to zero as β → ∞.

(b)
For any Z which satisfies E log e (1 + |Z| ρ ) ≤ θ, its relative entropy, in nats, with respect to V β,ρ is where (113) and (114) follow from (110) and (22), respectively. Since relative entropy is nonnegative, and zero only if both measures are identical, not only does (2) hold but any random variable other than V β,ρ achieves strictly lower differential entropy.
37. An unfortunate consequence stemming from Theorem 2 is that, while we were able to find out a cost function such that the Cauchy distribution is the maximal differential entropy distribution under an average cost constraint, this holds only for a specific value of the constraint. This behavior is quite different from the classical cases discussed in Item 36 for which the solution is, modulo scale, the same regardless of the value of the cost constraint. As we see next, this deficiency is overcome by the notion of strength introduced in Section 5. 38.

Proof.
(a) If Z is not an absolutely continuous random variable, or more generally, h(Z) = −∞ such as in the case ς(Z) = 0 in which Z = 0 with probability one, then (115) is trivially satisfied. (47) to conclude that not only does (115) hold, but it is satisfied with equality if and only if Z = ς(Z)V. 39. The entropy power of a random variable Z is the variance of a Gaussian random variable whose differential entropy is h(Z), i.e., While the power of a Cauchy random variable is infinite, its entropy power is given by In the same spirit as the definition of entropy power, Theorem 3 suggests the definition of N C (Z), the entropy strength of Z, as the strength of a centered Cauchy random variable whose differential entropy is h(Z), i.e., h(Z) = h(N C (Z)V). Therefore, where (119) follows from (56), and (120) holds with equality if and only if Z is centered Cauchy. Note that, for all (α, µ) ∈ R 2 , Comparing (116) and (118), we see that entropy power is simply a scaled version of the entropy strength squared, The entropy power inequality (e.g., [19,20]) states that, if X 1 and X 2 are independent real-valued random variables, then regardless of whether they have moments. According to (122), we may rewrite the entropy power inequality (123) replacing each entropy power by the corresponding squared entropy strength. Therefore, the squared entropy strength of the sum of independent random variables satisfies It is well-known that equality holds in (123), and hence (124), if and only if both random variables are Gaussian. Indeed, if X 1 and X 2 are centered Cauchy with respective strengths ς 1 > 0 and ς 2 > 0, then (124) becomes (ς 1 + ς 2 ) 2 > ς 2 1 + ς 2 2 . 40. Theorem 3 implies that any random variable with infinite differential entropy has infinite strength. There are indeed random variables with finite differential entropy and infinite strength. For example, let Z ∈ [2, ∞) be an absolutely continuous random variable with probability density function Then, h(Z) = 1.99258... nats, while the entropy of the quantized version as well as the strength satisfy H( Z ) = ∞ = ς(Z). 41. With the same approach, we may generalize Theorem 3 to encompass the full slew of the generalized Cauchy distributions in (9). To that end, fix ρ > 0 and define the (ρ, θ)-strength of a random variable as Therefore, ς ρ,θ (Z) = ς(Z) for (ρ, θ) = (2, log e 4), and if (β, ρ, θ) satisfy (110), then 42.

Proof.
As with Theorem 3, in the proof, we may assume 0 < ς ρ,θ (Z) < ∞ to avoid trivialities. Then, and, in nats, where (130) 43. In the multivariate case, we may find a simple upper bound on differential entropy based on the strength of the norm of the random vector.
Theorem 5. The differential entropy of a random vector Z n is upper bounded by

Proof.
As in the proof of Theorem 3, we may assume that 0 < ς( Z n ) < ∞. As usual, V n denotes the standard spherical multivariate Cauchy density in (6). Since for where (136) and (137) follow from (6) and the definition of strength, respectively.
For n = 1, Theorem 5 becomes the bound in (115). For n = 2, 3, . . ., the right side of (15) is greater than log e 4, and, therefore, ς( Z n ) > 1. Consequently, in the multivariate case, there is no Z n such that (134) is tight. 44. To obtain a full generalization of Theorem 3 in the multivariate case, it is advisable to define the strength of a random n-vector as for θ n = ψ n+1 2 + γ + log e 4. To verify (139), note (15)- (17). Notice that ς(λV n ) = |λ| and for n = 1, (138) is equal to (34). The following result provides a maximal differential entropy characterization of the standard spherical multivariate Cauchy density. Theorem 6. Let V n have the standard multivariate Cauchy density (6), Then, where h(V n ) is given in (17).
in view of (138). Hence, the difference between right and left sides of (140) is equal to zero if and only if Z n = λV n for some λ = 0; otherwise, it is positive.

Relative Information
45. For probability measures P and Q on the same measurable space (A, F ), such that P Q, the logarithm of their Radon-Nikodym derivative is the relative information denoted by 46. As usual, we may employ the notation ı X Y (x) to denote ı P X P Y (x). The distributions of the random variables ı X Y (X) and ı X Y (Y) are referred to as relative information spectra (e.g., [21]). It can be shown that there is a one-to-one correspondence between the cumulative distributions of ı X Y (X) and ı X Y (Y). For example, if they are absolutely continuous random variables with respective probability density functions f X Y and f X Y , then Obviously, the distributions of ı X Y (X) and dP X dP Y (X) determine each other. One caveat is that relative information may take the value −∞. It can be shown that 47. The information spectra determine all measures of the distance between the respective probability measures of interest (e.g., [22,23]), including f -divergences and Rényi divergences. For example, the relative entropy (or Kullback-Leibler divergence) of the dominated measure P with respect to the reference measure Q is the average of the relative information when the argument is distributed according to P, i.e., D( The information spectra also determine the fundamental trade-off in hypothesis testing. Let α ν (P 1 , P 0 ) denote the minimal probability of deciding P 0 when P 1 is true subject to the constraint that the probability of deciding P 1 when P 0 is true is no larger than ν. A consequence of the Neyman-Pearson lemma is where Y 0 ∼ P 0 and Y 1 ∼ P 1 . 49. Cauchy distributions are absolutely continuous with respect to each other and, in view of (2), 50. The following result, proved in Item 58, shows that the relative information spectrum corresponding to Cauchy distributions with respective scale/locations (λ 1 , µ 1 ) and (λ 0 , µ 0 ) depends on the four parameters through the single scalar where equality holds if and only if (λ 1 , µ 1 ) = (λ 0 , µ 0 ).
51. The indefinite integral (e.g., see 2.261 in [24]) 52. For future use, note that the endpoints of the support of (153) are their respective reciprocals. Furthermore, which implies

Equivalent Pairs of Probability Measures
53. Suppose that P 1 and Q 1 are probability measures on (A 1 , F 1 ) such that P 1 Q 1 and P 2 and Q 2 are probability measures on (A 2 , F 2 ) such that P 2 Q 2 . We say that (P 1 , Q 1 ) and (P 2 , Q 2 ) are equivalent pairs, and write (P 1 , Q 1 ) ≡ (P 2 , Q 2 ), if the cumulative distribution functions of ı P 1 Q 1 (X 1 ) and ı P 2 Q 2 (X 2 ) are identical with X 1 ∼ P 1 and X 2 ∼ P 2 . Naturally, ≡ is an equivalence relationship. Because of the one-to-one correspondence indicated in Item 46, the definition of equivalent pairs does not change if we require equality of the information spectra under the dominated measure, i.e., that ı P 1 Q 1 (Y 1 ) and ı P 2 Q 2 (Y 2 ) be equally distributed Y 1 ∼ Q 1 and Y 2 ∼ Q 2 . Obviously, the requirement that the information spectra coincide is the same as requiring that the distributions of dP 1 dQ 1 (Y 1 ) and dP 2 dQ 2 (Y 2 ) are equal. As in Item 46, we also employ the notation (X 1 , 54. Suppose that the output probability measures of a certain (random or deterministic) transformation are Q 0 and Q 1 when the input is distributed according to P 0 and P 1 , respectively. If (P 0 , P 1 ) ≡ (Q 0 , Q 1 ), then the transformation is a sufficient statistic for deciding between P 0 and P 1 (i.e., the case of a binary parameter). 55. If (A, F ) is a measurable space on which the probability measures P X 1 P X 2 are defined, and φ : A → B is a (F , G )-measurable injective function, then P φ(X 1 ) P φ(X 2 ) are probability measures on (B, G ) and Consequently, (X 1 , X 2 ) ≡ (φ(X 1 ), φ(X 2 )). 56. The most important special case of Item 55 is an affine transformation of an arbitrary real-valued random variable X, which enables the reduction of four-parameter problems into two-parameter problems: for all (λ 2 , µ 1 , µ 2 ) ∈ R 3 and λ 1 = 0, with by choosing the affine function φ(x) = x−µ 1 λ 1 .

57.
Theorem 8. If X n ∈ R n is an even random vector, i.e., P X n = P −X n , then 4 , then (161) holds even if X n is not even because the function x − µ is injective, in particular, with µ = µ 3 where (162) and (166) follow from Part (a), (164) follows because −x + µ 3 − µ 4 is injective, and (165) holds because X n is even. 58. We now proceed to prove Theorem 7.

Proof.
Since λV and −λV have identical distributions, we may assume for convenience that λ 1 > 0 and λ 0 > 0. Furthermore, capitalizing on Item 56, we may assume and we can verify that we recover (151) through the aforementioned substitution.
Once we have obtained the expectation of Z = dP V dP λV+µ (V), we proceed to determine its distribution. Denoting the right side of (169) by ζ, we have where Θ is uniformly distributed on [−π, π]. We have substituted V = tan Θ (see Item 4) in (172), and invoked elementary trigonometric identities in (173) and (174). Since the phase in (175) does not affect it, the distribution of Z is indeed as claimed in (152), and (153) follows because the probability density function of cos Θ is 59. In general, it need not hold that (X, Y) ≡ (Y, X)-for example, if X and Y are zeromean Gaussian with different variances. However, the class of scalar Cauchy distributions does satisfy this property since the result of Theorem 7 is invariant to swapping λ 1 ↔ λ 0 and µ 1 ↔ µ 0 . More generally, Theorem 7 implies that, if λ 1 λ 0 γ 1 γ 0 = 0, then Curiously, (177) implies that (V, V + 1) ≡ (V, 2V + 1). 60. For location-dilation families of random variables, we saw in Item 56 how to reduce a four-parameter problem into a two-parameter problem since ( with the appropriate substitution. In the Cauchy case, Theorem 7 reveals that, in fact, we can go one step further and turn it into a one-parameter problem. We have two basic ways of doing this: which are the solutions to ζ = λ 2 +1 2λ .

f -Divergences
This section studies the interplay of f -divergences and equivalent pairs of measures.

61.
If P Q and f : [0, ∞) → R is convex and right-continuous at 0, f -divergence is defined as 62. The most important property of f -divergence is the data processing inequality where P Y and Q Y are the responses of a (random or deterministic) transformation to P X and Q X , respectively. If f is strictly convex at 1 and D f (P X Q X ) < ∞, then (P X , Q X ) ≡ (P Y , Q Y ) is necessary and sufficient for equality in (180).

Proof.
As mentioned in Item 53, (P 1 , Q 1 ) ≡ (P 2 , Q 2 ) is equivalent to dP 1 dQ 1 (Y 1 ) and =⇒ According to (179), D f (P Q) is determined by the distribution of the random variable dP dQ (Y), Y ∼ Q. ⇐= For t ∈ R, the function f t (x) = e tx , x ≥ 0, is convex and right-continuous at 0, and D f t (P Q) is the moment generating function, evaluated at t, of the random variable dP 65. Since P Q is not necessary in order to define (finite) D f (P Q), it is possible to enlarge the scope of Theorem 9 by defining (P 1 , Q 1 ) ≡ (P 2 , Q 2 ) dropping the restriction that P 1 Q 1 and P 2 Q 2 . For that purpose, let µ 1 and µ 2 be σ-finite measures on (A 1 , F 1 ) and (A 2 , F 2 ), respectively, and denote when restricted to [0, 1], the random variables p 1 (Y 1 ) q 1 (Y 1 ) and when restricted to [0, 1], the random variables q 1 (X 1 ) p 1 (X 1 ) and Note that those conditions imply that (e) P 1 ({ω ∈ A 1 : q 1 (ω) = 0}) = P 2 ({ω ∈ A 2 : q 2 (ω) = 0}).
For example, if P 1 ⊥ Q 1 and P 2 ⊥ Q 2 , then (P 1 , Q 1 ) ≡ (P 2 , Q 2 ). To show the generalized version of Theorem 9, it is convenient to use the symmetrized form 66. Suppose that there is a class C of probability measures on a given measurable space with the property that there exists a convex function g : (0, ∞) → R (right-continuous at 0) such that, if (P 1 , Q 1 ) ∈ C 2 and (P 2 , Q 2 ) ∈ C 2 , then In such case, Theorem 9 indicates that C 2 can be partitioned into equivalence classes such that, within every equivalence class, the value of D f (P Q) is constant, though naturally dependent on f . Throughout C 2 , the value of D g (P Q) determines the value of D f (P Q), i.e., we can express D f (P Q) = ϑ f ,g D g (P Q) , where ϑ f ,g is a non-decreasing function. Consider the following examples: (a) Let C be the class of real-valued Gaussian probability measures with given variance σ 2 > 0. Then, Since Theorem 8 implies that (N µ 1 , σ 2 , N µ 2 , σ 2 ) ≡ (N µ 3 , σ 2 , N µ 4 , σ 2 ) as long as (µ 1 − µ 2 ) 2 = (µ 3 − µ 4 ) 2 , (184) indicates that (183) is satisfied with g(t) given by the right-continuous extension of t log t. Therefore, we can con-clude that, regardless of f , D f N µ 1 , σ 2 N µ 2 , σ 2 depends on (µ 1 , µ 2 , σ 2 ) only through (µ 1 − µ 2 ) 2 /σ 2 . (b) Let C be the collection of all Cauchy random variables. Theorem 7 reveals that (183) is also satisfied if g(x) = x 2 because, if X ∼ P and Y ∼ Q, then 67. An immediate consequence of Theorems 7 and 9 is that, for any valid f , the fdivergence between Cauchy densities is symmetric, This property does not generalize to the multivariate case. While, in view of Theorem 8, in general, (Λ 1 2 V n , V n ) ≡ (V n , Λ 1 2 V n ) since the corresponding relative entropies do not coincide as shown in [8]. 68. It follows from Item 66 and Theorem 7 that any f -divergence between Cauchy probability measures D f (λ 1 V + µ 1 λ 0 V + µ 0 ) is a monotonically increasing function of ζ(λ 1 , µ 1 , λ 0 , µ 0 ) given by (149). The following result shows how to obtain that function from f . (153),

Proof.
In view of (179) and the definition of Z in Theorem 7, Then, Theorem 10 indicates that The most common f -divergences are such that f (1) = 0 since in that case D f (P Q) ≥ 0. In addition, adding the function αt − α to f (t) does not change the value of D f (P Q) and with appropriately chosen α, we can turn f (t) into canonical form in which not only f (1) = 0 but f (t) ≥ 0. In the special case in which the second measure is fixed, Theorem 9 in [25] shows that, if ess sup dP n dQ (Y) → 1 with Y ∼ Q, then provided the limit on the right side exists; otherwise, the left side lies between the left and right limits at 1. In the Cauchy case, we can allow the second probability to depend on n and sharpen that result by means of Theorem 10. In particular, it can be shown that provided the right side is not 0 0 .

With either
71. If P and Q are Cauchy distributions, then (149), (151) and (195) result in a formula obtained in Appendix D of [26] using complex analysis and the Cauchy integral formula. In addition, invoking complex analysis and the maximal group invariant results in [27,28], ref. [26] shows that any f -divergence between Cauchy distributions can be expressed as a function of their χ 2 divergence, although [26] left open how to obtain that function, which is given by Theorem 10 substituting ζ = 1 + χ 2 .

The relative entropy between Cauchy distributions is given by
where λ 1 λ 0 = 0. The special case λ 1 = λ 0 of (198) was found in Example 4 of [29]. The next four items give different simple justifications for (198). An alternative proof was recently given in Appendix C of [26] using complex analysis holomorphisms and the Cauchy integral formula. Yet another, much more involved, proof is reported in [30]. See also Remark 19 in [26] for another route invoking the Lévy-Khintchine formula and the Frullani integral.

Since for absolutely continuous random variables
where (200) follows from (12) and (A4) with α 2 = λ 2 + µ 2 and cos β = µ |α| . Now, substituting λ = λ 0 λ 1 and µ = µ 0 −µ 1 λ 1 , we obtain (198) since, according to Item 56, (V, λ V + µ) ≡ (λ 1 V + µ 1 , λ 0 V + µ 0 ). 74. From the formula found in Example 4 of [29] and the fact that, according to (197), Moreover, as argued in Item 60, (201) is also valid for the relative entropy between Cauchy distributions with λ 1 = λ 0 as long as χ 2 is given in (197). Indeed, we can verify that the right side of (201) becomes (198) with said substitution. 75. By the definition of relative entropy, and Theorem 7, where (204) where V 1 = λ 1 V + λ 1 and V 0 = λ 1 V + λ 1 , and V is an independent (or exact) copy of V. In contrast, the corresponding result in the Gaussian case in which X, X 1 , X 0 are independent Gaussian with means µ, µ 1 , µ 0 and variances σ 2 , σ 2 1 , σ 2 0 , respectively, is In fact, it is shown in Lemma 1 of [31] that (206) holds even if X 1 and X 0 are not Gaussian but have finite variances. It is likely that (205) holds even if V 1 and V 0 are not Cauchy, but have finite strengths. 78. An important information theoretic result due to Csiszár [32] is that if Q 1 Q 2 and P is such that then the following Pythagorean identity holds Among other applications, this result leads to elegant proofs of minimum relative entropy results. For example, the closest Gaussian to a given P with a finite second moment has the same first and second moments as P. If we let Q 1 and Q 2 be centered Cauchy with strengths λ 1 and λ 2 , respectively, then the orthogonality condition (207) becomes, with the aid of (148) and (198), If, in addition, P is centered Cauchy, we can use (28) to verify that (209) holds only in the trivial cases in which either λ 1 = λ 2 or P = Q 1 . For non-Cauchy P, (208) may indeed be satisfied with λ 1 = λ 2 . For example, using (30), if X = V 2,2 , then (209), and therefore (208), holds with (λ 1 , λ 2 ) = (2, 0.35459 . . .). 79. Mutually absolutely continuous random variables may be such that An easy example is that of Gaussian X and Cauchy Z, or, if we let X be Cauchy, (210) holds with Z having the very heavy-tailed density function in (62). 80. While relative entropy is lower semi-continuous, it is not continuous. For example, using the Cauchy distribution, we can show that relative entropy is not stable against small contamination of a Gaussian random variable: if X is Gaussian independent of V, then no matter how small λ = 0,
82. Example 15 of [33] shows that the total variation distance between centered Cauchy distributions is in view of (197). Since any f -divergence between Cauchy distributions depends on the parameters only through the corresponding χ 2 -divergence, (217)-(218) imply the general formula Alternatively, applying Theorem 11 to the case of Cauchy random variables, note that, in this case, Z is an absolutely continuous random variable with density function (153). Therefore, P[Z = 1] = 1, and where (221) follows from (154) and the identity arcsin . Though more laborious (see [26]), (219) can also be verified by direct integration.

Hellinger Divergence
Notable special cases are where H 2 (P Q) is known as the squared Hellinger distance. 84. For Cauchy random variables, Theorem 10 yields where ζ is as given in (149), and we have used (A15) and P α (·) denotes the Legendre function of the first kind, which satisfies P −α = P α−1 (see 8.2.1. in [34]).

The Chernoff information
satisfies C(P Q) = C(Q P) regardless of (P, Q). If, as in the case of Cauchy measures, (P, Q) ≡ (Q, P), then Chernoff information is equal to the Bhattacharyya distance: where H 2 (P Q) is the squared Hellinger distance, which is the f -divergence with Together with Item 87, (240) gives the Chernoff information for Cauchy distributions. While it involves the complete elliptical integral function, its simplicity should be contrasted with the formidable expression for Gaussian distributions, recently derived in [38]. The reason (240) holds is that the supremum in (239) is achieved at λ * = 1 2 . To see this, note that where (241) reflects the skew-symmetry of Rényi divergence, and (242) holds because (P, Q) ≡ (Q, P). Since f (λ) : λ ∈ [0, 1] is concave and its own mirror image, it is maximized at λ * = 1 2 .

Fisher's Information
93. The score function of the standard Cauchy density (1) is Then, ρ V (V) is a zero-mean random variable with second moment equal to Fisher's information where we have used (A11). Since Fisher's information is invariant to location and scales as J(X) = α 2 J(αX), we obtain Together with (117), the product of entropy power and Fisher information is 4π e , thereby abiding by Stam's inequality [4], 1 ≤ N(X) J(X). 94. Introduced in [39], Fisher's information of a density function (245) quantifies its similarity with a slightly shifted version of itself. A more general notion is the Fisher information matrix of a random transformation P Y|X : R k → Y satisfying the regularity condition Then, the Fisher information matrix of P Y|X at θ has coefficients and satisfies (with relative entropy in nats) For the Cauchy family, the parametrization vector has two components, location and strength, namely, θ = (µ, λ). The regularity condition (247) is satisfied in view of (205), and we can use the closed-form expression in (205) to obtain 95. The relative Fisher information is defined as Although the purpose of this definition is to avoid some of the pitfalls of the classical definition of Fisher's information, not only do equivalent pairs fail to have the same relative Fisher information but, unlike relative entropy or f -divergence, relative Fisher information is not transparent to injective transformations. For example, J(X Y) = λ 2 J(λX λY). Centered Cauchy random variables illustrate this fact since 96. de Bruijn's identity [4] states that, if N ∼ N(0, 1) is independent of X, then, in nats, As well as serving as the key component in the original proofs of the entropy power inequality, the differential equation in (254) provides a concrete link between Shannon theory and its prehistory. As we show in Theorem 12, it turns out that there is a Cauchy counterpart of de Bruijn's identity (254). Before stating the result, we introduce the following notation for a parametrized random variable Y t (to be specified later): i.e., J(Y t ) and K(Y t ) are the Fisher information with respect to location and with respect to dilation, respectively (corresponding to the coefficients J 11 and J 22 of the Fisher information matrix when θ = (µ, λ) as in Item 94. The key to (254) is that Y t = X + √ t N, N ∼ N(0, 1) satisfies the partial differential equation Theorem 12. Suppose that X is independent of standard Cauchy V. Then, in nats, Proof.
Equation (259) does not hold in the current case in which Y t = X + t V, and However, some algebra (the differentiation/integration swaps can be justified invoking the bounded convergence theorem) indicates that the convolution with the Cauchy density satisfies the Laplace partial differential equation The derivative of the differential entropy of Y t is, in nats, Taking another derivative, the left side of (260) becomes where • (265) ⇐= the first term on the right side of (264) is zero; • (266) ⇐= (262); • (267) ⇐= (258); • (268) ⇐= integration by parts, exactly as in [4] (or p. 673 of [19]).
97. Theorem 12 reveals that the increasing function f X (t) = h(X + t V) is concave (which does not follow from the concavity of differential entropy functional of the density).
In contrast, it was shown by Costa [40] that the entropy power N X + √ t N , with N ∼ N(0, 1) is concave in t.

Mutual Information
98. Most of this section is devoted to an additive noise model. We begin with the simplest case in which X C is centered Cauchy independent of W C , also centered Cauchy with ς(W C ) > 0. Then, (11) yields thereby establishing a pleasing parallelism with Shannon's formula [1] for the mutual information between a Gaussian random variable and its sum with an independent Gaussian random variable. Aside from a factor of 1 2 , in the Cauchy case, the role of the variance is taken by the strength. Incidentally, as shown in [2], if N is standard exponential on (0, ∞), an independent X on [0, ∞) can be found so that X + N is exponential, in which case the formula (271) also applies because the ratio of strengths of exponentials is equal to the ratio of their means. More generally, if input and noise are independent non-centered Cauchy, their locations do not affect the mutual information, but they do affect their strengths, so, in that case, (271) holds provided that the strengths are evaluated for the centered versions of the Cauchy random variables. 99. It is instructive, as well as useful in the sequel, to obtain (271) through a more circuitous route. Since , the information density (e.g., [41]) is defined as Averaging with respect to (X C , Y C ) = (X C , X C + W C ), we obtain 100. If the strengths of output Y = X + N and independent noise N are finite and their differential entropies are not −∞, we can obtain a general representation of the mutual information without requiring that either input or noise be Cauchy. Invoking (56) and since, as we saw in (49), the finiteness of the strengths guarantees the finiteness of the relative entropies in (278). We can readily verify the alternative representation in which strength is replaced by standard deviation, and the standard Cauchy V is replaced by standard normal W: A byproduct of (278) is the upper bound where (281) follows from N C (Y) ≤ ς(Y), and (282) follows by dropping the last term on the right side of (278). Note that (281) is the counterpart of the upper bound given by Shannon [1] in which the standard deviation of Y takes the place of the strength in the numerator, and the square root of the noise entropy power takes the place of the entropy strength in the denominator. Shannon gave his bound three years before Kullback and Leibler introduced relative entropy in [42]. The counterpart of (282) with analogous substitutions of strengths by standard deviations was given by Pinsker [43], and by Ihara [44] for continuous-time processes. 101. We proceed to investigate the maximal mutual information between the (possibly non-Cauchy) input and its additive Cauchy-noise contaminated version.
Theorem 13. Maximal mutual information: output strength constraint. For any η ≥ ς(W C ) > 0, max X : ς(X+W C )≤η where W C is centered Cauchy independent of X. The maximum in (283) is attained uniquely by the centered Cauchy distribution with strength η − ς(W C ).

Proof.
For centered Cauchy noise, the upper bound in (282) simplifies to which shows ≤ in (283). If the input is centered Cauchy X C with strength η − ς(W C ), then ς(X C + W C ) = η, and I(X C ; X C + W C ) is equal to the right side in view of (271).
102. In the information theory literature, the maximization of mutual information over the input distribution is usually carried out under a constraint on the average cost E[b(X)] for some real-valued function b. Before we investigate whether the optimization in (283) can be cast into that conventional paradigm, it is instructive to realize that the maximization of mutual information in the case of input-independent additive Gaussian noise can be viewed as one in which we allow any input such that the output variance is constrained, and because the output variance is the sum of input and noise variances that the familiar optimization over variance constrained inputs obtains. Likewise, in the case of additive exponential noise and random variables taking nonnegative values, if we constrain the output mean, automatically we are constraining the input mean. In contrast, the output strength is not equal to the sum of Cauchy noise strength and the input strength, unless the input is Cauchy. Indeed, as we saw in Theorem 1-(d), the output strength depends not only on the input strength but on the shape of its probability density function. Since the noise is Cauchy, (45) yields which is the same input constraint found in [45] (see also Lemma 6 in [46] and Section V in [47]) in which η affects not only the allowed expected cost but the definition of the cost function itself. If X is centered Cauchy with strength η − ς(W C ), then (286) is satisfied with equality, in keeping with the fact that that input achieves the maximum in (283). Any alternative input with the same strength that produces output strength lower than or equal to η can only result in lower mutual information. However, as we saw in Item 29, we can indeed find input distributions with strength η − ς(W C ) that can produce output strength higher than η. Can any of those input distributions achieve I(X; Y) > log η ς(W C ) ? The answer is affirmative. If we let X = V β,2 , defined in (9), we can verify numerically that, for β ∈ [0.8, 1), We conclude that, at least for θ ς(W C ) ∈ (1, ς(V 0.8,2 )) = (1, 3.126 . . .), the capacity-inputstrength function satisfies 103. Although not always acknowledged, the key step in the maximization of mutual information over the input distribution for a given random transformation is to identify the optimal output distribution. The results in Items 101 and 102 point out that it is mathematically more natural to impose constraints on the attributes of the observed noisy signal than on the transmitted noiseless signal. In the usual framework of power constraints, both formulations are equivalent as an increase in the gain of the receiver antenna (or a decrease in the front-end amplifier thermal noise) of κ dB has the same effect as an increase of κ dB in the gain of the transmitter antenna (or increase in the output power of the transmitted amplifier). When, as in the case of strength, both formulations lead to different solutions, it is worthwhile to recognize that what we usually view as transmitter/encoder constraints also involve receiver features. 104. Consider a multiaccess channel Y i = X 1i + X 2i + W i , where W i is a sequence of strength ς(W) independent centered Cauchy random variables. While the capacity region is unknown if we place individual cost or strength constraints on the transmitters, it is easily solvable if we impose an output strength constraint. In that case, the capacity region is the triangle where η > ς(W) is the output strength constraint. To see this, note (a) the corner points are achievable thanks to Theorem 13; (b) if the transmitters are synchronous, a time-sharing strategy with Cauchy distributed inputs satisfies the output strength constraint in view of (107); (c) replacing the independent encoders by a single encoder which encodes both messages would not be able to achieve higher rate sum. It is also possible to achieve (289) using the successive decoding strategy invented by Cover [48] and Wyner [49] for the Gaussian multiple-access channel: fix α ∈ (0, 1); to achieve R 1 = α log η ς(W) and R 2 = (1 − α) log η ς(W) , we let the transmitters use random coding with sequences of independent Cauchy random variables with respective strengths which abide by the output strength constraint since ς 1 + ς 2 + ς(W) = η, and a rate-pair which is achievable by successive decoding by using a single-user decoder for user 1, which treats the codeword transmitted by user 2 as noise; upon decoding the message of user 1, it is re-encoded and subtracted from the received signal, thereby presenting a single-user decoder for user 2 with a signal devoid of any trace of user 1 (with high probability). 105. The capacity per unit energy of the additive Cauchy-noise channel where {V i } is an independent sequence of standard Cauchy random variables, was shown in [29] to be equal to (4λ 2 ) −1 log e, even though the capacity-cost function of such a channel is unknown. A corollary to Theorem 13 is that the capacity per unit output strength of the same channel is By only considering Cauchy distributed inputs, the capacity per unit input strength is lower bounded by but is otherwise unknown as it is not encompassed by the formula in [29]. 106. We turn to the scenario, dual to that in Theorem 13, in which the input is Cauchy but the noise need not be. As Shannon showed in [1], if the input is Gaussian, among all noise distributions with given second moment, independent Gaussian noise is the least favorable. Shannon showed that fact applying the entropy power inequality to the numerator on the right side of (279), and then further weakened the resulting lower bound by replacing the noise entropy power in the denominator by its variance. Taking a cue from this simple approach, we apply the entropy strength inequality (124) to (277) to obtain where (299) follows from N 2 C (W) ≤ ς 2 C (W). Unfortunately, unlike the case of Gaussian input, this route falls short of showing that Cauchy noise of a given strength is least favorable because the right side of (299) is strictly smaller than the Cauchyinput Cauchy-noise mutual information in (271). Evidently, while the entropy power inequality is tight for Gaussian random variables, it is not for Cauchy random variables as we observed in Item 39. For this approach to succeed showing that, under a strength constraint, the least favorable noise is centered Cauchy we would need that, if W is independent of standard Cauchy V, then N C (V + W) − N C (W) ≥ 1. (See Item 119-(a).) 107. As in Item 102, the counterpart in the Cauchy-input case is more challenging due to the fact that, unlike variance, the output strength need not be equal to the sum of input and noise strength. The next two results give lower bounds which, although achieved by Cauchy noise, do not just depend on the noise distribution through its strength.
with equality if W is centered Cauchy.
Although the lower bound in Theorem 14 is achieved by a centered Cauchy, it does not rule out the existence of W such that ς(W) = ς(W C ) and I(X C ; X C + W) < I(X C ; X C + W C ). 108. For the following lower bound, it is advisable to assume for notational simplicity and without loss of generality that ς(X C ) = 1. To remove that restriction, we may simply replace W by ς(X C )W.
Theorem 15. Let V be standard Cauchy independent of W. Then, where λ(W) is the solution to Equality holds in (306) if W is a centered Cauchy random variable, in which case, λ(W) = ς(W).

Proof.
It can be shown that, if P XY = P X P Y|X = P Y P X|Y and Q Y|X is an auxiliary random transformation such that P X Q Y|X = Q Y Q X|Y where Q Y is the response of Q Y|X to P X , then where (X, Y) ∼ P X P Y|X and the information density ı X;Y corresponds to the joint probability measure P X Q Y|X . We can particularize this decomposition of mutual information to the case where P X = P V , P Y|X=x = P W+x , Q Y|X=x = P W C +x where W C is centered Cauchy with strength λ > 0. Then, P X Q Y|X is the joint distribution of V and V + W C , and Taking expectation with respect to (x, y) = (V, V + t), and invoking (52), we obtain Finally, taking expectation with respect to t = W, we obtain If λ = λ(W), namely, the solution to (307), then (306) follows as a result of (308). If W = ς(W)V, then the solution to (307) is λ(W) = ς(W), and the equality in (306) can be seen by specializing (271) to (ς(X C ), ς(W C )) = (1, ς(W)).
On the other hand, we have if W has the probability density function in (100). 110. As the proof indicates, at the expense of additional computation, we may sharpen the lower bound in Theorem 15 to show which is attained at the solution to 111.
Theorem 16. The rate-distortion function of a memoryless source whose distribution is centered Cauchy with strength ς(X) such that the time-average of the distortion strength is upper bounded by D is given by Proof.
If D ≥ ς(X), reproducing the source by (0, . . . , 0) results in time-average of the distortion strength equal to 1 n ∑ n i=1 ς(X i ) = ς(X). Therefore, R(D) = 0. If 0 < D < ς(X), we proceed to determine the minimal I(X;X) among all PX |X such that ς(X −X) ≤ D. For any such random transformation, where (320) holds because conditioning cannot increase differential entropy, and (322) follows from Theorem 3 applied to Z = X −X. The fact that there is an allowable PX |X that achieves the lower bound with equality is best seen by letting X =X + Z, where Z andX are independent centered Cauchy random variables with ς(Z) = D and ς(X) = ς(X) − D. Then, PX |X P X = P X|X PX is such that the X marginal is indeed centered Cauchy with strength ς(X), and ς(X −X) = D. Recalling (271), and the lower bound in (323) can indeed be satisfied with equality. We are not finished yet since we need to justify that the rate-distortion function is indeed which does not follow from the conventional memoryless lossy compression theorem with average distortion because, although the distortion measure is separable, it is not the average of a function with respect to the joint probability measure P XX . This departure from the conventional setting does not impact the direct part of the theorem (i.e., ≤ in (325)), but it does affect the converse and in particular the proof of the fact that the n-version of the right side of (325) single-letterizes. To that end, it is sufficient to show that the function of D on the right side of (325) is convex (e.g., see pp. 316-317 in [19]). In the conventional setting, this follows from the convexity of the mutual information in the random transformation since, with a distortion function d(·, ·), we have where (X,X 1 ) ∼ P X P 1 X|X , (X,X 0 ) ∼ P X P 0 X|X , and (X,X α ) ∼ αP X P 1 X|X + (1 − α)P X P 0 X|X . Unfortunately, as we saw in Item 35, strength is not convex on the probability measure so, in general, we cannot claim that The way out of this quandary is to realize that (327) is only needed for those P 0 X|X and P 1 X|X that attain the minimum on the right side of (325) for different distortion bounds D 0 and D 1 . As we saw earlier in this proof, those optimal random transformations are such that X −X 0 and X −X 1 are centered Cauchy. Fortuitously, as we noted in (107), (327) does indeed hold when we restrict attention to mixtures of centered Cauchy distributions.
Theorem 16 gives another example in which the Shannon lower bound to the ratedistortion function is tight. In addition to Gaussian sources with mean-square distortion, other examples can be found in [50]. Another interesting aspect of the lossy compression of memoryless Cauchy sources under strength distortion measure is that it is optimally successively refinable in the sense of [51,52]. As in the Gaussian case, this is a simple consequence of the stability of the Cauchy distribution and the fact that the strength of the sum of independent Cauchy random variables is equal to the sum of their respective strengths (Item 27). 112. The continuity of mutual information can be shown under the following sufficient conditions Theorem 17. Suppose that X n is a sequence of real-valued random variables that vanishes in strength, Z is independent of X n , h(Z) > −∞ and 0 < ς(Z) < ∞. Then, lim n→∞ I(X n ; X n + Z) = 0. Proof.

113.
The assumption h(Z) > −∞ is not superfluous for the validity of Theorem 17 even though it was not needed in Theorem 1-(m). Suppose that Z is integer valued, and X n = (nL) −1 ∈ (0, 1 2 ) where L ∈ {2, 3, . . .} has probability mass function Then, I(X n ; X n + Z) = H(X n ) = H(L) = ∞, while E[|X n |] = 0.328289... n , and therefore, ς(X n ) → 0. 114. In the case in which V n and W n are standard spherical multivariate Cauchy random variables with densities in (6), it follows from (7) that λ X V n + λ W W n has the same distribution as (|λ X | + |λ W |) V n . Therefore, where we have used the scaling law h(αX n ) = n log |α| + h(X n ). There is no possibility of a Cauchy-counterpart of the celebrated log-determinant formula for additive Gaussian vectors (e.g., Theorem 9.2.1 in [41]) because, as pointed out in Item 7, Λ 1 2 V n +Λ 1 2 W n is not distributed according to the ellipsoidal density in (8) unless Λ andΛ are proportional, in which case the setup reverts to that in (330). 115. To conclude this section, we leave aside additive noise models and consider the mutual information between a partition of the components of the standard spherical multivariate Cauchy density (6). If I ∩ J = ∅, then (17) yields where h n stands for the right side of (17). For example, if i = j, then, in nats, More generally, the dependence index among the n random variables in the standard spherical multivariate Cauchy density is (see also [9,53]), in nats, k=1 log e (2k − 1) − n+1 2k−1 , n even; 116. The shared information of n random variables is a generalization of mutual information introduced in [54] for deriving the fundamental limit of interactive data exchange among agents who have access to the individual components and establish a dialog to ensure that all of them find out the value of the random vector. The shared information of X n is defined as such that |Π| > 1. If we divide (338) by n − 1, we obtain the shared information of n random variables distributed according to the standard spherical multivariate Cauchy model. This is a consequence of the following result, which is of independent interest. Theorem 18. If X n are exchangeable random variables, any subset of which have finite differential entropy, then for any partition Π of {1, . . . , n}, 1 |Π| − 1 D P X n |Π| ∏ =1 P X(I ) ≥ 1 n − 1 D(P X n P X 1 × · · · × P X n ).
Naturally, the same proof applies to n discrete exchangeable random variables with finite joint entropy.

117.
We have seen that a number of key information theoretic properties pertaining to the Gaussian law are also satisfied in the Cauchy case. Conceptually, those extensions shed light on the underlying reason the conventional Gaussian results hold. Naturally, we would like to explore how far beyond the Cauchy law those results can be expanded. As far as the maximization of differential entropy is concerned, the essential step is to redefine strength tailoring it to the desired law: Fix a reference random variable W with probability density function f W and finite differential entropy h(W) ∈ R, and define the W-strength of a real valued random variable Z as For example, Therefore, h(Z) ≤ h(W) + log ς W (Z), by definition of ς W (Z), thereby establishing ≤ in (348). Equality holds since ς W (ςW) = ς.
A corollary to Theorem 19 is a very general form of the Shannon lower bound for the rate-distortion function of a memoryless source Z such that the distortion is constrained to have W-strength not higher than D, namely, Theorem 19 finds an immediate extension to the multivariate case max Z n : ς W n (Z n )≤ς h(Z n ) = h(W n ) + n log ς, where, for W n with h(W n ) ∈ R, we have defined ς W n (Z n ) = inf ς > 0 : − E log f W n ς −1 Z n ≤ h(W n ) .
For example, if W n is zero-mean multivariate Gaussian with positive definite covariance Σ, then ς 2 W n (Z n ) = 1 n E Z n Σ −1 Z n . 118. One aspect in which we have shown that Cauchy distributions lend themselves to simplification unavailable in the Gaussian case is the single-parametrization of their likelihood ratio, which paves the way for a slew of closed-form expressions for f -divergences and Rényi divergences. It would be interesting to identify other multiparameter (even just scale/location) families of distributions that enjoy the same property. To that end, it is natural, though by no means hopeful, to study various generalizations of the Cauchy distribution such as the Student-t random variable, or more generally, the Rider distribution in (9). The information theoretic study of general stable distributions is hampered by the fact that they are characterized by their characteristic functions (e.g., p. 164 in [55]), which so far, have not lent themselves to the determination of relative entropy or even differential entropy. 119. Although we cannot expect that the cornucopia of information theoretic results in the Gaussian case can be extended to other domains, we have been able to show that a number of those results do find counterparts in the Cauchy case. Nevertheless, much remains to be explored. To name a few, (a) The concavity of the entropy-strength N C (X + tV)-a counterpart of Costa's entropy power inequality [40] would guarantee the least favorability of Cauchy noise among all strength-constrained noises as well as the entropy strength inequality (b) Information theoretic analyses quantifying the approach to normality in the central limit theorem are well-known (e.g., [56][57][58]). It would be interesting to explore the decrease in the relative entropy (relative to the Cauchy law) of independent sums distributed according to a law in the domain of attraction of the Cauchy distribution [55]. (c) Since de Bruijn's identity is one of the ancestors of the I-MMSE formula of [59], and we now have a counterpart of de Bruijn's identity for convolutions with scaled Cauchy, it is natural to wonder if there may be some sort of integral representation of the mutual information between a random variable and its noisy version contaminated by additive Cauchy noise. In this respect, note that counterparts for the I-MMSE formula for models other than additive Gaussian noise have been found in [60][61][62]. (d) Mutual information is robust against the addition of small non-Gaussian contamination in the sense that its effects are the same as if it were Gaussian [63]. The proof methods rely on Taylor series expansions that require the existence of moments. Any Cauchy counterparts (recall Item 77) would require substantially different methods. (e) Pinsker [41] showed that Gaussian processes are information stable imposing only very mild assumptions. The key is that, modulo a factor, the variance of the information density is upper bounded by its mean, the mutual information. Does the spherical multivariate Cauchy distribution enjoy similar properties?
120. Although not surveyed here, there are indeed a number of results in the engineering literature advocating Cauchy models in certain heavy-tailed infinite-variance scenarios (see, e.g., [45] and the references therein.) At the end, either we abide by the information theoretic maxim that "there is nothing more practical than a beautiful formula", or we pay heed to Poisson, who after pointing out in [64] that Laplace's proof of the central limit theorem broke down for what we now refer to as the Cauchy law, remarked that "Mais nous ne tiendrons pas compte de ce cas particulier, quil nous suffira d'avoir remarqué à cause de sa singularité, et qui ne se recontre sans doute pas dans la pratique".
Funding: This research received no external funding.

Institutional Review Board Statement: Not applicable.
Data Availability Statement: Not applicable.

Conflicts of Interest:
The author declares no conflict of interest.