Axiomatic Characterizations of Information Measures

Csiszár, Imre

doi:10.3390/e10030261

Open AccessArticle

Axiomatic Characterizations of Information Measures

by

Imre Csiszár

Rényi Institute of Mathematics, Hungarian Academy of Sciences, P.O.Box 127, H1364 Budapest, Hungary

Entropy 2008, 10(3), 261-273; https://doi.org/10.3390/e10030261

Submission received: 1 September 2008 / Accepted: 12 September 2008 / Published: 19 September 2008

(This article belongs to the Special Issue Facets of Entropy - Papers presented at the workshop in Copenhagen (24-26 October 2007))

Download Versions Notes

Abstract

:

Axiomatic characterizations of Shannon entropy, Kullback I-divergence, and some generalized information measures are surveyed. Three directions are treated: (A) Characterization of functions of probability distributions suitable as information measures. (B) Characterization of set functions on the subsets of {1,…,N} representable by joint entropies of components of an N-dimensional random vector. (C) Axiomatic characterization of MaxEnt and related inference rules. The paper concludes with a brief discussion of the relevance of the axiomatic approach for information theory.

Keywords:

Shannon entropy; Kullback I-divergence; Rényi information measures; f-divergence; f-entropy; functional equation; proper score; maximum entropy; transitive inference rule; Bregman distance

1. Introduction

Axiomatic characterizations of Shannon entropy

H (P) = - \sum_{i = 1}^{n} p_{i} log p_{i}

and Kullback I-divergence (relative entropy)

D (P | | Q) = - \sum_{i = 1}^{n} p_{i} log \frac{p_{i}}{q_{i}},

and of some generalized information measures will be surveyed, for discrete probability distributions P = (p₁,…,p_n), Q = (q₁,…,q_n), n = 2, 3, ….

No attempt at completeness is made, the references cover only part of the historically important contributions, but are believed to be representative for the development of the main ideas in the field. It is also illustrated how a research direction originating in information theory developed into a branch of the theory of functional equations; the latter, however, is not entered in depth, for its major achievements appear to be solutions of mathematical problems beyond information-theoretic relevance.

1.1. Historical comments

“Shannon entropy” first appeared in statistical physics, in works of Boltzmann and Gibbs, in the 19th century. Quantum entropy, of a density matrix with eigenvalues p₁,…,p_n, is defined by the same expression, Neumann [45]. I-divergence was defined as information measure by Kullback-Leibler [40] and may have been used much earlier in physics. The non-negativity of I-divergence is sometimes called Gibbs’ inequality, but this author could not verify that it does appear in Gibbs’ works. Wald [58] used I-divergence as a tool (without a name) in sequential analysis.

It was Shannon’s information theory [52] that established the significance of entropy as a key in- formation measure, soon complemented by I-divergence, and stimulated their profound applications in other fields such as large deviations [50], ergodic theory [38], and statistics [39].

Axiomatic characterizations of entropy also go back to Shannon [52]. In his view, this is “in no way necessary for the theory” but “lends a certain plausibility” to the definition of entropy and related information measures. “The real justification resides” in operational relevance of these measures.

1.2. Directions of axiomatic characterizations

(A) Characterize entropy as a function of the distribution P = (p₁,…,p_n), n = 2, 3, …: Show that it is the unique function that satisfies certain postulates, preferably intuitively desirable ones. Similarly for I-divergence. This direction has an extensive literature. Main references: Aczél-Daróczy [1], Ebanks- Sahoo-Sander [26].

(B) Characterize entropy as a set function: Determine the class of set functions φ(A), A ⊂ {1,…,N}, which can be represented as φ(A) = H({X_i}_i∈A), for suitable random variables X₁,…,X_N, or as a limit of a sequence of such “entropic” set functions. This direction has been initiated by Pippenger [47], the main reference is Yeung [59].

(C) Characterize axiomatically the MaxEnt inference principle. To infer a distribution P = (p₁,…,p_n) from incomplete information specifying only linear constraints

\sum_{i = 1}^{n} p_{i} a_{i j} = b_{j}, j = 1, \dots, k

, this prin-ciple (Jaynes [33], Kullback [39]) calls for maximizing H(P) or, if a “prior guess” Q is available, minimizing D(P||Q) subject to the given constraints. References: Shore-Johnson [53], Paris-Vencovská [46], Csiszár [18].

(D) Not entered: Information without probability [32], [35], and the “mixed theory of information” [2].

2. Direction (A)

Properties of entropy that have been used as postulates:

-: Positivity: H(P) ≥ 0
-: Expansibility: “Expansion” of P by a new component equal to 0 does not change H(P)
-: Symmetry: H(P) is invariant under permutations of p₁,…,p_n
-: Continuity: H(P) is a continuous function of P (for fixed n)
-: Additivity: H(P × Q) = H(P) + H(Q)
-: Subadditivity: H(X, Y) ≤ H(X) + H(Y)
-: Strong additivity: H(X, Y) = H(X) + H(Y|X)
-: Recursivity: H(p₁,…,p_n) = H(p₁ + p₂, p₃,…,p_n) + (p₁ + p₂)H $(\frac{p_{1}}{p_{1} + p_{2}}, \frac{p_{2}}{p_{1} + p_{2}})$
-: Sum property: $H (P) = \sum_{i = 1}^{n} g (p_{i})$ , for some function g.

Above, H(X), H(Y), H(X, Y) are the entropies of the distributions of random variables X, Y (with values in {1,…,n} and {1,…,m}) and of their joint distribution. H(Y|X) denotes the average of the entropies of the conditional distributions of Y on the conditions X = i, 1 ≤ i ≤ n, weighted by the probabilities of the events X = i.

2.1. Shannon entropy and I-divergence

Shannon [52] showed that continuity, strong additivity, and the property that H(1/n,…,1/n) in-creases with n, determine entropy up to a constant factor. The key of the proof was to show that the assumptions imply H(1/n,…,1/n) = c log n.

Faddeev [27] showed that recursivity plus 3-symmetry (symmetry for n = 3) plus continuity for n = 2 determine H(P) up to constant factor.

Further contributions along these lines include

Tverberg [56] and Lee [41]: relaxed continuity to Lebesgue integrability resp. measurability
Diderrich [25]: Recursivity plus 3-symmetry plus boundedness suffice
Daróczy-Maksa [24]: Positivity instead of boundedness does not suffice.

These works used as a key tool the functional equation for f(x) = H(x,1 − x)

f (x) + (1 - x) f (\frac{y}{1 - x}) = f (y) + (1 - y) f (\frac{x}{1 - y})

where x, y ∈ [0, 1), x + y ≤ 1. Aczél-Daróczy [1] showed that all solutions of this equation with f (0) = f (1) = 0 are given by

\begin{matrix} f (x) = x h (x) + (1 - x) h (1 - x) & 0 < x < 1, \end{matrix}

where h is any function satisfying

\begin{matrix} h (u v) = h (u) + h (v) & u, v > 0 . \end{matrix}

Chaundy-McLeod [13] showed, by solving another functional equation, that the sum property with continuous g, plus additivity, determine Shannon’s entropy up to a constant factor.

Daróczy [23] proved the same under the weaker conditions that g is measurable, g(0) = 0, and H is (3,2)-additive (additive for P = (p₁, p₂, p₃), Q = (q₁, q₂)). However, (2,2)-additivity does not suffice.

The intuitively most appealing axiomatic result is due to Aczél-Forte-Ng [3], extending Forte’s previous work [29]: Symmetry, expansibility, additivity, and subadditivity uniquely characterize linear combinations with non-negative coefficients of H(P) and H₀(P) = log |{i : p_i > 0}|. The same postulates plus continuity for n = 2 determine Shannon entropy up to a constant factor.

I-divergence has similar characterizations as entropy, both via recursivity, and the sum property plus additivity. For I-divergence, recursivity means

D (p_{1}, \dots, p_{n} | | q_{1}, \dots, q_{n}) = D (p_{1} + p_{2}, \dots, p_{n} | | q_{1} + q_{2}, \dots, q_{n}) + (p_{1} + p_{2}) D (\frac{p_{1}}{p_{1} + p_{2}}, \frac{p_{2}}{p_{1} + p_{2}} | | \frac{q_{1}}{q_{1} + q_{2}}, \frac{q_{2}}{q_{1} + q_{2}}) .

The sum property means that, for some function G of two variables,

D (p_{1}, \dots, p_{n} | | q_{1}, \dots, q_{n}) = \sum_{i = 1}^{n} G (p_{i}, q_{i}) .

The first results in this direction employed the device of admitting also “incomplete distributions” for Q (with sum of probabilities less than 1). Not needing this, Kannappan-Ng [36,37] proved: D(P||Q), as a function of arbitrary probability distributions P and strictly positive probability distributions Q, is determined up to a constant factor by recursivity, 3-symmetry, measurability in p for fixed q and in q for fixed p of D(p, 1 − p||q, 1 − q), plus D(1/2, 1/2||1/2, 1/2) = 0.

For the proof the following analogue, with four unknown functions, of the functional equation in the characterization of entropy had to be solved:

f_{1} (x) + (1 - x) f_{2} (\frac{y}{1 - x}) = f_{3} (y) + (1 - y) f_{4} (\frac{x}{1 - y}),

where x, y ∈ [0, 1), x + y ≤ 1.

Both kinds of characterizations have been extended to “information measures” depending on more than two distributions. Here, only the following corollary of more general (deep) results in the book [26] is mentioned, as an illustration. If a function of m strictly positive probability distributions

P_{j} = (p_{j 1}, \dots, p_{j n}), j = 1, \dots, m

, is of form

\sum_{i = 1}^{n} G (p_{1 i}, \dots, p_{m i})

with measurable G, and satisfies additivity, then this function equals a linear combination of entropies H(P_j) and divergences D(P_i||P_j).

2.2. Rényi entropies and divergences

Shannon entropy and I-divergence are means ∑p_kI_k of individual informations I_k = − log pk or

I_{k} = log \frac{p_{k}}{q_{k}}

. Rényi [48] introduced alternative information measures, which are generalized means

ψ^{- 1} (\sum p_{k} ψ (I_{k}))

, where ψ is a continuous, strictly monotone function, and which satisfy additivity. Entropy and divergence of order α ≠ 1 correspond to ψ(x) equal to e^(1−α)x respectively e^(α−1)x:

\begin{matrix} H_{α} (P) = \frac{1}{1 - α} log \sum p_{k}^{α}, & D_{α} (P | | Q) = \frac{1}{α - 1} log \sum p_{k}^{α} q_{k}^{1 - α}; \end{matrix}

here, the sums are for k ∈ {1,…,n} with p_k > 0. Limiting as α → 1 gives H₁ = H, D₁ = D.

These quantities were previously considered by Schützenberger [51]. Rényi [48] showed for the case of divergence, and conjectured also for entropy, that only these generalized means give rise to additive information measures, provided “incomplete distributions” were also considered. The latter conjecture was proved by Daróczy [20]. Then Daróczy [21] showed without recourse to incomplete distributions that the entropies of order α > 0 exhaust all additive entropies equal to generalized means such that the entropy of (p, 1 − p) approaches 0 as p → 0 (the last condition excludes α ≤ 0.)

Rényi entropies are additive, but not subadditive unless α = 1 or 0. If P = (1/n,…,1/n) then H_α(P) = log n, otherwise H_α(P) is a strictly decreasing function of α. Moreover, H_∞(P) := lim_α→∞ H_α(P) = − log max_k p_k.

Rényi entropies have operational relevance in the theory of random search [49], for variable length source coding (average codelength in exponential sense [12]), block coding for sources and channels (generalized cutoff rates [19]), and in cryptography (privacy amplification [9]).

Remark 1.

For information transmission over noisy channels, a key information measure is mutual in- formation, which can be expressed via entropy and I-divergence in several equivalent ways. The α- analogues of these expressions are no longer equivalent, the one of demonstrated operational meaning is

I_{α} (P, W) = min_{Q} \sum_{k = 1}^{n} p_{k} D_{α} (W_{k} | | Q),

see Csiszár [19]. Here W is the channel matrix with rows W_k = (w_k1,…,w_km), P = (p₁,…,p_n) is the input distribution, and the minimization is over distributions Q = (q₁,…,q_m). This definition of mutual information of order α, and different earlier ones (Sibson [54], Arimoto [7]) give the same maximum over input distributions P (“capacity of order α” of the channel W).

2.3. Other entropies and divergences

The f-divergence of P from Q is

D_{f} (P | | Q) = \sum_{k = 1}^{n} q_{k} f (\frac{p_{k}}{q_{k}}),

where f is a convex function on (0, ∞) with f (1) = 0. It was introduced by Csiszár [14,15], and independently by Ali-Silvey [5]. An unsophisticated axiomatic characterization of f-divergences appears in Csiszár [17].

In addition to I-divergence, this class contains reversed I-divergence, Hellinger distance, χ²-divergence, variation distance, etc. It shares some key properties of I-divergence, in particular monotonicity: for any partition

A = (A_{1}, \dots, A_{m})

of {1,…,n}, with notation

P^{A} = (p_{1}^{A}, \dots, p_{m}^{A}), p_{i}^{A} = \sum_{k \in A_{i}} p_{k}

, it holds that

D_{f} (P^{A} | | Q^{A}) \leq D_{f} (P | | Q) .

Remark 2.

The f-divergence of probability distributions does not change if f(t) is replaced by f(t) + a(t − 1), any a ∈ ℝ, hence f ≥ 0 may be assumed without loss of generality. If f ≥ 0, the obvious ex- tension of the definition to arbitrary

P, Q \in ℝ_{+}^{n}

retains the intuitive meaning of divergence. Accordingly, the I-divergence of arbitrary

P, Q \in ℝ_{+}^{n}

is defined as the f-divergence with f(t) = t log t − t + 1,

D (P | | Q) = \sum_{i = 1}^{n} (p_{i} log \frac{p_{i}}{q_{i}} - p_{i} + q_{i}) .

A generalization of mutual information via f-divergences, and as a special case a concept of f- entropy, appear in Csiszár [16]. Different concepts of f-entropies were defined by Arimoto [6], viz.

H^{f} (P) = \sum_{k = 1}^{n} f (p_{k})

, f concave, and

H_{f} (P) = {inf}_{Q} \sum_{k = 1}^{n} p_{k} f (q_{k})

. Both were used to bound probability of error. Ben-Bassat [8] determined the best bounds possible in terms of H^f(P). The f-entropy of [16] coincides with

H_{\tilde{f}} (P)

in the sense of [6], where

\tilde{f} (x) = x f (1 / x)

.

Very general information measures have been considered in the context of statistical decision theory, see Grünwald and Dawid [30] and references there. A function l(Q, k) of probability distributions Q = {q₁,…,q_n} and k ∈ {1,…,n}, measuring the loss when Q has been inferred and outcome k is observed, is called a proper score if the average loss

\sum_{k = 1}^{n} p_{k} l (Q, k)

is minimized for Q equal to the true distribution P, whatever this P is. Then

\sum_{k = 1}^{n} p_{k} l (P, k)

is called the entropy of P corresponding to the proper score l. In this context, Shannon entropy is distinguished as that corresponding to the only proper score of form l(Q, k) = f(q_k), the logarithmic score. Indeed, if for some n > 2

\sum_{k = 1}^{n} p_{k} f (p_{k}) \leq \sum_{k = 1}^{n} p_{k} f (q_{k})

for all strictly positive distributions P and Q on {1,…,n}, then f(x) = c log x + b, with c ≤ 0. This result has a long history, the book [1] attributes its first fully general and published proof to Fischer [28].

In the decision theory framework, Arimoto’s entropies H^f(P ) correspond to “separable Bregman scores” [30].

2.4. Entropies and divergences of degree α

This subclass of f-entropies/divergences is defined, for α ≠ 0, 1, by

\begin{matrix} H^{α} (P) = c_{α} (\sum_{k = 1}^{n} p_{k}^{α} - 1), & D^{α} (P | | Q) = - c_{α} (\sum_{k = 1}^{n} p_{k}^{α} q_{k}^{1 - α} - 1) . \end{matrix}

Here c_α is some constant, positive if 0 < α < 1 and negative otherwise. Its typical choices are such that (1 − α)c_α → 1 as α → 1, then limiting as α → 1 gives H¹ = H, D¹ = D.

Entropy of degree α was introduced by Havrda-Charvát [31]. The special case of α = 2 (“quadratic entropy”) may have appeared earlier, Vajda [57] used it to bound probability of error for testing multiple hypotheses. Divergences of degree 2 and 1/2 have long been used in statistics, the former since the early 20th century (χ² test), the latter goes back at least to Bhattacharyya [10].

In statistical physics, H^α(P) is known as Tsallis entropy, referring to [55]. Previously, Lindhard-Nielsen [42] proposed generalized entropies for statistical physics, effectively the same as entropies of degree α and order α, also unaware of their prior use in information theory.

Entropies/divergences of order α and those of degree α are in a one-to-one functional relationship. In principle, it would suffice to use only one of them, but in different situations one or the other is more convenient. For example, in source coding for identification it is entropy of degree 2 that naturally enters [4].

Entropies of degree α ≥ 1 are subadditive, but entropies of any degree α ≠ 1 are neither additive nor recursive. Rather,

\begin{matrix} H^{α} (P \times Q) = H^{α} (P) + H^{α} (Q) + c_{α}^{- 1} H^{α} (P) H^{α} (Q), \\ H^{α} (p_{1}, \dots, p_{n}) = H^{α} (p_{1} + p_{2}, p_{3}, \dots, p_{n}) + {(p_{1} + p_{2})}^{α} H^{α} (\frac{p_{1}}{p_{1} + p_{2}}, \frac{p_{2}}{p_{1} + p_{2}}) . \end{matrix}

With these “α-additivity” and “α-recursivity”, the analogues of characterization theorems for Shannon entropy hold, the first one due to Havrda-Charvát [31]. Remarkably, characterization via α-recursivity requires no regularity conditions [22]. Similar results hold for divergence of degree α, and for “information measures” of degree α involving more than two distributions. See the book [26] for details, some very complex. For divergence, α-recursivity means

\begin{matrix} D^{α} (p_{1}, \dots, p_{n} | | q_{1}, \dots, q_{n}) = D^{α} (p_{1} + p_{2}, p_{3}, \dots, p_{n} | | q_{1} + q_{2}, q_{3}, \dots, q_{n}) + \\ {(p_{1} + p_{2})}^{α} {(q_{1} + q_{2})}^{1 - α} D^{α} (\frac{p_{1}}{p_{1} + p_{2}}, \frac{p_{2}}{p_{1} + p_{2}} | | \frac{q_{1}}{q_{1} + q_{2}}, \frac{q_{2}}{q_{1} + q_{2}}) . \end{matrix}

3. Direction (B)

This very important direction can not be covered here in detail. We mention only the following key results: For N ≥ 4, the closure of the class of “entropic” set functions is a proper subclass of polymatroids, Zhang-Yeung [60]. It is a convex cone, Yeung [59], but not a polyhedral cone, Matúš [44], i.e., no finite set of linear entropy inequalities can provide the requested characterization.

4. Direction (C)

Here, some of the axiomatic results of Csiszár [18] are surveyed. Attention is not restricted to probability distributions, the object to be inferred could be

(i): a probability distribution P = (p₁,…,p_n), or
(ii): any $P = (p_{1}, \dots, p_{n}) \in ℝ_{+}^{n}$ , or
(iii): any P ∈ ℝⁿ

For technical reasons, in (i) and (ii) strict positivity of each pk is required. This conforms with the intuitive desirability of excluding inferences that certain events have probability 0. Below, n is fixed, n ≥ 5 in case (i), n ≥ 3 in cases (ii), (iii).

The only information about P is that it belongs to a feasible set F which could be any nonempty set determined by constraints

\begin{matrix} \sum_{i = 1}^{n} p_{i} a_{i j} = b_{j} & j = 1, \dots, m, \end{matrix}

that is, consisting of all P as in (i), (ii) or (iii) that satisfy the constraints. Assume that a prior guess (default model) Q is available which could be arbitrary (as P in (i), (ii) or (iii)).

An inference rule is any mapping Π that assigns to each feasible set F and prior guess Q an inference Π(F, Q) = P^∗ ∈ F. Axioms will be stated as desiderata for a “good” inference rule. The results substantiate that in cases (i) and (ii) the “best” inference rule is to let Π(F, Q) be the I-projection of Q to F (MaxEnt), and that in case (iii) the regular Euclidean projection (least squares) is “best”. Reasonable alternative rules will also be identified.

In the second axiom, we use the term “set of I-local constraints” where I is a subset of {1,…,n}. This means constraints of form

\sum_{i \in I} p_{i} a_{i j} = b_{j}

; in case (i) it is also supposed that one of them is

\sum_{i \in I} p_{i} = t

, for some 0 < t < 1.

The axioms are as follows:

Regularity: (a) Q ∈ F implies Π(F, Q) = Q, (b) F₁ ⊂ F and Π(F, Q) ∈ F₁ imply Π(F₁, Q) = Π(F, Q), (c) for each P ≠ Q, among the feasible sets determined by a single constraint there exists a unique F such that Π(F, Q) = P , (d) Π(F, Q) depends continuously on F.
Locality: If F₁ is defined by a set of I-local constraints, F₂ by a set of I^c-local ones, then the components $p_{i}^{*}, i \in I$ of $P^{*} = Π (F_{1} \cap F_{2}, Q)$ are determined by F₁ and {q_i : i ∈ I}.
Transitivity: If F₁ ⊂ F, Π(F, Q) = P^∗, then Π(F₁, Q) = Π(F₁, P^∗).
Semisymmetry: If F = {P : p_i + p_j = t} for some i ≠ j and constant t, and Q satisfies q_i = q_j , then P^∗ = Π(F, Q) satisfies $p_{i}^{*} = p_{j}^{*}$ .
Weak scaling (for cases (i), (ii)): For F as above, P^∗ = Π(F, Q) always satisfies

$p_{i}^{*} = \frac{t}{q_{i} + q_{j}} q_{i}, p_{j}^{*} = \frac{t}{q_{i} + q_{j}} q_{j} .$

Theorem 1.

An inference rule Π is regular and local iff Π(F, Q) is the minimizer subject to P ∈ F of a “distance”

d (P, Q) = \sum_{k = 1}^{n} f_{k} (p_{k}, q_{k}),

defined by functions f_k(p, q) ≥ 0 = f_k(p, q), continuously differentiable in p, in cases (i),(ii) with

\frac{\partial}{\partial p} f_{k} (p, q) \to - \infty

as p → 0, and such that d(P, Q) is strictly quasiconvex in P.

In cases (ii),(iii), these functions f_k are necessarily convex in p.

Theorem 2.

An inference rule as in Theorem 1 satisfies

(a): transitivity iff the functions f_k are of form

$f_{k} (p, q) = φ_{k} (p) - φ_{k} (q) - φ_{k}^{'} (q) (p - q),$

where $Φ (P) = \sum_{k = 1}^{n} φ_{k} (p_{k})$ is strictly convex; then d(P, Q) is the Bregman distance

$d (P, Q) = Φ (P) - Φ (Q) - {[g r a d Φ (Q)]}^{T} (P - Q)$
(b): semisymmetry iff f₁ = ⋯ = f_n
(c): weak scaling (in cases (i), (ii)) iff the functions f₁ = ⋯ = f_n are of form af(p/q), f is strictly convex, f(1) = f′(1) = 0, and f′(x) → −∞ as x → 0; then d(P, Q) is the f-divergence D_f(P||Q).

Bregman distances were introduced in [11]. An axiomatic characterization (in the continuous case), and hints to various applications, appear in Jones and Byrne [34]. The corresponding inference rule satisfies transitivity because it satisfies the “Pythagorean identity”

d (P, Q) = d (P, Π (F, Q)) + d (Π (F, Q), Q), P \in F .

It is not hard to see that in both cases (i) and (ii), only I-divergence is simultaneously an f-divergence and a Bregman distance.

Corollary 1.

In cases (i), (ii), regularity + locality + transitivity + weak scaling uniquely characterize the MaxEnt inference rule, with Π(F, Q) equal to the I-projection of Q to F.

In cases (ii), (iii), a natural desideratum is

Scale invariance: For each feasible set F , prior guess Q, and t > 0, Π(tF, tQ) = tΠ(F, Q). In case (iii), another desideratum is translation invariance, defined analogously.

Theorem 3.

A regular, local, transitive and semisymmetric inference rule Π satisfies

(a): translation and scale invariance (in case (iii)) iff Π(F, Q) equals the Euclidean projection of Q to F
(b): scale invariance (in case (ii)) iff Π(F, Q) is the minimizer of

$d_{α} (P, Q) = \sum_{i = 1}^{n} h_{α} (p_{i}, q_{i})$

subject to P ∈ F, where α ≤ 1 and

$h_{α} (p, q) = {\begin{cases} p log (p / q) - p + q & α = 1 \\ log (q / p) + (p / q) - 1 & α = 0 \\ (q^{α} - p^{α}) / α + q^{α - 1} (p - q) & e l s e \end{cases}$

Remark 3. α = 1 gives I-divergence, α = 0 Itakura-Saito distance. An early report of success (in spectrum reconstruction) using d_α with α = 1/m appears in [34].

Alternate characterizations of the MaxEnt and least squares inference rules involve the intuitively appealing axiom of “product consistency” in cases (i),(ii), or “sum consistency” in case (iii). This axiom applies also in the absence of a default model. Then inference via maximizing Shannon entropy resp. minimizing Euclidean norm is arrived at, see [18] for details.

5. Discussion

After surveying various axiomatic approaches to information measures, here their scientific value is briefly addressed.

Direction (A) has an extensive literature that includes many good and many weak papers. For mathematicians, good mathematics has scientific value on its own right, the controversial issue is relevance for information theory. Note, following Shannon [52], that the justification for regarding a quantity an information measure resides in the mathematical theorems, if any, demonstrating its operational significance. This author knows of one occasion [31] when an axiomatic approach led to a new information measure of practical interest, and of another [48] when such an approach initiated research that succeeded in finding operational meanings of a previously insignificant information measure. One benefit of axiomatic work in direction (A) is the proof that new information measures with certain desirable properties do not exist. On the other hand, this research direction has developed far beyond its origins, and became a branch of the theory of functional equations. Its main results in the last 30 years are of interest primarily for specialists of that theory.

Direction (B), only briefly mentioned here, addresses a problem of major information theoretic significance. Its full solution appears far ahead, but research in this direction has already produced valuable results. In particular, many new inequalities for Shannon entropy have been discovered, starting with [60].

Direction (C) addresses the characterization of “good” inference rules, which certainly appears relevant for the theory of inference. Such characterizations involving information measures, primarily Shannon entropy and I-divergence, and secondarily Bregman distances and f-divergences, indirectly amount to characterizations of the latter. As a preferable feature, these characterizations of information measures are directly related to operational significance (for inference).

Acknowledgement

This work was partially supported by the Hungarian Research Grant OTKA T046376.

References

Aczél, J.; Daróczy, Z. On Measures of Information and Their Characterizations; Academic Press: New York, 1975. [Google Scholar]
Aczél, J.; Daróczy, Z. A mixed theory of information I. RAIRO Inform. Theory 1978, 12, 149–155. [Google Scholar]
Aczél, J.; Forte, B.; Ng, C.T. Why Shannon and Hartley entropies are “natural”. Adv. Appl. Probab. 1974, 6, 131–146. [Google Scholar] [CrossRef]
Ahlswede, R.; Cai, N. An interpretation of identification entropy. IEEE Trans. Inf. Theory 2006, 52, 4198–4207. [Google Scholar] [CrossRef]
Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. Roy. Statist. Soc. B 1966, 28, 131–142. [Google Scholar]
Arimoto, S. Information-theoretic considerations on estimation problems. Information and Control 1971, 19, 181–194. [Google Scholar] [CrossRef]
Arimoto, S. Information measures and capacity of order α for discrete memoryless channels. In Topics in Information Theory; Colloq. Math. Soc. J. Bolyai 16; Csiszár, I., Elias, P., Eds.; North Holland: Amsterdam, 1977; pp. 41–52. [Google Scholar]
Ben-Bassat, M. f-entropies, probability of error, and feature selection. Information and Control 1978, 39, 227–242. [Google Scholar] [CrossRef]
Bennett, C.; Brassard, G.; Crépeau, C.; Maurer, U. Generalized privacy amplification. IEEE Trans. Inf. Theory 1995, 41, 1915–1923. [Google Scholar] [CrossRef]
Bhattacharyya, A. On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 1943, 35, 99–109. [Google Scholar]
Bregman, L.M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comp. Math. and Math. Phys. 1967, 7, 200–217. [Google Scholar] [CrossRef]
Campbell, L.L. A coding theorem and Rényi’s entropy. Information and Control 1965, 8, 423–429. [Google Scholar] [CrossRef]
Chaundry, T.W.; McLeod, J.B. On a functional equation. Edinburgh Mat. Notes 1960, 43, 7–8. [Google Scholar] [CrossRef]
Csiszár, I. Eine informationstheorische Ungleichung und ihre Anwendung auf den Beweis der Ergodizität von Markoffschen Ketten. Publ. Math. Inst. Hungar. Acad. Sci. 1963, 8, 85–107. [Google Scholar]
Csiszár, I. Information-type measures of difference of probability distributions and indirect observations. Studia Sci. Math. Hungar. 1967, 2, 299–318. [Google Scholar]
Csiszár, I. A class of measures of informativity of observation channels. Periodica Math. Hungar. 1972, 2, 191–213. [Google Scholar] [CrossRef]
Csiszár, I. Information measures: a critical survey. In Trans. 7th Prague Conference on Inf. Theory, etc.; Academia: Prague, 1977; pp. 73–86. [Google Scholar]
Csiszár, I. Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems. Ann. Statist. 1991, 19, 2032–2066. [Google Scholar]
Csiszár, I. Generalized cutoff rates and Rényi information measures. IEEE Trans. Inf. Theory 1995, 41, 26–34. [Google Scholar] [CrossRef]
Daróczy, Z. Über die gemeinsame Charakterisierung der zu den nicht vollständigen Verteilungen gehörigen Entropien von Shannon und von Rényi. Z. Wahrscheinlichkeitsth. Verw. Gebiete 1963, 1, 381–388. [Google Scholar] [CrossRef]
Daróczy, Z. Über Mittelwerte und Entropien vollständiger Wahrscheinlichkeitsverteilungen. Acta Math. Acad. Sci. Hungar. 1964, 15, 203–210. [Google Scholar] [CrossRef]
Daróczy, Z. Generalized information functions. Information and Control 1970, 16, 36–51. [Google Scholar] [CrossRef]
Daróczy, Z. On the measurable solutions of a functional equation. Acta Math. Acad. Sci. Hungar. 1971, 34, 11–14. [Google Scholar] [CrossRef]
Daróczy, Z.; Maksa, Gy. Nonnegative information functions. In Analytic Function Methods in Probability and Statistics; Colloq. Math. Soc. J. Bolyai 21; Gyires, B., Ed.; North Holland: Amsterdam, 1979; pp. 65–76. [Google Scholar]
Diderrich, G. The role of boundedness in characterizing Shannon entropy. Information and Control 1975, 29, 149–161. [Google Scholar] [CrossRef]
Ebanks, B.; Sahoo, P.; Sander, W. Characterizations of Information Measures; World Scientific: Singapore, 1998. [Google Scholar]
Faddeev, D.K. On the concept of entropy of a finite probability scheme (in Russian). Uspehi Mat. Nauk 1956, 11, 227–231. [Google Scholar]
Fischer, P. On the inequality ∑p_if(p_i) ≥ ∑p_if(q_i). Metrika 1972, 18, 199–208. [Google Scholar] [CrossRef]
Forte, B. Why Shannon’s entropy. In Conv. Inform. Teor., Rome 1973; Symposia Math. 15; Academic Press: New York, 1975; pp. 137–152. [Google Scholar]
Grünwald, P.; Dawid, P. Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. Ann. Statist. 2004, 32, 1367–1433. [Google Scholar]
Havrda, J.; Charvát, F. Quantification method of classification processes. Concept of structural a-entropy. Kybernetika 1967, 3, 30–35. [Google Scholar]
Ingarden, R.S.; Urbanik, K. Information without probability. Colloq. Math. 1962, 9, 131–150. [Google Scholar]
Jaynes, E.T. Information theory and statistical mechanics. Phys. Rev. 1957, 106, 620–630. [Google Scholar] [CrossRef]
Jones, L.K.; Byrne, C.L. General entropy criteria for inverse problems, with applications to data compression, pattern classification and cluster analysis. IEEE Trans. Inf. Theory 1990, 36, 23–30. [Google Scholar] [CrossRef]
Kampé de Fériet, J.; Forte, B. Information et probabilité. C. R. Acad. Sci. Paris A 1967, 265, 110–114, 142–146, and 350–353. [Google Scholar]
Kannappan, Pl.; Ng, C.T. Measurable solutions of functional equations related to information theory. Proc. Amer. Math. Soc. 1973, 38, 303–310. [Google Scholar] [CrossRef]
Kannappan, Pl.; Ng, C.T. A functional equation and its applications in information theory. Ann. Polon. Math. 1974, 30, 105–112. [Google Scholar]
Kolmogorov, A.N. A new invariant for transitive dynamical systems (in Russian). Dokl. Akad. Nauk SSSR 1958, 119, 861–864. [Google Scholar]
Kullback, S. Information Theory and Statistics; Wiley: New York, 1959. [Google Scholar]
Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Statist. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Lee, P.M. On the axioms of information theory. Ann. Math. Statist. 1964, 35, 415–418. [Google Scholar] [CrossRef]
Linhard, J.; Nielsen, V. Studies in statistical dynamics. Kong.Danske Vid. Selskab Mat-fys. Med. 1971, 38, 1–42. [Google Scholar]
Maksa, Gy. On the bounded solutions of a functional equation. Acta Math. Acad. Sci. Hungar. 1981, 37, 445–450. [Google Scholar] [CrossRef]
Matúš, F. Infinitely many information inequalities. In IEEE ISIT07 Nice, Symposium Proceedings; pp. 41–44.
Neumann, J. Thermodynamik quantenmechanischer Gesamtheiten. Gött. Nachr. 1927, 1, 273–291. [Google Scholar]
Paris, J.; Vencovská, A. A note on the inevitability of maximum entropy. Int’l J. Inexact Reasoning 1990, 4, 183–223. [Google Scholar] [CrossRef]
Pippenger, N. What are the laws of information theory? In Special Problems on Communication and Computation Conference, Palo Alto, CA, Sep. 3–5, 1986.
Rényi, A. On measures of entropy and information. In Proc. 4th Berkeley Symp. Math. Statist. Probability, 1960; Univ. Calif. Press: Berkeley, 1961; Vol. 1, pp. 547–561. [Google Scholar]
Rényi, A. On the foundations of information theory. Rev. Inst. Internat. Stat. 1965, 33, 1–4. [Google Scholar] [CrossRef]
Sanov, I.N. On the probability of large deviations of random variables (in Russian). Mat. Sbornik 1957, 42, 11–44. [Google Scholar]
Schützenberger, M.P. Contribution aux applications statistiques de la théorie de l’information. Publ. Inst. Statist. Univ. Paris 1954, 3, 3–117. [Google Scholar]
Shannon, C.E. A mathematical theory of communication. Bell System Tech. J. 1948, 27, 379–423 and 623–656. [Google Scholar] [CrossRef]
Shore, J.E.; Johnson, R.W. Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. IEEE Trans. Inf. Theory 1980, 26, 26–37. [Google Scholar] [CrossRef]
Sibson, R. Information radius. Z. Wahrscheinlichkeitsth. Verw. Gebiete 1969, 14, 149–161. [Google Scholar] [CrossRef]
Tsallis, C. Possible generalizations of the Boltzmann-Gibbs statistics. J. Statist. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
Tverberg, H. A new derivation of the information function. Math. Scand. 1958, 6, 297–298. [Google Scholar]
Vajda, I. Bounds on the minimal error probability for testing a finite or countable number of hy- potheses (in Russian). Probl. Inform. Transmission 1968, 4, 9–17. [Google Scholar]
Wald, A. Sequential Analysis; Wiley: New York, 1947. [Google Scholar]
Yeung, R.W. A First Course in Information Theory; Kluwer: New York, 2002. [Google Scholar]
Zhang, Z.; Yeung, R.W. On characterizations of entropy function via information inequalities. IEEE Trans. Inf. Theory 1998, 44, 1440–1452. [Google Scholar] [CrossRef]

© 2008 by the authors. Licensee Molecular Diversity Preservation International, Basel, Switzerland. This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license ( http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Csiszár, I. Axiomatic Characterizations of Information Measures. Entropy 2008, 10, 261-273. https://doi.org/10.3390/e10030261

AMA Style

Csiszár I. Axiomatic Characterizations of Information Measures. Entropy. 2008; 10(3):261-273. https://doi.org/10.3390/e10030261

Chicago/Turabian Style

Csiszár, Imre. 2008. "Axiomatic Characterizations of Information Measures" Entropy 10, no. 3: 261-273. https://doi.org/10.3390/e10030261

APA Style

Csiszár, I. (2008). Axiomatic Characterizations of Information Measures. Entropy, 10(3), 261-273. https://doi.org/10.3390/e10030261

Article Menu

Axiomatic Characterizations of Information Measures

Abstract

1. Introduction

1.1. Historical comments

1.2. Directions of axiomatic characterizations

2. Direction (A)

2.1. Shannon entropy and I-divergence

2.2. Rényi entropies and divergences

2.3. Other entropies and divergences

2.4. Entropies and divergences of degree α

3. Direction (B)

4. Direction (C)

5. Discussion

Acknowledgement

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI