Recent Progresses in Characterising Information Inequalities

Terence Chan

doi:10.3390/e13020379

Institute for Telecommunications Research, University of South Australia, Australia

Entropy2011, 13(2), 379-401;https://doi.org/10.3390/e13020379

Version Notes

Order Reprints

Abstract

In this paper, we present a revision on some of the recent progresses made in characterising and understanding information inequalities, which are the fundamental physical laws in communications and compression. We will begin with the introduction of a geometric framework for information inequalities, followed by the first non-Shannon inequality proved by Zhang et al. in 1998 [1]. The discovery of this non-Shannon inequality is a breakthrough in the area and has led to the subsequent discovery of many more non-Shannon inequalities. We will also review the close relations between information inequalities and other research areas such as Kolmogorov complexity, determinantal inequalities, and group-theoretic inequalities. These relations have led to non-traditional techniques in proving information inequalities and at the same time made impacts back on those related areas by the introduction of information-theoretic tools.

Keywords:

determinantal inequalities; Greene’s Theorem; Kolmogorov complexity; quasi-uniformity; Shannon entropies; subspace rank inequalities

1. Introduction

Information inequalities are the “physical laws” that characterise the fundamental limits in communications and compression. Probably the most well-known information inequalities are the nonnegativity of entropy and mutual information, extending back to Shannon [2]. They are indispensable in proving converse coding theorems and play a critical role in information theory.

To illustrate the idea about how inequalities are invoked to prove a converse, consider the following classical scenario: Alice aims to send a source message M to Bob in a hostile environment where the transmitted message may be eavesdropped by a malicious adversary Eve. In order to ensure that Eve will learn no knowledge about the source message M, Alice will encrypt it into a transmitted message X using a private key K which is known only by Bob and herself. It is well-known that in order to have perfect secrecy, the entropy of the key K is at least as large as the entropy of the message M. Such a result can be proved by invoking a few information inequalities as follows:

\begin{matrix} H (M) & \overset{(a)}{=} H (M | X) \end{matrix}

(1)

\begin{matrix} \overset{(b)}{=} I (M; K | X) \end{matrix}

(2)

\begin{matrix} \overset{(c)}{\leq} H (K | X) \end{matrix}

(3)

\begin{matrix} \overset{(d)}{\leq} H (K) \end{matrix}

(4)

where (a) is due to perfect secrecy (i.e., M and X are independent), (b) follows from that M can be reconstructed from the key K and the encrypted message X, (c) follows from the nonnegativity of conditional entropy

H (K | M, X)

and (d) is due to the nonnegativity of mutual information

I (X; K)

.

Besides their role in proving converse coding theorems, information inequalities are also shown to have close relations with inequalities for Kolmogorov complexities [3], group-theoretic inequalities [4], subspace rank inequalities [5], determinantal inequalities [6] and combinatorial inequalities [7]. Therefore, any new technique in characterising information inequalities will also have direct impact on these areas.

Despite its great importance, characterising information inequalities is not an easy task. It has been open for years whether there exists other information inequalities besides the nonnegativiity of entropies and mutual information. No further information inequalities were found for fifty years, until [1] reported the first “non-Shannon” information inequality. The significance of that result lay not only in the inequality itself, but also in its construction. This particular approach for construction has been the main ingredient in every non-Shannon inequality that has been subsequently discovered. Using this approach, new inequalities can be found mechanically [8] and there are in fact infinitely many such independent inequalities even when there are only four random variables involved [9]. Despite this progress, a complete characterisation is still missing however.

In this survey paper, we will review some of the major progresses in the areas of information inequalities. The organisation of the paper is as follows. In Section 2, we will first outline a geometric framework for information inequalities, based on which we will explain how a Shannon inequality can be proved mechanically. Then we will outline the proof of a non-Shannon inequality which was first proved in [1]. A geometric perspective for the proof will also be given. Next, Matúš’ series of information inequality (and its relaxation) will be discussed.

In Section 3, we will consider several “equivalent frameworks” for information inequalities. First and the most natural one is for the scenario when random variables are continuous. We will prove that information inequalities for discrete and continuous random variables are “essentially the same”. Then we will change our focus to the one-to-one relation between information inequalities, inequalities for Kolmogorov complexity, group-theoretic inequalities and inequalities for box assignments. In Section 4, we will consider two constrained classes of information inequalities, subject to the constraint respectively that random variables are induced by vector subspaces and are Gaussian. These constrained classes of information inequalities are equivalent to subspace rank inequalities and determinantal inequalities respectively.

2. Notations

Let

N_{n} = {1, \dots, n}

be a finite set and

2^{N_{n}}

be its power set. If n is understood implicitly, we will simply denote

N_{n}

by

N

. We define

H [N]

as the set of all real functions defined on

2^{N}

. Hence,

H [N]

is a

2^{| N |}

-dimensional Euclidean space. Elements in

H [N]

are called rank functions over

N

. Let

X_{1}, X_{2}, \dots, X_{n}

be nonempty sets and

{X_{1}, X_{2}, \dots, X_{n}}

be n jointly distributed discrete random variables defined on

X_{1}, X_{2}, \dots, X_{n}

respectively. For any

α \subseteq N

,

X_{α}

denotes the joint random variable

(X_{i} : i \in α)

defined over

X_{α}

(the Cartesian product of

X_{i}

for

i \in α

). As an example,

X_{{1, 2}}

is the random variable

(X_{1}, X_{2})

. For simplicity, the parentheses in the subscript are usually omitted, i.e.,

X_{{1, 2}}

is written as

X_{1, 2}

(or even simply

X_{12}

).

For a discrete random variable X,

λ (X)

denotes the support of the probability distribution function of X. In other words,

\begin{matrix} λ (X) & ≜ {x \in X : Pr (X = x) > 0} \end{matrix}

The (discrete) entropy of X, denoted by

H (X)

, is defined as

- \sum_{x \in λ (X)} p (x) log p (x)

where p is the probability distribution of X. We will also use the following conventions. Singletons and sets with one element are not distinguished. For any set

{Y_{i}, i \in N}

and subset

α \subseteq N

,

Y_{α}

denotes the subset

{Y_{i}, i \in α}

.

3. A Framework for Information Inequalities

Let

{X_{i}, i \in N}

be a set of discrete random variables. It induces a rank function h which is defined as follows: For any

α \subseteq N

,

\begin{matrix} h (α) ≜ H (X_{α}) . \end{matrix}

(5)

We call h the entropy function induced by

{X_{1}, \dots, X_{n}}

. For any function h in

H [N]

, we define

\begin{matrix} h (α | β) & ≜ h (α \cup β) - h (β), \end{matrix}

(6)

\begin{matrix} I_{h} (α; β) & ≜ h (α) + h (β) - h (α \cup β) - h (α \cap β) . \end{matrix}

(7)

If h is the entropy function induced by random variables

{X_{i}, i \in N}

, then

h (α | β)

is the conditional entropy

H (X_{α} | X_{β})

and

I_{h} (α; β)

is the mutual information

I (X_{α}; X_{β} | X_{α \cap β})

.

All entropy functions must satisfy the following polymatroidal axioms.

\begin{matrix} r (\emptyset) & = 0 \end{matrix}

(R1)

\begin{matrix} α \subseteq β & \Rightarrow h (α) \leq h (β) \end{matrix}

(R2)

\begin{matrix} h (α \cup β) + h (α \cap β) & \leq h (α) + h (β) . \end{matrix}

(R3)

The second axiom (R2) corresponds to that conditional entropy is nonnegative and the third axiom (R3) corresponds to that the conditional mutual information between

X_{α}

and

X_{β}

given

X_{α \cap β}

is nonnegative.

3.1. Geometric Framework

Characterisation of entropic functions is one of the most important and challenging problems in information theory. In the following, we will review the geometric framework proposed in [10] which has greatly simplified our understanding about information inequalities.

A function

h \in H [N]

is called weakly entropic if there exists

δ > 0

such that

δ \cdot h

is entropic, and is called almost entropic if it is the limit of a sequence of weakly entropic functions. Let

Γ^{*} (N)

be the set of all entropic functions and

{\bar{Γ}}^{*} (N)

be its closure. Then

{\bar{Γ}}^{*} (N)

is a closed and convex cone, and in fact is the set of all almost entropic functions. Compared to

Γ^{*} (N)

, its closure

{\bar{Γ}}^{*} (N)

is more manageable. In fact, for many application, it is sufficient to consider

{\bar{Γ}}^{*} (N)

. The following proves that characterising all linear information inequalities is equivalent to characterising the set

{\bar{Γ}}^{*} (N)

.

Theorem 1

(Yeung [10]) An information inequality

\sum_{α \subseteq N} c_{α} H (X_{α}) \geq 0

is valid (i.e., holds for all discrete random variables) if and only if

\sum_{α \subseteq N} c_{α} h (α) \geq 0, \forall h \in {\bar{Γ}}^{*} (N) .

Unfortunately,

{\bar{Γ}}^{*} (N)

is still extremely difficult to characterise explicitly for

n \geq 4

. As we shall see, the cone is not polyhedral and hence cannot be defined by a finite number of linear inequalities. Theorem 1 offers a geometric perspective in understanding information inequalities. Based on the theorem, Yan et al. [11] wrote the software called Information-Theoretic Inequality Prover (ITIP) which can mechanically verify all Shannon inequalities.

The idea behind ITIP is very simple: Suppose we have a cone

Υ

of

H [N]

such that

Γ^{*} (N) \subseteq Υ

. Consider an information inequality

\begin{matrix} \sum_{α \subseteq N} c_{α} H (X_{α}) \geq 0 . \end{matrix}

(8)

Suppose one can verify that

\sum_{α \subseteq N} c_{α} h (α) \geq 0, \forall h \in Υ .

Then by Theorem 1, the information inequality (8) will be valid. In other words, if the minimum of the following optimisation problem is nonnegative,

\begin{matrix} Minimise \sum_{α \subseteq N} c_{α} h (α) \\ subject to \\ h \in Υ, \end{matrix}

then the information inequality (8) is valid.

As

Υ

is a cone (hence,

δ h \in Υ

for all

δ \geq 0

and

h \in Υ

), it is only required to test if the origin

0

is a global minimum or not in the above optimisation problem. Furthermore, as the optimisation problem is convex, the optimality of

0

can be verified by checking the Karush–Kuhn–Tucker (KKT) condition.

In ITIP,

Υ

is chosen as the cone

Γ (N)

whose elements are all rank functions h that satisfies the polymatroidal axioms (R1)-(R3). By picking such a cone, the ITIP can prove all inequalities that are implied by the three axioms (or equivalently, all Shannon inequalities).

3.2. Non-Shannon Inequalities

It has been an open question for many years whether there exist information inequalities that are not implied by Shannon’s information inequalities. This question was finally answered in [1] where non-Shannon type inequalities were constructed explicitly. The proof was based on the use of auxiliary random variables. This turns out to be a very powerful technique. In fact, all subsequently discovered non-Shannon type information inequalities are essentially proved by the same technique.

Theorem 2

(Non-Shannon’s inequality [1]) Let

{X_{1}, X_{2}, X_{3}, X_{4}}

be random variables. Then

\begin{matrix} 2 I (X_{3}; X_{4}) \leq I (X_{1}; X_{2}) + I (X_{1}; X_{3}, X_{4}) + 3 I (X_{3}; X_{4} | X_{1}) + I (X_{3}; X_{4} | X_{2}) . \end{matrix}

(9)

Or equivalently, if h is entropic, then

\begin{matrix} 2 I_{h} (3; 4) \leq I_{h} (1; 2) + I_{h} (1; 3, 4) + 3 I_{h} (3; 4 | 1) + I_{h} (3; 4 | 2) . \end{matrix}

(10)

The information inequality in Theorem 2 is a non-Shannon’s inequality because one can construct a rank function

h \in H (N_{4})

such that (1) h satisfies all the polymatroidal axioms (R1)-(R3) and (2) h violates the inequality (10)

To illustrate the technique in proving new inequalities, we will sketch the proof for Theorem 2. Further details can be found in [1,12].

Sketch of proof of Theorem 2:

Let h be the entropy function induced by a set of discrete random variables

{X_{1}, X_{2}, X_{3}, X_{4}}

whose underlying distribution is p. Construct two auxiliary random variables

X_{1}^{'}

and

X_{2}^{'}

such that

\begin{matrix} Pr (x_{1}, x_{2}, x_{3}, x_{4}, x_{1}^{'}, x_{2}^{'}) = \{\begin{matrix} p (x_{3}, x_{4}) p (x_{1}, x_{2} | x_{3}, x_{4}) p (x_{1}^{'}, x_{2}^{'} | x_{3}, x_{4}) & if p (x_{3}, x_{4}) > 0 \\ 0 & if p (x_{3}, x_{4}) = 0 . \end{matrix} \end{matrix}

(11)

It is easy to see that the marginals of

{X_{1}, X_{2}, X_{3}, X_{4}}

and

{X_{1}^{'}, X_{2}^{'}, X_{3}, X_{4}}

are the same. By invoking the basic Shannon inequalities (involving six random variables), we can prove that

\begin{matrix} I (X_{3}; X_{4}) - I (X_{3}; X_{4} | X_{1}) - I (X_{3}; X_{4} | X_{2}) \\ = I (X_{1}; X_{2}^{'}) - I (X_{1}; X_{2}^{'} | X_{4}) - I (X_{1}; X_{2}^{'} | X_{3}) - I (X_{3}; X_{4} | X_{1}, X_{2}^{'}) . \end{matrix}

(12)

Hence,

\begin{matrix} I (X_{3}; X_{4}) - I (X_{3}; X_{4} | X_{1}) - I (X_{3}; X_{4} | X_{2}) \leq I (X_{1}; X_{2}^{'}) . \end{matrix}

(13)

Similarly, we can also prove that

\begin{matrix} I (X_{3}; X_{4}) - 2 I (X_{3}; X_{4} | X_{1}) \leq I (X_{1}; X_{1}^{'}), \end{matrix}

(14)

and consequently,

\begin{matrix} 2 I (X_{3}; X_{4}) - 3 I (X_{3}; X_{4} | X_{1}) - I (X_{3}; X_{4} | X_{2}) \leq I (X_{1}; X_{1}^{'}) + I (X_{1}; X_{2}^{'}) . \end{matrix}

(15)

Again, by invoking only Shannon’s inequalities, it can be proved that

\begin{matrix} I (X_{1}; X_{1}^{'}) + I (X_{1}; X_{2}^{'}) & \leq I (X_{1}; X_{3}, X_{4}) + I (X_{1}^{'}; X_{2}^{'}) \end{matrix}

(16)

\begin{matrix} = I (X_{1}; X_{3}, X_{4}) + I (X_{1}; X_{2}) \end{matrix}

(17)

Combining (15) and (17), the theorem is proved.Ⅰ

Remark: In the above proof of Theorem 2, the non-Shannon inequality is proved by invoking only a sequence of Shannon inequalities. This seems impossible at the first glance, as by definition, non-Shannon inequalities are all inequalities that are not implied by Shannon inequalities. The trick however is to apply Shannon inequalities over a larger set of random variables.

Using the geometric framework obtained earlier, we will describe in the following a “geometric interpretation” for the proof of the non-Shannon’s inequality.

Consider a set

M

such that

N \subseteq M

. Let

h \in H [M]

. We define

{proj}_{N} [h]

as a function

g \in H [N]

such that

g (α) = h (α)

for all

α \subseteq N

. Similarly, for any subset

A

of

H [M]

,

{proj}_{N} [A]

is the following subset

{proj}_{N} [A] ≜ {{proj}_{N} [h] : h \in A} .

Now, suppose that one can construct two cones

Υ

and

C

such that

$Γ^{*} (M) \subseteq Υ$ ;
For any $g \in Γ^{*} (N)$ , there exists a $h \in Γ^{*} (M) \cap C$ such that $g = {proj}_{N} [h]$ . Or equivalently, $Γ^{*} (N) \subseteq {proj}_{N} [Γ^{*} (M) \cap C]$ .

From the conditions 1 and 2, we have

Γ^{*} (N) \subseteq {proj}_{N} [Υ \cap C] .

Again, using Theorem 1, we can prove that an information inequality

\begin{matrix} \sum_{α \subseteq N} c_{α} H (X_{α}) \geq 0 \end{matrix}

(18)

is valid if

\sum_{α \subseteq N} c_{α} g (α) \geq 0, \forall g \in {proj}_{N} (C \cap Υ) .

Equivalently, the inequality (18) is valid if the minimum of the following linear program is zero.

\begin{matrix} Minimise \sum_{α \subseteq N} c_{α} h (α) \end{matrix}

\begin{matrix} subject to \end{matrix}

(19)

\begin{matrix} h \in C \cap Υ . \end{matrix}

Remark: Instead of verifying if an information inequality is valid or not, we can also use the Fourier-Motzkin elimination method to find all linear inequalities that defines the cone

{proj}_{N} (C \cap Υ)

. Clearly, each such inequality corresponds to a valid information inequality over

{X_{i}, i \in N}

.

Now, we will revisit the non-Shannon inequality in Theorem 2. Let

N = {1, 2, 3, 4}

and

M = {1, 2, 3, 4, 1^{'}, 2^{'}}

. Given any random variables

{X_{1}, \dots, X_{4}}

, construct two random variables

X_{1}^{'}

and

X_{2}^{'}

such that the probability distribution of

{X_{1}, X_{2}, X_{3}, X_{4}, X_{1}^{'}, X_{2}^{'}}

is given by (11). Let g be the entropy function of

{X_{1}, \dots, X_{4}}

and h be the entropy function of

{X_{1}, X_{2}, X_{3}, X_{4}, X_{1}^{'}, X_{2}^{'}}

. Then it is easy to see that for all

i \in {1, 2}

and

β \subseteq {3, 4}

,

I_{h} (1, 2; 1^{'}, 2^{'} | 3, 4) = 0, h (1, 2, β) = h (1^{'}, 2^{'}, β), h (i, β) = h (i^{'}, β)

and

{proj}_{N} [h] = g

.

Let

C ≜ \{\begin{matrix} h \in H [M] : & I_{h} (1, 2; 1^{'}, 2^{'} | 3, 4) = 0, h (1, 2, β) = h (1^{'}, 2^{'}, β), \\ h (i, β) = h (i^{'}, β), \forall i \in {1, 2} and β \subseteq {3, 4} \end{matrix}\}

and

Υ = Γ (M)

(which is the set of all functions h that satisfies the polymatroidal axioms). Then clearly

Γ^{*} (M) \subseteq Γ (M)

and

Γ^{*} (N) \subseteq {proj}_{N} [Υ \cap C]

. It can be numerically verified that the minimum of the linear program in (19) is zero when the information inequality is the non-Shannon inequality (9). Consequently, the non-Shannon inequality is indeed proved.

3.3. Non-Polyhedral Property

In the pervious subsection, we have discussed a promising technique in proving (or even discovering) new information inequalities. Using the same technique proposed in [1], more and more linear information inequalities have been discovered [8,13,14,15]. Later in [9], Matúš obtained a countable infinite set of linear information inequalities for a set of four random variables. Using the same set of inequalities, Matúš further proved that

{\bar{Γ}}^{*} (N_{4})

is not a polyhedral. In the following, we will review Matúš’ inequalities and its relaxation.

Remark: The non-polyhedral property of

{\bar{Γ}}^{*} (N_{4})

was later used by [16] to show that the set of achievable tuples of a network is in general also non-polyhedral. As a result, this proved that the Linear Programming bounds is not tight in general.

Theorem 3 (Matúš)

Let

s \in Z^{+}

and

g \in Γ^{*} (N_{4})

. Then

\begin{matrix} s (□_{12, 34} g + ▵_{34 | 2} g + ▵_{24 | 3} g) + ▵_{23 | 4} g + \frac{s (s - 1)}{2} (▵_{24 | 3} g + ▵_{34 | 2} g) \geq 0 \end{matrix}

(20)

where for any distinct elements

i, j, k \in N_{4}

,

\begin{matrix} ▵_{i j | k} g & ≜ I_{g} (X_{i}; X_{j} | X_{k}), \\ □_{12, 34} g & ≜ g (13) + g (23) + g (14) + g (24) + g (34) \\ - g (12) - g (3) - g (4) - g (134) - g (234) . \end{matrix}

While Matúš proved a series of linear information inequalities, it is sometimes difficult to use these infinitely number of inequalities at the same time. In [17], the series of Matúš’ inequalities is relaxed to a single non-linear inequality.

Remark: Using one single nonlinear inequality, it can be proved that the set of all almost entropic functions is not polyhedral.

Theorem 4

(Quadratic information inequality [17]) Let

g \in {\bar{Γ}}^{*} (N)

,

\begin{matrix} a (g) & ≜ \frac{1}{2} (▵_{24 | 3} g + ▵_{34 | 2} g) \\ b (g) & ≜ □_{12, 34} g + ▵_{34 | 2} g + ▵_{24 | 3} g \\ c (g) & ≜ ▵_{23 | 4} g \\ w (g) & ≜ \{\begin{matrix} - \frac{b (g) - a (g)}{2 a (g)} & i f a (g) > 0 \\ 0 & otherwise . \end{matrix} \end{matrix}

(21)

If

b (g) \leq 2 a (g)

, then

\begin{matrix} {(b (g) - a (g))}^{2} - 4 a (g) c (g) \leq min (4 a {(g)}^{2} {(w (g) - ⌊ w (g) ⌋)}^{2}, 4 a {(g)}^{2} {(⌈ w (g) ⌉ - w (g))}^{2}) \end{matrix}

(22)

and consequently,

\begin{matrix} {(□_{12, 34} g + \frac{▵_{24 | 3} g + ▵_{34 | 2} g}{2})}^{2} - 2 (▵_{24 | 3} g + ▵_{34 | 2} g) ▵_{32 | 4} g \leq \frac{{(▵_{24 | 3} g + ▵_{34 | 2} g)}^{2}}{4} . \end{matrix}

(23)

Remark: Subject to the constraint that

b (g) > 2 a (g)

, then the series of linear inequalities (20) is implied by the Shannon inequalities. Therefore, the constraint (i.e.,

b (g) > 2 a (g)

) we imposed on Theorem 4 is not critical.

Conjecture 1

(20) holds for all

s \geq 0

. Consequently, if

b (g) \leq 2 a (g)

, then

{(b (g) - a (g))}^{2} - 4 a (g) c (g) \leq 0 .

4. Equivalent Frameworks

In the previous section, we have described a framework for information inequalities for discrete random variables. We have also demonstrated the common proving technique. In this section, we will construct several different frameworks which are “equivalent” or “almost equivalent” to the earlier one. These equivalence relations among different frameworks will turn out to be very useful in deriving new information theoretic tools.

4.1. Differential Entropy

The previous framework for information inequalities assumes that all random variables are discrete. A very natural extension of the framework is thus to relax the restriction by allowing random variables to be continuous. To achieve this goal, we will first need an analogous definition of discrete entropy in the domain of continuous random variables.

Definition 1 (Differential entropies)

Let

{X_{i}, i \in N}

be a set of continuous random variables such that

X_{i}

are real numbers. For any

α \subseteq N

, let

f_{α} (x_{i}, i \in α)

be the density functions for

(X_{i}, i \in α)

. Then the differential entropy of

(X_{i}, i \in α)

is denoted by

H (X_{α}) ≜ - \int f (x_{α}) log f (x_{α}) d_{x_{α}} .

Remark: For notation simplicity, we abuse our notations by using

H (X)

to denote both discrete and differential entropies. However, its exact meaning should be clear from the context.

Discrete and differential entropies shared similar and dissimilar properties. The main difference is that differential entropy can be negative, unlike discrete entropy. However, mutual information and its conditional counterpart (by defined analogously as in (7)) remain nonnegative. In fact, as we shall see, the sets of information inequalities for discrete and continuous random variables are almost the same.

Definition 2 (Balanced inequalities)

An information inequality

\sum_{α \subseteq N} c_{α} H (X_{α}) \geq 0

(for either discrete or continuous random variables) is called balanced if for all

n \in N

,

\sum_{α \subseteq N : n \in α} c_{α} = 0

.

For any information inequality

\sum_{α \subseteq N} c_{α} H (X_{α}) \geq 0

or expression

\sum_{α \subseteq N} c_{α} H (X_{α})

, its

n^{t h}

residual weight

r_{n}

is defined as

\begin{matrix} r_{n} ≜ \sum_{α \subseteq N : n \in α} c_{α} . \end{matrix}

(24)

Clearly, an information inequality is balanced if and only if

r_{n} = 0

for all

n \in N

.

Example 1

The residual weights

r_{1}, r_{2}

of the information inequality

H (X_{1}) + H (X_{2}) \geq 0

are both equal to one. Hence, the inequality is not balanced.

For any information inequality

\sum_{α \subseteq N} c_{α} H (X_{α}) \geq 0

, its balanced counterpart is the following inequality

\begin{matrix} \sum_{α \subseteq N} c_{α} H (X_{α}) - \sum_{n \in N} r_{n} H (X_{n} | X_{i}, i \neq n) \geq 0, \end{matrix}

(25)

which is balanced (as its name suggests).

Proposition 1

(Necessary and sufficiency of balanced inequalities [6]) For any valid information inequality

\sum_{α \subseteq N} c_{α} H (X_{α}) \geq 0

, it is a valid discrete information inequality if and only if

1.: its residual weights $r_{n} \geq 0$ for all n, and
2.: its balanced counterpart is also valid.

$\sum_{α \subseteq N} c_{α} H (X_{α}) - \sum_{n \in N} r_{n} H (X_{n} | X_{i}, i \neq n) \geq 0 .$

Consequently, all valid discrete information inequalities are implied by the set of all valid balanced inequalities and the nonnegativity of (conditional) entropies.

It turns out that this set of balanced information inequalities also play the same significant role for inequalities involving continuous random variables.

Theorem 5

(Equivalence [6]) All information inequalities for continuous random variables are balanced. Furthermore, a balanced information inequality

\sum_{α \subseteq N} c_{α} H (X_{α}) - \sum_{n \in N} r_{n} H (X_{n} | X_{i}, i \neq n) \geq 0

is valid for continuous random variable if and only if it is also valid for discrete random variables.

By Theorem 5, to characterise information inequalities, it is sufficient to consider only balanced information inequalities which are the same for either discrete or continuous random variables.

4.2. Inequalities for Kolmogorov Complexity

The second framework we will describe is quite different from the earlier information-theoretic frameworks. For information inequalities, the objects of interest are random variables. However, for the following Kolmogorov complexity framework, the objects of interest are deterministic strings instead.

To understand what Kolmogorov complexity is, let us consider the following example: Suppose that

x_{1}

and

x_{2}

are the following binary strings

\begin{matrix} x_{1} & = 1111111111111111111111111111111111111111111111111111111111111111 \end{matrix}

(26)

\begin{matrix} x_{2} & = 0110010101010100010001010100101101001011101010101001011011101011 . \end{matrix}

(27)

Kolmogorov complexity of a string x (denoted by

K ⟨ x ⟩

) is the minimal program length required to output that string [18] In the above example, it is clear that the Kolmogorov complexity of

x_{1}

is much smaller than that of

x_{2}

(which is obtained by flipping a fair coin).

Although the objects of interest are different, [3] proved a surprising result that inequalities for Kolmogorov complexities and for entropies are essentially the same.

Theorem 6

(Equivalence [3]) An information inequality (for discrete random variable)

\sum_{α} c_{α} H (X_{i}, i \in α) \geq 0

is valid if and only if the corresponding Kolmogorov complexity inequality defined below

\sum_{α} c_{α} K (x_{i}, i \in α) \geq 0

is also valid.

4.3. A Group-Theoretic Framework

Besides Kolmogorov complexities, information inequalities are also closely related to group-theoretic inequalities [4]. To understand their relation, we first illustrate how to construct a random variable from a subgroup.

Definition 3 (Group-theoretic construction of random variables)

Let G be a finite group and U be a random variable that takes value in G uniformly. In other words,

\begin{matrix} Pr (U = i) = \frac{1}{| G |} \end{matrix}

(28)

for all

i \in G

.

For any subgroup K of G, it partitions G into

| G | / | K |

’s left (or right) coset of K in G such that each coset has exactly

| K |

’s elements. Note that, each coset can be written as the following subset for some elements

b \in G

{b \circ k : k \in K}

where ∘ is the binary group operator. Let

Ω_{K}

be the collection of all left cosets of K in G. The subgroup K induces a random variable

X_{K}

, which is defined as the random left coset of K in G that contains U. In fact,

X_{K}

is equal to the following coset

\begin{matrix} {U \circ k : k \in K} . \end{matrix}

(29)

Since U is uniformly distributed over G, we can easily prove that

X_{K}

is uniformly distributed over

Ω_{K}

and that

H (X_{K}) = log | G | / | K | .

The above construction of a random variable from a subgroup can be extended naturally to multiple subgroups.

Theorem 7

(Group characterisable random variables [4]) Let G be a finite group and

{G_{i}, i \in N}

be a set of subgroups of G. For each

i \in N

, let

X_{i}

be the random variable induced by the subgroup

G_{i}

as defined above. Then for any

α \subseteq N

,

1.: $H (X_{i}, i \in α) = log | G | / | \cap_{i \in α} G_{i} |$ ,
2.: $| λ (X_{i}, i \in α) | = | G | / | \cap_{i \in α} G_{i} |$ ,
3.: $(X_{i}, i \in α)$ is uniformly distributed over its support. In other word, the value of the probability distribution function of $(X_{i}, i \in α)$ is either zero or is a constant.

Definition 4

A function

h \in H [N]

is called group characterisable if it is the entropy function of a set of random variables

{X_{1}, \dots, X_{n}}

induced by a finite group G and its subgroups

{G_{1}, \dots, G_{n}}

. Furthermore, h is

1.: representable if ${G, G_{1}, \dots, G_{n}}$ are all vector space, and
2.: abelian if G is abelian.

Clearly, random variables induced by a set of subgroups must satisfy all valid information inequalities Therefore, we have the following theorem.

Theorem 8

(Group-theoretic inequalities [4]) Let

\begin{matrix} \sum_{α \subseteq N} c_{α} H (X_{α}) \geq 0 \end{matrix}

(30)

be a valid information inequality. Then for any finite group G and its subgroups

{G_{i}, i \in N}

, we have

\begin{matrix} \sum_{α \subseteq N} c_{α} log \frac{| G |}{| \cap_{i \in α} G_{i} |} \geq 0, \end{matrix}

(31)

or equivalently,

\begin{matrix} {|G|}^{\sum_{α \subseteq N} c_{α}} \geq \prod_{α \subseteq N} {|\cap_{i \in α} G_{i}|}^{c_{α}} . \end{matrix}

(32)

Theorem 8 proved that we can directly “translate” any information inequality into a group-theoretic inequality. A very surprising result proved in [4] was that the the converse also holds.

Theorem 9

(Converse [4]) The information inequality (30) is valid if it is satisfied by all random variables induced by groups, or equivalently, the group-theoretic inequality (32) is valid.

Theorems 8 and 9 suggested that to prove an information inequality, it is necessary and sufficient to verify if the inequality is satisfied by all random variables induced by groups. Later, we will further illustrate how to use the two theorems to derive a group-theoretic proof for information inequalities.

In the following, we will further prove that many statistical properties of random variables induced by groups will have analogous algebraic interpretations.

Lemma 1 (Properties of group induced random variables)

Suppose that

{X_{i}, i \in N}

is a set of random variables induced by a finite group G and its subgroups

{G_{i}, i \in N}

. Then

1.: (Functional dependency) $H (X_{l} | X_{i}, i \in α) = 0$ (i.e., $X_{l}$ is a function of $X_{α}$ ) if and only if $\cap_{i \in α} G_{i} \subseteq G_{l}$ . Hence, functional dependency is equivalent to subset relation;
2.: (Independency) $I (X_{i}; X_{j} | X_{l}) = 0$ if and only if

$\begin{matrix} | G_{i} \cap G_{l} | | G_{j} \cap G_{l} | = | G_{l} | | G_{i} \cap G_{j} \cap G_{l} |; \end{matrix}$

(33)
3.: (Conditioning preserves group characterisation) for any fixed any $α \subseteq N$ , the group $K ≜ \cap_{i \in α} K_{i}$ and its subgroups $K_{i} ≜ K \cap G_{i}$ for $i \in N$ induce a set of random variables ${Y_{i}, i \in N}$ such that

$H (Y_{i}, i \in β) = H (X_{i}, i \in β | X_{j}, j \in α)$

for all $β \subseteq N$ . In other words, for any group characterisable $h \in H [N]$ , let $g \in H [N]$ such that

$g (β) = h (β | α)$

for all $β \subseteq N$ . Then g is also group characterisable.

Proposition 2

(Duality [19]) Let

{V_{1}, \dots, V_{n}}

be a set of vector subspaces of

V ≜ F^{m}

over the finite field

F

. Define the following subspace

W_{i}

for

i \in N

:

\begin{matrix} W_{i} = {w \in V : v^{⊤} w = 0} . \end{matrix}

(34)

Then, for any

α \subseteq N

,

dim ⟨ V_{i}, i \in α ⟩ = dim V - dim ⋂_{i \in α} W_{i} = log \frac{| V |}{| ⋂_{i \in α} W_{i} |} .

Hence, if

h \in H [N]

such that

h (α) ≜ dim ⟨ V_{i}, i \in α ⟩

for all

α \subseteq N

, then h is weakly representable.

Remark: While

W^{⊥}

and W are both subspaces of V and

dim W + dim W^{⊥} = dim V

,

⟨ W, W^{⊥} ⟩ \neq V

in general. If

F = R

, then

W_{i}

(defined as in (34)) is the orthogonal complement of

V_{i}

.

Theorems 8 and 9 suggested that proving an information inequality (30) is equivalent to proving a group-theoretic inequality (32). In the following, we will illustrate the idea by providing a group-theoretic proof for nonnegativity of mutual information

\begin{matrix} H (X_{1}) + H (X_{2}) \geq H (X_{1}, X_{2}) . \end{matrix}

(35)

Example 2 (Group-theoretic Proof)

Let G be a finite group and

G_{1}

and

G_{2}

be its subgroups. Let

S = {a \circ b : a \in G_{1}, b \in G_{2}}

where ∘ is the binary group operator. As S is a subset of

| G |

,

| S | \leq | G |

. With a simple counting argument (by removing duplications), it can be proved easily that

| S | = \frac{| G_{1} | | G_{2} |}{| G_{1} \cap G_{2} |} .

Therefore,

| G | | G_{1} \cap G_{2} | \geq | G_{1} | | G_{2} | .

Finally, according to Theorems 8 and 9, the inequality (35) follows.

It is worth mentioning that Theorems 8 and 9 also suggested an information-theoretic proof for group-theoretic inequalities. For example, the following information inequality

\begin{matrix} H (X_{1}) + H (X_{2}) + 2 H (X_{1}, X_{2}) + 4 H (X_{3}) + 4 H (X_{4}) \end{matrix}

(36)

\begin{matrix} + 5 H (X_{1}, X_{3}, X_{4}) + 5 H (X_{2}, X_{3}, X_{4}) \end{matrix}

(37)

\begin{matrix} \leq & 6 H (X_{3}, X_{4}) + 4 H (X_{1}, X_{3}) + 4 H (X_{1}, X_{4}) \end{matrix}

(38)

\begin{matrix} + 4 H (X_{2}, X_{3}) + 4 H (X_{2}, X_{4}), \end{matrix}

(39)

implies the following group-theoretic inequality

\begin{matrix} | G_{34} |^{6} | G_{13} |^{4} | G_{14} |^{4} | G_{23} |^{4} | G_{24} |^{4} \leq | G_{1} | | G_{2} | | G_{3} |^{4} | G_{4} |^{4} | G_{12} |^{2} | G_{134} |^{5} {| G_{234} |}^{5} \end{matrix}

(40)

The meaning of this inequality and its implications in group theory are yet to be understood.

4.4. Combinatorial Perspective

Random variables that are induced by groups have many interesting properties. One interesting property is that they are quasi-uniform in nature.

Definition 5 (Quasi-uniform random variables)

A set of random variables

{X_{1}, \dots, X_{n}}

is called quasi-uniform if for all

α \subseteq N

,

X_{α} ≜ (X_{i}, i \in α)

is uniformly distributed over its support

λ (X_{α})

. In other words,

\begin{matrix} Pr (X_{α} = x_{α}) = \{\begin{matrix} 1 / | λ (X_{α}) | & i f x_{α} \in λ (X_{α}) \\ 0 & o t h e r w i s e . \end{matrix} \end{matrix}

(41)

Since

X_{α}

is uniformly distributed for all

α \subseteq N

, the entropy

H (X_{α})

is thus equal to

log | λ (X_{α}) |

.

According to the Asymptotic Equipartition Property (AEP) [12], for a sufficiently long sequence of independent and identically distributed random variables, the set of typical sequences has a total probability close to one and the probability of each typical sequence is approximately the same. In certain sense, quasi-uniform random variables possess the non-aymptotic equipartition property that the probabilities are completely concentrated and uniformly distributed over their supports. As a result, quasi-uniform random variables can be fully characterised by their supports (because the probability distributions are uniform over the supports). This offers a combinatorial interpretation for quasi-uniform random variables. And it turns out that this interpretation offers a combinatorial approach to proving information inequalities.

Definition 6 (Box assignment)

Let

{X_{1}, \dots, X_{n}}

be nonempty finite sets and

X

be their Cartesian product

\prod_{i = 1}^{n} X_{i}

. A box assignment

A

in

X

is a nonempty subset of

X_{N}

.

For any

α \subseteq N

and

a_{α} ≜ (a_{i}, i \in α) \in \prod_{i \in α} X_{i}

, we define

\begin{matrix} A_{N | α} (a_{α}) & ≜ \{(x_{j}, j \in N) \in A : x_{i} = a_{i}, i \in α\}, \end{matrix}

(42)

\begin{matrix} A_{α} & ≜ \{(a_{i}, i \in α) : | A_{N | α} (a_{α}) | \geq 1\} . \end{matrix}

(43)

Roughly speaking,

A_{N | α} (a_{α})

is the set of elements in

A

such that its “

i^{t h}

-coordinate” is

a_{i}

for

i \in α

. The set

A_{N | α} (a_{α})

will be called the

a_{α}

-layer of

A

. And hence,

A_{α}

contains all

a_{α}

such that the

a_{α}

-layer of

A

is nonempty. And we will call

A_{α}

the α-projection of

A

.

Definition 7 (Quasi-uniform box assignment)

A box assignment

A

is called quasi-uniform if for any

α \subseteq N

, the cardinality of

A_{N | α} (a_{α})

is constant for all

a_{α} \in A_{α}

. And we will denote the constant by

| A_{N | α} |

for simplicity.

The following proposition proves that quasi-uniform box assignment and quasi-uniform random variables are in fact equivalent.

Proposition 3

(Equivalence [7]) Let

{X_{1}, \dots, X_{n}}

be a set of quasi-uniform random variables and

A

be its probability distribution’s support. Then

A

is a quasi-uniform box assignment in

\prod_{i \in N} X_{i}

. Furthermore, for all

α \subseteq N

,

\begin{matrix} H (X_{α}) = log | A_{α} | . \end{matrix}

(44)

Conversely, for any quasi-uniform box assignment

A

, there exists a set of quasi-uniform random variables

{X_{1}, \dots, X_{n}}

whose probability distribution’s support is indeed

A

.

As random variables induced by groups are quasi-uniform, by Theorems 8 and 9, we have the following combinatorial interpretation for information inequalities.

Theorem 10

(Combinatorial interpretation [7]) An information inequality

\begin{matrix} \sum_{α \subseteq N} c_{α} H (X_{α}) \geq 0 \end{matrix}

(45)

is valid if and only if the following box assignment inequality is valid

\begin{matrix} \sum_{α \subseteq N} c_{α} log | A_{α} | \geq 0, \end{matrix}

(46)

or equivalently,

\begin{matrix} \prod_{α \subseteq N} {| A_{α} |}^{c_{α}} \geq 1 \end{matrix}

(47)

for all quasi-uniform box assignments

A

.

Again, in the following example, we will illustrate how to use the combinatorial interpretation to derive a “combinatorial proof” for information inequality.

Example 3 (Combinatorial proof)

Let

A

be a quasi-uniform box assignment in

X = X_{1} \times X_{2}

. Suppose

(a_{1}, a_{2}) \in A

. Then it is obvious that

a_{1} \in A_{1}

and

a_{2} \in A_{2}

. In other words,

A \subseteq A_{1} \times A_{2}

and consequently,

| A_{1} | \times | A_{2} | \geq | A_{1, 2} | .

By Theorem 10, we prove that

H (X_{1}) + H (X_{2}) \geq H (X_{1}, X_{2})

.

4.5. Coding Perspective

We can also view a box assignment

A

as an error correcting code such that

A

is the set of all codewords. For each codeword

(a_{1}, \dots, a_{n})

,

a_{i}

is the

i^{t h}

symbol to be transmitted across a channel. Taking this coding perspective, in the following, a box assignment will simply be called a code. Also, a code

C

is called a quasi-uniform code if

C

is a quasi-uniform box assignment. Again, each quasi-uniform code

C

will induce a set of quasi-uniform random variables

{X_{1}, \dots, X_{n}}

.

For any code

C

(which is just a box assignment) and two codewords

c, c^{'} \in C

, the Hamming distance between codewords

c ≜ (c_{1}, \dots, c_{n})

and

c^{'} ≜ (c_{1}^{'}, \dots, c_{n}^{'})

is defined as

D (c, c^{'}) ≜ | {i \in N : c_{i} \neq c_{i}^{'}} | .

In addition, the minimum Hamming distance of the code

C

is defined as

D (C) ≜ min_{c \neq c^{'}, c, c^{'} \in C} D (c, c^{'})

The minimum Hamming distance of a code characterises how strong the error correcting capability of the code is. Specifically, a code

C

with a minimum Hamming distance d can correct up to

⌊ \frac{d - 1}{2} ⌋

’s symbol errors.

Example 4

Let

C

be a length-3 code containing only two codewords

(0, 0, 0)

and

(1, 1, 1)

. The minimum Hamming distance of this code is 3 and hence can correct any single symbol error. For instance, suppose the codeword

(0, 0, 0)

is transmitted. If a symbol error occurs, the receiver will receive either

(1, 0, 0)

,

(0, 1, 0)

or

(0, 0, 1)

. In any case, the receiver can always determine which symbol is erroneous (by using a bounded-distance decoder) and hence can correct it.

In addition to the minimum Hamming distance, in many cases, a code’s distance profile is also of great importance: Let

C

be a code and c be a codeword in

C

. The distance profile of

C

centered at c is a set of integers

A (C, c) ≜ {A_{r} (c) : r = 1, \dots, n}

where

A_{r} (c) ≜ | {c^{'} \in C : D (c, c^{'}) = r} |} .

In other words,

A_{r} (c)

is the number of codewords in

C

such that their Hamming distances to the centering codeword c is r.

The profile

A (C, c)

contains information about how likely a decoding error (i.e., the receiver decodes a wrong codeword) occurs if the transmitted codeword is c. In general, the distance profile

A (C, c)

depends on the choice of c. A code is called distance-invariant if its distance profile

A (C, c)

is independent of c. Roughly speaking, a distance-invariant code is one where the probability of decoding error is the same for all transmitted codewords

c \in C

.

Theorem 11

(Distance invariance [20]) Quasi-uniform codes are distance-invariant.

Example 5 (Linear codes)

Let P be a

n - k \times n

parity check matrix (over a finite field

F

) and the code

C

is defined by

C = {c \in F^{n} : P c = 0} .

Then

C

is called a linear code. Note that, for a linear code, if

c_{1}, c_{2} \in C

, then

c_{1} + c_{2}

is also contained in

C

. Linear codes are quasi-uniform codes and hence are also distance invariant.

In the following, we will consider only quasi-uniform codes. For simplicity, we will assume without loss of generality that there is a zero-codeword

0 \in C

(by renaming). Also, for any

c \in C

, we define the Hamming weight of the codeword c (denoted by

D (c)

) as

D (c, 0)

.

Definition 8 (Weight enumerator)

The weight enumerator of a quasi-uniform code C with length n is

W_{C} (x, y) = \sum_{r = 1}^{n} A_{r} x^{n - r} y^{r}

where x and y are indeterminates, and

A_{r} ≜ A_{r} (0)

. Using simple counting, it is easy to prove that

\begin{matrix} W_{C} (x, y) ≜ \sum_{c \in C} x^{n - D (c)} y^{D (c)} . \end{matrix}

(48)

In many cases, it is more convenient to work with weight enumerator than distance profile. However, conceptually, they are equivalent (i.e., they can be uniquely obtained from each other). Clearly, the weight enumerator is uniquely determined from the code

C

. However, what “structural property” of the code

C

determines the weight enumerator? For example, suppose that we construct a new code from

C

by exchanging the first and the second codeword symbols. It is obvious that this modification will not affect the weight enumerator. In other words, ordering of the codeword symbols has no effects on the weight enumerator. The question therefore is: What property of a code has direct effects on the weight enumerator?

To answer the question, let us use the old perspective that a quasi-uniform code is merely a quasi-uniform box assignment (and also its associated set of quasi-uniform random variables). These random variables

{X_{1}, \dots, X_{n}}

have a simple interpretation here: Suppose a codeword

C = (C_{1}, \dots, C_{n})

is randomly and uniformly selected from

C

. Then

X_{i}

is the

i^{t h}

symbol in the random codeword C, i.e.,

X_{i} = C_{i}

. Our answer to the above question is given in the following theorem.

Theorem 12

(Generalised Greene’s Theorem [20]) Let C be a quasi-uniform code and

{X_{1}, \dots, X_{n}}

be its induced quasi-uniform random variables. Suppose that ρ is the entropy function of

{X_{1}, \dots, X_{n}}

. In other words,

ρ (α) = H (X_{i}, i \in α)

. Then

\begin{matrix} W_{C} (x, y) = \sum_{α \subseteq N} 2^{ρ (N) - ρ (α)} {(x - y)}^{| α |} y^{n - | α |} . \end{matrix}

(49)

Remark: The Greene’s Theorem is a special case of Theorem 12 when the code

C

is a linear code.

By Theorem 12, the weight enumerator (and also the error-correcting capability) of a quasi-uniform code depends only on the entropy function induced by the codeword symbol random variables. By exploiting the relation between the entropy function of a set of quasi-uniform random variables and the weight enumerator of the induced code, we open a new door on how to harness coding theory results to derive new information theory results.

Example 6

(Code-theoretic proof) Consider a set of quasi-uniform random variables

{X_{1}, X_{2}}

which induces a length-2 quasi-uniform code C. The length of the code is 2. By the Generalised Greene’s Theorem, the number of codewords which have Hamming weights 1 is given by

\begin{matrix} A_{1} = (2^{H (X_{1}, X_{2}) - H (X_{1})} + 2^{H (X_{1}, X_{2}) - H (X_{2})} - 2) . \end{matrix}

(50)

As

A_{1}

is nonnegative, (50) implies that

\begin{matrix} min (H (X_{1}), H (X_{2})) \leq H (X_{1}, X_{2}) . \end{matrix}

(51)

Finally, by Theorem 10 (a variation of which to be precise), an information inequality holds if and only if it also holds for all quasi-uniform random variables. Consequently, we prove that (51) holds for all random variables.

5. Constrained Information Inequalities

In pervious sections, we considered general information inequalities where we do not impose any constraint on the choice of random variables. In the following, we will focus on two constrained classes of information inequalities: subspace rank inequalities and determinantal inequalities.

5.1. Rank Inequalities

Let

{V_{1}, \dots, V_{n}}

be a set of vector subspaces over a field

F

. A subspace rank inequality is an inequality about the rank or dimension of subspaces in the following form:

\begin{matrix} \sum_{α \subseteq N} c_{α} dim ⟨ V_{i}, i \in α ⟩ \geq 0 . \end{matrix}

(52)

For example, it is straightforward to prove that

\begin{matrix} dim ⟨ V_{1} ⟩ + dim ⟨ V_{2} ⟩ \geq dim ⟨ V_{1}, V_{2} ⟩, \end{matrix}

(53)

which is a direct consequence of the following identity

\begin{matrix} dim ⟨ V_{1} ⟩ + dim ⟨ V_{2} ⟩ = dim ⟨ V_{1}, V_{2} ⟩ + dim V_{1} \cap V_{2} . \end{matrix}

(54)

Subspace rank inequalities are in fact constrained information inequalities subject to the criteria that random variables are induced by vector subspaces over a field. Clearly, all valid information inequalities (including all Shannon inequalities) are subspace rank inequalities. For example, the subspace rank inequality (53) is indeed equivalent to the nonnegativity of mutual information. Besides all these known unconstrained information inequalities, one of the most well-known subspace rank inequalities is the Ingleton inequalities [21]. A recent work [22] proved that Ingleton inequalities also include Shannon inequalities as special cases and determined the unique minimal set of Ingleton inequalities that imply all the others.

Theorem 13 (Ingleton inequality)

Suppose r is a representable polymatroid over

X

. Then for every choice of subsets

X_{1}, X_{2}, X_{3}, X_{4} \subseteq X

\begin{matrix} 0 \leq r (X_{1} \cup X_{2}) + r (X_{1} \cup X_{3}) + r (X_{1} \cup X_{4}) + r (X_{2} \cup X_{3}) + r (X_{2} \cup X_{4}) \\ - r (X_{1}) - r (X_{2}) - r (X_{3} \cup X_{4}) - r (X_{1} \cup X_{2} \cup X_{3}) - r (X_{1} \cup X_{2} \cup X_{4}) . \end{matrix}

(55)

It has been open for years whether there exists subspace rank inequalities that are not implied by Ingleton inequalities and Shannon inequalities. It was until recently that the question was finally answered. In [5], insufficiency of Ingleton inequality to characterise all subspace rank inequalities was proved. And in [23,24], new subspace rank inequalities not implied by Ingleton inequalities were explicitly constructed. In fact, the set of subspace rank inequalities for up to five variables have all been determined. However, the complete characterisation involving more than five variables is still missing. In the following, we will review some of the important results along this line of work.

Theorem 14

(Kinser [23]) Suppose

X = {X_{1}, \dots, X_{n}}

and h is representable over

X

. Then

\begin{matrix} h (X_{1, 2}) + h (X_{1, 3, n}) + h (X_{3}) + \sum_{i = 4}^{n} (h (X_{i}) + h (X_{2, i - 1, i})) \\ \leq h (X_{1, 3}) + h (X_{1, n}) + h (X_{2, 3}) + \sum_{i = 4}^{n} (h (X_{2, i}) + h (X_{i - 1, i})) . \end{matrix}

(56)

Or equivalently,

\begin{matrix} I_{h} (X_{2}; X_{3}) \leq I_{h} (X_{1}; X_{2}) + I_{h} (X_{3}; X_{n} | X_{1}) + \sum_{i = 4}^{n} I_{h} (X_{2}; X_{i - 1} | X_{i}) . \end{matrix}

(57)

Theorem 15

(Dougherty et al. [24]) Suppose

X = {A, B, C_{1}, \dots, C_{n}}

and h is representable over

X

. Then

\begin{matrix} (n - 1) I_{h} (A; B) \leq \sum_{i = 1}^{n} I_{h} (A; B | C_{i}) + \sum_{i = 1}^{n} h (C_{i}) - h (C_{1}, \dots, C_{n}) . \end{matrix}

(58)

Remark: In addition to the inequalities obtained in Theorem 15, the work [24] found all subspace rank inequalities in five variables (called DFZ inequalities) and many more other new inequalities in six variables.

Definition 9 (ϵ-truncation)

Let h be a polymatroid over

Y

and

0 \leq ϵ \leq h (Y)

. Define g as follows where

\begin{matrix} g (α) & ≜ min (h (α), h (Y) - ϵ), \forall α \subseteq Y . \end{matrix}

(59)

Then g is called the ϵ-truncation of h.

Definition 10 (Truncation-preserving inequalities)

Let

Y = {Y_{i}, i \in N_{m}}

. A set of rank inequalities

\begin{matrix} \{\sum_{α \subseteq N_{m}} c_{α}^{ℓ} H (Y_{i}, i \in α) \geq 0, ℓ \in Δ\} \end{matrix}

(60)

is said to preserve truncation (or is truncation-preserving) if for any h satisfying all the inequalities in (60), its truncation also satisfies all the inequalities.

Proposition 4

(Chan et al. [5]) DFZ inequalities are truncation preserving.

Theorem 16

(Insufficiency of truncation preserving inequalities [5]) Let

Δ_{n}

be the set of all subspace rank inequalities involving n variables (or subspaces). Then for sufficiently large n,

Δ_{n}

is not truncation-preserving.

5.2. Determinantal Inequalities

Information inequalities for Gaussian random variables are another interesting class of information inequalities. As we shall see, they are equivalent to determinantal inequalities.

Definition 11 (Gaussian polymatroid)

Let h be a polymatroid over

N

. It is called Gaussian if there exists a set of jointly Gaussian random variables

{Y_{j}, j \in K}

with a

| K | \times | K |

covariance matrix and a partition of

K

into n disjoint nonempty subsets

β_{1}, \dots, β_{n}

such that for any

α \subseteq N

,

\begin{matrix} h (α) = H (X_{i}, i \in α) . \end{matrix}

(61)

where

X_{i} = (Y_{j}, j \in β_{i})

for all

i \in N

. Furthermore, h is called weakly Gaussian if there exists

δ > 0

such that

δ h

is Gaussian, and almost Gaussian if h is the limit of a sequence of weakly Gaussian functions.

It is straightforward to prove that the weakly Gaussian property is closed under addition. In other words, if h and g are weakly Gaussian, then their sum

h + g

is also weakly Gaussian. Furthermore, like information inequality for any continuous random variables, if an inequality

\begin{matrix} \sum_{α \subseteq N} c_{α} H (X_{α}) \geq 0 \end{matrix}

(62)

holds for all Gaussian random variables

{X_{i}, i \in N}

[25], then it must be balanced. Therefore, in the following, we will only consider balanced information inequalities.

Let

{Y_{j}, j \in K}

be a set of jointly Gaussian random variables with covariance matrix K which is a

| K | \times | K |

positive definite matrix. Suppose

K

is partitioned into n disjoint nonempty subsets

β_{1}, \dots, β_{n}

. A very compelling property of a set of Gaussian random variable is that its entropy and the determinant of its covariance matrix is related by the following relation:

\begin{matrix} H (Y_{i}, i \in β) = \frac{1}{2} log [{(2 π e)}^{| β |} det (K_{β})] = | β | \frac{log (2 π e)}{2} + \frac{log det (K_{β})}{2} . \end{matrix}

(63)

where

K_{β}

be the principal submatrix of K by deleting rows and columns that are not indexed by β. Substitute (63) back into (62), the inequality (62) is satisfied by all Gaussian random variables

{X_{i}, i \in N}

if and only if

\begin{matrix} \sum_{α \subseteq N} c_{α} (| β_{α} | \frac{log (2 π e)}{2} + \frac{log det (K_{β_{α}})}{2}) \geq 0 \end{matrix}

(64)

where

β_{α} = ⋃_{i \in α} β_{i}

. Since the inequality (62) is balanced,

\begin{matrix} \sum_{α \subseteq N : j \in α} c_{α} = 0 \end{matrix}

(65)

for all

j \in N

. On the other hand,

\begin{matrix} \sum_{α \subseteq N} c_{α} | β_{α} | & = \sum_{α \subseteq N} c_{α} \sum_{j \in α} | β_{j} | \end{matrix}

(66)

\begin{matrix} = \sum_{j \in N} \sum_{α \subseteq N : j \in α} c_{α} | β_{j} | \end{matrix}

(67)

\begin{matrix} = \sum_{j \in N} | β_{j} | \sum_{α \subseteq N : j \in α} c_{α} \end{matrix}

(68)

\begin{matrix} = 0 . \end{matrix}

(69)

Therefore, the inequality (62) holds for all Gaussian random variables if and only if the following determinantal inequality holds for all positive definite matrix K

\begin{matrix} \sum_{α \subseteq N} c_{α} log det (K_{β_{α}}) \geq 0 \end{matrix}

(70)

or equivalently,

\begin{matrix} \prod_{α \subseteq N} det {(K_{β_{α}})}^{c_{α}} \geq 1 . \end{matrix}

(71)

As a direct consequence, for any valid information inequality, we can use the above relation to derive a corresponding determinantal inequality. For example, the following well-known determinantal inequalities can all be proved using this “information-theoretical method”.

(Hadamard inequality) Let K be a positive definite matrix K. Then

$\begin{matrix} det K \leq \prod_{i = 1}^{| K |} K_{i, i} \end{matrix}$

(72)

where $K_{i, i}$ is the $i^{t h}$ diagonal entry of K. This inequality follows from the following information inequality

$H (Y_{1}, \dots, Y_{k}) \leq \sum_{i = 1}^{k} H (Y_{i}) .$
(Szasz inequality) For any $1 \leq l < k$ ,

$\begin{matrix} {(\prod_{β : | β | = l} det (K_{β}))}^{1 / (\binom{k - 1}{l - 1})} \geq {(\prod_{β : | β | = l + 1} det (K_{β}))}^{1 / (\binom{k - 1}{l})} . \end{matrix}$

(73)

This determinantal inequality follows from the following information inequality

$\begin{matrix} \frac{1}{(\binom{k}{l})} \sum_{β : | β | = l} \frac{H (Y_{i}, i \in β)}{l} \geq \frac{1}{(\binom{k}{l + 1})} \sum_{β : | β | = l + 1} \frac{H (Y_{i}, i \in β)}{l + 1} . \end{matrix}$

(74)

Finally, we will conclude this section by the following open question: While Gaussian polymatroid is clearly almost entropic, is it true that an almost entropic polymatorid almost Gaussian? In other words, for any almost entropic polymatroid h, can we construct a sequence of Gaussian polymatroids

{g_{i}, i = 1, \dots}

such that

lim_{i \to \infty} δ_{i} g_{i} = h

for some

δ_{i} > 0

for all i.

6. Summary and Conclusions

In this paper, we have reviewed some of the recent progresses in characterisation of information inequalities. We first began with a geometric framework for information inequalities which has simplified the understanding of information inequalities. We also reviewed how the first non-Shannon inequality was proved and highlighted the general idea behind the proof. Next, we studied the infinite series of inequalities over

N_{4}

and considered a nonlinear relaxation of the series of inequalities.

We have also reviewed how information inequalities are related to Kolmogorov complexity inequalities, group-theoretic inequalities and inequalities for box assignments. Based on their relations, we demonstrated non-traditional approaches to proving information inequalities.

Finally, we investigated two constrained classes of information inequalities. The first class is when random variables are induced by vector spaces. In this case, the constrained inequalities are equivalent to subspace rank inequalities. We showed that Ingleton and DFZ inequalities are insufficient to characterise all subspace rank inequalities in general where the set of all subspace rank inequalities is not truncation-preserving. The second constrained class of inequalities is when random variables are Gaussian. We have showed that these constrained inequalities are in fact determinantal inequalities.

As a final remark, we would like to emphasise that this survey paper aims not to cover every aspect about information inequalities. In fact, there are many interesting pieces of work that we did not cover. For example, as pointed out by one of the reviewers, one very interesting area is about the relation between convex body inequalities and information inequalities [26,27]. We strongly encourage readers who are interested to further explore those relevant areas.

Acknowledgements

This work was supported by the Australian Government under ARC grant DP1094571.

References

Zhang, Z.; Yeung, R.W. On the characterization of entropy function via information inequalities. IEEE Trans. Inform. Theory 1998, 44, 1440–1452. [Google Scholar] [CrossRef]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423, 623–656. [Google Scholar] [CrossRef]
Hammer, D.; Romashchenko, A.; Shen, A.; Vereshchagin, N. Inequalities for Shannon Entropy and Kolmogorov Complexity. J. Computer Syst. Sci. 2000, 60, 442–464. [Google Scholar] [CrossRef]
Chan, T.H.; Yeung, R.W. On a relation between information inequalities and group theory. IEEE Trans. Inform. Theory 2002, 48, 1992–1995. [Google Scholar] [CrossRef]
Chan, T.H.; Grant, A.; Kern, D. Novel technique in characterising representable polymatroids. IEEE Trans. Inform. Theory 2009. submitted for publication. [Google Scholar]
Chan, T.H. Balanced information inequalities. IEEE Trans. Inform. Theory 2003, 49, 3261–3267. [Google Scholar] [CrossRef]
Chan, T.H. A combinatorial approach to information inequalities. Commun. Inform. Syst. 2001, 1, 1–14. [Google Scholar] [CrossRef]
Dougherty, R.; Freiling, C.; Zeger, K. Six New Non-Shannon Information Inequalities. IEEE Int. Symp. Inform. Theory 2006, 233–236. [Google Scholar]
Matus, F. Infinitely Many Information Inequalities. In Proceedings of ISIT 2007, Nice, France, June 2007.
Yeung, R. A framework for linear information inequalities. IEEE Trans. Inform. Theory 1997, 43, 1924–1934. [Google Scholar] [CrossRef]
Yeung, R.; Yan, Y. Information Theoretic Inequality Prover. Available online: http://user-www.ie.cuhk.edu.hk/ITIP/ (accessed on 27 January 2011).
Yeung, R. A First Course in Information Theory; Kluwer Academic/Plenum Publisher: New York, NY, USA, 2002. [Google Scholar]
Yeung, R.W.; Zhang, Z. A class of non-Shannon-type information inequalities and their applications. Commun. Inform. Syst. 2001, 1, 87–100. [Google Scholar] [CrossRef]
Sason, I. Identification of new classes of non-Shannon type constrained information inequalities and their relation to finite groups. In Proceedings of 2002 IEEE International Symposium, Lausanne, Switzerland, 30 June–5 July 2002.
Makarychev, K.; Makarychev, Y.; Romashchenko, A.; Vereshchagin, N. A new class of non-Shannon-type inequalities for entropies. Commun. Inform. Syst. 2002, 2, 147–165. [Google Scholar] [CrossRef]
Chan, T.H.; Grant, A. Dualities between Entropy Functions and network codes. IEEE Trans. Inform. Theory 2008, 54, 4470–4487. [Google Scholar] [CrossRef]
Chan, T.; Grant, A. Non-linear Information Inequalities. Entropy J. 2008, 10, 765–775. [Google Scholar] [CrossRef]
Strictly speaking, the Kolmogorov complexity of a string depends on the chosen “computer model”. However, the choice of the computer model will only affect the resulting Kolmogorov up to a constant difference (because different computer models can emulate each other). Asymptotically, such a difference will not cause a significant difference.
Chan, T.H.; Grant, A. Linear programming bounds for network coding. IEEE Trans. Inform. Theory 2011. sbmitted to be published. [Google Scholar]
Chan, T.H.; Grant, A.; Britz, T. Properties of quasi-uniform codes. In Proceedings of 2010 IEEE International Symposium on Information Theory, Austin, TX, USA, June 2010.
Ingleton inequalities are not valid information inequalities, as there exists almost entropic polymatroids violating the inequalities.
Guille, L.; Chan, T.H.; Grant, A. The minimal set of Ingleton inequalities. IEEE Trans. Inform. Theory 2009. [Google Scholar]
Kinser, R. New inequalities for subspace arrangements. J. Combin. Theory Ser. A 2010. [Google Scholar] [CrossRef]
Dougherty, R.; Freiling, C.; Zeger, K. Linear rank inequalities on five or more variables. Arxiv Preprint 2009. [Google Scholar]
Each X_i can be a vector of jointly distributed Gaussian random variables as defined in (61).
Lutwak, E.; Yang, D.; Zhang, G. Cramer-Rao and moment-entropy inequalities for Renyi entropy and generalized Fisher information. IEEE Trans. Info. Theory 2005, 51, 473–478. [Google Scholar] [CrossRef]
Lutwak, E.; Yang, D.; Zhang, G. Moment-entropy inequalities for a random vector. IEEE Trans. Info. Theory 2007, 53, 1603–1607. [Google Scholar] [CrossRef]

© 2011 by the author; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license http://creativecommons.org/licenses/by/3.0/.

Recent Progresses in Characterising Information Inequalities

Abstract

1. Introduction

2. Notations

3. A Framework for Information Inequalities

3.1. Geometric Framework

3.2. Non-Shannon Inequalities

3.3. Non-Polyhedral Property

4. Equivalent Frameworks

4.1. Differential Entropy

4.2. Inequalities for Kolmogorov Complexity

4.3. A Group-Theoretic Framework

4.4. Combinatorial Perspective

4.5. Coding Perspective

5. Constrained Information Inequalities

5.1. Rank Inequalities

5.2. Determinantal Inequalities

6. Summary and Conclusions

Acknowledgements

References

Article Metrics

Citations

Article Access Statistics