Recent Progresses in Characterising Information Inequalities

In this paper, we present a revision on some of the recent progresses made in characterising and understanding information inequalities, which are the fundamental physical laws in communications and compression. We will begin with the introduction of a geometric framework for information inequalities, followed by the first non-Shannon inequality proved by Zhang et al. in 1998 [1]. The discovery of this non-Shannon inequality is a breakthrough in the area and has led to the subsequent discovery of many more non-Shannon inequalities. We will also review the close relations between information inequalities and other research areas such as Kolmogorov complexity, determinantal inequalities, and group-theoretic inequalities. These relations have led to non-traditional techniques in proving information inequalities and at the same time made impacts back on those related areas by the introduction of information-theoretic tools.


Introduction
Information inequalities are the "physical laws" that characterise the fundamental limits in communications and compression.Probably the most well-known information inequalities are the nonnegativity of entropy and mutual information, extending back to Shannon [2].They are indispensable in proving converse coding theorems and play a critical role in information theory.
To illustrate the idea about how inequalities are invoked to prove a converse, consider the following classical scenario: Alice aims to send a source message M to Bob in a hostile environment where the transmitted message may be eavesdropped by a malicious adversary Eve.In order to ensure that Eve will learn no knowledge about the source message M , Alice will encrypt it into a transmitted message X using a private key K which is known only by Bob and herself.It is well-known that in order to have perfect secrecy, the entropy of the key K is at least as large as the entropy of the message M .Such a result can be proved by invoking a few information inequalities as follows: where (a) is due to perfect secrecy (i.e., M and X are independent), (b) follows from that M can be reconstructed from the key K and the encrypted message X, (c) follows from the nonnegativity of conditional entropy H(K|M, X) and (d) is due to the nonnegativity of mutual information I(X; K).
Besides their role in proving converse coding theorems, information inequalities are also shown to have close relations with inequalities for Kolmogorov complexities [3], group-theoretic inequalities [4], subspace rank inequalities [5], determinantal inequalities [6] and combinatorial inequalities [7].Therefore, any new technique in characterising information inequalities will also have direct impact on these areas.
Despite its great importance, characterising information inequalities is not an easy task.It has been open for years whether there exists other information inequalities besides the nonnegativiity of entropies and mutual information.No further information inequalities were found for fifty years, until [1] reported the first "non-Shannon" information inequality.The significance of that result lay not only in the inequality itself, but also in its construction.This particular approach for construction has been the main ingredient in every non-Shannon inequality that has been subsequently discovered.Using this approach, new inequalities can be found mechanically [8] and there are in fact infinitely many such independent inequalities even when there are only four random variables involved [9].Despite this progress, a complete characterisation is still missing however.
In this survey paper, we will review some of the major progresses in the areas of information inequalities.The organisation of the paper is as follows.In Section 2, we will first outline a geometric framework for information inequalities, based on which we will explain how a Shannon inequality can be proved mechanically.Then we will outline the proof of a non-Shannon inequality which was first proved in [1].A geometric perspective for the proof will also be given.Next, Matúš' series of information inequality (and its relaxation) will be discussed.
In Section 3, we will consider several "equivalent frameworks" for information inequalities.First and the most natural one is for the scenario when random variables are continuous.We will prove that information inequalities for discrete and continuous random variables are "essentially the same".Then we will change our focus to the one-to-one relation between information inequalities, inequalities for Kolmogorov complexity, group-theoretic inequalities and inequalities for box assignments.In Section 4, we will consider two constrained classes of information inequalities, subject to the constraint respectively that random variables are induced by vector subspaces and are Gaussian.These constrained classes of information inequalities are equivalent to subspace rank inequalities and determinantal inequalities respectively.

Notations
Let N n = {1, . . ., n} be a finite set and 2 Nn be its power set.If n is understood implicitly, we will simply denote N n by N .We define H[N ] as the set of all real functions defined on 2 N .Hence, For any α ⊆ N , X α denotes the joint random variable (X i : i ∈ α) defined over X α (the Cartesian product of X i for i ∈ α).As an example, X {1,2} is the random variable (X 1 , X 2 ).For simplicity, the parentheses in the subscript are usually omitted, i.e., X {1,2} is written as X 1,2 (or even simply X 12 ).
For a discrete random variable X, λ(X) denotes the support of the probability distribution function of X.In other words, The (discrete) entropy of X, denoted by H(X), is defined as where p is the probability distribution of X.We will also use the following conventions.Singletons and sets with one element are not distinguished.For any set {Y i , i ∈ N } and subset α ⊆ N , Y α denotes the subset {Y i , i ∈ α}.

A Framework for Information Inequalities
Let {X i , i ∈ N } be a set of discrete random variables.It induces a rank function h which is defined as follows: For any α ⊆ N , h(α) H(X α ). ( We call h the entropy function induced by {X 1 , . . ., X n }.For any function h in H[N ], we define If h is the entropy function induced by random variables All entropy functions must satisfy the following polymatroidal axioms. The second axiom (R2) corresponds to that conditional entropy is nonnegative and the third axiom (R3) corresponds to that the conditional mutual information between X α and X β given X α∩β is nonnegative.

Geometric Framework
Characterisation of entropic functions is one of the most important and challenging problems in information theory.In the following, we will review the geometric framework proposed in [10] which has greatly simplified our understanding about information inequalities.
A function h ∈ H[N ] is called weakly entropic if there exists δ > 0 such that δ • h is entropic, and is called almost entropic if it is the limit of a sequence of weakly entropic functions.Let Γ * (N ) be the set of all entropic functions and Γ * (N ) be its closure.Then Γ * (N ) is a closed and convex cone, and in fact is the set of all almost entropic functions.Compared to Γ * (N ), its closure Γ * (N ) is more manageable.In fact, for many application, it is sufficient to consider Γ * (N ).The following proves that characterising all linear information inequalities is equivalent to characterising the set Γ * (N ).
Theorem 1 (Yeung [10]) An information inequality α⊆N c α H(X α ) ≥ 0 is valid (i.e., holds for all discrete random variables) if and only if Unfortunately, Γ * (N ) is still extremely difficult to characterise explicitly for n ≥ 4. As we shall see, the cone is not polyhedral and hence cannot be defined by a finite number of linear inequalities.Theorem 1 offers a geometric perspective in understanding information inequalities.Based on the theorem, Yan et al. [11] wrote the software called Information-Theoretic Inequality Prover (ITIP) which can mechanically verify all Shannon inequalities.
The idea behind ITIP is very simple: Suppose we have a cone Υ of H[N ] such that Γ * (N ) ⊆ Υ.
Then by Theorem 1, the information inequality (8) will be valid.In other words, if the minimum of the following optimisation problem is nonnegative, then the information inequality (8) is valid.
As Υ is a cone (hence, δh ∈ Υ for all δ ≥ 0 and h ∈ Υ), it is only required to test if the origin 0 is a global minimum or not in the above optimisation problem.Furthermore, as the optimisation problem is convex, the optimality of 0 can be verified by checking the Karush-Kuhn-Tucker (KKT) condition.
In ITIP, Υ is chosen as the cone Γ(N ) whose elements are all rank functions h that satisfies the polymatroidal axioms (R1)-(R3).By picking such a cone, the ITIP can prove all inequalities that are implied by the three axioms (or equivalently, all Shannon inequalities).

Non-Shannon Inequalities
It has been an open question for many years whether there exist information inequalities that are not implied by Shannon's information inequalities.This question was finally answered in [1] where non-Shannon type inequalities were constructed explicitly.The proof was based on the use of auxiliary random variables.This turns out to be a very powerful technique.In fact, all subsequently discovered non-Shannon type information inequalities are essentially proved by the same technique.
Theorem 2 (Non-Shannon's inequality [1]) Or equivalently, if h is entropic, then The information inequality in Theorem 2 is a non-Shannon's inequality because one can construct a rank function h ∈ H(N 4 ) such that (1) h satisfies all the polymatroidal axioms (R1)-(R3) and (2) h violates the inequality (10) To illustrate the technique in proving new inequalities, we will sketch the proof for Theorem 2. Further details can be found in [1,12].

Sketch of proof of Theorem 2:
Let h be the entropy function induced by a set of discrete random variables {X 1 , X 2 , X 3 , X 4 } whose underlying distribution is p.Construct two auxiliary random variables X 1 and X 2 such that It is easy to see that the marginals of {X 1 , X 2 , X 3 , X 4 } and {X 1 , X 2 , X 3 , X 4 } are the same.By invoking the basic Shannon inequalities (involving six random variables), we can prove that Similarly, we can also prove that and consequently, Again, by invoking only Shannon's inequalities, it can be proved that Combining ( 15) and ( 17), the theorem is proved.
Remark: In the above proof of Theorem 2, the non-Shannon inequality is proved by invoking only a sequence of Shannon inequalities.This seems impossible at the first glance, as by definition, non-Shannon inequalities are all inequalities that are not implied by Shannon inequalities.The trick however is to apply Shannon inequalities over a larger set of random variables.
Using the geometric framework obtained earlier, we will describe in the following a "geometric interpretation" for the proof of the non-Shannon's inequality. Consider for all α ⊆ N .Similarly, for any subset Now, suppose that one can construct two cones Υ and C such that From the conditions 1 and 2, we have Again, using Theorem 1, we can prove that an information inequality Equivalently, the inequality ( 18) is valid if the minimum of the following linear program is zero.
Remark: Instead of verifying if an information inequality is valid or not, we can also use the Fourier-Motzkin elimination method to find all linear inequalities that defines the cone proj N (C ∩ Υ).
Clearly, each such inequality corresponds to a valid information inequality over {X i , i ∈ N }.Now, we will revisit the non-Shannon inequality in Theorem 2. Let N = {1, 2, 3, 4} and M = {1, 2, 3, 4, 1 , 2 }.Given any random variables {X 1 , . . ., X 4 }, construct two random variables X 1 and X 2 such that the probability distribution of {X 1 , X 2 , X 3 , X 4 , X 1 , X 2 } is given by (11).Let g be the entropy function of {X 1 , . . ., X 4 } and h be the entropy function of Then it is easy to see that for all i ∈ {1, 2} and β ⊆ {3, 4}, and Υ = Γ(M) (which is the set of all functions h that satisfies the polymatroidal axioms).Then clearly It can be numerically verified that the minimum of the linear program in ( 19) is zero when the information inequality is the non-Shannon inequality (9).Consequently, the non-Shannon inequality is indeed proved.

Non-Polyhedral Property
In the pervious subsection, we have discussed a promising technique in proving (or even discovering) new information inequalities.Using the same technique proposed in [1], more and more linear information inequalities have been discovered [8,[13][14][15].Later in [9], Matúš obtained a countable infinite set of linear information inequalities for a set of four random variables.Using the same set of inequalities, Matúš further proved that Γ * (N 4 ) is not a polyhedral.In the following, we will review Matúš' inequalities and its relaxation.
Remark: The non-polyhedral property of Γ * (N 4 ) was later used by [16] to show that the set of achievable tuples of a network is in general also non-polyhedral.As a result, this proved that the Linear Programming bounds is not tight in general.
where for any distinct elements i, j, k ∈ N 4 , 12,34 g g( 13) + g (23) + g( 14) While Matúš proved a series of linear information inequalities, it is sometimes difficult to use these infinitely number of inequalities at the same time.In [17], the series of Matúš' inequalities is relaxed to a single non-linear inequality.
Remark: Using one single nonlinear inequality, it can be proved that the set of all almost entropic functions is not polyhedral.
Theorem 4 (Quadratic information inequality [17]) and consequently, Remark: Subject to the constraint that b(g) > 2a(g), then the series of linear inequalities ( 20) is implied by the Shannon inequalities.Therefore, the constraint (i.e., b(g) > 2a(g)) we imposed on Theorem 4 is not critical.

Equivalent Frameworks
In the previous section, we have described a framework for information inequalities for discrete random variables.We have also demonstrated the common proving technique.In this section, we will construct several different frameworks which are "equivalent" or "almost equivalent" to the earlier one.These equivalence relations among different frameworks will turn out to be very useful in deriving new information theoretic tools.

Differential Entropy
The previous framework for information inequalities assumes that all random variables are discrete.A very natural extension of the framework is thus to relax the restriction by allowing random variables to be continuous.To achieve this goal, we will first need an analogous definition of discrete entropy in the domain of continuous random variables.
Definition 1 (Differential entropies) Let {X i , i ∈ N } be a set of continuous random variables such that X i are real numbers.For any α ⊆ N , let f α (x i , i ∈ α) be the density functions for (X i , i ∈ α).Then the differential entropy of (X i , i ∈ α) is denoted by Remark: For notation simplicity, we abuse our notations by using H(X) to denote both discrete and differential entropies.However, its exact meaning should be clear from the context.
Discrete and differential entropies shared similar and dissimilar properties.The main difference is that differential entropy can be negative, unlike discrete entropy.However, mutual information and its conditional counterpart (by defined analogously as in ( 7)) remain nonnegative.In fact, as we shall see, the sets of information inequalities for discrete and continuous random variables are almost the same.
For any information inequality α⊆N c α H(X α ) ≥ 0 or expression α⊆N c α H(X α ), its n th residual weight r n is defined as Clearly, an information inequality is balanced if and only if r n = 0 for all n ∈ N .
Example 1 The residual weights r 1 , r 2 of the information inequality H(X 1 ) + H(X 2 ) ≥ 0 are both equal to one.Hence, the inequality is not balanced.
For any information inequality α⊆N c α H(X α ) ≥ 0, its balanced counterpart is the following inequality which is balanced (as its name suggests).
Proposition 1 (Necessary and sufficiency of balanced inequalities [6]) For any valid information inequality α⊆N c α H(X α ) ≥ 0 , it is a valid discrete information inequality if and only if 1. its residual weights r n ≥ 0 for all n, and 2. its balanced counterpart is also valid.
Consequently, all valid discrete information inequalities are implied by the set of all valid balanced inequalities and the nonnegativity of (conditional) entropies.
It turns out that this set of balanced information inequalities also play the same significant role for inequalities involving continuous random variables.
Theorem 5 (Equivalence [6]) All information inequalities for continuous random variables are balanced.Furthermore, a balanced information inequality is valid for continuous random variable if and only if it is also valid for discrete random variables.
By Theorem 5, to characterise information inequalities, it is sufficient to consider only balanced information inequalities which are the same for either discrete or continuous random variables.

Inequalities for Kolmogorov Complexity
The second framework we will describe is quite different from the earlier information-theoretic frameworks.For information inequalities, the objects of interest are random variables.However, for the following Kolmogorov complexity framework, the objects of interest are deterministic strings instead.
To understand what Kolmogorov complexity is, let us consider the following example: Suppose that x 1 and x 2 are the following binary strings Kolmogorov complexity of a string x (denoted by K x ) is the minimal program length required to output that string [18] In the above example, it is clear that the Kolmogorov complexity of x 1 is much smaller than that of x 2 (which is obtained by flipping a fair coin).
Although the objects of interest are different, [3] proved a surprising result that inequalities for Kolmogorov complexities and for entropies are essentially the same.

is valid if and only if the corresponding Kolmogorov complexity inequality defined below
is also valid.

A Group-Theoretic Framework
Besides Kolmogorov complexities, information inequalities are also closely related to group-theoretic inequalities [4].To understand their relation, we first illustrate how to construct a random variable from a subgroup.Definition 3 (Group-theoretic construction of random variables) Let G be a finite group and U be a random variable that takes value in G uniformly.In other words, for all i ∈ G.
For any subgroup K of G, it partitions G into |G|/|K|'s left (or right) coset of K in G such that each coset has exactly |K|'s elements.Note that, each coset can be written as the following subset for some where • is the binary group operator.Let Ω K be the collection of all left cosets of K in G.The subgroup K induces a random variable X K , which is defined as the random left coset of K in G that contains U .In fact, X K is equal to the following coset Since U is uniformly distributed over G, we can easily prove that X K is uniformly distributed over Ω K and that The above construction of a random variable from a subgroup can be extended naturally to multiple subgroups.
Theorem 7 (Group characterisable random variables [4]) Let G be a finite group and {G i , i ∈ N } be a set of subgroups of G.For each i ∈ N , let X i be the random variable induced by the subgroup G i as defined above.Then for any α ⊆ N , Clearly, random variables induced by a set of subgroups must satisfy all valid information inequalities Therefore, we have the following theorem.
Theorem 8 (Group-theoretic inequalities [4]) be a valid information inequality.Then for any finite group G and its subgroups {G i , i ∈ N }, we have or equivalently, Theorem 8 proved that we can directly "translate" any information inequality into a group-theoretic inequality.A very surprising result proved in [4] was that the the converse also holds.[4]) The information inequality (30) is valid if it is satisfied by all random variables induced by groups, or equivalently, the group-theoretic inequality (32) is valid.

Theorem 9 (Converse
Theorems 8 and 9 suggested that to prove an information inequality, it is necessary and sufficient to verify if the inequality is satisfied by all random variables induced by groups.Later, we will further illustrate how to use the two theorems to derive a group-theoretic proof for information inequalities.
In the following, we will further prove that many statistical properties of random variables induced by groups will have analogous algebraic interpretations.
Lemma 1 (Properties of group induced random variables) Suppose that {X i , i ∈ N} is a set of random variables induced by a finite group G and its subgroups {G i , i ∈ N }.Then Hence, functional dependency is equivalent to subset relation; 2. (Independency) I(X i ; X j |X l ) = 0 if and only if 3. (Conditioning preserves group characterisation) for any fixed any α ⊆ N , the group K ∩ i∈α K i and its subgroups

In other words, for any group characterisable h ∈ H[N ], let g ∈ H[N ] such that g(β) = h(β|α)
for all β ⊆ N .Then g is also group characterisable.
Proposition 2 (Duality [19]) Let {V 1 , . . ., V n } be a set of vector subspaces of V F m over the finite field F. Define the following subspace W i for i ∈ N : Then, for any α ⊆ N , Remark: While W ⊥ and W are both subspaces of V and dim W +dim Theorems 8 and 9 suggested that proving an information inequality (30) is equivalent to proving a group-theoretic inequality (32).In the following, we will illustrate the idea by providing a group-theoretic proof for nonnegativity of mutual information Example 2 (Group-theoretic Proof) Let G be a finite group and G 1 and G 2 be its subgroups.Let where • is the binary group operator.As S is a subset of |G|, |S| ≤ |G|.With a simple counting argument (by removing duplications), it can be proved easily that Finally, according to Theorems 8 and 9, the inequality (35) follows.
It is worth mentioning that Theorems 8 and 9 also suggested an information-theoretic proof for group-theoretic inequalities.For example, the following information inequality implies the following group-theoretic inequality The meaning of this inequality and its implications in group theory are yet to be understood.

Combinatorial Perspective
Random variables that are induced by groups have many interesting properties.One interesting property is that they are quasi-uniform in nature.

Definition 5 (Quasi-uniform random variables) A set of random variables
Since X α is uniformly distributed for all α ⊆ N , the entropy H(X α ) is thus equal to log |λ(X α )|.
According to the Asymptotic Equipartition Property (AEP) [12], for a sufficiently long sequence of independent and identically distributed random variables, the set of typical sequences has a total probability close to one and the probability of each typical sequence is approximately the same.In certain sense, quasi-uniform random variables possess the non-aymptotic equipartition property that the probabilities are completely concentrated and uniformly distributed over their supports.As a result, quasi-uniform random variables can be fully characterised by their supports (because the probability distributions are uniform over the supports).This offers a combinatorial interpretation for quasi-uniform random variables.And it turns out that this interpretation offers a combinatorial approach to proving information inequalities.Definition 6 (Box assignment) Let {X 1 , . . ., X n } be nonempty finite sets and X be their Cartesian product n i=1 X i .A box assignment A in X is a nonempty subset of X N .
For any α ⊆ N and a α (a i , i ∈ α) ∈ i∈α X i , we define Roughly speaking, A N |α (a α ) is the set of elements in A such that its "i th -coordinate" is a i for i ∈ α.
The set A N |α (a α ) will be called the a α -layer of A. And hence, A α contains all a α such that the a α -layer of A is nonempty.And we will call A α the α-projection of A.
Definition 7 (Quasi-uniform box assignment) A box assignment A is called quasi-uniform if for any α ⊆ N , the cardinality of A N |α (a α ) is constant for all a α ∈ A α .And we will denote the constant by |A N |α | for simplicity.
The following proposition proves that quasi-uniform box assignment and quasi-uniform random variables are in fact equivalent.
Proposition 3 (Equivalence [7]) Let {X 1 , . . ., X n } be a set of quasi-uniform random variables and A be its probability distribution's support.Then A is a quasi-uniform box assignment in i∈N X i .
Furthermore, for all α ⊆ N , Conversely, for any quasi-uniform box assignment A, there exists a set of quasi-uniform random variables {X 1 , . . ., X n } whose probability distribution's support is indeed A.
As random variables induced by groups are quasi-uniform, by Theorems 8 and 9, we have the following combinatorial interpretation for information inequalities.

is valid if and only if the following box assignment inequality is valid
or equivalently, for all quasi-uniform box assignments A.
Again, in the following example, we will illustrate how to use the combinatorial interpretation to derive a "combinatorial proof" for information inequality.

Example 3 (Combinatorial proof) Let A be a quasi-uniform box assignment in
By Theorem 10, we prove that H(X 1 ) + H(X 2 ) ≥ H(X 1 , X 2 ).

Coding Perspective
We can also view a box assignment A as an error correcting code such that A is the set of all codewords.For each codeword (a 1 , . . ., a n ), a i is the i th symbol to be transmitted across a channel.Taking this coding perspective, in the following, a box assignment will simply be called a code.Also, a code C is called a quasi-uniform code if C is a quasi-uniform box assignment.Again, each quasi-uniform code C will induce a set of quasi-uniform random variables {X 1 , . . ., X n }.
For any code C (which is just a box assignment) and two codewords c, c ∈ C, the Hamming distance between codewords c (c 1 , . . ., c n ) and c (c 1 , . . ., c n ) is defined as In addition, the minimum Hamming distance of the code C is defined as The minimum Hamming distance of a code characterises how strong the error correcting capability of the code is.Specifically, a code C with a minimum Hamming distance d can correct up to d−1 2 's symbol errors.
Example 4 Let C be a length-3 code containing only two codewords (0, 0, 0) and (1, 1, 1).The minimum Hamming distance of this code is 3 and hence can correct any single symbol error.For instance, suppose the codeword (0, 0, 0) is transmitted.If a symbol error occurs, the receiver will receive either (1, 0, 0), (0, 1, 0) or (0, 0, 1).In any case, the receiver can always determine which symbol is erroneous (by using a bounded-distance decoder) and hence can correct it.
In addition to the minimum Hamming distance, in many cases, a code's distance profile is also of great importance: Let C be a code and c be a codeword in C. The distance profile of C centered at c is a set of integers A(C, c) {A r (c) : r = 1, . . ., n} where In other words, A r (c) is the number of codewords in C such that their Hamming distances to the centering codeword c is r.
The profile A(C, c) contains information about how likely a decoding error (i.e., the receiver decodes a wrong codeword) occurs if the transmitted codeword is c.In general, the distance profile A(C, c) depends on the choice of c.A code is called distance-invariant if its distance profile A(C, c) is independent of c. Roughly speaking, a distance-invariant code is one where the probability of decoding error is the same for all transmitted codewords c ∈ C. Theorem 11 (Distance invariance [20]) Quasi-uniform codes are distance-invariant.
Example 5 (Linear codes) Let P be a n − k × n parity check matrix (over a finite field F) and the code C is defined by Then C is called a linear code.Note that, for a linear code, if c 1 , c 2 ∈ C, then c 1 + c 2 is also contained in C. Linear codes are quasi-uniform codes and hence are also distance invariant.
In the following, we will consider only quasi-uniform codes.For simplicity, we will assume without loss of generality that there is a zero-codeword 0 ∈ C (by renaming).Also, for any c ∈ C, we define the Hamming weight of the codeword c (denoted by D(c)) as D(c, 0).

Definition 8 (Weight enumerator)
The weight enumerator of a quasi-uniform code C with length n is where x and y are indeterminates, and A r A r (0).Using simple counting, it is easy to prove that In many cases, it is more convenient to work with weight enumerator than distance profile.However, conceptually, they are equivalent (i.e., they can be uniquely obtained from each other).Clearly, the weight enumerator is uniquely determined from the code C.However, what "structural property" of the code C determines the weight enumerator?For example, suppose that we construct a new code from C by exchanging the first and the second codeword symbols.It is obvious that this modification will not affect the weight enumerator.In other words, ordering of the codeword symbols has no effects on the weight enumerator.The question therefore is: What property of a code has direct effects on the weight enumerator?
To answer the question, let us use the old perspective that a quasi-uniform code is merely a quasi-uniform box assignment (and also its associated set of quasi-uniform random variables).These random variables {X 1 , . . ., X n } have a simple interpretation here: Suppose a codeword C = (C 1 , . . ., C n ) is randomly and uniformly selected from C. Then X i is the i th symbol in the random codeword C, i.e., X i = C i .Our answer to the above question is given in the following theorem.
Theorem 12 (Generalised Greene's Theorem [20]) Let C be a quasi-uniform code and {X 1 , . . ., X n } be its induced quasi-uniform random variables.Suppose that ρ is the entropy function of {X 1 , . . ., X n }.In other words, ρ(α) = H(X i , i ∈ α).Then Remark: The Greene's Theorem is a special case of Theorem 12 when the code C is a linear code.By Theorem 12, the weight enumerator (and also the error-correcting capability) of a quasi-uniform code depends only on the entropy function induced by the codeword symbol random variables.By exploiting the relation between the entropy function of a set of quasi-uniform random variables and the weight enumerator of the induced code, we open a new door on how to harness coding theory results to derive new information theory results.
Example 6 (Code-theoretic proof) Consider a set of quasi-uniform random variables {X 1 , X 2 } which induces a length-2 quasi-uniform code C. The length of the code is 2. By the Generalised Greene's Theorem, the number of codewords which have Hamming weights 1 is given by As A 1 is nonnegative, (50) implies that Finally, by Theorem 10 (a variation of which to be precise), an information inequality holds if and only if it also holds for all quasi-uniform random variables.Consequently, we prove that (51) holds for all random variables.

Constrained Information Inequalities
In pervious sections, we considered general information inequalities where we do not impose any constraint on the choice of random variables.In the following, we will focus on two constrained classes of information inequalities: subspace rank inequalities and determinantal inequalities.

Rank Inequalities
Let {V 1 , . . ., V n } be a set of vector subspaces over a field F. A subspace rank inequality is an inequality about the rank or dimension of subspaces in the following form: For example, it is straightforward to prove that which is a direct consequence of the following identity Subspace rank inequalities are in fact constrained information inequalities subject to the criteria that random variables are induced by vector subspaces over a field.Clearly, all valid information inequalities (including all Shannon inequalities) are subspace rank inequalities.For example, the subspace rank inequality (53) is indeed equivalent to the nonnegativity of mutual information.Besides all these known unconstrained information inequalities, one of the most well-known subspace rank inequalities is the Ingleton inequalities [21].A recent work [22] proved that Ingleton inequalities also include Shannon inequalities as special cases and determined the unique minimal set of Ingleton inequalities that imply all the others.Theorem 13 (Ingleton inequality) Suppose r is a representable polymatroid over X .Then for every choice of subsets It has been open for years whether there exists subspace rank inequalities that are not implied by Ingleton inequalities and Shannon inequalities.It was until recently that the question was finally answered.In [5], insufficiency of Ingleton inequality to characterise all subspace rank inequalities was proved.And in [23,24], new subspace rank inequalities not implied by Ingleton inequalities were explicitly constructed.In fact, the set of subspace rank inequalities for up to five variables have all been determined.However, the complete characterisation involving more than five variables is still missing.In the following, we will review some of the important results along this line of work.
Theorem 14 (Kinser [23]) Suppose X = {X 1 , . . ., X n } and h is representable over X .Then Or equivalently, Theorem 15 (Dougherty et al. [24]) Suppose X = {A, B, C 1 , . . ., C n } and h is representable over X .Then Remark: In addition to the inequalities obtained in Theorem 15, the work [24] found all subspace rank inequalities in five variables (called DFZ inequalities) and many more other new inequalities in six variables.
Definition 9 ( -truncation) Let h be a polymatroid over Y and 0 ≤ ≤ h(Y).Define g as follows where Then g is called the -truncation of h.

Definition 10 (Truncation-preserving inequalities) Let
is said to preserve truncation (or is truncation-preserving) if for any h satisfying all the inequalities in (60), its truncation also satisfies all the inequalities.
Theorem 16 (Insufficiency of truncation preserving inequalities [5]) Let Δ n be the set of all subspace rank inequalities involving n variables (or subspaces).Then for sufficiently large n, Δ n is not truncation-preserving.

Determinantal Inequalities
Information inequalities for Gaussian random variables are another interesting class of information inequalities.As we shall see, they are equivalent to determinantal inequalities.
Definition 11 (Gaussian polymatroid) Let h be a polymatroid over N .It is called Gaussian if there exists a set of jointly Gaussian random variables {Y j , j ∈ K} with a |K| × |K| covariance matrix and a partition of K into n disjoint nonempty subsets β 1 , . . ., β n such that for any α ⊆ N , where X i = (Y j , j ∈ β i ) for all i ∈ N .Furthermore, h is called weakly Gaussian if there exists δ > 0 such that δh is Gaussian, and almost Gaussian if h is the limit of a sequence of weakly Gaussian functions.
It is straightforward to prove that the weakly Gaussian property is closed under addition.In other words, if h and g are weakly Gaussian, then their sum h + g is also weakly Gaussian.Furthermore, like information inequality for any continuous random variables, if an inequality holds for all Gaussian random variables {X i , i ∈ N } [25], then it must be balanced.Therefore, in the following, we will only consider balanced information inequalities.Let {Y j , j ∈ K} be a set of jointly Gaussian random variables with covariance matrix K which is a |K| × |K| positive definite matrix.Suppose K is partitioned into n disjoint nonempty subsets β 1 , . . ., β n .A very compelling property of a set of Gaussian random variable is that its entropy and the determinant of its covariance matrix is related by the following relation: As a direct consequence, for any valid information inequality, we can use the above relation to derive a corresponding determinantal inequality.For example, the following well-known determinantal inequalities can all be proved using this "information-theoretical method".
1. (Hadamard inequality) Let K be a positive definite matrix K. Then where K i,i is the i th diagonal entry of K.This inequality follows from the following information inequality This determinantal inequality follows from the following information inequality Finally, we will conclude this section by the following open question: While Gaussian polymatroid is clearly almost entropic, is it true that an almost entropic polymatorid almost Gaussian?In other words, for any almost entropic polymatroid h, can we construct a sequence of Gaussian polymatroids {g i , i = 1, . ..} such that lim i→∞ δ i g i = h for some δ i > 0 for all i.

Summary and Conclusions
In this paper, we have reviewed some of the recent progresses in characterisation of information inequalities.We first began with a geometric framework for information inequalities which has simplified the understanding of information inequalities.We also reviewed how the first non-Shannon inequality was proved and highlighted the general idea behind the proof.Next, we studied the infinite series of inequalities over N 4 and considered a nonlinear relaxation of the series of inequalities.
We have also reviewed how information inequalities are related to Kolmogorov complexity inequalities, group-theoretic inequalities and inequalities for box assignments.Based on their relations, we demonstrated non-traditional approaches to proving information inequalities.
Finally, we investigated two constrained classes of information inequalities.The first class is when random variables are induced by vector spaces.In this case, the constrained inequalities are equivalent to subspace rank inequalities.We showed that Ingleton and DFZ inequalities are insufficient to characterise all subspace rank inequalities in general where the set of all subspace rank inequalities is not truncation-preserving.The second constrained class of inequalities is when random variables are Gaussian.We have showed that these constrained inequalities are in fact determinantal inequalities.
As a final remark, we would like to emphasise that this survey paper aims not to cover every aspect about information inequalities.In fact, there are many interesting pieces of work that we did not cover.For example, as pointed out by one of the reviewers, one very interesting area is about the relation between convex body inequalities and information inequalities [26,27].We strongly encourage readers who are interested to further explore those relevant areas.

2 .(
Szasz inequality) For any 1 ≤ l < k, uniformly distributed over its support.In other word, the value of the probability distribution function of (X i , i ∈ α) is either zero or is a constant.Definition 4 A function h ∈ H[N ] is called group characterisable if it is the entropy function of a set of random variables {X 1 , . . ., X n } induced by a finite group G and its subgroups {G 1 , . . ., G n }.Furthermore, h is 1. representable if {G, G 1 , . . ., G n } are all vector space, and 2. abelian if G is abelian.
where K β be the principal submatrix of K by deleting rows and columns that are not indexed by β.Substitute (63) back into (62), the inequality (62) is satisfied by all Gaussian random variables {X i , i ∈ N } if and only if = i∈α β i .Since the inequality (62) is balanced, α⊆N :j∈α