An Information Theoretic Condition for Perfect Reconstruction

A new information theoretic condition is presented for reconstructing a discrete random variable X based on the knowledge of a set of discrete functions of X. The reconstruction condition is derived from Shannon’s 1953 lattice theory with two entropic metrics of Shannon and Rajski. Because such a theoretical material is relatively unknown and appears quite dispersed in different references, we first provide a synthetic description (with complete proofs) of its concepts, such as total, common, and complementary information. The definitions and properties of the two entropic metrics are also fully detailed and shown to be compatible with the lattice structure. A new geometric interpretation of such a lattice structure is then investigated, which leads to a necessary (and sometimes sufficient) condition for reconstructing the discrete random variable X given a set {X1,…,Xn} of elements in the lattice generated by X. Intuitively, the components X1,…,Xn of the original source of information X should not be globally “too far away” from X in the entropic distance in order that X is reconstructable. In other words, these components should not overall have too low of a dependence on X; otherwise, reconstruction is impossible. These geometric considerations constitute a starting point for a possible novel “perfect reconstruction theory”, which needs to be further investigated and improved along these lines. Finally, this condition is illustrated in five specific examples of perfect reconstruction problems: the reconstruction of a symmetric random variable from the knowledge of its sign and absolute value, the reconstruction of a word from a set of linear combinations, the reconstruction of an integer from its prime signature (fundamental theorem of arithmetic) and from its remainders modulo a set of coprime integers (Chinese remainder theorem), and the reconstruction of the sorting permutation of a list from a minimal set of pairwise comparisons.


Introduction
We consider the problem of perfectly reconstructing a discrete random variable X, based on the knowledge of a finite set X 1 , X 2 ,..., X n of deterministic processings or transformations of X, denoted f i such that X i = f i (X).Intuitively, the components X i are assumed to carry only a partial amount of the "information" present in X and perfect reconstruction of X would only be possible if the combination of the "informations" in X 1 , X 2 ,..., X n is enough to contain all the original "information" in X.Such intuitive considerations expressed in the language of information is very common in signal processing and in many other scientific fields; but they were never mathematically formalized as far as the authors know.This article aims at formalizing precisely this trivial and vague intuition.Such a task implies, in particular, an accurate definition of "information".
The classical Shannon's 1948 information theory [14] cannot really answer this question as it is rather a theory of the measure of information rather than of the information itself.Fortunately, a "true information" theory has also been developed by Claude Shannon in a relatively unknown 1953 for two deterministic functions f and g.
Proof.Relation ≡ is evidently reflexive (take f , g be the identity function) and symmetric (by permuting the roles of f and g in the definition).It is also transitive by composition: If X ≡ Y and Y ≡ Z, there exists f , g, h and k such that Y = f (X), X = g(Y) and Y = h(Z), Z = k(Y) a.s.; then X = g(h(Z)) = g • h(Z) and Z = k • f (X) a.s.Proposition 1. X ≡ Y iff (if and only if) there exists a bijective function h such that Y = h(X) a.s.
Proof.If X ≡ Y, then there exist two deterministic functions f and g such that, X = f (Y) and Y = g(X) a.s.Thus, X = f (g(X)) a.s.Then, for every value X = x with non-zero probability, f • g(x) = x.Hence, f • g coincides with the identity function a.s.Since the problem is symmetric in X and Y, g • f also coincides with the identity function a.s.Thus, h = g is bijective from the set of values that X can take with non-zero probability to the set of values that Y can take with non-zero probability, and we have Y = g(X) = h(X) a.s.
As suggested by Rajski [11], the equivalence between X and Y can be characterized by way of their joint probability matrix: Proposition 2 (Matrix Characterization).If we restrain Ω to the elements of non-zero probability measure, X ≡ Y iff the matrix of joint probabilities P(X = x, Y = y) is a permutation matrix.
Proof.By Proposition 1, X ≡ Y iff there exists a bijective function h such that Y = h(X) a.s.Thus to each outcome of X corresponds exactly one outcome of Y and vice versa, which is equivalent to saying that the matrix of joint probabilities is a permutation matrix.
In the following, we shall note (without possible confusion) X the equivalence class of the variable X, and thus X = Y the equality between the two classes X and Y (rather than X ≡ Y).
With this definition, it is clear that the equivalence relation is compatible with any functional relation Y = f (X).If f is not bijective, it is tempting to say that there is less information in Y than in X. Hence the following partial order.
We also write Y ≤ X.We are not necessarily considering real valued variables, so the order X ≥ Y has nothing to do with the order in R.
Proposition 3. The relation ≥ is indeed a partial order on the set of equivalence classes of the relation ≡ defined above.
Proof.We first show that the relation ≡ is compatible with the relation ≥.Let X 1 , X 2 and Y 1 , Y 2 be such that X 1 ≡ X 2 and Y 1 ≡ Y 2 .Then if X 1 ≥ Y 1 , there exists a deterministic function f such that Y 1 = f (X 1 ) a.s.. Since X 1 ≡ X 2 , there exists a bijective h such that X 1 = h(X 2 ) a.s., hence Y 1 = f • h(X 2 ) a.s. and X 2 ≥ Y 1 .Likewise, since Y 1 ≡ Y 2 , there exists a bijective g such that Y 2 = g(Y 1 ) a.s., so Y 2 = g • f • h(X 2 ) a.s., hence X 2 ≥ Y 2 .This shows that the relation ≥ is well defined on the set of equivalence classes of the relation ≡.
We now show that ≥ is indeed a partial order: • Reflexivity: X = Id(X) so X ≥ X.

•
Antisymmetry: If X ≥ Y and Y ≥ X, X = f (Y) a.s. and Y = g(X) a.s.for deterministic functions f and g, so X ≡ Y.

•
Transitivity: If X ≥ Y and Y ≥ Z, then there exist two deterministic functions f and g such that: Z = g(Y) a.s. and Y = f (X) a.s.Then Z = g( f (X)) a.s., hence X ≥ Z.

Structure of the Information Lattice: Joint Information; Common Information
Beyond the partial order, Shannon [16] established the natural mathematical structure of information: It is a lattice, i.e. two variables X, Y always admit a maximum X ∨ Y and a minimum X ∧ Y. Let us recall that these quantities (necessarily unique if they exist) are defined by the relations Shannon, in his paper [16], used Boolean notations instead, X + Y for X ∨ Y and X•Y for X ∧ Y.
Proposition 4 (Joint Information).The joint information X ∨ Y of X and Y is the random pair X ∨ Y = (X, Y).
Proof.If X and Y are functions of Z, then the pair (X, Y) is also a function of Z. Conversely, since X and Y are functions of (X, Y), if (X, Y) is a function of Z then so are X and Y.
The definition of X ∧ Y (common information) is more difficult and was not made explicit by Shannon.Following Gács and Körner [5], let us adopt the following definition: Definition 3. We say that x ∈ X and y ∈ Y communicate, denoted by x ∼ y, if there exists a path xy 1 x 1 y 2 • • • y n x n y in which all transitions are of non zero probability: Proposition 5.The relation ∼ is an equivalence relation on the set of pairs (x, y) for which P(X = x) > 0 and P(Y = y) > 0.
Proof.Reflexivity is obvious.

•
Symmetry: If x ∼ y, taking the path x . . .y in the other direction and we have y ∼ x.

•
Transitivity: If x 1 ∼ y 1 , y 1 ∼ x 2 and x 2 ∼ y 2 , then there exists a path from x 1 to y 1 , another from y 1 to x 2 and a last one from x 2 to y 2 , whose transitions are of non-zero probabilities.The concatenated path from x 1 to y 2 has non zero transition probabilities, hence x 1 ∼ y 2 .
Definition 4 (Communication Class).The communication class C(x, y) is the equivalence class of (x, y) where x ∼ y.
Remark 1.In order to compute the common information between X and Y in practice, one has to fully determine communication classes, which is only possible if there is a finite number of classes, each of which containing a finite number of elements.In other words, X and Y should take a finite number of values.This is the reason why we restrict ourselves to finitely valued variables in this paper.
Remark 2. As in any lattice, X ≤ Y is equivalent to saying that X ∨ Y = Y or that X ∧ Y = X.

Computing Common Information
As shown in the previous section, the definition of common information is not a simple one, but one can compute it efficiently using the following algorithm.Given two variables X and Y, this algorithm turns the joint probability matrix of (X, Y) into a block-diagonal matrix where each block corresponds to each communication class.
Let X and Y be two random variables taking values in X and Y, respectively.Consider the graph G = (V, E) whose vertices V are X ∪ Y, and such that vertices x and y of V are connected by an edge if and only if P(X = x, Y = y) > 0. Hence, G is fully described by the joint probability matrix P X,Y .Furthermore, this is a bipartite graph (no edge connects two vertices x 1 , x 2 belonging to X or two vertices y 1 , y 2 belonging to Y).
Then, the communication classes C(X, Y) correspond to the connected components of G. Indeed, a connected component C is a subset of V such that each of its elements is accessible to all the others by a path in the subgraph (C, E).So for any two vertices x, y in the connected component C, there exists y 1 , x 1 , . . ., y k , x k such that all the edges (x, y 1 ), (y 1 , x 1 ),. . ., (y k , x k ), (x k , y) belong to E, that is all the transition probabilities between these vertices are non-zero, which is equivalent to saying that they belong to the same communication class.Now, it is known that the connected components of G can be determined by a depth-first search.
We propose an algorithm, whose pseudo-code is given in Fig. 1, that takes as input the joint probability matrix P X,Y and outputs a bloc-diagonal form of P X,Y representing the common information X ∧ Y, an array storing the permutation of the columns of P X,Y and an array storing the permutation of the rows P X,Y .Since the matrix P X,Y is sufficient to fully describe G, we adapt the depth-first search algorithm to browse the rows and columns of the matrix P X,Y to find which of its rows and columns must be swapped in order to write this matrix into a block-diagonal form.In this algorithm (Fig. 1), the ith row of P X,Y will be represented by the pair (r, i) and the jth column by the pair (c, j).
The complexity of this algorithm can be determined as follows.Let n = Card(X ) + Card(Y ) be the sum of the alphabet sizes on which X and Y take their values, i.e. the sum of the number of rows of P X,Y and the number of columns of P X,Y .The algorithm passes through each row and column at most once.Indeed, for the index of a row or column to enter the stack, it must be unmarked, but as soon as we put it on the stack, we mark it.Then, each time the index of a row or a column is unstacked, we look at each coefficient of the corresponding row or column.Therefore, our algorithm looks at each coefficient of the joint probability matrix P X,Y exactly once.Four elementary operations are performed each time we cross a nonzero coefficient.Thus, the algorithm complexity is quadratic in n.
Notice that the output of our algorithm gives a visualization of the common information: The stochastic matrix P(X = x, Y = y) is written, after permutation of rows/columns, in the "block diagonal" form where k, the number of blocks, is maximal.The k rectangular matrices then represent the k different equivalence classes, the probability P(C(X, Y) = i) being the sum of all entries in block C i .

•
The minimal element 0 ("null information") is the equivalence class of all deterministic variables.Thus X = 0 means that X is a deterministic variable.

•
The maximal element 1 ("total information") of the lattice is the equivalence class of the identity function Id on Ω.
Proof.If X is any random variable and Z = c a.s. is any deterministic variable, then it is clear that Z = f (X) where f is the constant function c. Letting 0 be the equivalence class of constant variables, one has 0 ≤ X for all X.Also, for any random variable X, X = X • Id, hence X ≤ Id.Letting 1 be the equivalence class of the identity function on Ω, one has X ≤ 1 for all X.
Proposition 8 (Complementary Information).The information lattice is complemented, i.e., any X ≤ Y admits a complement Z ("complementary information") such that X ∨ Z = Y and X ∧ Z = 0.This Z is the information missing from X to obtain Y: It allows Y to be reconstructed from X without requiring more information than necessary.Shannon in [16] did not say how to determine it.The following proof gives an explicit construction: Proof.Since X ≤ Y, we simply have X = X ∧ Y = C(X, Y).Thus, a given class C(X, Y) = x has only one value X = x per class, corresponding in general to several values of Y, say, y x 1 , y x 2 , . . ., Finally, the value Z = 1 connects each pair (x, z), so there is only one class according to (X, Z), i.e.X ∧ Z = 0.
This construction can be visualized on the stochastic tensor of (X, Y, Z) described in Fig. 2. Remark 3. The complementary information Z is not uniquely determined by X and Y.In the above construction, it depends on how the values of Y are indexed by the class X = x.

Computing the Complementary Information
Given X ≤ Y, the algorithm of Fig. 3 determines a random variable Z corresponding to the complementary information from X to Y.This algorithm takes as input the joint probability matrix P X,Y in its bloc-diagonal form and outputs the tensor of the joint probability P X,Y,Z where X ∨ Z = Y and X ∧ Z = 0.The tensor is built by spreading the non-zero coefficients of the joint probability matrix P X,Y on the Z-axis as shown in Fig. 2.
The algorithm looks at each coefficient of the joint probability matrix P X,Y exactly once and performs at most two elementary operations for each coefficient it processes.Therefore, it is quadratic in n = Card(X ) + Card(Y ) (since the number of coefficients in the matrix P X,Y is quadratic in n).

Is the Information Lattice a Boolean Algebra?
Interestingly, it was Shannon who, as early as 1938 in his master's thesis, used the Boolean algebra to study relay-based circuits -"the most important master's thesis of the century" for which Shannon received the Alfred Noble prize (not to be confused with the Alfred Nobel Prize!) in 1940.But alas, as Shannon noted, his information lattice is not a Boolean algebra.It would have been one end for Indirect Proof.In any Boolean algebra, the complement is unique.As seen above, this is not the case for the information lattice.
Direct Proof.As a direct second proof, we provide an explicit counterexample to distributivity.Consider the probability space (Ω, P (Ω), P), where Ω = {0, 1, 2, 3} and P is the uniform probability measure, and define

Information and Information Measures
First of all, it is immediate to check the compatibility of the information lattice with respect to the entropy or the mutual information as measures of information.
Proposition 10.Entropy, conditional entropy and mutual information are compatible with the definition of information as an equivalence class. Proof.

•
Entropy: If X ≡ Y, there exist functions f and g such that Y = f (X) a.s., hence H(Y) ≤ H(X), and , compatibility follows from the two previous cases.
We then have some obvious connections: Proposition 11 (Partial Order and Conditional Entropy).
In particular, H is "order-preserving" (greater information implies higher entropy): Finally H(X) ≥ 0 for all X, with equality H(X) = 0 iff X = 0.
Proof.H(X|Y) = 0 means that H(X|Y = y) = 0 for all y ∈ Y, which amounts to saying that X is deterministic equal to f (y) given Y = y.In other words X = f (Y) a.s.We then have Finally, since X ≥ 0 for all X, H(X) ≥ H(0) = 0 and it is well known that the entropy H(X) is zero if and only if the variable X is deterministic, that is, X = 0.

Common Information vs. Mutual Information
Proposition 12.The entropy of the joint information is the joint entropy, i.e.H(X ∨ Y) = H(X, Y).
One may wonder by analogy with the usual Venn diagram in information theory (Fig. 4) if the entropy of joint information is equal to the mutual information: is it true that H(X ∧ Y) = I(X; Y)?
The answer is no, as shown next.Proposition 13 is implicit in [5], and made explicit by Wyner in [18] who credits a private communication from Kaplan.
with equality iff U and V are conditionally independent given W.
Remark 4. In particular, if X and Y are independent, they have a null common information X ∧ Y = 0.However, common information H(X ∧ Y) can be far less [5] than mutual information I(X; Y).Remark 5. Notice that the case of equality corresponds to the case where the matrix blocks C i in (4) are stochastic matrices of two independent variables X, Y knowing W = i, i.e. matrices of rank 1. Remark 6. Shannon's notion of common information should not be confused with the well known Wyner's acception of "common information", which is defined as the maximum of I(X, Y; W) when X and Y are conditionally independent knowing W. This quantity is not less but greater than the mutual information I(X; Y) [18].

Submodularity of Entropy on the Information Lattice
From the results in [9] we can show that entropy is submodular on the information lattice: Combining gives the announced inequality.
Remark 8.The submodularity property of entropy that is generally studied in the information theory literature is with respect to the set lattice (or algebra), where the entropy is that of a collection of random variables indexed by some index set (thus considered as a set function).Such considerations have been greatly developed in recent years, see, e.g., [19].By contrast, it is the information lattice that is considered here.It can be easily shown using Proposition 13 that the two notions of submodularity coincide for collections of independent random variables.

Two Entropic Metrics: Shannon Distance; Rajski Distance
Since X = Y ⇐⇒ (X ≤ Y and X ≥ Y), according to Prop.11, it suffices that H(X|Y) + H(Y|X) = 0 in order for X, Y to be equivalent: X = Y.Shannon [16] noted that this defines a distance which makes the information lattice a metric space: Proposition 15 (Shannon's Entropic Distance).D(X, Y) = H(X|Y) + H(Y|X) is a distance over the information lattice. Proof.

•
Positivity: As just noted above, D(X, Y) ≥ 0 vanishes only when It is interesting to note that this is not the only distance (nor the only topology).By normalizing D(X, Y) by the joint entropy H(X, Y), we obtain another distance metric: Notice that normalization by H(X, Y) is valid when X and Y are non-deterministic since X ̸ = 0 and Y ̸ = 0 implies H(X, Y) > 0.
Proof.First of all, symmetry d(X, Y) = d(Y, X) is obvious and positivity follows from that of D. We follow Horibe [6] to prove the triangular inequality.One may always assume non deterministic random variables.Observe that: and Summing ( 7) and ( 8) yields Now, from the above proof of the triangular inequality of D, one has H(X|Y) + H(Y|Z) ≥ H(X|Z).Noting that a ≥ b > 0 and c ≥ 0 imply a a+c ≥ b b+c , we obtain Therefore, Permuting the roles of X and Z gives Summing ( 11) and ( 12), we conclude that d(X, Y) Remark 9. Rajski's distance between two variables X and Y can be visualized as the Jaccard distance between the region corresponding to X and the region corresponding to Y in the Venn diagram of Fig. 4. The Jaccard (or Jaccard-Tanimoto) distance [7] between two sets A and B is defined by d J (A, B) = |A∆B| |A∪B| , where ∆ is the symmetric difference between A and B. Thus, if A and B are respectively the regions corresponding to X and to Y in the Venn diagram, we have:

Dependency Coefficient
From the Rajski distance, we can define a quantity which measures the dependence between two non-deterministic (i.e., nonzero) random variables X and Y.
Definition 5 (Dependency Coefficient).For all non-zero elements X, Y of the information lattice, Proposition 17.The dependency coefficient can be seen as a normalized mutual information: Remark 10.The property of ρ in proposition 18 is similar to the usual property of the linear correlation coefficient.However, while two independent random variables have zero correlation (but not conversely), the corresponding converse property holds for the dependence coefficient since two random variables are independent if and only if ρ(X, Y) = 0.

Discontinuity and Continuity Properties
Perhaps the biggest flaw in Shannon's lattice information theory [16] is that the different constructions of elements in the lattice (e.g., common and complementary information) do not actually depend on the values of the probabilities involved, but only on whether they are equal to or different from zero.Thus, a small perturbation on probabilities can greatly influence the results.As an illustration we have the following Proposition 19 (Discontinuity of common information).The application (X, Y) → X ∧ Y is discontinuous in the metric lattice with distance D (or d).
Proof.Let (X ε , Y ε ) be defined by the stochastic matrix Since there is a single class of communication, common information However, it should be noted that the joint information ∨ is continuous with respect to Shannon's distance.In fact we have the following Proof.One has where (a) is the consequence of the chain rule and (b) is due to the fact that conditioning reduces entropy.Since X, X ′ and Y, Y ′ play a symmetrical role in (b), we can permute the roles of X, X ′ and Y, Y ′ , which gives Summing both inequalities yields the result.
Remark 11.In particular for X = X ′ , for any X, Y, Z, In other words joining the same X can only reduce the Shannon distance: In this respect, the joining operator Y → X ∨ Y is a contraction operator.
Furthermore, the entropy, the conditional entropy, and the mutual information are continuous with respect to the entropic distance of Shannon.Indeed, we have the following inequalities (see Problem 3.5 in [2]): Proof.

i)
By the chain rule: From the continuity of joint information (Proposition 20), one can further bound D((X, Y), The conclusion now follows from i) and ii).iv) By the chain rule, |I(X; The conclu- sion follows from bounding each of the three terms in the sum using i) and ii).
In the remainder of this paper, we only consider quantities that are continuous with respect to the entropic metrics (Shannon and Rajski distance).As a result, the discontinuity of the ∧ operator will not hinder our derivations in the sequel.

Alignments of Random Variables
Definition 6 (Alignment).Let δ be any distance on the information lattice.The random variables X, Y and Z are said to be aligned with respect to δ if the triangular inequality is met with equality: This alignment condition is illustrated in Fig. 6.Proof.From the proof of the triangular inequality for d (Proposition 16), equality holds iff equality holds in all inequalities (7), (8), and (10), and those inequalities obtained by permuting the roles of X and Z.Now a close inspection of (7) shows it achieves equality iff H(Z|Y) = 0, that is, Z ≤ Y.
Similarly, (8) achieves equality iff H(X|Y) = 0, that is X ≤ Y.Both conditions write X ∨ Z ≤ Y, which is symmetric in (X, Z).Finally, (10) achieves equality iff H(X|Y) + H(Y|Z) = H(X|Z) and the corresponding equality obtained by permuting the roles of X and Z.This means that X, Y and Z are aligned w.r.t.D, that is, X − Y − Z is a Markov chain and Y ≤ X ∨ Z. Overall Y = X ∨ Z, which already implies that X and Z are conditionally independent given Y = (X, Z), i.e., X − Y − Z is a Markov chain.
Remark 12.Note that if X, Y and Z are aligned in the sense of Rajski's distance, then they are also aligned in the sense of Shannon's entropic distance since Y = (X, Z) implies that X − Y − Z is a Markov chain.Thus, the alignment condition is stronger in the case of the Rajski distance.
Remark 13.The alignment condition w.r.t. the Rajski distance is simpler and expressed by using only the operators of the information lattice, whereas that w.r.t. the Shannon distance requires the additional notion of Markov chain.Therefore, in the sequel, we develop some geometrical aspects of the information lattice based essentially on the Rajski distance.

Convex Sets of Random Variables in the Information Lattice
Definition 7 (Convexity).Given two random variables X and Y, we define the segment [X, Y] of endpoints X and Y as the set of all random variables Z such that X, Z and Y are aligned with respect to the Rajski distance, i.e., such that d(X, Z) A set C of points (random variables) in the information lattice is convex if for all points X, Y ∈ C, the segment [X, Y] ⊆ C. If S is any set of points of the information lattice, its convex envelope is the smallest convex set containing S.
By its very definition, the convex envelope of the two-element set {X, Y} is the segment [X, Y].We have the following simple characterization.
Proposition 24 (Segment Characterization).For any two elements X, Y of the information lattice, the segment [X, Y] is the three-element set [X, Y] = X, (X, Y), Y , with respective distances to endpoints given by d X, (X, Y) = H(Y|X) H(X,Y) and d Y, (X, Y) = H(X|Y) H(X,Y) .
Proof.X and Y do belong to the segment [X, Y] since d(X, X) then X, Z and Y are aligned with respect to the Rajski distance so that necessarily Z = (X, Y).One calculates d X, (X, H(X,Y) , and similarly for d Y, (X, Y) by permuting the roles of X and Y.
Remark 14.By the above Proposition, segments in the information lattice are intrinsically discrete objects.In the case where X ≤ Y or Y ≤ X, then the segment [X, Y] contains only two distinct points, X and Y. Obviously, if X = Y, then [X, Y] is a singleton.This gives three possible cases as illustrated in Fig. 7. Remark 15.As a result of this characterization, four or more distinct points cannot be aligned w.r.t. the Rajski distance, because a segment cannot contain more than three distinct points.
Proposition 25.C is convex iff it is closed under the ∨ operator.
Beyond the case of a two-element set, we now characterize the convex envelope of any nelement set in the information lattice, that is, the convex envelope of n random variables X 1 , X 2 ,..., X n .We adopt the following usual convention.For any n-tuple of indices I = (i 1 , i 2 , ..., i n ), the random vector (X i 1 , X i 2 , ..., X is denoted by X I .Again by convention, for the empty set, X ∅ = 0, so that one always have X I∪J = X I ∨ X J for any two finite sets of indices I and J.
Proposition 26.Let I be a finite index set and (X i ) i∈I be random variables.The convex envelope of ( that is, the set of all sub-tuples of the X i . Proof.With every X i (i ∈ I), the convex envelope in question should be closed by the ∨ operator, hence contain any tuple ∨ j∈J X j for any nonempty J ⊆ I. Now C = {∨ j∈J X j = X J | ∅ ̸ = J ⊆ I} contains all X i for i ∈ I and is already convex.Indeed, for all X J ∈ C and X K ∈ C, X J ∨ X K = X J∪K ∈ C.
Remark 16.Given a finite set I of index of cardinality |I| = n, the convex envelope of (X i ) i∈I contains at most 2 n − 1 distinct elements, since there are 2 n − 1 nonempty subsets of I.The number 2 n − 1 is only an upper bound since it might happen that two different subsets J and K of I are such that X J = X K .
An example of the convex envelope of a family of three random variables is shwon in Fig. 8.
. 7-element convex envelope of three random variables X 0 , X 1 and X 2 .These three random variables are represented as vertices of an (equilateral) triangle.The other points in the convex envelope are obtained as intersections of medians and edges, and the common information X 0 ∨ X 1 ∨ X 2 is the center of gravity (intersection of the three medians).Similarly, the 15-element convex envelope of four distinct points can be visualized in a tetrahedron, etc.
It is also interesting to note that any sublattice of the information lattice does have some convexity properties: Proposition 27.Any sublattice of the information lattice (including the information lattice itself) is convex.With every X i (i ∈ I), the sublattice also contains the convex envelope of (X i ) i∈I .
Proof.With every two points X, Y, the sublattice should contain their maximum X ∨ Y, hence the whole segment [X, Y].It is, therefore, convex.Now every convex set contains the convex enveloppe of any of its subsets.

The Lattice Generated by a Random Variable
In the sequel, we are interested in all possible deterministic functions of a given random variable X.In fact, their set constitutes a sublattice of the information lattice: Proposition 28 (Sublattice generated by a random variable).Let X be any random variable in the information lattice.The set of all random variables ≤ X is a sublattice, which we call lattice generated by X, denoted ⟨X⟩.It is a bounded lattice with maximum (total information) X and minimum 0.
Proof.Let Y ≤ X and Z ≤ X.There exists deterministic functions f and g such that Y = f (X) a.s. and Z = g(X) a.s.
Therefore, the set of random variables ≤ X forms a sublattice.Clearly X is maximum and 0 (deterministic random variable seen as a constant function of X) is minimum.
Remark 17.One may also define the sublattice ⟨X 1 , X 2 , . . ., X n ⟩ generated by several random variables X 1 , X 2 , . . ., X n simply as the sublattice generated by the variable Therefore, it is enough to restrict ourselves to one random variable X as the lattice generator.
Proof.Let Y ≤ Z ≤ X, so that both Y, Z ∈ ⟨X⟩.By Proposition 8, Y admits at least one complement information Y w.r.t.Z in the information lattice, such that Y ∧ Y = 0 and Y ∨ Y = Z.Now Y ≤ Z ≤ X, hence the complement Y ∈ ⟨X⟩ belongs to the sublattice generated by X.

Properties of Rajski and Shannon Distances in the Lattice Generated by a Random Variable
We now investigate the metric properties of the sublattice ⟨X⟩ generated by a random variable X.
To avoid the trivial case ⟨0⟩ = {0} we assume that X is nondeterministic.First of all, we observe that the entropy of a element of the sublattice increases as it is closer to X (in terms of either Shannon's or Rajski's distance): Proof.One has where (a) is because Y ≤ X, and (b) is a consequence of the chain rule: Remark 18.In the language of data compression, be seen as the relative entropic redundancy of X when it is represented ("encoded") by Y.
Remark 19.The maximum distance case in the Proposition can be stated as follows: The only random variables Y that can be obtained as functions of X (Y ∈ ⟨X⟩) while being also independent of X (d(X, Y) = 1) are the constant (deterministic) random variables.
In Euclidean geometry, Apollonius's theorem allows one to calculate the length of the median of a triangle XYZ given the length of its other three sides.In the information lattice context, Y ∨ Z denotes the median of the segment [Y, Z] (the only possible point in the segment that is not an endpoint).Thus, Apollonius's theorem gives a formula for distance D(X, Y ∨ Z) in terms of D(X, Y), D(X, Z), and D(Y, Z).The following Proposition is the analogue of Apollonius's theorem for the Shannon distance in the information lattice generated by X: This can also be written as This is illustrated in Fig. 9.Note that when X = Y ∨ Z one recovers that Y, X, Z (in this order) are aligned.Proof.From Proposition 30, From Lemma 1 we derive the following with equality if and only if Y and Z are independent.
with equality iff Y and Z are independent.Now by Lemma 1, D(X, Y) yields the announced inequality.
In the other direction we have the following with equality if and only if X = Y = Z.
Proof.By the triangular inequality, Combining yields the announced inequality.
Remark 20.In the course of the proof, we have proved the following stronger inequality: for any Y, Z ∈ ⟨X⟩, with the same equality condition X = Y = Z.
Remark 21.By Lemmas 2 and 3, we see that in terms of Rajski distances to the generator X, d(X, Y ∨ Z) lies between d(X, Y) + d(X, Z) − 1 and d(X, Y) + d(X, Z), where the lower and upper bounds differ by 1 and the minimum value is achieved in the case of independence.These two Lemmas are instrumental in the derivations of the next section.

Problem Statement
Suppose one is faced with the following reconstruction problem.We are given a (discrete) source of information X (e.g., a digital signal, some text document, or any type of data), which is processed using deterministic functions into several "components": (e.g., different filtered versions of the signal at various frequencies, translated parts of the document, or some nonlinear transformations of the data).The natural question is: Did one loose information when processing X into its n components X 1 , X 2 , . . ., X n ?Or else, can we perfectly reconstruct the original X from its n components using some (unknown) deterministic function We emphasize that all involved functions must be deterministic (no noise is involved), otherwise perfect reconstruction (without error) would not be possible.Yet we do not require any precise form for the reconstruction function f , only that such a reconstruction exists.To our knowledge, the first occurence of such a problem (for n = 2) is an Exercise 6 of the textbook [12].
Stated in the information lattice language, the perfect reconstruction problem is as follows.Suppose we are given X 1 , X 2 , . . ., X n in ⟨X⟩, the sublattice generated by X.Is it true that X ≤ X 1 ∨ X 2 ∨ • • • ∨ X n ?Since the sublattice is convex (Proposition 27), i.e., stable by the ∨ operator (Proposition 25), one always has, by assumption, that Thus, when n = 2, perfect reconstruction is possible iff X lies in the segment [X 1 , X 2 ].When n = 3, perfect reconstruction is possible iff for every distinct indices i, j, k ∈ {1, 2, 3}, X i , X and X j,k are aligned w.r.t. the Rajski distance as illustrated in Fig. 10.Intuitively, the processed components X i should not (on the whole) be too "far away" from the original source X in order that perfect reconstruction be possible.In other words, at least some of the distances d(X, X i ) should not be too high.Such distances can be, in principle, evaluated when processing the source X into each of its components.In the following Subsection, we give a simple necessary condition on the sum d(X, X 1 ) + d(X, X 2 ) + • • • + d(X, X n ) to allow for perfect reconstruction.

A Necessary Condition for Perfect Reconstruction
The main result of this paper is the following Theorem 1 (Necessary Condition for Perfect Reconstruction).Let X be a random variable and let X 1 , X 2 , . . . ,X n ∈ ⟨X⟩.If perfect reconstruction is possible: with equality iff X 1 , X 2 , . . ., X n are independent.
Proof.By repeated use of Lemma 2, each joining operation of two components in the sum-e.g., passing from d(X, X i ) + d(X, X j ) to d(X, X i ∨ X j )-decreases this sum by at most 1.Thus, Equality holds iff all the above n − 1 inequalities are equalities.By the equality condition of Lemma 2, this means by induction that X 1 is independent from , and so on until X n−1 is independent from X n .Overall this is equivalent to saying that all components X 1 , X 2 , . . ., X n are mutually independent.
Remark 23.To illustrate Theorem 1, consider a uniformly distributed two-bit random variable X (i.e., the result of two independent coin flips) and let X 1 be the result of the first coin toss and X 2 be that of the second coin toss.Clearly, reconstruction is possible since X = (X 1 , X 2 ).Now a simple calculation gives d(X, X 1 ) = H(X|X 1 ) H(X) = log 2 log 4 = 1 2 , and similarly d(X, X 2 ) = 1 2 , which shows that (25) is achieved with equality: d(X, X 1 ) + d(X, X 2 ) = 2 − 1 = 1.This is not surprising since X 1 and X 2 are independent, as can be checked directly.Now consider X 3 = 0 or 1 depending on whether X 1 = X 2 or not.Clearly, X can be also reconstructed from X 1 , X 2 , X 3 since it can already be reconstructed from X 1 , X 2 .Again one computes d(X, X 3 ) = log 2 log 4 = 1 2 so in this case, the sum of distances to X is now d(X, X 1 ) + d(X, X 2 ) + d(X, X 3 ) = 3 2 < 3 − 1 = 2.This shows that (25) is still satisfied, but not with equality.In fact, it can easily be proved that even though X 1 , X 2 , X 3 are pairwise independent, they are not mutually independent.
Remark 24.In practice, Theorem 1 gives an impossibility condition for perfect reconstruction of the random variable X from components X 1 , X 2 , . . ., X n .Indeed, if the latter are such that then perfect reconstruction is impossible, however complex the reconstruction function f could have been.
In other words, X < X 1 ∨ X 2 ∨ • • • ∨ X n , information was lost by processing.That perfect reconstruction is impossible does not mean that it would never be possible to deduce one particular value of X from some particular values of X 1 , X 2 , . . ., X n .It means that such a deduction is not possible in general, for every possible values taken by X 1 , X 2 , . . ., X n .In other words, there is at least one set of values X 1 = x 1 , X 2 = x 2 , . . ., X n = x n for which X cannot be reconstructed unambiguously.
In other words, perfect reconstruction can only occur if the components are (as a whole) sufficiently dependent on the original X.Otherwise (28) precludes perfect reconstruction.
Remark 26.Since the Rajski distance is always upper bounded by 1, if the impossibility condition (27) is met, then the actual value of the sum ∑ n i=1 d(X, X i ) necessarily lies in the interval (n − 1, n]. In the worse situation ∑ n i=1 d(X, X i ) = n, all terms should equal one: d(X, X i ) = 1.This means that all components are independent from X.By Proposition 30, the components X i = 0 are all constants: In this case, all information is lost.
Remark 27.By Theorem 1, for perfect reconstruction to be possible, the components X i should be (at least slightly) tightened around X in the sense that (25) is satisfied.The example of Remark 23 shows that under this condition (even when the inequality is strict), it may be actually possible to reconstruct X.However, proximity may not be enough: The necessary condition of Theorem 1 is not sufficient in general.
To see this, consider X uniformly distributed in the integer interval {0, 1, . . ., 11} and define In other words X 1 is the integer division of X by 2 and X 2 is the integer division of X by 3. One easily computes While the necessary condition (25) of Theorem 1 is met, the value of X cannot be unambiguously determined from those of X 1 and X 2 .For example, X 1 = X 2 = 0 leaves two possibilities X = 0 or 1.Therefore, perfect reconstruction is not possible.Another way to see this is to observe that perfect reconstruction is equivalent to saying that X 1 , X, X 2 are aligned, which in terms of the Shannon distance would write D(X 1 , X 2 ) = D(X, X 1 ) + D(X, X 2 ).But while D(X, X 1 ) + D(X, X 2 ) = log 6, one has which is clearly less than log 6.Therefore, perfect reconstruction is impossible in our example, because X 1 and X 2 are too close together, i.e., there is too much redundant information between them.
A slight modification of the above example where X takes values in {0, 1, . . ., 12m − 1} for abitraily large m shows that the sum d(X, X 1 ) + d(X, X 2 ) = log 6 log(12m) can actually be as small as desired, while perfect reconstruction is still impossible.Therefore, there can be no condition of the form ∑ n i=1 d(X, X i ) < c (or any condition based only on the value of this sum) to ensure perfect reconstruction.Such a sufficient condition cannot be established without assuming some other property of the components X i , as seen in the next Subsection.

A Sufficient Condition for Perfect Reconstruction
For independent components X 1 , X 2 , . . ., X n (with no redundant information between them), the necessary condition of Theorem 1 becomes also a sufficient condition: Theorem 2 (Sufficient Condition for Perfect Reconstruction).Let X be a random variable and let X 1 , X 2 , . . ., X n ∈ ⟨X⟩ be independent.If inequality (25) holds, then it necessarily holds with equality: and perfect reconstruction is possible: Proof.A closer look at the proof of Theorem 1 shows that we have established (without the perfect reconstruction assumption) the general inequality which holds with equality iff X 1 , X 2 , . . ., X n are independent.Therefore, by the independence assumption, (25) writes ∑ Since the distance is nonnegative, this necessarily implies that the inequality holds with equality and that d Remark 28.Following Remark 26, we see that for independent X 1 , X 2 , . . ., X n , the sum of distances to X: ∑ n i=1 d(X, X i ) can only take values in the interval [n − 1, n], with two possibilities: perfect reconstruction is impossible.In other words, independent components cannot be arbitrarily tightly packed around X.
Following Remark 25, in terms of dependency coefficients, for independent X 1 , X 2 , . . ., X n , Remark 29.Following Remark 22 and Fig. 10 in the case of three independent components X 1 , X 2 , X 3 , one should have d(X, X 1 ) + d(X, X 2 ) + d(X, X 3 ) = 2 for perfect reconstruction to hold.Incidentally, the graphical Euclidean illustration of Fig. 10 is faithful in this case, since for an equilateral triangle X 1 X 2 X 3 with sides of length 1, the sum of Euclidean distances equals

Approximate Reconstruction
Suppose we encode the information source X by n components X 1 , X 2 , . . ., X n but do not particularly insist that perfect reconstruction is possible.Rather, we assume that the encoding removes a fraction of redundancy in X equal to (see Remark 18).Since the case δ = 0 corresponds to the previous case of perfect reconstruction we assume that δ > 0 in the sequel.Thus, in what follows, reconstruction of X can only be approximate (up to a certain distance tolerance δ).We then have the following Theorem 3 (Approximate Reconstruction).Let X be a random variable and let X 1 , X 2 , . . ., X n ∈ ⟨X⟩ such that (33) holds with redundancy = δ > 0. Then with equality in the second inequality iff the components X 1 , X 2 , . . ., X n are independent.
Proof.The rightmost inequality in (34) is just (32) (with the announced case of equality), which was established by repeated application of Lemma 2. Similarly, repeated application of Lemma 3 gives with equality iff all X i = X (i = 1, . . ., n).But such an equality condition would yield δ = d(X, X) = 0, contrary to the assumption δ > 0. This shows the leftmost inequality in (34) is strict.
Remark 30.Similarly as in the above two Subsections, one can deduce from Theorem 3 that for independent components X 1 , X 2 , . . ., X n , one necessarily has ∑ n i=1 d(X, X i ) = n − 1 + δ, and that in general, approximate reconstruction within distance tolerance ≤ δ will be impossible if Clearly, since X is uniformly distributed, X i is likewise uniformly distributed over {0, 1, . . .
Thus, inequality (25) of Theorem 1 is achieved with equality, which proves that X 1 , X 2 , . . ., X n are independent.Had we proved directly this independence, Theorem 2 would have shown that perfect reconstruction is possible.Thus, a information theoretic proof of the Chinese remainder theorem using this method amounts to proving such an independence.But this can be done quite similarly as the Chinese remainder theorem is classically proved.With our present method, however, it can be easily seen that perfect reconstruction would not be possible if we do not use all components X 1 , X 2 , . . ., X n .Indeed, suppose without loss of generality that one tries to reconstruct X only from X 1 , X 2 , . . ., X n−1 .Then by the above calculation, which shows by Theorem 1 that perfect reconstruction of X from less than n remainders is impossible.

Optimal Sort
In this Subsection, we provide a new information theoretic proof of the following Theorem 4. Any pairwise comparison-based sorting algorithm has worst-case computational complexity ≥ log 2 k! = Ω(k log 2 k) where k is the cardinality of the list to be sorted.
Proof.Consider a finite, totally ordered list of k elements.It can be seen as a permutation of the uniquely sorted elements, and sorting this list amounts to finding this permutation.Let X = (X 1 , X 2 , ..., X k ) be a (uniformly chosen) random permutation on {1, 2, . . ., k}.
For i, j ∈ {1, . . ., k} with i ̸ = j, let X i,j be the binary random variable taking the value 1 if X i < X j and 0 otherwise.Clearly X i,j ≤ X for any i, j.
Since there are as many permutations such that X i < X j as such that X i > X j , every X i,j is a Bernoulli (1/2) variable (equiprobable bit).Therefore, Assuming n pairwise comparisons are made to sort the complete list, this gives ∑ i,j (n terms) By Theorem 1, it is necessary that this value does not exceed n − 1, i.e., n ≥ log 2 k! for perfect reconstruction to hold.In other words, the wort-case complexity to achieve the complete sort for any possible realization of the initial unsorted list requires at least ⌈log 2 k!⌉ pairwise comparisons.
Remark 31.This example illustrates a method to find a lower bound on the worst-case complexity of a problem.The first step is to express the instance of the problem as a random variable X.Second, one determines which pieces of information one is allowed to extract from X, and models them as "observed" random variables X i ≤ X.Third, for each i, we compute the Rajski distance d(X, X i ).
Finally, we use Theorem 1 to find a lower bound on the number of "observed" variables X i that are required to reconstruct X.We feel that such a method is interesting because it is often harder to find a lower bound on the complexity of a problem than to find an upper bound on it.

Conclusion and Perspectives
It is an understatement to say that the "true" information theory of 1953 was not as popular as the classical theory of 1948.John Pierce, a colleague of Shannon, wrote that "apparently the structure was not good enough to lead to anything of great value" [10].We find two possible reasons for this pessimism: the fact that the lattice is not Boolean, which does not facilitate the calculations; and the discontinuous nature of the common informtion with respect to the entropy metric.
However, as we have shown in this paper, this lattice structure is quite helpful to understand reconstruction problems.As shown in Section 6, the implications of the resolution of perfect reconstruction problems go beyond signal processing, since the concept of information is pervasive in all fields of mathematics and of science.Thus, we believe it is important to deepen this theory, defining information per se, and to further generalize the reconstruction problems.It would indeed be of great interest to find a simple sufficient condition to reconstruct a variable X from (not necessarily independent) components X 1 , X 2 , . . ., X n .
One may legitimately argue that most examples (except in Subsection 6.1) assume uniform distributions, where the entropy is just a logarithmic measure of the alphabet size, and since all considered processings are deterministic, the essence of the present reconstruction problem appears more combinatorial than probabilistic.Indeed, a desirable perspective is to go beyond perfect reconstruction of discrete quantities by considering the possibility of noisy reconstruction of discrete and/or continuous sources of information.
In a perspective closer to computer science, we have used our theorems to find a lower bound on the complexity of the comparison-based sorting problem.It would be interesting to find other problems for which a lower bound on complexity can be found using our technique, especially for decision problems that are not known to be in P.
Finally, as another practical perspective for security problems, one may assume that X models all the possible values that can take a secret key in a given cryptographic device, and that an attacker can observe k random values that are deterministically obtained from X.Such important problems have been studied e.g., in [8] to evaluate information leakage in the execution of deterministic programs.One may use the theorems of Section 5 to find a lower bound on k for the attacker to be able to reconstruct the secret.Disclaimer/Publisher's Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s).MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figure 1 .
Figure 1.Algorithm to compute the common information

Figure 2 .
Figure2.Construction of the complementary information Z allowing to pass from X to Y.The stochastic tensor of (X, Y, Z) representing P X,Y,Z has nonzero entries marked in red.The distribution P Z of Z is obtained by marginalizing the tensor on the Z axis.

Figure 3 .
Figure 3. Algorithm for computing the complementary information

Figure 4 .
Figure 4. Usual Venn diagram in information theory.

Figure 6 .
Figure 6.Venn diagram illustrating the alignment condition for the Rajski distance.

Figure 7 .
Visualization of the segment [X, Y] for three possible cases.

Figure 10 .
Figure 10.Geometric Illustration of the Three-Component Reconstruction Problem.

Remark 25 .
Another look at Theorem 1 can be made using the dependency coefficient ρ = 1 − d in place of the Rajski distance.Then the impossibility condition (27) simply writes n 19. R. W. Yeung, Information Theory and Network Coding.Springer, 2008.
The pop() operation removes the top stack element and returns it.
Figure 5. Venn diagram illustrating the alignment condition for the Shannon distance.Proof.From the proof of the triangular inequality for D (Proposition 15), equality holds iff equality holds in both inequalities H(X|Z) ≤ H(X, Y|Z) = H(X|Y, Z) + H(Y|Z) ≤ H(X|Y) + H(Y|Z) and those inequalities obtained by permuting the roles of X and Z.Since H(X, Y|Z) − H(X|Z) = H(Y|X, Z) and H(X|Y) − H(X|Y, Z) = I(X; Z|Y), equality holds iff H(Y|X, Z) = 0 and I(X; Z|Y) = 0, both conditions being symmetric in (X, Z).Now H(Y|X, Z) = 0 means that Y is a function of (X, Proposition 22 (Alignment w.r.t. the Shannon distance D).The random variables X, Y and Z are aligned w.r.t.D if and only if X− Y − Z is a Markov chain and Y ≤ X ∨ Z.This alignment condition is illustrated in Fig.5.Z), i.e., Y ≤ X ∨ Z. Also I(X; Z|Y) = 0 means that X and Z are conditionally independent given Y, which characterizes the fact that X − Y − Z forms a Markov chain.Proposition 23 (Alignment w.r.t.Rajski's distance d).The random variables X, Y and Z are aligned w.r.t.d if and only if Y