The Homological Nature of Entropy

Baudot, Pierre; Bennequin, Daniel

doi:10.3390/e17053253

Open AccessArticle

The Homological Nature of Entropy^†

by

Pierre Baudot

^1,* and

Daniel Bennequin

²

¹

Max Planck Institute for Mathematics in the Sciences, Inselstrasse 22, 04103 Leipzig, Germany

²

Universite Paris Diderot-Paris 7, UFR de Mathematiques, Equipe Geometrie et Dynamique, Batiment Sophie Germain, 5 rue Thomas Mann, 75205 Paris Cedex 13, France

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in Proceedings of the MaxEnt 2014 Conference on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Amboise, France, 21–26 September 2014.

Entropy 2015, 17(5), 3253-3318; https://doi.org/10.3390/e17053253

Submission received: 31 January 2015 / Revised: 3 May 2015 / Accepted: 5 May 2015 / Published: 13 May 2015

(This article belongs to the Special Issue Information, Entropy and Their Geometric Structures)

Download Versions Notes

Abstract

:

We propose that entropy is a universal co-homological class in a theory associated to a family of observable quantities and a family of probability distributions. Three cases are presented: (1) classical probabilities and random variables; (2) quantum probabilities and observable operators; (3) dynamic probabilities and observation trees. This gives rise to a new kind of topology for information processes, that accounts for the main information functions: entropy, mutual-informations at all orders, and Kullback–Leibler divergence and generalizes them in several ways. The article is divided into two parts, that can be read independently. In the first part, the introduction, we provide an overview of the results, some open questions, future results and lines of research, and discuss briefly the application to complex data. In the second part we give the complete definitions and proofs of the theorems A, C and E in the introduction, which show why entropy is the first homological invariant of a structure of information in four contexts: static classical or quantum probability, dynamics of classical or quantum strategies of observation of a finite system.

Keywords:

Shannon information; homology theory; entropy; quantum information; homotopy of links; mutual informations; Kullback-Leiber divergence; trees; monads; partitions

1. Introduction

1.1. What is Information?

“What is information?” is a question that has received several answers according to the different problems investigated. The best known definition was given by Shannon [1], using random variables and a probability law, for the problem of optimal message compression. However, the first definition was given by Fisher, as a metric associated to a smooth family of probability distributions, for optimal discrimination by statistical tests; it is a limit of the Kullback–Leibler divergence, which was introduced to estimate the accuracy of a statistical model of empirical data, and which can be also viewed as a quantity of information. More generally Kolmogorov considered that the concept of information must precede probability theory (cf. [2]). However, Evariste Galois saw the application of group theory for discriminating solutions of an algebraic equation as a first step toward a general theory of ambiguity, that was developed further by Riemann, Picard, Vessiot, Lie, Poincare and Cartan, for systems of differential equations; it is also a theory of information. In another direction Rene Thom claimed that information must have a topological content (see [3]); he gave the example of the unfolding of the coupling of two dynamical systems, but he had in mind the whole domain of algebraic or differential topology.

All these approaches have in common the definition of secondary objects, either functions, groups or homology cycles, for measuring in what sense a pair of objects departs from independency. For instance, in the case of Shannon, the mutual information is I(X; Y) = H (X) + H (Y) − H (X,Y), where H denotes the usual Gibbs entropy (H(X) = − Σ_x P(X = x) ln₂ P(X = x)), and for Galois it is the quotient set IGal(L₁; L₂|K) = (Gal(L₁ |K) × Gal(L₂|K))/Gal(L|K), where L₁, L₂ are two fields containing a field K in an algebraic closure Ω of K, where L is the field generated by L₁ and L₂ in Ω, and where

G a l (L_{i} | K) = (for i = 0, 1, 2)

denotes the group introduced by Galois, made by the field automorphisms of L_i fixing the elements of K.

We suggest that all information quantities are of co-homological nature, in a setting which depends on a pair of categories (cf. [4,5]); one for the data on a system, like random variables or functions of solutions of an equation, and one for the parameters of this system, like probability laws or coefficients of equations; the first category generates an algebraic structure like a monoid, or more generally a monad (cf. [4]), and the second category generates a representation of this structure, as do for instance conditioning, or adding new numbers; then information quantities are co-cycles associated with this module.

We will see that, given a set of random variables on a finite set Ω and a simplicial subset of probabilities on Ω, the entropy appears as the only one universal co-homology class of degree one. The higher mutual information functions that were defined by Shannon are co-cycles (or twisted co-cycles for even orders), and they correspond to higher homotopical constructions. In fact this description is equivalent to the theorem of Hu Kuo Ting [6], that gave a set theoretical interpretation of the mutual information decomposition of the total entropy of a system. Then we can use information co-cycles to describe forms of the information distribution between a set of random data; figures like ordinary links, or chains or Borromean links appear in this context, giving rise to a new kind of topology.

1.2. Information Homology

Here we call random variables (r.v) on a finite set Ω congruent when they define the same partition (remind that a partition of Ω is a family of disjoint non-empty subsets covering Ω and that the partition associated to a r.v X is the family of subsets Ω_x of Ω defined by the equations X(ω) = x); the join r.v YZ, also denoted by (Y, Z), corresponds to the less fine partition that is finer than Y and Z. This defines a monoid structure on the set n(Ω) of partitions of Ω, with 1 as a unit, and where each element is idempotent, i.e., ∀X, XX = X. An information category is a set

S

of r.v such that, for any

Y, Z \in S

less fine than

U \cup S

, the join YZ belongs to

S

, cf. [7]. An ordering on S is given by Y ≤ Z when Z refines Y, which also defines the morphisms Z → Y in the category

S

. In what follows we always assume that 1 belongs to

S

. The simplex ∆(Ω) is defined as the set of families of numbers {p_ω; ω ∊ Ω}, such that ∀ω, 0 ≤ p_ω ≤ 1 and Σ_ω p_ω = 1; it parameterizes all probability laws on Ω. We choose a simplicial sub-complex

P

in Δ(Ω), which is stable by all the conditioning operations by elements of

S

. By definition, for N ∊ ℕ, an information N-cochain is a family of measurable functions of

P \in P

, with values in ℝ or ℂ, indexed by the sequences (S₁;…;S_N) in

S

majored by an element of

S

, whose values depend only of the image law (S₁, …, S_N)_*P. This condition is natural from a topos point of view, cf. [4]; we interpret it as a “locality” condition. Note that we write (S₁; …; S_N) for a sequence, because (S₁, …, S_N) designates the joint variable. For N = 0 this gives only the constants. We denote by

C^{N}

the vector space of N-cochains of information. The following formula corresponds to the averaged conditioning of Shannon [1]:

S_{0} . F (S_{1}; \dots; S_{N}; P) = \sum P (S_{0} = υ_{j}) F (S_{1}; \dots; S_{N} P | S_{0} = υ_{j}),

(1)

where the sum is taken over all values of S₀, and the vertical bar is ordinary conditioning. It satisfies the associativity condition

({S^{'}}_{0} S_{0}) . F = {S^{'}}_{0} . (S_{0} . F)

.

The coboundary operator δ is defined by

\begin{array}{l} δ F (S_{0}; \dots; S_{N}; P) \\ = S_{0} . F (S_{1}; \dots; S_{N}; P) + \sum_{0}^{N - 1} (- 1) F (\dots; (S_{i}, S_{i + 1}); \dots; S_{N}; P) + {(- 1)}^{N + 1} F (S_{0}; \dots; S_{N - 1}; P), \end{array}

(2)

It corresponds to a standard non-homogeneous bar complex (cf. [5]). Another co-boundary operator on

C^{N}

is δ_t (t for twisted or trivial action or topological complex), that is defined by the above formula with the first term S₀.F (S₁;…; S_N; ℙ) replaced by F (S₁;…; S_N; ℙ). The corresponding co-cycles are defined by the equations δF = 0 or δ_t F = 0, respectively. We easily verify that δ ○ δ = 0 and δ_t ○ δ_t = 0; then co-homology

H * (S; P)

resp.

H_{t}^{*} (S; P)

is defined by taking co-cycles modulo the elements of the image of δ resp. δ_t, called co-boundaries. The fact that classical entropy H(X; ℙ) = − Σ_i p_i log₂ p_i is a 1-co-cycle is the fundamental equation H(X, Y) = H(X) + X.H (Y).

Theorem A. (cf. Theorem 1 section 2.3, [7]): For the full simplex ∆(Ω), and if

S

is the monoid generated by a set of at least two variables, such that each pair takes at least four values, then the information co-homology space of degree one is one-dimensional and generated by the classical entropy.

Problem 1. Compute the homology of higher degrees.

We conjecture that for binary variables it is zero, but that in general non-trivial classes appear, deduced from polylogarithms. This could require us to connect with the works of Dupont, Bloch, Goncharov, Elbaz-Vincent, Gangl et al. on motives (cf. [8]), which started from the discovery of Cathelineau (1988) that entropy appears in the computation of the degree one homology of the discrete group SL₂ over ℂ with coefficients in the adjoint action (cf. [9]).

Suppose

S

is the monoid generated by a finite family of partitions. The higher mutual informations were defined by Shannon as alternating sums:

I_{N} (S_{1}; \dots; S_{N}; P) = \sum_{k = 1}^{k = N} {(- 1)}^{k - 1} \sum_{I \subset [N]; c a r d (I) = k} H (S_{I}; P),

(3)

where S_I denotes the join of the S_i such that i ∊ I. We have I₁ = H and I₂ = I is the usual mutual information: I(S; T) = H(S) + H (T) − H(S, T).

Theorem B. (cf. section 3, [7]): I₂_m = δ_tδδt…δδ_tH, I₂_m₊₁ = −δδ_tδδt…δδ_tH, where there are m − 1 δ and m δ_t factors for I₂_m and m δ and m δ_t factors for I_2m+1.

Thus odd information quantities are information co-cycles, because they are in the image of δ, and even information quantities are twisted (or topological) co-cycles, because they are in the image of δ_t.

In [7] we show that this description is equivalent to the theorem of Hu Kuo Ting (1962) [6], giving a set theoretical interpretation of the mutual information decomposition of the total entropy of a system: mutual information, join and averaged conditioning correspond respectively to intersection, union and difference A\B = A ⋂ B^c. In special cases we can interpret I_N as homotopical algebraic invariants. For instance for N = 3, suppose that I(X; Y) = I(Y; Z) = I(Z; X) = 0, then I₃(X; Y; Z) = −I ((X,Y); Z) can be defined as a Milnor invariant for links, generalized by Massey, as they are presented in [10] (cf. page 284), through the 3-ary obstruction to associativity of products in a subcomplex of a differential algebra, cf. [7]. The absolute minima of I₃ correspond to Borromean links, interpreted as synergy, cf. [11,12].

1.3. Extension to Quantum Information

Positive hermitian n × n-matrices ρ, normalized by Tr(ρ) = 1, are called density of states (or density operators) and are considered as quantum probabilities on E = ℂⁿ. Real quantum observables are n × n hermitian matrices, and, by definition, the amplitude, or expectation, of the observable Z in the state ρ is given by the formula

E (Z) = T r (Z p)

(see e.g., [13]). Two real observables Y, Z are said congruent if their eigenspaces are the same, thus orthogonal decomposition of E are the quantum analogs of partitions. The join is well defined for commuting observables. An information structure S is given by a subset of observables, such that, if Y, Z have common refined eigenspaces decomposition in S, their join (Y, Z) belongs to S. We assume that {E} belongs to S. What plays the role of a probability functor is a map Q from S to sets of positive hermitian forms on E, which behaves naturally with respect to the quantum direct image, thus Q is a covariant functor. We define information N-cochains as for the classical case, starting with the numerical functions on the sets Q_X; X ∊ S, which behave naturally under direct images.

The restriction of a density ρ by an observable Y is

ρ_{Y} = \sum_{A} E_{A}^{*} ρ E_{A}

, where the E_A’s are the spectral projectors of the observable Y. The functor Q is said to match S (or to be complete and minimal with respect to S) if, for each X ∊ S, the set Q_X is the set of all possible densities of the form ρ_X.

The action of a variable on the cochains space

C_{Q}^{*}

is given by the quantum averaged conditioning:

Y . F (Y_{0}; \dots; Y_{m}; ρ) = \sum_{A} T r (E_{A}^{*} ρ E_{A}) F (Y_{0}; \dots; Y_{m}; E_{A}^{*} ρ E_{A})

(4)

>From here we define coboundary operators δ_q and δ_Qt by the formula (22), then the notions of co-cycles, co-boundaries and co-homology classes follow. We have δ_q ○ δ_q = 0 and δ_Qt ○ δ_Qt = 0; cf. [7].

When the unitary group U_n acts transitively on S and Q, there is a notion of invariant cochains, forming a subcomplex of information cochains, and giving a more computable co-homology than the brut information co-homology. We call it the invariant information co-homology and denote it by

H_{U}^{*} (S; Q)

.

The Von-Neumann entropy of ρ is S(ρ) = ℕρ(−log₂(ρ)) = −(ρ log₂(ρ)); it defines a 0-cochain S_Y by restricting S to the sets Q_X. The classical entropy is

H (Y; ρ) = - \sum_{A} T r (E_{A}^{*} ρ E_{A}) \log_{2} (T r (E_{A}^{*} ρ E_{A}))

. Both these co-chains are invariant. It is well known that S₍_X,Y₎(ρ) = H(X; ρ) + X.S_Y(ρ) when X, Y commute, cf. [13]. In particular, by taking Y = 1_E we see that classical entropy measures the default of equivariance of the quantum entropy, i.e., H(X; ρ) = S_X (ρ) − (X.S)(ρ). But using the case where X refines Y, we obtain that the entropy of Shannon is the co-boundary of (minus) the Von Neumann entropy.

Theorem C. (cf. Theorem 3 section 4.3): For n ≥ 4 and when S is generated by at least two decompositions such that each pair has at least four subspaces, and when Q is matching S, the invariant co-homology

H_{U}^{1}

of δ_q in degree one is zero, and the space

H_{U}^{0}

is of dimension one. In particular, the only invariant 0-cochain such that δS = −H is the Von Neumann entropy.

(This statement, which will be proved below, corrects a similar statement which was made in the announcement [14].)

1.4. Concavity and Convexity Properties of Information Quantities

The simplest classical information structure

S

is the monoid generated by a family of “elementary” binary variables S₁,…,S_n. It is remarkable that in this case, the information functions I_N,J = I_N(S_j₁;…S_jN) over all the subsets J = {j₁,…,j_N} of [n] = {1,…, n}, different from [n] itself, give algebraically independent functions on the probability simplex ∆(Ω) of dimension 2ⁿ − 1. They form coordinates on the quotient of ∆(Ω) by a finite group.

Let

ℒ

_d denotes the Lie derivative with respect to d = (1,…,1) in the vector space

ℝ^{2^{n}}

, and ∆ the Euclidian Laplace operator on

ℝ^{2^{n}}

, then ∆ = ∆ − 2⁻ⁿ

ℒ

_d ○

ℒ

_d is the Laplace operator on the simplex ∆(Ω) defined by equating the sum of coordinates to 1.

Theorem D. (cf [15]): On the affine simplex ∆(Ω) the functions I_N,J with N odd (resp. even) satisfies the inequality ∆I_N ≥ 0 (resp. ∆I_N ≤ 0).

In other terms, for N odd the I_N,J are super-harmonic which is a kind of weak concavity and for N even they are sub-harmonic which is a kind of weak convexity. In particular, when N is even (resp. odd) I_N,J has no local maximum (resp. minimum) in the interior of ∆(Ω).

Problem 2. What can be said of the other critical points of I_N,J? What can be said of the restriction of one information function on the intersection of levels of other information functions? Information topology depends on the shape of these intersections and on the Morse theory for them.

1.5. Monadic Cohomology of Information

Now we consider the category

S *

of generalized ordered partitions of Ω over

S

: they are sequences S = (E₁,…,E_m) of subsets of Ω such that ⋃_jE_j = Ω and

E_{i} \cap E_{j} = 0

as soon as i ≠ j. The number m is named the degree of S. Note the important technical point that some of the sets E_j can be the empty set. In the same spirit we introduce generalized ordered orthogonal decompositions of E for the quantum case; but in this summary, for simplicity we restrict ourselves to the classical case. Also we forget to add generalized to ordered up to now in this summary. A rooted tree decorated by

S *

is an oriented finite planar tree Γ, with a marked initial vertex s₀, named the root of Γ, where each vertex s is equipped with an element F_s of

S *

, such that edges issued from s correspond to the values of F_s. When we want to mention that we restrict to partitions less fine than a partition X we put an index X, like in

S_{X}^{*}

.

The notation μ(m; n₁,…,n_m) denotes the operation which associates to an ordered partition S of degree m and to m ordered partitions S_i of respective degrees n_i, the ordered partition that is obtained by cutting the pieces of S using the pieces of S_i and respecting the order. An evident unit element for this operation is the unique partition n₀ of degree 1. The symbol μ_m denotes the collection of those operations for m fixed. The introduction of empty subsets in ordered partitions insures that the result of μ(m; n_i,…,n_m) is a partition of length n_i +… + n_m, thus the μ_m do define what is named an operad; cf. [10,16]. The axioms of unity, associativity and covariance for permutations are satisfied. See [10,16–18] for the definition of operads.

The most important algebraic object which is associated to an operad is a monad (cf. [4,16]), i.e., a functor

V

from a category

A

to itself, equipped with two natural transformations

μ : V \circ V \to V

and

η : R \to V

, which satisfy to the following axioms:

\begin{matrix} μ \circ (V μ) = μ \circ (μ V), & μ \circ (V η) = I d = μ \circ (η V) \end{matrix}

(5)

In our situation, we can apply the Schur construction (cf. [16]) to the μ_m to get a monad: take for V the real vector space freely generated by

S *

; it is naturally graded, so it is the direct sum of spaces V(m); m ≥ 1 where the symmetric group

S_{m}

acts naturally to the right, then introduce, for any real vector space W the real vector space

V (W) = \otimes_{m \geq 0} V (m) \otimes_{S_{m}} W^{\otimes m}

; the Schur composition is defined by

V \circ V = \oplus_{m \geq 0} V (m) \otimes_{S_{m}} V^{\otimes m}

. It is easy to verify that the collection (μ_m; m ∊ ℕ) defines a natural transformation

V \circ V \to V

, and the trivial partition π₀ defines a natural transformation

η : R \to V

, that satisfied to the axioms of a monad.

Also we fix a functor of probability laws Q_X over the category

S

ℳ

_X(m) be the vector space freely generated over ℝ by the symbols (P,i,m) where P belongs to Q_X, and 1 ≤ i ≤ m. In the last section of the second part we show how this space arises from the consideration of divided probabilities. This is apparent on the following definition of the right action of the operad

V

on the family

M_{X} (m); m \in N *

: a sequence S₁,…,S_m or ordered partitions in

S_{X}^{*}

acts to a generator (P,i,m) by giving the vector Σ_jp_j(P_j,),n) where p_j is the probability P(S_i = j) and P_j is the conditioned probability P|(S_i = j). We denote by θ_m((P, i, m), (S₁,…,S_m)) this vector.

Now we consider the Schur functor

M_{X} (W) + \oplus_{m} M_{X} (m) \otimes_{S_{m}} W^{\otimes m}

; the operations θ_m define a natural transformation

θ : M \circ V \to M

, which is an action to the right in the sense of monads, i.e.,

θ \circ (F μ) = θ \circ (θ V)

; θ ○ (

ℱ

η) = Id. (We forgot the index X for simplicity.)

Now we consider the bar resolution of

M : \dots . \to M \circ V \circ^{(k + 1)} \to M \circ V \circ^{k} \to \dots

, as in Beck (triples,…) [19], and Fresse [16], with its simplicial structure deduced from θ and μ, and the complex of natural transformations of

V

-right modules

C * (M) = H o m_{V} (M \circ V ° *, R)

, where

R

is the trivial right module given by

R (m) = ℝ

. As in the classical case, we restrict us to co-chains that are measurable in the probability (P, i, m).

The co-boundary is defined by the Hochschild formula, extended by MacLane and Beck to monads (see Beck [19]):

δ F = F \circ (θ V \circ^{k}) - \sum_{i = 0, \dots, k - 1} {(- 1)}^{i} F \circ MV \circ^{i} μ V \circ^{k - i - 1} - {(- 1)}^{k} F \circ MV \circ^{k} ε .

(6)

The cochains are described by families of scalar measurable functions F_X(S₁;…,S_k; (P, i, m), where S₁;…;S_k is a forest of m trees of level k labelled by

S_{X}^{*}

, and where the value on (P, i, m) depends only on the tree

S_{1}^{i}; S_{2}^{i}; \dots; S_{k}^{i}

.

We impose now the condition, named regularity, that

F_{X} (S_{1}; \dots, S_{k}; (P, i, m)) = F_{X} (S_{1}^{i}; S_{2}^{i}; \dots; S_{k}^{i} P)

. The regular co-chains form a sub-complex

C_{r}^{*} (M)

; by definition, its homology is the arborescent information co-homology.

The regular cochains of degree k are determined by their values for m = 1 and decorated trees of level k, where the co-boundary takes the form:

\begin{array}{l} δ F (S; S_{1}; \dots; S_{k}; P) \\ = \sum_{i} P (S = i) F (S_{1}^{i}; \dots; S_{k}^{i}; P | (S = i)) + \sum_{i = 1}^{i = k} {(- 1)}^{i} F (S; \dots; μ (S_{i - 1} \circ S_{i}); S_{i + 1}; \dots; S_{k}; P) \\ + {(- 1)}^{k + 1} F (S; \dots; S_{k - 1}; P) \end{array}

(7)

This gives co-homology groups

H_{τ}^{*} (S, P)

, τ for tree. The fact that entropy H(S_*ℙ) = H(S; ℙ) defines a 1-cocycle is a result of an equation of Fadeev, generalized by Baez, Fritz and Leinster [20], who gave another interpretation, based on the operad structure of the set of all finite probability laws. See also Marcolli and Thorngren [21].

Theorem E. (cf. Theorem 4 section 6.3, [22]): If Ω has more than four points,

H_{τ}^{1} (\prod (Ω), Δ (Ω))

is the one dimensional vector space generated by the entropy.

Another co-boundary δ_t on

C_{r}^{*} (M)

corresponds to another right action of the monad

V_{X}

, which is deduced from the maps θ_t that send (P, i, m) ⊗ S₁ ⊗… ⊗ S_m) to the sum of the vectors (P, (i, j), n) for j = 1,…, n_i that are associated to the end branches of S_i. It gives a twisted version of information co-homology as we have done in the first paragraph. This allows us to define higher information quantities for strategies: for N = 2M + 1 odd, I_τ,N = − (δδ_t)^M H, and for N = 2M + 2 even, i_τ,n = δ_t(δδ_t)^M H.

This gives for N = 2, a notion of mutual information between a variable S of length m and a collection T of m variables T₁,…,T_m:

I_{τ} (S; T_{i}; P) = \sum_{i = 1}^{i = m} (H (T_{i} P) - P (S = i) H (T_{i}; P | S = i)) .

(8)

When all the T_i are equals we recover the ordinary mutual information of Shannon plus a multiple of the entropy of T_i.

1.6. The Forms of Information Strategies

A rooted tree Γ decorated by

S_{*}

can be seen as a strategy to discriminate between points in Ω. For each vertex s there is a minimal set of chained edges α₁,…,α_k connecting s₀ to s; the cardinal k is named the level of s; this chain defines a sequence (F₀,v₀; F₁,v₁; F_k₋₁,v_k₋₁) of observables and values of them; then we can associate to s the subset Ω_s of Ω where each F_j takes the value u_j. At a given level k the sets Ω_s form a partition π_k of Ω; the first one π₀ is the unit partition of length 1, and π_l is finer than π_l−1 for any l. By recurrence over k it is easy to deduce from the orderings of the values of F_s an embedding in the Euclidian plane of the subtrees Γ(k) at level k such that the values of the variables issued from each vertex are oriented in the direct trigonometric sense, thus π_k has a canonical ordering ω_k. Remark that many branches of the tree gives the empty set for Ω_s after some level; we name them dead branches. It is easy to prove that the set

\prod {(S)}_{*}

of ordered partitions that can be obtained as a (π_k,ω_k) for some tree Γ and some level k is closed by the natural ordered join operation, and, as

\prod {(S)}_{*}

contains π₀, it forms a monoid, which contains the monoid

M (S_{*})

generated by

S_{*}

.

Complete discrimination of Ω by

S_{*}

exists when the final partition of Ω by singletons is attainable as a π_k; optimal discrimination correspond to minimal level k. When the set Ω is a subset of the set of words x₁,…,x_N with letters x_i belonging to given sets M_i of respective cardinalities m_i, the problem of optimal discrimination by observation strategies Γ decorated by

S_{*}

is equivalent to a problem of minimal rewriting by words of type (F₀,v₀), (F₁,v₁),(F_k,v_k); it is a variant of optimal coding, where the alphabet is given. The topology of the poset of discriminating strategies can be computed in terms of the free Lie algebra on Ω, cf. [16].

Probabilities ℙ in

P

correspond to a priori knowledge on Ω. In many problems

P

is reduced to one element, that is the uniform law. Let s be a vertex in a strategic tree Γ, and let

P_{s}

be the set of probability laws that are obtained by conditioning through the equations F_i = i = 0,…,k − 1 for a minimal chain leading from s₀ to s. We can consider that the sets

P_{s}

for different s along a branch measure the evolution of knowledge when applying the strategy. The entropy H(F; ℙ_s) for F in

S_{*}

and ℙ_s in

P_{s}

gives a measure of information we hope to obtain when applying F at s in the state ℙ_s. The maximum entropy algorithm consists in choosing at each vertex s a variable that has the maximal conditioned entropy H(F; ℙs).

Theorem F. (cf. [22]): To find one false piece of different weight among N pieces for N ≥ 3, when knowing the false piece is unique, by the minimal numbers of weighing, one can use the maximal entropy algorithm.

However we have another measure of information of the resting ambiguity at s, by taking for the Galois group G_s the set of permutations of Ω_s which respects globally the set

P_{s}

and the set of restrictions of elements of

S_{*}

to Ω_s, and which preserves one by one the equations F_i = v_i. Along branches of Γ this gives a decreasing sequence of groups, whose successive quotients measure the evolution of acquired information in an algebraic sense.

Problem 3. Generalize Theorem F. Can we use algorithms based on the Galoisian measure of information? Can we use higher information quantities associated to trees for optimal discrimination?

1.7. Conclusion and Perspective

Concepts of Algebraic topology were recently applied to Information theory by several researchers. In particular notions coming from category theory, homological algebra and differential geometry were used for revisiting the nature and scope of entropy, cf. for instance Baez et al. [20], Marcolli and Thorngren [21] and Gromov [23]. In the present note we interpreted entropy and Shannon information functions as co-cycles in a natural co-homology theory of information, based on categories of observable and complexes of probability. This allowed us to associate topological figures, like Borromean links, with particular configuration of mutual dependency of several observable quantities. Moreover we extended these results to a dynamical setting of system observation, and we connected probability evolutions with the measures of ambiguity given by Galois groups. All those results provide only the first steps toward a developed Information Topology. However, even at this preliminary stage, this theory can be applied to the study of distribution and evolution of Information in concrete physical and biological systems. This kind of approach already proved its efficiency for detecting collective synergic dynamic in neural coding [12], in genetic expression [24], in cancer signature [25], or in signaling pathways [26]. In particular, information topology could provide the principles accounting for the structure of information flows in biological systems and notably in the central nervous system of animals.

2. Classical Information Topos. Theorem One

2.1. Information Structures and Probability Families

Let Ω be a finite set, the set Π(Ω) of all partitions of Ω constitutes a category with one arrow Y → Z from Y to Z when Y is more fine than Z, we also say in this case that Y divides Z. In Π(Ω) we have an initial element, which is the partition by points, denoted ω and a final element, which is Ω itself and is denoted by 1. The joint partition YZ or (Y, Z), of two partitions Y, Z of Ω is the less fine partition that divides Y and Z, i.e., their gcd. For any X we get XX = X, ωX = ω and 1.X = X.

By definition an information structure

S

on Ω is a subset of Π(Ω), such that for any element X of

S

, and any pair of elements Y, Z in

S

that X refines, the joint partition YZ also belongs to

S

.

In addition we will always assume that the final partition 1 belongs to

S

. In terms of observations, it means that at least something is a certitude.

Examples: start with a set Σ = {S_i; 1 ≤ i ≤ n} of partitions of Ω. For any subset I = {i₁,…, i_k} of [n] = {1,…, n}, the joint (Si₁,…, Si_k), also denoted S_I, divides each Si_j. The set W = W(Σ) of all the S_I, when I describes the subsets of [n] is an information struture. It is even a commutative monoid, because any product of elements of W belongs to W, and the partition associated to Ω itself gives the identity element of W. The product S_[_n_] of all the S_i is maximal; it divides all the other elements. As Π(Ω) the monoid W(Σ) is idempotent, i.e., for any X we have XX = X.

By definition, the faces of the abstract simplex ∆([n]) are the subsets of [n]; its vertices are the singletons. Thus the monoid W(Σ) can be identified with the first barycentric subdivision of the simplex ∆([n]).

Remind that a simplicial subcomplex of ∆([n]) is a subset of faces that contains all faces of any of its elements. Then any simplicial sub-complex K of ∆([n]) gives a simplicial information structure

S (K)

, embedded in W(Σ). In fact, if Y and Z are faces of a simplex X belonging to K, YZ is also a face in X, thus it belongs to K. The maximal faces Σ_a; a ∊ A of K correspond to the finest elements in

S (K)

; the vertices of a face Σ_a gives a family of partitions, which generates a sub-monoid W_a = W(Σ_a) of W; it is a sub-information structures (full sub-category) of

S (K)

, having the same unit, but having its own initial element ω_a. These examples arise naturally when formalizing measurements if some obstructions or a priori decisions forbid a set of joint measurements.

This kind of examples were considered by Han [27] see also McGill [28].

Example 1. Ω has four elements (00), (01), (10), (11); the variable S₁ (resp. S₂) is the projection pr₁ (resp. pr₂), on E₁ = E₂ = {0, 1}; Σ is the set {S₁, S₂}. The monoid W(Σ) has four elements 1, S₁, S₂, S₁ S₂. The partition S₁S₂ = S₂S₁ corresponds to the variable Id : Ω → Ω.

Example 2. Same Ω as before, with the same names for the elements, but we take all the partitions of Ω in

S

. In addition to 1, S₁, S₂ and S = S₁S₂, there is S₃, the last partition in two subsets of cardinal two, which can be represented by the sum of the indices: S₃(00) = 0, S₃(11) = 0, S₃(01) = 1, S₃(10) = 1, the four partitions Y_ω, for ω ∊ Ω, formed by a singleton {ω} and its complementary, and finally the six partitions X_μν = Y_μY_ν, indexed by pairs of points in Ω satisfying p < ν in the lexical order. The product of two distinct Y is a X, the product of two distinct X or two distinct S_i is S, the product of one Y and a S_i is a X, of one Y and a X is this X or S, of one S and a X is this X or S. In particular the monoid W is also generated by the three S_i and the four Y_ω; it is called the monoid of partitions of Ω, and the associative algebra

Λ (S)

of this monoid is called the partition algebra of Ω.

Example 3. Same Ω as before, that is Ω = ∆(4), with the notations of example 2 for the partitions; but we choose as generating family the set ϒ of the four partitions Y_μ; μ ∊ Ω; the joint product of two such partitions is either a Y_μ (when they coincide) or a X_μv (when they are different). The monoid W(ϒ) has twelve elements.

Example 4. Ω has 8 elements, noted (000),…,(111), and we consider the family Σ of the three binary variables S₁, S₂, S₃ given by the three projections. If we take all the joints, we have a monoid of eight elements. However, if we forbid the maximal face (S₁, S₂, S₃), we have a structure

S

which is not a monoid; it is the set formed by 1, S₁, S₂, S₃ and the three joint pairs (S₁, S₂), (S₁, S₃), (S₂, S₃).

On the side of probabilities, we choose a Boolean algebra

B

of sets in Ω, i.e., a subset

B

of the set

P (Ω)

of subsets of Ω that contains the empty set

0

and the full set Ω, and is closed by union and intersection. In this finite context, it is easy to prove that

B

is constituted by all the unions of its minimal elements (called atoms). Associated to this case, we will consider only information structures that are made by partitions whose each element belongs to

B

. Consequently we could replace everywhere Ω by the finite set

Ω_{B}

of the atoms of

B

, but we will see that several Boolean sub-algebras appear naturally in the process of observation, thus we prefer to mention the choice of

B

at the beginning of observations. Then we consider the set

Δ (Ω_{B})

, or

Δ (B)

, of all probability laws on

(Ω, B)

, i.e., all real functions p_x of the atoms x of

B

(the points of

Ω_{B}

), satisfying p_x ≥ 0 and Σ_x p_x = 1. We see that this set of probabilities is also a simplex ∆([N]), where N is the cardinality of

Ω_{B}

.

As on the side of partitions, we will consider more generally any simplicial sub-complex

Q

of

Δ (B)

, and call it a probability complex. In the appendix, we show that this kind of examples correspond to natural forbidding rules, that can express physical constraints on the observed system.

A partition Y which is measurable with respect to

B

is made by elements Y_i for i = 1, …, m, belonging to

B

. Let P be an element of

Δ (B)

; the conditioning of P by the element Y_i is defined only if P(Y_i) ≠ 0, and given by the formula P(B|Y = y_i) = P(B ⋂ Y_i)/P(Y_i). We will consider it as a probability on Ω equipped with

B

, not as a probability on Y_i. Remark that if P belongs to a simplicial family

Q

, the probability P(B|Y = y_i) is also contained in

Q

. In fact, if the smallest face of

Q

which contains P is the simplex a on the vertices x₁,…,x_k, then the conditioning of P by Y_i, being equal to 0 for the other atoms x, belongs to a face of σ, which is in

Q

, because

Q

is a complex.

For a probability family

Q

, i.e., a set of probabilities on Ω, and a set of partitions

S

, we say that

Q

and

S

are adapted one to each other if the conditioning of every element of

Q

by every element of S belongs to

Q

.

By definition, the algebra

B_{Y}

is the set of unions of elements of the partition Y. We can consider it as a Boolean algebra on Ω contained in

B

or as Boolean algebra on the quotient set Ω/Y. The image Y_*Q of a probability Q for

B

by the partition Y is the probability on Ω for the sub-algebra

B_{Y}

, that is given by Y * Q(t) = Q(t) for t ∊

B_{Y}

. It is the forgetting operation, also frequently named marginalization by Y.

By definition, the set

Q_{Y}

is the image of Y_*. Let us prove that it is a simplicial sub-complex of

Δ (B_{Y})

: take a simplex σ of

Q

, denote its vertices by x₁,…,x_k, note δ_j the Dirac mass of x_j, and look at the partition σ_i = Y_i ⋂ σ of σ induced by Y, then for all the x_j ∊ σ_i the images Y_* δ_j coincide. Let us denote this image by δ(Y, σ_i); it is an element of

Q_{Y}

. For every law Q in a, the image Y_*Q belongs to the simplex on the laws δ(Y, σ_i), and any point in this simplex belongs to

Q_{Y}

. Q.E.D.

If X → Y is an arrow in

Π (Ω_{B})

, the above argument shows that the map

Q_{X} \to Q_{Y}

is a simplicial mapping.

Conditioning by Y and marginalization by Y_* are related by the barycentric law (or theorem of total probability, Kolmogorov 1933 [29]): for any measurable set A in

B

we have

P (A) = P (Y = y_{1}) P | (Y = y_{1}) (A) + \dots + P (Y = y_{m}) P | (Y = y_{m}) (A) .

(9)

Remark that the notions of information structures and probability complexes extend to infinite sets; this is developed in paper [7].

In this context, we have a formula for any integrable function φ on Ω with respect to P:

\int_{Ω} φ (ω) d P (ω) = \int_{Ω / Y} d (Y_{*} P) (ω^{'}) \int_{Ω} φ (ω) d (P | (Y = ω^{'})) (ω) .

(10)

Consider a finite set Ω, equipped with a Boolean algebra

B

, a probability family

Q

for it and an information structure

S

adapted to

B

.

For each object X in

S

, the set

S_{X}

made by the partitions Y that are divided by X is a closed sub-category, possessing an internal law of monoid. The object X is initial. To any arrow X → Y is associated the inclusion

S_{Y} \to S_{X}

, thus we get a contra-variant functor from

S

to the category of monoids.

On the other side we have a natural co-variant functor of

S

to the category of sets, which associates to each partition

X \in S

the set

Q_{X}

of probability laws in the image of

Q

on the quotient set Ω/X, and which associates to each arrow X → Y the surjection

Q_{X} \to Q_{Y}

which is given by direct image P_X ↦ Y_*P_X. If

Q

is simplicial the functor goes to the category of simplicial complexes.

Definition 1. For

X \in S

, the functional module

F_{X} (Q)

is the real vector space of measurable functions on the space

Q_{X}

; for each arrow of divisibility X → Y, we have an injective linear map f ↦ f^Y^|X from

ℱ

_Y to

ℱ

_X given by

f^{Y | X} (P_{X}) = f (Y_{*} P_{X}) .

(11)

In this manner, we obtain a contra-variant functor

ℱ

from the category

S

to the category of real vector spaces.

If

Q

and

S

are adapted one to each other, the functor

ℱ

admits a canonical action of the monoid functor

X \mapsto S_{X}

, given by the average formula

(Y . f) (P) = \int d Y_{*} P (y) f (P | (Y = y)) .

(12)

To verify this is an action of monoid, we must verify that for any Z which divides Y, and any f ∊

ℱ

_Y, we have, in

ℱ

_X the identity

{(Z . f)}^{Y | X} = Z . (f^{Y | X});

(13)

that means, for any

P \in Q_{X}

:

\int_{E z} d Z_{*} P (z) f^{Y | X} (P | (Z = z)) = \int_{E z} d Z_{*} P (z) f ((Y_{*} P) | (Z = z)) .

(14)

But this results from the identity Y_*(P|(Z = z)) = (Y_*P)|(Z = z) due to Y_*P(Z = z) = P(Z = z). The arrows of direct images and the action of averaged conditioning satisfy the axiom of distributivity: if Y and Z divide X, but not necessarily Z divides Y, we have

Z . (f^{Y}) (P, X) = (Z, Y) ({(Z, Y)}_{*} P, (Y, Z)) = {(Z . f)}^{(Z, Y)} (P, X) .

(15)

Proof. The first identity comes from the fact that (Z,Y)_*(P|(Z = z)) = Y_*(P|(Z = z)); the second one follows from the fact that we have an action of the monoid

S_{X}

.

As the formula (12) is central in our work, we insist a bit on it, and comment its meaning, at least in this finite setting:

Let P ↦ f (P) be an element of

ℱ

_X, and Y be the goal of an arrow X → Y, we have

Y . f (P) = \sum_{j} P (Y = y_{i}) f (P | Y = j) .

(16)

where j describes the indices of the partition Y.

We will see when discussing functions of several partitions that this formula is due to Shannon and correspond to conditional information.

Lemma 1. for any pair (Y, Z) of variables in

S_{X}

, and any F for which the integrals converge, we have (Y,Z).F = Y.(Z.F).

Proof. We note p_i the probability that Y = y_i, π_ij the joint probability of (Y = y_i, Z = z_j), and q_ij the conditional probability of Z = z_j knowing that Y = y_i, then

\begin{array}{l} (Y, Z) . F (P) = \sum_{i} \sum_{j} π_{i j} F (P | (Y = y_{i}, Z = z_{i})) \\ = \sum_{i} p_{i} (\sum_{j} q_{i j} F (P | (Y = y_{i}, Z = z_{i})) \\ = \sum_{i} p_{i} (\sum_{j} q_{i j} F (P | (Y = y_{i})) | (Z = z_{i})) \\ = \sum_{i} p_{i} (Z . F) (P | (Y = y_{i})) \\ = Y . (Z . F) (P) . \end{array}

Remark 1. In the general case, where Ω is not necessarily finite and

B

is any sigma-algebra, the Lemma 1 is a version of the Fubini theorem.

Let us consider the category

S

equipped with the discrete topology, to get a site (cf. SGA [30]). Over a discrete site every presheaf is a sheaf. The contravariant functor

X \mapsto S_{X}

gives a structural sheaf of monoids, and by passing to the algebras

A_{X}

over ℝ which are generated by the (finite) monoids, we get a sheaf in rings, thus S becomes a ringed site. Moreover, by considering all contra-variant functors

X \mapsto N_{X}

from

S

to modules over the algebra functor

A

, we obtain a ringed topos, that we name the information topos associated to

Ω, B, S

. This ringed topos concerns only the observables given by partitioning.

Take now in account a probability family

Q

which is adapted to

S

, for instance a simplicial family; we obtain a functor X ↦ Q_X translating the marginalization by the partitions, considered as observable quantities, and the conditioning by observables is translated by a special element X ↦

ℱ

_X of the information topos.

In this way it is natural to expect that topos co-homology, as introduced by Grothendieck, Verdier and their collaborators (see SGA 4 [30]), captures the invariant structure of observation, and defines in this context what information is. This is the main outcome of our work.

As a consequence of Grothendieck’s article (Tohoku, 1957 [31]), a ringed topos possesses enough injective objects, i.e., any object is the sub-object of an injective object, moreover, up to isomorphism, there is a unique minimal injective object containing a given object, called its injective envelope (cf. Gabriel, seminaire Dubreil, exp. 17 [32]). Thus each object in the category

D_{S}

of modules over a ringed site

S

possesses a canonical injective resolution I_*(N); then the group

E x t_{D}^{n} (M, N)

can be defined as the homology of the complex

H o m_{D} (M, I_{n} (N))

. Those groups are denoted by Hⁿ(M; N).

The “comparison theorem” (cf. Bourbaki, Alg.X Th1, p. 100 [33], or MacLane 1975, p. 261 [5]) asserts that, for any projective (resp. injective) resolution of M (resp. N) there exists a natural map of complexes between the resulting complex of homomorphisms and the above canonical complex, and that this map induces an isomorphism in co-homology.

In our context, we take for M the trivial constant module

R_{S}

over

S

, and we take for N the functional module

F (Q)

.

The existence of free resolutions of

R_{S}

makes things easier to handle.

Hence we propose that the natural information quantities are classes in the co-homology groups

H * (R_{S}, F (Q))

.

This is reminiscent of Galois co-homology see SGA [30], where M is also taken as the constant sheaf over the category of G-objects seen as a site.

In [7] we develop further this more geometric approach, by considering several resolutions. But in this paper, in order to be concrete, we will only focus on a more elementary approach, associated to a special resolution, called the non-homogeneous bar-resolution, which also leads to the general result. This is the object of the next section.

2.2. Non-Homogeneous Information Co-Homology

For each relative integer m ≥ 0, and each object

X \in S

, we consider the real vector space S_m(X), freely generated by the m-uples of elements of the monoid

S_{X}

, and we define C^m(X) as the real vector space of linear functions from S_m(X) to the space

ℱ

_X of measurable functions from

Q_{X}

to ℝ.

Then we define the set

C^{m}

of m-cochains as the set of collections F_X ∊ C^m(X) satisfying the following condition, named joint locality:

For each Y divided by X, when each variable X_j is divided by Y, we must have

F_{Y} (X_{1}; \dots; X_{m}; Y_{*} P) = F_{X} (X_{1}; \dots; X_{m}; P) .

(17)

Thus a co-chain F is a natural transformation from the functor S_m (X) from

S

to the category of real vector spaces to the functor

ℱ

of measurable functions on

Q_{X}

. Hence, F is not an ordinary numerical function of probability laws ℙ and a set (X_i,…,X_m) of m random variables, but we can speak of its value F_X(X₁;…;X_m; ℙ) for each X in

S

. For X given the co-chains form a sub-vector space

C^{m} (X)

of C^m(X).

If we apply the condition to Y = (X₁,…,X_m) we find that F(X₁;…; X_m; ℙ) depends only on the direct image of ℙ by the joint variable of the X_i’s. This implies that, if F belongs to

C^{m} (X)

, we have

F (X_{1}; \dots; X_{m}; P) = F (X_{1}; \dots; X_{m}; {(X_{1} \dots X_{n})}_{*} P),

(18)

Conversely, suppose that F satisfies the conditions (18) and consider X, Y two variables such that X divides Y, and that Y divides each X_j, and let P be a probability in

Q_{X}

; then the joint variable Z =(X_i,…,X_m) divides Y and X, thus we have Z_*P = Z_*(X_*P) = Z_*(Y_*P), and

F (X_{1}; \dots; X_{m}; Y_{*} P) = F (X_{1}; \dots; X_{m}; Z_{*} P) = F (X_{1}; \dots; X_{m}; X_{*} P) .

(19)

Which proves that F belongs to

C^{m} (X)

.

Let F be an element of

C^{m} (X)

, and Y an element of

S_{X}

; then we define

Y . F (X_{1}; \dots; X_{m}; P) = \sum P (Y = y_{j}) F (X_{1}; \dots; X_{m}; P | Y = y_{i}) .

(20)

It follows from the equivalent condition (18) that Y.F also belongs to

C^{m} (X)

.

Moreover, the proof of Lemma 1 applies and give that, for any pair (Y, Z) of variables in

S_{X}

, and any F in

C^{m} (X)

, we have (Y, Z).F = Y.(Z.F).

Thus (1) defines an action of the semigroup

S_{X}

on the vector spaces

C^{m} (X)

.

Remark 2. The operation of

S_{X}

can be rewritten more compactly by using integrals:

Y . F (X_{1}; \dots; X_{m}; ℙ) = \int_{Ω} F (X_{1}; \dots; X_{m}; ℙ | Y = Y (ω))) d P (ω) .

(21)

The differential δ for computing co-homology is given by the Eilenberg-MacLane formula (1943):

\begin{array}{l} δ^{m} F (Y_{1}; \dots; Y_{m + 1}; P) \\ = Y_{1} . F (Y_{2}; \dots; Y_{m + 1}; P) + \sum_{1}^{m} {(- 1)}^{i} F (\dots; (Y_{i}, Y_{i + 1}); \dots; Y_{m + 1}; P) + {(- 1)}^{m + 1} F (Y_{1}; \dots; Y_{m}; P) . \end{array}

(22)

Since this formula corresponds to the standard inhomogeneous bar-resolution in the case of semi-groups and algebras (Cf. MacLane p. 115 [4] and Cartan-Eilenberg pp. 174–175. [34]), we name δ the Hochschild co-boundary, as in the case of semi-groups, and algebras.

Remark that a function F satisfying the joint locality condition, (i.e., the hypothesis that F(Y₁;…; Y_m; P) depends only on (Y₁,…, Y_m)_*P), has a co-boundary which is also jointly local, because the variables appearing in the definition are all joint variables of the Y_j. (This this would not have been true for the stronger locality hypothesis asking that F depends only on the collection (Y_j)_*P; j = 1,…,m.)

It is easy to verify that δ^m ○ δ^m−1 = 0. We denote by Z^m the kernel of δ^m and by B^m the image of δ^m−1. The elements of Z^m are named m-cocycles, we consider them as information quantities, and the elements of B^m are m-coboundaries.

Definition 2. For m ≥ 0, the quotient

H^{m} (C *) = Z^{m} / B^{m}

(23)

is the m-th cohomology group of information of the information structure

S

on the simplicial family of probabilities

Q

. We denote it by

H^{m} (S; Q)

.

The information co-homology satisfies functoriality properties:

Consider two pairs of information structures and probability families,

(S, Q)

and

(S^{'}, Q^{'})

on two sets Ω, Ω′ equipped with the σ-algebras

ℬ, ℬ'

respectively, and φ a surjective measurable map from

(Ω, ℬ)

to

(Ω^{'}, ℬ')

, such that

Q \subseteq φ_{*} (Q)

(i.e.,

φ_{*} (Q) \in Q^{'}

for every

Q \in Q

), and such that

S \subseteq φ^{*} S^{'}

(i.e.,

\forall X \in S, \exists X^{'} \in S^{'}, X = X^{'} \circ φ

); then we have the following construction:

Proposition 1. For each integer m ≥ 0, a natural linear map

φ^{*} : H^{m} (Q^{'}; S^{'}) \to H^{m} (Q; S),

(24)

is defined by the following application at the level of local co-chains:

φ^{*} (F^{'}) (X_{1}; \dots; X_{m}; P) = F^{'} ({X^{'}}_{1}; \dots; {X^{'}}_{m}; φ_{*} (P)),

(25)

for a collection of variables

{X^{'}}_{j}; j = 1, \dots, m

satisfying

X_{j} = {X^{'}}_{j} \circ φ

for each j.

Proof. First, remark that X_j=X″_j ○ φ implies

{X^{'}}_{j} = X ”_{j}

because φ is surjective. As F′ is (jointly) local, the co-chain F = φ* (F′) is also (jointly) local. Finally, it is evident that the map F′ ↦ F commutes with the co-boundary operator. Therefore the proposition follows.

Another co-homological construction works in the reversed direction:

Consider two information structures

(S, Q)

and

(S^{'}, Q^{'})

on two sets Ω, Ω′ equipped with σ-algebras

ℬ, ℬ'

respectively, and φ a measurable map from

(Ω, ℬ)

to

(Ω^{'}, ℬ^{'})

, such that

Q^{'} \subseteq φ_{*} (Q)

(i.e.,

\forall Q^{'} \in Q^{'}, \exists Q \in Q, Q^{'} = φ_{*} (Q)

), and such that

φ^{*} S^{'} \subseteq S

(i.e.,

\forall X^{'} \in S^{'}, X^{'} \circ φ \in S

); then the following result is true:

Proposition 2. For each integer m ≥ 0, a natural linear map

φ_{*} : H^{m} (Q^{'}; S^{'}) \to H^{m} (Q; S),

(26)

is defined by the following application at the level of co-chains:

φ_{*} (F) ({X^{'}}_{m}; \dots; {X^{'}}_{m}; P^{'}) = F ({X^{'}}_{1} \circ φ; \dots; {X^{'}}_{m} \circ φ; P),

(27)

for a probability law

P \in Q

and its image P′ = φ_* (P).

Proof. First, remark that, if Q also satisfies P′ = φ_*(Q), we have

F ({X^{'}}_{1} \circ φ; \dots; {X^{'}}_{m} \circ φ; P) = F ({X^{'}}_{1} \circ φ; \dots; {X^{'}}_{m} \circ φ; Q)

. To establish that point, let us denote

X_{j} = {X^{'}}_{j} \circ φ; j = 0, \dots, m

, and

X^{'} = ({X^{'}}_{1}, \dots, {X^{'}}_{m})

, X= (X₁,…,X_m) the joint variables; the quantity

F ({X^{'}}_{1} \circ φ; \dots; {X^{'}}_{m} \circ φ; P)

depends only on X_*P, but this law can be rewritten

{X^{'}}_{*} P^{'}

, which is also equal to X_*Q. In particular, if F is local, then F′ = φ_* F is local.

As it is evident that the map F ↦ F′ commutes with the co-boundary operator, the proposition follows.

Remark this way of functoriality uses the locality of co-cycles.

Corollary 1. In the case where

Q^{'} = φ_{*} (Q)

and

S = φ^{*} S^{'}

, the maps φ* and φ_* in information co-homology are inverse one of each other.

This is our formulation of the invariance of the information co-homology for equivalent information structures.

When m = 0, co-cochains are functions f of P_X in

Q_{X}

such that f(Y_* P_X) = f (P_X) for any Y multiple of X (i.e., coarser than X). As we assume 1 belongs to

S

, and the set Q₁ has only one element, f must be a constant. And every constant is a co-cycle, because

δ . f (X_{0}; P) = X_{0} . f (P) - f (P) = \sum_{j} P (X_{0} = x_{j}) f (P | X_{0} = x_{j}) - f (P) = f (1) (1 - 1) = 0 .

(28)

Consequently H⁰ is ℝ. This corresponds to the hypothesis

1 \in S

, meaning connexity of the category. If m components exist, we recover them in the same way and H⁰ is isomorphic to ℝ^m.

We now consider the case m = 1. From what precedes we know that there is no non-trivial co-boundary.

Non-homogeneous 1-cocycles of information are families of functions f_X (Y; P_X), measurable in the variable P in

Q

, labelled by elements

Y \in S_{X}

, which satisfies the locality condition, stating that each time we have Z → X → Y in

S

, we have

f_{X} (Y; X_{*} P_{z}) = f_{z} (Y; P_{z})

(29)

and the co-cycle equation, stating that for two elements Y, Y′ of

S_{X}

, we have

f ((Y, Y^{'}); P) = f (Y; P) + Y . f (Y^{'}; P) .

(30)

Remark that locality implies that it is sufficient to know the f_Y(Y; Y_*P) to recover f_X(Y; P) for all partition X in

S

that divides Y.

It is in this sense that we frequently omit the index X in f_X.

Remark also that for any 1-cocycle f we have f (1; P) = 0.

In fact, the co-cycle equation tells that

f ((1, 1); P) = f (1; P) + 1 . f (1; P) .

(31)

but

1 . f (1; P) = f (1; P | 1 = 1) = f (1; P),

(32)

and (1, 1) = 1,thus f (1; P) = 0.

More generally, for any X, and any value x_i of X, we have

f (X; P | (X = x_{i})) = 0,

(33)

In fact a special case of Equation (30) is

f ((X, X); P) = f (X; P) + X . f (X; P) .

(34)

which implies X.f (X; P) = 0; however, by definition,

X . f (X; P) = \sum_{i} P (X = x_{i}) f (X; P | (X = x_{i})),

(35)

thus for every i we must have f(X; P|(X = x_i)) = 0, due to P ≥ 0. This generalizes f (1; P) = 0 for any P, because, for a probability conditioned by X = x_i, the partition X appears the same as 1, that is a certitude.

Remark also that for each pair of variables (X, Y), a 1-cocycle must satisfy the following symmetric relation:

f (Y; ℙ) - Z . f (Y; ℙ) = f (Z; ℙ) - Y . f (Z; ℙ) .

(36)

2.3. Entropy

Any multiple of the Shannon entropy is a non-homogeneous information co-cycle. Remind that entropy H is defined for one partition X by the formula

H (X; ℙ) = - \sum_{i} p_{i} \log p_{i},

(37)

where the p_i denotes the values of ℙ on the elements of the partition X. In particular the function H depends only on X_*(ℙ), which is locality. The co-cycle equation expresses the fundamental property for an information quantity, writen by Shannon:

H (X, Y) = H (X) + H_{X} (Y)

(38)

Thus every constant multiple f = λH of H defines a co-cycle. Remark that the corresponding “homogeneous 1-cocycle” is the entropy variation:

F (X; Y; ℙ) = H (X; ℙ) - H (Y; ℙ) .

(39)

This means that it satisfies the “invariance property”:

\begin{array}{l} F ((Z, X); (Z, Y)) = H (Z, X) - H (Z, Y) \\ = H (Z) + H_{z} (X) - H (Z) - H_{Z} (Y) \\ = Z . F (Z; Y), \end{array}

and the “simplicial equation”:

F (Y; Z) - F (X; Z) + F (X; Y) = 0

(40)

Note that the entropy variation H(X; P) − H(Y; P) exists in a wider range of condition, i.e., when Ω is infinite, if the laws of X and Y are absolutely continuous with respect to a same probability law ℙ₀: we only have to replace the finite sum by the integral of the function −φ log φ where φ denotes the density with respect to ℙ₀. Changing the reference law ℙ₀ changes the quantities H(X) and H(Y) by the same constant, thus does not change the variation H(X; P) − H(Y; P).

We will prove now that, for many simplicial structures

S

, and sufficiently large adapted probability complexes

Q

, any information co-homology class of degree one is a multiple of the entropy class.

In particular this would be true for

S = W (Σ)

and

Q = Δ (Ω)

, when Σ has more than two elements and Ω more than four elements, but this is also true in more refined situation, as we will see.

We assume that the functor of probabilities

Q_{X}

contains all the laws on Ω/X, when X belongs to

S

. In such a case, by definition, we say that

Q

is complete with respect to

S

.

Let us consider a probability law P in

Q

and two partitions X, Y in the structure

S

, such that the joint XY belongs to

S

. We denote by Greek letters α,β,… the indices labelling the partition Y and by Latin letters k,l,… the indices of the partition X; the probability that X = ξ_k,Y = η_α is noted p_k,α, then the probability of X = ξ_k is equal to p_k = Σ_α p_k,α and the probability of Y = η_α is equal to q_α = Σ_k p_k,α. To simplify the notations, let us write F = f (X; p),G = f ((Y, X); ℙ),H = f (Y; ℙ), F_α = f (X; ℙ|(Y = η_α)),H_k = f (Y; P|(X = ξ_k)).

The Hochschild co-cycle equation gives

\sum_{α} q_{α} F_{α} (\frac{p_{k_{1}, α}}{q_{α}}, \dots, \frac{p_{k_{m}, α}}{q_{α}}) = G ((p_{k, α})) - H (q_{α_{1}, \dots,} q_{α_{n}})

(41)

But we also have the relation obtained by exchanging X and Y, which gives

\sum_{k} p_{k} H_{k} (\frac{p_{k, α_{1}}}{p_{k}}, \dots, \frac{p_{k, α_{n}}}{p_{k}}) = G ((p_{k, α})) - F (p_{k_{1}, \dots,} p_{k_{m}}) .

(42)

Suppose that p_k,α = 0 except when α = α₁ and k = k₂, k₃,…,k_m or α = α₂ and k = k₁; we put

p_{k_{i}, α_{1}} = x_{i}

; i = 2,…,m and

p_{k_{1}, α_{2}} = x_{1}

, which implies that we have x₁ + x₂ +… + x_m = 1. Then Equation (33) implies that each term H in Equation (42) is zero, because only one value of the image law is non-zero, thus we can replace the only term G by

F (p_{k_{1}, \dots,} p_{k_{m}})

, and we get from Equation (41):

H (1 - x_{1}, x_{1}, 0, \dots, 0) = F (x_{1}, x_{2}, \dots, x_{m}) - (1 - x_{1}) F_{α_{1}} (0, \frac{x_{2}}{1 - x_{1}}, \dots, \frac{x_{m}}{1 - x_{1}}) .

(43)

Only the term F for α₁ subsists because, the possible other one, for α₂, concerns a certitude.

Consequently, by imposing x₂ = 1 − x₁ = a, x₃ =… = x_m = 0, we deduce the identity H (a, 1 − a, 0,…, 0) = F(1 − a, a, 0,…, 0). This gives a recurrence equation to calculate F from the binomial case:

F (x_{1}, x_{2}, \dots, x_{m}) = F (x_{1}, 1 - x_{1}, 0, \dots, 0) + (1 - x_{1}) F (0, \frac{x_{2}}{1 - x_{1}}, \dots, \frac{x_{m}}{1 - x_{1}}) .

(44)

That is due to the fact that F_α₁ is a special case of F, thus independent from Y and α₁.

Then coming back to the co-cycle equation, we obtain in particular a functional equation for the binomial variables.

Lemma 2. With the notations of the example 1 (cf. example 1), Ω = {(00), (01), (10), (11)}, S₁ (resp. S₂) the projection pr₁ (resp. pr₂), on E₁ = E₂ = {0,1}, S = {S₁, S₂}; then the (measurable) information co-homology of degree one is generated by the entropy, i.e., there exists a constant C such that, for any X in

W (Σ), P \in P, f (X; P) = C H (X; P)

.

Proof. We consider a 1-cocycle f. We have f(1; P) = 0. Let us note f_i(P) = f(S_i; P), and f_ijk (u) the function f (S_i; P|(S_j = k)), the variable u representing the probability of the first point in the fiber S_j = k in the lexicographic order. For each tableau 2 × 2, P = (p₀₀, p₀₁, p₁₀, p₁₁), the symmetry formula (36) gives

\begin{array}{l} (p_{00} + p_{10}) f_{120} (\frac{p_{00}}{p_{00} + p_{10}}) + (p_{01} + p_{11}) f_{121} (\frac{p_{01}}{p_{01} + p_{11}}) - f_{1} (P) \\ = (p_{00} + p_{01}) f_{210} (\frac{p_{00}}{p_{00} + p_{01}}) + (p_{10} + p_{11}) f_{211} (\frac{p_{10}}{p_{10} + p_{11}}) - f_{2} (P) \end{array}

(45)

imposing p₁₀ = 0,p₀₀ = u,p₁₁ = v,p₀₁ = 1 − u − v in this relation, we obtain the equation:

\begin{array}{l} (1 - u) f_{1} (0, \frac{1 - u - v}{1 - u}, 0, \frac{v}{1 - u}) - f_{1} (u, 1 - u - v, 0, v) \\ = (1 - v) f_{2} (\frac{u}{1 - v}, \frac{1 - u - v}{1 - v}, 0, 0) - f_{2} (u, 1 - u - v, 0, v) . \end{array}

(46)

By hypothesis, f₁, f₂ depend only on the image law by S₁, S₂ respectively, thus, again by noting a binomial probability from the value of the first element in lexicographic order, we get

(1 - u) f_{1} (\frac{1 - u - v}{1 - u}) - f_{1} (1 - v) = (1 - v) f_{2} (\frac{u}{1 - v}) - f_{2} (u) .

(47)

By equating u to 1 − v, we find that f₁(u) = f₂(u); then we arrive to the following functional equation for h = f₁ = f₂:

h (u) - h (v) = (1 - v) h (\frac{u}{1 - v}) - (1 - u) h (\frac{v}{1 - u})

(48)

This is the functional equation which was considered by Tverberg in 1958 [35]. As a result of the works of Tverberg [35], Kendall [36] and Lee (1964, [37]), (see also Kontsevich, 1995 [38]), it is known that every measurable solution of this equation is a multiple of the entropy function:

h (x) = C (x \log (x) + (1 - x) \log (1 - x)) .

(49)

>From here it follows that, for any m-uple (x₁, …, x_m) of real numbers such that x₁ + … + x_m = 1,

F (x_{1}, x_{2}, \dots, x_{m}) = C \sum_{i} x_{i} \log (x_{i}) .

(50)

The same is true for H and G with the appropriate number of variables.

A pair of variables X, Y , such that X, Y, (XY) belong to S, is called an edge of S; we says this edge is rich if X and Y contain at least two elements and (X, Y) at least four elements which cross the elements of X and Y , in such a manner that the Lemma 2 applies if

Q

is complete. We say that

S

is connected, if every pair of elements X, X′ in

S

can be joined by a sequence of edges. We say that

S

is sufficiently rich if each vertex belongs to at least one rich edge. By the the recurrence Equation (100), these two conditions guaranty that the constant C which appears in the Lemma 2 is the same for all rich edges. Then the same recurrence Equation (100) implies that the whole co-cycle is equal to CH. If

S

has m connected components, we get necessarily m independent constants.

Thus we have established the following result:

Theorem 1. For every connected structure of information

S

, which is sufficiently rich, and every set of probability

Q

, which is complete with respect to

S

, the information co-homology group of degree one is one-dimensional and generated by the classical entropy.

The theorem applies to rich simplicial complexes, in particular to the full simplex

S = W (Σ)

, which is generated by a family Σ of partitions S₁, …, S_n, when n ≥ 2, such that, for every i at least of the pairs (S_i, S_j) is rich.

Note that most of the axiomatic characterizations of entropy have used convexity, and recurrence over the dimension, see Khintchin [39], Baez et al. [20].

In our characterization, we assumed no symmetry hypothesis, this was a consequence of co-homology. Moreover, we do not assume any stability property relating to a higher dimensional simplex, this was also a consequence of the homological definition.

There exists a notion of symmetric information co-homology:

The group of permutations

S (Ω, ℬ)

, made by the permutations of Ω that respect the algebra

ℬ

, acts naturally on the set of partitions Π(Ω); in fact, if X ∈ Π(Ω) is made by the subsets Ω₁, …, Ω_k, the partition σ^∗X is made by the subsets σ⁻¹(Ω₁), …, σ⁻¹ (Ω₁), in such a manner that, if σ, τ are two permutations of Ω, we have τ^∗(σ^∗X) = (σ ○ τ)^∗X.

We say that a classical information structure

S

on

(Ω, ℬ)

is symmetric if it is closed by the action of the group of permutations

S (Ω, ℬ)

, i.e., if X ∈ S, and σ ∈ S(Ω), the partition σ^∗X also belongs to

S

.

In the same way, we say that a probability functor

Q

is symmetric, if it is stable under local permutations, i.e., if

X \in S

and

P \in Q_{X}

, and if

σ \in S (Ω / X)

, then the probability law σ^∗P = P ○ σ on Ω/X also belongs to

Q_{X}

.

Remark that we also have τ^∗σ^∗P = (σ ○ τ)^∗P). Thus the actions of symmetric groups are defined here on the right. However, we have actions to the left by taking σ_∗ = (σ⁻¹)^∗. For the essential role of symmetries in information theory, see the article of Gromov in this volume.

A m-cochain

F_{X} : S^{m} \times Q_{X} \to ℝ

is said symmetric, when, for every

X \in S

, every probability

P \in Q_{X}

, every collection of partitions Y₁, …, Y_m in

S_{X}

, we have

F_{σ_{*} X} (σ_{*} Y_{1}; \dots; σ_{*} Y_{m}; σ_{*} P) = F_{X} (Y_{1}; \dots; Y_{m}; P) .

(51)

It is evident that symmetric cochains form a subcomplex of the information cochains complex; i.e., the coboundary of a symmetric cochain being a symmetric cochain. Consequently we get a symmetric information co-homology, that we name

H_{S}^{*} (S; Q)

.

In particular the entropy is a symmetric 1-cocycle.

The above proof of Theorem 1 applies to symmetric cocycle as well, thus, under the convenient hypothesis of connexity, richness, and completeness for

S

and

Q

we have

H_{S}^{1} (S; Q) = ℝ H

.

Remark that an equivalent way to look at symmetric information cochains, consists in enlarging the category

S

in a “symmetric category”

S^{S}

, by putting an arrow associated to each element

σ_{X} \in S (Ω / X)

from X to σ_∗X, and completing the category by composing the two kind of arrows, division and permutation. In this case, the probability functor

Q

must behave naturally with respect to permutation, which implies it is symmetric. Moreover, the natural notion of functional sheaf and local cochains are a symmetric sheaf and symmetric cochains.

2.4. Appendix. Complex of Possible Events

In each concrete situation, physical constraints produce exclusion rules between possible events, which select a sub-complex

Q

in the full probability simplex

P = Δ_{N}

on Ω. The aim of this appendix is to make this remark more precise.

Let A⁰, A¹, A², A³, … the N + 1 vertices of the large simplex Δ_N, a point of Δ_N is interpreted as a probability ℙ on the set of thee vertices; each vertex can be seen as an elementary event, and we will say that a general event A is possible for ℙ when ℙ(A) is different from zero. An event A is said impossible for P in the other case, that is when ℙ(A) = 0.

The star S(A) of a vertex A of Δ_N is the complementary set of the opposite face to A, i.e., it is the set of probabilities P in Δ_N such that A is possible, i.e., has non-zero probability. The relative star S(A|K) of A in subcomplex K is the intersection of the star of A with K.

We denote F = (A, B, C, D, …) the face of Δ_N whose vertices are A, B, C, D, …. We note L(F) the set of points p in Δ_N such that at least one of the points A, B, C, D, … is impossible for p. This is also the reunion of the faces which are opposite to the vertices A, B, C, D, … . Then L(F) is a simplicial complex. The complementary set in F of the interior of F , i.e., the boundary of F , is the reunion of the intersections of F with all faces opposite to A, B, C, D, …; it is also the set of probabilities p in F such that at least one of the points A, B, C, D, … is impossible for p, thus it is equal to L(F) ∩ F . If G is a face containing F the complex L(G) contains the complex L(F).

Let K be a simplicial complex contained in a N-simplex; then K is obtained by deleting from Δ_N a set E = E_K of open faces. Let

\dot{F} = F \ \partial F

be an element of E, then each faces G of Δ_N containing F belongs to E, because K is a complex.

In this case K is contained in L(F). In fact L(F) is the smallest sub-complex of Δ_N which does not contain

\dot{F}

. This can be proved as follows: if p in K makes that every vertices of F is possible, it belongs to a face G such that every vertex of F is a vertex of G, thus K contains G which contains F . So, if K does not contain

\dot{F}

, K is contained in L(F).

Let L = L_K be the intersection of the L(F), where F describe the faces in E_K. From what precedes we know that K is contained in L. However, every

\dot{F}

in E is included in the complementary set of L(F), thus it is included in the complementary set of L, which is the union of the complementary sets of the L(F). Consequently the complementary set of K is included in the complementary set of L. Then K = L.

This discussion establishes the following result:

Theorem 2. A subset K of the simplex Δ_N is a simplicial sub-complex if and only if it is defined by a finite number of constraints of the type: “for any p in K, the fact that A, B, C, … are possible for p implies that D is impossible for p”.

In other terms, more imaged but also more ambiguous, every sub-complex K is defined by constraints of the type: “if A, B, C, … are simultaneously allowed it is excluded that D can happen”.

The statement of the theorem is just a rewriting of the discussion, using elementary propositional calculus: let K be a sub-complex of Δ_N, we have shown that K is the intersection of the L(F) where the open face

\dot{F}

is not in K, but if A, B, C, D, … denote the vertices of the face F, a point p belongs to L(F) if and only if “(A is impossible for p) or (B is impossible for p) or …”, and this sentence is equivalent to “if (A is possible for p) and (B is possible for p) and …, then (D is impossible for p)”. This results from the equivalence between “(P implies Q) is true” and “(no P or Q) is true”. Reciprocally any L(F) is a simplicial complex, then every intersection of sets of the form L(F) is a simplicial complex too.

3. Higher Mutual Informations. A Sketch

The topological co-boundary operator on C^∗, denoted by δ_t, is defined by the same formula as δ, except that the first term Y₁.F (Y₂; …; Y_n; ℙ) is replaced by the term F(Y₂; …; Y_n; ℙ) without Y₁:

\begin{array}{l} δ_{t}^{m} F (Y_{1}; \dots; Y_{m + 1}; P_{X}) \\ = F (Y_{2}; \dots; Y_{m + 1}; P_{X}) + \sum_{1}^{m} {(- 1)}^{i} F (\dots; (Y_{i}, Y_{i + 1}); \dots; Y_{m + 1}; P_{X}) + {(- 1)}^{m + 1} F (Y_{1}; \dots; Y_{m}; P_{X}) . \end{array}

(52)

It is the coboundary of the bar complex for the trivial module

ℱ_{t}

, which is the same as

ℱ

except no conditioning appears, i.e., Y.F = F . Hence it is the ordinary simplicial co-homology of the complex S with local coefficients in

ℱ

.

Remark that this operator also preserves locality, because all the functions of ℙ which comes in the development depends only on (Y₂, …, Y_n) ∗ ℙ, (Y₁, …, Y_n) ∗ ℙ and (Y₁, …, Y_n−₁) ∗ ℙ.

By definition a topological cocycle of information is a cochain F that satisfies δ_tF = 0, and a topological co-boundary is an element in the image of δ_t.

It is easy to show that δ_t ○ δ_t = 0, which allows to define a co-homology theory that we will name topological co-homology.

Now assume that the information structure

S

is a set W (Σ) = Δ(n) generated by a family Σ of partitions S₁, …, S_n, when n ≥ 2.

Higher mutual information quantities were defined by Hu Kuo Ting [6] (see also Yeung [40]), generalizing the Shannon mutual information.

I_{N} (S_{1}; \dots; S_{N}; ℙ) = \sum_{k = 1}^{k = N} {(- 1)}^{k - 1} H_{k} (S_{1}; \dots; S_{N}; ℙ),

(53)

where

H_{k} (S_{1}; \dots; S_{N}; ℙ) = \sum_{I \subset [N]; c a r d (I) = k} H (S_{I}; ℙ),

(54)

S_I denoting the joint partition of the S_i such that i ∈ I. We also define I₁ = H.

The definition of I_N makes evident it is a symmetric function, invariant by all permutation of the partitions S₁, …, S_N.

For instance I₂(S; T) = H(S) + H(T) − H(S, T) is the usual mutual information.

It is easily seen that I₂ = δ_tH. The following formula generalizes this remark to higher mutual informations of even orders:

I_{2 m} = δ_{t} δ δ_{t} \dots δ δ_{t} H,

(55)

where the right member contains 2m − 1 terms.

And for odd mutual information we have

I_{2 m + 1} = - δ δ_{t} δ δ_{t} \dots δ δ_{t} H,

(56)

where the right member contains 2m terms.

We deduce from here that higher mutual informations are co-boundaries for δ or δ_t according that their order is odd or even respectively.

The result which proves the two above formulas is the following:

Lemma 3. Let n be even or odd we have

I_{N} ((S_{0}, S_{1}); S_{2}; \dots; S_{N}; ℙ) = I_{N} (S_{0}; S_{2}; \dots; S_{N}; ℙ) + S_{0} . I_{N} (S_{1}; S_{2}; \dots; S_{N}; ℙ)

(57)

This lemma can be proved by comparing the completely developed forms of the quantities. It seems to signify that, with respect to one variable, I_N satisfies the equation of information 1-cocycle, thus I_N seems to be a kind of “partial 1-cocycle”; however this is misleading, because the locality condition is not satisfied. In fact I_N is a N-cocycle, either for δ, either for δ_t depending on the parity of N.

For any N-cochain F we have

(δ - δ_{t}) F (S_{0}; S_{1}; \dots; S_{N}; ℙ) = ((S_{0} - 1) . F) (S_{1}; \dots; S_{N}; P),

(58)

where S₀ − 1 denotes the sum of the two operators of mean conditioning and minus identity.

That implies:

(δ δ_{t} - δ_{t} δ) F (S_{0}; S_{1}; S_{2}; \dots; S_{N}; ℙ) = ((1 + S_{0} + S_{1} - S_{0} S_{1}) . F) (S_{2}; \dots; S_{N}; ℙ),

(59)

Remark 3. Reciprocally the functions I_N decompose the entropy of the finest joint partition:

H (S_{1}, S_{2}, \dots, S_{N}; ℙ) = \sum_{k = 1}^{k = N} {(- 1)}^{k - 1} \sum_{I \subset [N]; c a r d (I) = k} I_{k} (S_{i_{1}}; S_{i_{2}}; \dots; S_{i_{k}}; ℙ)

(60)

For example, we have H(S, T) = I₁(S) + I₁(T) − I₂(S; T), and

H (S, T, U) = H (S) + H (T) + H (U) - I_{2} (S; T) - I_{2} (T; U) - I_{2} (S; U) + I_{3} (S; T; U) .

(61)

Let us also note the recurrence formula whose proof is left to the reader (cf. Cover and Thomas [41]):

I_{N + 1} (S_{0}; S_{1}; \dots; S_{N}) = I_{N} (S_{1}; \dots; S_{N}) - S_{0} . I (S_{1}; \dots; S_{N}) .

(62)

4. Quantum Information and Projective Geometry

4.1. Quantum Measure, Geometry of Abelian Conditioning

In finite dimensional quantum mechanics the role of the finite set Ω of atomic events is played by a complex vector space E of finite dimension.

In fact, to each set Ω, of cardinal N, is naturally associated a vector space of dimension N over ℂ, which is the space freely generated over ℂ by the elements of Ω. Then we can identify E with ℂ^N, the canonical basis being the points x of Ω. In this case the canonical positive hermitian metric on E corresponds to the quadratic mean: if f and g are elements of E, we have

h_{0} (f, g) = {〈 f | g 〉}_{0} = \int \bar{f} (ω) g (ω) d ω = \frac{1}{N} \sum_{j} \bar{f_{j}} g_{j}

(63)

Remark that, in the infinite dimensional situation, the space which would play the role of E is the space of L² functions for a fixed probability P₀.

Probability laws ℙ, which are elements of the big simplex Δ(N), give other hermitian structures, the ones which are expressed by diagonal matrices, with positive coefficients, and trace equal to 1.

In the general quantum case, described by E, a quantum probability law is every positive non-zero hermitian product h. If a basis is chosen, h is described by an N × N-matrix ρ. In the physical literature, every such ρ is called a density of states; and it is considered as a full description of the physical states of the finite quantum system. Usually ρ is normalized by Tr(ρ) = 1.

Note that this condition on the trace has no meaning for a positive hermitian form h if no additional structure is given, for instance a non-degenerate form h₀ of reference. Why is it so? Because a priori a hermitian form h on E is a map from E to

{\bar{E}}^{*}

, where ∗ denotes duality and bar denotes conjugation, the conjugate space

\bar{E}

being the same set E, with the same structure of vector space over the real numbers as E, but with structure of vector space over the complex numbers changed by changing the sign of the action of the imaginary unit i. The complexification of the real vector space H of hermitian forms is

H_{o m ℂ} (E, {\bar{E}}^{*}) ≅ E^{*} \otimes {\bar{E}}^{*}

. The space H is the set of fixed points of the ℂ-anti-linear map u ↦^t ū. A trace is defined for an endomorphism of the space E, as a linear invariant quantity on E^* ⊗ E. Here we could take the trace over ℝ, because E and

\bar{E}

are the same over ℝ, but the duality would be an obstacle, because even over the field ℝ, the spaces E and E^* cannot be identified, and there exits no linear invariant in E^* ⊗ E^*, even over ℝ. In fact, a non-degenerate positive h₀ is one of the way to identify E and

{\bar{E}}^{*}

. A basis is another way, also defining canonically a form h₀. More precisely, when h₀ is given, every hermitian form h diagonalizes in an orthonormal basis for h₀, thus all the spectrum of h makes sense not only the trace.

This h₀ is tacitly assumed in most presentations. However it is better to understand the consequences of this choice. In non-relativistic quantum mechanics, it is not too grave, however in relativist quantum mechanics, it is; for instance, considering the system of two states as a spinor on the Lorentz space of dimension 4, the choice of h₀ is equivalent to the choice of a coordinate of time. See Penrose and Rindler [42].

A much less violent way to do is to consider hermitian structures h up to multiplication by a strictly positive number. This would have the same effect as fixing the trace equals to one, without introducing any choice. In quantum mechanics only non-zero positive h are considered, not necessarily positive definite, but non-zero. This indicates that a good space of states is not the set H₊ of all positive non-zero hermitian products but a convex part PH₊ of the real projective space of real lines in the vector space H of hermitian forms. In this space, the complex projective space ℙ(E) of dimension N − 1 over ℂ is naturally embedded, its image consists of the rank one positive hermitian matrices of trace 1; these matrices correspond to the orthogonal projectors on one dimensional directions in E.

When a basis of E is chosen, particular elements of ℙ(E) are given by the generators of ℂ^N; they correspond to the Dirac distributions on classical states. We see here a point defended in particular by Von Neumann, that quantum states are projective objects not linear objects.

The classical random variables, i.e., the measurable functions on Ω with values in ℂ, are generalized in Quantum Mechanics by the operators in E, they are all the endomorphisms, i.e., any N × N-matrix, and they are named observables. Classical observables are recovered by diagonal matrices, their action on E corresponding to the multiplication of functions. Real valued variables are generalized by hermitian operators. Again this supposes that a special probability law h₀ is given. If not “to be hermitian” for an operator has no meaning. (What could have a meaning for an operator is to be diagonalizable over R, which is something else.)

Then if h₀ is chosen, the only difference between real observable and density of states is the absence of the positivity constraint.

By definition, the amplitude, or expectation, of the observable Z in the state ρ is the number given by the formula

E_{ρ} (Z) = T r (Z ρ) .

(64)

It is important to note that h₀ plays a role in this formula. Consequently the definition of expectation requires to fix an h₀ not only a ρ. This imposes a departure from the relativistic case, which shall not be surprising, since considerations in relativistic statistical physics show that the entropy, for instance, depends on the choice of a coordinate for time. Cf. Landau-Lifschitz, Fluid Mechanics, second edition [43].

The partitions of Ω associated to random variables are replaced in the quantum context by the spectral decompositions of the hermitian operators X. As h₀ is given, this decomposition is given by a set of positive hermitian commuting projectors of sum equal to the identity. The additional data for recovering the operator X is one real eigenvalue for each projector. The underlying fact from linear algebra is that every hermitian matrix is diagonalizable in a unitary basis, which means that

Z = \sum_{j} z_{j} E_{j},

(65)

where the number z_j are real, two by two different, and where the matrices E_j are hermitian projectors, which satisfy, for any j and k ≠ j,

E_{j}^{2} = E_{j}; E_{j}^{*} = E_{j}; E_{j} E_{k} = E_{k} E_{j} = 0;

(66)

and

\sum_{j} E_{j} = I d_{N}

(67)

When the hermitian operator Z commutes with the canonical projectors on the axis of ℂ^N, its spectral measure gives an ordinary partition of the canonical basis, and we recover the classical situation.

Note that the extension of the notion of partition is given by any decomposition of the vector space E in orthogonal sum, not necessarily compatible with a chosen basis. Again this assumes a given positive definite h₀.

To generalize what we presented in the classical setting, quantum information theory must use only the spectral support of the decomposition, not the eigenvalues.

It would have been tempting to consider any decomposition of E in direct sum as a possible observable, however not every linear operator, or projective transformation, corresponds to such a decomposition, due to the existence of non-trivial nilpotent operators. What could be their role in quantum information? Moreover, the presence of h₀ fully justifies the limitation to orthogonal decompositions.

In the general case, hermitian but not necessarily diagonal, we define the probability of the elementary events Z = z_j by the following formula

ℙ_{ρ} (Z = z_{j}) = T r (E_{j}^{*} ρ E_{j})

(68)

And we define the conditional probability ρ|(Z = z_j) by the formula

ρ | (Z = z_{j}) = E_{j}^{*} ρ E_{j} / T r (E_{j}^{*} ρ E_{j}) .

(69)

One can notice that this definition can be extended to any projector, not necessarily hermitian. By definition, the conditioning of ρ by a projector Y is the matrix Y^*ρY, normalized to be of trace 1. However, here, as it is done in most of the texts on Quantum Mechanics, we will mostly restrict ourselves to the case of hermitian projectors, i.e., Y^* = Y.

Remark 4. What justifies these definitions of probability and conditioning? First they allow to recover the classical notions when we restrict to diagonal densities and diagonal observables, i.e., when ρ is diagonal, real, positive, of trace 1, Z is diagonal, and the E_j are diagonals, in which case they give a partition of Ω. The mean of Z is its amplitude. The probability of the event Z = z_j is the sum of the probabilities p(ω) = ρ_ωω for ω in the image of E_j; this the trace of ρE_j. Moreover, the conditioning by this event is the probability obtained by projection on this image, as prescribed by the above formula.

Second, pure states are defined as rank one hermitian matrices. In this case ρ is the orthogonal projection on a vector ψ of norm equal to 1 (the finite dimensional version of the Schrodinger wave vector), the exact relation is

ρ = | ψ 〉 〈 ψ |

(70)

or, in coordinates, if ψ has for coordinates the imaginary numbers ψ(ω), we have

ρ_{ω ω^{'}} = \bar{ψ (ω)} ψ (ω^{'}) .

(71)

Let Z be any hermitian operator, the result of quantum experiments indicate that the probability of the event Z = z_j, for the state ψ, is equal to

P_{j} = 〈 ψ | E_{j} ψ 〉 .

(72)

But this quantity can also be written

P_{j} = T r_{ℂ} (〈 ψ | E_{j} ψ 〉) = T r_{E} (| ψ 〉 〈 ψ | E_{j} |) = T r (ρ E_{j}) .

(73)

Starting from this formula and the fact any ρ can be written as a classical mixture of commuting quantum pure states,

ρ = \sum_{a} p_{a} | ψ_{a} 〉 〈 ψ_{a} |,

(74)

we get the general formula of a quantum probability that we recalled.

Moreover, physical experiments indicate that after the measurement of an observable Z, giving the quantity z_j, the system is reduced to the space E_j, and every pure state ψ is reduced to its projection E_jψ, which is compatible with the above definition of conditioning for pure states. Here again, the general formula can be deduced by Equation (74). The division by the probability is achieved to normalize to a trace 1. Thus conditioning in general is given by orthogonal projection in E, and it corresponds to the operation of measurement.

However, as claimed in particular by Roger Balian [44], the fact that the decomposition in pure states is non-unique implies that pure states cannot be so pertinent for understanding quantum information.

Definition 3. The density of states associated to a given variable Z and a given density ρ is given by the sum:

ρ z = \sum_{j} ℙ_{ρ} (Z = z_{j}) ρ | (Z = z_{j}) = \sum_{j} E_{j}^{*} ρ E_{j},

(75)

where (E_j)_j_∈_J designates the spectral decomposition of Z, also named spectral measure of Z. Thus ρ_Z is usually seen as representing the density of states after the measurement of the variable Z. This formula is usually interpreted by saying that the statistical analysis of the repeated measurements of the observable Z transforms the density ρ into the density ρ_Z.

Remark that ρ_Z is better understood as being a collection of conditional probabilities ρ|(Z = z_j), indexed by j.

In quantum physics as in classical physics the symmetries, discrete and continuous, have always played a fundamental role. For example, in quantum mechanics, a fundamental principle is the unitarity of the evolution in time, which claims that the states evolve as ρ_t = U_tρ and that the observables evolve as

Z_{t} = U_{t} Z U_{t}^{- 1}

, with U_t respecting the fundamental scalar product h₀. In fact, as we already mentioned, a deeper principle associates the choice of a time coordinate t to the choice of h₀, which gives birth to a unitary group U(E; h₀), isomorphic to U_N(ℂ). For stationary systems the family (U_t)_t_∈ℝ forms a one parameter group, i.e., U_t₊_s = U_tU_s = U_sU_t, and there exists a hermitian generator H of U_t in the sense that U_t = exp(2π itH/h); by definition, this particular observable H is the energy, the most important observable. Even if we have a privileged basis, like Ω in the relation with classical probability, the consideration of another basis which makes the energy H diagonal is of great importance. In the stationary case, a symmetry of the dynamical system is defined as any unitary operator, which commutes with the energy H. The set of symmetries forms a Lie group G, a closed sub-group in U_N. The infinitesimal generators are considered as hermitian observables (obtained by multiplying the elements of the Lie algebra L(G) by i); in general they do not commute between themselves.

All these axioms extend to the infinite dimensional situation when E has a structure of an Hilbert space, but the spectral analysis of the un-bounded operators is more delicate and diverse than the analysis in finite dimension. Three kinds of spectrum appear, discrete, absolutely continuous and singular continuous. The symmetries could not form a Lie group in general, and so on.

In our simple case of elementary quantum probability, without fixed dynamics, the classical symmetries of the set of probabilities are given by the permutations of Ω, the vertices of Δ(N). They correspond to the unitary matrices which have one and only one non-zero element in each line and each column. They do not diagonalize in the same basis because they do not commute, but they form a group

S_{N}

. Another subgroup of U_N is natural for semi-classical study, it is the diagonal torus

T^{N}

, its elements are the diagonal matrices with elements of modulus 1, they correspond to sets of angles. The group

S_{N}

normalizes the torus

T^{N}

, i.e., for each permutation σ and each diagonal element Z, the matrix σZσ⁻¹ is also diagonal; its elements are the same as the elements of Z but in a different orders. The subgroup generated by

S_{N}

and

T^{N}

is the full normalizer of

T^{N}

.

One of the strengths of the quantum theory, with respect to the classical theory, is that it gives a similar status to the states, the observables and the symmetries. States are hermitian forms, generalizing points in the sphere (or in the projective space) which are pure states, observables are hermitian operators, or better spectral decompositions, and symmetries are unitary operators, infinitesimal symmetries being anti-hermitian matrices.

All classical groups should appear in this framework. First, by choosing a special structure on E we restrict the linear group GL_N(ℂ) to an algebraic subgroup G_ℂ. For instance, by choosing a symmetric invertible bilinear form on E we obtain O_N(ℂ), or, when N is even, by choosing an antisymmetric invertible bilinear form on E we obtain Sp_N(ℂ). In each of these cases there exists a special maximal torus (formed by the complexification of a maximal abelian subgroup T of unitary operators in G_ℂ), and a Weyl group, which is the quotient of the normalizer N(T) by the torus T itself. This Weyl group generalizes the permutation group when more algebraic structures are given in addition to the linear structure. The compact group of symmetries is the intersection G of G_ℂ with U_N. In fact, given any compact Lie group G_c, and any faithful representation r_c of G_c in ℂ^N, we can restrict real observables to generators of elements in C_c, and general observables to complex combinations of these generators, which integrate in a reductive linear group G. The spectral decomposition corresponds to the restriction to parabolic sub-groups of G_ℂ. The densities of states are restricted to the Satake compactification of the symmetric space G_ℂ/G_c [45].

4.2. Quantum Information Structures and Density Functors

To define information quantities in the quantum setting, we have a priori to consider families of operators (Y₁, Y₂, …, Y_m) as joint variables. However, the efforts made in Physics and Mathematics were not sufficient to attribute a clear probability to the joint events (Y₁ = y₁, Y₂ = y₂, …, Y_m = y_m), when Y₁, …, Y_m do not commute; we even suspect that this difficulty is revelator of a principle, that information requires a form of commutativity. Thus, in our study, we will adopt the convention that every time we consider joint observables, they do commute. Hence we will consider only collections of commuting hermitian observables; their natural amplitudes in a given state are vectors in ℝ^m. However we do not exclude the consideration in our theory of sequences (Y₁; …; Y_m) such that the Y_i do not commute.

A joint observable (Y₁, Y₂, …, Y_m) define a linear decomposition of the total space E in direct orthogonal sum

E = \underset{α \in A}{\oplus} E_{α},

(76)

where E_α; α ∈ A is the collection of joint eigenspaces of the operators Y_j. Note that any orthogonal decomposition can be defined by a unique operator.

Another manner to handle the joint variables is to consider linear families of commuting operators

Y (λ_{1}, \dots, λ_{m}) = λ_{1} Y_{1} + \dots + λ_{m} Y_{m},

(77)

or in equivalent terms, linear maps from ℝ^m to End(E). Then assigning a probability number and perform probability conditioning can be seen as functorial operations.

In what follows we denote indifferently by E_α the subspace of E or the orthogonal projection on this subspace.

>From the point of view of information, two sets of observables are equivalent if they give the same linear decomposition of E. We say that a decomposition E_α; α ∈ A refines a decomposition E′_β; β ∈ B, when each E′_β is a sum of spaces E_α for α in a subset A_β of A. In such a case, we say that E_α; α ∈ A divides E′_β; β ∈ B.

For instance, for commuting decompositions Y, Z it is possible to define the joint variable, as the less fine decomposition which is finer than Y and Z.

We insist that only decompositions have a role in information study at this moment. We will see that observation trees in the last section imposes to consider a supplementary structure, which consists in an ordering of the factors in the decomposition.

An information structure on E is a set S of decompositions X of E in direct sum, such that when Y and Z are elements of S which refine X ∈ S, then Y, Z commute and the finer decomposition (Y, Z) they generate belongs to S. In this text, we will only consider orthogonal decompositions.

Remark: in fact, the necessity of this condition in the quantum context was the original motivation to introduce the definition of classical information structure, as exposed in the first section. This can be seen as a comfortable flexibility in the classical context, or as a step from classical to quantum information theory.

As in the classical case, an information structure gives a category, denoted by the letter S, whose objects are the elements of S, and whose arrows X → Y are given by the divisions X|Y between the decompositions in S.

In what follows we always assume that 1, which corresponds to the trivial partition E, belongs to S, and is a final object. If not we will not get a topos.

Note that we are not the first to use categories and topos to formulate quantum or classical probability. In particular Doring and Isham propose a reformulation of the whole quantum and classical physics by using topos theory, see [46] and references inside. This theory followed remarkable works of Isham, Butterfield and Hamilton, made beween 1998 and 2002, and was further developed by Flori, Heunen, Landsman, Spitters, specially in the direction of a quantum logic. A common point between these works and our work is the consideration of sheaves over the category made by the partial ordering in commutative subalgebras. However, Doring et al. consider only the set of maximal algebras, and do not look at decompositions, i.e., they consider also the spectral values. In [46], Doring and Isham defined topos associated to quantum and classical probabilities. However, they focused on the definition of truth values in this context. For instance, in the classical setting, the topos they define is the topos of ordinary topological sheaves over the space (0, 1)_L which has for open sets the intervals]0, r[for 0 ≤ r ≤ 1, and particular points in their topos are given by arbitrary probabilized spaces, which is far from the objects we consider, because our classical topos are attached to sigma-algebras over a given set. In fact, our aim is more to develop a kind of geometry in this context, by using homological algebra, in the spirit of Artin, Grothendieck, Verdier, when they developed topos for studying the geometry of schemes.

Example 5. The most interesting structures S seem to be provided by the quantum generalization of the simplicial information structure in classical finite probability. A finite family of commuting decompositions Σ = {S₁, …, S_n} is given, they diagonalize in a common orthogonal basis, but it can happen that not all diagonal decompositions associated to the maximal torus belongs to the set of joints W (Σ). In such a case a subgroup G_Σ appears, which corresponds to the stabilizer of the finest decomposition S_[n] = (S₁…S_n). This group is in general larger than a maximal torus of U_N, it is a product of unitary groups (corresponding to common eigenvalues of observables in W (Σ)), and it is named a Levy subgroup of the unitary group. In addition we consider a closed subgroup G in the group U(E; h₀) (which could be identified with U_N), and all the conjugates gY g⁻¹ of elements of W (Σ) by elements of G; this gives a manifold of commutative observable families Σ_g; g ∈ G. More generally we could consider several families Σ_γ; γ ∈ Γ of commuting observables, where Γ is any set. It can happen that an element of Σ_γ is also an element of Σ_λ for λ ≠ γ. The family Γ ∗ Σ of the Σ_γ when γ describes the set Γ forms a quantum information structure. The elements of this structure are (perhaps ambiguously) parameterized by the product of an abstract simplex ∆(n) with the set Δ (in particular Γ = G for conjugated families).

A simplicial information structure is a subset of Γ ∗ Σ which corresponds to a family K_γ of simplicial sub-complexes of ∆(n). In the invariant case, when Γ = G, several restrictions could be usefull, for instance using the structure of the manifold of the conjugation classes of G_Σ under G. The simplest case is given by taking the same complex K for all conjugates gΣg⁻¹. By definition this latter case is a simplicial invariant family of quantum observables.

An event associated to S is a subspace E_A, which is an element of one of the decompositions X ∈ S. For instance, if Y = (Y₁, …, Y_m), the joint event A = (Y₁ = y₁, Y₂ = y₂, …, Y_m = y_m) gives the space E_A which is the maximal vector subspace of E where A happens, i.e.,

(f \in E_{A}) \Leftrightarrow (Y_{1} (f) = y_{1} f, Y_{2} (f) = y_{2} f, \dots, Y_{m} (f) = y_{m} f) .

(78)

We say that A is measurable for a decomposition Y whenever it is obtained by unions of elements of Y.

The role of the Boolean algebra

B

introduced in the first section, could have been accounted here by a given decomposition B of E such that any decomposition in S is divided by B.

However this choice of B is too rigid, in particular it forbids invariance by the unitary group U(h₀). Thus we decided that a better analog of the Boolean algebra

B

is the set UB of all decompositions that are deduced from a given B by unitary transformations.

On the side of density of states, i.e., quantum probabilities, we can consider a subspace Q₁ of the space P = ℙH₊ of hermitian positive matrices modulo multiplication by a constant. Concretely, we identify the elements of Q₁ with positive hermitian operators ρ such that T rρ = 1. The space P is naturally stratified by the rank of the form; the largest cell ℙH₊₊ corresponds to the non-degenerate forms; the smallest cells correspond to the rank one forms, which are called pure states in Quantum Mechanics.

We will only consider subsets Q₁ of P which are adapted to S, i.e., which satisfy that if ρ belongs to Q₁, the conditioning of ρ by elements of S also belongs to Q₁. This means that Q₁ is closed by orthogonal projections on all the elements E_A of the orthogonal decompositions X belonging to S. Note that a subset of P which is closed by all orthogonal projections is automatically adapted to any information category S.

Remind that, if ρ is a density of states and E_A is an elementary event (i.e., a subspace of E), we define the conditioning of ρ by A by the hermitian matrix

ρ | A = E_{A}^{*} ρ E_{A} / T r (E_{A}^{*} ρ E_{A}) .

(79)

And we define the probability of the event E_A for ρ as the trace:

ℙ_{ρ} (A) = T r (E_{A}^{*} ρ E_{A}),

(80)

In the same manner we define the density of a joint observable by

ρ_{Y} = \sum_{A} ℙ_{p} (A) ρ | A = \sum_{A} E_{A}^{*} ρ E_{A},

(81)

A nice reference studying important examples is Paul-Andre Meyer, Quantum probability for probabilists [47].

If X is an orthogonal decomposition of E, we can associate to it a subset Q_X of Q₁, which contains at least all the forms ρ_X where ρ belongs to Q₁. The natural axiom that we assume for the function X ↦ Q_X, is that for each arrow of division X → Y , the set Q_Y contains the set Q_X; then we note Y_∗ the injection from Q_X to Q_Y . The fact that Q_X is stable by conditioning by every element of a decomposition Y which is less fine than X is automatic; it follows from the fact that Q₁ is adapted to S. We will use conditioning in this way.

In what follows we denote by the letter Q such a functor X ↦ Q_X from the category S to the category of quantum probabilities, with the arrows given by direct images. The set Q₁ is the value of the functor Q for the certitude 1. We must remind that many choices are possible for the functor when Q₁ is given; the two extreme being the functor Q^max where Q_X = Q₁ for every X, and the functor Q^min where Q_X is restricted to the set of forms ρ_X where ρ describes Q₁; in this last case the elements of Q_X are positive hermitian forms on E, which are decomposed in blocs according to X.

From the physical point of view, Q^min appears to have more sense than Q^max, but we prefer to consider both of them.

A special probability functor, which will be noted Q^can(S), is canonically associated to a quantum information structure S:

Definition 4. The canonical density functor

Q_{X}^{c a n} (S)

, is made by all positive hermitian forms matched to X, i.e., all the forms ρ_X when ρ describes PH₊.

It is equal to the functor Q^min associated to the full set Q₁ = PH₊. When the context is clear, we will simply write Q^can.

An important difference appears between the quantum and the classical frameworks: if X divides Y, there exist more (quantum) probability laws in Q_Y than in Q_X, but there exist less classical laws at the place Y than at the place X, because classical laws are defined on smaller sigma-algebras.

In particular, the trivial partition has only one classical state, which is Tr(ρ) = 1, but it has the richest structure in terms of quantum laws, any hermitian positive form.

Let us consider the classical probabilities, i.e., the maps that associate the number P_ρ(A) to an event A; then, for an event which is measurable for Y, the law Y_∗ρ_X gives the same result than the law ρ_X.

Remark: This points to a generalized notion of direct image, which is a correspondence q_XY_∗ between Q_X and Q_Y , not a map: we say that the pair (ρ_X, ρ_Y) in Q_X × Q_Y belongs to q_XY_∗, if for any event which is measurable for Y, we have the equality of probabilities

ℙ_{ρ X} (A) = ℙ_{ρ Y} (A)

(82)

Let us look at the relation of quantification, between a classical information structure and a quantum one:

Consider a maximal family of commuting observables

S

in the quantum information structure S, i.e., the full subcategory associated to an initial object X₀. This family is a classical information structure. Conversely, if we start with a classical information structure

S

, made by partitions of a finite set Ω, we can always consider it as a quantum structure associated to the vector space E = ℂ^Ω freely generated over ℂ by the elements of Ω. Note that E comes with a canonical positive definite form h₀, and, to be interesting from the quantum point of view, it is better to extend

S

by applying to it all unitary transformations of E, generating a quantum structure

S = U S

.

Remark 5. Suppose that S is unitary invariant, we can define a larger category S^U by taking as arrows the isomorphisms of ordered decomposition, and close by all compositions of arrows of S with them. Such an invariant extended category S^U is not far to be equivalent to the category

S^{S}

, made by adding arrows for permutations of the sets Ω/X (cf. above section), from the point of view of category theory: let us work an instant, as we will do in the last part of this paper, with ordered partitions of Ω, being itself equipped with an order, and ordered orthogonal decompositions of E. In this case we can associate to any ordered partition X = (E₁, …, E_m) of E, the unique ordered partition Ω compatible with the sequence of dimensions and the order of Ω. It gives a functor τ from S to

S

such that

ι \circ τ = I d_{S}

, where ι denotes the inclusion of

S

in S. These two functors are extended, preserving this property, to the categories S^U and

S^{S}

. In fact, the functor ι sends a permutation to the unitary map which acts by this permutation on the canonical basis, and the functor τ sends a unitary transformation g between X ∈ S and gXg^∗ ∈ S to the permutation it induces on the orthogonal decompositions. Moreover, consider the map f which associates to any X ∈ S^U the unique morphism from the decomposition ι ◦ τ(X) to X; it is a natural transformation from the functor ι ◦ τ to the functor

I d_{S^{U}}

, which is invertible, then it defines an equivalence of category between

S^{S}

and S^U. However a big difference begins with probability functors.

Let Q be a quantum density functor adapted to S, and note ι^∗Q the composite functor on

S

; we can consider the map Q which associates to

X \in S

the set of classical probabilities ℙ_ρ for ρ ∈ Q_X. If X divides Y, the fact that the direct image Y_∗ℙ(ρ) of ρ ∈ Q_X coincides with the law

ℙ_{Y_{*} (ρ)}

gives the following result:

Lemma 4. p ↦ ℙ_ρ is a natural transformation from the functor ι^∗Q to the functor Q.

Definition 5. This natural transformation is called the Trace, and we denote by T r_X its value in X, i.e., T r_X(ρ) = ℙ_ρ, seen as a map from Q_X to

Q_{X}

.

In general there is no natural transformation in the other direction, from

Q_{X}

to Q_X.

Remark that the trace sends a unitary invariant functor to a symmetric functor.

4.3. Quantum Information Homology

As in the classical case, we can consider the ringed site given by the category S, equipped with the sheaf of monoids {S_X; X ∈ S}. In the ringed topos of sheaves of S-modules, the choice of a probability functor Q generates remarkable elements in this topos, formed by the functional space F of measurable functions on Q with values in ℝ. The action of the monoid (or the generated ring) being given by averaged conditioning, and the arrows being given by transposition of direct images. Then, the quantum information co-homology is the topos co-homology:

H^{m} (S, Q) = E x t_{S}^{m} (ℝ; F)

(83)

However, as in the classical case, we can define directly the co-homology with a bar resolution of the constant sheaf, as follows:

A set of functions F_X of m observables Y₁, …, Y_m divided by X, and one density ρ indexed by X ∈ S, is said local, when for any decomposition X dividing a decomposition Y, we have, for each ρ in Q_X,

F_{X} (Y_{1}; \dots; Y_{m}; ρ) = F_{X} (Y_{1}; \dots; Y_{m}; Y_{*} (ρ)) .

(84)

For m = 0 this equation expresses that the family F_X is an element of the topos.

For every m, a collection F_X, X ∈ S is a natural transform F from a free functor S_m to the functor F.

Be careful that in the quantum context, it is not true in general that locality is equivalent to the condition saying that the value F_X(Y₁; …; Y_n; ρ) depends only on the family of conditioned densities

E_{A_{i}}^{*} ρ E_{A_{ι}}; i = 0, \dots, m

, where A_i is one of the possible events defined by Y_i.

In fact it depends on the choice of Q; for instance it is false for a Q^max, but it is true for a Q^min.

The counter-example in the case of Q^max is given by a function F (ρ) which is independent of X. It is local (in the sense of topos that we adopt) but it is non-local in the apparently more natural sense that it depends only of ρ_X. This is important to have this quantum particularity in the mind for understanding the following discussion.

As in the classical case, the action of observables on local functions is given by the average of conditioning, in the manner of Shannon, but using the Von Neumann conditioning:

Y . F (Y_{0}; \dots; Y_{m}; ρ) = \sum_{A} T r (E_{A}^{*} ρ E_{A}) F (Y_{0}; \dots; Y_{m}; ρ | A)

(85)

where the E_A’s are the spectral projectors of the bundle Y. In this definition there is no necessity to assume that Y commutes with the Y_j’s.

Remind that, when E^∗_AρE_A is non-zero, ρ|A is equal to E^∗_AρE_A/T r(E_A^∗ρE_A), and verifies the normalization condition that the trace equals to one. When E^∗_AρE_A is equal to zero, the factor T r(E^∗_AρE_A) is zero, then by convention the corresponding term F is absent.

The proof of the Lemma 1 applies without significant change to prove that the above formula defines an action of the monoid functor S_X.

Then, the definition of co-homology is given exactly as we have done for the classical case, by introducing the Hochschild operator:

\begin{array}{l} {\hat{δ}}^{m} F (Y_{1}; \dots; Y_{m + 1}; ρ) \\ = Y_{1} . F (Y_{2}; \dots; Y_{m + 1}; ρ) + \sum_{1}^{m} {(- 1)}^{i} F (\dots; (Y_{i}, Y_{i + 1}); \dots; Y_{m + 1}; ρ) + {(- 1)}^{m + 1} F (Y_{1}; \dots; Y_{m}; ρ) . \end{array}

(86)

The Von-Neumann entropy is defined by the following formula

S (ρ) = E_{ρ} (- \log_{2} (ρ)) = - T r (ρ \log_{2} (ρ)) .

(87)

For any density functor Q which is adapted to S, the Von-Neumann entropy defines a local 0-cochain, that we will call S_X, and is simply the restriction of S to the set Q_X. If ρ belongs to Q_X and if X divides Y , the law Y_∗ρ, which is the same hermitian form as ρ belongs to Q_Y by functoriality, thus S(Y_∗ρ) = S(ρ) is translated by S_X(ρ) = S_Y (Y_∗ρ). This 0-cochain will be simply named the Von Neumann entropy.

In the case of Q^max, S_X gives the same value at all places X. In the case of Q^min it coincides with S(ρ_X), where ρ_X denotes the restriction to the decomposition X.

Be careful: ρ ↦S(ρ_X) is not a local 0-cochain for Q^max. In fact in the case of Q^max we have the same set Q = Q_X for every place X, thus, if we take for X a strict divisor of Y and if we take a density ρ such that, for the restrictions of ρ, the spectrum of ρ_Y and ρ_X are different, then, in general, we do not have S_X(ρ) = S_Y (Y_∗ρ), even if, as it is the case in the quantum context, Y_∗ρ = ρ.

Remark that in the case of Q^max, where every function of ρ independent of X is a cochain of degree zero, the particular functions which depends only on the spectrum of ρ are invariant under the action of the unitary group, and they are the only 0-cochains which are invariant by this group.

Definition 6. Suppose that S and Q are invariant by the unitary group, as is UB, we say that an m-cochain F is invariant, if for every X in S dividing Y₁, …, Y_m in S, every ρ in Q_X and every g in the group U(h₀), we have

F_{g . X} (g . Y_{1}, \dots, g . Y_{m}; g . ρ) = F_{X} (Y_{1}; \dots; Y_{m}; ρ);

(88)

where g.X = gXg^∗, g.Y_i = gY_ig^∗; i = 1, …, m and g.ρ = gρg^∗.

This is compatible with the naturality assumption (functoriality by direct images), because direct image is a covariant operation.

Note that conditioning is also covariant if we change all variables and laws coherently. Thus the action of the monoids S_X on cochains respects the invariance.

Then the coboundary

\hat{δ}

preserves invariance. Thus the co-homology of the invariant co-chains is well defined. We call it the invariant information co-homology, and we will denote it by

H_{U}^{*} (S; Q)

, U for unitary.

Invariant co-cochains form a subcomplex of ordinary cochains, then we have a well defined map from

H_{U}^{*} (S; Q)

to H^∗(S; Q).

The invariant 0-co-chains depend only on the spectrum of ρ in the sets Q_X.

The invariant co-homology is probably a more natural object from the point of view of Physics. It is also on this co-homology that we were able to obtain constructive results.

The classical entropy of the decomposition {E_j} and the quantum law ρ is

H (X; ρ) = - \sum_{j} T r (E_{j}^{*} ρ E_{j}) \log_{2} (T r (E_{j}^{*} ρ E_{j}))

(89)

In general it is not true that H(X; ρ) = H(Y ; Y_∗ρ) when X divides Y . Thus the Shannon (or Gibbs) entropy is not a local 0-cochain, but it is a local 1-cochain, i.e., if X → Y → Z we have

H_{X} (Z; ρ_{X}) = H_{Y} (Z; Y_{*} ρ_{X}),

(90)

Moreover it is a spectral 1-cochain for any Q^min.

The following result is well known, cf. Nielsen and Chuang [13].

Lemma 5. Let X, Y be two commuting families of observables; we have

S_{(X, Y)} (ρ) = H (Y; ρ) + Y . S_{X} (ρ)

(91)

Proof. We denote by α, β, … the indices of the different values of X, by k, l, … the indices of the different values of Y , and by i, j, … the indices of a basis I_k,α of eigenvectors of the conditioned density

ρ_{k, α} = E_{k, α}^{*} ρ E_{k, α}

constrained by the projectors E_k,α of the pair (Y, X). The probability

p_{k} = P_{ρ} (X = ξ_{k})

is equal to the sum over i, α of the eigenvalues λ_i,k,α of ρ_k,α. We have

\begin{array}{l} Y . S (X; ρ) = - \sum_{k} p_{k} \sum_{i, α} \frac{λ_{i, k, α}}{p_{k}} \log_{2} (\frac{λ_{i, k, α}}{p_{k}}) \\ = - \sum_{i, k, α} λ_{i, k, α} \log_{2} (λ_{i, k, α}) + \sum_{i, k, α} λ_{i, k, α} \log_{2} (p_{k}) \\ = - \sum_{i, k, α} λ_{i, k, α} \log_{2} (λ_{i, k, α}) + \sum_{k} p_{k} \log_{2} (p_{k}) . \end{array}

Remark 6. Taking X = 1, or any scalar matrix, the preceding Lemma 5 expresses the fact that classical entropy is a derived quantity measuring the default of equivariance of the quantum entropy:

H (Y; ρ) = S_{Y} (ρ) - (Y . S_{Y}) (ρ) .

(92)

Lemma 6. For any X ∈ S, dividing Y ∈ S and ρ ∈ Q_X,

\hat{δ} (S_{X}) (Y; ρ) = - H_{X} (Y; ρ) .

(93)

Proof. This is exactly what says the Lemma 5 in this particular case, because in this case (X, Y) = X, and, by definition, we have

\hat{δ} (S_{X}) (Y; ρ) = Y . S_{X} (ρ) - S_{X} (ρ)

.

To insist, we give a direct proof with less indices for this case:

\begin{array}{l} Y . S_{X} (ρ) = - \sum_{i} p_{i} \sum_{k} \frac{λ_{i k}}{p_{i}} \log_{2} \frac{λ_{i k}}{p_{i}} \\ = - \sum_{i k} λ_{i k} \log_{2} λ_{i k} + \sum_{i k} λ_{i k} \log_{2} p_{i} \\ = S_{X} (ρ) + \sum_{i} \log_{2} p_{i} \sum_{k} λ_{i k} = S_{X} (ρ) + \sum_{i} (\log_{2} p_{i}) p_{i} \\ = S_{X} (ρ) - H_{X} (Y; ℙ_{ρ}) = S_{X} (ρ) - H_{X} (Y; ρ) . \end{array}

The Lemma 6 says that (up to the sign) the Shannon entropy is the co-boundary of the Von-Neumann entropy. This implies that the Shannon entropy is a 1-co-cycle, as in the classical case, but now it gives zero in co-homology.

Note that the result is true for any Q, thus for Q^min and for Q^max as well.

Consider a maximal observable X₀ in S, i.e., a maximal set of commuting observables in S, the elements of this maximal partition form a finite set Ω₀. If S is invariant by the group U(E; h₀), all the maximal observables are deduced from X₀ by applying a unitary base change. Suppose that the functor Q is invariant also; then we get automatically a symmetric classical structure of information

S

on Ω₀, given by the elements of S divided by X₀. And

S

is equipped with a symmetric classical functor of probability, given by the probability laws associated to the elements of

S

.

Remind that we defined the trace from quantum probabilities to classical probabilities, by taking the classical ℙ_ρ for each ρ, and we noticed that the trace is compatible with invariance and symmetry by permutations.

Definition 7. To each classical co-chain F ⁰ we can associate a quantum co-chain F = tr^∗F⁰ by putting

t r^{*} {(F)}_{X} (Y_{1}; \dots; Y_{m}; ρ) = F_{X}^{0} (Y_{1}; \dots; Y_{m}; t r_{X} (ρ)) .

(94)

The following result is straightforward:

Proposition 3. (i) The trace of co-chains defines a map of the classical information Hochschild complex to the quantum one, which commutes with the co-boundaries, i.e., the map tr^∗ defines a map from the classical information Hochschild complex to the quantum Hochschild complex; (ii) this map sends symmetric cochains to invaraint cochains; it induces a natural map from the symmetric classical information co-homology

H_{S}^{*} (S, Q)

to the invariant quantum information co-homology H_U^∗(S; Q).

The Lemma 6 says that the entropy class goes to zero.

Remark 7. In a preliminary version of these notes, we considered the expression s(X; ρ) = S(ρ_X) − S(ρ) and showed it satisfies formally the 1-cocycle equation. But we suppress this consideration now, because s is not local, thus it plays no interesting role in homology. For instance in Q^min, S(ρ_X) is local but S(ρ) is not and in Q^max, S(ρ) is local but S(ρ_X) is not.

Definition 8. In an information structure S we call edge a pair of decompositions (X, Y) such that X, Y and XY belong to S; we say that an edge is rich when both X and Y have at least two elements and XY cuts those two in four distinct subspaces of E. The structure S is connected if every two points are joined by a sequence of edges, and it is sufficiently rich when every point belongs to a rich edge. We assume a maximal set of subspaces UB is given in the Grassmannian of E, in such a way that the maximal elements X₀ of S (i.e., initial in the category) are made by pieces in UB. The density functor Q is said complete with respect to S (or UB) if for every X, the set Q_X contains the positive hermitian forms on the blocs of X, that give scalar blocs ρ_αβ for two elements E_α, E_β of a maximal decomposition. (All that is simplified when we choose a basis, and take maximal commutative subalgebras of operators, but we want to be free to consider simplicial complexes.)

Theorem 3. (i) for any unitary invariant quantum information structure S, which is connected and sufficiently rich, and for the canonical invariant density functor Q^can(S), (i.e., the density functor which is minimal and complete with respect to S), the invariant information co-homology of degree one

H_{U}^{1} (S; Q)

is zero. (ii) Under the same hypothesis, the invariant co-homology of degree zero has dimension one, and is generated by the constants. Then, up to an additive constant, the only invariant 0-cochain which has the Shannon entropy as co-boundary is (minus) the Von-Neumann entropy.

Proof. (I) Let X, Y be two orthogonal decompositions of E belonging to S such that (X, Y) belongs to S, and ρ an element of Q. We name

A_{k_{i}}; i = 1, \dots, m

the summands of X, and

B_{α_{j}}; j = 1, \dots, l

the summands of Y ; the projections

E_{k_{i}} ρ E_{k_{i}}; i = 1, \dots, m

resp.

E_{α_{j}} ρ E_{α_{j}}; j = 1, \dots, l

of ρ on the summands of X, resp. Y are denoted by

ρ_{k_{i}}; i = 1, \dots, m

and

ρ_{α_{j}}; j = 1, \dots, l

respectively. The projections by the commutative products

E_{k_{i}} E_{α_{j}}

are denoted by

ρ_{k_{i}, α_{j}}; i = 1, \dots, m, j = 1, \dots, l

.

Let f be a 1-cocycle, we write f(X; ρ) = F (ρ), f(Y; ρ) = H(ρ) and G(ρ) = f(X, Y; ρ). Note that in Q^min, F is a function of the

ρ_{k_{i}}

, H a function of the

ρ_{α_{j}}

and G a function of the

ρ_{k_{i}, α_{j}}

, but there is no necessity too assume this property; we can always consider these functions restricted to diagonal blocs, which are arbitrary due to the completeness hypothesis.

For any positive hermitian ρ′, we write ρ′|α, resp. ρ′|i the form conditioned by the event B_α resp. A_i. The co-cycle equation gives the two following equations, that are exchanged by permuting X and Y:

\sum_{α_{j}} T r (ρ_{α_{j}}) F ((ρ_{k_{i}} | α_{j}); i = 1, \dots, m) = G ((ρ_{k_{i}, α_{j}}); i, j) - H ((ρ_{α_{j}}); j),

(95)

\sum_{i} T r (ρ_{k_{i}}) H ((ρ_{α_{j}} | k_{i}); j) = G ((ρ_{k_{i}, α_{j}}); i, j) - F ((ρ_{k_{i}}); i) .

(96)

Now we consider a particular case, where the small blocs ρ_k,α are zero except for (k₁, α₂) and (k_j, α₁) for j = 2, …, m. We denote by h₁ the forme

ρ_{k_{1}, α_{2}}

and by h_i the form

ρ_{k_{i}, α_{1}}

, for i = 2, …, m. Remark that Tr(h₁ + h₂ + … + h_m) = 1.

(II) As in the classical case, it is a general fact for a 1-cocycle f and any variable Z the value f(Z; ρ) is zero if ρ is zero outside one of the orthogonal summand C_a of Z; because the equation f_X(Z, Z; ρ) = f_X(Z; ρ) + Z.f_X(Z; ρ) implies Z.f_X(Z, ρ) = 0, and if ρ has only one non-zero factor ρ_a, we have

Z . f (Z; ρ) = \sum_{b} T r (ρ_{b}) f (Z; ρ_{b} / T r (ρ_{b})) = T r (ρ_{a}) f (Z; ρ_{a} / T r (ρ_{a})) = 1 . f (Z; ρ_{a}) .

(97)

Therefore in the particular case that we consider, we get for any i that

H ((ρ_{α_{j}} | k_{i}); j) = 0

. Consequently the Equation (96) equals the term in G to the term in F , and we can report this equality in the first equation. By denoting

1 - x_{1} = T r (ρ_{α 1})

, this gives

H ((ρ_{α_{j}}); j = 1, 2) = F ((ρ_{k_{i}}); i = 1, \dots, m) - (1 - x_{1}) F ((0, \frac{h_{2}}{1 - x_{2}}, \dots, \frac{h_{m}}{1 - x_{m}})) .

(98)

Now if we add the condition h₃ = … = h_m = 0 we have F (0, h₂/(1−x₁), 0, …, 0) = 0 for the reason which eliminated the

H ((ρ_{α_{j}} | k_{i}); j)

; thus we obtain

H (ρ_{α_{1}}); j = 1, 2 = F ((ρ_{k_{1}}); i = 1, 2) .

(99)

This is a sufficiently strong constraints for implying that both terms are functions of h₁, h₂ only, and that of course they coincide as functions of these small blocs.

First this gives a recurrence equation, which, as in the classical case is able to reconstruct

F ((ρ_{k_{i}}); i = 1, \dots, m)

from the case of two blocs:

F (X; (ρ_{k_{i}}); i = 1, \dots, m) = F (X; (ρ_{k_{1}}, ρ_{k_{2}}, 0, \dots, 0) - (1 - x_{1}) F (X; (0, \frac{h_{2}}{1 - x_{2}}, \dots, \frac{h_{m}}{1 - x_{m}})) .

(100)

(III) We are left with the study of two binary variables Y, Z, forming a rich edge.

The blocs of ρ adapted to the joint ZY are denoted by ρ₀₀, ρ₀₁, ρ₁₀, ρ₁₁, where the first index refers to Y and the second index refers to Z, but the blocs that are allowed for Y and Z are more numerous than four; there exist out of diagonal blocs, and their role will be important in our analysis. For Y we have matrices

ρ_{0}^{0}

and

ρ_{1}^{0}

, and for Z we have matrices

ρ_{0}^{1}

and

ρ_{1}^{1}

;

\begin{matrix} ρ_{0}^{0} = (\begin{matrix} ρ_{00} & ρ_{001}^{0} \\ ρ_{010}^{0} & ρ_{01} \end{matrix}) & ρ_{1}^{0} = (\begin{matrix} ρ_{10} & ρ_{101}^{0} \\ ρ_{111}^{0} & ρ_{11} \end{matrix}) \end{matrix}

(101)

\begin{matrix} ρ_{0}^{1} = (\begin{matrix} ρ_{00} & ρ_{001}^{1} \\ ρ_{010}^{1} & ρ_{01} \end{matrix}) & ρ_{1}^{1} = (\begin{matrix} ρ_{10} & ρ_{101}^{1} \\ ρ_{111}^{1} & ρ_{11} \end{matrix}) \end{matrix}

(102)

They are disposed in sixteen blocs for ρ, but certain of them, noted with stars, cannot be seen from ρ_Y or ρ_Z:

ρ = (\begin{matrix} ρ_{00} & ρ_{001}^{0} & ρ_{001}^{1} & ρ_{001}^{*} \\ ρ_{010}^{0} & ρ_{01} & ρ_{101}^{*} & ρ_{001}^{1} \\ ρ_{010}^{1} & ρ_{010}^{*} & ρ_{10} & ρ_{101}^{0} \\ ρ_{111}^{*} & ρ_{111}^{1} & ρ_{111}^{0} & ρ_{11} \end{matrix})

(103)

Now the co-cycle equations are

F (Y, Z; ρ) = Y . F (Z; ρ) + F (Y; ρ) = Z . F (Y; ρ) + F (Z; ρ),

(104)

giving the symmetrical relation:

Y . F (Z; ρ) - F (Z; ρ) = Z . F (Y; ρ) - F (Y; ρ) .

(105)

The conditioning makes many blocs disappear. Then, by denoting with latin letters the corresponding traces, and taking in account explicitly the blocs that must count, the symmetrical identity gives, for any ρ, the following developed equation:

\begin{array}{l} (p_{00} + p_{01}) F_{Z} (\frac{ρ_{00}}{p_{00} + p_{01}}, \frac{ρ_{01}}{p_{00} + p_{01}}, 0, 0) + (p_{10} + p_{11}) F_{Z} (0, 0, \frac{ρ_{10}}{p_{10} + p_{11}}, \frac{ρ_{01}}{p_{10} + p_{11}}) \\ - F_{Z} (ρ_{00}, ρ_{001}^{1}, ρ_{01}, ρ_{001}^{1}, ρ_{010}^{1}, ρ_{10}, ρ_{111}^{1}, ρ_{11}) \\ = (p_{00} + p_{10}) F_{Y} (\frac{ρ_{00}}{p_{00} + p_{10}}, 0, \frac{ρ_{10}}{p_{00} + p_{11}}, 0) + (p_{01} + p_{11}) F_{Y} (0, \frac{ρ_{10}}{p_{01} + p_{11}}, 0, \frac{ρ_{11}}{p_{01} + p_{11}}) \\ - F_{Y} (ρ_{00,} ρ_{001}^{0}, ρ_{01}, ρ_{001}^{0}, ρ_{010}^{0}, ρ_{10}, ρ_{111}^{0}, ρ_{11}) . \end{array}

(106)

(IV) Now we make appeal to the invariance hypothesis: let us apply a unitary transformation g which respects the two summands of Y but does not necessarily respect the summands of Z we replace Z by gZg^∗, and ρ by gρg^∗, the value of

F_{Y} (ρ_{0}^{0}, ρ_{1}^{0})

does not change. Our claim is that the only function F_Y which is compatible with the Equation (106) for every ρ are functions of the traces of the blocs.

For the proof, we assume that all the blocs are zero except the eight blocs concerning Y . In this case, we see that the last function −F_Y of the right member, involves the eight blocs, but all the other functions involve only the four diagonal blocs. Thus our claim follows from the following result:

Lemma 7. A measurable function f on the set H of hermitian matrices which is invariant under conjugation by the unitary group U_n and invariant by the change of the coefficient a₁_n, the farthest from the diagonal, is a function of the trace.

Proof. An invariant function for the adjoint representation is a function of the traces of the exterior powers Λ^k(ρ), but these traces are coefficients in the basis

e_{i_{1}} \land e_{i_{1}} \land \dots \land e_{i_{k}}

, and the elements divisible by e₁ ∧ e_n cannot be neglected, as soon as k ≥ 2.

Therefore the co-cycle F_Y, F_Z comes from the image of tr^* in proposition 3. Then the recurrence relation (100) implies that the same is true for the whole co-cycle F.

(V) For concluding the proof of (i), we appeal to the Theorem 1, that the only non-zero cocycles in this context, connected and sufficiently rich, are multiples of the classical entropy. However, the Lemma 5 says that the entropy is a co-boundary.

(VI) To prove (ii), we have to show that every 0-cocycle X ↦ f_X(ρ), which depends only on the spectrum of ρ, is a constant. We know that a spectral function is a measurable function φ(σ₁; σ₂; …) of the elementary symmetric functions

σ_{1} = \sum_{i} λ_{i}, σ_{2} = \sum_{i < j} λ_{i} λ_{j}, \dots

.

And, to be a 0-cocycle, f must verify, for every pair of decompositions, X → Y, the equation

f_{X} (ρ) = \sum_{i} P_{ρ} (Y = i) f_{X} (ρ | (Y = i)) .

(107)

Explicitly, if f_X(ρ) = φ_X(σ₁, σ₂, …),

φ_{X} (σ_{1}, σ_{2}, \dots) = \sum_{i} σ_{1} (λ_{k, i}) φ_{X} (σ_{1} (λ_{k i}), \dots)

(108)

where each bloc ρ|i has the spectrum

{λ_{k, i}; k \in J_{i}}

. For a sufficiently rich edge X = Y Z, we have with four eigenvalues repeated as it must be to fulfill the dimensions:

\begin{array}{l} f (λ_{00}^{(n_{00})}, n_{00}^{(n_{00})}, λ_{01}^{(n_{01})}, λ_{10}^{(n_{10})}, λ_{11}^{(n_{11})}) \\ = (n_{00} λ_{00} + n_{01} λ_{01}) f (\frac{λ_{00}^{(n_{00})}}{n_{00} λ_{00} + n_{01} λ_{01}}, \frac{λ_{01}^{(n_{01})}}{n_{00} λ_{00} + n_{01} λ_{01}}) \\ + (n_{10} λ_{10} + n_{11} λ_{11}) f (\frac{λ_{10}^{(n_{10})}}{n_{10} λ_{10} + n_{11} λ_{11}}, \frac{λ_{11}^{n_{11}}}{n_{10} λ_{10} + n_{11} λ_{11}}), \end{array}

(109)

and

\begin{array}{l} f (λ_{00}^{(n_{00})}, n_{00}^{(n_{00})}, λ_{01}^{(n_{01})}, λ_{10}^{(n_{10})}, λ_{11}^{(n_{11})}) \\ = (n_{00} λ_{00} + n_{10} λ_{10}) f (\frac{λ_{00}^{(n_{00})}}{n_{00} λ_{00} + n_{10} λ_{10}}, \frac{λ_{10}^{(n_{10})}}{n_{00} λ_{00} + n_{10} λ_{10}}) \\ + (n_{01} λ_{01} + n_{11} λ_{11}) f (\frac{λ_{01}^{(n_{01})}}{n_{01} λ_{01} + n_{11} λ_{11}}, \frac{λ_{11}^{(n_{11})}}{n_{01} λ_{01} + n_{11} λ_{11}}), \end{array}

(110)

By equating the two second members, taking λ₀₁ = λ₀₀ = 0, and varying λ₁₀, λ₁₁, we find that f(x, y) is the sum of a constant and a linear function.

At the end, f_X must be the sum of a constant and a linear function for every X. However, a linear symmetric function is a multiple of σ₁. As ρ is normalized by the condition Tr(ρ) = 1, only the constant survives.

Remark 8. In his book “Structure des Systemes Dynamiques”, J-M. Souriau [48] showed that the mass of a mechanical system is a degree one class of co-homology of the relativity group with values in its adjoint representation; this class being non-trivial for classical Mechanics, with the Galileo group, and becoming trivial for Einstein relativistic Mechanics, with the Lorentz-Poincare group. Even if we are conscious of the big difference with our construction, the above result shows the same thing happens for the entropy, but going from classical statistics to quantum statistics.

>From the philosophical point of view, it is important to mention that the main difference between classical and quantum information co-homology in degree less than one, is the fact that the certitude, 1, becomes highly non-trivial in the quantum context. This point is discussed in particular by Gabriel Catren [49]. In geometric quantization the first ingredient, discovered by Kirillov, Kostant and Souriau in the sixties, is a circular bundle over the phase space that allows a non-trivial representation of the constants. The second ingredient also discovered by the same authors, is the necessity to choose a polarization, which correspond to the choice of a maximal commutative Poisson sub-algebra of observable quantities. This second ingredient appears in our framework through the limitations of information categories to collection of commutative Boolean algebras, coming from the impossibility to define manageable joints for arbitrary pair of observables.

5. Product Structures, Kullback–Leibler Divergence, Quantum Version

In this short section, we use both the homogeneous bar-complex and the non-homogeneous complex. A natural extension of the information co-cycles is to look at the measurable functions

F (X_{0}; X_{1}; \dots; X_{m}; P_{0}; P_{1}, P_{2}, \dots, P_{n}; X),

(111)

of several probability laws P_j (or density of states respectively) on Ω (or E respectively) belonging to the space

Q_{X}

that are absolutely continuous with respect to P₀, and several decompositions Y_i less fine than X. To be homogeneous co-chains these functions have to behave naturally under direct image Y_∗(P_i), and to satisfy the equivariance relation:

\begin{array}{l} F ((Y, X_{0}); (Y, X_{1}); \dots; (Y, X_{m}); P_{0}; P_{1}, P_{2}, \dots, P_{n}; X) \\ = Y . F (X_{0}; X_{1}; \dots; X_{m}; P_{0}; P_{1}, P_{2}, \dots, P_{n}; X), \end{array}

(112)

for any Y ∈ S_X (resp. S_X), where

\begin{array}{l} Y . F (X_{0}; X_{1}; \dots; X_{m}; P_{0}; P_{1}, P_{2}, \dots, P_{n}; X) \\ = \int_{E_{Y}} d Y_{*} P_{0} (y) F (X_{0}; X_{1}; \dots; X_{m}; P_{0} | Y = y; P_{1} | Y = y, \dots, P_{n} | Y = y; X) . \end{array}

(113)

Note that a special role is played by the law P₀, which justifies the coma notation.

The proof of the Lemma 1 in Section 2.1 extends without modification to show that this defines an action of semi-group.

Then we define the homogeneous co-boundary operator by

\begin{array}{l} δ F (X_{0}; X_{1}; \dots; \dots; X_{m}; X_{m + 1}; P_{0}; P_{1}, P_{2}, \dots, P_{n}; X) \\ = \sum_{i} {(- 1)}^{i} F (X_{0}; \dots; {\hat{X}}_{i}; \dots; X_{m}; X_{m + 1}; P_{0}; P_{1}, P_{2}, \dots, P_{n}; X) . \end{array}

(114)

The co-cycles are the elements of the kernel of δ and the co-boundaries the elements of the image of δ (with a shift of degree). The co-homology groups are the quotients of the spaces of co-cycles by the spaces of co-boundaries.

This co-homology is the topos co-homology

H_{S}^{*} (ℝ, F_{n})

, of the module functor

F_{n}

of measurable functions of n + 1-uples of probabilities, in the ringed topos S (resp. S in the quantum case).

There is also the non-homogeneous version: a m-cocycle is a family of functions F_X(X₁; …; …; X_m; P₀; P₁, P₂, …, P_n) which behave naturally under direct images, without equivariance condition.

The co-boundary operator is copied on the Hochschild operator: then we define the homogeneous co-boundary operator by

\begin{array}{l} \hat{δ} F_{X} (X_{0}; X_{1}; \dots; \dots; X_{m}; P_{0}; P_{1}, P_{2}, \dots, P_{n}) \\ = (X_{0} . F_{X}) (X_{i}; \dots; \dots; X_{m}; P_{0}; P_{1}; P_{2}, \dots, P_{n}) \\ + \sum_{i} {(- 1)}^{i + 1} F (X_{0}; \dots; {\hat{X}}_{i}; \dots; X_{m}; P_{0}; P_{1}, P_{2}, \dots, P_{n}; X) . \end{array}

(115)

Let us recall the definition of the Kullback–Leibler divergence (or relative entropy) between two classical probability laws P, Q on the same space Ω, in the finite case:

H (P; Q) = - \sum_{i} p_{i} \log \frac{q_{i}}{p_{i}} .

(116)

Over an infinite set, it is required that Q is absolutely continuous with respect to P with a L¹-density dQ/dP , and the definition is

H (P; Q) = - \int_{Ω} d P (ω) \log \frac{d Q (ω)}{d P (ω)} .

(117)

When dQ(ω)/dP (ω) = 0, the logarithm is −∞ and due to the sign minus, we get a contribution +∞ in H, thus, if this happens with probability non-zero for P the divergence is infinite positive. To get a finite number we must suppose also that P is absolutely continuous with respect to Q, i.e., P and Q are equivalent.

The analogous formula defines the quantum Kullback–Leibler divergence (or quantum relative entropy), cf. Nielsen-Chuang [13], between two density of states ρ, σ on the same Hilbert space E, in the finite dimensional case:

S (ρ; σ) = - T r (ρ (\log σ - \log ρ)) .

(118)

In the case of an infinite dimensional Hilbert space, it is required that the trace is well defined.

These quantities are positive or zero, and they are zero only in the case of equality of the measures (resp. the densities of states). It is the reason why it is frequently used as a measure of distance between two laws.

Proposition 4. The map which associates to X in

S

, Y divided by X, and two laws P, Q the quantity H(Y_∗P ; Y_∗Q) defines a non-homogeneous 1-cocycle, denoted H_X(Y ; P ; Q).

Proof. As we already know that the classical Shannon entropy is a non-homogeneous 1-cocycle, it is sufficient to prove the Hochschild relation for the new function

H_{m} (Y; P; Q) = - \sum_{i} p_{i} \log q_{i} .

(119)

Let us denote by p_ij (resp. q_ij) the probability for P (resp. Q) of the event Y = x_i, Z = y_j, and by p^j (resp. q^j) the probability for P (resp. Q) of the event Z = y_j; then the probability p^j (resp.

q_{i}^{j}

) of Y = x_i knowing that Z = y_j for P (resp. for Q) is equal to p_ij/p^j (resp. q_ij/q^j), and we have

H_{m} ((Z, Y); P, Q) = - \sum_{i} \sum_{j} p_{i j} \log q_{i j}

(120)

= - \sum_{j} p^{j} \sum_{i} p_{i}^{j} \log (q^{j} q_{i}^{j})

(121)

= - \sum_{j} p^{j} \log q^{j} (\sum_{i} p_{i}^{j}) - \sum_{j} p^{j} \sum_{i} p_{i}^{j} \log q_{i}^{j}

(122)

= - \sum_{j} p^{j} \log q^{j} - \sum_{j} p^{j} \sum_{i} p_{i}^{j} \log q_{i}^{j};

(123)

the first term on the right is H_m(Z; P ; Q) and the second is (Z.H_m)(Y ; P ; Q), Q.E.D.

This defines a homogeneous co-cycle for pairs of probability laws H_X(Y ; Z; P ; Q) = H_X(Y ; P ; Q)− H_X(Z; P ; Q), named Kullback-divergence variation.

In the quantum case, for two densities of states ρ, σ we define in the same manner a classical Kullback–Leibler divergence H_X(Y ; ρ; σ) by the formula

H_{X} (Y; ρ; σ) = \sum_{k} (T r (ρ_{k} \log (T r (ρ_{k})) - \log (T r (σ_{k}))));

(124)

where the index k parameterizes the orthogonal decomposition E_k associated to Y and where ρ_k (resp. σ_k) denotes the matrix E_k^∗ρE_k (resp. E_k^∗σE_k). It is the Kullback–Leibler divergence of the classical laws associated to the direct images ρ and σ respectively.

But in the case of quantum information theory, we can also define a quantum divergence, for any pair densities of states (ρ, σ) in Q_X,

S_{X} (ρ; σ) = - T r (ρ \log σ) .

(125)

Lemma 8. For any pair (X, Y) of commuting hermitian operators, such that Y divides X, the function S_X satisfies the relation

S (X, Y) (ρ; σ) = H_{Y} (X; ρ; σ) + X . S_{Y} (ρ; σ);

(126)

where H_X of two variables denotes the mixed entropy, defined by Equation (119).

Proof. As in the proof of the Lemma 4, we denote by α, β, … (resp. k, l, …) the indices of the orthogonal decomposition Y (resp. X), and by i, j, … the indices of a basis φ_i,k,α of the space E_k,α made by eigenvectors of the matrix

G_{k, α} = E_{k, α}^{*} ρ E_{k, α}

belonging to the joint operator (X, Y). In a general manner if M is an endomorphism of E_k,α we denote by M_i,k,α the diagonal coefficient of index (i, k, α). The probability p_k (resp. q_k) for ρ (resp. σ) of the event X = ξ_k is equal to the sum over i, α of the eigenvalues λi,k,α of ρ_k,α (resp. µi,k,α of σ_k,α). And the restricted density ρ^Yk (resp. σ^Yk), conditioned by X = ξ_k, is the sum over α of ϱ_k,α (resp. of σ_k,α) divided by p_k (resp. q_k). We have

X . S_{Y} (ρ; σ) = - \sum_{k} p k T r (ρ_{Y_{k}} \log σ^{Y_{k}})

(127)

= - \sum_{k} p_{k} \sum_{i, α} \frac{λ_{i, k, α}}{p_{k}} (\log \frac{σ_{k}}{q_{k}}) i, k, α

(128)

= \sum_{i, k, α} λ_{i, k, α} \log q k - \sum_{i, k, α} λ_{i, k, α} {(\log σ_{k})}_{i, k, α}

(129)

= \sum_{k} p_{k} \log q_{k} - T r (ρ_{k, α} \log (σ_{k, α})

(130)

= - H_{Y} (X; ρ; σ) + S_{(X, Y)} (ρ; σ) .

(131)

As a corollary, with the argument proving the Lemma 5 from the Lemma 4, we obtain that the classical Kullback divergence is minus the co-boundary of the 0-cochain defined by the quantum divergence.

This shows that the generating function of all the co-cycles we have considered so far is the quantum 0-cochain for pairs S(ρ; σ) = −T r(ρ log σ).

6. Structure of Observation of a Finite System

Up to now the considered structures and the interventions of entropy can be considered as forming a kind of statics in information theory. The aim of this section is to indicate the elements of dynamics which could correspond. This more dynamical study could be more adapted to the known intervention of entropy in the theory of dynamical systems, as defined by Kolmogorov and Sinai.

6.1. Problems of Discrimination

The problem of optimal discrimination consists in separating the various states of a system, by using in the most economical manner, a family of observable quantities. One can also only want to detect a state satisfying a certain chosen property. A possible measure of the cost of discrimination is the number of step before ending the process.

First, let us define more precisely what we mean by a system, a state, an observable quantity and a strategy for using observations. As before, for simplicity, the setting is finite sets.

The symbol [n] denotes the set {1, …, n}. We have n finite sets M_i of respective cardinalities m_i, and we consider the set M of sequences x₁, …, x_n where x_i belongs to M_i; by definition a system is a subset X of M and a state of the system is an element of X. The set of (classical) observable quantities is a (finite) subset A of the functions from X to R.

A use of observables, named an observation strategy, is an oriented tree Γ, starting at its root, that is the smallest vertex, and such that each vertex is labelled by an element of A, and each arrow (naturally oriented edge) is labelled by a possible value of the observable at the initial vertex of the arrow.

For instance, if F₀ marks the root s₀, it means that we aim to measure F₀(x) for the states; then branches issued from t₀ are indexed by the values v of F₀, and to each branch F₀ = υ corresponds a subset X_υ of states, giving a partition of X. If F₁_,v is the observable at the final vertex α_v of the branch F₀ = υ, the next step in the program is to evaluate F₁_,v(x) for x ∈ X_v; then branches issued from α_v corresponds to values w of F₁_,_υ restricted to X_v, and so on.

For each vertex s in Γ we note ν(s) the number of edges that are necessary for joining s to the root s₀. The function ν with values in ℕ is called the level in the tree.

It can happen that a set X_υ consists of one element only; in this case we decide to extend the tree to the next levels by a branch without bifurcation, for instance by labelling with the same observable and the same value, but it could be any labelling, and its value on X_v. In such a way, each level k gives a well defined partition π_k of X.

The level k also defines a sub-tree Γ_k of Γ, such that its final branches are bearing π_k. This gives a sequence π₀, π₁, …, π_l of finer and finer partitions of X, i.e., a growing sequence of partitions (if the ordering on partition is the opposite of the sense of arrows in the information category Π(X)). The tree is said fully discriminant if the last partition π_l, which is the finest is made by singletons.

The minimal number of steps that are necessary for separating the elements of X, or more modestly for detecting a certain part of states, can be seen as a measure of complexity of the system with respect to the observations A. A refined measure could take in account the cost of use of a given observable, for instance the difficulty to compute its values.

Standard examples are furnished by weighting problems: in this case the states are mass repartitions in n objects, and allowed observables are weighting, which are functions of the form

F_{I, J} (x) = \sum_{i \in I} x_{i} - \sum_{j \in J} x_{j}

(132)

where I et J are disjoint subsets of [n].

We underline that such a function, which requires the choice of two disjoint subsets in [n], makes use of the definition of M as a set of sequences, not as an abstract finite set.

The kind of problems we can ask in this framework were studied for instance in “Problemes plaisants et delectables qui se font par les nombres” from Bachet de Meziriac (1612, 1624) [50].

The starting point of our research in this direction was a particular classical problem signaled to us by Guillaume Marrelec: given n objects ξ₁, …, ξ_n, if we know that m have the same mass and n − m have another common mass, how many measures must be performed, to separate the two groups and decide which is the heavier?

Even for m = 1 the solution is interesting, and follows a principle of choice by maximum of entropy. In the present text we only want to describe the general structures in relation to this kind of problem without developing a specific study, in particular we want to show that the co-homological nature of the entropy extends to a more dynamical context of discrimination in time.

Remark 9. The discrimination problem is connected with the coding problem. In fact a finite system X (as we defined it just before) is nothing else than a particular set of words of length n, where the letter appearing at place i belongs to an alphabet M_i. Distinguishing between different words with a set A of variables f, is nothing else than rewriting the words x of X with symbols v_f (labelling the image f(X)). To determine the most economical manner to do that, consists to find the smallest maximal length l of words in the alphabet (f, v_f); f ∈ A, v_f ∈ f(X) translating all the words x in X. This translation, when it is possible, can be read on the branches of a fully discriminating rooted tree, associated to an optimal strategy, of minimal level l. The word that translate x being the sequence (F₀, v₀), (F₁, v₁), …, (F_k, v_k), k ≤ l, of the variables put on the vertices along the branch going from 0 to x, and the values of these variables put along the edges of this branch.

6.2. Observation Trees. Galois Groups and Probability Knowledge

More generally, we consider as in the first part (resp. in the second part) a finite set Ω, equipped with a Boolean algebra

ℬ

(resp. a finite dimensional complex vector space E equipped with a positive definite hermitian form h₀ and a family of direct decompositions in linear spaces UB). In each situation we have a natural notion of observable quantity: in the case of Ω it is a partition Y compatible with

ℬ

(i.e., less fine than

ℬ

) with numbering of the parts by the integers 1, .., k if Y has k elements; in the case of E it is a decomposition Y compatible with UB (i.e., each summand is direct sum of elements of one of the decompositions uB; for u ∈ U(h₀)), with a numbering of the summands by the integers 1, .., k if Y has k elements. We also have a notion of probability: in the case of (Ω, Y) it is a classical probability law P_Y on the quotient set Ω/Y; in the case of (E, Y) it is a collection of non-negative hermitian forms h_Y,i on each summands of Y.

We will consider information structures, denoted by the symbol S, for both cases (which could be distinguished by the typography,

S

or S, if necessary): they are categories made by objects that are observables and arrows that are divisions, satisfying the condition that if X ∈ S divides Y and Z in S, then the joint (Y, Z) belongs to S.

We will also consider probability families adapted to these information structures; they form a covariant functor X ↦ Q_X (which can be typographically distinguished in the two cases by

Q_{X}

and Q_X) of direct images. When

S

is a classical subcategory of the quantum structure S, we suppose that we have a trace transformation from ι^∗Q to

Q

, and if S and Q are unitary invariant, we remind that, thanks to the ordering, we have an equivalence of category between S^U and

S

, and a compatible morphism from the functional module

ℱ_{Q}

to the functional module

ℱ_{Q}

.

Except the new ingredient of orderings, they are familiar objects for our reader. The letter X will denote both cases Ω and E, then the letters S, B, Q will denote respectively

S

,

ℬ

,

Q

or S, UB, Q. Be careful that now all observable quantities are ordered, either partitions, either direct decomposition. We will always assume the compatibility condition between Q and S, meaning that every conditioning of P ∈ Q by an event associated to an element of S belongs to Q.

In addition we choose a subset A of observables in S, which play the role of allowed elementary observations.

We say that a bijection σ from Ω to itself, measurable for

ℬ

, respects a set of observables

A

if for any

Y \in A

, there exists

Z \in A

such that Y ○ σ = Z. It means that σ establishes an ordered bijection between the pieces Y (i) and the pieces Z(i), i.e., x ∈ Z(i) if and only if σ(x) ∈ Y (i). In other words the permutation σ respects

A

when the map σ^* which associates the partition Y ○ σ to any partition Y, sends

A

into

A

.

In the same way, we say that σ respects a family of probabilities

Q

if the associated map σ_* sends an element of

Q

to an element of

Q

.

In the quantum case, with E, h₀ and UB, we do the same by asking in addition that σ is a linear unitary automorphism of E.

Definition 9. If X, S, Q, B and A are given, the Galois group G₀ is the set of permutations of X (resp. linear maps) that respect S, Q, B and A.

Example 6. Consider the system X associated to the simple classical weighting problem: states are parameterized by points with coordinates 0, 1 or −1 in the sphere Sⁿ⁻¹ of radius 1 in ℝⁿ, according to their weights, either normal, heavier or lighter. Thus in this case Ω = X possesses 2n points. The set A of elementary observables is given by the weighting operations F_I,J, Equation (132). For

S

we take the set

S (A)

of all ordered partitions π_k obtained by applications of discrimination trees labelled by A. And we consider only the uniform probability P₀ on X; in

Q

this gives the images of this law by the elements of

S

, and the conditioning by all the events associated to

S

.

Then the Galois group G₀ is the subgroup

S_{n} \times C_{2}

of

S_{2 n}

made by the product of the permutation group of n symbols by the group changing the signs of all the x_i for i in [n].

Proof: the elements of

S_{n}

respect A, and the uniform law. Moreover if σ changes the sign of all the x_i, one can compensate the effect of σ on F_I,J by taking G_I,J = F_J,I, i.e., by exchanging the two sides of the balance.

To finish we have to show that permutations of X outside

S_{n} \times C_{2}

do not respect A. First, consider a permutation σ that does not respect the indices i. In this case there exists an index i ∈ [n] such that σ(i⁺) and σ(i⁻) are states associated to different coins, for instance σ(i⁺) = j⁺ and σ(i⁻) = k⁺, with j ≠ k, or σ(i⁺) = j⁺ and σ(i⁻) = k⁻, with j ≠ k. Two cases are possible: these states have the same mass, or they have opposite mass. In both cases let us consider a weighting F_j,h(x) = x_j − x_h, where h ≠ k; by applying σ^*F_j,h to x = σ(i⁺) we find +1 (or −1), and by applying σ^*F_j,h to x = σ(i⁻) we find 0. However, this cannot happen for a weighting, because for a weighting, either the change of i⁺ into i⁻ has no effect, either it exchanges the results +1 and −1. Finally, consider a permutation σ that respects the indices but exchanges the signs of a subset I = {i₁, …, i_k}, with 0 < k < n. In this case let us consider a weighting F_i,j(x) = x_i − x_j with i ∈ I and j ∈ [n]\I, the function F_i,j ○ σ takes the value +1 for the states i⁻, j⁻, the value −1 for i⁺, j⁺ and the value 0 for the other states, which cannot happen for any weighting, because this weighting must involve both i and j, but it cannot be F_j,i(x) = x_j − x_i, which takes the value −1 for j⁻, and it cannot be F_i,j which takes the value +1 for i⁺.

The probability laws we are considering express the beliefs in initial knowledge on the system, in this case it is legitimate to consider that they constrain the initial Galois group G₀. This corresponds to the Jaynes principle [51,52].

We define in this framework the notion of observation tree adapted to a given subset A of S: it is a finite oriented rooted tree Γ where each vertex s is labelled by an observable F_s belonging to A and each arrow α beginning at s is labelled by an element F_s(i) of F_s. A priori we introduce as many branches as there exist elements in F_s. The disposition of the arrows in the trigonometric circular order makes that the tree Γ is imbedded in the Euclidian plane up to homotopy.

A branch γ in the tree Γ is a sequence α₁, …, α_k of oriented edges, such that, for each i the initial extremity of α_i₊₁ is the terminal extremity of α_i. Then α_i₊₁ starts with the label F_i and ends with the label F_i₊₁. We will say that γ starts with the root if the initial extremity of α₁ is the root s₀, with a label F₀.

For any edge α in Γ, there exists a unique branch γ(α) starting from the root, and abutting in α. Along this branch, the vertices are decorated with the variables F_i; i = 0, …, F_k and the edges are decorated with values v_i of these functions; we note

S (α) = (F_{0}, v_{0}; F_{1}, v_{1}; \dots; F_{k - 1}, v_{k - 1}; F_{k})

(133)

By definition, the set X(α) of states which are compatible with α is the subset of elements of X such that F₀(x) = v₀, …, F_k−₁(x) = v_k−₁.

At any level k the sets X(α) form a partition π_k de X.

Definition 10. We say that an observation tree Γ labelled by A is allowed by S, if all joint observable along each branch belongs to S.

We say simply allowed if their is no risk of confusion.

In what follows this restriction is imposed on all considered tree. Of course if we start with the algebra of all ordered partitions this gives no restriction, but this would exclude the quantum case, where the best we can do is to take maximal commutative families.

Definition 11. Let α be an edge of Γ, we note

Q (α)

the set of probability laws on X(α) which are obtained by conditioning by the values v₀, v₁…, v_k−₁ of the observables F₀, F₁, …, F_k−₁ along the branch γ(α) starting in the root and ending with α.

Definition 12. The Galois group G(α) is the set of permutations of elements of X(α) that belongs to G₀, preserve all the equations F_i(x) = v_i (resp. all the summands of the orthogonal decomposition F_i labelling the edges) and preserve the sets of probability Q(α) (resp. quantum probabilities).

We consider G(α) as embedded in G₀ by fixing point by point all the elements of X outside X(α).

Remark 10. Let P be a probability law (either classical or quantum) on X, Φ = (F_i; i ∈ I) a collection of observables, and φ = (v_i; i ∈ I) a vector of possible values of Φ; the law P |(Φ = φ) obtained by conditioning P by the equations Φ(x) = φ, is defined only if the set X_φ of all solutions of the system of equations Φ(x) = φ has a non-zero probability p_φ = P (X_φ). It can be viewed either as a law on X_φ, or as a law on the whole X by taking the image by the inclusion of X_φ in X.

Definition 13. The edge α is said Galoisian if the set of equations and probabilities that are invariant by G(α) coincide respectively with X(α) and

Q (α)

.

A tree Γ is said Galoisian when all its edges are Galoisian.

At each level k we define the group G_k which is the product of the groups G(α) for the free edges at level k; it is a subgroup of G₀ preserving elements by elements the pieces of the partition π_k.

Along the path γ the partition (or decomposition) π_l, l ≤ k of X is increasing (finer and finer) and the sequence of groups G_l, l ≤ k is decreasing.

Along a branch the sets X(α) are decreasing and the sequence of groups G₀, G(α₁), …, G(α_k) is decreasing. We propose that the quotient G(α_i₊₁)/G(α_i) gives a measure of the Galoisian information gained by applying F_i and obtaining the value v_i.

On each set X(α) the images of the elements of the probability family

Q

form sets

Q (α)

of probabilities on X(α).

Thus also imposed in the group G(α) to preserve the set

Q (α)

.

Remark 11. In terms of coding, introducing probabilities on the X(α) permits to formulate the principle, that it is more efficient to choose, after the edge α, the observation having the largest conditional entropy in Q(α). In what circumstances it gives the optimal discrimination tree is a difficult problem, even if the folklore admit that as a theorem. It is the problem of optimal coding.

In virtue of a Shannon’s theorem, the minimal length is bounded below by entropy of the law on X if this law is unique. We found it works in a simple example of weighting (cf. paper 3 [22]).

Note however important differences between our approach and the traditional one for coding: for us A is given and

Q

is given; they correspond respectively to an a priori limitation of possible codes for use (like a natural language), and to a set of possible a priori knowledges, for instance taking in account the Galois ambiguity in the system (Jaynes principle). All that is Bayesian in spirit.

Definition 14. We say that an observation tree Γ labelled by A is allowed by S and by X ∈ S, if it is allowed by S_X, which means that all joint observable along each branch is divided by X.

Definition 15. S(A) is the set of (ordered) observables π_k which can be obtained by allowed observation trees. For X ∈ S we note S_X(A) the set of (ordered) observables π_k which can be obtained by observation trees that are allowed by S and X.

Lemma 9. The joint product defines a structure of monoid on the set S_X(A).

Proof. Let Γ, Γ′ be two observation trees allowed by A, S and X ∈ S, of respective lengths k, k′, giving final decompositions S, S′. To establish the lemma we must show that the joint SS′ is obtained by a tree associated with A, allowed by S and X.

For that we just graft one exemplar of Γ′ on each free edge of Γ. This new tree ΓΓ′ is associated with A, and its final partition is clearly finer than S. It is also finer than S′, because at the end of any branch of ΓΓ′ we have an X(β) which is contained in the corresponding element of the final partition π_k′ (Γ′). To finish the proof we have to show that each element of π_k₊_k′ (ΓΓ′) is the intersection of element of π_k(Γ) with one element of π_k′ (Γ′), because we know these observables are in S_X, which is a monoid, by the definition of information structure. But a complete branch γ.γ′ in ΓΓ′, going from the root to a terminal edge at level k + k′, corresponds to a word (F₀, v₀, F₁, v₁, …, F_k−₁, v_k−₁,

{F^{'}}_{0}

,

{v^{'}}_{0}, \dots, {F^{'}}_{k^{'} - 1}

,

{v^{'}}_{k^{'} - 1}

, thus the final set of the branch γ.γ′ is defined by the equations F_i = v_i; i = 0, …, k−1 et

{F^{'}}_{j} = {v^{'}}_{j}

; j = 0, …, k′−1, and is the intersection of the sets respectively defined by the first and second groups of equations, that belong respectively to π_k(Γ) and π_k′ (Γ′).

Then S(A) form an information structure. In particular there is a unique maximal partition, initial element for each subcategory S_X(A) in the information structure S(A).

But on S(A) the operation of grafting, that we will describe now, is much richer than what we used in the above Lemma 9: we can graft an allowed tree on each free edge of an allowed tree, and this introduces to a theory of operads and monads for information theory.

6.3. Co-Homology of Observation Strategies

Remember that the elements of the partitions or decompositions Y we are considering, are now numbered by the ordered set {1, …, L(Y)}, where L(Y) is the number of elements in the partition, or the decomposition, also called its length. In particular we consider as different two partitions which are labelled differently by the integers. This was already taken into account in the definition of the Galois groups.

We define the multi-products µ(m; n₁, …, n_m) on the set of ordered partitions:

They are defined between a partition equipped with an ordering (π, ω) with m pieces and m ordered partitions (π₁, ω₁), …, (π_m, ω_m) of respective lengths n₁, …, n_m; the results is the ordered partition obtained by cutting each piece X_i of π by the corresponding decomposition π_i and renumbering the non-empty pieces by integers in the unique way compatible with the orderings ω, ω₁, …, ω_m. Observe the important fact that the result has in general less than n = n₁ + … + n_m pieces. This introduces a strong departure from usual multi-products (cf. P. May [17,53], Loday-Vallette [10]). We do not have an operad, when introducing vector spaces V (m) generated by decompositions of length m, we get filtered but not graded structures. However a form of associativity and neutral element are preserved, hence we propose to name this structure a filtered operads.

There exists an evident unit to the right which is the unique decomposition of length 1.

The action of the symmetric group

S_{m}

on the products is evident, and does not respect the length of the result. We will designate by µ_m the collection of products for the same length m.

The numbers m_i between 1 and n_i that counts the pieces of the decomposition of the element X_i of π are functions m_i(π, ω, π_i, ω_i). There exists a growing injection η_i : [m_i] → [n_i], which depends only on (π, ω, π_i, ω_i) telling what indices of (π_i, ω_i) survive in the product. These injections are integral parts of the structure of filtered operad. In particular, if we apply a permutation σ_i to [n_i], i.e., if we replace ω_i by ω_i ○ σ_i, the number can change.

The axioms of operadic unity and associativity, conveniently modified are easy to verify (cf. [22]). The reference we follow here is Fresse “Basic concepts of operads” [16]. For unity nothing has to be modified. For associativity (Figure 1.3 in Fresse [16]), we modify by saying that if the (π_i, ω_i) of lengths n_i, for i between 1 et k, are composed from µ(n_i;

n_{i}^{1}, \dots, n_{i}^{n_{i}}

) with the n_i-uples (…,

(π_{i}^{j}, ω_{i}^{j})

, …) whose respective lengths are

n_{i}^{j}

, and if the result µ_i for each i has length (

m_{i}^{1} + \dots + m_{i}^{n_{i}}

) where

m_{i}^{j}

is function of (π_i, ω_i) and

(π_{i}^{j}, ω_{i}^{j})

, then the product of (π, ω) of length k with the µ_i is the same as the one we would have obtained by composing µ(k; n₁, …, n_k)((π, ω); (π₁, ω₁), …)) with the m = m₁ + … + m_k ordered decompositions

(π_{i}^{j}, ω_{i}^{j})

for j belonging to the image of η_i : [m_i] → [n_i]. This result is more complicate to write than to prove, because it only expresses the associativity of the ordinary join of three partitions; from which ordering follows.

Moreover, the first axiom concerning permutations (Figure 1.1 in Fresse [16]), can be modified, by considering only permutations of n_i letters which preserve the images of the maps η_i.

The second axiom, which concerns a permutation σ of k elements in π, and the inverse permutation of the partitions π_i can be reformulated by telling the effect of σ on the multiple product µ is the same as the effect of σ on the indices of the (π_i, ω_i). In other terms, the effect of σ on ω is compensated by the action of σ⁻¹ on the indices of the (π_i, ω_i). One has to be careful, because the result of µ applied to (π, ω ○ σ) has in general not the same length as µ applied to (π, ω). However the compensation implies that µ_k is well defined on the quotient of the set of sequences ((π, ω), (π₁, ω₁), …) by the diagonal action of

S_{k}

, which permutes the k pieces of π and which permutes the indices i of the n_i in the other factors.

Geometrically, if the partition (π, ω) in S(A) is generated by an observation tree Γ with m ending edges and the partitions (π_i, ω_i); i = 1, …, m are generated by a collection of observation trees Γ_i; then the result of the application of µ(m; n₁, …, n_m) to (π, ω) and (π_i, ω_i); i = 1, …, m is generated by the observation tree that is obtained by grafting each Γ_i on the vertex number i. Drawing the planar trees associated to three successive sets of decompositions for two successive grafting operations helps to understand the associativity property.

The fact that in general this does not give a tree with n₁ + … + n_m free edges, where n_i denotes the number of free edges of Γ_i comes from the possibility to find an empty set X(β) at some moment along a branch of the grafted tree; this we call a dead branch. It expresses the fact that the empty set is excluded from the elements of a partition in the classical context, and the zero space excluded from the orthogonal decomposition in the quantum context. When computing conditioned probabilities we encounter the same problem if a set X(β) at some place in a branch has measure zero.

The dead branches and the lack of graduation cause a lot of difficulties for studying algebraically the operations µ_m, thus we introduce more flexible objects, which are the ordered partitions with empty parts of Ω, resp. ordered orthogonal decompositions with zero summands of E: such a partition π^* (resp. decomposition) is a family (E₁, …, E_m) of disjoint subsets of Ω (resp. orthogonal subspaces of E), such that their union (resp. sum) is Ω (resp. E). The only difference with respect to ordered partitions, resp. decompositions, is that we accept to repeat ø (resp. 0) an arbitrary high number of times. For shortening we will name generalized decompositions these new objects. The number m is named the degree of π^*. These objects are the natural results of applying rooted observation trees embedded in an oriented half plane.

The notions of adaptation to A, S and X in S concerning the trees, apply to the generated generalized decompositions. The corresponding sets of generalized objets are written S^*(A) and

S_{X}^{*} (A)

.

The multi-product µ(m; n₁, …, n_m) extends naturally to generalized decompositions, and in this case the degrees are respected, i.e., the result of this operation is a generalized decomposition of degree n₁ + n₂ + … + n_m.

Remark that we could write µ^*(m; n₁, …, n_m) for the multi-products extended to generalized decompositions, however we prefer to keep the same notation µ(m; n₁, …, n_m); this is justified by the following observation: to a generalized decomposition π^* is associated a unique ordered decomposition (π, ω), by forgetting the empty sets (resp. zero spaces) in the family, and the multi-product is compatible with this forgetting application. The gain of the extension is the easy construction of a monad we expose now.

The definition of operad was introduced by P. May [17] as the right tool for studying the homology of infinite loop spaces; then it was recognized as a fundamental tool for algebraic topology, and many other topics, see Loday and Valette, Fresse.

We will encounter only “symmetric” operads.

The multiple products μ_m on generalized decompositions can be assembled in a structure of monad by using the standard Schur construction (cf. Loday et Valette [10], or Fresse, “on partitions” [16]): For each X ∈ S, we introduce the real vector space V_X = V_X(A) freely generated by the set

S_{X}^{*} (A),

of generalized decompositions obtained by observation trees that are allowed by A, S and X; the length m define a graduation V_X(m) of V_X. We put V_X(0) = 0.

The maps µ_m generate m-linear applications from products of these spaces to themselves which respect the graduation; these applications, also denoted by µ_m, are parameterized by the sets

S_{X}^{*} (m),

, whose elements are the generalized decompositions of degree m which are divided by X:

μ_{m} : V_{X} (m) \otimes_{S_{m}} V_{X}^{\otimes m} \to V_{X}

(134)

The linear Schur functor from the category of real vector spaces to itself, is defined by the direct sum of symmetric co-invariants:

V_{X} (W) = \underset{m \geq 0}{\oplus} V_{X} (m) \otimes_{S_{m}} W^{\otimes m}

(135)

The composition of Schur functors is defined by

V_{X} \circ V_{X} = \underset{m \geq 0}{\oplus} V_{X} (m) \otimes_{S_{m}} V_{X}^{^{\otimes m}} .

(136)

i.e., for each real vector space W:

V_{X} \circ V_{X} (W) = \underset{m \geq 0}{\oplus} \underset{l \geq m}{\oplus} \underset{n_{1}, \dots, n_{m}; \sum_{i} n_{i} = l}{\oplus} V_{X} (m) \otimes_{S_{m}} \underset{i}{\otimes} V_{X} (n_{i}) \otimes_{S_{n_{i}}} W^{\otimes n_{i}}

(137)

= \underset{l \geq 0}{\oplus} \underset{m \geq 0}{\oplus} \underset{n_{1}, \dots, n_{m}; \sum_{i} n_{i} = l}{\oplus} V_{X} (m) \otimes_{S_{m}} \underset{i}{\otimes} V_{X} (n_{i}) \otimes_{S_{n_{1}, \dots, n_{k}}} W^{\otimes l};

(138)

where

S_{n}_{_{1}, \dots n_{m}}

denotes the groups of permutations by blocs.

Proposition 5. For each X in S, the collection of operations µ_m defines a linear natural transformation of functors µ_X : V_X ◦ V_X → V_X; and the trivial partition defines a linear natural transformation of functors η_X : R → V_X, which satisfy the axioms of a monad (cf. MacLane “Categories for Working Mathematician” 2nd ed. [4], and Alain Proute, Introduction a la Logique Categorique, 2013, Prepublications [54]):

μ_{X} \circ (V_{X} μ_{X}) = μ_{X} \circ (V_{X} μ_{X}), μ_{X} \circ (V_{X} η_{X}) = I d = μ_{X} \circ (η_{X} V_{X})

(139)

Proof. The argument is the same as the argument given in Fresse (partitions …). The fact that the natural transformation µ_X is well defined on the quotient by the diagonal action of the symmetric group

S_{m}

on

V_{X} (m) \otimes \otimes_{i} V_{X} (n_{i}) \otimes_{S_{n_{1}, \dots, n_{m}}} W^{\otimes s}

comes from the verification of the symmetry axiom and the properties of associativity and neutral element comes from the verification of the corresponding axiom.

Moreover all these operations are natural for the functor of inclusion from the category S_Y to the category S_X of observables divided by Y and X respectively when X divides Y; therefore we have the following result:

Proposition 6. To each arrow X → Y in the category S is associated a natural transformation of functors

ρ_{X, Y} : V_{Y} \to V_{X}

, making a morphism of monads; this defines a contravariant functor

V

from the category S to the category of monads, that we name the arborescent structural sheaf of S and A.

Considering the discrete topology on S, we introduce the topos of sheaves of modules over the functor in monads

V

, which we call the arborescent information topos associated to S and A.

As explained in Proute loc.cit. [54] a monad in a category

C

becomes a monoid in the category of endo-functors of

C

, thus the topos we introduce is equivalent to an ordinary ringed topos.

The monad

V_{X}

, and the contravariant monadic functor

V

on S, are better understood by considering trees, cf. Getzler-Jones [55], Ginzburg-Kapranov [56] and Fresse [16]; in our context we consider all observation trees labelled by elements of

S_{X}^{*} A

: if Γ is an oriented rooted tree of level k, each vertex v of Γ gives birth to m_v edges; we define

V_{X} (Γ) (W) = \underset{v \in Γ}{\otimes} V_{X} (m_{v}) \otimes_{S_{m_{v}}} W^{\otimes m_{v}} .

(140)

The space V (Γ)(W) is the direct sum of spaces V_X(Γ_Y₎(W) associated to trees which are decorated by a subset Y in

S_{X}^{*} (A)

, with one element Y_v of S_X(m) for each vertex v which gives birth to m_v edges.

Then the iterated functors

V^{\circ k} = V \circ \dots \circ V

for k ≥ 1 are the direct sums of the functors V (Γ) of level k. Remark that we could have worked directly with observation trees labelled by elements of A in spite of working with generalized partitions; this would have given a strictly larger monad but equivalent results.

Associated to probability families we define now a right

V_{X}

-module (in the terms of Fresse, Partitions, the term

V_{X}

-algebra being reserved to a structure of left module on a constant functor).

For that we introduce the notion of divided probability.

Definition 16. A divided probability law of degree m is a sequence of triplets (p, P, U) = (p₁, P₁, U₁; …; p_m, P_m, U_m), where p_i; i = 1, …, m are positive numbers of sum one, i.e., p₁+…+p_m = 1, where each P_i; i = 1, …, m is a classical (resp. quantum) probability law when the corresponding p_i is strictly positive, and a probability law or the empty set when the corresponding p_i is equal to 0, and where each U_i; i = 1, …, m is the support in X of P_i; moreover the U_i are assumed to be orthogonal (resp. disjoint in the classical case). The letter P will designate the probability p₁P₁ +…+p_mP_m, where 0.∅ = 0 when it happens.

The symbol

D (m)

designates the set of divided probabilities of degree m on X, and

D_{X} (m)

denotes the subset made with probability laws in Q_X adapted to a variable X.

The vector space generated by

D_{X} (m)

will be written

ℒ_{X} (m)

. We put

ℒ_{X} (0) = 0

.

We also introduce the subspace

K (m)

of

ℒ_{X} (m)

which is generated by two families of vectors in

ℒ_{X} (m)

:

First the vectors

L (λ, p^{'}, p^{″}, P, U) = λ (p, P, U) + (1 - λ) (p^{″}, P, U) - (λ p^{'} + (1 - λ) p^{″}, P, U),

(141)

where λ is any real number between 0 and 1, and (p′, P, U), (p″, P, U) two divided probabilities associated to the same sequence of probability laws (P₁,…, P_m) and the same supports (U₁, …, U_m);

Second the vectors

D (p, P, U, Q, V) = (p, P, U) - (p, P^{'}, U^{'}),

(142)

where for each index i between 1 and m, such that p_i > 0 we have

P_{i} = {P^{'}}_{i}

, and consequently

U_{i} = {U^{'}}_{i}

.

The we define the space of classes of divided probabilities as the quotient real vector space

M_{X} (m) = ℒ_{X} (m) / K (m)

. In particular M_X (0) = 0, M_X (1) is freely generated over ℝ by the elements of Q_X.

Lemma 10. The space

M_{X} (m)

is freely generated over ℝ by the vectors (∅, …, ∅, P_i, ∅, …, ∅) of length m, where at the rank i, P_i is an element of Q_X.

Proof. Let D = (p₁, P₁, U₁), …, (p_m, P_m, U_m) be a divided probability; we consider for each i between 1 and m the divided probability

D_{i} = (0, P_{1}, U_{1}), \dots, (0, P_{i - 1}, U_{i - 1}), (1, P_{i}, U_{i}), (0, P_{i + 1}, U_{i + 1}), \dots, (0, P_{m}, U_{m}),

then the vector

D - \sum_{i} p_{i} D_{i}

is a sum of vectors of type L in

K_{X} (m)

. However, for each i, the vector D_i − (∅, …, ∅, P_i, ∅, …, ∅) is of type D, thus the particular vectors of the Lemma 10 generate

M_{X} (m)

.

Now, we prove that, if a linear combination of r of these vectors belongs to

K_{X}

, the coefficients of this combination must all be equal to 0. We proceed by recurrence on r, the result being evident for r = 1. We also can suppose that at least two involved vectors have a non-empty element at the same place, which we can suppose to be i = 1. All vectors with p₁ = 0 can be replaced by a vector where P₁ = ∅ using an element of type D in

K_{X} (m)

, then we can assume that at least one of the vectors has a p₁ strictly positive, i.e., equals to 1. Let us consider all these vectors D₁, …, D_s, for 2 ≤ s ≤ r, their other numbers p_i for i > 1 are zero. The other vectors D_j, for j > s having the coordinate p₁ equal to zero. Let ∑_j λ_j D_j be the linear combination of length r belonging to

K_{X} (m)

; this vector is a linear combination of vectors of type L and D. We can suppose that every λ_j is non-zero. Let us consider an element Q of Q_X which appears in at least one of the D_j, j ≤ s; this Q cannot appear in only one D_j, because the sum of coefficients λ multiplied by the first p₁ in front of any given Q in a vector L or D is zero. Thus we have at least two D_j with the same P₁. We can replace the sum of them with λ_j positive (resp. negative) by only one special vector of the Lemma 10 using a sum of multiples of vectors of type L. Then we are left with the case of two vectors, D₁, D₂ having P₁ = Q such that λ₁ + λ₂ = 0, which means that λ₁D₁ + λ₂D₂ is multiple of a vector of type D. Subtracting it we can apply the recurrence hypothesis and conclude that the considered linear relation is trivial.

As a corollary an equivalent definition of the spaces

M_{X} (m)

would be the real vector space freely generated by pairs (P, i) where P ∈ Q_X and i ∈ [m]. Such a vector, identified with (∅, .., P, …, ∅) in

ℒ_{X} (m)

, where only the place i is non-empty, will be named a simple vector of degree m.

Let S = (S₁, …, S_m) be a sequence of generalized decompositions in

S_{X}^{*} (A)

, of respective degrees n₁, …, n_m, with n = n₁ + … + n_m, and let (p, P, U) be an element of

D_{X} (m)

, we define θ((p, P, U), S) as the following divided probability of degree n: if, for i = 1, …, m the decomposition S_i is made of pieces

E_{i}^{j_{i}}

where j_i varies between 1 and n_i, we take for

p_{i}^{j i}

is the classical probability

ℙ (E_{i}^{j_{i}} \cap U_{i})

; we take for

P_{i}^{j_{i}}

the law P_i conditioned by the event S_i = j_i which corresponds to

E_{i}^{j_{i}}

; and we take for

U_{i}^{j_{i}}

the support of

P_{i}^{j_{i}}

. Then we order the obtained family of triples

{(p_{i}^{j_{i}}, P_{i}^{j_{i}}, U_{i}^{j_{i}})}_{i = 1, \dots, m; j_{i} = 1, \dots, n_{i}}

by the lexicographic ordering. It is easy to verify that the resulting sequence is a divided probability.

Extending by linearity we get a linear map,

λ_{m} : ℒ_{X} (m) \otimes V_{X} (n_{1}) \otimes \dots \otimes V_{X} (n_{m}) \to ℒ_{X} (n_{1} + \dots n_{m}),

(143)

By linearity a vector of type L in

ℒ_{X} (m)

, tensorized with S₁⊗…⊗S_m goes to a linear combination of vectors of type L in

ℒ_{X} (n)

. Moreover, if p_i = 0 for an index i in [m], all the

p_{i}^{j_{i}}

are zero, thus a vector of type D goes to a vector of type D. Then the map λ_m sends the subspace

K_{X} (m) \otimes V_{X} (n_{1}) \otimes \dots \otimes V_{X} (n_{m})

into the subspace

K_{X} (n_{1} + \dots n_{m})

, thus it defines a linear map

θ_{m} : ℳ_{X} (m) \otimes V_{X} (n_{1}) \otimes \dots \otimes V_{X} (n_{m}) \to ℳ_{X} (n_{1} + \dots n_{m}),

(144)

On a simple vector (P, i), the operation θ_m is independent of the S_j for i ≠ i.

Now we introduce the Schur functor

M_{X}

of symmetric co-invariant spaces

ℳ_{X} (W) = \oplus_{m} ℳ_{X} (m) \otimes S_{m} W^{\otimes m}

from the category of real vector space to itself, associated to the

S - module

ℳ_{X}^{*}

(cf. Loday and Valette [10], Fresse [16]), formed by the graded family

ℳ_{X} (m); m \in ℕ

.

Then the maps θ_m define a natural transformation of functors:

θ_{X} : ℳ_{X} \circ V \to ℳ_{X} .

(145)

In addition, this set of transformations behaves naturally with respect to X in the information category S. Note that it defines a co-variant functor, not a presheaf.

For simplicity, we will note in general θ, µ,

ℱ

,

V

, … and not θ_X, µ_X,

ℱ_{X}

,

V_{X}

, …, but we memorize this is an abuse of language.

Then the composite functor

ℳ \circ V (W)

is given by

\begin{array}{l} ℳ_{X} \circ V_{X} (W) = \underset{m \geq 0}{\oplus} ℳ_{X} (m) \otimes_{S_{m}} \underset{i}{\otimes} (V_{X} (n_{i}) \otimes_{S_{n_{i}}} W^{\otimes n_{i}}) \\ = \underset{n \geq 0}{\oplus} \underset{m \geq 0}{\oplus} \underset{n_{1}, \dots, n_{m}; Σ_{i} n_{i} = n}{\oplus} ℳ_{X} (m) \otimes_{S_{m}} \underset{i}{\otimes} V_{X} (n_{i}) \otimes_{S_{n_{1}, \dots, n_{k}}} W^{\otimes n}; \end{array}

where

S_{n}_{1}, \dots, n_{m}

denotes the groups of permutations by blocs.

Proposition 7. The natural transformation θ defines a right action in the sense of monads, i.e., we have

θ \circ (ℱ μ) = θ \circ (θ V); θ \circ (ℱ η) = I d .

(146)

Proof. The proof is the same as for proposition 5, by using the associativity of conditioning, and the Bayes identity P (A ∩ B) = P (A|B)P (B).

Ginzburg and Kapranov [56] gave a construction of the (co)bar complex of an operad based on decorated trees. It is a graded complex of operads, with a differential operator of degree −1. The dual construction can be found in Getzler et Jones [55]; it gives a graded complex of co-operads with a differential operator of degree +1. The link with quasi-free co-operads and operads (Quillen’s construction) is developed by Fresse (in “partitions” [16]); in this article Fresse also shows that these constructions correspond to the simplicial bar construction for the monads (Maclane) and to the natural notions of derived functors in this context.

In our case, with two right modules, the easiest way is to use the bar construction of Beck (1967) [19], further explicited by Fresse with decorated trees in the case of monads coming from operads.

A morphism from a right module

ℳ

over

V

to a right module

ℛ

over

V

is a natural transformation f of the first functor in the second such that

f \circ θ_{M} = θ_{R} \circ f V

.

In what follows we will use the module R which comes from the functor of symmetric powers:

R (W) = \underset{m}{\oplus} S^{m} (W);

(147)

it is the Schur functor associated to the trivial

S_{*} - module

,

ℛ (m) = ℝ

, i.e., the action of

S_{m}

on

ℛ (m)

is trivial. We put

ℛ (0) = ℝ

.

The right action of

V_{X}

is given by the map

ρ_{m} : ℛ_{X} (m) \otimes V_{X} (n_{1}) \otimes \dots \otimes V_{X} (n_{m}) \to ℛ_{X} (n_{1} + \dots n_{m}),

(148)

which send each generator (1, S₁, …, S_m) to 1 in

ℛ (n) = ℝ

.

The axioms of a right module are easy to verify.

This

V

-module

ℛ

will play the dual role of the trivial module in the case of information structure co-homology.

Following Beck (Triples, Algebras, Cohomology, 1967, 2002 [19]), we consider the simplicial bar complex

ℳ_{X} \circ V_{X}^{*}

extending the right module

ℳ

on

V

by the sequence of modules

\dots . \to ℳ_{X} \circ V_{X}^{\circ (k + 1)} \to ℳ_{X} \circ V_{X}^{\circ k} \to \dots

. Then we introduce the growing complex

C^{*} (ℳ_{X})

of measurable morphisms from

ℳ_{X} \circ V_{X}^{*}

to the symmetric right module R.

For a given k ≥ 0, a morphism F from

ℳ_{X} \circ V_{X}^{\circ k}

to R is defined by a family of maps F (N) :

ℳ_{X} \circ V_{X}^{\circ k} (N) \to ℛ (N) = ℝ

for N ∈ ℕ.

This gives a family of measurable numerical functions of a divided probability law (p, P, U), of degree m ≤ N, indexed by forests having m components trees of height k and having total number of ending branches N.

We denote such a family of functions by the symbol F_X(S₁; S₂; …; S_k; (p, P, U)), indexed by X in S, where S₁; …; S_k here designates the sets of decompositions present in the trees at each level from 1 to k.

First we remark that the compatibility with the action of

V_{X}

to the right imposes that for any allowed set of variables S_k₊₁ we must have

F_{X} (S_{1}; S_{2}; \dots; μ (S_{k}, S_{k + 1}); (p, P, U)) = F_{X} (S_{1}; S_{2}; \dots; S_{k}; (p, P, U)) .

(149)

By taking for S_k the collection (π₀, …, π₀), we deduce that F_X is independent of the last variable.

This has the effect of decreasing the degree in k by one, for respecting the preceding conventions on information cochains; i.e., we pose

C^{k} (M_{X}) = H o m (ℳ_{X} \circ V^{\circ (k + 1)}, ℛ)

.

Secondly, as we are working with the quotient of the space generated by divided probabilities (p, P, U) by the space generated by linearity relations on the external law p, for (p, P, U) of degree m, we have

F_{X} (S_{1}; S_{2}; \dots; S_{k}; (p, P, U)) = \sum_{i = 1}^{m} p_{i} F_{X} (S_{1}; S_{2}; \dots; S_{k}; (P_{i}; i, m));

(150)

where (Q; i, m) designates the divided probability of degree m where all the laws in the sequence are empty except for the number i where it is equal to Q.

Moreover, from the definition of θ and the rule of composition of functors, for any m ≥ 1 and i ∈ [m], and any simple vector (Q, i, m), the value of F on any forest depends only on the tree component of index i; that we can summarize by the following identity:

F_{X} (S_{1}; S_{2}; \dots; S_{k}; (Q; i, m)) = F_{X} (T (S_{1}^{i}; S_{2}^{i}; \dots; S_{k}^{i}); (Q; i, m));

(151)

where

T (S_{1}^{i}; S_{2}^{i}; \dots; S_{k}^{i})

designates the tree numbered by i, prolonged in any manner at all the places j ≠ i.

Definition 17. An element of

C^{k} (M_{X})

is said regular when for each degree m and each index i between 1 and m, we have, for each ordered forest S₁; S₂; …; S_k of m trees, and each probability Q,

F_{X} (S_{1}; S_{2}; \dots; S_{k}^{i}; (Q; i, m)) = F_{X} (S_{1}^{i}; S_{2}^{i}; \dots; S_{k}^{i}; Q);

(152)

where

S_{1}^{i}; S_{2}^{i}; \dots; S_{k}^{i}

designates the tree number i.

Due to Equation (150), this makes that regular elements are defined by their values on trees and ordinary, not divided probabilities.

The adjective regular can be better interpreted as “local in the sense of observation trees”.

The vector space

C_{X}^{k} (N)

is generated by families of functions of divided probabilities F_X(S₁; S₂; …; S_k; (p, P, U)), indexed by X in S and forests S₁; …; S_k of level k. These families are supposed local with respect to X, which means that it is compatible with direct image of probabilities under observables in S^∗.

Remark 12. As we showed in the static case, in the classical context, locality is equivalent to the fact that the values of the functions depend on ℙ through the direct images of ℙ by the joint of all the ordered observables which decorate the tree (the joint of the joints along branches); but this is not necessarily true in the quantum context, where it depends on Q. However it is true for Q^min, in particular Q^can which is the most natural choice.

The spaces

C^{k} (M_{X})

form a natural degree one complex:

The faces

δ_{i}^{(k)}; 1 \leq i \geq k

are given by applying µ on

V \circ V

at the places (i, i + 1); the last face

δ_{k + 1}^{(k)}; 1 \leq i \geq k

consists in forgetting the last functor, the operation denoted by ϵ; and the zero face is given by the action θ. Then the boundary δ⁽^k⁾ is the alternate sum of the operators

δ_{i}^{(k)}; 0 \leq i \geq k + 1

: if F is measurable morphism from

ℳ \circ V^{\circ k}

to ℝ, then

δ F = F \circ (θ V^{\circ k}) - {\sum_{i = 0, \dots, k - 1} {(- 1)}^{i} F \circ ℳ V}^{\circ i} μ V^{\circ k - i - 1} - {(- 1)}^{k} F \circ ℳ V^{\circ k} ϵ .

(153)

The zero face in the complex

C_{X}^{*}

corresponds to the right action of the monad V_X on divided probabilities; on regular cochains it is expressed by a generalization of the formula (20): if (P, i, m) is a simple vector of degree m and S₀; S₁; …; S_k a forest of level k + 1, with m component trees, then

\begin{array}{l} F_{S_{0}} (S_{1}; \dots; S_{k}; (P, i, m)) = F (S_{1}; \dots; S_{k}; θ ((P, i, m) S_{0})) \\ = \sum_{j_{i} = 1, \dots, n_{i}} ℙ (S_{0}^{i} = j_{i}) F ((S_{1}^{j_{i}}; S_{2}^{j_{i}}; \dots; S_{2}^{j_{i}}; (P | S_{0}^{i} = j_{i})), \end{array}

(154)

where

S_{1}^{j_{i}}; S_{2}^{j_{i}}; \dots; S_{k}^{j_{i}}

designates the tree number j_i grafted on the branch j_i of the variable S₀_,i at the place i in the collection S₀.

The formula (154) is compatible with the existence of dead branches.

Note that natural integers come into the play under two different aspects: m is for the internal monadic degree and counts the number of components, or the length of partitions, k is for the height of the trees in the forest. The number k gives the degree in co-homology.

The coboundary δ of

C^{*}

is of degree +1 with respect to k and degree 0 with respect to m. For any m ∈ ℕ, the operator δ has the formula of the coboundary given by the simplicial structure associated to θ and µ:

\begin{array}{l} δ F (S_{0}; S_{1}; \dots; S_{k}; (p, P, U)) = F_{S_{0}} (S_{1}; \dots; S_{k}; (p, P, U)) \\ + \sum_{i = 1}^{i = k} {(- 1)}^{i} F (S_{0}; \dots; μ (S_{i - 1} \otimes S_{i}); S_{i + 1}; \dots; S_{k}; (p, P, U)) \\ + {(- 1)}^{k + 1} F (S_{0}; \dots; S_{k - 1}; (p, P, U)) \end{array}

(155)

We constat that locality is preserved by δ.

Lemma 11. If the transformation F is regular, then δF is regular; in other terms, the regular elements form a sub-complex

C^{k} r (ℳ_{X})

.

Proof. Let (P, i, m) be a simple vector and S₀; …; S_k a forest with m components; let us denote by

S_{0}^{j}

the variable number j having degree n_j, and n = n₁ + … + n_m; we have

\begin{array}{l} δ F (S_{0}; \dots; S_{k}; (P, i, m)) \\ = F (S_{1}; \dots; S_{k}; θ ((P, i, m) S_{0}^{i})) - F (μ (S_{0}, S_{1}); \dots; S_{k}; (P, i, m)) - \dots \\ + {(- 1)}^{k} F (S_{0}; \dots; μ (S_{k - 1}, S_{k}); (P, i, m)) + {(- 1)}^{k + 1} F (S_{0}; \dots; S_{k - 1}; (P, i, m)) . \end{array}

(156)

The first term on the right is a combination of the image of F for the n simple vectors

P . S_{0}^{i, j_{i}}

of degree n = n₁ + … + n_m which result from the division of (P, i, m) by

S_{0}^{i}

. If F is regular, this combination is the same as the combination of the simple vectors of degree n_i constituting the division of (P, i, m) by

S_{0}^{i}

, which gives the same result as the first term on the right in the formula

\begin{array}{l} δ F (S_{0}^{i}; \dots; S_{k}^{i}; (P, 1, 1)) = F (S_{1}^{i}; \dots; S_{k}^{i}; θ (P, S_{0}^{i})) - F (μ (S_{0}^{i}, S_{1}^{i}); \dots; S_{k}^{i}; P) - \dots \\ + {(- 1)}^{k} F (S_{0}^{i}; \dots; μ (S_{k - 1}^{i}, S_{k}^{i}); P) + {(- 1)}^{k + 1} F (S_{0}^{i}; \dots; S_{k - 1}^{i}; P) . \end{array}

(157)

If F is regular the term number l > 1 on the right of the equation (156) coincides with the corresponding term on the right of the Equation (157).

Therefore the terms on the left in Equation (156) coincides with the left term in (157); which establishes the lemma.

We define

C_{r}^{*} (ℳ_{X})

as the sub-complex of regular vectors in

C^{*} (ℳ_{X})

. Its elements are named tree information cochains or arborescent information cochains.

By definition, the tree information co-homology is the homology of this regular complex, considered as a sheaf of complexes over the category S(A), i.e., a contravariant functor. This corresponds to the topos information co-homology in the monadic context.

To recover the case of the ordinary algebra of partitions, and the formulas of the bar construction in the first sections of this article, we have to take the special case where all the decompositions of the same level coincide at every level of the forests. In this case, we can replace the quotient

ℳ_{X}

by the modules of conditioning by a redefinition of the action on functions

ℱ_{X}

. However the notion of divided probabilities for observation trees and the definition of co-homology in the monadic context can be seen as the natural basis of information co-homology.

When k = 0, in the classical case, a cochain is a function f(ℙ), the locality condition tells that it is a constant; and in this case it is a cocycle because the sum of probabilities equals one implies f(ℙ) = f_S(ℙ). Then

H_{τ}^{0}

has dimension one.

When k = 0, in the quantum case, the spectral functions of ρ in the Q_X gives invariant information co-chains. Among them the Von Neumann entropy is specially relevant because its co-boundary gives the classical entropy. However, only the constant function is an invariant zero degree co-cycle. Thus again

H_{U}^{0}

has dimension one.

For k = 1, a cochain is given by a function F_X(S; P), such that, each time we have X → Y → S and elements of Y refines S, we have F_X(S; P) = F_Y (S; Y_∗P). It is a cocycle when for every collection S₁, …, S_m of m observables, where m is the length of S, we have

F (μ_{m} (S, (S_{1}, \dots, S_{m})); P) = F (S; P) + \sum_{i} ℙ (S = i) F (F (S_{i}; P | S = i) .

(158)

Note that the partition µ_m(S, (S₁, …, S_m)) is not the joint of S and the S_i for i ≥ 1, except when all the S_i coincide. Thus it is amazing that the ordinary entropy also satisfies this functional equation, finer than the Shannon’s identity:

Proposition 8. The usual entropy H(S_∗ℙ) = H(S; ℙ) is an arborescent co-cycle.

Proof. By linearity on the module of divided probabilities

ℳ_{X}

, we can decompose the probability ℙ in the conditional probabilities ℙ|(S = s), thus we can restrict the proof of the lemma to the case where S = π₀ is the trivial partition, i.e., m = 1.

Let X_i; i = 1, …, m denote the elements of the partition associated to S₀ and

X_{i}^{j}; j = 1, \dots, n_{i}

the pieces of the intersection of X_i with the elements of the partition associate to S_i; note p_i the probability of the event X_i and

p_{i}^{j}

the probability of the event

X_{i}^{j}

; we have

H (μ_{m} (S_{0}; (S_{1}, \dots, S_{m})); ℙ = - \sum_{i = 1}^{i = m} \sum_{j = 1}^{j = n_{i}} p_{i}^{j} \log_{2} p_{i}^{j},

(159)

and

H_{S_{0}} (S_{1}; \dots; S_{m}; ℙ) = - \sum_{i = 1}^{i = m} p_{i} \sum_{j = 1}^{j = n_{i}} \frac{p_{i}^{j}}{p_{i}} \log_{2} \frac{p_{i}^{j}}{p_{i}}

(160)

= - \sum_{i = 1}^{i = m} \sum_{j = 1}^{j = n_{i}} p_{i}^{j} (\log_{2} p_{i}^{j} - \log_{2} p_{i})

(161)

= - \sum_{i = 1}^{i = m} \sum_{j = 1}^{j = n_{i}} p_{i}^{j} \log_{2} p_{i}^{j} + \sum_{i = 1}^{i = m} \log_{2} p_{i} \sum_{j = 1}^{i = n_{i}} p_{i}^{j}

(162)

= - \sum_{i = 1}^{i = m} \sum_{j = 1}^{j = n_{i}} p_{i}^{j} \log_{2} p_{i}^{j} + \sum_{i = 1}^{i = m} p_{i} \log_{2} p_{i},

(163)

then

H (μ_{m} (S_{0}; (S_{1}, \dots, S_{m})); ℙ) - H_{S_{0}} (S_{1}; \dots; S_{m}; ℙ) = H (S_{0}; ℙ) .

(164)

Q.E.D.

This identity was discovered by Faddeev, Baez, Fritz, Leinster see [20]. However, we propose that information homology explains its significance.

When the category of quantum information S, the set A and the probability functor Q are invariant under the unitary group, and if we choose a classical full subcategory

S

, there is trace map from Q to

Q

, induces a morphism from the classical arborescent co-homology of

S

, A and

Q

to the invariant quantum arborescent co-homology of S, A and Q.

As a corollary of the Lemma 10 and the Theorems 1 and 3, we obtain the following result:

Theorem 4. (i) both in the classical and the invariant quantum context, if S(A) is connected, sufficiently rich, and if Q is canonical, every 1-co-cycle is co-homologous to the entropy of Shannon; (ii) in the classical case H¹(

S

, A,

Q

) is the vector space of dimension 1 generated by the entropy; (iii) in the quantum case

H_{U}^{1} (S, A, Q) = 0

, and the only invariant 0-cochain which has for co-boundary the Shannon entropy is (minus) the Von-Neumann entropy.

6.4. Arborescent Mutual Information

For k = 2, a cochain is given by a local function of a probability and a rooted decorated tree of level 2. It is a cocycle when the following functional equation is satisfied

\begin{array}{l} \sum_{i} ℙ (S = i) F (T_{i}; U_{i}; P | S = i) - F (S; T; P) \\ = F (μ_{m} (S \circ T); U; P) - F (S; (μ_{n_{i}} (T_{i} \circ U_{i}); i \in [m]); P), \end{array}

(165)

where S denotes a variable of length m, T a collection of m variables T₁, …, T_m of respective lengths n₁, …, n_m and U a collection of variables

U_{i, j}^{k}

of respective lengths n_i,j, with i going from 1 to m, j going from 1 to n_i and k going from 1 to n_i,j; the notation U_i denoting the collection of variables

U_{i, j}^{k}

of index i.

Our aim is to extend in the monadic context the topological action of the ordinary information structure on functions of probability used in the discussion of mutual information.

For that, we define another structure of

V_{X}

-right module on the functor

ℳ_{X}

associated to probabilities, by defining the following map θ_t(m) from

ℳ_{X} (m)

tensorized with V_X(n₁)⊗…⊗V_X(n_m) to

ℳ_{X} (n)

, for n = n₁ + … + n_m:

θ_{t} ((P, i, m) \otimes S_{1} \otimes \dots \otimes S_{m}) = \sum_{j = 1, \dots, n_{i}} (P, (i, j), n) .

(166)

Remark that the generalized decompositions S_j are used only through the orders on their elements.

As for

ℛ

, it is easy to verify that the collection of maps θ_t(m) defines a right action of the monad V_X on the Schur functor

ℳ_{X}

.

Then we consider as before, the graded vector space

C^{*} (ℳ_{X})

of homomorphisms of

V

-modules from the functors

ℳ \circ V^{\circ k}; k \geq 0

to the functor

ℛ

which are measurable in the probabilities P . As before, on

C^{*} (ℳ_{X})

, we shift the degree by one, because of the independency with respect to the last stage of the forest, which follows from the trivial action on

ℛ

.

The topological coboundary operator δ_t is defined in every degree by the formula of the simplicial bar construction, as in Equation (153) for δ, but with θ_t replacing θ. It corresponds to the usual simplicial complex of the family

V^{\circ k}

. A cochain is represented by a family of functions of probability laws F_X(S₁; …; S_k; (P, i, m)), where S₁; …; S_k denotes a forest with m trees of level k. The operator δ_t is given by

\begin{array}{l} δ_{t} F (S_{0}; \dots; S_{k}; (P, i, m)) = F (S_{1}; \dots; S_{k}; θ_{t} ((P, i, m), S_{0})) \\ - F (μ (S_{0}, S_{1}); \dots; S_{k}; (P, i, m)) - \dots + {(- 1)}^{k} F (S_{0}; \dots μ (S_{k - 1}, S_{k}); (P, i, m)) \\ + {(- 1)}^{k + 1} F (S_{0}; \dots; S_{k - 1}; (P, i, m)) . \end{array}

(167)

where n = n₁ + … + n_m is the sum of numbers of branches of the generalized decompositions

S_{0}^{i}

for i = 1, …, m.

As for δ, a value F (S₁; …; S_k; (P, j, n) depends only on the tree

S_{1}^{j}; \dots; S_{k}^{j}

rooted at the place numbered by j in the forest S₁; …; S_k.

Lemma 12. The coboundary δ_t sends a regular cochain to a regular cochain.

Proof. Consider a simple vector (P, i, m) in

ℳ

_X(m) and a forest S₀; …; S_k with m components; we denote by

S_{0}^{j}

the variable number j having degree n_j, and n = n₁ + … + n_m, and we consider the formula (167).

If F is regular the first term on the right is the sum of the images by F for P and the n trees

S_{1}^{i, j_{i}}

which result from the forgetting of the first branches

S_{0}^{j}

, and the other terms on the right are equal to the value of F for P and the tree rooted at i in S₀. On the other side for the tree

S_{0}^{i}; \dots; S_{k}^{i}

, if F is regular, we have

\begin{array}{l} δ F (S_{0}^{i}; \dots; S_{k}^{i}; (P, 1)) = \sum_{j} F (S_{1}^{i, j}; \dots; S_{k}^{i, j}; (P, 1)) - F (μ (S_{0}^{i}, S_{1}^{i}); \dots; S_{k}^{i}; (P, 1)) - \dots \\ + {(- 1)}^{k} F (S_{0}^{i}; \dots; μ (S_{k - 1}^{i}, S_{k}^{i}); (P, 1)) + {(- 1)}^{k + 1} F (S_{0}^{i}; \dots; S_{k - 1}^{i}; (P, 1)) . \end{array}

(168)

Thus δF is topologically regular.

Consequently we can restrict δ_t to the subcomplex

C_{r}^{*} (N_{X})

, and name its homology the arborescent, or tree, topological information co-homology, written H_τ,t^∗(S^∗, A, Q).

Now we suggest to extend the notion of mutual information I(X; Y ; ℙ) in the way it will be a cocycle for this co-homology as it was the case for the Shannon mutual information in the ordinary topological information complex. We suggest to adopt the formulas using δ and δ_t, as in the standard case:

Definition 18. Let H(T ; (P, i, m)) denotes the regular extension to forests of the usual entropy; then the mutual arborescent information between a partition S of length m and a collection T of m partitions T₁, …, T_m is defined by

I_{α} (S; T; ℙ) = δ_{t} H (S; T; ℙ) .

(169)

The identity δH = 0 implies

I_{α} (S; T; ℙ) = \sum_{i = 1}^{i = m} H (T_{i}; ℙ) - ℙ (S = i) H (T_{i}; ℙ | S = i)) .

(170)

In the particular case were all the T_i are equal to a variable T , it gives

\begin{array}{l} I_{α} (S; T; ℙ) = \sum_{i = 1}^{i = m} ℙ (S = i) (H (T; ℙ) - H (T; ℙ | S = i)) + (m - 1) H (T; ℙ) \\ = H (T; P) - \sum_{i = 1}^{i = m} ℙ (S = i) H (T; ℙ | (S = i)) + (m - 1) H (T; ℙ) \\ = H (T; ℙ) - H_{S} (T; ℙ) + (m - 1) H (T; ℙ), \end{array}

(171)

then

I_{α} (S; T; ℙ) = I (S; T; ℙ) + (m - 1) H (T; ℙ) .

For

S (A (

, the function I_α is an arborescent topological 2-cocycle.

It satisfies the Equation (165) were ℙ replaces conditional probabilities ℙ|(S = i) and where the factors ℙ(S = i) disappear. Remark that, in this manner, maximization of I_α(S; T ; ℙ) comports maximization of usual mutual information I(S; T ; ℙ) and unconditioned entropies H(T_i; ℙ).

Pursuing the homological interpretation of higher mutual information quantities given by the Formulas (55) and (56), we suggest the following definition:

Definition 19. The mutual arborescent informations of higher orders are given by I_α,N = −(δδ_t)^MH for N = 2M + 1 odd and by I_α,N = δ_t(δδ_t)^M H for N = 2M + 2 even.

Acknowledgments

We thank MaxEnt14 for the opportunity to present these researches to the information science community. We thank Guillaume Marrelec for discussions and notably his participation to the research of the last part on optimal discrimination. We thank Frederic Barbaresco, Alain Chenciner, Alain Proute and Juan-Pablo Vigneaux for discussions and comments on the manupscript. We thank the "Institut des Systemes complexes" (ISC-PIF) region Ile-de-France, and Max Planck Institute For Mathematic in the Science for the financial support and hosting of P. Baudot.

Author Contributions

Both authors contribute equally to the research, the second author wrote the manuscript. Both authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J 1948, 27, 379–423. [Google Scholar]
Kolmogorov, A. Combinatorial foundations of information theory and the calculus of probabilities. Russ. Math. Surv. 1983, 38. [Google Scholar] [CrossRef]
Thom, R. Stabilité struturelle et morphogénèse; deuxième ed.; Dunod: Paris, France, 1977; in French. [Google Scholar]
Mac Lane, S. Categories for the Working Mathematician; Springer: Berlin/Heidelberg, Germany, 1998. [Google Scholar]
Mac Lane, S. Homology; Springer: Berlin/Heidelberg, Germany, 1975. [Google Scholar]
Hu, K.T. On the Amount of Information. Theory Probab. Appl. 1962, 7, 439–447. [Google Scholar]
Baudot, P.; Bennequin, D. Information Topology I, in preparation.
Elbaz-Vincent, P.; Gangl, H. On poly(ana)logs I. Compos. Math. 2002, 130, 161–214. [Google Scholar]
Cathelineau, J. Sur l’homologie de sl2 a coefficients dans l’action adjointe. Math. Scand. 1988, 63, 51–86. [Google Scholar]
Loday, J.L.; Valette, B. Algebraic Operads; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
Matsuda, H. Information theoretic characterization of frustrated systems. Physica A 2001, 294, 180–190. [Google Scholar]
Brenner, N.; Strong, S.; Koberle, R.; Bialek, W. Synergy in a Neural Code. Neural Comput. 2000, 12, 1531–1552. [Google Scholar]
Nielsen, M.; Chuang, I. Quantum Computation and Quantum Information; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
Baudot, P.; Bennequin, D. Topological forms of information. AIP Conf. Proc. 2015, 1641, 213–221. [Google Scholar]
Baudot, P.; Bennequin, D. Information Topology II, in preparation.
Fresse, B. Koszul duality of operads and homology of partitionn posets. Contemp. Math. Am. Math. Soc. 2004, 346, 115–215. [Google Scholar]
May, J.P. The Geometry of Iterated Loop Spaces; Springer: Berlin/Heidelberg, Germany, 1972. [Google Scholar]
May, J.P. Einfinite Ring Spaces and Einfinite Ring Spectra; Springer: Berlin/Heidelberg, Germany, 1977. [Google Scholar]
Beck, J. Triples, Algebras and Cohomology. Ph.D. Thesis, Columbia University, New York, NY, USA, 1967. [Google Scholar]
Baez, J.; Fritz, T.; Leinster, T. A Characterization of Entropy in Terms of Information Loss. Entropy 2011, 13, 1945–1957. [Google Scholar]
Marcolli, M.; Thorngren, R. Thermodynamic Semirings 2011, arXiv. [CrossRef]
Baudot, P.; Bennequin, D. Information Topology III, in preparation.
Gromov, M. In a Search for a Structure, Part 1: On Entropy. 2013. Available online: http://www.ihes.fr/gromov/PDF/structre-serch-entropy-july5-2012.pdf accessed on 6 May 2015.
Watkinson, J.; Liang, K.; Wang, X.; Zheng, T.; Anastassiou, D. Inference of Regulatory Gene Interactions from Expression Data Using Three-Way Mutual Information. Chall. Syst. Biol. Ann. N.Y. Acad. Sci. 2009, 1158, 302–313. [Google Scholar]
Kim, H.; Watkinson, J.; Varadan, V.; Anastassiou, D. Multi-cancer computational analysis reveals invasion-associated variant of desmoplastic reaction involving INHBA, THBS2 and COL11A1. BMC Med. Genomics. 2010, 3. [Google Scholar] [CrossRef]
Uda, S.; Saito, T.H.; Kudo, T.; Kokaji, T.; Tsuchiya, T.; Kubota, H.; Komori, Y.; ichi Ozaki, Y.; Kuroda, S. Robustness and Compensation of Information Transmission of Signaling Pathways. Science 2013, 341, 558–561. [Google Scholar]
Han, T.S. Linear dependence structure of the entropy space. Inf. Control. 1975, 29, 337–368. [Google Scholar]
McGill, W. Psychometrika. Multivar. Inf. Transm. 1954, 19, 97–116. [Google Scholar]
Kolmogorov, A.N. Grundbegriffe der Wahrscheinlichkeitsrechnung; Springer: Berlin/Heidelberg, Germany, 1933; in German. [Google Scholar]
Artin, M.; Grothendieck, A.; Verdier, J. Théorie des topos et cohomologie étale des schémas—(SGA 4) Tome I,II,III; Springer: Berlin/Heidelberg, Germany, in French.
Grothendieck, A. Sur quelques points d’algèbre homologique, I. Tohoku Math. J 1957, 9, 119–221. [Google Scholar]
Gabriel, P. Objets injectifs dans les catégories ab liennes. Séminaire Dubreil. Algèbre et théorie des nombres 12, 1–32.
Bourbaki, N. Algèbre, chapitre 10, Algèbre homologique; Masson: Paris, France, 1980; in French. [Google Scholar]
Cartan, H.; Eilenberg, S. Homological Algebra; The Princeton University Press: Princeton, NJ, USA, 1956. [Google Scholar]
Tverberg, H. A new derivation of information function. Math. Scand. 1958, 6, 297–298. [Google Scholar]
Kendall, D. Functional Equations in Information Theory. Z. Wahrscheinlichkeitstheorie 1964, 2, 225–229. [Google Scholar]
Lee, P. On the Axioms of Information Theory. Ann. Math. Stat. 1964, 35, 415–418. [Google Scholar]
Kontsevitch, M. The 1+1/2 logarithm. Unpublished note. Reproduced in Elbaz-Vincent & Gangl, 2002 On poly(ana)logs I. Compositio Mathematica, 1995; e-print math.KT/0008089. [Google Scholar]
Khinchin, A. Mathematical Foundations of Information Theory; Dover: New York, NY, USA; Silverman, R.A.; Friedman, M.D., Translators; From two Russian articles in Uspekhi Matematicheskikh Nauk; 1957; pp. 17–75. [Google Scholar]
Yeung, R. Information Theory and Network Coding; Springer: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
Cover, T.M.; Thomas, J. Elements of Information Theory; Wiley: Weinheim, Germany, 1991. [Google Scholar]
Rindler, W.; Penrose, R. Spinors and Spacetime, 2nd ed; Cambridge University Press: Cambridge, UK, 1986. [Google Scholar]
Landau, L.D.; Lifshitz, E.M. Fluid Mechanics, 2nd ed; Volume 6 of a Course of Theoretical Physics; Pergamon Press, 1959. [Google Scholar]
Balian, R. Emergences in Quantum Measurement Processes. KronoScope 2013, 13, 85–95. [Google Scholar]
Borel, A.; Ji, L. Compactifications of Symmetric and Locally Symmetric Spaces. In Unitary Representations and Compactifications of Symmetric Spaces; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Doering, A.; Isham, C. Classical and quantum probabilities as truth values. J. Math. Phys. 2012, 53. [Google Scholar] [CrossRef]
Meyer, P. Quantum Probability for Probabilists; Springer: Berlin, Germany, 1993. [Google Scholar]
Souriau, J. Structure des Systemes Dynamiques; Jacques Gabay: Paris, France, 1970; in French. [Google Scholar]
Catren, G. Towards a Group-Theoretical Interpretation of Mechanics. Philos. Sci. Arch. 2013. http://philsci-archive.pitt.edu/10116/.
Bachet Claude-Gaspar, Problèmes plaisans et délectables, qui se font par les nombres; A. Blanchard: Paris, France, 1993; p. 1612, in French.
Jaynes, E.T.; Information, Theory. Statistical Mechanics. In Statistical Physics; Ford, K., Ed.; Benjamin: New York, NY, USA, 1963; p. 181. [Google Scholar]
Jaynes, E.T. Prior Probabilities. IEEE Trans. Syst. Sci. Cybern. 1968, 4, 227–241. [Google Scholar]
Cohen, F.; Lada, T.; May, J. The Homology of Iterated Loop Spaces; Springer: Berlin, Germany, 1976. [Google Scholar]
Prouté, A. Introduction la Logique Catégorique. 2013. Available online: www.logique.jussieu.fr/~alp/ accessed on 6 May 2015.
Getzler, E.; Jones, J.D.S. Operads, homotopy algebra and iterated integrals for double loop spaces 1994, arXiv. hep-th/9403055v1.
Ginzburg, V.; Kapranov, M.M. Koszul duality for operads. Duke Math. J 1994, 76, 203–272. [Google Scholar]

© 2015 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Baudot, P.; Bennequin, D. The Homological Nature of Entropy. Entropy 2015, 17, 3253-3318. https://doi.org/10.3390/e17053253

AMA Style

Baudot P, Bennequin D. The Homological Nature of Entropy. Entropy. 2015; 17(5):3253-3318. https://doi.org/10.3390/e17053253

Chicago/Turabian Style

Baudot, Pierre, and Daniel Bennequin. 2015. "The Homological Nature of Entropy" Entropy 17, no. 5: 3253-3318. https://doi.org/10.3390/e17053253

APA Style

Baudot, P., & Bennequin, D. (2015). The Homological Nature of Entropy. Entropy, 17(5), 3253-3318. https://doi.org/10.3390/e17053253

Article Menu

The Homological Nature of Entropy^†

Abstract

1. Introduction

1.1. What is Information?

1.2. Information Homology

1.3. Extension to Quantum Information

1.4. Concavity and Convexity Properties of Information Quantities

1.5. Monadic Cohomology of Information

1.6. The Forms of Information Strategies

1.7. Conclusion and Perspective

2. Classical Information Topos. Theorem One

2.1. Information Structures and Probability Families

2.2. Non-Homogeneous Information Co-Homology

2.3. Entropy

2.4. Appendix. Complex of Possible Events

3. Higher Mutual Informations. A Sketch

4. Quantum Information and Projective Geometry

4.1. Quantum Measure, Geometry of Abelian Conditioning

4.2. Quantum Information Structures and Density Functors

4.3. Quantum Information Homology

5. Product Structures, Kullback–Leibler Divergence, Quantum Version

6. Structure of Observation of a Finite System

6.1. Problems of Discrimination

6.2. Observation Trees. Galois Groups and Probability Knowledge

6.3. Co-Homology of Observation Strategies

6.4. Arborescent Mutual Information

Acknowledgments

Author Contributions

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

The Homological Nature of Entropy †

Abstract

1. Introduction

1.1. What is Information?

1.2. Information Homology

1.3. Extension to Quantum Information

1.4. Concavity and Convexity Properties of Information Quantities

1.5. Monadic Cohomology of Information

1.6. The Forms of Information Strategies

1.7. Conclusion and Perspective

2. Classical Information Topos. Theorem One

2.1. Information Structures and Probability Families

2.2. Non-Homogeneous Information Co-Homology

2.3. Entropy

2.4. Appendix. Complex of Possible Events

3. Higher Mutual Informations. A Sketch

4. Quantum Information and Projective Geometry

4.1. Quantum Measure, Geometry of Abelian Conditioning

4.2. Quantum Information Structures and Density Functors

4.3. Quantum Information Homology

5. Product Structures, Kullback–Leibler Divergence, Quantum Version

6. Structure of Observation of a Finite System

6.1. Problems of Discrimination

6.2. Observation Trees. Galois Groups and Probability Knowledge

6.3. Co-Homology of Observation Strategies

6.4. Arborescent Mutual Information

Acknowledgments

Author Contributions

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

The Homological Nature of Entropy^†